Small Data Challenges of Studying Rare Diseases (2024)

The age of big data is in full swing, with researchers in both clinical medicine and public health seeking to take advantage of the increasing availability of massive amounts of electronic and administrative health data. In turn, this has led to substantial resources and efforts being poured into the development and teaching of methods for data collection and storage as well as machine learning analytic methods.¹

However, big data are not always available, especially in the study of rare diseases. Indeed, in the study of rare diseases, small sample sizes are inevitable, especially when the primary end point is also uncommon. As an example, Avadhanula et al² used data from a cohort of 125 patients with alkaptonuria, a rare autosomal recessive disorder. Patients were recruited between 2000 and 2018 as part of a prospective longitudinal study conducted at the National Human Genome Research Institute to investigate the incidence of thyroid dysfunction among patients with alkaptonuria. While this is by no means a generous sample size, the cohort is the largest of its kind for patients with alkaptonuria, according to the authors.

In the US, a rare disease is defined as a health condition that affects fewer than 200 000 individuals.³ This definition was created by Congress as part of the Orphan Drug Act of 1983, which aimed to use financial incentives to motivate pharmaceutical and medical device companies to develop new treatments for patients with rare diseases. Close to 7000 conditions meet this definition. Although a relatively small number of individuals are affected by each rare disease, the estimated total number of individuals living with any rare disease is between 25 million and 30 million.³ Support for rare disease research continues today. In 2016, the US Food and Drug Administration awarded $23 million over 4 years to support research in 21 different rare diseases.⁴ The Patient-Centered Outcomes Research Institute also has a special advisory board for rare disease research and thus far has funded more than 28 patient-centered comparative effectiveness studies that focus on the treatment and management of rare diseases.⁵

That the study of rare diseases poses unique challenges has been recognized. From the perspective of study design, researchers investigating rare diseases have many options, including crossover and adaptive trials.⁶ For observational studies, Whicher et al⁷ list self-controlled study designs, case-control designs, and prospective inception cohorts as potential designs suitable for rare disease research. Beyond the choice of study design, researchers must also be wary of the analytic challenges that arise from studying rare diseases, including the extent to which the available data can be viewed as representative of the entire population of patients with the condition and whether there is sufficient (statistical) power to draw definitive conclusions (ie, those that could inform decision-making). It is perhaps less well recognized that, when the sample size is small, P values are especially vulnerable to small deviations in the observed number of outcomes. For example, in the study by Avadhanula et al,² 1 patient was diagnosed with hyperthyroidism in the cohort of 125 individuals. Based on the exact test for 1-sample proportion, Avadhanula et al² found insufficient evidence that the estimated prevalence in the study population (ie, 1 of 125 [0.8%]) was different than that in the general population (ie, 0.5%), with a P value of .88. As a thought experiment, suppose 2 patients instead of 1 had been diagnosed with hyperthyroidism. The same test would yield a P value of .23. Furthermore, if 3 patients were diagnosed, the resulting P value would then be .04. Thus, by hypothetically observing just 2 more cases, there is a dramatic change in the P value, a change that would likely alter decision-making.

This is all the more important to acknowledge when it is placed against the backdrop of a 2019 editorial by the American Statistical Association that called on researchers to move away from using the term statistical significance to describe results with a P value of less than .05.⁸ As part of the editorial, the American Statistical Association solicited suggestions for alternative paradigms. One interesting proposal was that journals adopt a so-called results-blind review process in which study results are omitted from the initial manuscript submission. In doing so, the central criteria for publication would be whether the study objective is relevant and interesting from either a clinical or public health perspective and whether the study design and methods are appropriate. Rare disease research that lacks statistical power or fails to achieve the conventional levels of statistical significance may especially benefit from this type of review process. More publications and dissemination of knowledge of rare disease research would increase awareness and possibly foster new collaborations among different institutions that could lead to small data becoming bigger.

In a 2019 study, Rees et al⁹ reported on the completion and publication status of 659 clinical trials for rare diseases registered at ClinicalTrials.gov between January 2010 and December 2012. They found that, as of December 2014, 199 trials (30.2%) were discontinued, with insufficient patient accrual as the most cited reason. Furthermore, among those completed, more than half (306 [66.5%]) remained unpublished at 2 years and nearly one-third (142 [31.5%]) remained unpublished at 4 years. Although the authors were unable to ascertain whether sample size and statistical significance factored into whether a study was published, it seems highly plausible that they would in many instances.¹⁰

Currently, JAMA Network Open does not use a results-blind review process. However, although not explicitly stated in the Instructions to Authors, statistical significance is not considered a criterion for publication. Driven by the desire to publish important science, JAMA Network Open is open to publishing high-quality studies with an important research question, a sound study design, appropriate methodology, and conclusions that are a reasonable and accurate reflection of the nature and strength of the evidence. Consequently, this journal represents an important venue for the publication of studies of rare diseases and embraces the challenges that arise from studying diseases that are often overlooked. After all, do we not hope that every disease will become rare in the future?

Article Information

Published: March 23, 2020. doi:10.1001/jamanetworkopen.2020.1965