Data curation during a pandemic and lessons learned from COVID-19

Moritz U. G. Kraemer, Samuel V. Scarpino, Vukosi Marivate, Bernardo Gutierrez, Bo Xu, Graham Lee, Jared B. Hawkins, Caitlin Rivers, David M. Pigott, Rebecca Katz & John S. Brownstein
Nature Computational Science
January 14, 2021


Detailed, accurate data related to a disease outbreak enable informed public health decision making. Given the variety of data types available across different regions, global data curation and standardization efforts are essential to guarantee rapid data integration and dissemination in times of a pandemic.

A wide range of data are critical to characterizing disease outbreaks and informing public health responses1. Pathogen genomic data have become essential to identify the causative agent of an infection, and they can also help track mutations and investigate transmission networks and the geographic spread of an infectious disease2. Clinical data are useful to understand disease severity, develop clinical case definition, evaluate pharmaceutical interventions and monitor disease outcomes3. Serological data are important to characterize immunity, antibody responses and how they may relate to clinical outcomes4. Additionally, epidemiological data ranging from aggregated case counts5 to detailed contact-tracing data have been used extensively to characterize basic parameters such as the reproduction number, key time distributions (onset of symptoms to hospitalization among others), and heterogeneity in transmission6. Metadata associated with individual epidemiological cases can be of great importance to understand early disease dynamics and critical transitions from imported infections to those locally acquired7. Further, disease data that contain demographic information have been used extensively to understand population level attack rates8. All of these data inform response actions designed to mitigate the consequences of the disease event.

Whereas many countries and jurisdictions collect detailed data during outbreaks, they may not be shared openly due to various ethical, legal and privacy issues, political regulations and concerns, and/or computational limitations9. Computational frameworks for rapid ingestion of these data are not widely available either. In addition, there are no standardized data formats that facilitate the open reporting of such information while ensuring its compliance with regulations around data privacy (primarily de-anonymization). This makes international comparisons of large, detailed outbreak data difficult and prevents inferences from such data to be effective in response to disease outbreaks. As a consequence, there is a missed opportunity in having a single platform where all of the data, irrespective of type and region, can be easily and quickly shared among the scientific community, which can greatly accelerate research related to a disease outbreak.

During the COVID-19 pandemic, the initiative was established to create a global infrastructure for consolidating, standardizing and sharing individual-level epidemiological data across different geographic regions. Nevertheless, challenges related to data ingestion and curation still persist, and addressing them is of crucial importance to enable rapid data analysis of open data during future outbreaks.

Related publications