New open source COVID-19 notebooks offer a toolkit that helps developers and data scientists answer important questions about the pandemic.
By Frederick Reiss, Chief Architect, Center for Open Source Data and AI Technologies, IBM
For data scientists and policy makers who are analyzing the effects of COVID-19 and trying to come up with actionable plans based on data, the information landscape is overwhelming. A near-constant flow of data from research studies, news outlets, social media, and health organizations make the task of analyzing data into useful action nearly impossible. Developers and data scientists need answers to their questions about data sources, tools, and how to draw meaningful and statistically valid conclusions from the ever-changing data.
Policy makers face similar challenges. The United States has over 3,000 counties, each with a unique story of how COVID-19 is impacting its community. Policy makers are asking questions including: What stories can we tell in the aggregate? Are there patterns we see across the country? What regions or demographics are getting affected the most by the pandemic?
Open source Jupyter notebooks offer actionable data
To help, IBM’s Center for Open-Source Data and AI Technologies (CODAIT) created COVID notebooks, a toolkit that helps developers and data scientists answer important questions about the pandemic. We’ve taken care of the more mundane tasks, including:
- Obtaining authoritative data on the current status of the outbreak
- Cleaning up the most serious data-quality problems
- Collating the data into a format amenable to easy analysis with tools like Pandas and Scikit-Learn
- Building an initial set of example reports and graphs
Taking care of these tasks frees developers and data scientists to focus on advanced analysis and modeling tasks instead of worrying about things like data formats and data cleaning. Our repository uses developer-friendly Jupyter notebooksto cover each of these initial data analysis steps.
It’s important to note that the underlying data for COVID-19 changes on a daily basis. As you build your own analysis, you’ll want to update the results of your own notebooks frequently. But rerunning a collection of interconnected notebooks can be challenging. There are multiple stages of analysis, and the output of one step often feeds into multiple other steps. To simplify the process of updating your results with the latest data, we’ve created data processing pipelines using the Elyra Notebook Pipelines Visual Editorand KubeFlow Pipelines. For example, here’s the pipeline for county-level time series data for the United States:
These pipelines let you rerun the entire extract-transform-analyze workflow with a single click of a button.
Authoritative data sources
The tools in our repository use authoritative sources to arrive at aggregate insights policy makers can use to make real-time, critical decisions. For county-level data about the United States, our data extraction notebook downloads the latest data from the COVID-19 Data Repositoryby the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. This data set is the primary source for many of the predictive models used by organizations working with the Centers for Disease Control.
Our notebook fills in known gaps in this primary source with additional data from the New York Times Coronavirus (Covid-19) Data in the United Statesrepository (for more-complete data on Rhode Island and Utah) and New York newspaper THE CITY’s digestof the daily reports from the New York City Department of Health and Mental Hygiene (for borough-level data on New York City). Our notebooks use the European Centre for Disease Prevention and Control’s data on the geographic distribution of COVID-19 cases worldwideas their data source for worldwide data at the granularity of individual countries.
The notebooks download all of these data sets as they run, for two reasons. First of all, these data sets change on a daily basis. Second, the license terms of the data sets do not allow commercial entities to redistribute the data. If you use our open source code in a commercial application, be sure to verify that you are staying within the bounds of the license terms for the underlying data!
You can use our code as a jumping off point for more in-depth, interesting analysis. For example, you might want to analyze the data to find correlations between poverty level and infection rates. Open source developers and data scientists can easily build on these tools to extend the analysis to their individual use cases.
Explanation of the technology used for developers and data scientists
The notebooks in our repository are Jupyternotebooks. We use common Python data science libraries, including Pandas, Numpy, Matplotlib, seabornand scipy.optimize.
Most of the cleaning and data analysis is done using Pandas dataframes, a framework that underlies a majority of analysis done by data scientists. IBM is also working to extend Pandas for natural language processing use cases. Parts of our COVID-19 time series analysis notebooks use the TensorArray extension typefrom our Text Extensions for Pandaslibrary to store time series tensors in the cells of Pandas dataframes. Our team also leveraged the graphical workflow editorthat we have built as part of the Elyra project to tie our notebooks into workflows that you can run each day as new data becomes available.
Leveraging the open source tool and data set
IBM and our team believe in the importance of democratizing technology, activating developers with the most up-to-date datasets and tools, which can help policy makers make the most informed decisions for citizens’ well-being.
You can start contributing by using the repoto build analysis in your community, county and country. Developers and data scientists can also contribute directly to the tools that were used to do this analysis by making pull requests to our GitHub repository.