Unless you have been living on top of a mountain without WiFi or hiding under a rock, you are quite familiar with the chaos that the coronavirus (COVID-19) has thrust the world into. Let’s try to gain a sense of what has transpired until now, using data analysis techniques. We will use Python and its many data science-friendly libraries to go through this.
By the end of this post, you will be more comfortable handling data and be able to analyze it in different ways.. In later blog(s) as part of this series, we will learn how to visualize data better and apply machine learning techniques to see if we can predict the outbreak’s spread/impact using statistical modeling and extrapolation.
The instructions below are for a Mac and you can use their equivalents on Linux/Windows where needed.
Get the dataset from John Hopkins university. From command line do the following,
git clone https://github.com/CSSEGISandData/COVID-19.gitcd COVID-19
You will notice a set of ‘.csv’ files containing Global and U.S. data for COVID-19 confirmed cases, recovered information and deaths.
The quickest and simplest way to get started is to install Anaconda for getting data science packages and managing your Python environments (or you can just use ‘pip install’ if you have a good handle of tools like ‘pyenv’ or ‘virtualenv’).
Download Miniconda and install via terminal:
Create a conda environment and activate it:
You can install individual libraries like pandas, matplotlib, seaborn, jupyter etc; or get all of them together using pyviz/holoviz (useful in later blogs when we consider visualization aspects).
By the way, all the Python jupyter notebook code used in the following sections of this blog can be found here. So, if you just want to read through the blog and not actually code along, be my guest. For those of you who want to walk through along with me, use the below sections and feel free to copy/paste in your notebook.
Start a new notebook and you are in business.
Import the pandas library and load the data from the downloaded files
Let’s do a quick review of the contents of the data frames, number for rows, columns, column names, etc. Since we loaded each of the files into their respective dictionaries, we can access them like so:
You can access the information about the first 3 rows in each of them like below. Here is a sample of what you might see (this is up to April 12th):
We can see that the measurements are listed as separate columns. In order to do some analysis, it would be helpful if each of the measurement (days) is a row item in the dataframe. This calls for some data pre-processing:
We can merge the different case types to get a consolidated dataframe. Let’s see the last 5 rows in the dataframe.
Let’s get the latest data loaded into a separate dataframe for easy analysis. You can choose the most recent date accordingly based on when you are working on this analysis. Also, let’s get a subset of the data for each of the case types.
We can now answer a few questions based on the pre-processed data we have:
How many countries data is being represented in this latest data set?
You can find that out using the ‘nunique()’ method on the df_latest dataframe
What is the total number of confirmed cases worldwide?
We can find this out using the ‘sum()’ method, like so
What are top 3 countries with the most number of confirmed cases?
What are the top 5 countries with COVID-19 recovered cases?
What are the top 3 countries with the lowest Deaths percentage in relation to Confirmed cases?
While these results are accurate, they may not present an accurate picture of a country’s death rate, so may be a percentage of deaths in relation to confirmed cases is a better indicator.
You can add a new ‘feature’ (column) to your data set; say, “Deaths Percent.” We will be using a lambda function to perform this. The apply() method applies the function to all the elements in the data frame. We use the sort() method and the ascending property to get our results. Notice there is a percent() method which is used to calculate the percentage of one type of case over another.
What are the top 3 countries with the highest Recovered percentage in relation to Confirmed cases ?
We will do something similar to the last analysis, only this time on the ‘Recovered’ dataset with sorting done in descending order:
In summary, we covered different stages of Data analysis – Set up, Review, Pre-processing and answering interesting questions. Hopefully, you noticed that once you have the data curated, performing analysis was pretty straightforward (i.e., once you get the hang of the ‘pandas’ J). In a later blog we will leverage this work and visualize the data in engaging ways, followed by applying AI/ML on it for prediction capabilities.
Disclaimer: The data is from the Johns Hopkins University data set and may not provide an exhaustive collection, and is dependent on the reports from different countries and the CDC.