Dissecting COVID-19 data using data science techniques

May 11, 2020

Unless you have been living on top of a mountain without WiFi or hiding under a rock, you are quite familiar with the chaos that the coronavirus (COVID-19) has thrust the world into. Let’s try to gain a sense of what has transpired until now, using data analysis techniques. We will use Python and its many data science-friendly libraries to go through this.

By the end of this post, you will be more comfortable handling data and be able to analyze it in different ways.. In later blog(s) as part of this series, we will learn how to visualize data better and apply machine learning techniques to see if we can predict the outbreak’s spread/impact using statistical modeling and extrapolation.

There are a few notable websites that provide great coronavirus (COVID-19) dashboards and heat maps such as John Hopkins University and  Google.

The instructions below are for a Mac and you can use their equivalents on Linux/Windows where needed.

Set up

Get the dataset from John Hopkins university. From command line do the following,

git clone https://github.com/CSSEGISandData/COVID-19.gitcd COVID-19

You will notice a set of ‘.csv’ files containing Global and U.S. data for COVID-19 confirmed cases, recovered information and deaths.

The quickest and simplest way to get started is to install Anaconda for getting data science packages and managing your Python environments (or you can just use ‘pip install’ if you have a good handle of tools like ‘pyenv’ or ‘virtualenv’).

Download Miniconda and install via terminal:

bash Miniconda3-latest-MacOSX-x86_64.sh

Create a conda environment and activate it:

conda create -n covid19 
conda activate covid19

You can install individual libraries like pandas, matplotlib, seaborn, jupyter etc; or get all of them together using pyviz/holoviz (useful in later blogs when we consider visualization aspects).

conda install pandas jupyter
# conda install -c pyviz holoviz

By the way, all the Python jupyter notebook code used in the following sections of this blog can be found here. So, if you just want to read through the blog and not actually code along, be my guest. For those of you who want to walk through along with me, use the below sections and feel free to copy/paste in your notebook.

Start a new notebook and you are in business.

jupyter notebook

Load Data

Import the pandas library and load the data from the downloaded files

import pandas as pd 
 
global_data = 'time_series_covid19_{}_global.csv'
global_categories = ['Confirmed', 'Deaths', 'Recovered']
global_df = dict()
for cat in global_categories:     
    global_df[cat] = pd.read_csv(global_data.format(cat))   
 
us_data = 'time_series_covid19_{}_US.csv'
us_categories = ['Confirmed', 'Deaths']
us_df = dict()
for ucat in us_categories:
    us_df[ucat] = pd.read_csv(us_data.format(ucat))

Review

Let’s do a quick review of the contents of the data frames, number for rows, columns, column names, etc. Since we loaded each of the files into their respective dictionaries, we can access them like so:

global_df['Confirmed'].info()
 
cto-blog-code-snippet

You can access the information about the first 3 rows in each of them like below. Here is a sample of what you might see (this is up to April 12th):

global_df['Confirmed'].head(3)
 
cto-blog-covid-table-1
 
us_df['Deaths'].head(3)
 
cto-blog-covid-table-2

Pre-Processing

We can see that the measurements are listed as separate columns. In order to do some analysis, it would be helpful if each of the measurement (days) is a row item in the dataframe. This calls for some data pre-processing:

# In the global dataset, for each of the different types of cases (Confirmed, Deaths and Recovered)for field, df in global_df.items():   
# group by Country
    df = df.groupby('Country/Region', as_index=False).sum()
    # convert each column into a separate row item, using the melt() method
    df = df.melt(id_vars=['Country/Region', 'Lat', 'Long'],
                 value_name='Count')
    # keep track of the case type (Confirmed, Deaths, or Recovered)
    df['Type'] = field
    # Simplify column names
    df.columns =  ['Country', 'Latitude', 'Longitude', 'Date', 'Count', 'Type']    # Replace the dataframe in the global dataframe dictionary
    global_df[field] = df

We can merge the different case types to get a consolidated dataframe. Let’s see the last 5 rows in the dataframe.

# Concatenate all case types into one data frame
df_complete = pd.concat(global_df.values())
df_complete['Date'] = pd.to_datetime(df_complete['Date'])
df_complete.tail(5)

cto-blog-covid-table-3

Let’s get the latest data loaded into a separate dataframe for easy analysis. You can choose the most recent date accordingly based on when you are working on this analysis. Also, let’s get a subset of the data for each of the case types.

# Load the latest data
df_latest = df_complete[df_complete['Date']=='2020-05-10'] 
df_deaths = df_latest[df_latest['Type']=='Deaths']
df_confirmed = df_latest[df_latest['Type']=='Confirmed']
df_recovered = df_latest[df_latest['Type']=='Recovered']

Analysis

We can now answer a few questions based on the pre-processed data we have:

How many countries data is being represented in this latest data set?

You can find that out using the ‘nunique()’ method on the df_latest dataframe

cto-blog-covid-code-snippet-2

What is the total number of confirmed cases worldwide?

We can find this out using the ‘sum()’ method, like so

cto-blog-covid-code-snippet-3

What are top 3 countries with the most number of confirmed cases?

df_confirmed = df_confirmed.sort_values(by ='Count', ascending=False)
df_confirmed.head(3)
 
cto-blog-covid-table-4

What are the top 5 countries with COVID-19 recovered cases?

df_recovered = df_recovered.sort_values(by ='Count', ascending=False)
df_recovered.head()
 
cto-blog-covid-table-5

What are the top 3 countries with the lowest Deaths percentage in relation to Confirmed cases?

While these results are accurate, they may not present an accurate picture of a country’s death rate, so may be a percentage of deaths in relation to confirmed cases is a better indicator.

You can add a new ‘feature’ (column) to your data set; say, “Deaths Percent.” We will be using a lambda function to perform this. The apply() method applies the function to all the elements in the data frame. We use the sort() method and the ascending property to get our results. Notice there is a percent() method which is used to calculate the percentage of one type of case over another.

def percent(df, country, case_type1, case_type2):
    case_type1_count = 0
    case_type2_count = 0
    for i, j in df.iterrows():
        if (j['Country'] in country) and (j['Type'] in case_type1):
            case_type1_count = j['Count'] 
       if (j['Country'] in country) and (j['Type'] in case_type2):
            case_type2_count = j['Count']
        if (case_type1_count !=0 and case_type2_count !=0):
            return case_type1_count*100/case_type2_count
df_deaths['Deaths Percent'] = df_deaths['Country'].apply(lambda x: (df_latest, x, 'Deaths', 'Confirmed'))
# Remove missing valuesdf_deaths = df_deaths.dropna(subset = ['Deaths Percent'], inplace=False)
# Sortdf_deaths_sorted = df_deaths.sort_values(by ='Deaths Percent', ascending=True)
print("Sorted DataFrame based on Deaths Percent in relation to Confirmed Cases - Lowest to Highest: ")
df_deaths_sorted.head(3)
 
cto-blog-covid-table-6

What are the top 3 countries with the highest Recovered percentage in relation to Confirmed cases ?

We will do something similar to the last analysis, only this time on the ‘Recovered’ dataset with sorting done in descending order:

df_recovered['Recovered Percent'] = df_recovered['Country'].apply(lambda x: percent(df_latest, x,
'Recovered', 'Confirmed'))
# Remove missing valuesdf_recovered = df_recovered.dropna(subset = ['Recovered Percent'], inplace=False)
# Sort
df_recovered_sorted = df_recovered.sort_values(by ='Recovered Percent', ascending=False)
print("Sorted DataFrame based on Recovered Percent in relation to Confirmed Cases - Highest to Lowest: ")
df_recovered_sorted.head(3)
 
cto-blog-covid-table-7

In summary, we covered different stages of Data analysis – Set up, Review, Pre-processing and answering interesting questions. Hopefully, you noticed that once you have the data curated, performing analysis was pretty straightforward (i.e., once you get the hang of the ‘pandas’ J). In a later blog we will leverage this work and visualize the data in engaging ways, followed by applying AI/ML on it for prediction capabilities.

Reference:

  1. https://github.com/ciseburg/cov19-data-analysis/blob/master/cov19-analysis-notebook.ipynb
  2. https://coronavirus.jhu.edu/map.html
  3. https://www.google.com/covid19-map/

Disclaimer: The data is from the Johns Hopkins University data set and may not provide an exhaustive collection, and is dependent on the reports from different countries and the CDC.

Other Posts You Might Be Interested In

Big Data Viability and Technologies on Amazon Web Services

Big data solutions are typically associated with using the Apache Hadoop framework and supporting tools in both on-premises and cloud infrastructures. This article aims to... Read More

Journey Through a Proof of Concept Application Development Using AWS Cloud Services

I recently had the opportunity to develop a Proof of Concept (PoC) for an idea whereby data is available from PDFs and Excel files to be consumed and analyzed into a... Read More

Amazon Kinesis: The core of real time streaming data analysis on AWS

In an earlier blog titled “Big data Viability and technologies on Amazon Cloud AWS ” we discussed the different services in the AWS Platform and its viability in... Read More