Sorry, the language specified is not available for this page

    Dissecting COVID-19 data using data science techniques

    May 11, 2020

    Unless you have been living on top of a mountain without WiFi or hiding under a rock, you are quite familiar with the chaos that the coronavirus (COVID-19) has thrust the world into. Let’s try to gain a sense of what has transpired until now, using data analysis techniques. We will use Python and its many data science-friendly libraries to go through this.

    By the end of this post, you will be more comfortable handling data and be able to analyze it in different ways.. In later blog(s) as part of this series, we will learn how to visualize data better and apply machine learning techniques to see if we can predict the outbreak’s spread/impact using statistical modeling and extrapolation.

    There are a few notable websites that provide great coronavirus (COVID-19) dashboards and heat maps such as John Hopkins University and  Google.

    The instructions below are for a Mac and you can use their equivalents on Linux/Windows where needed.

    Set up

    Get the dataset from John Hopkins university. From command line do the following,

    git clone COVID-19

    You will notice a set of ‘.csv’ files containing Global and U.S. data for COVID-19 confirmed cases, recovered information and deaths.

    The quickest and simplest way to get started is to install Anaconda for getting data science packages and managing your Python environments (or you can just use ‘pip install’ if you have a good handle of tools like ‘pyenv’ or ‘virtualenv’).

    Download Miniconda and install via terminal:


    Create a conda environment and activate it:

    conda create -n covid19 
    conda activate covid19

    You can install individual libraries like pandas, matplotlib, seaborn, jupyter etc; or get all of them together using pyviz/holoviz (useful in later blogs when we consider visualization aspects).

    conda install pandas jupyter
    # conda install -c pyviz holoviz

    By the way, all the Python jupyter notebook code used in the following sections of this blog can be found here. So, if you just want to read through the blog and not actually code along, be my guest. For those of you who want to walk through along with me, use the below sections and feel free to copy/paste in your notebook.

    Start a new notebook and you are in business.

    jupyter notebook

    Load Data

    Import the pandas library and load the data from the downloaded files

    import pandas as pd 
    global_data = 'time_series_covid19_{}_global.csv'
    global_categories = ['Confirmed', 'Deaths', 'Recovered']
    global_df = dict()
    for cat in global_categories:     
        global_df[cat] = pd.read_csv(global_data.format(cat))   
    us_data = 'time_series_covid19_{}_US.csv'
    us_categories = ['Confirmed', 'Deaths']
    us_df = dict()
    for ucat in us_categories:
        us_df[ucat] = pd.read_csv(us_data.format(ucat))


    Let’s do a quick review of the contents of the data frames, number for rows, columns, column names, etc. Since we loaded each of the files into their respective dictionaries, we can access them like so:


    You can access the information about the first 3 rows in each of them like below. Here is a sample of what you might see (this is up to April 12th):



    We can see that the measurements are listed as separate columns. In order to do some analysis, it would be helpful if each of the measurement (days) is a row item in the dataframe. This calls for some data pre-processing:

    # In the global dataset, for each of the different types of cases (Confirmed, Deaths and Recovered)for field, df in global_df.items():   
    # group by Country
        df = df.groupby('Country/Region', as_index=False).sum()
        # convert each column into a separate row item, using the melt() method
        df = df.melt(id_vars=['Country/Region', 'Lat', 'Long'],
        # keep track of the case type (Confirmed, Deaths, or Recovered)
        df['Type'] = field
        # Simplify column names
        df.columns =  ['Country', 'Latitude', 'Longitude', 'Date', 'Count', 'Type']    # Replace the dataframe in the global dataframe dictionary
        global_df[field] = df

    We can merge the different case types to get a consolidated dataframe. Let’s see the last 5 rows in the dataframe.

    # Concatenate all case types into one data frame
    df_complete = pd.concat(global_df.values())
    df_complete['Date'] = pd.to_datetime(df_complete['Date'])


    Let’s get the latest data loaded into a separate dataframe for easy analysis. You can choose the most recent date accordingly based on when you are working on this analysis. Also, let’s get a subset of the data for each of the case types.

    # Load the latest data
    df_latest = df_complete[df_complete['Date']=='2020-05-10'] 
    df_deaths = df_latest[df_latest['Type']=='Deaths']
    df_confirmed = df_latest[df_latest['Type']=='Confirmed']
    df_recovered = df_latest[df_latest['Type']=='Recovered']


    We can now answer a few questions based on the pre-processed data we have:

    How many countries data is being represented in this latest data set?

    You can find that out using the ‘nunique()’ method on the df_latest dataframe


    What is the total number of confirmed cases worldwide?

    We can find this out using the ‘sum()’ method, like so


    What are top 3 countries with the most number of confirmed cases?

    df_confirmed = df_confirmed.sort_values(by ='Count', ascending=False)

    What are the top 5 countries with COVID-19 recovered cases?

    df_recovered = df_recovered.sort_values(by ='Count', ascending=False)

    What are the top 3 countries with the lowest Deaths percentage in relation to Confirmed cases?

    While these results are accurate, they may not present an accurate picture of a country’s death rate, so may be a percentage of deaths in relation to confirmed cases is a better indicator.

    You can add a new ‘feature’ (column) to your data set; say, “Deaths Percent.” We will be using a lambda function to perform this. The apply() method applies the function to all the elements in the data frame. We use the sort() method and the ascending property to get our results. Notice there is a percent() method which is used to calculate the percentage of one type of case over another.

    def percent(df, country, case_type1, case_type2):
        case_type1_count = 0
        case_type2_count = 0
        for i, j in df.iterrows():
            if (j['Country'] in country) and (j['Type'] in case_type1):
                case_type1_count = j['Count'] 
           if (j['Country'] in country) and (j['Type'] in case_type2):
                case_type2_count = j['Count']
            if (case_type1_count !=0 and case_type2_count !=0):
                return case_type1_count*100/case_type2_count
    df_deaths['Deaths Percent'] = df_deaths['Country'].apply(lambda x: (df_latest, x, 'Deaths', 'Confirmed'))
    # Remove missing valuesdf_deaths = df_deaths.dropna(subset = ['Deaths Percent'], inplace=False)
    # Sortdf_deaths_sorted = df_deaths.sort_values(by ='Deaths Percent', ascending=True)
    print("Sorted DataFrame based on Deaths Percent in relation to Confirmed Cases - Lowest to Highest: ")

    What are the top 3 countries with the highest Recovered percentage in relation to Confirmed cases ?

    We will do something similar to the last analysis, only this time on the ‘Recovered’ dataset with sorting done in descending order:

    df_recovered['Recovered Percent'] = df_recovered['Country'].apply(lambda x: percent(df_latest, x,
    'Recovered', 'Confirmed'))
    # Remove missing valuesdf_recovered = df_recovered.dropna(subset = ['Recovered Percent'], inplace=False)
    # Sort
    df_recovered_sorted = df_recovered.sort_values(by ='Recovered Percent', ascending=False)
    print("Sorted DataFrame based on Recovered Percent in relation to Confirmed Cases - Highest to Lowest: ")

    In summary, we covered different stages of Data analysis – Set up, Review, Pre-processing and answering interesting questions. Hopefully, you noticed that once you have the data curated, performing analysis was pretty straightforward (i.e., once you get the hang of the ‘pandas’ J). In a later blog we will leverage this work and visualize the data in engaging ways, followed by applying AI/ML on it for prediction capabilities.



    Disclaimer: The data is from the Johns Hopkins University data set and may not provide an exhaustive collection, and is dependent on the reports from different countries and the CDC.

    Other Posts You Might Be Interested In

    Big Data Viability and Technologies on Amazon Web Services

    Big data solutions are typically associated with using the Apache Hadoop framework and supporting tools in both on-premises and cloud infrastructures. This article aims to... Learn More

    Blockchain Part 2: How to Thrive in a Data-Driven World

    I'm picking this series up after setting it aside for a while.  A recap/reminder from Part 1. What is the blockchain?  At a very simplified, high level it can be thought of... Learn More

    How to Manage Data at the Edge

    Data is the lifeblood of any business.  Consequently, the availability of that data is critical.  The Internet of Things (IOT) produces a tremendous amount of data that a... Learn More