Pandas analysis of coronavirus pandemic

Image for post
Image for post
Photo by Pascal Müller on Unsplash

oronavirus disease (COVID-19) is caused by Severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) and has had a worldwide effect. Many blog posts are analyzing the coronavirus pandemic. Some of them are informative, others have misleading claims. The purpose of this blogpost is to show how to analyze coronavirus data with pandas. I also answer the question that I’ve been asking for a while: “How many people are currently sick from coronavirus in a certain country”.

To run the examples download this Jupyter notebook.

Here are a few links that might interest you:

- Labeling and Data Engineering for Conversational AI and Analytics- Data Science for Business Leaders [Course]- Intro to Machine Learning with PyTorch [Course]- Become a Growth Product Manager [Course]- Deep Learning (Adaptive Computation and ML series) [Ebook]- Free skill tests for Data Scientists & Machine Learning Engineers

Some of the links above are affiliate links and if you go through them to make a purchase I’ll earn a commission. Keep in mind that I link courses because of their quality and not because of the commission I receive from your purchases.

To Step Up Your Pandas Game, see:

Dataset

Coronavirus disease 2019 (COVID-19) time series lists confirmed cases, reported deaths and reported recoveries. Data is in CSV format and updated daily. This dataset includes time-series data tracking the number of people affected by COVID-19 worldwide, including:

  • confirmed tested cases of Coronavirus infection,
  • the number of people who have reportedly died while sick with Coronavirus,
  • the number of people who have reportedly recovered from it.

Getting coronavirus data in pandas DataFrame is as simple as reading it from a file. Let’s define the URL of a CSV file and load the data to the DataFrame.

source_url = "https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv"df = pd.read_csv(source_url)
df.shape

We convert the date string to the DateTime type and sort entries by date.

df.Date = pd.to_datetime(df.Date)df = df.sort_values('Date').reset_index(drop=True)df.head()
Image for post
Image for post

There are 67 entries for each country from 22 January 2020 to 28th March 2020.

df.Country.value_counts()
Image for post
Image for post

How many people are currently sick from coronavirus in a certain country

The media mostly reports the number of new infections with coronavirus, while they disregard the ones that have recovered from it. This question is important because the high growth of confirmed cases may cross the tipping point for hospitals.

Let’s calculate these statistics and look closely at the US, Italy, South Korea and China.

countries = ['US', 'Italy', 'Korea, South', 'China']

Let’s do the math and calculate how many people are currently sick in a certain country.

df.loc[:, 'n_hospitalized'] = df.Confirmed - df.Recovered - df.Deathsdf.head()
Image for post
Image for post

In the plot below, we can observe the staggering growth of patients with coronavirus in the US and Italy. As we hear from the media, South Korea and China were more successful in combating the coronavirus. We can also support this with the numbers. They were able to flatten the curve.

fig, ax = plt.subplots(1, 4, figsize=(28, 7))for i in range(4):
country = countries[i]
ax[i].set_title('Corona virus patients in %s' % country)
df[df.Country == country][['Date', 'n_hospitalized']].plot(ax=ax[i], x='Date')
Image for post
Image for post

The dataset has a cumulative number for each county. Let’s calculate the number of confirmed, recovered and deaths per day for each country.

df.loc[:, 'n_confirmed_per_day'] = df.sort_values('Date').groupby('Country')['Confirmed'].diff().fillna(0).astype(int)
df.loc[:, 'n_recovered_per_day'] = df.sort_values('Date').groupby('Country')['Recovered'].diff().fillna(0).astype(int)
df.loc[:, 'n_deaths_per_day'] = df.sort_values('Date').groupby('Country')['Deaths'].diff().fillna(0).astype(int)

With daily numbers, we can analyze if there are more people sick from coronavirus or getting over it.

df.loc[:, 'n_hospitalized_per_day'] = df.n_confirmed_per_day -  df.n_recovered_per_day - df.n_deaths_per_daydf[df.Country == "US"].head()
Image for post
Image for post
fig, ax = plt.subplots(1, 4, figsize=(28, 7))
countries = ['US', 'Italy', 'Korea, South', 'China']
for i in range(4):
country = countries[i]
ax[i].set_title('Corona virus patients in %s' % country)
df[df.Country == country][['Date', 'n_hospitalized_per_day']].plot(ax=ax[i], x='Date')
Image for post
Image for post

We can observe that the growth of positive coronavirus cases in the US is staggering. The pandemic hasn’t reached its heights in the US. The situation in Italy has also an uptrend with a few negative peaks. In South Korea and China, more people are getting over coronavirus than getting sick.

Conclusion

This was not an in-depth analysis of coronavirus pandemic by no means. It merely serves as a template that you can use to replicate a study or answer your question about the pandemic. Take a look Exploratory Data Analysis with pandas to get some ideas on how to plot the data. Pandas analytics server can be used to make your analysis interactive.

Before you go

Follow me on Twitter, where I regularly tweet about Data Science and Machine Learning.

Image for post
Image for post
Photo by Courtney Hedger on Unsplash

Written by

Senior Data Scientist, tweeting twitter.com/romanorac.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store