Pandas analysis of coronavirus pandemic

Coronavirus disease (COVID-19) is caused by Severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) and has had a worldwide effect. Many blog posts are analyzing the coronavirus pandemic. Some of them are informative, others have misleading claims. The purpose of this blogpost is to show how to analyze coronavirus data with pandas. I also answer the question that I’ve been asking for a while: “How many people are currently sick from coronavirus in a certain country”.
To run the examples download this Jupyter notebook.
Here are a few links that might interest you:
- Labeling and Data Engineering for Conversational AI and Analytics- Data Science for Business Leaders [Course]- Intro to Machine Learning with PyTorch [Course]- Become a Growth Product Manager [Course]- Deep Learning (Adaptive Computation and ML series) [Ebook]- Free skill tests for Data Scientists & Machine Learning Engineers
Some of the links above are affiliate links and if you go through them to make a purchase I’ll earn a commission. Keep in mind that I link courses because of their quality and not because of the commission I receive from your purchases.
To Step Up Your Pandas Game, see:
Dataset
Coronavirus disease 2019 (COVID-19) time series lists confirmed cases, reported deaths and reported recoveries. Data is in CSV format and updated daily. This dataset includes time-series data tracking the number of people affected by COVID-19 worldwide, including:
- confirmed tested cases of Coronavirus infection,
- the number of people who have reportedly died while sick with Coronavirus,
- the number of people who have reportedly recovered from it.
Getting coronavirus data in pandas DataFrame is as simple as reading it from a file. Let’s define the URL of a CSV file and load the data to the DataFrame.
source_url = "https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv"df = pd.read_csv(source_url)
df.shape
We convert the date string to the DateTime type and sort entries by date.
df.Date = pd.to_datetime(df.Date)df = df.sort_values('Date').reset_index(drop=True)df.head()

There are 67 entries for each country from 22 January 2020 to 28th March 2020.
df.Country.value_counts()

How many people are currently sick from coronavirus in a certain country
The media mostly reports the number of new infections with coronavirus, while they disregard the ones that have recovered from it. This question is important because the high growth of confirmed cases may cross the tipping point for hospitals.
Let’s calculate these statistics and look closely at the US, Italy, South Korea and China.
countries = ['US', 'Italy', 'Korea, South', 'China']
Let’s do the math and calculate how many people are currently sick in a certain country.
df.loc[:, 'n_hospitalized'] = df.Confirmed - df.Recovered - df.Deathsdf.head()

In the plot below, we can observe the staggering growth of patients with coronavirus in the US and Italy. As we hear from the media, South Korea and China were more successful in combating the coronavirus. We can also support this with the numbers. They were able to flatten the curve.
fig, ax = plt.subplots(1, 4, figsize=(28, 7))for i in range(4):
country = countries[i]
ax[i].set_title('Corona virus patients in %s' % country)
df[df.Country == country][['Date', 'n_hospitalized']].plot(ax=ax[i], x='Date')

The dataset has a cumulative number for each county. Let’s calculate the number of confirmed, recovered and deaths per day for each country.
df.loc[:, 'n_confirmed_per_day'] = df.sort_values('Date').groupby('Country')['Confirmed'].diff().fillna(0).astype(int)
df.loc[:, 'n_recovered_per_day'] = df.sort_values('Date').groupby('Country')['Recovered'].diff().fillna(0).astype(int)
df.loc[:, 'n_deaths_per_day'] = df.sort_values('Date').groupby('Country')['Deaths'].diff().fillna(0).astype(int)
With daily numbers, we can analyze if there are more people sick from coronavirus or getting over it.
df.loc[:, 'n_hospitalized_per_day'] = df.n_confirmed_per_day - df.n_recovered_per_day - df.n_deaths_per_daydf[df.Country == "US"].head()

fig, ax = plt.subplots(1, 4, figsize=(28, 7))
countries = ['US', 'Italy', 'Korea, South', 'China']for i in range(4):
country = countries[i]
ax[i].set_title('Corona virus patients in %s' % country)
df[df.Country == country][['Date', 'n_hospitalized_per_day']].plot(ax=ax[i], x='Date')

We can observe that the growth of positive coronavirus cases in the US is staggering. The pandemic hasn’t reached its heights in the US. The situation in Italy has also an uptrend with a few negative peaks. In South Korea and China, more people are getting over coronavirus than getting sick.
Conclusion
This was not an in-depth analysis of coronavirus pandemic by no means. It merely serves as a template that you can use to replicate a study or answer your question about the pandemic. Take a look Exploratory Data Analysis with pandas to get some ideas on how to plot the data. Pandas analytics server can be used to make your analysis interactive.
Before you go
Follow me on Twitter, where I regularly tweet about Data Science and Machine Learning.