Exploratory Data Analysis on Covid-19 dataset

As part of my course project on Data Analysis with Python, I had to first find a real-world dataset and perform an exploratory data analysis on it. Without much thought, I decided to work on the most trending topic in today’s world — Covid-19. I downloaded the latest dataset on Covid-19 from https://ourworldindata.org/coronavirus-source-data which gave a complete list of information for all the countries starting from February 24, 2020. Similarly, I downloaded another dataset from https://www.kaggle.com/fernandol/countries-of-the-world. This dataset contained other basic information of the countries(not covid related). I wanted to merge certain columns from both these datasets for my analysis.

For this project, I have used Pandas, matplotlib and Seaborn.

Pandas is a python library that contains data structures and data manipulation tools. It is used for working with tabular or heterogeneous data. Similarly, Matplotlib is a python plotting library which is designed for basic plotting(bars, lines, scatter plots, etc). Seaborn provides a variety of visualization patterns and uses fewer syntax. It is used in statistics visualization to summarize the data in visualizations or to show the distribution.

Data Preparation and Cleaning:

Let me first load the required datasets into a panda dataframe:


This is followed by cleaning the dataset. This process involves handling missing and invalid data, grouping by certain columns, selecting required columns, and finally merging the datasets to get a finalized dataframe.

For the first Dataset:

The first dataframe:


For the second dataset:

The second dataframe:


Looks like the second dataframe does not need any further cleaning, so moving on to the next step.

I’m now creating a new dataframe to calculate the cumulative number for all the countries which I will be using later. We can see that the initial dataset of covid contains the information about the countries on each date starting from 22nd Feb 2020 to 9th June 2021, where the number on 9th June gives the latest data. Let’s call it the third dataframe.

Data cleaning:

The third dataframe:


Now, I merge the three dataframes to get my finalized dataframe:

My finalized dataframe:


Now let's find some details about our dataframe:


From the above computations, we get a lot of ideas and information about our dataframe. For eg: Our dataframe consists of 189 countries, grouped across 14 different categories. Most of the columns are of float datatype. Similarly, we get the statistics as well. For eg: the mean value of total cases for all the countries is 1007963 and its standard deviation is 3713711. The standard deviation is spread very higher than the mean total cases which means that the changes in the cases are very unpredictable.

Similarly, making some other calculations:

Since both the highest cases and the highest deaths are from the United States, it is the most affected country from the pandemic.

Now, plotting different graphs:


From the graph, we can note that the most number of tests have been done in Asia and the least in Oceania. Similarly, the highest number of cases are again from Asia and the lowest from Oceania. But it seems we cannot determine the same for total deaths as the line is somewhat straight. So, let's try to use a barplot to show the total deaths across the continents.


From the barplot, it is clear that the total number of deaths is the highest in Europe and the lowest in Oceania. Hence, Oceania seems to be the least affected continent from all aspects.

Using a scatterplot from the Seaborn library to plot the total cases and the total population of the continents:


So, among the top 10 countries with the highest population, China has the least number of deaths, and the USA has the highest.

Multiple Plotting:


Now it’s time to take our analysis to the next step. Let's ask some questions about our data and answer them using visualizations and calculations.

1. How many countries are there that have the least hospital_beds_per_thousand, and also fall in the group of countries with the highest cases per thousand? (Top 50)

2. How many countries are there that have the least hospital_beds_per_thousand, and also fall in the group of countries with the highest deaths per thousand? (Top 50)

3. How many countries are there that have the least hospital_beds_per_thousand, and also fall in the group of countries with the lowest tests per thousand? (Top 50)

To answer these, let's start off by adding three new columns to our df; tests_per_thousand, deaths_per_thousand, and cases_per_thousand:

Now I’ll create separate dataframes for 50 countries with the lowest hospital beds per thousand, 50 countries with the highest cases per thousand, 50 countries with the highest deaths per thousand, and 50 countries with the lowest tests per thousand:

Finally, let's calculate the required counts:

4. How many countries are there that have the highest number of smokers and also have the highest cases?

5. How many countries fall under the group of the highest number of smokers and also have the highest number of deaths?

From questions 4 and 5, we can note that more than half of the countries with the highest number of cases and the highest number of deaths also have the most number of smokers.

6. Create a dataframe consisting of the 10 countries with the most number of cases.


6. Create a dataframe consisting of the 10 countries with the most number of deaths.


Creating a multiplot to plot these two values:


The graphs give us an even clearer idea about the respective countries.

My exploratory analysis of the covid dataset comes to an end with this. The analysis allowed me to explore my skills and apply all the tools & techniques I learned throughout the zerotopandas lessons. Needless to say, I have learned a lot from this project and the entire Data analysis course as a whole, and I can safely say that I feel more confident and comfortable working with data. This project has also given me a bunch of ideas of what one could do with data, and I’m very excited about my next steps! So, I want to conclude by thanking the instructor and the entire team of jovian for making this awesome course available to aspirants like me. I truly appreciate your efforts and the time you spent preparing this course.

Reference links:

https://www.geeksforgeeks.org/ https://stackoverflow.com/questions/35530364/pandas-merge-only-return-column-names https://seaborn.pydata.org/ https://www.w3schools.com/python/ref_string_strip.asp https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas https://analyticsindiamag.com/comparing-python-data-visualization-tools-matplotlib-vs-seaborn/ https://datascience.stackexchange.com/questions/33053/how-do-i-compare-columns-in-different-data-frames https://matplotlib.org/stable/gallery/subplots_axes_and_figures/subplots_demo.html https://stackoverflow.com/questions/35807599/select-by-common-values-in-multiple-pandas-dataframes



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store