The Corona Virus – COVID-19 outbreak has brought the whole world to a standstill position, with complete lock-down in several countries. Salute! To every health and security professional. Here we will attempt to perform single data analysis with COVID-19 Dataset Using Python. https://www.datatobiz.com/blog/unraveling-the-u-meaning-from-covid-19-dataset-using-python-a-tutorial-for-beginners/
2. Introduction
Pandas: Open Source Python Library that allows us to practice various tools
for data analysis. Majorly used for Data Analysis and Manipulation.
Seaborn: Another Python Library for Data Visualization, based on Matplotlib.
Provides a wide range of Graphics for presentation purpose.
Matplotlib: Python Library for multi-platform Data Visualization. Widely used
for creating, manipulating and plotting interactive visualizations.
The Corona Virus – COVID-19 outbreak has brought the whole world to a stand
still position, with complete lock-down in several countries. Salute! To every
health and security professional. Today, we will attempt to perform a single data
analysis with COVID-19 Dataset Using Python. Here’s the link for Data Set
available on Kaggle. Following are the the Python Libraries we’ll be implementing
today for this exercise.
3. What Data Does It Hold
Sno: Serial Number.
ObservationDate: Date of Observation in mm/dd/yyyy format.
Province/State: Province or State of the case.
Country/Region: Country or region of the case.
Last Update: UTC time format for when was the row updated.
Confirmed: Cumulative number of confirmed cases
Deaths: Cumulative number of deaths cases
Recovered: Cumulative number of recovered cases
The available dataset has details of number of cases for COVID-19, on daily basis.
Let us begin with understanding the columns and what they represent. Column
Description for the Dataset:
These are the columns within the file, most of our work will working around
three columns which are Confirmed, Deaths and Recovered.
4. Let Us Begin: Firstly, we’ll import our first library, pandas and read the source file.
import pandas as pd
df = pd.read_csv("covid_19_data.csv")
Now that we have read the data, let us print the head of the file, which will print top
five rows with columns.
df.head()
5. As you can see in the above screenshot, we have printed the top five rows of the data file,
with the columns explained earlier.
Let us now get into some dept of the data, where we can understand the mean and
standard deviation of the data, along with other factors.
df.describe()
6. Describe function in pandas is used to return the basic details of the data, statistically.
We have our mean, which is “1972.956586” for confirmed cases and Standard Deviation is “10807.777684”
for confirmed cases. Mean and Standard Deviation for Deaths and Recovered columns is listed, too.
Let us now begin with plotting the data, which means to plot these data points on graph or histogram.
We used pandas library until now, we’ll need to import the other two libraries and proceed.
import seaborn as sns
import matplotlib.pyplot as plt
We now have imported all three libraries. We will now attempt to plot our data on a graph and output
will reflect figure with three data points on a graph and their movements towards the latest date.
plt.figure(figsize = (12,8))
df.groupby('ObservationDate').mean()['Confirmed'].plot()
df.groupby('ObservationDate').mean()['Recovered'].plot()
df.groupby('ObservationDate').mean()['Deaths'].plot()
7. Code Explanation: plt.figure with initial the plot with mentioned width and height.
figsize is used to define the size of the figure, it takes two float numbers as parameters,
which are width and height in inches. If parameters not provided, default will be
scParams, [6.4, 4.8].
Then we have grouped Observation Data column with three different columns, which
are Confirmed, Recovered and Deaths. Observation goes horizontal along with the
vertical count.
Above code will plot the three columns one by one and the output after execution will
be as shown in following image.
READ THE FULL ARTICLE: https://www.datatobiz.com/blog/unraveling-the-u-meaning-from-covid-
19-dataset-using-python-a-tutorial-for-beginners/
8.
9. This data reflects the impact of COVID-19 over the globe, distributed in three columns. Using
the same data, we can implement prediction models but the data is quite uncertain and
does not qualify for prediction purpose. Moving on we will focus on India as Country and
analyze the data
Country Focus: India
Let us specifically check the data for India.
ind = df[df['Country/Region'] == 'India']
ind.head()
Above lines of code will filter out columns with India as Country/Region and place those
columns in “ind” and upon checking for the head(), it will reflect the top five columns. Check
the below attached screenshot.
10. Let’s plot the data for India:
plt.figure(figsize = (12,8))
ind.groupby('ObservationDate').mean()['Confirmed'].plot()
ind.groupby('ObservationDate').mean()['Recovered'].plot()
ind.groupby('ObservationDate').mean()['Deaths'].plot()
11. Similar to earlier example, this code will return a figure with the columns plotted on
the figure. Output for above code will be:
12. This is how Data is represented graphically, making it easy to read and understand.
Moving forward, we will implement a Satterplot using Seaborn library. Our next figure
will place data points, with respect to sex of the patient.
Code: Firstly we’ll make some minor changes in variables.
df['sex'] = df['sex'].replace(to_replace = 'male', value = 'Male')
df['sex'] = df['sex'].replace(to_replace = 'female', value = 'Female')
Above code simply changes the variable names to standard format. Then we’ll fill the
data points into the figure, plotting.
plt.figure(figsize = (15,8))
sns.scatterplot(x = 'longitude', y = 'latitude', data = df2, hue = 'sex', alpha = 0.2)
13. Code Explanation: The “x and y” defines the longitude and latitude. data defines the
data frame or the source, where columns and rows are variables and observations,
respectively. The hue defines the variable names in the data and here these variables
will be produced with different colors. alpha, which takes float value decides the
opacity for the points. Refer the below attached screenshot for proper output.
14.
15. Future Scope: Now that we have understood how to read raw data and present
it in readable figures, here the future scope could be implementing a Time
Series Forecasting Module and getting a Prediction. Using RNN, we could
achieve a possibly realistic number of future cases for COVID-19. But at
present, it could be difficult to get realistic prediction as the data we posses
now is too uncertain and too less.
But considering the current situation and the fight we have been giving, we
have decided not to implement Prediction Module to acquire any number
which could lead to unnecessary unrest.
16. Read the full article
https://www.datatobiz.com/blog/unraveling-the-u-meaning-from-
covid-19-dataset-using-python-a-tutorial-for-beginners/