At Stratada we are always looking for ways to highlight Power BI by using it for data science projects.  The ability to use Python in Power BI is one of our favorite methods for deeper analysis.  We decided to use easily available COVID-19 data and write up our steps to share in this post.

COVID-19 has become a global public health emergency, having grown even bigger than the SARS outbreak which occurred a few years ago. The virus emerged in Wuhan (China) in December 2019. As of 3rd April 2020, more than one million people worldwide have been tested positive and a total of 51364 deaths have been recorded.

The data available to today can be analyzed to estimate (predict) the number of cases for the upcoming days. In this article we will apply machine learning techniques like Time Series Analysis and Regression on a COVID-19 dataset and perform exploratory data analysis to create visualizations in Power BI using Python.

Python integration has been available in Power BI for some time. It’s possible to run Python scripts directly in Power Bi Desktop and import the resulting datasets into a Power BI Desktop data model. We will be using the CSSE COVID-19 dataset present in this GitHub repository. The folder csse_covid_19_time_series contains time series data in csv files for confirmed cases, deaths and recoveries starting from 22nd January, 2020. The dataset is updated daily to account for the increase or decrease in the cases on that day.

This article is divided into following topics:

  1. Setting up Python in Power BI
  2. Creating the data import script
  3. Create visuals using Python
  4. Regression and Time Series Analysis

Setting up Python in Power BI

Installing python and required packages

To run Python scripts in Power BI Desktop, you need to install Python on your local machine. You can download Python from the Python website. Along with this, you need to install the following python packages which will be used in this article –

  • Pandas – for data manipulation and analysis
  • Matplotlib – for plotting and visualization
  • Sklearn – performing machine learning in python
  • Fbprophet – Time Series forecasting in python
  • Plotly – Python Open Source Graphing Library

In a console or shell, use the pip command-line tool to install these packages.  For example to install pandas, run pip install pandas in command prompt.

Configure Power BI to use Python –

  1. Start Power BI and click on File located at top left hand corner. Click on Options and Settings  and then on Options.
  2. Click on Python Scripting to open the Python script options. As default Power BI lists the Python environments it has been able to detect in the system. You can specify a path of any other python environment as well.

For example – In my case as shown in the image, the default path is E:\Python

Next click on Ok

 

Creating the Data Import Script

    1. Go to the Home tab and click Get data and choose More
    2. Select the Python script option from the list and click Connect. A Python script dialog opens where you can add your own code. Copy and paste the below code and click Ok

import pandas as pd

import numpy as np

import math

 

confirmed_global_cases = pd.read_csv(‘https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv‘)

 

cols = confirmed_global_cases.keys()

confirmed = confirmed_global_cases.loc[:, cols[4]:cols[-1]]

dates = confirmed.keys()

world_cases_cummulative = []

for i in dates:

world_cases_cummulative.append(confirmed[i].sum())

 

def daily_increase(data):

d = []

for i in range(len(data)):

if i == 0:

d.append(data[0])

else:

d.append(data[i]-data[i-1])

return d

 

world_daily_increase = daily_increase(world_cases_cummulative)

cols = cols[4:]

df1 = pd.DataFrame(data = cols, columns=[‘ds’])

world_daily_increase = np.array(world_daily_increase).reshape(-1,1)

df2 = pd.DataFrame(data = world_daily_increase, columns = [‘y’])

df1[‘ds’] = pd.to_datetime(df1[‘ds’])

df_daily_increase = pd.concat([df1, df2], axis=1)

 

world_cases_cummulative = np.array(world_cases_cummulative).reshape(-1,1)

df3 = pd.DataFrame(data = world_cases_cummulative, columns = [‘y’])

df_cummulative_increase = pd.concat([df1, df3], axis=1)

 

num_weeks = math.ceil(len(df_daily_increase)/7)

df_weekly_increase = np.array_split(df_daily_increase, num_weeks)

x = []

for i in range(len(df_weekly_increase)):

x.append(df_weekly_increase[i][‘y’].cumsum().iloc[-1])

x = np.array(x).reshape(-1,1)

df_weekly = pd.DataFrame(data=x, columns=[‘y’])

week_array = []

for i in range(len(df_weekly)):

week_array.append(‘Week ‘+str(i+1))

week_array = np.array(week_array).reshape(-1,1)

df_week_array = pd.DataFrame(data=week_array, columns=[‘Week’])

df_weekly_increase = pd.concat([df_weekly, df_week_array], axis=1)

3. In the next window we can choose which datasets to import. Select df_cummulative_increase, df_daily_increase, df_weekly_increase. Next click on Load

Create visuals using Python

 

  1. Go to the Report view to start working on the visualizations. We will first plot the number of COVID-19 positive cases by week using a pie chart. Click on Py in Visualizations. Then from the df_weekly_increase table located on right, drag both the columns into the Values field. A python Script Editor will appear. Copy and paste the following script into the editor and click Run Script

import matplotlib.pyplot as plt

plt.pie(df_weekly_increase[‘y’], labels = df_weekly_increase[‘Week’], autopct=’%1.0f%%’, pctdistance=0.7, labeldistance=1.2, radius=8)

plt.axis(‘equal’)

plt.tight_layout()

plt.show()

The result should be something like this:

2. Next we will plot a bar graph of positive cases to visualize the trend using plotly. Follow the steps from point 1. This time drag both columns from df_cummulative_increase into the values field. Copy and paste the following script into the editor and click Run Script

from fbprophet import Prophet

from fbprophet.plot import plot_plotly, add_changepoints_to_plot

n=Prophet(interval_width=0.9)

n.fit(df_daily_increase_resampled)

future=n.make_future_dataframe(periods=5)

forecast=n.predict(future)

import plotly.offline as py

fig = plot_plotly(n, forecast)

fig = n.plot(forecast,xlabel='Date',ylabel='Confirmed Count')

This will produce the following graph in your default browser. You can also work with plotly offline which will display the graph in the Power BI window.

Regression and Time Series Analysis

We will first apply polynomial regression to predict the number of positive cases for the next 5 days. For this, we will use the values from df_cummulative_increase. Paste the following code in script editor

from fbprophet import Prophet

from fbprophet.plot import plot_plotly, add_changepoints_to_plot

n=Prophet(interval_width=0.9)

n.fit(df_daily_increase_resampled)

future=n.make_future_dataframe(periods=5)

forecast=n.predict(future)

import plotly.offline as py

fig = plot_plotly(n, forecast)

fig = n.plot(forecast,xlabel='Date',ylabel='Confirmed Count')

Next, we will use time series forecasting to predict the number of worldwide cases in the upcoming days using Prophet. Prophet is open source software released by Facebook’s Core Data Science Team. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. Some advantages of Prophet are –

  1. Accuracy and speed
  2. Fully Automatic
  3. Tunable forecasts
  4. Available in Python or R

The input to Prophet is always a dataframe with two columns: ds and y. The ds (datestamp) column should be of a format expected by Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp. The y column must be numeric and represents the measurement we wish to forecast.

Since we have only daily data of the number of cases in the past 10-11 weeks, we need to first upsample and interpolate the time series data to get more accurate results. Upsampling means increasing the frequency of the samples, such as from days to hours. We will use the resample() function provided by pandas library to upsample the daily (24 hour) values to 4 hour values and use linear interpolation to fill in the missing values. Click on Py in Visualizations. Then from the df_daily_increase table located on right,  drag both the columns into the Values field. Then paste the following script in the editor:

from fbprophet import Prophet

from fbprophet.plot import plot_plotly, add_changepoints_to_plot

n=Prophet(interval_width=0.9)

n.fit(df_daily_increase_resampled)

future=n.make_future_dataframe(periods=5)

forecast=n.predict(future)

import plotly.offline as py

fig = plot_plotly(n, forecast)

fig = n.plot(forecast,xlabel='Date',ylabel='Confirmed Count')

This will generate a data frame df_daily_increase_resampled. Add the following code to the editor and then click on Run Script.

from fbprophet import Prophet

from fbprophet.plot import plot_plotly, add_changepoints_to_plot

n=Prophet(interval_width=0.9)

n.fit(df_daily_increase_resampled)

future=n.make_future_dataframe(periods=5)

forecast=n.predict(future)

import plotly.offline as py

fig = plot_plotly(n, forecast)

fig = n.plot(forecast,xlabel='Date',ylabel='Confirmed Count')

The output plot generated is:

The black dots in the plot are the actual values of the cases on those dates. The continuous blue line is predicted by the model. The interval_width=0.9 passed in the code while initializing Prophet is the confidence level (90% in this case). The faint blue region represents the region in which the predicted values may fluctuate based on the confidence level.

There are many more inputs needed to accurately forecast coming COVID-19 cases.  This model is probably only vaguely predictive for a few days ahead at best.  We only use this as good way to show off the use of Python and some modern Data Science techniques in Power BI.  Whatever data you have, Stratada data scientists can help you interpret it using the latest techniques and the best tools.  

Do NOT follow this link or you will be banned from the site!