At Stratada we are always looking for ways to highlight Power BI by using it for data science projects. The ability to use Python in Power BI is one of our favorite methods for deeper analysis. We decided to use easily available COVID19 data and write up our steps to share in this post.
COVID19 has become a global public health emergency, having grown even bigger than the SARS outbreak which occurred a few years ago. The virus emerged in Wuhan (China) in December 2019. As of 3rd April 2020, more than one million people worldwide have been tested positive and a total of 51364 deaths have been recorded.
The data available to today can be analyzed to estimate (predict) the number of cases for the upcoming days. In this article we will apply machine learning techniques like Time Series Analysis and Regression on a COVID19 dataset and perform exploratory data analysis to create visualizations in Power BI using Python.
Python integration has been available in Power BI for some time. It’s possible to run Python scripts directly in Power Bi Desktop and import the resulting datasets into a Power BI Desktop data model. We will be using the CSSE COVID19 dataset present in this GitHub repository. The folder csse_covid_19_time_series contains time series data in csv files for confirmed cases, deaths and recoveries starting from 22nd January, 2020. The dataset is updated daily to account for the increase or decrease in the cases on that day.
This article is divided into following topics:
 Setting up Python in Power BI
 Creating the data import script
 Create visuals using Python
 Regression and Time Series Analysis
Setting up Python in Power BI
Installing python and required packages
To run Python scripts in Power BI Desktop, you need to install Python on your local machine. You can download Python from the Python website. Along with this, you need to install the following python packages which will be used in this article –
 Pandas – for data manipulation and analysis
 Matplotlib – for plotting and visualization
 Sklearn – performing machine learning in python
 Fbprophet – Time Series forecasting in python
 Plotly – Python Open Source Graphing Library
In a console or shell, use the pip commandline tool to install these packages. For example to install pandas, run pip install pandas in command prompt.
Configure Power BI to use Python –
 Start Power BI and click on File located at top left hand corner. Click on Options and Settings and then on Options.
 Click on Python Scripting to open the Python script options. As default Power BI lists the Python environments it has been able to detect in the system. You can specify a path of any other python environment as well.
For example – In my case as shown in the image, the default path is E:\Python
Next click on Ok
Creating the Data Import Script

 Go to the Home tab and click Get data and choose More
 Select the Python script option from the list and click Connect. A Python script dialog opens where you can add your own code. Copy and paste the below code and click Ok
import pandas as pd
import numpy as np
import math
confirmed_global_cases = pd.read_csv(‘https://raw.githubusercontent.com/CSSEGISandData/COVID19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv‘)
cols = confirmed_global_cases.keys()
confirmed = confirmed_global_cases.loc[:, cols[4]:cols[1]]
dates = confirmed.keys()
world_cases_cummulative = []
for i in dates:
world_cases_cummulative.append(confirmed[i].sum())
def daily_increase(data):
d = []
for i in range(len(data)):
if i == 0:
d.append(data[0])
else:
d.append(data[i]data[i1])
return d
world_daily_increase = daily_increase(world_cases_cummulative)
cols = cols[4:]
df1 = pd.DataFrame(data = cols, columns=[‘ds’])
world_daily_increase = np.array(world_daily_increase).reshape(1,1)
df2 = pd.DataFrame(data = world_daily_increase, columns = [‘y’])
df1[‘ds’] = pd.to_datetime(df1[‘ds’])
df_daily_increase = pd.concat([df1, df2], axis=1)
world_cases_cummulative = np.array(world_cases_cummulative).reshape(1,1)
df3 = pd.DataFrame(data = world_cases_cummulative, columns = [‘y’])
df_cummulative_increase = pd.concat([df1, df3], axis=1)
num_weeks = math.ceil(len(df_daily_increase)/7)
df_weekly_increase = np.array_split(df_daily_increase, num_weeks)
x = []
for i in range(len(df_weekly_increase)):
x.append(df_weekly_increase[i][‘y’].cumsum().iloc[1])
x = np.array(x).reshape(1,1)
df_weekly = pd.DataFrame(data=x, columns=[‘y’])
week_array = []
for i in range(len(df_weekly)):
week_array.append(‘Week ‘+str(i+1))
week_array = np.array(week_array).reshape(1,1)
df_week_array = pd.DataFrame(data=week_array, columns=[‘Week’])
df_weekly_increase = pd.concat([df_weekly, df_week_array], axis=1)
3. In the next window we can choose which datasets to import. Select df_cummulative_increase, df_daily_increase, df_weekly_increase. Next click on Load
Create visuals using Python
 Go to the Report view to start working on the visualizations. We will first plot the number of COVID19 positive cases by week using a pie chart. Click on Py in Visualizations. Then from the df_weekly_increase table located on right, drag both the columns into the Values field. A python Script Editor will appear. Copy and paste the following script into the editor and click Run Script
import matplotlib.pyplot as plt
plt.pie(df_weekly_increase[‘y’], labels = df_weekly_increase[‘Week’], autopct=’%1.0f%%’, pctdistance=0.7, labeldistance=1.2, radius=8)
plt.axis(‘equal’)
plt.tight_layout()
plt.show()
The result should be something like this:
2. Next we will plot a bar graph of positive cases to visualize the trend using plotly. Follow the steps from point 1. This time drag both columns from df_cummulative_increase into the values field. Copy and paste the following script into the editor and click Run Script
from fbprophet import Prophet
from fbprophet.plot import plot_plotly, add_changepoints_to_plot
n=Prophet(interval_width=0.9)
n.fit(df_daily_increase_resampled)
future=n.make_future_dataframe(periods=5)
forecast=n.predict(future)
import plotly.offline as py
fig = plot_plotly(n, forecast)
fig = n.plot(forecast,xlabel='Date',ylabel='Confirmed Count')
This will produce the following graph in your default browser. You can also work with plotly offline which will display the graph in the Power BI window.
Regression and Time Series Analysis
We will first apply polynomial regression to predict the number of positive cases for the next 5 days. For this, we will use the values from df_cummulative_increase. Paste the following code in script editor
from fbprophet import Prophet
from fbprophet.plot import plot_plotly, add_changepoints_to_plot
n=Prophet(interval_width=0.9)
n.fit(df_daily_increase_resampled)
future=n.make_future_dataframe(periods=5)
forecast=n.predict(future)
import plotly.offline as py
fig = plot_plotly(n, forecast)
fig = n.plot(forecast,xlabel='Date',ylabel='Confirmed Count')
Next, we will use time series forecasting to predict the number of worldwide cases in the upcoming days using Prophet. Prophet is open source software released by Facebook’s Core Data Science Team. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. Some advantages of Prophet are –
 Accuracy and speed
 Fully Automatic
 Tunable forecasts
 Available in Python or R
The input to Prophet is always a dataframe with two columns: ds and y. The ds (datestamp) column should be of a format expected by Pandas, ideally YYYYMMDD for a date or YYYYMMDD HH:MM:SS for a timestamp. The y column must be numeric and represents the measurement we wish to forecast.
Since we have only daily data of the number of cases in the past 1011 weeks, we need to first upsample and interpolate the time series data to get more accurate results. Upsampling means increasing the frequency of the samples, such as from days to hours. We will use the resample() function provided by pandas library to upsample the daily (24 hour) values to 4 hour values and use linear interpolation to fill in the missing values. Click on Py in Visualizations. Then from the df_daily_increase table located on right, drag both the columns into the Values field. Then paste the following script in the editor:
from fbprophet import Prophet
from fbprophet.plot import plot_plotly, add_changepoints_to_plot
n=Prophet(interval_width=0.9)
n.fit(df_daily_increase_resampled)
future=n.make_future_dataframe(periods=5)
forecast=n.predict(future)
import plotly.offline as py
fig = plot_plotly(n, forecast)
fig = n.plot(forecast,xlabel='Date',ylabel='Confirmed Count')
This will generate a data frame df_daily_increase_resampled. Add the following code to the editor and then click on Run Script.
from fbprophet import Prophet
from fbprophet.plot import plot_plotly, add_changepoints_to_plot
n=Prophet(interval_width=0.9)
n.fit(df_daily_increase_resampled)
future=n.make_future_dataframe(periods=5)
forecast=n.predict(future)
import plotly.offline as py
fig = plot_plotly(n, forecast)
fig = n.plot(forecast,xlabel='Date',ylabel='Confirmed Count')
The output plot generated is:
The black dots in the plot are the actual values of the cases on those dates. The continuous blue line is predicted by the model. The interval_width=0.9 passed in the code while initializing Prophet is the confidence level (90% in this case). The faint blue region represents the region in which the predicted values may fluctuate based on the confidence level.
There are many more inputs needed to accurately forecast coming COVID19 cases. This model is probably only vaguely predictive for a few days ahead at best. We only use this as good way to show off the use of Python and some modern Data Science techniques in Power BI. Whatever data you have, Stratada data scientists can help you interpret it using the latest techniques and the best tools.