An Introduction to Prophet

Posted on July 1, 2018

Prophet, released by Facebook Open Source, is an interesting library for Python and R. Straddling the line between statistical modeling and machine learning, it is an open source tool I have not seen talked about much, even though it is rather interesting. Prophet is a tool for forecasting time series data, such as weather or page views.

To get started, here is a little tutorial of Prophet.

Note:

As seen on the issues page on GitHub, Prophet has some problems working on Windows computers. While a majority of the functionality will work, some parts, such as graphing the predictions, will not work, partly because of Prophet and because of the main dependency, PyStan, which also seems to have problems on Windows. As usual to install PyStan to get started:

To begin, follow the previous installation along with doing the below to install prophet on your machine.

For this example, we will be using a dataset from PageViews, a Wikimedia service that allows one to see the daily page views of any page of Wikipedia, along with other related services such as redirect views and compare across languages, a useful service to track the popularity of subjects.

For this tutorial, we will be looking at the pageviews of newly famous politician Alexandria Ocasio-Cortez, mainly because it is interesting how much reporting around her has exploded, and this can be seen with her pageview results.

To start, here is the following

One does not need matplotlib for Prophet, but it will prove useful for doing an initial look at the dataset. We need Pandas to shape the CSV file into a data frame useable by Prophet. The data itself requires two forms of data. First the time series data in the form of month/day/year and a Y variable, such as pageviews in this case, which is the data we are forecasting.

Before working with Prophet, here is some simple tools from Pandas to have a look at the dataset

Describe result here

                   y
count      21.000000
mean     8683.809524
std     39661.332898
min         2.000000
25%        12.000000
50%        24.000000
75%        32.000000
max    181780.000000

As seen, the page is quite new, with their being a very large difference between the min, max, mean, and standard deviation (std) That max is from the last day of the set, being 6/27/2018, the day after she won her primary in 14th congressional district of New York against incumbent Democratic Representative Joseph Crowley.

Now here is the basic code to work with Prophet.

First with Prophet() to initiate the Prophet object to analyze the dataframe, then we with the Prophet object fit our dataset into it, and in that last line of code, we make a forecast that is equal to 365 days, with periods=365 meaning we specify to look 365 days/1 year later.

And now for the last portion, where we can look at the results of the analysis.

Note:

Sometimes when running the code, the phrase Hessian Matrix will show up. The Hessian Matrix is a matrix of 2nd partial derivatives.

For example, let us say we have x2 + y2 + z2. The first derivative would be 2x + 2y + 2z and the 2nd derivative would be 2 + 2 + 2.

Partial derivatives are only when you take the derivative of one part of the equation however, so the partial derivative of the first unknown would be 2x + y2 + z2, with the 2nd partial derivative for x being 2 + 0 + 0. We can plot these results into a matrix then!

With these numbers for example:

[2,0,0 (2nd partial derivative of x)
 0,2,0 (2nd partial derivative of y)
 0,0,2](2nd partial derivative of z)

Now for the results of the analysis.

First let us look at the print statement:

            ds           yhat     yhat_lower     yhat_upper
381 2019-06-23  687963.404739  652022.556382  728502.770550
382 2019-06-24  687961.602067  649358.178347  726809.029122
383 2019-06-25  688009.537470  649664.221806  725880.952722
384 2019-06-26  748545.400747  708648.671636  789416.559359
385 2019-06-27  700957.348619  659259.490072  738761.610851

Y-hat stands for predicted y value on this case, in this case we are predicting average page views. Here we are printing the tail of our future forecasts, so for example on day 381 we get an average y-hat of 687963.404739. The lower and upper y-hate values represent the 25% / 75% range in forecasted values respectively.

Now a look with pandas at a description of the forecast value.

               trend     yhat_lower      ...        multiplicative_terms_upper           yhat
count     386.000000     386.000000      ...                             386.0     386.000000
mean   347070.886224  307883.437338      ...                               0.0  347062.884321
std    206902.507379  208053.515036      ...                               0.0  207938.495639
min     -9902.806374  -54435.244935      ...                               0.0  -13006.850626
25%    168583.245236  129992.390964      ...                               0.0  168733.079942
50%    347070.857915  310056.019017      ...                               0.0  350458.653974
75%    525558.470594  487103.625723      ...                               0.0  525716.758356
max    704046.083273  708648.671636      ...                               0.0  748545.400747

The count in this case includes the initial data points before we initiated profit. The min value is a bit strange, but sans that the numbers look reasonable. However one cannot take these values at face value initially, not because they are wrong, but it is good to double check ones claims.

To finish off, let us compare our predictions to the actual Pageviews results for the past two days.

6/28/2018: 133,285
6/29/2018: 83,014

With the last value, the forecast is rather close! The discrepancy can be explained because her page is so new, and the change values is rather extreme. To test Prophet better, it would be easier to do this on a larger dataset that goes back further in time, for example the page for The United States of America.

Future ways to use Prophet include using it as a tool to look at datasets that have time series data such as stocks or adding an additional variable to look at special changes in time series data or check for seasonality in the model.