Can Ravens Forecast?

Graphic by Carl Goodwin

Humans have the magical ability to plan for future events, for future gain. It’s not quite a uniquely human trait. Because apparently ravens can match a 4-year-old.

An abundance of data, and some very nice R packages, make our ability to plan all the more powerful.

In the Spring of 2018 I looked at sales from an historical perspective in Six Months Later.. Here I’ll use the data to model a time-series forecast for the year ahead. The techniques apply to any time series with characteristics of trend, seasonality or longer-term cycles.

Why forecast sales? Business plans require a budget, e.g. for resources, marketing and office space. A good projection of revenue provides the foundation for the budget. And, for an established business, with historical data, time-series forecasting is one way to deliver a robust projection.

Without exogenous data, the forecast assumes one continues to do what one’s doing. So, it provides a good starting-point. Then one might, for example, add assumptions about new products or services. And, if there is forward-looking data available, for example, market size projections (with past projections to train the model), then one could feed this into the forecast modelling too.

library(tidyverse)
library(wesanderson)
library(rebus)
library(kableExtra)
library(fpp3)
library(scales)

First I’ll check the encoding of the data.

gcloud_csv <-
  "https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/703943/G-Cloud_spend_data_to_end_March_2018.csv"
  
dos_csv <-
  "https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/703952/DOS_spend_data_to_end_March_2018.csv"

names <- c(gcloud_csv, dos_csv)

map(names, guess_encoding)
## [[1]]
## # A tibble: 2 x 2
##   encoding   confidence
##   <chr>           <dbl>
## 1 ISO-8859-1       0.4 
## 2 ISO-8859-2       0.22
## 
## [[2]]
## # A tibble: 2 x 2
##   encoding   confidence
##   <chr>           <dbl>
## 1 ISO-8859-1       0.36
## 2 ISO-8859-2       0.24

Next I’ll set up a vector of column names to apply consistently to both files, and import the data with the suggested encoding.

colnam <- 
  c("sector", "lot", "month", "spend", "status", "supplier", "customer", "framework")

read_dm <- function(x){
  read_csv(
    x,
    col_names = colnam,
    skip = 1,
    locale(encoding = "ISO-8859-1"),
    col_types = NULL)
}

raw <- map(names, read_dm) %>% 
  set_names(c("gcloud", "dos")) %>% 
  bind_rows() %>% 
  mutate(framework = if_else(is.na(framework), "DOS", framework))

I’d like to create some new features: Month-end dates, something to distinguish between the two frameworks (G-Cloud or DOS). The spend has a messy format and needs a bit of cleaning too.

The lot structure for G-Cloud has evolved over time, but fortunately, there is a simple mapping, i.e. PaaS and IaaS became Cloud Hosting, SaaS became Cloud Software, and Specialist Cloud Services became Cloud Support, so I’ll standardise on the latter.

both <- raw %>%
  mutate(
    month_start = myd(str_c(month, "-01")),
    month_end = month_start + months(1) - 1,
    month = yearmonth(month_end),
    framework = str_extract(framework, ".{3,7}"),
    spend = str_remove(spend, fixed("£")),
    spend = str_replace(spend, "^\\(", "-"),
    spend = parse_number(spend) / 1000000,
    lot = recode(
      lot,
      "Software as a Service (SaaS)" = "Cloud Software",
      "Infrastructure as a Service (IaaS)" = "Cloud Hosting",
      "Platform as a Service (PaaS)" = "Cloud Hosting",
      "Specialist Cloud Services" = "Cloud Support"
      )
)

The tidied data now needs to be converted to a tsibble, the temporal equivalent of a tibble.

R has evolved since I first wrote this post. At that time, it was necessary to either split the data into the two frameworks (G-Cloud and DOS) and forecast them separately. Or, as I did with the three G-Cloud lots, use the purrr package to iterate through a forecast.

The tsibble package combined with the newer fable and feasts packages, make this easier. One of the defining feature of the tsibble is the key. I want a model for each framework, so I’m setting this as the tsibble key (and the temporal variable as the tsibble index).

both_ts <- both %>%
  group_by(month, framework) %>% 
  summarise(spend = sum(spend)) %>% 
  as_tsibble(key = framework, index = month)

both_ts %>% 
  ggplot(aes(month, spend, colour = framework)) +
  geom_line(key_glyph = "timeseries") +
  scale_y_continuous(labels = dollar_format(prefix = "£", suffix = "m")) +
  scale_colour_manual(values = cols[c(4, 8)]) +
  labs(x = NULL, y = NULL, title = "Monthly Digital Marketplace Sales")

By decomposing the historical data we can tease out the underlying trend and seasonality:

  • Trend: The sales for both frameworks have grown over time as more Suppliers have added their services to the Government frameworks, and more Public Sector organizations have found the benefits of purchasing Cloud services through this faster, simpler, more transparent and more competitive contracting vehicle.

  • Seasonality: Suppliers often manage their sales and financials based on a quarterly cycle, with a particular emphasis on a strong close to the financial year. And Government Buyers may want to make optimal use of their budgets at the close of their financial year (March 31st). Consequently, we see quarterly seasonality with an extra spike at financial year-end.

both_ts %>%
  model(stl = STL(spend ~ trend(window = 7) + season(window = "periodic"))) %>%
  components() %>% 
  autoplot() +
  scale_colour_manual(values = cols[c(4, 8)]) +
  labs(x = NULL, title = "Timeseries Decomposition")

I’ll use auto.arima: AutoRegressive Integrated Moving Average modelling which aims to describe the autocorrelations in the data.

By setting stepwise and approximation to FALSE, auto.arima will explore a wider range of potential models.

I’ll forecast with the default 80% and 95% prediction intervals. This means the darker-shaded 80% range should include the future sales value with an 80% probability. Likewise with a 95% probability when adding the wider and lighter-shaded area.

Use of autoplot would simplify the code, but personally I like to expose all the data, for example unpacking the prediction intervals, and have finer control over the visualisation.

mod_ts <- both_ts %>%
  model(ARIMA = ARIMA(spend, stepwise = TRUE, approximation = FALSE))

mod_ts %>% 
  glance() %>%
  select(-ar_roots, -ma_roots) %>% 
  kable()
framework .model sigma2 log_lik AIC AICc BIC
DOS ARIMA 41.95495 -32.32154 70.64308 74.64308 71.55084
G-Cloud ARIMA 34.75275 -191.61547 393.23094 394.34205 403.70267
mod_ts %>% 
  tidy() %>%
  kable()
framework .model term estimate std.error statistic p.value
DOS ARIMA ar1 0.8227685 0.2064670 3.984989 0.0025801
DOS ARIMA constant 4.5060745 1.3493985 3.339321 0.0074991
G-Cloud ARIMA ar1 0.9390150 0.0595354 15.772391 0.0000000
G-Cloud ARIMA ma1 -0.5777210 0.1094066 -5.280494 0.0000019
G-Cloud ARIMA sar1 -0.5124417 0.1142670 -4.484597 0.0000336
G-Cloud ARIMA constant 1.4084703 0.3168155 4.445712 0.0000385
fcast_ts <- mod_ts %>%
  forecast(h = "2 years") %>% 
  mutate(`95%` = hilo(spend, 95), `80%` = hilo(spend, 80)) %>% 
  unpack_hilo(c("95%", "80%")) %>% 
  select(-spend) %>% 
  bind_rows(both_ts)

fcast_ts %>%
  ggplot(aes(month, fill = framework)) +
  geom_line(aes(y = spend), colour = cols[8]) +
  geom_ribbon(aes(ymin = `95%_lower`, ymax = `95%_upper`), fill = cols[3], colour = NA) +
  geom_ribbon(aes(ymin = `80%_lower`, ymax = `80%_upper`), fill = cols[6], colour = NA) +
  geom_line(aes(y = .mean), colour = "white") +
  facet_wrap(~ framework) +
  labs(title = "Digital Marketplace Sales Forecast by Framework", x = NULL, y = "Spend $m",
       subtitle = "80 & 95% Prediction Intervals") +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

The G-Cloud framework compromises three lots: Cloud Hosting, Cloud Software and Cloud Support.

I previously combined auto.arima (from the forecast package) with functions from the sweep package, to create multiple forecasts in one shot. tsibble coupled fabletools handle this with the key set to the lot variable.

An alternative option is hierarchical time-series forecasting which models bottom-up, top-down or middle-out, and ensures the sum of the forecasts at the lower level sum to the top-level forecast. This approach has pros and cons and is not considered here.

gcloud_ts <- both %>%
  filter(framework == "G-Cloud") %>% 
  group_by(month, lot) %>% 
  summarise(spend = sum(spend)) %>% 
  as_tsibble(key = lot, index = month)

gc_ts <- gcloud_ts %>%
  model(ARIMA = ARIMA(spend, stepwise = TRUE, approximation = FALSE))

gc_ts %>% 
  glance() %>%
  select(-ar_roots, -ma_roots) %>% 
  kable()
lot .model sigma2 log_lik AIC AICc BIC
Cloud Hosting ARIMA 1.347179 -109.3173 224.6346 224.9982 231.3801
Cloud Software ARIMA 2.275664 -107.5551 221.1102 221.5465 227.3428
Cloud Support ARIMA 18.606529 -212.6556 435.3112 436.2342 446.6246
gc_ts %>% 
  tidy() %>%
  kable()
lot .model term estimate std.error statistic p.value
Cloud Hosting ARIMA ma1 -0.8269314 0.0765747 -10.799011 0.0000000
Cloud Hosting ARIMA constant 0.1713346 0.0258894 6.617935 0.0000000
Cloud Software ARIMA ma1 -0.9981867 0.1304426 -7.652304 0.0000000
Cloud Software ARIMA ma2 0.2242994 0.1224834 1.831264 0.0721116
Cloud Support ARIMA ma1 -0.7044948 0.0743797 -9.471597 0.0000000
Cloud Support ARIMA sma1 0.3878660 0.1458579 2.659205 0.0096735
Cloud Support ARIMA sma2 0.7866476 0.4284425 1.836064 0.0705352
Cloud Support ARIMA constant 0.7601155 0.2889763 2.630373 0.0104520
fcgc_ts <- gc_ts %>%
  forecast(h = "2 years") %>% 
  mutate(`95%` = hilo(spend, 95), `80%` = hilo(spend, 80)) %>% 
  unpack_hilo(c("95%", "80%")) %>% 
  select(-spend) %>% 
  bind_rows(gcloud_ts)

fcgc_ts %>%
  ggplot(aes(month, fill = lot)) +
  geom_line(aes(y = spend), colour = cols[8]) +
  geom_ribbon(aes(ymin = `95%_lower`, ymax = `95%_upper`), fill = cols[3], colour = NA) +
  geom_ribbon(aes(ymin = `80%_lower`, ymax = `80%_upper`), fill = cols[6], colour = NA) +
  geom_line(aes(y = .mean), colour = "white") +
  facet_wrap(~ lot) +
  labs(title = "G-Cloud Sales Forecast by Lot", x = NULL, y = "Spend $m",
       subtitle = "80 & 95% Prediction Intervals") +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

So ravens are not yet ready for forecasting with R. But then neither are 4-year-olds, are they?

R Toolbox

Summarising below the packages and functions used in this post enables me to separately create a toolbox visualisation summarising the usage of packages and functions across all posts.

Package Function
base library[6]; c[5]; sum[3]; function[2]; conflicts[1]; cumsum[1]; is.na[1]; months[1]; search[1]
dplyr mutate[8]; filter[5]; select[5]; if_else[4]; bind_rows[3]; group_by[3]; summarise[3]; tibble[2]; arrange[1]; as_tibble[1]; desc[1]; recode[1]
fable ARIMA[2]; as_tsibble[2]
fabletools hilo[4]; model[3]; as_tsibble[2]; forecast[2]; glance[2]; tidy[2]; unpack_hilo[2]; autoplot[1]; components[1]
feasts as_tsibble[2]; autoplot[1]; STL[1]
ggplot2 aes[11]; geom_line[5]; geom_ribbon[4]; labs[4]; ggplot[3]; element_text[2]; facet_wrap[2]; scale_colour_manual[2]; theme[2]; autoplot[1]; scale_y_continuous[1]; theme_bw[1]; theme_set[1]
kableExtra kable[5]
lubridate month[2]; myd[1]
purrr map[3]; set_names[2]; map2_dfr[1]; possibly[1]
readr guess_encoding[1]; locale[1]; parse_number[1]; read_csv[1]; read_lines[1]
rebus literal[4]; lookahead[3]; whole_word[2]; ALPHA[1]; lookbehind[1]; one_or_more[1]; or[1]
scales dollar_format[1]
stringr str_c[3]; str_detect[3]; str_remove[3]; fixed[1]; str_count[1]; str_extract[1]; str_remove_all[1]; str_replace[1]
tibble enframe[1]
tidyr tibble[2]; as_tibble[1]; unnest[1]
tsibble as_tsibble[2]; tibble[2]; as_tibble[1]; yearmonth[1]
wesanderson wes_palette[1]
Carl Goodwin
Carl Goodwin
IBM Data Scientist & Growth Leader
comments powered by Disqus

Related