Let's Jitter

Graphic by Carl Goodwin
theme_set(theme_bw())

cols <- wes_palette(name = "Royal1", type = "discrete")

I want to start with an unambitious little project to explore and visualise some sales data. The UK Government’s Digital Marketplace is something with which I’m familiar and provides a rich and varied source of public data under the Open Government Licence. So I’m using it in a series of posts.

The marketplace was set up with an intent to break down barriers that impede Small and Medium Enterprises (SMEs) from bidding for Public Sector contracts. So, let’s see how that’s going.

The tidyverse sits at the heart of all my data science work as clearly evidenced in my favourite things. So I’ll begin by using two of my most used tidyverse packages (readr and dplyr) to import, clean and tidy the cloud services (G-Cloud) sales data.

Datasets are often scruffy affairs. Importing, cleaning and tidying is a necessary first step. In the case of these data, there are characters in an otherwise numeric spend column. And the date column is a mix of two formats.

gcloud_df <-
  read_csv("https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/639799/g-cloud-sales-figures-july2017.csv") %>%
  clean_names() %>%
  mutate(
    evidenced_spend = str_remove_all(evidenced_spend, "[^0-9-]") %>% parse_number(),
    date = as_date(as.numeric(return_month), origin = "1899-12-30"),
    date = if_else(is.na(date), dmy(return_month), date),
    sme_status = if_else(sme_status == "SME", "SME", "Non-SME"),
    sme_spend = if_else(sme_status == "SME", evidenced_spend, 0)
  )

With that done, I can move on to summarising and visualising how the SME share has changed over time. For that I’ll need my second-most used tidyverse package: ggplot2.

share_df <- gcloud_df %>%
  group_by(date) %>%
  summarise(
    evidenced_spend = sum(evidenced_spend, na.rm = TRUE),
    sme_spend = sum(sme_spend, na.rm = TRUE),
    pct = sme_spend / evidenced_spend
  )

last_date <- gcloud_df %>% summarise(date = max(date)) %>% pull()

share_df %>%
  ggplot(aes(date, pct)) +
  geom_point(colour = cols[4]) +
  geom_smooth(colour = cols[2], fill = cols[3]) +
  scale_y_continuous(labels = percent_format()) +
  scale_x_date(date_breaks = "years", date_labels = "%Y") +
  labs(
    x = NULL, y = NULL,
    title = glue("SME Share of G-Cloud to {stamp('July 1, 2000')(last_date)}"), 
    subtitle = "Dots = % Monthly Sales via SMEs",
    caption = "Source: GOV.UK G-Cloud Sales"
  )

I can see that sales grew steadily to a cumulative £2.4bn by July 2017. And as the volume of sales grew, an increasingly clearer picture of sustained growth in the SME share emerged. However, in those latter few months, SMEs lost a little ground.

Dig a little deeper, and I also see variation by sub-sector. And that’s after setting aside those buyers with cumulative G-Cloud spend below £100k, where large enterprise suppliers are less likely to be inclined to compete.

sector_df <- gcloud_df %>%
  mutate(sector = if_else(
    sector %in% c("Central Government", "Local Government", "Police", "Health"),
    sector,
    "Other Sector"
  )) %>%
  group_by(customer_name, sector) %>%
  summarise(
    evidenced_spend = sum(evidenced_spend, na.rm = TRUE),
    sme_spend = sum(sme_spend, na.rm = TRUE),
    pct = sme_spend / evidenced_spend
  ) %>% 
  filter(evidenced_spend >= 100000) %>% 
  group_by(sector) %>%
  mutate(median_pct = median(pct)) %>% 
  ungroup() %>% 
  mutate(sector = fct_reorder(sector, median_pct))

n_df <- sector_df %>% group_by(sector) %>% summarise(n = n())

sector_df %>%
  ggplot(aes(sector, pct)) +
  geom_boxplot(outlier.shape = FALSE, fill = cols[3]) +
  geom_jitter(width = 0.2, alpha = 0.5, colour = cols[2]) +
  geom_label(aes(y = .75, label = glue("n = {n}")),
    data = n_df,
    fill = cols[1], colour = "white"
  ) +
  scale_y_continuous(labels = percent_format()) +
  labs(
    x = NULL, y = NULL,
    title = glue("SME Share of G-Cloud to {stamp('July 1, 2000')(last_date)}"),
    subtitle = "% Sales via SMEs for Buyers with Cumulative Sales >= £100k",
    caption = "Source: gov.uk G-Cloud Sales"
  )

The box plot, overlaid with jittered points to avoid over-plotting, shows:

  • Central government, with its big-spending departments, and police favouring large suppliers. This may reflect, among other things, their ability to scale.
  • Local government and health, in contrast, favouring SMEs. And this despite their looser tether to central government strategy.

So, irrespective of whether service integration is taken in-house or handled by a service integrator, large enterprise suppliers have much to offer:

  • The ability to deliver at scale;
  • A breadth and depth of capabilities exploitable during discovery to better articulate the “art of the possible”;
  • A re-assurance that there is always extensive capability on hand.

SMEs offer flexibility, fresh thinking and broader competition, often deploying their resources and building their mission around a narrower focus. They tend to do one thing, or a few things, exceptionally well.

I’ll return to these data in Six months later and again in Can Ravens Forecast.

R Toolbox

Summarising below the packages and functions used in this post enables me to separately create a toolbox visualisation summarising the usage of packages and functions across all posts.

Package Function
base library[8]; sum[5]; as.numeric[1]; c[1]; conflicts[1]; cumsum[1]; function[1]; is.na[1]; max[1]; search[1]
dplyr mutate[8]; if_else[7]; filter[5]; group_by[5]; summarise[5]; tibble[2]; arrange[1]; as_tibble[1]; desc[1]; n[1]; pull[1]; select[1]; ungroup[1]
forcats fct_reorder[1]
ggplot2 aes[3]; ggplot[2]; labs[2]; scale_y_continuous[2]; geom_boxplot[1]; geom_jitter[1]; geom_label[1]; geom_point[1]; geom_smooth[1]; scale_x_date[1]; theme_bw[1]; theme_set[1]
glue glue[4]
janitor clean_names[1]
kableExtra kable_material[1]; kbl[1]
lubridate date[4]; stamp[2]; as_date[1]; dmy[1]
purrr map[1]; map2_dfr[1]; possibly[1]; set_names[1]
readr parse_number[1]; read_csv[1]; read_lines[1]
rebus literal[4]; lookahead[3]; whole_word[2]; ALPHA[1]; lookbehind[1]; one_or_more[1]; or[1]
scales percent_format[2]
stats median[1]
stringr str_detect[3]; str_c[2]; str_remove[2]; str_remove_all[2]; str_count[1]
tibble enframe[1]
tidyr unnest[1]
wesanderson wes_palette[1]
Carl Goodwin
Carl Goodwin
IBM Data Scientist & Growth Strategy Leader
comments powered by Disqus

Related