Set Operations

Graphic by Carl Goodwin

In Let’s Jitter I looked at a relatively simple set of sales data.

G-Cloud data offers a much richer source of data with many thousands of services documented by several thousand suppliers and hosted across myriad web pages. These services straddle many categories. I’ll use these data to explore sets and their intersections.

I’m going to focus on the Cloud Hosting lot. Suppliers document the services they want to offer to Public Sector buyers. Each supplier is free to assign each of their services to one or more service categories. It would be interesting to see how these categories overlap when looking at the aggregated data.

I’ll begin by harvesting the URL for each category’s search results. And I’ll also capture the number of search pages for each category. This will enable me to later control how R iterates through the web pages to extract the required data.

lot_urls <-
  list(
    "https://www.digitalmarketplace.service.gov.uk/g-cloud/search?lot=cloud-hosting",
    "https://www.digitalmarketplace.service.gov.uk/g-cloud/search?lot=cloud-software",
    "https://www.digitalmarketplace.service.gov.uk/g-cloud/search?lot=cloud-support"
  )

cat_urls <- future_map_dfr(lot_urls, function(x) {
  nodes <- x %>%
    read_html() %>%
    html_nodes(".lot-filters--last-list a")

  tibble(
    url = nodes %>%
      map(html_attr, "href"),

    pages = nodes %>%
      map(html_text)
  )
}) %>%
  mutate(
    pages = parse_number(as.character(pages)),
    pages = if_else(pages %% 100 > 0, pages %/% 100 + 1, pages %/% 100),
    lot = str_extract(url, lookbehind("cloud-") %R% one_or_more(WRD)),
    url = str_remove(url, ".*" %R% lookahead("&"))
  )

So now I’m all set to parallel process through the data at two levels. At category level. And within each category, I’ll iterate through the multiple pages of search results, harvesting 100 service IDs per page.

I’ll also auto-abbreviate the category names so I’ll have the option of more concise names for less-cluttered plotting later on.

tic()

data_df <-
  future_pmap_dfr(list(cat_urls$url, cat_urls$pages, cat_urls$lot), function(x, y, z) {
    future_map_dfr(1:y, function(y) {
      str_c(
        "https://www.digitalmarketplace.service.gov.uk/g-cloud/search?page=",
        y,
        x,
        "&lot=cloud-",
        z
      ) %>%
        read_html() %>%
        html_nodes(".search-result-title a") %>%
        html_attr("href") %>%
        tibble(
          lot = str_c("Cloud ", str_to_title(z)),
          id = str_extract(., digit(15)),
          cat = str_remove(x, "&serviceCategories=") %>%
            str_replace_all(literal("+"), " ") %>%
            str_remove(fixed("%28") %R% one_or_more(PRINT) %R% fixed("%29"))
        )
    })
  }) %>%
  select(lot:cat) %>%
  mutate(
    cat = str_trim(cat) %>% str_to_title(),
    abbr = str_remove(cat, "and") %>% abbreviate(3) %>% str_to_upper()
  )

toc()
## 297.353 sec elapsed

Now that I have a nice tidy tibble, I can start to think about visualisations.

I like Venn diagrams. But to create one I’ll first need to do a little prep as ggVennDiagram requires separate character vectors for each set.

all_cats <- data_df %>%
  filter(lot == "Cloud Hosting") %>%
  split(.$abbr) %>%
  map("id")

Venn diagrams work best with a small number of sets. So we’ll select four categories.

four_cats <- all_cats[c("CAAH", "PAAS", "OBS", "IND")]

four_cats %>% 
  ggVennDiagram(label = "count", label_alpha = 0) +
  scale_fill_gradient(low = cols[1], high = cols[4]) +
  labs(
    x = "Category Combinations", y = NULL, fill = "# Services",
    title = "The Most Frequent Category Combinations",
    subtitle = "Focusing on Four G-Cloud 11 Service Categories",
    caption = "Source: digitalmarketplace.service.gov.uk\n"
  )

Let’s suppose I want to find out which Service IDs lie in a particular intersection. Perhaps I want to go back to the web site with those IDs to search for, and read up on, those particular services. I could use purrr’s reduce to achieve this. For example, let’s extract the IDs at the heart of the Venn which intersect all categories.

four_cats %>% reduce(intersect)
##  [1] "145652469242314" "201177724411485" "352535503829591" "763538670774955"
##  [5] "268767108656008" "170541097989831" "945423345174171" "876456734412491"
##  [9] "988342286965038" "290706137917789"

And if we wanted the IDs intersecting the “OBS” and “IND” categories?

list(
  four_cats$OBS,
  four_cats$IND
) %>%
  reduce(intersect)
##  [1] "145652469242314" "680290702876876" "457525792504478" "443222617062706"
##  [5] "723018193062072" "201177724411485" "352535503829591" "763538670774955"
##  [9] "268767108656008" "170541097989831" "922359812838932" "594319457574863"
## [13] "528742387707190" "925602375598889" "783631222110810" "647937587343042"
## [17] "149006357644340" "758628476072691" "945423345174171" "876456734412491"
## [21] "435089868976159" "988342286965038" "290706137917789" "305548050583747"
## [25] "344916994679297" "314875063746602" "291514633094017" "628926673613678"

Sometimes though we need something a little more scalable than a Venn diagram. The ggupset package provides a good solution. Before we try more than four sets though, I’ll first use the same four categories so we may compare the visualisation to the Venn.

set_df <- data_df %>%
  filter(abbr %in% c("CAAH", "PAAS", "OBS", "IND")) %>%
  group_by(id) %>%
  mutate(category = list(cat)) %>%
  distinct(id, category) %>%
  group_by(category) %>%
  mutate(n = n()) %>%
  ungroup()

set_df %>%
  ggplot(aes(category)) +
  geom_bar(fill = cols[1]) +
  geom_label(aes(y = n, label = n), vjust = -0.1, size = 3, fill = cols[5]) +
  scale_x_upset() +
  theme(panel.border = element_blank()) +
  labs(
    x = "Category Combinations", y = NULL,
    title = "The Most Frequent Category Combinations",
    subtitle = "Focusing on Four G-Cloud 11 Service Categories",
    caption = "Source: digitalmarketplace.service.gov.uk"
  ) +
  expand_limits(y = c(0, 800))

Now let’s take a look at the intersections across all the categories. And let’s suppose that our particular interest is all services which appear in one, and only one, category.

set_df <- data_df %>%
  group_by(id) %>%
  filter(n() == 1, lot == "Cloud Hosting") %>%
  mutate(category = list(cat)) %>%
  distinct(id, category) %>%
  group_by(category) %>%
  mutate(n = n()) %>%
  ungroup()

set_df %>%
  ggplot(aes(category)) +
  geom_bar(fill = cols[2]) +
  geom_label(aes(y = n, label = n), vjust = -0.1, size = 3, fill = cols[3]) +
  scale_x_upset(n_sets = 10) +
  theme(panel.border = element_blank()) +
  labs(
    x = "Category Combinations", y = NULL,
    title = "10 Most Frequent Single-Category Services",
    subtitle = "Focused on Service Categories in the Cloud Hosting Lot",
    caption = "Source: digitalmarketplace.service.gov.uk"
  ) +
  expand_limits(y = c(0, 800))

Suppose we want to extract the intersection data for the top intersections across all sets. I could use functions from the tidyr package to achieve this.

cat_mix <- data_df %>%
  filter(lot == "Cloud Hosting") %>%
  mutate(x = cat) %>%
  pivot_wider(id, names_from = cat, values_from = x, values_fill = "^") %>%
  unite(col = intersect, -id, sep = "/") %>%
  count(intersect) %>%
  mutate(
    intersect = str_replace_all(intersect, or(literal("/^"), literal("^/")), ""),
    intersect = str_replace_all(intersect, "/", " | ")
  ) %>%
  arrange(desc(n)) %>%
  slice(1:21)

cat_mix %>%
  kable(col.names = c("Intersecting Categories", "Services Count"))
Intersecting Categories Services Count
Platform As A Service 600
Compute And Application Hosting 249
Archiving Backup And Disaster Recovery 168
Archiving Backup And Disaster Recovery | Compute And Application Hosting | Nosql Database | Relational Database | Other Database Services | Message Queuing And Processing | Networking | Platform As A Service | Object Storage | Other Storage Services 150
Networking 124
Other Storage Services 94
Compute And Application Hosting | Platform As A Service 86
Compute And Application Hosting | Platform As A Service | Block Storage | Object Storage | Other Storage Services 68
Logging And Analysis 65
Container Service 59
Infrastructure And Platform Security 48
Message Queuing And Processing 40
Other Database Services 35
Relational Database 32
Content Delivery Network 32
Block Storage | Object Storage | Other Storage Services 25
Archiving Backup And Disaster Recovery | Compute And Application Hosting | Container Service | Distributed Denial Of Service Attack Protection | Firewall | Infrastructure And Platform Security | Load Balancing | Networking | Protective Monitoring | Block Storage 25
Archiving Backup And Disaster Recovery | Compute And Application Hosting 24
Object Storage 22
Archiving Backup And Disaster Recovery | Compute And Application Hosting | Platform As A Service 22
Block Storage 20

And I can compare this table to the equivalent ggupset visualisation.

set_df <- data_df %>%
  group_by(id) %>%
  filter(lot == "Cloud Hosting") %>%
  mutate(category = list(cat)) %>%
  distinct(id, category) %>%
  group_by(category) %>%
  mutate(n = n()) %>%
  ungroup()

set_df %>%
  ggplot(aes(category)) +
  geom_bar(fill = cols[5]) +
  geom_label(aes(y = n, label = n), vjust = -0.1, size = 3, fill = cols[4]) +
  scale_x_upset(n_sets = 22, n_intersections = 21) +
  theme(panel.border = element_blank()) +
  labs(
    x = "Category Combinations", y = NULL,
    title = "Top Intersections Across all Sets",
    subtitle = "Focused on Service Categories in the Cloud Hosting Lot",
    caption = "Source: digitalmarketplace.service.gov.uk"
  ) +
  expand_limits(y = c(0, 700))

And if I want to extract all the service IDs for the top 5 intersections, I could use dplyr and tidyr verbs to achieve this too.

I won’t print them all out though!

top5_int <- data_df %>%
  filter(lot == "Cloud Hosting") %>%
  select(id, abbr) %>%
  mutate(x = abbr) %>%
  pivot_wider(names_from = abbr, values_from = x, values_fill = "^") %>%
  unite(col = intersect, -id, sep = "/") %>%
  mutate(
    intersect = str_replace_all(intersect, or(literal("/^"), literal("^/")), ""),
    intersect = str_replace(intersect, "/", " | ")
  ) %>%
  group_by(intersect) %>%
  mutate(count = n_distinct(id)) %>%
  arrange(desc(count), intersect, id) %>%
  ungroup() %>%
  add_count(intersect, wt = count, name = "temp") %>%
  mutate(temp = dense_rank(desc(temp))) %>%
  filter(temp %in% 1:5) %>%
  distinct(id)

top5_int %>%
  summarise(ids = n_distinct(id))
## # A tibble: 1 x 1
##     ids
##   <int>
## 1  1291

R Toolbox

Summarising below the packages and functions used in this post enables me to separately create a toolbox visualisation summarising the usage of packages and functions across all posts.

Package Function
base library[9]; list[6]; c[5]; cat[5]; function[4]; abbreviate[1]; as.character[1]; conflicts[1]; cumsum[1]; search[1]; split[1]; sum[1]
dplyr mutate[18]; filter[11]; group_by[8]; n[8]; id[7]; desc[4]; distinct[4]; if_else[4]; intersect[4]; tibble[4]; ungroup[4]; arrange[3]; select[3]; count[2]; n_distinct[2]; summarise[2]; add_count[1]; as_tibble[1]; dense_rank[1]; slice[1]
furrr future_map_dfr[2]; future_pmap_dfr[1]
future multiprocess[1]; plan[1]
ggplot2 aes[6]; labs[4]; element_blank[3]; expand_limits[3]; geom_bar[3]; geom_label[3]; ggplot[3]; theme[3]; scale_fill_gradient[1]; theme_bw[1]; theme_set[1]
ggupset scale_x_upset[3]
ggVennDiagram ggVennDiagram[2]
kableExtra kable[2]
purrr map[4]; reduce[2]; map2_dfr[1]; possibly[1]; set_names[1]
readr parse_number[1]; read_lines[1]
rebus literal[9]; lookahead[4]; one_or_more[3]; or[3]; lookbehind[2]; whole_word[2]; ALPHA[1]; digit[1]; PRINT[1]; WRD[1]
rvest html_nodes[2]; html_attr[1]; html_text[1]
stringr str_remove[6]; str_c[4]; str_replace_all[4]; str_detect[3]; fixed[2]; str_extract[2]; str_to_title[2]; str_count[1]; str_remove_all[1]; str_replace[1]; str_to_upper[1]; str_trim[1]
tibble enframe[1]
tictoc tic[1]; toc[1]
tidyr pivot_wider[2]; unite[2]; unnest[1]
wesanderson wes_palette[1]
xml2 read_html[2]
Carl Goodwin
Carl Goodwin
IBM Data Scientist & Growth Strategy Leader
comments powered by Disqus

Related