Plots Thicken

Graphic by Carl Goodwin
theme_set(theme_bw())

cols <- wes_palette(8, name =  "IsleofDogs1", type = "continuous")

One could think of data science as “art grounded in facts”. It tells a story through visualisation. Both story and visualisation rely on a good plot. And an abundance of those has evolved over time. Many have their own dedicated Wikipedia page!

Which generate the most interest? How is the interest in each trending over time? Let’s build an interactive app to find out.

I’m going to start by harvesting some data from Wikipedia’s “Statistical charts and diagrams” category. I can use this to build a list of all chart types which have a dedicated Wikipedia article page. Using rvest inside the app ensures it will respond to any newly-created articles.

charts <-
  read_html("https://en.wikipedia.org/wiki/Category:Statistical_charts_and_diagrams") %>%
  html_nodes(".mw-category-group a") %>%
  html_text() %>%
  tibble(chart = .)

The pageviews package provides an API into Wikipedia. I’ll create a function wrapped around article_pageviews so I can later iterate through a subset of the list established in the prior code chunk.

pv <- function(article) {
  article_pageviews(
    project = "en.wikipedia",
    article,
    user_type = "user",
    start = "2015070100",
    end = Sys.Date()
  )
}

I want an input selector so that a user can choose plot types for comparison. I also want to provide user control of the y-axis scale. A combination of fixed and log10 is better for comparing plots. Free scaling reveals more detail in the individual trends.

ui <-
  fluidPage(
    theme = shinytheme("sandstone"),
    titlePanel(NULL, windowTitle = "Plot plotter"),
    sidebarLayout(
      sidebarPanel(
        wellPanel(
          helpText(
            "Choose up to 8 wikipedia article titles to compare. The selection list is from the category: \"statistical charts and diagrams\"."
          ),
          selectizeInput(
            inputId = "article",
            label = "Chart type:",
            choices = charts,
            selected = c(
              "Violin plot",
              "Dendrogram",
              "Histogram",
              "Pie chart"
            ),
            options = list(maxItems = 8),
            multiple = TRUE
          )
        ),
        wellPanel(
          helpText(
            "\"Fixed\" with \"log 10\" scaling (i.e. 10, 100, 1,000) works best for a visual comparison of chart types. \"Free\" is better for examining individual chart trends."
          ),
          selectInput(
            inputId = "scales",
            label = "Fixed or free (y-axis) scale:",
            choices = c("Fixed" = "fixed", "Free" = "free_y"),
            selected = "free_y"
          ),
          selectInput(
            inputId = "log10",
            label = "Log 10 or normal (y-axis) scale:",
            choices = c("Log 10" = "log10", "Normal" = "norm"),
            selected = "norm"
          )
        ),
        img(
          src = "logo.png",
          height = 55
        )
      ),

      mainPanel(plotOutput(outputId = "line"))
    )
  )

An earlier version, had map_dfr pre-load a dataframe with the pageview data for all chart types (there are more than 100). Profiling with profvis prompted the more efficient approach of loading the data only for the user’s selection (maximum of 8).

Profvis also showed that attempting to round the corners of the plot.background with additional grid package code was expensive. App efficiency felt more important than minor cosmetic detailing that users would probably barely notice.

server <- function(input, output, session) {
  subsetr <- reactive({
    req(input$article)
    pageviews <- map_dfr(input$article, pv) %>%
      mutate(
        date = ymd(date),
        article = str_replace_all(article, "_", " ")
      )
  })

  output$line <- renderPlot({
    p <- ggplot(
      subsetr(),
      aes(date,
        views,
        colour = article
      )
    ) +
      geom_line() +
      theme_bw() +
      theme(
        rect = element_rect(fill = "#f9f5f1"),
        plot.background = element_rect(fill = "#f9f5f1")
      ) +
      scale_colour_manual(values = cols) +
      geom_smooth(colour = cols[7]) +
      facet_wrap(~article, nrow = 1, scales = input$scales) +
      theme(
        plot.title = element_text(hjust = 0.5),
        legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1)
      ) +
      labs(
        x = NULL, y = NULL,
        title = "Wikipedia Daily Page Views\n",
        caption = "\nSource: en.wikipedia (excludes bots)"
      )

    p2 <- switch(input$log10,
      norm =
        p,
      log10 =
        p + scale_y_log10(breaks = c(1, 10, 100, 1000, 10000))
    )
    p2
  })
}

shinyApp(ui = ui, server = server, options = list(height = 1000))

Note the utility of selecting the right scaling. The combination of “fixed” and “normal” reveals what must have been “world histogram day” on July 27th 2015, but little else.

Turning non-interactive code into an app sharpens the mind’s focus on performance. And profvis, integrated into RStudio via the profile menu option, is a wonderful “tool for helping you understand how R spends its time”.

My first version of the app was finger-tappingly slow.

Profvis revealed the main culprit to be the pre-loading of a dataframe with the page-view data for all chart types (there are more than 100). Profiling prompted the more efficient “reactive” approach of loading the data only for the user’s selection (maximum of 8).

Profiling also showed that rounding the corners of the plot.background with additional grid-package code was expensive. App efficiency felt more important than minor cosmetic detailing (to the main panel to match the theme’s side panel). And most users would probably barely notice (had I not drawn attention to it here).

R Toolbox

Summarising below the packages and functions used in this post enables me to separately create a toolbox visualisation summarising the usage of packages and functions across all posts.

Package Function
base library[9]; c[4]; function[3]; list[2]; conflicts[1]; cumsum[1]; search[1]; sum[1]; switch[1]; Sys.Date[1]
dplyr mutate[5]; filter[4]; if_else[3]; tibble[3]; arrange[1]; as_tibble[1]; desc[1]; group_by[1]; select[1]; summarise[1]
ggplot2 element_rect[2]; element_text[2]; theme[2]; theme_bw[2]; aes[1]; facet_wrap[1]; geom_line[1]; geom_smooth[1]; ggplot[1]; labs[1]; scale_colour_manual[1]; scale_y_log10[1]; theme_set[1]
graphics axis[2]
kableExtra kable[1]
lubridate date[1]; ymd[1]
pageviews article_pageviews[1]
purrr map[1]; map_dfr[1]; map2_dfr[1]; possibly[1]; set_names[1]
readr cols[1]; read_lines[1]
rebus literal[4]; lookahead[3]; whole_word[2]; ALPHA[1]; lookbehind[1]; one_or_more[1]; or[1]
rvest html_nodes[1]; html_text[1]
shiny helpText[2]; selectInput[2]; wellPanel[2]; fluidPage[1]; img[1]; mainPanel[1]; plotOutput[1]; reactive[1]; renderPlot[1]; req[1]; selectizeInput[1]; shinyApp[1]; sidebarLayout[1]; sidebarPanel[1]; titlePanel[1]
shinythemes shinytheme[1]
stringr str_detect[3]; str_c[2]; str_remove[2]; str_count[1]; str_remove_all[1]; str_replace_all[1]
tibble enframe[1]
tidyr tibble[3]; as_tibble[1]; unnest[1]
wesanderson wes_palette[1]
xml2 read_html[1]
Carl Goodwin
Carl Goodwin
IBM Data Scientist & Growth Leader
comments powered by Disqus

Related