A Frosty Deal?

Graphic by Carl Goodwin

Reading news articles on the will-they-won’t-they post-Brexit trade negotiations with the EU sees days of optimism jarred by days of gloom. Do negative news articles, when one wants a positive outcome, leave a deeper impression?

I wondered if I could get a more objective view from quantitative analysis of textual data. To do this, I’m going to look at hundreds of articles published in the Guardian newspaper over the course of the year to see how trade-talk sentiment has changed week-to-week.

library(tidyverse)
library(rebus)
library(wesanderson)
library(kableExtra)
library(lubridate)
library(GuardianR)
library(quanteda)
library(scales)
library(tictoc)
library(patchwork)
library(text2vec)
library(topicmodels)
library(zoo)
library(glue)
theme_set(theme_bw())

cols <- wes_palette(name = "Chevalier1")

The Withdrawal Agreement between the UK and the European Union was signed on the 24th of January 2020. I’ll import Brexit-related newspaper articles from that date.

The Guardian newspaper asks for requests to span no more than 1 month at a time. So I’ll first create a set of monthly date ranges.

dates_df <- tibble(start_date = seq(ymd("2020-01-24"), today(), by = "1 month")) %>%
  mutate(end_date = start_date + months(1) - 1)

dates_df %>%
  kable()
start_date end_date
2020-01-24 2020-02-23
2020-02-24 2020-03-23
2020-03-24 2020-04-23
2020-04-24 2020-05-23
2020-05-24 2020-06-23
2020-06-24 2020-07-23
2020-07-24 2020-08-23
2020-08-24 2020-09-23
2020-09-24 2020-10-23

I’ll import the newspaper articles in monthly chunks. Note, access to the Guardian’s API requires a key which may be requested here.

tic()

article_df <-
  dates_df %>%
  pmap_dfr(., function(start_date, end_date) {
    Sys.sleep(1)
    get_guardian(
      "brexit",
      from.date = start_date,
      to.date = end_date,
      api.key = key
    )
  })

toc()

The data need a little cleaning, for example, to remove multi-topic articles, html tags and non-breaking spaces.

trade_df <-
  article_df %>%
  filter(!str_detect(id, "/live/"), sectionId %in% c("world", "politics", "business")) %>% 
  mutate(
    body = str_remove_all(body, "<.*?>") %>% str_to_lower(),
    body = str_remove_all(body, "[^a-z0-9 .-]"),
    body = str_remove_all(body, "nbsp")
  )

A corpus then gives me a collection of texts whereby each document is a newspaper article.

trade_corp <- trade_df %>% 
  corpus(docid_field = "shortUrl", text_field = "body")

Although I’ve only imported articles mentioning Brexit since the Withdrawal Agreement was signed, some of these articles will not be related to trade negotiations with the EU. For example, there are on-going negotiations with many countries around the world. So, I’m going to use word embeddings to help narrow the focus to the specific context of the UK-EU trade deal.

The chief negotiator for the EU is Michel Barnier, so I’ll quantitatively identify words in close proximity to “Barnier” in the context of these Brexit news articles.

window <- 5

trade_fcm <- 
  trade_corp %>% 
  fcm(context = "window", window = window, count = "weighted", weights = window:1)

glove <- GlobalVectors$new(rank = 60, x_max = 10)

set.seed(42)

wv_main <- glove$fit_transform(trade_fcm, n_iter = 10)
## INFO  [18:12:41.398] epoch 1, loss 0.3801 
## INFO  [18:12:43.295] epoch 2, loss 0.2511 
## INFO  [18:12:45.176] epoch 3, loss 0.2226 
## INFO  [18:12:47.151] epoch 4, loss 0.2020 
## INFO  [18:12:49.027] epoch 5, loss 0.1846 
## INFO  [18:12:50.908] epoch 6, loss 0.1710 
## INFO  [18:12:52.791] epoch 7, loss 0.1607 
## INFO  [18:12:54.655] epoch 8, loss 0.1527 
## INFO  [18:12:56.530] epoch 9, loss 0.1462 
## INFO  [18:12:58.403] epoch 10, loss 0.1407
wv_context <- glove$components
word_vectors <- wv_main + t(wv_context)

search_coord <- 
  word_vectors["barnier", , drop = FALSE]

word_vectors %>% 
  sim2(search_coord, method = "cosine") %>% 
  as_tibble(rownames = NA) %>% 
  rownames_to_column("term") %>% 
  rename(similarity = 2) %>% 
  arrange(desc(similarity)) %>% 
  slice(1:10) %>%
  kable()
term similarity
barnier 1.0000000
michel 0.8209618
negotiator 0.7974649
frost 0.7412635
brussels 0.6720059
eus 0.6494242
chief 0.6235152
negotiators 0.5975188
eu 0.5534461
downing 0.5496184

Word embedding is a learned modelling technique placing words into a multi-dimensional vector space such that contextually-similar words may be found close by. Not surprisingly, one of the closest words contextually is “Michel”. And as he is the chief negotiator for the EU, we find “eu’s”, “chief”, and “negotiator” also in the top most contextually-similar words.

The word embeddings algorithm, through word co-occurrence, has identified the name of Michel Barnier’s UK counterpart David Frost. So filtering articles for “Barnier”, “Frost” and “UK-EU” should help narrow the focus.

context_df <- 
  trade_df %>% 
  filter(str_detect(body, "barnier|frost|uk-eu")) 

context_corp <- 
  context_df %>% 
  corpus(docid_field = "shortUrl", text_field = "body")

I can then use quanteda’s kwic function to review the key phrases in context to ensure I’m homing in on the texts I want. Short URLs are included below so I can click on any to read the actual article as presented by The Guardian.

set.seed(123)

context_corp %>%
  tokens(
    remove_punct = TRUE,
    remove_symbols = TRUE,
    remove_numbers = TRUE
  ) %>%
  kwic(pattern = phrase(c("trade negotiation", "trade deal", "trade talks")), 
       valuetype = "regex", window = 7) %>%
  as_tibble() %>%
  left_join(article_df, by = c("docname" = "shortUrl")) %>% 
  slice_sample(n = 10) %>% 
  select(docname, pre, keyword, post, headline) %>%
  kable()
docname pre keyword post headline
https://gu.com/p/ej6yt eu has said there will be no trade deal without an agreement on fishing as it Time-wasting UK makes post-Brexit deal unlikely, says Barnier
https://gu.com/p/ezkxc it threatens to damage british prospects of trade deals with the us and eu it puts Tuesday briefing: Rancour as law-breaking bill goes forward
https://gu.com/p/ene7e all circumstances who would want to agree trade deals with a country that doesnt implement international Senior Tories urge ministers to scrap ‘illegal’ Brexit rule plan
https://gu.com/p/d7d9t people we have who have been negotiating trade deals forever she said while people criticise the Brexit trade talks: EU to back Spain over Gibraltar claims
https://gu.com/p/ezxv5 uk and japan have agreed a free trade deal as britain races to secure easy access UK government hails ‘historic’ trade deal with Japan
https://gu.com/p/emhd4 even if it does not strike a trade deal with the eu during a visit to Brussels rules out summit intervention in troubled Brexit talks
https://gu.com/p/d7n4t alignment with eu rules in any brexit trade deal while brussels threatened to put tariffs on Pound falls as Boris Johnson takes tough line on EU trade deal
https://gu.com/p/dnvbj recovery comes first and last and a trade deal with the uk is of course a Fears Brexit talks could collapse in June but UK still optimistic
https://gu.com/p/d94j9 to be willing to accept an australian-style trade deal with the eu by reminding meps that Ursula von der Leyen mocks Boris Johnson’s stance on EU trade deal
https://gu.com/p/fxzfx democrats in warning that a us-uk free trade deal could be blocked if the northern ireland Von der Leyen says parts of Brexit talks ‘completely open’ ahead of showdown

Quanteda provides a sentiment dictionary which, in addition to identifying positive and negative words, also finds negative-negatives and negative-positives such as, for example, “not effective”. For each week’s worth of articles, I’ll calculate the proportion of positive sentiments.

tic()

sent_df <- 
  context_corp %>% 
  dfm(dictionary = data_dictionary_LSD2015) %>% 
  as_tibble() %>%
  left_join(context_df, by = c("doc_id" = "shortUrl")) %>% 
  mutate(
    pos = positive + neg_negative,
    neg = negative + neg_positive,
    date = ceiling_date(as_date(webPublicationDate), "week"),
    pct_pos = pos / (pos + neg)
  )

sent_df %>% 
  select(doc_id, starts_with("pos"), starts_with("neg")) %>% 
  slice(1:10) %>% 
  kable()
doc_id positive pos negative neg_positive neg_negative neg
https://gu.com/p/d6qhb 40 40 22 0 0 22
https://gu.com/p/d9e9j 27 27 15 0 0 15
https://gu.com/p/d6bt2 37 37 7 0 0 7
https://gu.com/p/d6kzd 51 52 27 0 1 27
https://gu.com/p/d9vjq 13 13 23 0 0 23
https://gu.com/p/d7n8b 57 57 34 1 0 35
https://gu.com/p/d79cn 56 57 48 3 1 51
https://gu.com/p/d6t3c 28 28 26 0 0 26
https://gu.com/p/d696t 15 15 21 1 0 22
https://gu.com/p/d9xtf 33 33 13 1 0 14
summary_df <- sent_df %>% 
  group_by(date) %>% 
  summarise(pct_pos = mean(pct_pos), n = n())

toc()
## 0.686 sec elapsed

Plotting the changing proportion of positive sentiment over time did surprise me a little. The outcome was more balanced than I expected which perhaps confirms the deeper impression left on me by negative articles.

The upper violin plot shows the average weight of the sentiment across multiple articles for each week. Individually the articles range from 20% to 80% positive, with discernible periods of relatively negative and relatively positive sentiment.

The lower plot shows the volume of articles. As we draw closer to the crunch-point the volume appears to be picking up.

p1 <- sent_df %>% 
  ggplot(aes(date, pct_pos)) +
  geom_violin(aes(group = date), alpha = 0.5, fill = cols[1]) +
  geom_line(data = summary_df, aes(date, pct_pos), colour = cols[1], linetype = "dashed") +
  geom_hline(yintercept = 0.5, linetype = "dashed", colour = cols[4], size = 1) +
  scale_y_continuous(labels = percent_format(), limits = c(0.2, 0.8)) +
  labs(title = "Changing Sentiment Towards a UK-EU Trade Deal",
       subtitle = "Week-to-week Since the Withdrawal Agreement",
       x = NULL, y = "Positive Sentiment")

p2 <- summary_df %>% 
  ggplot(aes(date, n)) +
  geom_line(colour = cols[1]) +
  labs(x = "Weeks", y = "Article Count",
       caption = "Source: Guardian Newspaper")

p1 / p2 + 
  plot_layout(heights = c(2, 1))

Adopting titaniumtroop’s Disqus approach we could visualise a rolling 7-day mean with a narrowing ribbon representing a narrowing variation in sentiment.

width <- 7

sent_df2 <- sent_df %>%
  mutate(web_date = as_date(webPublicationDate)) %>% 
  group_by(web_date) %>%
  summarise(pct_pos = sum(pos) / sum(neg + pos)) %>% 
  mutate(
    roll_mean = rollapply(pct_pos, width, mean, partial = TRUE, align = "right"),
    roll_lq = rollapply(pct_pos, width, quantile, probs = 0.25, partial = TRUE, align = "right"),
    roll_uq = rollapply(pct_pos, width, quantile, probs = 0.75, partial = TRUE, align = "right")
  )

sent_df2 %>%
  ggplot(aes(web_date)) +
  geom_line(aes(y = roll_mean), colour = cols[1]) +
  geom_ribbon(aes(ymin = roll_lq, ymax = roll_uq), alpha = 0.33, fill = cols[1]) +
  geom_hline(yintercept = 0.5, linetype = "dashed", colour = cols[4], size = 1) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title = "Changing Sentiment Towards a UK-EU Trade Deal",
    subtitle = glue("Rolling {width} days Since the Withdrawal Agreement"),
    x = NULL, y = "Positive Sentiment"
  )

Some writers exhibit more sentiment variation than others.

byline_df <- 
  sent_df %>% 
  mutate(byline = word(byline, 1, 2) %>% str_remove_all(PUNCT)) %>% 
  group_by(byline, date) %>% 
  summarise(pct_pos = mean(pct_pos), n = n())

top_3 <- byline_df %>% 
  count(byline, sort = TRUE) %>% 
  ungroup() %>% 
  filter(!is.na(byline)) %>% 
  slice(c(3, 2)) %>% 
  pull(byline)

byline_df %>% 
  filter(byline %in% top_3) %>% 
  ggplot(aes(date, pct_pos, colour = byline)) +
  geom_line() +
  geom_hline(yintercept = 0.5, linetype = "dotted", colour = cols[2]) +
  scale_y_continuous(labels = percent_format(), limits = c(0.2, 0.8)) +
  scale_colour_manual(values = cols[c(1, 4)]) +
  labs(title = "Changing Sentiment Towards a UK-EU Trade Deal",
       subtitle = "Week-to-week Since the Withdrawal Agreement",
       x = "Weeks", y = "Positive Sentiment", colour = "Byline", 
       caption = "Source: Guardian Newspaper")

R Toolbox

Summarising below the packages and functions used in this post enables me to separately create a toolbox visualisation summarising the usage of packages and functions across all posts.

Package Function
base c[8]; sum[3]; function[2]; mean[2]; set.seed[2]; conflicts[1]; cumsum[1]; is.na[1]; months[1]; search[1]; seq[1]; Sys.sleep[1]
dplyr mutate[10]; filter[8]; as_tibble[4]; group_by[4]; summarise[4]; if_else[3]; n[3]; select[3]; slice[3]; tibble[3]; arrange[2]; desc[2]; left_join[2]; starts_with[2]; count[1]; pull[1]; rename[1]; slice_sample[1]; ungroup[1]
ggplot2 aes[8]; geom_line[4]; ggplot[4]; labs[4]; geom_hline[3]; scale_y_continuous[3]; geom_ribbon[1]; geom_violin[1]; scale_colour_manual[1]; theme_bw[1]; theme_set[1]
glue glue[1]
GuardianR get_guardian[1]
kableExtra kable[5]
lubridate date[3]; as_date[2]; ceiling_date[1]; today[1]; ymd[1]
patchwork plot_layout[1]
purrr map[1]; map2_dfr[1]; pmap_dfr[1]; possibly[1]; set_names[1]
quanteda corpus[2]; data_dictionary_LSD2015[1]; dfm[1]; fcm[1]; kwic[1]; phrase[1]; t[1]; tokens[1]
readr read_lines[1]
rebus literal[4]; lookahead[3]; whole_word[2]; ALPHA[1]; lookbehind[1]; one_or_more[1]; or[1]; PUNCT[1]
scales percent_format[3]
stringr str_detect[5]; str_remove_all[5]; str_c[2]; str_remove[2]; str_count[1]; str_starts[1]; str_to_lower[1]; word[1]
text2vec sim2[1]
tibble enframe[1]; rownames_to_column[1]
tictoc tic[2]; toc[2]
tidyr unnest[1]
wesanderson wes_palette[1]
zoo rollapply[3]
Carl Goodwin
Carl Goodwin
IBM Data Scientist & Growth Strategy Leader
comments powered by Disqus

Related