A Frosty Deal?

Graphic by Carl Goodwin

Reading news articles on the will-they-won’t-they post-Brexit trade negotiations with the EU sees days of optimism jarred by days of gloom. Do negative news articles, when one wants a positive outcome, leave a deeper impression?

I wondered if I could get a more objective view from quantitative analysis of textual data. To do this, I’m going to look at hundreds of articles published in the Guardian newspaper over the course of the year to see how trade-talk sentiment has changed week-to-week.

library(tidyverse)
library(rebus)
library(wesanderson)
library(kableExtra)
library(lubridate)
library(GuardianR)
library(quanteda)
library(scales)
library(tictoc)
library(patchwork)
library(text2vec)
library(topicmodels)
library(zoo)
library(glue)
theme_set(theme_bw())

cols <- wes_palette(name = "Chevalier1")

The Withdrawal Agreement between the UK and the European Union was signed on the 24th of January 2020. I’ll import Brexit-related newspaper articles from that date.

The Guardian newspaper asks for requests to span no more than 1 month at a time. So I’ll first create a set of monthly date ranges.

dates_df <- tibble(start_date = seq(ymd("2020-01-24"), today(), by = "1 month")) %>%
  mutate(end_date = start_date + months(1) - 1)

dates_df %>%
  kbl() %>% 
  kable_material()
start_date end_date
2020-01-24 2020-02-23
2020-02-24 2020-03-23
2020-03-24 2020-04-23
2020-04-24 2020-05-23
2020-05-24 2020-06-23
2020-06-24 2020-07-23
2020-07-24 2020-08-23
2020-08-24 2020-09-23
2020-09-24 2020-10-23
2020-10-24 2020-11-23

I’ll import the newspaper articles in monthly chunks. Note, access to the Guardian’s API requires a key which may be requested here.

tic()

article_df <-
  dates_df %>%
  pmap_dfr(., function(start_date, end_date) {
    Sys.sleep(1)
    get_guardian(
      "brexit",
      from.date = start_date,
      to.date = end_date,
      api.key = key
    )
  })

toc()

The data need a little cleaning, for example, to remove multi-topic articles, html tags and non-breaking spaces.

trade_df <-
  article_df %>%
  filter(!str_detect(id, "/live/"), sectionId %in% c("world", "politics", "business")) %>% 
  mutate(
    body = str_remove_all(body, "<.*?>") %>% str_to_lower(),
    body = str_remove_all(body, "[^a-z0-9 .-]"),
    body = str_remove_all(body, "nbsp")
  )

A corpus then gives me a collection of texts whereby each document is a newspaper article.

trade_corp <- trade_df %>% 
  corpus(docid_field = "shortUrl", text_field = "body")

Although I’ve only imported articles mentioning Brexit since the Withdrawal Agreement was signed, some of these articles will not be related to trade negotiations with the EU. For example, there are on-going negotiations with many countries around the world. So, I’m going to use word embeddings to help narrow the focus to the specific context of the UK-EU trade deal.

The chief negotiator for the EU is Michel Barnier, so I’ll quantitatively identify words in close proximity to “Barnier” in the context of these Brexit news articles.

window <- 5

trade_fcm <- 
  trade_corp %>% 
  fcm(context = "window", window = window, count = "weighted", weights = window:1)

glove <- GlobalVectors$new(rank = 60, x_max = 10)

set.seed(42)

wv_main <- glove$fit_transform(trade_fcm, n_iter = 10)
## INFO  [20:09:01.914] epoch 1, loss 0.3798 
## INFO  [20:09:04.047] epoch 2, loss 0.2526 
## INFO  [20:09:06.307] epoch 3, loss 0.2243 
## INFO  [20:09:08.458] epoch 4, loss 0.2038 
## INFO  [20:09:10.571] epoch 5, loss 0.1868 
## INFO  [20:09:12.687] epoch 6, loss 0.1736 
## INFO  [20:09:14.824] epoch 7, loss 0.1636 
## INFO  [20:09:16.982] epoch 8, loss 0.1558 
## INFO  [20:09:19.130] epoch 9, loss 0.1495 
## INFO  [20:09:21.292] epoch 10, loss 0.1442
wv_context <- glove$components
word_vectors <- wv_main + t(wv_context)

search_coord <- 
  word_vectors["barnier", , drop = FALSE]

word_vectors %>% 
  sim2(search_coord, method = "cosine") %>% 
  as_tibble(rownames = NA) %>% 
  rownames_to_column("term") %>% 
  rename(similarity = 2) %>% 
  arrange(desc(similarity)) %>% 
  slice(1:10) %>%
  kbl() %>% 
  kable_material()
term similarity
barnier 1.0000000
frost 0.8202823
michel 0.7897546
negotiator 0.7760485
brussels 0.6902393
eus 0.6723040
chief 0.6580897
negotiators 0.6517455
team 0.6461421
talks 0.5924516

Word embedding is a learned modelling technique placing words into a multi-dimensional vector space such that contextually-similar words may be found close by. Not surprisingly, one of the closest words contextually is “Michel”. And as he is the chief negotiator for the EU, we find “eu’s”, “chief”, and “negotiator” also in the top most contextually-similar words.

The word embeddings algorithm, through word co-occurrence, has identified the name of Michel Barnier’s UK counterpart David Frost. So filtering articles for “Barnier”, “Frost” and “UK-EU” should help narrow the focus.

context_df <- 
  trade_df %>% 
  filter(str_detect(body, "barnier|frost|uk-eu")) 

context_corp <- 
  context_df %>% 
  corpus(docid_field = "shortUrl", text_field = "body")

I can then use quanteda’s kwic function to review the key phrases in context to ensure I’m homing in on the texts I want. Short URLs are included below so I can click on any to read the actual article as presented by The Guardian.

set.seed(123)

context_corp %>%
  tokens(
    remove_punct = TRUE,
    remove_symbols = TRUE,
    remove_numbers = TRUE
  ) %>%
  kwic(pattern = phrase(c("trade negotiation", "trade deal", "trade talks")), 
       valuetype = "regex", window = 7) %>%
  as_tibble() %>%
  left_join(article_df, by = c("docname" = "shortUrl")) %>% 
  slice_sample(n = 10) %>% 
  select(docname, pre, keyword, post, headline) %>%
  kbl() %>% 
  kable_material()
docname pre keyword post headline
https://gu.com/p/ezjdz said the downsides with the eu free trade deal the us free trade deal and our Brexit bill hugely damaging to UK’s reputation, says ex-ambassador
https://gu.com/p/d7d9t people we have who have been negotiating trade deals forever she said while people criticise the Brexit trade talks: EU to back Spain over Gibraltar claims
https://gu.com/p/eyzhq played down the prospect of reaching a trade deal with the eu in time for december No 10 blames EU and plays down prospects of Brexit trade deal
https://gu.com/p/dnvbj personal rapport when communicating remotely related post-brexit trade talks with eu on course to fail johnson Fears Brexit talks could collapse in June but UK still optimistic
https://gu.com/p/f63be for a no-deal outcome on january the trade talks are over the eu has effectively ended What was the point of Johnson’s Brexit statement? To save face
https://gu.com/p/fbqf9 in the approach of the us on trade talks with the uk biden was barack obamas What does Biden’s win mean for Brexit?
https://gu.com/p/fd8tz use a provisional application meaning a renewed trade deal would come into force before the 21-day UK trade department faces race to get £80bn of trade agreements ratified
https://gu.com/p/ea933 was essential regardless of the outcome of trade talks between britain and the eu and called Three in four UK firms unprepared for Brexit, study shows
https://gu.com/p/dbhnh rejected the eus opening offer for a trade deal and said it did not recognise the Downing Street rejects EU’s ‘onerous’ opening trade offer
https://gu.com/p/dbhnh such as canada and japan when signing trade deals with them we just want the same Downing Street rejects EU’s ‘onerous’ opening trade offer

Quanteda provides a sentiment dictionary which, in addition to identifying positive and negative words, also finds negative-negatives and negative-positives such as, for example, “not effective”. For each week’s worth of articles, I’ll calculate the proportion of positive sentiments.

tic()

sent_df <- 
  context_corp %>% 
  dfm(dictionary = data_dictionary_LSD2015) %>% 
  as_tibble() %>%
  left_join(context_df, by = c("doc_id" = "shortUrl")) %>% 
  mutate(
    pos = positive + neg_negative,
    neg = negative + neg_positive,
    date = ceiling_date(as_date(webPublicationDate), "week"),
    pct_pos = pos / (pos + neg)
  )

sent_df %>% 
  select(doc_id, starts_with("pos"), starts_with("neg")) %>% 
  slice(1:10) %>% 
  kbl() %>% 
  kable_material()
doc_id positive pos negative neg_positive neg_negative neg
https://gu.com/p/d6qhb 40 40 22 0 0 22
https://gu.com/p/d9e9j 27 27 15 0 0 15
https://gu.com/p/d6bt2 37 37 7 0 0 7
https://gu.com/p/d9vjq 13 13 23 0 0 23
https://gu.com/p/d6kzd 51 52 27 0 1 27
https://gu.com/p/d7n8b 57 57 34 1 0 35
https://gu.com/p/d6t3c 28 28 26 0 0 26
https://gu.com/p/d79cn 56 57 48 3 1 51
https://gu.com/p/d9xtf 33 33 13 1 0 14
https://gu.com/p/d696t 15 15 21 1 0 22
summary_df <- sent_df %>% 
  group_by(date) %>% 
  summarise(pct_pos = mean(pct_pos), n = n())

toc()
## 1.202 sec elapsed

Plotting the changing proportion of positive sentiment over time did surprise me a little. The outcome was more balanced than I expected which perhaps confirms the deeper impression left on me by negative articles.

The upper violin plot shows the average weight of the sentiment across multiple articles for each week. Individually the articles range from 20% to 80% positive, with discernible periods of relatively negative and relatively positive sentiment.

The lower plot shows the volume of articles. As we draw closer to the crunch-point the volume appears to be picking up.

p1 <- sent_df %>% 
  ggplot(aes(date, pct_pos)) +
  geom_violin(aes(group = date), alpha = 0.5, fill = cols[1]) +
  geom_line(data = summary_df, aes(date, pct_pos), colour = cols[1], linetype = "dashed") +
  geom_hline(yintercept = 0.5, linetype = "dashed", colour = cols[4], size = 1) +
  scale_y_continuous(labels = percent_format(), limits = c(0.2, 0.8)) +
  labs(title = "Changing Sentiment Towards a UK-EU Trade Deal",
       subtitle = "Week-to-week Since the Withdrawal Agreement",
       x = NULL, y = "Positive Sentiment")

p2 <- summary_df %>% 
  ggplot(aes(date, n)) +
  geom_line(colour = cols[1]) +
  labs(x = "Weeks", y = "Article Count",
       caption = "Source: Guardian Newspaper")

p1 / p2 + 
  plot_layout(heights = c(2, 1))

Adopting titaniumtroop’s Disqus approach we could visualise a rolling 7-day mean with a narrowing ribbon representing a narrowing variation in sentiment.

width <- 7

sent_df2 <- sent_df %>%
  mutate(web_date = as_date(webPublicationDate)) %>% 
  group_by(web_date) %>%
  summarise(pct_pos = sum(pos) / sum(neg + pos)) %>% 
  mutate(
    roll_mean = rollapply(pct_pos, width, mean, partial = TRUE, align = "right"),
    roll_lq = rollapply(pct_pos, width, quantile, probs = 0.25, partial = TRUE, align = "right"),
    roll_uq = rollapply(pct_pos, width, quantile, probs = 0.75, partial = TRUE, align = "right")
  )

sent_df2 %>%
  ggplot(aes(web_date)) +
  geom_line(aes(y = roll_mean), colour = cols[1]) +
  geom_ribbon(aes(ymin = roll_lq, ymax = roll_uq), alpha = 0.33, fill = cols[1]) +
  geom_hline(yintercept = 0.5, linetype = "dashed", colour = cols[4], size = 1) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title = "Changing Sentiment Towards a UK-EU Trade Deal",
    subtitle = glue("Rolling {width} days Since the Withdrawal Agreement"),
    x = NULL, y = "Positive Sentiment"
  )

Some writers exhibit more sentiment variation than others.

byline_df <- 
  sent_df %>% 
  mutate(byline = word(byline, 1, 2) %>% str_remove_all(PUNCT)) %>% 
  group_by(byline, date) %>% 
  summarise(pct_pos = mean(pct_pos), n = n())

top_3 <- byline_df %>% 
  count(byline, sort = TRUE) %>% 
  ungroup() %>% 
  filter(!is.na(byline)) %>% 
  slice(c(3, 2)) %>% 
  pull(byline)

byline_df %>% 
  filter(byline %in% top_3) %>% 
  ggplot(aes(date, pct_pos, colour = byline)) +
  geom_line() +
  geom_hline(yintercept = 0.5, linetype = "dotted", colour = cols[2]) +
  scale_y_continuous(labels = percent_format(), limits = c(0.2, 0.8)) +
  scale_colour_manual(values = cols[c(1, 4)]) +
  labs(title = "Changing Sentiment Towards a UK-EU Trade Deal",
       subtitle = "Week-to-week Since the Withdrawal Agreement",
       x = "Weeks", y = "Positive Sentiment", colour = "Byline", 
       caption = "Source: Guardian Newspaper")

R Toolbox

Summarising below the packages and functions used in this post enables me to separately create a toolbox visualisation summarising the usage of packages and functions across all posts.

Package Function
base c[8]; sum[3]; function[2]; mean[2]; set.seed[2]; conflicts[1]; cumsum[1]; is.na[1]; months[1]; search[1]; seq[1]; Sys.sleep[1]
dplyr mutate[10]; filter[8]; as_tibble[4]; group_by[4]; summarise[4]; if_else[3]; n[3]; select[3]; slice[3]; tibble[3]; arrange[2]; desc[2]; left_join[2]; starts_with[2]; count[1]; pull[1]; rename[1]; slice_sample[1]; ungroup[1]
ggplot2 aes[8]; geom_line[4]; ggplot[4]; labs[4]; geom_hline[3]; scale_y_continuous[3]; geom_ribbon[1]; geom_violin[1]; scale_colour_manual[1]; theme_bw[1]; theme_set[1]
glue glue[1]
GuardianR get_guardian[1]
kableExtra kable_material[5]; kbl[5]
lubridate date[3]; as_date[2]; ceiling_date[1]; today[1]; ymd[1]
patchwork plot_layout[1]
purrr map[1]; map2_dfr[1]; pmap_dfr[1]; possibly[1]; set_names[1]
quanteda corpus[2]; data_dictionary_LSD2015[1]; dfm[1]; fcm[1]; kwic[1]; phrase[1]; t[1]; tokens[1]
readr read_lines[1]
rebus literal[4]; lookahead[3]; whole_word[2]; ALPHA[1]; lookbehind[1]; one_or_more[1]; or[1]; PUNCT[1]
scales percent_format[3]
stringr str_detect[5]; str_remove_all[5]; str_c[2]; str_remove[2]; str_count[1]; str_starts[1]; str_to_lower[1]; word[1]
text2vec sim2[1]
tibble enframe[1]; rownames_to_column[1]
tictoc tic[2]; toc[2]
tidyr unnest[1]
wesanderson wes_palette[1]
zoo rollapply[3]
Carl Goodwin
Carl Goodwin
IBM Data Scientist & Growth Strategy Leader
comments powered by Disqus

Related