A Frosty Deal?

Reading news articles on the will-they-won’t-they post-Brexit trade negotiations with the EU sees days of optimism jarred by days of gloom. Do negative news articles, when one wants a positive outcome, leave a deeper impression?
I wondered if I could get a more objective view from quantitative analysis of textual data. To do this, I’m going to look at hundreds of articles published in the Guardian newspaper over the course of the year to see how trade-talk sentiment has changed week-to-week.
library(tidyverse)
library(rebus)
library(wesanderson)
library(kableExtra)
library(lubridate)
library(GuardianR)
library(quanteda)
library(scales)
library(tictoc)
library(patchwork)
library(text2vec)
library(topicmodels)
library(zoo)
library(glue)
theme_set(theme_bw())
cols <- wes_palette(name = "Chevalier1")
The Withdrawal Agreement between the UK and the European Union was signed on the 24th of January 2020. I’ll import Brexit-related newspaper articles from that date.
The Guardian newspaper asks for requests to span no more than 1 month at a time. So I’ll first create a set of monthly date ranges.
dates_df <- tibble(start_date = seq(ymd("2020-01-24"), today(), by = "1 month")) %>%
mutate(end_date = start_date + months(1) - 1)
dates_df %>%
kbl() %>%
kable_material()
start_date | end_date |
---|---|
2020-01-24 | 2020-02-23 |
2020-02-24 | 2020-03-23 |
2020-03-24 | 2020-04-23 |
2020-04-24 | 2020-05-23 |
2020-05-24 | 2020-06-23 |
2020-06-24 | 2020-07-23 |
2020-07-24 | 2020-08-23 |
2020-08-24 | 2020-09-23 |
2020-09-24 | 2020-10-23 |
2020-10-24 | 2020-11-23 |
I’ll import the newspaper articles in monthly chunks. Note, access to the Guardian’s API requires a key which may be requested here.
tic()
article_df <-
dates_df %>%
pmap_dfr(., function(start_date, end_date) {
Sys.sleep(1)
get_guardian(
"brexit",
from.date = start_date,
to.date = end_date,
api.key = key
)
})
toc()
The data need a little cleaning, for example, to remove multi-topic articles, html tags and non-breaking spaces.
trade_df <-
article_df %>%
filter(!str_detect(id, "/live/"), sectionId %in% c("world", "politics", "business")) %>%
mutate(
body = str_remove_all(body, "<.*?>") %>% str_to_lower(),
body = str_remove_all(body, "[^a-z0-9 .-]"),
body = str_remove_all(body, "nbsp")
)
A corpus then gives me a collection of texts whereby each document is a newspaper article.
trade_corp <- trade_df %>%
corpus(docid_field = "shortUrl", text_field = "body")
Although I’ve only imported articles mentioning Brexit since the Withdrawal Agreement was signed, some of these articles will not be related to trade negotiations with the EU. For example, there are on-going negotiations with many countries around the world. So, I’m going to use word embeddings to help narrow the focus to the specific context of the UK-EU trade deal.
The chief negotiator for the EU is Michel Barnier, so I’ll quantitatively identify words in close proximity to “Barnier” in the context of these Brexit news articles.
window <- 5
trade_fcm <-
trade_corp %>%
fcm(context = "window", window = window, count = "weighted", weights = window:1)
glove <- GlobalVectors$new(rank = 60, x_max = 10)
set.seed(42)
wv_main <- glove$fit_transform(trade_fcm, n_iter = 10)
## INFO [20:09:01.914] epoch 1, loss 0.3798
## INFO [20:09:04.047] epoch 2, loss 0.2526
## INFO [20:09:06.307] epoch 3, loss 0.2243
## INFO [20:09:08.458] epoch 4, loss 0.2038
## INFO [20:09:10.571] epoch 5, loss 0.1868
## INFO [20:09:12.687] epoch 6, loss 0.1736
## INFO [20:09:14.824] epoch 7, loss 0.1636
## INFO [20:09:16.982] epoch 8, loss 0.1558
## INFO [20:09:19.130] epoch 9, loss 0.1495
## INFO [20:09:21.292] epoch 10, loss 0.1442
wv_context <- glove$components
word_vectors <- wv_main + t(wv_context)
search_coord <-
word_vectors["barnier", , drop = FALSE]
word_vectors %>%
sim2(search_coord, method = "cosine") %>%
as_tibble(rownames = NA) %>%
rownames_to_column("term") %>%
rename(similarity = 2) %>%
arrange(desc(similarity)) %>%
slice(1:10) %>%
kbl() %>%
kable_material()
term | similarity |
---|---|
barnier | 1.0000000 |
frost | 0.8202823 |
michel | 0.7897546 |
negotiator | 0.7760485 |
brussels | 0.6902393 |
eus | 0.6723040 |
chief | 0.6580897 |
negotiators | 0.6517455 |
team | 0.6461421 |
talks | 0.5924516 |
Word embedding is a learned modelling technique placing words into a multi-dimensional vector space such that contextually-similar words may be found close by. Not surprisingly, one of the closest words contextually is “Michel”. And as he is the chief negotiator for the EU, we find “eu’s”, “chief”, and “negotiator” also in the top most contextually-similar words.
The word embeddings algorithm, through word co-occurrence, has identified the name of Michel Barnier’s UK counterpart David Frost. So filtering articles for “Barnier”, “Frost” and “UK-EU” should help narrow the focus.
context_df <-
trade_df %>%
filter(str_detect(body, "barnier|frost|uk-eu"))
context_corp <-
context_df %>%
corpus(docid_field = "shortUrl", text_field = "body")
I can then use quanteda’s kwic
function to review the key phrases in context to ensure I’m homing in on the texts I want. Short URLs are included below so I can click on any to read the actual article as presented by The Guardian.
set.seed(123)
context_corp %>%
tokens(
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE
) %>%
kwic(pattern = phrase(c("trade negotiation", "trade deal", "trade talks")),
valuetype = "regex", window = 7) %>%
as_tibble() %>%
left_join(article_df, by = c("docname" = "shortUrl")) %>%
slice_sample(n = 10) %>%
select(docname, pre, keyword, post, headline) %>%
kbl() %>%
kable_material()
docname | pre | keyword | post | headline |
---|---|---|---|---|
https://gu.com/p/ezjdz | said the downsides with the eu free | trade deal | the us free trade deal and our | Brexit bill hugely damaging to UK’s reputation, says ex-ambassador |
https://gu.com/p/d7d9t | people we have who have been negotiating | trade deals | forever she said while people criticise the | Brexit trade talks: EU to back Spain over Gibraltar claims |
https://gu.com/p/eyzhq | played down the prospect of reaching a | trade deal | with the eu in time for december | No 10 blames EU and plays down prospects of Brexit trade deal |
https://gu.com/p/dnvbj | personal rapport when communicating remotely related post-brexit | trade talks | with eu on course to fail johnson | Fears Brexit talks could collapse in June but UK still optimistic |
https://gu.com/p/f63be | for a no-deal outcome on january the | trade talks | are over the eu has effectively ended | What was the point of Johnson’s Brexit statement? To save face |
https://gu.com/p/fbqf9 | in the approach of the us on | trade talks | with the uk biden was barack obamas | What does Biden’s win mean for Brexit? |
https://gu.com/p/fd8tz | use a provisional application meaning a renewed | trade deal | would come into force before the 21-day | UK trade department faces race to get £80bn of trade agreements ratified |
https://gu.com/p/ea933 | was essential regardless of the outcome of | trade talks | between britain and the eu and called | Three in four UK firms unprepared for Brexit, study shows |
https://gu.com/p/dbhnh | rejected the eus opening offer for a | trade deal | and said it did not recognise the | Downing Street rejects EU’s ‘onerous’ opening trade offer |
https://gu.com/p/dbhnh | such as canada and japan when signing | trade deals | with them we just want the same | Downing Street rejects EU’s ‘onerous’ opening trade offer |
Quanteda provides a sentiment dictionary which, in addition to identifying positive and negative words, also finds negative-negatives and negative-positives such as, for example, “not effective”. For each week’s worth of articles, I’ll calculate the proportion of positive sentiments.
tic()
sent_df <-
context_corp %>%
dfm(dictionary = data_dictionary_LSD2015) %>%
as_tibble() %>%
left_join(context_df, by = c("doc_id" = "shortUrl")) %>%
mutate(
pos = positive + neg_negative,
neg = negative + neg_positive,
date = ceiling_date(as_date(webPublicationDate), "week"),
pct_pos = pos / (pos + neg)
)
sent_df %>%
select(doc_id, starts_with("pos"), starts_with("neg")) %>%
slice(1:10) %>%
kbl() %>%
kable_material()
doc_id | positive | pos | negative | neg_positive | neg_negative | neg |
---|---|---|---|---|---|---|
https://gu.com/p/d6qhb | 40 | 40 | 22 | 0 | 0 | 22 |
https://gu.com/p/d9e9j | 27 | 27 | 15 | 0 | 0 | 15 |
https://gu.com/p/d6bt2 | 37 | 37 | 7 | 0 | 0 | 7 |
https://gu.com/p/d9vjq | 13 | 13 | 23 | 0 | 0 | 23 |
https://gu.com/p/d6kzd | 51 | 52 | 27 | 0 | 1 | 27 |
https://gu.com/p/d7n8b | 57 | 57 | 34 | 1 | 0 | 35 |
https://gu.com/p/d6t3c | 28 | 28 | 26 | 0 | 0 | 26 |
https://gu.com/p/d79cn | 56 | 57 | 48 | 3 | 1 | 51 |
https://gu.com/p/d9xtf | 33 | 33 | 13 | 1 | 0 | 14 |
https://gu.com/p/d696t | 15 | 15 | 21 | 1 | 0 | 22 |
summary_df <- sent_df %>%
group_by(date) %>%
summarise(pct_pos = mean(pct_pos), n = n())
toc()
## 1.202 sec elapsed
Plotting the changing proportion of positive sentiment over time did surprise me a little. The outcome was more balanced than I expected which perhaps confirms the deeper impression left on me by negative articles.
The upper violin plot shows the average weight of the sentiment across multiple articles for each week. Individually the articles range from 20% to 80% positive, with discernible periods of relatively negative and relatively positive sentiment.
The lower plot shows the volume of articles. As we draw closer to the crunch-point the volume appears to be picking up.
p1 <- sent_df %>%
ggplot(aes(date, pct_pos)) +
geom_violin(aes(group = date), alpha = 0.5, fill = cols[1]) +
geom_line(data = summary_df, aes(date, pct_pos), colour = cols[1], linetype = "dashed") +
geom_hline(yintercept = 0.5, linetype = "dashed", colour = cols[4], size = 1) +
scale_y_continuous(labels = percent_format(), limits = c(0.2, 0.8)) +
labs(title = "Changing Sentiment Towards a UK-EU Trade Deal",
subtitle = "Week-to-week Since the Withdrawal Agreement",
x = NULL, y = "Positive Sentiment")
p2 <- summary_df %>%
ggplot(aes(date, n)) +
geom_line(colour = cols[1]) +
labs(x = "Weeks", y = "Article Count",
caption = "Source: Guardian Newspaper")
p1 / p2 +
plot_layout(heights = c(2, 1))
Adopting titaniumtroop’s Disqus approach we could visualise a rolling 7-day mean with a narrowing ribbon representing a narrowing variation in sentiment.
width <- 7
sent_df2 <- sent_df %>%
mutate(web_date = as_date(webPublicationDate)) %>%
group_by(web_date) %>%
summarise(pct_pos = sum(pos) / sum(neg + pos)) %>%
mutate(
roll_mean = rollapply(pct_pos, width, mean, partial = TRUE, align = "right"),
roll_lq = rollapply(pct_pos, width, quantile, probs = 0.25, partial = TRUE, align = "right"),
roll_uq = rollapply(pct_pos, width, quantile, probs = 0.75, partial = TRUE, align = "right")
)
sent_df2 %>%
ggplot(aes(web_date)) +
geom_line(aes(y = roll_mean), colour = cols[1]) +
geom_ribbon(aes(ymin = roll_lq, ymax = roll_uq), alpha = 0.33, fill = cols[1]) +
geom_hline(yintercept = 0.5, linetype = "dashed", colour = cols[4], size = 1) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
labs(
title = "Changing Sentiment Towards a UK-EU Trade Deal",
subtitle = glue("Rolling {width} days Since the Withdrawal Agreement"),
x = NULL, y = "Positive Sentiment"
)
Some writers exhibit more sentiment variation than others.
byline_df <-
sent_df %>%
mutate(byline = word(byline, 1, 2) %>% str_remove_all(PUNCT)) %>%
group_by(byline, date) %>%
summarise(pct_pos = mean(pct_pos), n = n())
top_3 <- byline_df %>%
count(byline, sort = TRUE) %>%
ungroup() %>%
filter(!is.na(byline)) %>%
slice(c(3, 2)) %>%
pull(byline)
byline_df %>%
filter(byline %in% top_3) %>%
ggplot(aes(date, pct_pos, colour = byline)) +
geom_line() +
geom_hline(yintercept = 0.5, linetype = "dotted", colour = cols[2]) +
scale_y_continuous(labels = percent_format(), limits = c(0.2, 0.8)) +
scale_colour_manual(values = cols[c(1, 4)]) +
labs(title = "Changing Sentiment Towards a UK-EU Trade Deal",
subtitle = "Week-to-week Since the Withdrawal Agreement",
x = "Weeks", y = "Positive Sentiment", colour = "Byline",
caption = "Source: Guardian Newspaper")
R Toolbox
Summarising below the packages and functions used in this post enables me to separately create a toolbox visualisation summarising the usage of packages and functions across all posts.
Package | Function |
---|---|
base | c[8]; sum[3]; function[2]; mean[2]; set.seed[2]; conflicts[1]; cumsum[1]; is.na[1]; months[1]; search[1]; seq[1]; Sys.sleep[1] |
dplyr | mutate[10]; filter[8]; as_tibble[4]; group_by[4]; summarise[4]; if_else[3]; n[3]; select[3]; slice[3]; tibble[3]; arrange[2]; desc[2]; left_join[2]; starts_with[2]; count[1]; pull[1]; rename[1]; slice_sample[1]; ungroup[1] |
ggplot2 | aes[8]; geom_line[4]; ggplot[4]; labs[4]; geom_hline[3]; scale_y_continuous[3]; geom_ribbon[1]; geom_violin[1]; scale_colour_manual[1]; theme_bw[1]; theme_set[1] |
glue | glue[1] |
GuardianR | get_guardian[1] |
kableExtra | kable_material[5]; kbl[5] |
lubridate | date[3]; as_date[2]; ceiling_date[1]; today[1]; ymd[1] |
patchwork | plot_layout[1] |
purrr | map[1]; map2_dfr[1]; pmap_dfr[1]; possibly[1]; set_names[1] |
quanteda | corpus[2]; data_dictionary_LSD2015[1]; dfm[1]; fcm[1]; kwic[1]; phrase[1]; t[1]; tokens[1] |
readr | read_lines[1] |
rebus | literal[4]; lookahead[3]; whole_word[2]; ALPHA[1]; lookbehind[1]; one_or_more[1]; or[1]; PUNCT[1] |
scales | percent_format[3] |
stringr | str_detect[5]; str_remove_all[5]; str_c[2]; str_remove[2]; str_count[1]; str_starts[1]; str_to_lower[1]; word[1] |
text2vec | sim2[1] |
tibble | enframe[1]; rownames_to_column[1] |
tictoc | tic[2]; toc[2] |
tidyr | unnest[1] |
wesanderson | wes_palette[1] |
zoo | rollapply[3] |