Criminal Goings-on

Graphic by Carl Goodwin

When I first posted this back in early 2018 I used the caret package to model crime in London. Since then, the newer tidymodels framework, consistent with tidy data principles, has rapidly evolved. So, I’ll briefly re-model the data from data.gov.uk with four supervised machine learning models supported by tidymodels.

raw_df <-
  read_csv(
    "https://data.london.gov.uk/download/recorded_crime_rates/c051c7ec-c3ad-4534-bbfe-6bdfee2ef6bb/crime%20rates.csv",
    col_types = "cfcfdn"
  ) %>%
  clean_names() %>%
  mutate(year = str_extract(year, number_range(1999, 2017)),
         year = as.numeric(year)) %>%
  group_by(year, borough, offences) %>%
  summarise(number_of_offences = sum(number_of_offences)) %>% 
  ungroup()

First, I’ll use a faceted plot to get a sense of the data.

I found I was initially losing some text in the legend and the facet headings, so whilst polishing the visualisation, guides helped me force the wordy legend onto four lines, and stringr’s str_wrap fixed the longer borough names.

raw_df %>%
  mutate(borough = str_wrap(borough, 11)) %>%
  ggplot(aes(year, number_of_offences, colour = offences, group = offences)) +
  geom_line() +
  facet_wrap(~borough, scales = "free_y", ncol = 4) +
  labs(
    x = NULL, y = NULL, title = "London Crime by Borough",
    colour = "Offence", caption = "Source: data.gov.uk"
  ) +
  scale_colour_manual(values = cols) +
  guides(colour = guide_legend(nrow = 4)) +
  theme(
    strip.background = element_rect(fill = cols[9]),
    legend.position = "bottom",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Visualising data in small multiples using facet_wrap or facet_grid can be a useful way to explore data. When there are a larger number of these however, as we’re starting to see in the example above, there are alternative techniques one can employ. I’ll explore these in Bigger Brighter.

Nonetheless, one can anyway see there are data aggregated at multiple levels. So to net these data down to purely borough-level, I’ll filter out the summarised rows, for example, “England and Wales” and “Inner London”.

crime_df <- raw_df %>%
  filter(
    offences != "All recorded offences",
    !borough %in% c("England and Wales", "Met Police Area", "Inner London", "Outer London")
  )

There are 9 types of offences in 33 boroughs. The dataset covers the period 1999 to 2016.

The faceted plot hints at a potential interaction between borough and type of offence. In more affluent boroughs, and/or those attracting greater visitor numbers, e.g. Westminster and Kensington & Chelsea, “theft and handling” is the more dominant category. In Lewisham, for example, “violence against the person” exhibits higher counts. However, for the purpose of this basic model comparison, I’m going to set aside the potential interaction term.

Before modelling, I’ll visualise the dependent variable against each independent variable.

crime_df %>%
  group_by(offences, borough) %>%
  summarise(number_of_offences = sum(number_of_offences)) %>%
  group_by(offences) %>%
  mutate(
    median_offences = median(number_of_offences),
    offences = str_wrap(offences, 10),
    ) %>% 
  ggplot(aes(fct_reorder(offences, median_offences), number_of_offences)) +
  geom_boxplot(fill = cols[7]) +
  scale_y_log10(labels = comma) +
  labs(
    x = NULL, y = NULL, 
    title = "Number of Offences by Type",
    caption = "Source: data.gov.uk"
  )

crime_df %>%
  group_by(offences, borough) %>%
  summarise(number_of_offences = sum(number_of_offences)) %>%
  group_by(borough) %>%
  mutate(
    median_offences = median(number_of_offences),
    offences = str_wrap(offences, 10),
    ) %>% 
  ggplot(aes(fct_reorder(borough, median_offences), number_of_offences)) +
  geom_boxplot(fill = cols[6]) +
  scale_y_log10(labels = comma) +
  coord_flip() +
  labs(
    x = NULL, y = NULL, 
    title = "Number of Offences by Borough",
    caption = "Source: data.gov.uk"
  )

The offences and borough variables show significant variation in crime counts. And there is also evidence of a change over time.

crime_df %>%
  group_by(year) %>%
  summarise(number_of_offences = sum(number_of_offences)) %>%
  ggplot(aes(year, number_of_offences)) +
  geom_line(colour = cols[10]) +
  geom_smooth(colour = cols[5], fill = cols[4]) +
  scale_y_continuous(labels = comma) +
  labs(
    x = NULL, y = NULL,
    title = "Number of Offences by Year",
    caption = "Source: data.gov.uk"
  )

I’ll separate out some test data so I can compare the performance of the models on data they have not see during model training.

set.seed(123)

data_split <- 
  crime_df %>%
  initial_split(strata = offences)

crime_train <- data_split %>%
  training()

crime_test <- data_split %>%
  testing()

I’m using the recipes package to establish the role of the variables. Alternatively I could have used a formula-based approach, i.e. number_of_offences ~ borough + offences + year.

Whilst borough and offences are nominal, I’m not creating any dummy variables since I intend to use tree-based models which will anyway branch left and right based on groups of values.

crime_recipe <-
  crime_train %>%
  recipe() %>%
  update_role(number_of_offences, new_role = "outcome") %>%
  update_role(-has_role("outcome"), new_role = "predictor")

summary(crime_recipe)
## # A tibble: 4 x 4
##   variable           type    role      source  
##   <chr>              <chr>   <chr>     <chr>   
## 1 year               numeric predictor original
## 2 borough            nominal predictor original
## 3 offences           nominal predictor original
## 4 number_of_offences numeric outcome   original

I’ll start with a Recursive Partitioning And Regression Trees (rpart) model. The feature importance plot tells me which variables are having the biggest influence on the model. The type of offence is the most important predictor in the rpart model, followed by the location of the offences. This makes intuitive sense.

Clearly there is a temporal component too otherwise there would be no trend.

rp_model <- 
  decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("regression")

rp_wflow <- workflow() %>%
  add_recipe(crime_recipe) %>%
  add_model(rp_model)

rp_fit <- rp_wflow %>% 
  fit(crime_train)

rp_fit %>%
  pull_workflow_fit() %>% 
  vip(fill = cols[1]) +
  labs(title = "Feature Importance -- rpart")

rp_results <- 
  crime_test %>%
  bind_cols(predict(rp_fit, new_data = crime_test) %>% select(.pred)) %>% 
  mutate(model = "rpart")

Ranger is an implementation of random forests or recursive partitioning that, according to the documentation, is particularly suited to high dimensional data. My data is not high-dimensional, but let’s throw it into the mix.

ranger_model <- 
  rand_forest() %>%
  set_engine("ranger", mtry = 2, importance = "impurity") %>%
  set_mode("regression")

ranger_wflow <- workflow() %>%
  add_recipe(crime_recipe) %>%
  add_model(ranger_model)

ranger_fit <- ranger_wflow %>% 
  fit(crime_train)

ranger_fit %>%
  pull_workflow_fit() %>% 
  vip(fill = cols[3]) +
  labs(title = "Feature Importance -- Ranger")

ranger_results <- 
  crime_test %>%
  bind_cols(predict(ranger_fit, new_data = crime_test) %>% select(.pred)) %>% 
  mutate(model = "ranger")

And of course my project title would make little sense without a Random Forest.

rf_model <- 
  rand_forest() %>%
  set_engine("randomForest", mtry = 2) %>%
  set_mode("regression")

rf_wflow <- workflow() %>%
  add_recipe(crime_recipe) %>%
  add_model(rf_model)

rf_fit <- rf_wflow %>% 
  fit(crime_train)

rf_fit %>%
  pull_workflow_fit() %>% 
  vip(fill = cols[5]) +
  labs(title = "Feature Importance -- Random Forest")

rf_results <- 
  crime_test %>%
  bind_cols(predict(rf_fit, new_data = crime_test) %>% select(.pred)) %>% 
  mutate(model = "random forest")

For good measure, I’ll also include a generalized linear model (glm)

poisson_model <- 
  poisson_reg() %>%
  set_engine("glm") %>%
  set_mode("regression")

poisson_wflow <- workflow() %>%
  add_recipe(crime_recipe) %>%
  add_model(poisson_model)

poisson_fit <- poisson_wflow %>% 
  fit(crime_train)

poisson_fit %>%
  pull_workflow_fit() %>% 
  vip(fill = cols[7]) +
  labs(title = "Feature Importance -- glm")

poisson_results <- 
  crime_test %>%
  bind_cols(predict(poisson_fit, new_data = crime_test) %>% select(.pred)) %>% 
  mutate(model = "glm")

The Random Forest and the glm models performed the best here, with the former edging the Mean Absolute Error and R Squared metrics, and the latter with its nose in front on the Root Mean Squared Error.

model_results <- 
  rp_results %>% 
  bind_rows(ranger_results) %>% 
  bind_rows(rf_results) %>% 
  bind_rows(poisson_results) %>% 
  group_by(model) %>% 
  metrics(truth = number_of_offences, estimate = .pred)

model_results %>% 
  ggplot(aes(model, .estimate, fill = model)) +
  geom_col() +
  geom_label(aes(label = round(.estimate, 2)), size = 3) +
  facet_wrap(~ .metric, scales = "free_y") +
  scale_fill_manual(values = cols[c(1, 3, 5, 7)]) +
  labs(x = NULL, y = NULL, title = "Comparison of Model Metrics") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

Another way of approaching all this would be to use time-series forecasting. This would major on auto-regression, i.e. looking at how the lagged number-of-offences influence future values. And one could further include exogenous data such as, say, the numbers of police. It would be reasonable to expect that increasing police numbers would, in time, lead to decreased crime levels.

I explored time-series in other posts such as Digging Deep, so I won’t go down that path here.

What I could do though is to strengthen my tree-based models above by engineering some additional temporal features. Let’s try that just with the Random Forest to see if it improves the outcome.

temp_df <- 
  crime_df %>% 
  mutate(num_lag1 = lag(number_of_offences),
         num_lag2 = lag(number_of_offences, 2),
         num_lag3 = lag(number_of_offences, 3)) %>% 
  drop_na()

So, when predicting the number of offences, the model will now additionally consider, for each borough, type of offence and year, the number of offences in each of the three prior years.

set.seed(123)

data_split <- 
  temp_df %>%
  initial_split(strata = offences)

temp_train <- data_split %>%
  training()

temp_test <- data_split %>%
  testing()

temp_recipe <-
  temp_train %>%
  recipe() %>%
  update_role(number_of_offences, new_role = "outcome") %>%
  update_role(-has_role("outcome"), new_role = "predictor")

summary(temp_recipe)
## # A tibble: 7 x 4
##   variable           type    role      source  
##   <chr>              <chr>   <chr>     <chr>   
## 1 year               numeric predictor original
## 2 borough            nominal predictor original
## 3 offences           nominal predictor original
## 4 number_of_offences numeric outcome   original
## 5 num_lag1           numeric predictor original
## 6 num_lag2           numeric predictor original
## 7 num_lag3           numeric predictor original
temp_model <- 
  rand_forest() %>%
  set_engine("randomForest", mtry = 2) %>%
  set_mode("regression")

temp_wflow <- workflow() %>%
  add_recipe(temp_recipe) %>%
  add_model(temp_model)

temp_fit <- temp_wflow %>% 
  fit(temp_train)

temp_fit %>%
  pull_workflow_fit() %>% 
  vip(fill = cols[5]) +
  labs(title = "Feature Importance -- Random Forest with Lags")

temp_results <- 
  temp_test %>%
  bind_cols(predict(temp_fit, new_data = temp_test) %>% select(.pred)) %>% 
  metrics(truth = number_of_offences, estimate = .pred) %>% 
  mutate(model = "rf with lags")

The recipe summary includes the three new predictors. And the feature importance plot shows the lags playing a larger role in the model than the year variable, so looks like we should anticipate a model improvement.

updated_results <- 
  model_results %>% 
  bind_rows(temp_results)

updated_results %>% 
  ggplot(aes(model, .estimate, fill = model)) +
  geom_col() +
  geom_label(aes(label = round(.estimate, 2)), size = 3) +
  facet_wrap(~ .metric, scales = "free_y") +
  scale_fill_manual(values = cols[c(1, 3, 5, 7, 9)]) +
  labs(x = NULL, y = NULL, title = "Comparison of Model Metrics") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

The model metrics bear this out. The mae and rmse are markedly smaller, and the rsq significantly improved. We could have tried further lags. We could have tried tweaking some parameters. We could have tried time-series forecasting with, for example a statistical model like ARIMA, or a Neural Network model such as NNETAR.

The best approach would depend upon a more precise definition of the objective. And some trial and error, comparing approaches after more extensive feature-engineering, validation, testing and tuning. For the purposes of this post though I wanted to merely explore some techniques. So I’ll leave it there.

R Toolbox

Summarising below the packages and functions used in this post enables me to separately create a toolbox visualisation summarising the usage of packages and functions across all posts.

Package Function
base library[9]; sum[5]; round[2]; set.seed[2]; summary[2]; as.numeric[1]; c[1]; conflicts[1]; cumsum[1]; function[1]; search[1]
dplyr mutate[14]; group_by[8]; select[6]; bind_cols[5]; filter[5]; summarise[5]; bind_rows[4]; if_else[3]; lag[3]; tibble[2]; arrange[1]; as_tibble[1]; desc[1]; ungroup[1]
forcats fct_reorder[2]
ggplot2 labs[11]; aes[8]; ggplot[6]; element_text[3]; facet_wrap[3]; theme[3]; geom_boxplot[2]; geom_col[2]; geom_label[2]; geom_line[2]; scale_fill_manual[2]; scale_y_log10[2]; coord_flip[1]; element_rect[1]; geom_smooth[1]; guide_legend[1]; guides[1]; scale_colour_manual[1]; scale_y_continuous[1]; theme_bw[1]; theme_set[1]
janitor clean_names[1]
kableExtra kable[1]
parsnip fit[5]; set_engine[5]; set_mode[5]; rand_forest[3]; decision_tree[1]
poissonreg poisson_reg[1]
purrr map[1]; map2_dfr[1]; possibly[1]; set_names[1]
readr cols[1]; read_csv[1]; read_lines[1]
rebus literal[4]; lookahead[3]; whole_word[2]; ALPHA[1]; lookbehind[1]; number_range[1]; one_or_more[1]; or[1]
recipes update_role[4]; has_role[2]; recipe[2]
rsample initial_split[2]; testing[2]; training[2]
scales comma[3]
stats predict[5]; median[2]
stringr str_detect[3]; str_wrap[3]; str_c[2]; str_remove[2]; str_count[1]; str_extract[1]; str_remove_all[1]
tibble enframe[1]
tidyr tibble[2]; as_tibble[1]; drop_na[1]; unnest[1]
vip vip[6]
wesanderson wes_palette[1]
workflows add_model[5]; add_recipe[5]; pull_workflow_fit[5]; workflow[5]
yardstick metrics[2]
Carl Goodwin
Carl Goodwin
IBM Data Scientist & Growth Leader
comments powered by Disqus

Related