Criminal goings-on in a random forest and predictions with tree-based and glm models
Author
Carl Goodwin
Published
March 1, 2018
Modified
January 20, 2023
When first posted in 2018 this project used the caret package to model crime in London. Since then, the newer tidymodels(Kuhn and Wickham 2020) framework, consistent with tidy data principles, has rapidly evolved.
This custom palette was created in Adobe Colour as the basis for the feature image above and with the hex codes loaded for use in ggplot. colorRampPalette enables interpolation of an extended set of colours to support the number of offence types.
url<-str_c("https://data.london.gov.uk/","download/recorded_crime_rates/","c051c7ec-c3ad-4534-bbfe-6bdfee2ef6bb/","crime%20rates.csv")raw_df<-read_csv(url, col_types ="cfcfdn")|>clean_names()|>mutate( year =str_extract(year, "(?:1999|200[0-9]|201[0-7])"), # 1999-2007 year =as.numeric(year))|>summarise(number_of_offences =sum(number_of_offences), .by =c(year, borough, offences))
A faceted plot is one way to get a sense of the data.
raw_df|>mutate(borough =str_wrap(borough, 11))|>ggplot(aes(year, number_of_offences, colour =offences, group =offences))+geom_line()+facet_wrap(~borough, scales ="free_y", ncol =4)+labs( x =NULL, y =NULL, title ="London Crime by Borough", colour ="Offence", caption ="Source: data.gov.uk")+scale_colour_manual(values =cols10)+guides(colour =guide_legend(nrow =4))+theme( legend.position ="bottom", axis.text.x =element_text(angle =45, hjust =1))
Visualising data in small multiples using facet_wrap or facet_grid can be a useful way to explore data. When there are a larger number of these however, as we’re starting to see in the example above, there are alternative techniques one can employ. This is explored in Seeing the Wood for the Trees.
Nonetheless, one can anyway see there are data aggregated at multiple levels. So to net these data down to purely borough-level, I’ll filter out the summarised rows, for example, “England and Wales” and “Inner London”.
crime_df<-raw_df|>filter(offences!="All recorded offences",!borough%in%c("England and Wales","Met Police Area", "Inner London", "Outer London"))
There are 9 types of offence in 33 boroughs. The dataset covers the period 1999 to 2016.
The faceted plot hints at a potential interaction between borough and type of offence. In more affluent boroughs, and/or those attracting greater visitor numbers, e.g. Westminster and Kensington & Chelsea, “theft and handling” is the more dominant category. In Lewisham, for example, “violence against the person” exhibits higher counts. However, for the purpose of this basic model comparison, I’m going to set aside the potential interaction term.
Before modelling, I’ll visualise the dependent variable against each independent variable.
crime_df|>summarise(number_of_offences =sum(number_of_offences), .by =c(offences, borough))|>mutate( median_offences =median(number_of_offences), offences =str_wrap(offences, 10), .by =offences)|>ggplot(aes(fct_reorder(offences, median_offences), number_of_offences))+geom_boxplot(fill =cols[1])+scale_y_log10(labels =label_number(scale_cut =cut_short_scale()))+labs( x =NULL, y =NULL, title ="Number of Offences by Type", caption ="Source: data.gov.uk")
The offences and borough variables show significant variation in crime counts. And there is also evidence of a change over time.
crime_df|>summarise(number_of_offences =sum(number_of_offences), .by =year)|>ggplot(aes(year, number_of_offences))+geom_line(colour =cols[4], linetype ="dashed")+geom_smooth(colour =cols[5], fill =cols[1])+scale_y_continuous(labels =label_number(scale_cut =cut_short_scale()))+labs( x =NULL, y =NULL, title ="Number of Offences by Year", caption ="Source: data.gov.uk")
I’ll separate out some test data so I can compare the performance of the models on data they have not see during model training.
I’m using the recipes package to establish the role of the variables. Alternatively I could have used a formula-based approach, i.e. number_of_offences ~ borough + offences + year.
Whilst borough and offences are nominal, I’m not creating any dummy variables since I intend to use tree-based models which will anyway branch left and right based on groups of values.
I’ll start with a Recursive Partitioning And Regression Trees (rpart) model. The feature importance plot tells me which variables are having the biggest influence on the model. The type of offence is the most important predictor in the rpart model, followed by the location of the offences. This makes intuitive sense.
Clearly there is a temporal component too otherwise there would be no trend.
Ranger is an implementation of random forests or recursive partitioning that, according to the documentation, is particularly suited to high dimensional data. My data is not high-dimensional, but let’s throw it into the mix.
The Random Forest and the glm models performed the best here, with the former edging the Mean Absolute Error and R Squared metrics, and the latter with its nose in front on the Root Mean Squared Error.
model_results<-rp_results|>bind_rows(ranger_results)|>bind_rows(rf_results)|>bind_rows(poisson_results)|>group_by(model)|>metrics(truth =number_of_offences, estimate =.pred)model_results|>ggplot(aes(model, .estimate, fill =model))+geom_col()+geom_label(aes(label =round(.estimate, 2)), size =3, fill ="white")+facet_wrap(~.metric, scales ="free_y")+scale_fill_manual(values =as.character(cols[c(4, 5, 3, 1)]))+labs(x =NULL, y =NULL, title ="Comparison of Model Metrics")+theme(axis.text.x =element_text(angle =45, hjust =1), legend.position ="none")
Another way of approaching all this would be to use time-series forecasting. This would major on auto-regression, i.e. looking at how the lagged number-of-offences influence future values. And one could further include exogenous data such as, say, the numbers of police. It would be reasonable to expect that increasing police numbers would, in time, lead to decreased crime levels.
I explored time-series in other posts such as Digging Deep, so I won’t go down that path here.
What I could do though is to strengthen my tree-based models above by engineering some additional temporal features. Let’s try that just with the Random Forest to see if it improves the outcome.
So, when predicting the number of offences, the model will now additionally consider, for each borough, type of offence and year, the number of offences in each of the three prior years.
temp_model<-rand_forest()|>set_engine("randomForest")|>set_mode("regression")temp_wflow<-workflow()|>add_recipe(temp_recipe)|>add_model(temp_model)temp_fit<-temp_wflow|>fit(temp_train)temp_fit|>extract_fit_parsnip()|>vip(aesthetics =list(fill =cols[2]))+labs(title ="Feature Importance -- Random Forest with Lags")
temp_results<-temp_fit|>augment(temp_test)|>metrics(truth =number_of_offences, estimate =.pred)|>mutate(model ="rf with lags")
The recipe summary includes the three new predictors. And the feature importance plot shows the lags playing a larger role in the model than the year variable, so looks like we should anticipate a model improvement.
updated_results<-model_results|>bind_rows(temp_results)updated_results|>ggplot(aes(model, .estimate, fill =model))+geom_col()+geom_label(aes(label =round(.estimate, 2)), size =3, fill ="white")+facet_wrap(~.metric, scales ="free_y")+scale_fill_manual(values =as.character(cols[c(4, 5, 3, 2, 1)]))+labs(x =NULL, y =NULL, title ="Comparison of Model Metrics")+theme(axis.text.x =element_text(angle =45, hjust =1), legend.position ="none")
The model metrics bear this out. The mae and rmse are markedly smaller, and the rsq significantly improved. We could have tried further lags. We could have tried tweaking some parameters. We could have tried time-series forecasting with, for example a statistical model like ARIMA, or a Neural Network model such as NNETAR.
The best approach would depend upon a more precise definition of the objective. And some trial and error, comparing approaches after more extensive feature-engineering, validation, testing and tuning. For the purposes of this post though I wanted to merely explore some techniques. So I’ll leave it there.
R Toolbox
Summarising below the packages and functions used in this post enables me to separately create a toolbox visualisation summarising the usage of packages and functions across all posts.
Kuhn, Max, and Hadley Wickham. 2020. “Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles.”https://www.tidymodels.org.