-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathRFM_customer_segmentation_markdown_article.Rmd
547 lines (324 loc) · 21.8 KB
/
RFM_customer_segmentation_markdown_article.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
---
title: "RFM Customer Segmentation with K Means"
subtitle: "Individual customer choice is paramout"
author: "Dessi G"
date: "28/09/2020"
output: html_document
---
```{r setup, include= FALSE, echo=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8, fig.path='Figs/',
include= FALSE, echo=FALSE, warning=FALSE, message=FALSE, cache =TRUE)
## Load the required libraries
library(stringr) # for string data manipulation
library(readr) # for data load
library(dplyr) # for data wrangling
library(ggplot2) # for awesome graphics
library(GGally) # extension of ggplot for nice correlation visualisations
library(rsample) # for creating validation splits
library(recipes) # for feature engineering
library(lubridate) # for calculating recency
library(knitr) # for visualising outputs in a nice table
library(ggstatsplot) # for visualising outliers
library(factoextra) # for measuring the distance in th clusters
library(pdp) # for 1+ visualisations together
## for geomarketing analysis
library(tmap)
library(sf)
library(datasets)
library(rnaturalearth)
library(rnaturalearthdata)
library(rgeos)
## Load the data
df<-read_csv("Data/data.csv")
country_coord<-read_csv("Data/countries_geographic_coordinates.csv")
#join the 2 tables to get the geographic coordinates of the countries
df<- inner_join(df,country_coord )
# check the output table
df %>% distinct(Country)
df %>% distinct(Description)
```
# RFM Customer segmentation with K Means
This article is part of a bigger project for development of basic marketing analytics stack. The idea is to build a stack of reusable and replicable templates for ML based marketing analysis. Template here is used in the sense of process logic, reusable code snippets and generalised business case scenarios for which the model is applicable.
For every analytical model an interactive application in Shiny is suggested. The idea of the app is to serve as a reusable framework for model presentation. This will allow for non-technical users to get visual understanding of the data and play with different scenarios directly on the app. In addition, the application will allow for automation of the analysis. This means that when the application is integrated with the data source, every change in the data will automatically change the output figures from the analysis. This will allow for instantaneous marketing action like grasping time sensitive opportunities for up-sale promotions.
### Structure of the article:
1. Business objectives of the model;
2. Algorithm behind the model;
3. Model development process;
4. Ideas for further analysis of the model output.
### Business objectives of customer segmentation
**Customer segmentation** helps you understand how, why, and when your product or service is purchased/used. These insights are crucial for efficient allocation of marketing resources.
**Recency, frequency and monetary (RFM)** customer segmentation gives you a base model to analyse customers according to their transactional behaviour – how recent was their last transaction, how often they purchase and how much are they spending.
**RFM customer segmentation** therefore is an effective way to prevent customer churn, identify and use up-sale and cross-sale opportunities.
### Why K Means clustering
**K means** is unsupervised algorithm and probably the most used algorithm for clustering. One of the major advantages of K means is that it can handle larger data sets compared to for example the hierarchical cluster approaches.
However, K-means comes with some disadvantages as well. It is sensitivity to outliers. The data sets used with K means need scaling before performing the algorithm.
### Why R
The advantages in using R:
- Availability of well developed analytical packages;
- Ease of creating beautiful layered visualisations;
- Ease of presenting ML models through interactive applications (Shiny App)
### Model development process
The development process of the model has 2 main stages:
**Data Preparation**
1. Data Preparation:
2. Feature engineering
3. Summary statistics of the engineered data sets
- remove outliers;
- scale the data
4. Build a heatmap to define the correlations between the variables.
**Model Development**
6. K means model development
- Define data point distance measurement method;
- Define optimal number of clusters;
- Build the clusters
- Visualise the clusters
7. Ideas for cluster analysis and persona building
- Categorise the clusters according to their value for the business;
- Provide geospatial visualisation of the most dominant clusters per country
## Data preparation
The main objective of this stage is to remove missing values, convert data types where necessary and examine the data for any discrepancies
```{r clean-the-data, echo=FALSE, warning=FALSE, message=FALSE}
## remove NAs and convert data types where necessary
#identify the location of NAs
which(is.na(df))
# define the count of NAs
sum(is.na(df)) #[1] 135621
# identify the columns containing missing values
colSums(is.na(df)) # the NAs are missing customer ids
# extract the NAs in case you need to analyse them later
NA_ids<-filter(df, is.na(df$CustomerID))
# take a look at the missing data, the missing customers might be high value customers
glimpse(NA_ids)
# remove the missing data
df<- filter(df,!is.na(df$CustomerID))
glimpse(df) #Rows: 397,932
# reformat the dates field from character to date format
# convert customer id from double to character
df$InvoiceDate<- as.Date(df$InvoiceDate,format = "%m/%d/%Y")
df$CustomerID <- as.character(df$CustomerID)
```
Check basic statistics of the cleaned data set
```{r echo=FALSE, warning=FALSE, message=FALSE}
summary(df)
```
The time span of the data set is about a year:
* maxDate = 2011-12-09
* minDate = 2010-12-01
In real business scenario the time span needed for customer segmentation should be defined by the business objective, the maturity of the business,the dynamics of the market, the type of product or service sold.
Also in cases where there is seasonality in the purchase behaviour, the time span for customer segmentation should capture at least 2 similar seasons. Usually this means that the analysis will be done over 24 months of transactional data.
The summary of the data shows that there are transactions with a total product price of 0 and products with negative quantities. This means that we might be dealing with refunds and discounts.We will remove the refunds and discounts together with their original transactions. These can be analysed later on segment level. For example which of the segments claimed the most refunds. This type of analysis can turn out to be quite important not only for the development of the product and services, but also for the further profiling of the segments
```{r echo=FALSE, warning=FALSE, message=FALSE}
# take an overall look at the smallest(negative) price and quantity values
df%>%
arrange(df$UnitPrice, ascending = TRUE)
df%>%
arrange(df$Quantity, ascending = TRUE)
df %>%
filter(df$UnitPrice == 0) # 37 transactions without monetary value
# extract the refunds for further analysis
refund <- df %>%
filter(df$Quantity < 0)
str(refund) #8,535 refunded transactions
# from the stock code values you can see that some of the negative quantities are actually discounts
# check all the transactions that have non numeric StockCode
not_revenue <- df %>% filter(!str_detect(df$StockCode, "[0-9]"))
str(not_revenue) # 1,774 discounts, not product purchase transactions
############### remove all transactions that are not revenue generating product purchases ############
df<- filter(df,df$UnitPrice != 0)
df<- filter(df,df$Quantity > 0)
df<- filter(df,str_detect(df$StockCode, "[0-9]"))
## check whether the remaining data set contains the initial transactions that were refunded
inner_join(df,refund)
# 0 no initial transaction
```
## Features engineering
To build the RFM model first we need to calculate the values for the recency, frequency and monetary variables. We can use the below equations.
**recency** = the most recent date of the data set minus the maximum date per customer purchase purchase. The lower is the value the more recent was the customer visit to the store
**frequency** = number of distinct purchases
**monetary** = the sum of the total spent per customer
These variables together with the customer id will form the data frame for our clusters
```{r feature-engineering , echo=FALSE, warning=FALSE, message=FALSE}
grouped_df <- df %>%
group_by(CustomerID) %>%
summarise(recency = as.numeric(as.Date("2011-12-09")-max(InvoiceDate)),
frequency = n_distinct(InvoiceNo),
total_value = round(sum(Quantity*UnitPrice)/n_distinct(InvoiceNo),1))
glimpse(grouped_df) # 4,312
```
As we mentioned earlier K Means is calculated based on the distances between the data points. This makes it sensitive to the presence of outliers. Running a summary statistics of the data set will help us define whether there are any outliers values.
```{r echo=FALSE, warning=FALSE, message=FALSE}
summary(grouped_df)
```
The summary statistics shows that we have outliers in the frequency and monetary variables. For the purpose of getting the clustering right we need to remove these outliers. However, we should not forget about them as they represent our most valuable customers.
To remove the outliers we need to identify a numerical cut-off range that differentiates an outlier from a non-outlier. We will use the [1.5*iqr rule](https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/identifying-outliers-iqr-rule) for that purpose.
We can calculate the interquartile range by finding first the quantiles. Here we use the quantile() function to find the 25th and the 75th percentile of the dataset and the IQR() function, which gives the difference between the 75th and 25th percentiles.
```{r iqr-estimate, echo=FALSE, warning=FALSE, message=FALSE}
Q_value <- quantile(grouped_df$total_value, probs=c(.25, .75), na.rm = FALSE)
Q_frequency <- quantile(grouped_df$frequency, probs=c(.25, .75), na.rm = FALSE)
iqr_value <-IQR(grouped_df$total_value)
iqr_frequency <-IQR(grouped_df$frequency)
```
Then we calculate the upper and lower range limits of our data:
**upper** <- Q3+1.5*iqr
**lower** <- Q1-1.5*iqr
We can use the subset() function to apply the 1.5*iqr rule and remove the outliers from the data set
```{r remove-outliers, echo=FALSE, warning=FALSE, message=FALSE}
grouped_df<- subset(grouped_df, grouped_df$total_value > (Q_value[1] - 1.5*iqr_value)
& grouped_df$total_value < (Q_value[2]+1.5*iqr_value))
grouped_df<- subset(grouped_df, grouped_df$frequency > (Q_frequency[1] - 1.5*iqr_frequency)
& grouped_df$frequency < (Q_frequency[2]+1.5*iqr_frequency))
```
Now we can check whether our values range looks more evenly spread. Visualise the range by using a boxplot visualisation
```{r box-plot, echo=FALSE, warning=FALSE, message=FALSE, include=TRUE}
grouped_df%>%
ggplot(aes(total_value, recency ))+
geom_boxplot()
grouped_df%>%
ggplot(aes(total_value,frequency))+
geom_boxplot()
```
### Scaling (normalising) the data
The next step is to normalise the variables to have mean = 0 and variance = 1. For this we will use the R scale() function. This step is important as we want the K Means algorithm to weight equally the variables when creating the clusters.
```{r scale, echo=FALSE, warning=FALSE, message=FALSE, include= FALSE}
grouped_df_scale <-
grouped_df %>%
subset(select = - CustomerID) %>%
scale()
glimpse(grouped_df_scale)
```
Now we can visualise the correlation between the variables. We can use a heatmap for that purpose.
```{r heatmap, echo=FALSE, warning=FALSE, message=FALSE, include=TRUE}
ggcorr(grouped_df, method = c("everything", "pearson"),label = TRUE, label_size = 5, label_round = 2)
```
We have only 3 variables and just 2 of them are slightly correlated. The lack of strong correlation is a good sign as it will prevent any bias in the model development.The interesting observation here is the negative correlation b/w recency and frequency. This is because of the opposite ways in which we assess the two variables. The lower is the recency value, the better ranked is the customer and the opposite applies for frequency - the higher, the better. Another observation is that the more frequent transactions have higher value, which kind of make sense.
```{r include=TRUE}
grouped_df%>%
ggplot(aes(recency,frequency))+
geom_col()
grouped_df%>%
ggplot(aes(frequency,total_value))+
geom_smooth(method='lm', formula= y~x)
```
Overall the more recent transactions have higher frequency. This might be an indicator of seasonality in the busines performance. As our cut off date is in December (Christmas period) the closer to Christmas we get, the more frequent the orders per customer become. This means that in order to be accurate in our segmentation we will need data from at least 2 years, e.g. 2 shopping seasons. Unfortunately, The available data set is only from one year. We will use it as it just to showcase the way the algorithm works.
## Building the K means model
### Choose a method to measure the distances between the data points
The main objective of K-means is to group the data into uniform sets of similar observations that are distinguishable from other sets of similar observations. The similarity and dissimilarity between the observations is defined by computing the distances between the data points. There are [different ways](https://machinelearningmastery.com/distance-measures-for-machine-learning/) of measuring the distances between data points.The Euclidean method is the only method that [works](https://stats.stackexchange.com/questions/81481/why-does-k-means-clustering-algorithm-use-only-euclidean-distance-metric) with K-means.
### Define the number of clusters
The next step is to specify the optimal number of clusters we can split the data on. First we can try to find the number of clusters manually just to get a feeling of how the K Means work.Then we can automate this process by using an algorithm.
For the manual calculation we can use kmeans() function. Let`s calculate and visualise our options with clusters from 2 to 8. The "centres" part of the function defines the number of clusters while the "nstart" defines the number of times sthe data will be reshuffled to find the most cohesive cluster given the set up. Here we will set the nstar = 25. This means that R will run with 25 random assignments and will select the one with the lowest within cluster variation.
```{r}
k2 <- kmeans(grouped_df_scale, centers = 2, nstart = 25)
k3 <- kmeans(grouped_df_scale, centers = 3, nstart = 25)
k4 <- kmeans(grouped_df_scale, centers = 4, nstart = 25)
k5 <- kmeans(grouped_df_scale, centers = 5, nstart = 25)
k6 <- kmeans(grouped_df_scale, centers = 6, nstart = 25)
k7 <- kmeans(grouped_df_scale, centers = 7, nstart = 25)
k8 <- kmeans(grouped_df_scale, centers = 8, nstart = 25)
```
```{r include=TRUE}
# to view the results
viz2 <- fviz_cluster(k2, geom = "point", data = grouped_df_scale) + ggtitle("k = 2")
viz3 <- fviz_cluster(k3, geom = "point", data = grouped_df_scale) + ggtitle("k = 3")
viz4 <- fviz_cluster(k4, geom = "point", data = grouped_df_scale) + ggtitle("k = 4")
viz5 <- fviz_cluster(k5, geom = "point", data = grouped_df_scale) + ggtitle("k = 5")
viz6 <- fviz_cluster(k7, geom = "point", data = grouped_df_scale) + ggtitle("k = 6")
viz7 <- fviz_cluster(k7, geom = "point", data = grouped_df_scale) + ggtitle("k = 7")
viz8 <- fviz_cluster(k8, geom = "point", data = grouped_df_scale) + ggtitle("k = 8")
grid.arrange( viz2, viz3, viz4, viz5, viz6, viz7, viz8, nrow=2)
```
There are different automated ways to define the optimal number of cluster. We will compare the output of 2 methods - the Elbow and the Average Silhouette Method. For both of them use the fviz_nbclust() function.
**Elbow method** - the basic idea of this method is that the total inter cluster variation,e.g. the total within cluster sum of squares is minimised.
```{r include=TRUE}
set.seed(123)
fviz_nbclust(grouped_df_scale, kmeans, method = "wss")
```
**Average Silhouette Method** - measures the quality of the clustering, how well each object lies within its cluster
```{r include=TRUE}
set.seed(123)
fviz_nbclust(grouped_df_scale, kmeans, method = "silhouette")
```
Both of the methods suggested that the best number of clusters is four. Let`s apply this result in a final K-means calculation. Use the set.seed() function to get reproducible result. Remember from above that the K means uses random reshuffling of the observations to get the optimal cluster split.
```{r}
set.seed(123)
K_final <- kmeans(grouped_df_scale, 4, nstart = 25)
viz_final <- fviz_cluster(K_final, geom = "point", data = grouped_df_scale) + ggtitle("k final")
viz_final
```
We have got 4 clusters with pretty much similar sizes apart from one, which is about double the size of the rest of the clusters.
In the remaining part of this article we will explore some ideas for further analysis of the segments.
## Segment Analysis
As a beginning we can get some basic figures describing the size of the segments and the way they score within the RFM model.
The size we can define by the number of observations assigned to each cluster. While the mean value of each feature (recency, frequency and monetary) will give us an idea of which are our best and worst performing segments.
We can get these figures by running a summary table of the clustered data using summarise_all() function. This function works only with numeric data. Therefore before running it we will need to exclude the customer id column as it is character format. After qwe get ;our summary table we will transpose it to facilitate its visualisation.
```{r}
# exclude the customer id column
summary_K <- grouped_df%>%
subset(select = -CustomerID)%>%
mutate(Cluster = K_final$cluster) %>%
group_by(Cluster) %>%
summarise_all(list(n="length", mean="mean"))
summary_K<-
subset(summary_K,select = -c(recency_n, frequency_n))%>%
rename(segment_size = total_value_n )
total_n_customers<- sum(summary_K$segment_size)
# transpose the summary table to facilitate the visualisation
summary_K_transposed <- gather(summary_K,
key = "feature",
value = "value",
-Cluster)
```
Now we can visualise the segments to determine which are the most and the least valuable ones. this will give us an idea of how and where to focua our marketing efforts.
```{r include=TRUE}
summary_K_transposed %>%
ggplot(aes(x=reorder(Cluster,-value),y=value, fill=as.factor(Cluster)))+
geom_col()+
facet_grid(.~feature)+
labs(title="Segments performance", subtitle = paste("Total N of customers: ", sum(summary_K$segment_size)))
```
From the visualisation of the summary statistics we can conclude that our most valuable customers are in segment 1 and 2. Segment 3 can be defined as lapsed as it has less than 2 purchases made almost a year ago. Our most active segment is segment 2. On the other hand segment 1 has the highest propensity to spend. Therefore, if we put some marketing efforts in increasing the frequency of purchase of segment 1, we can yield significant revenue for the company.
Segment 4 is our biggest segment. The only feature that makes it better than the lapsed segment 3 is the recency of purchase. Otherwise the customers from this segment made on average no more than 2 purchases. As this is the biggest segment it worth putting effort to increase its frequency of purchase by testing promotions and cross-sale campaigns.
### Geographic Profiling of the segments
Another interesting analysis of the segments is their geographic distribution. To perform this analysis we need to attach the segments to the initial data frame and get the geographic attributes of the customers. Also do not forget to deduplicate your data set and leave only unique customer records.
```{r}
clustered_df<-grouped_df%>%
mutate(Cluster = K_final$cluster) %>%
group_by(Cluster)
glimpse(clustered_df)
clustered_all<- inner_join(x=clustered_df ,y=df )
#get only the unique records per segment
segmented_data <-
clustered_all%>%
distinct(CustomerID, .keep_all = TRUE)
```
Now you can run some bar charts visualisations of the segments performance per country.
```{r include=TRUE}
segmented_data%>%
ggplot(aes(Country, fill =as.factor( Cluster)))+
geom_bar ()+
coord_flip()
```
We can see that the number of customers for UK is disproportionally higher compared to the rest of the countries. Therefore we will separate the analysis of the UK from the rest of the countries.
Let`s see how the segments are represented in UK and the rest of the rest of the world separately.
```{r include=TRUE}
segmented_data1<-segmented_data
segmented_data2<-segmented_data
segmented_data1%>%
filter(Country != "United Kingdom")%>%
ggplot(aes(Country, Cluster, fill = as.factor(Cluster)))+
geom_col()+
labs(title="Segments performance for the rest of the world", ylab = "Cluster")
segmented_data2%>%
filter(Country == "United Kingdom")%>%
ggplot(aes(Cluster,fill = as.factor(Cluster)))+
geom_bar()+
labs(title="Segments performance for UK")
```
Our best segments 1 and 2 are well represented in Germany and France.The business has good base to develop its market share within these 2 geographies
Segment 2 has dominant representation in the UK market. This was our most active segment.
For both of the markets - UK and the world hand segment 1 - the segment with the highest propensity to spend is smaller in size compared to segment 2. It will make sense if we focus our marketing efforts on increasing the representation and activity of segment 1.
Finally we can get geospatial representation of the dominant segments per country.
```{r segment-map,include = TRUE}
map <- ne_countries(scale = "medium", returnclass = "sf")
map$Segment <- segmented_data$Cluster[match(map$iso_a3, segmented_data$iso_a3)]
plot(map["Segment"])
```