RFM_customer_segmentation_markdown_article.Rmd

---
title: "RFM Customer Segmentation with K Means"
subtitle: "Individual customer choice is paramout"
author: "Dessi G"
date: "28/09/2020"
output: html_document
---

```{r setup, include= FALSE, echo=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8, fig.path='Figs/',
                      include= FALSE, echo=FALSE, warning=FALSE, message=FALSE, cache =TRUE)

## Load the required libraries

library(stringr) # for string data manipulation
library(readr)   # for data load
library(dplyr)      # for data wrangling
library(ggplot2) # for awesome graphics
library(GGally)  # extension of ggplot for nice correlation visualisations
library(rsample)    # for creating validation splits
library(recipes)    # for feature engineering
library(lubridate)  # for calculating recency
library(knitr)     # for visualising outputs in a nice table
library(ggstatsplot) # for visualising outliers
library(factoextra) # for measuring the distance in th clusters
library(pdp)       # for 1+ visualisations together

## for geomarketing analysis

library(tmap) 
library(sf)
library(datasets)
library(rnaturalearth)
library(rnaturalearthdata)
library(rgeos)


## Load the data

df<-read_csv("Data/data.csv")
country_coord<-read_csv("Data/countries_geographic_coordinates.csv")


#join the 2 tables to get the geographic coordinates of the countries

df<- inner_join(df,country_coord )

# check the output table

df %>% distinct(Country)
df %>% distinct(Description)


```

# RFM Customer segmentation with K Means

This article is part of a bigger project for development of basic marketing analytics stack. The idea is to build a stack of reusable and replicable templates for ML based marketing analysis. Template here is used in the sense of process logic, reusable code snippets and generalised business case scenarios for which the model is applicable.

For every analytical model an interactive application in Shiny is suggested. The idea of the app is to serve as a reusable framework for model presentation. This will allow for non-technical users to get visual understanding of the data and play with different scenarios directly on the app. In addition, the application will allow for automation of the analysis. This means that when the application is integrated with the data source, every change in the data will automatically change the output figures from the analysis. This will allow for instantaneous marketing action like grasping time sensitive opportunities for up-sale promotions.

### Structure of the article:

1.	Business objectives of the model;
2.	Algorithm behind the model;
3.	Model development process;
4.	Ideas for further analysis of the model output.

### Business objectives of customer segmentation

**Customer segmentation** helps you understand how, why, and when your product or service is purchased/used. These insights are crucial for efficient allocation of marketing resources.

**Recency, frequency and monetary (RFM)** customer segmentation gives you a base model to analyse customers according to their transactional behaviour – how recent was their last transaction, how often they purchase and how much are they spending.

**RFM customer segmentation** therefore is an effective way to prevent customer churn, identify and use up-sale and cross-sale opportunities.

### Why K Means clustering

**K means** is unsupervised algorithm and probably the most used algorithm for clustering. One of the major advantages of K means is that it can handle larger data sets compared to for example the hierarchical cluster approaches. 

However, K-means comes with some disadvantages as well. It is sensitivity to outliers. The data sets used with K means need scaling before performing the algorithm.

### Why R

The advantages in using R:

-	Availability of well developed analytical packages;
-	Ease of creating beautiful layered visualisations;
-	Ease of presenting ML models through interactive applications (Shiny App)

### Model development process

The development process of the model has 2 main stages:

**Data Preparation**

1.	Data Preparation:
2.	Feature engineering 
3.	Summary statistics of the engineered data sets
- remove outliers;
- scale the data
4.	Build a heatmap to define the correlations between the variables.

**Model Development**

6.	K means model development
-	Define data point distance measurement method;
-	Define optimal number of clusters;
-	Build the clusters
- Visualise the clusters

7.	Ideas for cluster analysis and persona building
-	Categorise the clusters according to their value for the business;
-	Provide geospatial visualisation of the most dominant clusters per country 


## Data preparation

The main objective of this stage is to remove missing values, convert data types where necessary and examine the data for any discrepancies

```{r clean-the-data,  echo=FALSE, warning=FALSE, message=FALSE}

## remove NAs and convert data types where necessary

#identify the location of NAs

which(is.na(df))

# define the count of NAs 

sum(is.na(df)) #[1] 135621

# identify the columns containing missing values

colSums(is.na(df)) # the NAs are missing customer ids

# extract the NAs in case you need to analyse them later


NA_ids<-filter(df, is.na(df$CustomerID))

# take a look at the missing data, the missing customers might be high value customers

glimpse(NA_ids) 

# remove the missing data

df<- filter(df,!is.na(df$CustomerID))

glimpse(df) #Rows: 397,932

# reformat the dates field from character to date format
# convert customer id from double to character

df$InvoiceDate<- as.Date(df$InvoiceDate,format = "%m/%d/%Y")
df$CustomerID <- as.character(df$CustomerID)

```

Check basic statistics of the cleaned data set

```{r  echo=FALSE, warning=FALSE, message=FALSE}

summary(df)

```

The time span of the data set is about a year: 

* maxDate =  2011-12-09
* minDate =  2010-12-01

In real business scenario the time span needed for customer segmentation should be defined by the business objective, the maturity of the business,the dynamics of the market, the type of product or service sold.
Also in cases where there is seasonality in the purchase behaviour, the time span for customer segmentation should capture at least 2 similar seasons. Usually this means that the analysis will be done over 24 months of transactional data.

The summary of the data shows that there are transactions with a total product price of 0 and products with negative quantities. This means that we might be dealing with refunds and discounts.We will remove the refunds and  discounts together with their original transactions. These can be analysed later on segment level. For example which of the segments claimed the most refunds. This type of analysis can turn out to be quite important not only for the development of the product and services, but also for the further profiling of the segments


```{r  echo=FALSE, warning=FALSE, message=FALSE}
# take an overall look at the smallest(negative) price and quantity values


df%>%
    arrange(df$UnitPrice, ascending = TRUE)


df%>%
  arrange(df$Quantity, ascending = TRUE)


df %>%
   filter(df$UnitPrice == 0) # 37 transactions without monetary value

# extract the refunds for further analysis

refund <- df %>%
         filter(df$Quantity < 0)

str(refund) #8,535 refunded transactions


# from the stock code values you can see that some of the negative quantities are actually discounts 
# check all the transactions that have non numeric StockCode


not_revenue <- df %>% filter(!str_detect(df$StockCode, "[0-9]"))

str(not_revenue) # 1,774 discounts, not product purchase transactions


############### remove all transactions that are not revenue generating product purchases ############

  df<- filter(df,df$UnitPrice != 0)
  df<- filter(df,df$Quantity > 0)  
  df<- filter(df,str_detect(df$StockCode, "[0-9]"))

## check whether the remaining data set contains the initial transactions that were refunded

inner_join(df,refund)

# 0 no initial transaction

```


## Features engineering 

To build the RFM model first we need to calculate the values for the recency, frequency and monetary variables. We can use the below equations.

**recency** = the most recent date of the data set minus the maximum date per customer purchase purchase. The lower is the value the more recent was the customer visit to the store
**frequency** = number of distinct purchases
**monetary** = the sum of the total spent per customer

These variables together with the customer id will form the data frame for our clusters

```{r feature-engineering , echo=FALSE, warning=FALSE, message=FALSE}
grouped_df <- df %>% 
  group_by(CustomerID) %>% 
  summarise(recency = as.numeric(as.Date("2011-12-09")-max(InvoiceDate)),
            frequency = n_distinct(InvoiceNo), 
            total_value = round(sum(Quantity*UnitPrice)/n_distinct(InvoiceNo),1))
          

glimpse(grouped_df) # 4,312

```

As we mentioned earlier K Means is calculated based on the distances between the data points. This makes it sensitive to the presence of outliers. Running a summary statistics of the data set will help us define whether there are any outliers values. 

```{r echo=FALSE, warning=FALSE, message=FALSE}
summary(grouped_df) 

```


The summary statistics shows that we have outliers in the frequency and monetary variables. For the purpose of getting the clustering right we need to remove these outliers. However, we should not forget about them as they represent our most valuable customers. 

To remove the outliers we need to identify a numerical cut-off range that differentiates an outlier from a non-outlier. We will use the [1.5*iqr rule](https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/identifying-outliers-iqr-rule) for that purpose.

We can calculate the interquartile range by finding first the quantiles. Here we use the quantile() function to find the 25th and the 75th percentile of the dataset and the IQR() function, which gives the difference between the 75th and 25th percentiles.

```{r iqr-estimate, echo=FALSE, warning=FALSE, message=FALSE}

Q_value     <- quantile(grouped_df$total_value, probs=c(.25, .75), na.rm = FALSE)
Q_frequency <- quantile(grouped_df$frequency, probs=c(.25, .75), na.rm = FALSE)

iqr_value     <-IQR(grouped_df$total_value)
iqr_frequency <-IQR(grouped_df$frequency)

```

Then we calculate the upper and lower range limits of our data:

**upper** <-  Q3+1.5*iqr
**lower** <- Q1-1.5*iqr 

We can use the subset() function to apply the 1.5*iqr rule and remove the outliers from the data set


```{r remove-outliers, echo=FALSE, warning=FALSE, message=FALSE}

grouped_df<- subset(grouped_df, grouped_df$total_value > (Q_value[1] - 1.5*iqr_value) 
                    & grouped_df$total_value < (Q_value[2]+1.5*iqr_value))

grouped_df<- subset(grouped_df, grouped_df$frequency > (Q_frequency[1] - 1.5*iqr_frequency) 
                    & grouped_df$frequency < (Q_frequency[2]+1.5*iqr_frequency))
```

Now we can check whether our values range looks more evenly spread. Visualise the range by using a boxplot visualisation

```{r box-plot, echo=FALSE, warning=FALSE, message=FALSE, include=TRUE}
grouped_df%>%
  ggplot(aes(total_value, recency ))+
  geom_boxplot()

grouped_df%>%
  ggplot(aes(total_value,frequency))+
  geom_boxplot()

```


### Scaling (normalising) the data 

The next step is to normalise the variables to have mean = 0 and variance = 1. For this we will use the R scale() function. This step is important as we want the K Means algorithm to weight equally the variables when creating the clusters.


```{r scale, echo=FALSE, warning=FALSE, message=FALSE, include= FALSE}
grouped_df_scale <- 
  grouped_df %>%
  subset(select = - CustomerID) %>%
  scale()

glimpse(grouped_df_scale)

```

Now we can visualise the correlation between the variables. We can use a heatmap for that purpose.

```{r heatmap, echo=FALSE, warning=FALSE, message=FALSE, include=TRUE}
ggcorr(grouped_df, method =  c("everything", "pearson"),label = TRUE, label_size = 5, label_round = 2) 

```

We have only 3 variables and just 2 of them are slightly correlated. The lack of strong correlation is a good sign as it will prevent any bias in the model development.The interesting observation here is the negative correlation b/w recency and frequency. This is because of the opposite ways in which we assess the two variables. The lower is the recency value, the better ranked is the customer and the opposite applies for frequency - the higher, the better. Another observation is that the more frequent transactions have higher value, which kind of make sense.

```{r include=TRUE}
grouped_df%>%
  ggplot(aes(recency,frequency))+
geom_col()

grouped_df%>%
ggplot(aes(frequency,total_value))+
geom_smooth(method='lm', formula= y~x)

```

Overall the more recent transactions have higher frequency. This might be an indicator of seasonality in the busines performance. As our cut off date is in December (Christmas period) the closer to Christmas we get, the more frequent the orders per customer become. This means that in order to be accurate in our segmentation we will need data from at least 2 years, e.g. 2 shopping seasons. Unfortunately, The available data set is only from one year. We will use it as it just to showcase the way the algorithm works.


## Building the K means model

### Choose a method to measure the distances between the data points

The main objective of K-means is to group the data into uniform sets of similar observations that are distinguishable from other sets of similar observations. The similarity and dissimilarity between the observations is defined by computing the distances between the data points. There are [different ways](https://machinelearningmastery.com/distance-measures-for-machine-learning/) of measuring the distances between data points.The Euclidean method is the only method that [works](https://stats.stackexchange.com/questions/81481/why-does-k-means-clustering-algorithm-use-only-euclidean-distance-metric) with K-means. 


### Define the number of clusters

The next step is to specify the optimal number of clusters we can split the data on. First we can try to find the number of clusters manually just to get a feeling of how the K Means work.Then we can automate this process by using an algorithm.
For the manual calculation we can  use kmeans() function. Let`s calculate and visualise our options with clusters from 2 to 8. The "centres" part of the function defines the number of clusters while the "nstart" defines the number of times sthe data will be reshuffled to find the most cohesive cluster given the set up. Here we will set the nstar = 25. This means that R will run with 25 random assignments and will select the one with the lowest within cluster variation. 

```{r}

k2 <- kmeans(grouped_df_scale, centers = 2, nstart = 25)

k3 <- kmeans(grouped_df_scale, centers = 3, nstart = 25)

k4 <- kmeans(grouped_df_scale, centers = 4, nstart = 25)

k5 <- kmeans(grouped_df_scale, centers = 5, nstart = 25)

k6 <- kmeans(grouped_df_scale, centers = 6, nstart = 25)

k7 <- kmeans(grouped_df_scale, centers = 7, nstart = 25)

k8 <- kmeans(grouped_df_scale, centers = 8, nstart = 25)

```

```{r include=TRUE}
# to view the results

viz2 <- fviz_cluster(k2, geom = "point", data = grouped_df_scale) + ggtitle("k = 2")
viz3 <- fviz_cluster(k3, geom = "point", data = grouped_df_scale) + ggtitle("k = 3")
viz4 <- fviz_cluster(k4, geom = "point", data = grouped_df_scale) + ggtitle("k = 4")
viz5 <- fviz_cluster(k5, geom = "point", data = grouped_df_scale) + ggtitle("k = 5")
viz6 <- fviz_cluster(k7, geom = "point", data = grouped_df_scale) + ggtitle("k = 6")
viz7 <- fviz_cluster(k7, geom = "point", data = grouped_df_scale) + ggtitle("k = 7")
viz8 <- fviz_cluster(k8, geom = "point", data = grouped_df_scale) + ggtitle("k = 8")

grid.arrange( viz2, viz3, viz4, viz5, viz6, viz7, viz8, nrow=2)

```
There are different automated ways to define the optimal number of cluster. We will compare the output of 2 methods - the Elbow and the Average Silhouette Method. For both of them use the fviz_nbclust() function.

**Elbow method** - the basic idea of this method is that the total inter cluster variation,e.g. the total within cluster sum of squares is minimised. 

```{r include=TRUE}

set.seed(123)
fviz_nbclust(grouped_df_scale, kmeans, method = "wss")

```

**Average Silhouette Method** - measures the quality of the clustering, how well each object lies within its cluster

```{r include=TRUE}
set.seed(123)
fviz_nbclust(grouped_df_scale, kmeans, method = "silhouette")

```

Both of the methods suggested that the best number of clusters is four. Let`s apply this result in a final K-means calculation. Use the set.seed() function to get reproducible result. Remember from above that the K means uses random reshuffling of the observations to get the optimal cluster split.

```{r}
set.seed(123)
K_final <- kmeans(grouped_df_scale, 4, nstart = 25)
viz_final <- fviz_cluster(K_final, geom = "point", data = grouped_df_scale) + ggtitle("k final")

viz_final


```

We have got 4 clusters with pretty much similar sizes apart from one, which is about double the size of the rest of the clusters. 

In the remaining part of this article we will explore some ideas for further analysis of the segments.


## Segment Analysis

As a beginning we can get some basic figures describing the size of the segments and the way they score within the RFM model.

The size we can define by the number of observations assigned to each cluster. While the mean value of each feature (recency, frequency and monetary) will give us an idea of which are our best and worst performing segments.

We can get these figures by running a summary table of the clustered data using summarise_all() function. This function works only with numeric data. Therefore before running it we will need to exclude the customer id column as it is character format. After qwe get ;our summary table we will transpose it  to facilitate its visualisation.


```{r}
# exclude the customer id column

summary_K <- grouped_df%>%
  subset(select = -CustomerID)%>%
  mutate(Cluster = K_final$cluster) %>%
  group_by(Cluster) %>%
  summarise_all(list(n="length", mean="mean"))

summary_K<-
subset(summary_K,select = -c(recency_n, frequency_n))%>%
  rename(segment_size = total_value_n )

total_n_customers<- sum(summary_K$segment_size)

# transpose the summary table to facilitate the visualisation
summary_K_transposed <- gather(summary_K,
                  key = "feature",
                   value = "value",
                 -Cluster)
```


Now we can visualise the segments to determine which are the most and the least valuable ones. this will give us an idea of how and where to focua our marketing efforts.

```{r include=TRUE}

summary_K_transposed %>%
ggplot(aes(x=reorder(Cluster,-value),y=value, fill=as.factor(Cluster)))+
  geom_col()+
    facet_grid(.~feature)+
  labs(title="Segments performance", subtitle = paste("Total N of customers: ", sum(summary_K$segment_size)))

```


From the visualisation of the summary statistics we can conclude that our most valuable customers are in segment 1 and 2. Segment 3 can be defined as lapsed as it has less than 2 purchases made almost a year ago. Our most active segment is segment 2. On the other hand segment 1 has the highest propensity to spend. Therefore, if we put some marketing efforts in increasing the frequency of purchase of segment 1, we can yield  significant revenue for the company. 

Segment 4 is our biggest segment. The only feature that makes it better than the lapsed segment 3 is the  recency of purchase. Otherwise the customers from this segment made on average no more than 2 purchases. As this is the biggest segment it worth putting effort to increase its frequency of purchase by testing promotions and cross-sale campaigns.

### Geographic Profiling of the segments

Another interesting analysis of the segments is their geographic distribution. To perform this analysis we need to attach the segments to the initial data frame and get the geographic attributes of the customers. Also do not forget to deduplicate your data set and leave only unique customer records.

```{r}
clustered_df<-grouped_df%>%
  mutate(Cluster = K_final$cluster) %>%
  group_by(Cluster)

glimpse(clustered_df)

clustered_all<- inner_join(x=clustered_df ,y=df )


#get only the unique records per segment

segmented_data <-
  clustered_all%>%
  distinct(CustomerID, .keep_all = TRUE)

```

Now you can run some bar charts visualisations of the segments performance per country.

```{r include=TRUE}

segmented_data%>%
ggplot(aes(Country, fill =as.factor( Cluster)))+
geom_bar ()+
  coord_flip()

```

We can see that the number of customers for UK is disproportionally higher compared to the rest of the countries. Therefore we will separate the analysis of the UK from the rest of the countries.

Let`s see how the segments are represented in UK and the rest of the rest of the world separately.


```{r include=TRUE}
segmented_data1<-segmented_data
segmented_data2<-segmented_data
segmented_data1%>%
  filter(Country != "United Kingdom")%>%
  ggplot(aes(Country, Cluster, fill = as.factor(Cluster)))+
  geom_col()+
  labs(title="Segments performance for the rest of the world", ylab = "Cluster")

segmented_data2%>%
  filter(Country == "United Kingdom")%>%
  ggplot(aes(Cluster,fill = as.factor(Cluster)))+
  geom_bar()+
  labs(title="Segments performance for UK")


```

Our best segments 1 and 2 are well represented in Germany and France.The business has good base to develop its market share within these 2 geographies

Segment 2 has dominant representation in the UK market. This was our most active segment. 

For both of the markets - UK and the world hand segment 1 - the segment with the highest propensity to spend is smaller in size compared to segment 2. It will make sense if we focus our marketing efforts on increasing the representation and activity of segment 1.

Finally we can get geospatial representation of the dominant segments per country.


```{r segment-map,include = TRUE}

map <- ne_countries(scale = "medium", returnclass = "sf")

map$Segment <- segmented_data$Cluster[match(map$iso_a3, segmented_data$iso_a3)]
plot(map["Segment"])

```