Spotify-Artists-Analysis.Rmd

---
title: "Spotify Artists Analysis"
author: "James Le"
date: 'Updated: `r Sys.Date()`'
output:
  html_document:
    df_print: paged
    toc: yes
    code_folding: hide
    number_sections: yes
---

# Introduction

Each musician has his or her own unique musical styles: from Ed Sheeran who devotes his life to the acoustic guitar, to Drake who masters the art of rapping, from Adele who can belt some crazy high notes on her pop ballads, to Kygo who creates EDM magic on his DJ set. Music is about creativity, orginality, inspiration, and feelings, and it is the perfect gateway to connect people across differences.

Spotify is the largest music streaming service available. With more than 35 million songs and 170 million monthly active users, it is the ideal platform for musicians to reach their audience. On the app, music can be browsed or searched for via various parameters, such as artists, album, genre, playlist, or record label. Users can create, edit and share playlists, share tracks on social media, and make playlists with other users. Additionally, Spotify launched a variety of interesting playlists tailored made for its users, of which I most admire these 3:

* **Discover Weekly**: a weekly generated playlist (updated on Monday) that brings users 2 hours of custom-made music recommendations, mixing a user's personal taste with songs enjoyed by similar listeners.

* **Release Radar**: a personalized playlist that allows users to stay up-to-date on new music released by artists they listen to the most.

* **Daily Mix**: a series of playlists that have "near endless playback" and mixes the user's favorite tracks with new, recommended songs.

I recently discovered the ['This Is'](https://open.spotify.com/search/playlists/this%20is%20) playlist series. One of Spotify’s best original features, `This Is` delivers on a major promise of the streaming revolution - the canonization and preservation of great artists’ repertoires for future generations to discover and appreciate. Each one is dedicated to a different legendary artist, chronicling the high points of iconic discographies. This is: Kanye West. This is: Maroon 5.  This is: Elton John.  Spotify has provided a shortcut, giving us curated lists of the greatest songs from the greatest artists.

The purpose of this project is to analyse how different or how similar is the music that different artists on Spotify produce. The focus will be placed on disentangling the musical taste of 50 different artists from a wide range of genres. Throughout the process, I also identify different clusters of artists that share a similar musical style.

For the study, I will access the [Spotify Web API](https://beta.developer.spotify.com/web-api/), which provides data from the Spotify music catalog and can be accesed via standard HTTPS requests to an API endpoint. The Spotify API provides, among other things, track information for each song, including audio statistics such as *danceability*, *instrumentalness* or *temp*. I will focus on retrieving this audio feature information from 50 different 'This Is' Playlists of 50 different artists . Each feature measures an aspect of a song. Detailed information on how each feature is calculated can be found in the Spotify API Website.

# Getting Data

The first step is registering my application in the [API Website](https://beta.developer.spotify.com/web-api/) and getting the keys (Client ID and Client Secret) for future requests.

The Spotify Web API has different URI (Uniform Resource Identifier) to access playlists, artists or tracks information. Consequently, the process of getting data must be divided in 2 key steps.

* Get the "This Is" Playlist Series for multiple musicians.

* Get the audio features for each artist´s Playlist tracks.

## Web API Credentials

First, I created two variables for the *Client ID* and the *Client Secret* credentials.

```{r}
spotifyKey <- "182878ec396d424283c951d6769e9497"
spotifySecret <- "2a6d8f846edc4667ba9f0ba43cd7fe4c"
```

After that, I requested an access token in order to authorise my app to retrieve and manage Spotify data.

```{r}
library(Rspotify)
library(httr)
library(jsonlite)

spotifyEndpoint <- oauth_endpoint(NULL, 
                                  "https://accounts.spotify.com/authorize",
                                  "https://accounts.spotify.com/api/token")
spotifyToken <- spotifyOAuth("Spotify Analysis", spotifyKey, spotifySecret)
```

## "This Is" Playlist Series

The first step to pull the artists´ ["This Is" series](https://open.spotify.com/search/playlists/this%20is%20) is to get the URI´s for each one. For your information, here are the 50 musicians I have chosen, using their popularity, modernity, and diversity as the main criteria:

* Pop: Taylor Swift, Ariana Grande, Shawn Mendes, Maroon 5, Adele, Justin Bieber, Ed Sheeran, Justin Timberlake, Charlie Puth, John Mayer, Lorde, Fifth Harmony, Lana Del Rey, James Arthur, Zara Larsson, Pentatonix.

* Hip-Hop / Rap: Kendrick Lamar, Post Malone, Drake, Kanye West, Eminem, Future, 50 Cent, Lil Wayne, Wiz Khalifa, Snoop Dogg, Macklemore, Jay-Z.

* R & B: Bruno Mars, Beyonce, Enrique Iglesias, Stevie Wonder, John Legend, Alicia Keys, Usher, Rihanna.

* EDM / House: Kygo, The Chainsmokers, Avicii, Marshmello, Calvin Harris, Martin Garrix.

* Rock: Coldplay, Elton John, One Republic, The Script, Jason Mraz.

* Jazz: Frank Sinatra, Michael Buble, Norah Jones.

I basically went to each musician's individual playlist, copied the URIs, and stored each URI in a .csv file and imported it in R.

```{r}
library(readr)
playlistURI <- read.csv("this-is-playlist-URI.csv", header = T, sep = ";")
```

With each Playlist URI, I applied the *getPlaylistSongs* from the *RSpotify* package and stored the Playlist information in an empty dataframe.

```{r}
# Empty dataframe
PlaylistSongs <- data.frame(PlaylistID = character(),
                            Musician = character(),
                            tracks = character(),
                            id = character(),
                            popularity = integer(),
                            artist = character(),
                            artistId = character(),
                            album = character(),
                            albumId = character(),
                            stringsAsFactors=FALSE) 
```

```{r}
# Getting each playlist
for (i in 1:nrow(playlistURI)) {
  i <- cbind(PlaylistID = as.factor(playlistURI[i,2]),
             Musician = as.factor(playlistURI[i,1]),
             getPlaylistSongs("spotify",
                              playlistid = as.factor(playlistURI[i,2]),
                              token=spotifyToken))
  PlaylistSongs <- rbind(PlaylistSongs, i)
}
```

As we can see below, the dataframe has 10 columns and 2129 rows.

```{r}
dim(PlaylistSongs)
```

The following table shows the first 86 rows of my dataframe PlaylistSongs. It contains the tracks by Taylor Swift and Ariana Grande.

```{r}
library(knitr)
library(kableExtra)
library(dplyr)
options(knitr.table.format = "html")
options(width = 12)

# Only Taylor Swift and Ariana Grande
kable(head(PlaylistSongs,86)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12) %>%
scroll_box(width = "1000px", height = "750px")
```

## Audio Features

First, I wrote a formula (*getFeatures*) that extracts the audio features for any specific ID stored as a vector.

```{r}
getFeatures <- function (vector_id, token) 
{
  link <- httr::GET(paste0("https://api.spotify.com/v1/audio-features/?ids=", 
  vector_id), httr::config(token = token))
  list <- httr::content(link)
  return(list)
}
```

Next, I included *getFeatures* in another formula (*get_features*). The latter formula extracts the audio features for the track ID’s vector and returns them in a dataframe.

```{r}
get_features <- function (x) 
{
  getFeatures2 <- getFeatures(vector_id = x, token = spotifyToken)
  features_output <- do.call(rbind, lapply(getFeatures2$audio_features, data.frame, stringsAsFactors=FALSE))
}
```

Using the formula created above, I will be able to extract the audio features for each track. In order to do so, I need a vector containing each track ID. The rate limit for the Spotify API is 100 tracks, so I decided to create a vector with track IDs for each musician.

```{r}
TaylorSwift_vc <- paste(as.character(PlaylistSongs$id[1:38]), sep="", collapse=",")
ArianaGrande_vc <- paste(as.character(PlaylistSongs$id[39:86]), sep="", collapse=",")
KendrickLamar_vc <- paste(as.character(PlaylistSongs$id[87:124]), sep="", collapse=",")
ShawnMendes_vc <- paste(as.character(PlaylistSongs$id[125:177]), sep="", collapse=",")
Maroon5_vc <- paste(as.character(PlaylistSongs$id[178:226]), sep="", collapse=",")
PostMalone_vc <- paste(as.character(PlaylistSongs$id[227:261]), sep="", collapse=",")
Kygo_vc <- paste(as.character(PlaylistSongs$id[262:299]), sep="", collapse=",")
TheChainsmokers_vc <- paste(as.character(PlaylistSongs$id[300:333]), sep="", collapse=",")
Adele_vc <- paste(as.character(PlaylistSongs$id[334:358]), sep="", collapse=",")
Drake_vc <- paste(as.character(PlaylistSongs$id[359:408]), sep="", collapse=",")
JustinBieber_vc <- paste(as.character(PlaylistSongs$id[409:457]), sep="", collapse=",")
Coldplay_vc <- paste(as.character(PlaylistSongs$id[458:494]), sep="",collapse=",")
KanyeWest_vc <- paste(as.character(PlaylistSongs$id[495:545]), sep="", collapse=",")
BrunoMars_vc <- paste(as.character(PlaylistSongs$id[546:584]), sep="", collapse=",")
EdSheeran_vc <- paste(as.character(PlaylistSongs$id[585:624]), sep="", collapse=",")
Eminem_vc <- paste(as.character(PlaylistSongs$id[625:679]), sep="", collapse=",")
Beyonce_vc <- paste(as.character(PlaylistSongs$id[680:711]), sep="", collapse=",")
Avicii_vc <- paste(as.character(PlaylistSongs$id[712:770]), sep="", collapse=",")
Marshmello_vc <- paste(as.character(PlaylistSongs$id[771:808]), sep="", collapse=",")
CalvinHarris_vc <- paste(as.character(PlaylistSongs$id[809:846]), sep="", collapse=",")
JustinTimberlake_vc <- paste(as.character(PlaylistSongs$id[847:912]), sep="", collapse=",")
FrankSinatra_vc <- paste(as.character(PlaylistSongs$id[913:962]), sep="", collapse=",")
CharliePuth_vc <- paste(as.character(PlaylistSongs$id[963:993]), sep="", collapse=",")
MichaelBuble_vc <- paste(as.character(PlaylistSongs$id[994:1035]), sep="", collapse=",")
MartinGarrix_vc <- paste(as.character(PlaylistSongs$id[1036:1084]), sep="", collapse=",")
EnriqueIglesias_vc <- paste(as.character(PlaylistSongs$id[1085:1125]), sep="", collapse=",")
JohnMayer_vc <- paste(as.character(PlaylistSongs$id[1126:1184]), sep="", collapse=",")
Future_vc <- paste(as.character(PlaylistSongs$id[1185:1224]), sep="", collapse=",")
EltonJohn_vc <- paste(as.character(PlaylistSongs$id[1225:1265]), sep="", collapse=",")
FiftyCent_vc <- paste(as.character(PlaylistSongs$id[1266:1315]), sep="", collapse=",")
Lorde_vc <- paste(as.character(PlaylistSongs$id[1316:1346]), sep="", collapse=",")
LilWayne_vc <- paste(as.character(PlaylistSongs$id[1347:1396]), sep="", collapse=",")
WizKhalifa_vc <- paste(as.character(PlaylistSongs$id[1397:1446]), sep="", collapse=",")
FifthHarmony_vc <- paste(as.character(PlaylistSongs$id[1447:1479]), sep="", collapse=",")
LanaDelRay_vc <- paste(as.character(PlaylistSongs$id[1480:1524]), sep="",collapse=",")
NorahJones_vc <- paste(as.character(PlaylistSongs$id[1525:1562]), sep="", collapse=",")
JamesArthur_vc <- paste(as.character(PlaylistSongs$id[1563:1581]), sep="", collapse=",")
OneRepublic_vc <- paste(as.character(PlaylistSongs$id[1582:1614]), sep="", collapse=",")
TheScript_vc <- paste(as.character(PlaylistSongs$id[1615:1658]), sep="", collapse=",")
StevieWonder_vc <- paste(as.character(PlaylistSongs$id[1659:1708]), sep="", collapse=",")
JasonMraz_vc <- paste(as.character(PlaylistSongs$id[1709:1758]), sep="", collapse=",")
JohnLegend_vc <- paste(as.character(PlaylistSongs$id[1759:1795]), sep="", collapse=",")
Pentatonix_vc <- paste(as.character(PlaylistSongs$id[1796:1834]), sep="", collapse=",")
AliciaKeys_vc <- paste(as.character(PlaylistSongs$id[1835:1884]), sep="", collapse=",")
Usher_vc <- paste(as.character(PlaylistSongs$id[1885:1934]), sep="", collapse=",")
SnoopDogg_vc <- paste(as.character(PlaylistSongs$id[1935:1984]), sep="", collapse=",")
Macklemore_vc <- paste(as.character(PlaylistSongs$id[1985:2007]), sep="",collapse=",")
ZaraLarsson_vc <- paste(as.character(PlaylistSongs$id[2008:2043]), sep="", collapse=",")
JayZ_vc <- paste(as.character(PlaylistSongs$id[2044:2093]), sep="", collapse=",")
Rihanna_vc <- paste(as.character(PlaylistSongs$id[2094:2129]), sep="", collapse=",")
```

Next, I apply the *get_features* formula to each vector obtaining the audio features for each musician.

```{r}
TaylorSwift <- get_features(TaylorSwift_vc)
ArianaGrande <- get_features(ArianaGrande_vc)
KendrickLamar <- get_features(KendrickLamar_vc)
ShawnMendes <- get_features(ShawnMendes_vc)
Maroon5 <- get_features(Maroon5_vc)
PostMalone <- get_features(PostMalone_vc)
Kygo <- get_features(Kygo_vc)
TheChainsmokers <- get_features(TheChainsmokers_vc)
Adele <- get_features(Adele_vc)
Drake <- get_features(Drake_vc)
JustinBieber <- get_features(JustinBieber_vc)
Coldplay <- get_features(Coldplay_vc)
KanyeWest <- get_features(KanyeWest_vc)
BrunoMars <- get_features(BrunoMars_vc)
EdSheeran <- get_features(EdSheeran_vc)
Eminem <- get_features(Eminem_vc)
Beyonce <- get_features(Beyonce_vc)
Avicii <- get_features(Avicii_vc)
Marshmello <- get_features(Marshmello_vc)
CalvinHarris <- get_features(CalvinHarris_vc)
JustinTimberlake <- get_features(JustinTimberlake_vc)
FrankSinatra <- get_features(FrankSinatra_vc)
CharliePuth <- get_features(CharliePuth_vc)
MichaelBuble <- get_features(MichaelBuble_vc)
MartinGarrix <- get_features(MartinGarrix_vc)
EnriqueIglesias <- get_features(EnriqueIglesias_vc)
JohnMayer <- get_features(JohnMayer_vc)
Future <- get_features(Future_vc)
EltonJohn <- get_features(EltonJohn_vc)
FiftyCent <- get_features(FiftyCent_vc)
Lorde <- get_features(Lorde_vc)
LilWayne <- get_features(LilWayne_vc)
WizKhalifa <- get_features(WizKhalifa_vc)
FifthHarmony <- get_features(FifthHarmony_vc)
LanaDelRay <- get_features(LanaDelRay_vc)
NorahJones <- get_features(NorahJones_vc)
JamesArthur <- get_features(JamesArthur_vc)
OneRepublic <- get_features(OneRepublic_vc)
TheScript <- get_features(TheScript_vc)
StevieWonder <- get_features(StevieWonder_vc)
JasonMraz <- get_features(JasonMraz_vc)
JohnLegend <- get_features(JohnLegend_vc)
Pentatonix <- get_features(Pentatonix_vc)
AliciaKeys <- get_features(AliciaKeys_vc)
Usher <- get_features(Usher_vc)
SnoopDogg <- get_features(SnoopDogg_vc)
Macklemore <- get_features(Macklemore_vc)
ZaraLarsson <- get_features(ZaraLarsson_vc)
JayZ <- get_features(JayZ_vc)
Rihanna <- get_features(Rihanna_vc)
```

After that, I merged each musician´s audio features dataframe into a new one, *all_features*. It contains the audio features for all the tracks in every musician's This Is Playlist.

```{r}
library(gdata)
all_features <- combine(TaylorSwift,ArianaGrande,KendrickLamar,ShawnMendes,Maroon5,PostMalone,Kygo,TheChainsmokers,Adele,Drake,JustinBieber,Coldplay,KanyeWest,BrunoMars,EdSheeran,Eminem,Beyonce,Avicii,Marshmello,CalvinHarris,JustinTimberlake,FrankSinatra,CharliePuth,MichaelBuble,MartinGarrix,EnriqueIglesias,JohnMayer,Future,EltonJohn,FiftyCent,Lorde,LilWayne,WizKhalifa,FifthHarmony,LanaDelRay,NorahJones,JamesArthur,OneRepublic,TheScript,StevieWonder,JasonMraz,JohnLegend,Pentatonix,AliciaKeys,Usher,SnoopDogg,Macklemore,ZaraLarsson,JayZ,Rihanna)
```

A preview of the *all_features* dataframe can be found below. It only shows 86 rows with the data for Taylor Swift and Ariana Grande. The last column (*Source*) contains the musician.

```{r}
options(knitr.table.format = "html")
options(width = 12)

kable(head(all_features, 86)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12) %>%
scroll_box(width = "1000px", height = "750px")
```

Finally, I have computed the mean of each musician´s audio features using the *aggregate* function. The resulting dataframe contains the audio features for each musician as the mean of the tracks in their respective playlists.

```{r}
mean_features <- aggregate(all_features[, c(1:11,17)], list(all_features$source), mean)
names(mean_features) <- c("Musician", "danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "duration_ms")
```

```{r}
options(knitr.table.format = "html")
options(width = 12)
kable(mean_features) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12) %>%
scroll_box(width = "1000px", height = "500px")
```

## Audio Features Description

The description of each feature from the [Spotify Web API Guidance](https://beta.developer.spotify.com/web-api/get-audio-features/) can be found below:

* **Danceability**: Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

* **Energy**: Is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

* **Key**: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.

* **Loudness**: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

* **Mode**: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

* **Speechiness**: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

* **Acousticness**: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

* **Instrumentalness**: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

* **Liveness**: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

* **Valence**: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

* **Tempo**: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

* **Duration_ms**: The duration of the track in milliseconds.

# Data Visualization

## Radar Chart

A radar chart is useful to compare the musical vibes of these musicians in a more visual way. The first visualisation is an R implementation of the radar chart from the [chart.js](http://www.chartjs.org/) javascript library and evaluates the audio features for 10 selected musicians.

In order to do so, I normalised the values to be from 0 to 1. This will help to make the chart more clear and readable.

```{r}
mean_features_norm <- cbind(mean_features[1], 
                                     apply(mean_features[-1],2,
                                           function(x){(x-min(x)) / diff(range(x))}))
```

Okay, let's plot these interactive radar charts in batch of 10 musicians. Each chart displays data set labels when you hover over each radial line, showing the value for the selected feature.

**Batch 1: Taylor Swift, Ariana Grande, Kendrick Lamar, Shawn Mendes, Maroon 5, Post Malone, Kygo, The Chainsmokers, Adele, Drake**

```{r}
library(radarchart)
library(tidyr)
sample1 <- mean_features[mean_features$Musician %in% c("TaylorSwift", "ArianaGrande", "KendrickLamar", "ShawnMendes", "Maroon5", "PostMalone", "Kygo", "TheChainsmokers", "Adele", "Drake"),]

mean_features_norm_1 <- cbind(sample1[1], 
                                        apply(sample1[-1],2,
                                              function(x){(x-min(x)) / diff(range(x))})) 

radarDF_1 <- gather(mean_features_norm_1, key=Attribute, value=Score, -Musician) %>%
  spread(key=Musician, value=Score)

chartJSRadar(scores = radarDF_1,
             scaleStartValue = -1, 
             maxScale =1, 
             showToolTipLabel = TRUE)
```

**Batch 2: Justin Bieber, Coldplay, Kanye West, Bruno Mars, Ed Sheeran, Eminem, Beyonce, Avicii, Marshmello, Calvin Harris**

```{r}
sample2 <- mean_features[mean_features$Musician %in% c("JustinBieber", "Coldplay", "KanyeWest", "BrunoMars", "EdSheeran", "Eminem", "Beyonce", "Avicii", "Marshmello", "CalvinHarris"),]

mean_features_norm_2 <- cbind(sample2[1], apply(sample2[-1],2,function(x){(x-min(x)) / diff(range(x))})) 

radarDF_2 <- gather(mean_features_norm_2, key=Attribute, value=Score, -Musician) %>%
  spread(key=Musician, value=Score)

chartJSRadar(scores = radarDF_2, scaleStartValue = -1, maxScale = 1, showToolTipLabel = TRUE)
```

**Batch 3: Justin Timberlake, Frank Sinatra, Charlie Puth, Michael Buble, Martin Garrix, Enrique Iglesias, John Mayer, Future, Elton John, 50 Cent**

```{r}
sample3 <- mean_features[mean_features$Musician %in% c("JustinTimberlake", "FrankSinatra", "CharliePuth", "MichaelBuble", "MartinGarrix", "EnriqueIglesias", "JohnMayer", "Future", "EltonJohn", "FiftyCent"),]

mean_features_norm_3 <- cbind(sample3[1], apply(sample3[-1],2,function(x){(x-min(x)) / diff(range(x))})) 

radarDF_3 <- gather(mean_features_norm_3, key=Attribute, value=Score, -Musician) %>%
  spread(key=Musician, value=Score)

chartJSRadar(scores = radarDF_3, scaleStartValue = -1, maxScale = 1, showToolTipLabel = TRUE)
```

**Batch 4: Lorde, Lil Wayne, Wiz Khalifa, Fifth Harmony, Lana Del Rey, Norah Jones, James Arthur, One Republic, The Script, Stevie Wonder**

```{r}
sample4 <- mean_features[mean_features$Musician %in% c("Lorde", "LilWayne", "WizKhalifa", "FifthHarmony", "LanaDelRay", "NorahJones", "JamesArthur", "OneRepublic", "TheScript", "StevieWonder"),]

mean_features_norm_4 <- cbind(sample4[1], apply(sample4[-1],2,function(x){(x-min(x)) / diff(range(x))})) 

radarDF_4 <- gather(mean_features_norm_4, key=Attribute, value=Score, -Musician) %>%
  spread(key=Musician, value=Score)

chartJSRadar(scores = radarDF_4, scaleStartValue = -1, maxScale = 1, showToolTipLabel = TRUE)
```

**Batch 5: Jason Mraz, John Legend, Pentatonix, Alicia Keys, Usher, Snoop Dogg, Macklemore, Zara Larsson, Jay-Z, Rihanna**

```{r}
sample5 <- mean_features[mean_features$Musician %in% c("JasonMraz", "JohnLegend", "Pentatonix", "AliciaKeys", "Usher", "SnoopDogg", "Macklemore", "ZaraLarsson", "JayZ", "Rihanna"),]

mean_features_norm_5 <- cbind(sample5[1], apply(sample5[-1],2,function(x){(x-min(x)) / diff(range(x))})) 

radarDF_5 <- gather(mean_features_norm_5, key=Attribute, value=Score, -Musician) %>%
  spread(key=Musician, value=Score)

chartJSRadar(scores = radarDF_5, scaleStartValue = -1, maxScale = 1, showToolTipLabel = TRUE)
```

## Cluster Analysis

Another way to find out the differences between these musicians in their musical repertoire is grouping them in clusters. The general idea of a clustering algorithm is to divide a given dataset into multiple groups on the basis of similarity in the data. In this case, musicians will be grouped in different clusters according to their music preferences. Rather than defining groups before looking at the data, clustering allows me to find and analyze the groups that have formed organically.

Prior to clustering data, it is important to rescale the numeric variables of the dataset. After that, I kept the musicians as the row names to be able to show them as labels in the plot.

```{r}
scaled.features <- scale(mean_features[-1])
rownames(scaled.features) <- mean_features$Musician
```

I applied the **K-Means Clustering** method, which is one of the most popular techniques of unsupervised statistical learning methods. It is a type of unsupervised learning, which is used for unlabeled data. The algorithm finds groups in the data, with the number of groups represented by the variable **K**. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. In this instance, I will choose *K = 6*, as I suppose clusters can be formed based on the 6 different genres I based on when choosing the artists (Pop, Hip-Hop, R&B, EDM, Rock, and Jazz).

Now that I apply the the K-Means algorithm for each musician, I can plot a two-dimensional view of the data. The x-axis and y-axis correspond to the first and second component, and the eigenvectors (represented by red arrows) indicate the directional influence each variable has on the principal components. Let´s have a look at the clusters that result from applying the K-Means algorithm to my dataset.

```{r}
library(ggfortify)
library(ggthemes)
set.seed(5000)

k_means <- kmeans(scaled.features, 6)

kmeans_plot <- autoplot(k_means, 
              main = "K-means Clustering", 
              data = scaled.features,
              loadings = TRUE, 
              loadings.colour = "#CC0000",
              loadings.label.colour = "#CC0000", 
              loadings.label = TRUE, 
              loadings.label.size = 2.2,  
              loadings.label.repel = T,
              label.size = 2.2, 
              label.repel = T) + 
  scale_fill_manual(values = c("#000066", "#9999CC", "#66CC99", "#FB7201", "#21CDFF", "#FF219C")) + 
  scale_color_manual(values = c("#000066", "#9999CC", "#66CC99", "#FB7201", "#21CDFF", "#FF219C")) + 
  theme(plot.title=element_text(size=18, face="bold"))

kmeans_plot
```

Let's see which artists belong to which clusters:

```{r}
k_means$cluster
```

I have also plotted another radar chart containing the features for each cluster. It is useful to compare the attributes of the songs that each cluster creates.

```{r}
mean_features_norm_50 <- cbind(mean_features[1], apply(mean_features[-1],2,scale))
```

```{r}
library(radarchart)
library(tidyr)
cluster_centers <- as.data.frame(k_means$centers)
cluster <- c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6")
cluster_centers <- cbind(cluster, cluster_centers)
```

```{r}
radarDF_6 <- gather(cluster_centers, key=Attribute, value=Score, -cluster) %>%
  spread(key=cluster, value=Score)
# we change the colours according to clusters
colMatrix = matrix(c(c(4,24,102), c(135,133,193), c(87,196,135), c(251,114,1), c(33,205,255), c(255,33,156)), nrow = 3)
# chart
chartJSRadar(scores = radarDF_6,
             scaleStartValue = -4, 
             maxScale =1.5, 
             showToolTipLabel = TRUE,
             colMatrix = colMatrix)
```

* *Cluster 1* includes 4 artists: Coldplay, Avicii, Marshmello, and Martin Garrix. Their music are mostly performed live and instrumental, usually loud and full of energy with high tempo. Not too surpirsed as 3 of the 4 artists perform EDM / House music, and Coldplay is known for their live concerts.

* *Cluster 2* incldues 2 artists: Frank Sinatra and Norah Jones (any Jazz fans out there?). Their music score high on acousticness and the Major scale mode. However, they score low in all the remaining attributes. Typical Jazz tunes.

* *Cluster 3* includes 10 artists: Post Malone, Kygo, The Chainsmokers, Adele, Lorde, Lana Del Rey, James Arthur, One Republic, John Legend, and Alicia Keys. This cluster scores average in mostly all the attributes. This suggests that this group of artists is well-balanced and versatile with style & creation, hence the diversity of genres presented in this cluster (EDM, Pop, R&B).

* *Cluster 4* includes 15 artists: Ariana Grande, Maroon 5, Drake, Justin Bieber, Bruno Mars, Calvin Harris, Charlie Puth, Enrique Iglesias, Future, Wiz Khalifa, Fifth Harmony, Usher, Macklemore, Zara Larsson, and Rihanna. Their music are danceable, loud, high-tempo, and energetic. This group has the presence of many young mainstream artists in the Pop and Hip-Hop genres.

* *Cluster 5* includes 10 artists: Taylor Swift, Shawn Mendes, Ed Sheeran, Michael Buble, John Mayer, Elton John, The Script, Stevie Wonder, Jason Mraz, and Pentatonix. This is my favorite group! Taylor Swift? Ed Sheeran? John Mayer? Jason Mraz? Elton John? I guess I listen to a lot of singer-songwriter artists. Their music are mostly in the Major scale, while achieve perfect balance (average score) in all other attributes.

* *Cluster 6* includes 9 artists: Kendrick Lamar, Kanye West, Eminem, Beyonce, Justin Timberlake, 50 Cent, Lil Wayne, Snoop Dogg, and Jay-Z. You already see the trend here: 7 of them are Rappers, and even Beyonce and JT regularly collaborate with rappers. Their songs have high number of spoken words and speech-like sections, are long in duration and performed live often. Any better description of rap music?

## Analysis by Feature

The following charts show the values for each feature for every musician.

### Danceability

```{r}
library(stringr)
# Converting cluster to vector
clusters <- as.vector(k_means$cluster)
clusters <- str_replace_all(clusters, "1", "Cluster 1")
clusters <- str_replace_all(clusters, "2", "Cluster 2")
clusters <- str_replace_all(clusters, "3", "Cluster 3")
clusters <- str_replace_all(clusters, "4", "Cluster 4")
clusters <- str_replace_all(clusters, "5", "Cluster 5")
clusters <- str_replace_all(clusters, "6", "Cluster 6")
mean_features_norm_50 <- cbind(mean_features_norm_50, cluster = clusters)
```

```{r}
# Danceability feature
library(ggplot2)
danceability_subset <- mean_features_norm_50[,c("Musician","danceability", "cluster")]
danceability_subset <- danceability_subset[order(danceability_subset$danceability, decreasing = TRUE), ]

danceability_plot <- ggplot(danceability_subset, 
                            aes(x = reorder(Musician, danceability), 
                                y = danceability, label = danceability)) + 
  xlab("Musician") + ylab("Danceability") + 
  geom_bar(stat = 'identity', aes(fill = cluster), width = .5) + 
  scale_fill_manual(name = "Cluster",
                        labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"), 
                        values = c("Cluster 1" = "#000066", 
                                   "Cluster 2" = "#9999CC", 
                                   "Cluster 3" = "#66CC99",
                                   "Cluster 4" = "#FB7201",
                                   "Cluster 5" = "#21CDFF",
                                   "Cluster 6" = "#FF219C")) + 
  labs(title = "Danceability Feature") + coord_flip()

danceability_plot
```

If you want to bust the moves and impresse your crush, try listen to more of Future, Drake, Wiz Khalifa, Snoop Doog, and Eminem. On the other hand, don't even attempt to dance to Frank Sinatra or Lana Del Rey's tunes.

### Energy

```{r}
# Energy feature
energy_subset <- mean_features_norm_50[,c("Musician","energy", "cluster")]
energy_subset <- energy_subset[order(energy_subset$energy, decreasing = TRUE), ]

energy_plot <- ggplot(energy_subset, 
                            aes(x = reorder(Musician, energy), 
                                y = energy, label = energy)) + 
  xlab("Musician") + ylab("energy") + 
  geom_bar(stat = 'identity', aes(fill = cluster), width = .5) + 
  scale_fill_manual(name = "Cluster",
                        labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"), 
                        values = c("Cluster 1" = "#000066", 
                                   "Cluster 2" = "#9999CC", 
                                   "Cluster 3" = "#66CC99",
                                   "Cluster 4" = "#FB7201",
                                   "Cluster 5" = "#21CDFF",
                                   "Cluster 6" = "#FF219C")) + 
  labs(title = "Energy Feature") + coord_flip()

energy_plot
```

You're a fairly energetic person if you listen to lots of Marhsmello, Calvin Harris, Enrique Iglesias, Martin Garrix, Eminem, Jay-Z. The opposite is true if you're a fan of Frank Sinatra and Norah Jones.

### Loudness

```{r}
# Loudness feature
loudness_subset <- mean_features_norm_50[,c("Musician","loudness", "cluster")]
loudness_subset <- loudness_subset[order(loudness_subset$loudness, decreasing = TRUE), ]

loudness_plot <- ggplot(loudness_subset, 
                            aes(x = reorder(Musician, loudness), 
                                y = loudness, label = loudness)) + 
  xlab("Musician") + ylab("loudness") + 
  geom_bar(stat = 'identity', aes(fill = cluster), width = .5) + 
  scale_fill_manual(name = "Cluster",
                        labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"), 
                        values = c("Cluster 1" = "#000066", 
                                   "Cluster 2" = "#9999CC", 
                                   "Cluster 3" = "#66CC99",
                                   "Cluster 4" = "#FB7201",
                                   "Cluster 5" = "#21CDFF",
                                   "Cluster 6" = "#FF219C")) + 
  labs(title = "Loudness Feature") + coord_flip()

loudness_plot
```

The Loudness ranking is almost exactly similar to the Energy one.

### Speechiness

```{r}
# Speechiness feature
speechiness_subset <- mean_features_norm_50[,c("Musician","speechiness", "cluster")]
speechiness_subset <- speechiness_subset[order(speechiness_subset$speechiness, decreasing = TRUE), ]

speechiness_plot <- ggplot(speechiness_subset, 
                            aes(x = reorder(Musician, speechiness), 
                                y = speechiness, label = speechiness)) + 
  xlab("Musician") + ylab("speechiness") + 
  geom_bar(stat = 'identity', aes(fill = cluster), width = .5) + 
  scale_fill_manual(name = "Cluster",
                        labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"), 
                        values = c("Cluster 1" = "#000066", 
                                   "Cluster 2" = "#9999CC", 
                                   "Cluster 3" = "#66CC99",
                                   "Cluster 4" = "#FB7201",
                                   "Cluster 5" = "#21CDFF",
                                   "Cluster 6" = "#FF219C")) + 
  labs(title = "Speechiness Feature") + coord_flip()

speechiness_plot
```

All the Rap fans out there: what's your favorite songs from Kendrick Lamar? or 50 Cents? or Jay-Z? Hmm, I'm surprised Eminem does not rank higher, as I personally he's the GOAT of all rappers.

### Acousticness

```{r}
# Acousticness feature
acousticness_subset <- mean_features_norm_50[,c("Musician","acousticness", "cluster")]
acousticness_subset <- acousticness_subset[order(acousticness_subset$acousticness, decreasing = TRUE), ]

acousticness_plot <- ggplot(acousticness_subset, 
                            aes(x = reorder(Musician, acousticness), 
                                y = acousticness, label = acousticness)) + 
  xlab("Musician") + ylab("acousticness") + 
  geom_bar(stat = 'identity', aes(fill = cluster), width = .5) + 
  scale_fill_manual(name = "Cluster",
                        labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"), 
                        values = c("Cluster 1" = "#000066", 
                                   "Cluster 2" = "#9999CC", 
                                   "Cluster 3" = "#66CC99",
                                   "Cluster 4" = "#FB7201",
                                   "Cluster 5" = "#21CDFF",
                                   "Cluster 6" = "#FF219C")) + 
  labs(title = "Acousticness Feature") + coord_flip()

acousticness_plot
```

Acousticness is exactly opposite of Loudness and Energy. Mr. Sinatra and Mrs. Jones released some powerful acoustic tracks throughout their careers.

### Instrumentalness

```{r}
# Instrumentalness feature
instrumentalness_subset <- mean_features_norm_50[,c("Musician","instrumentalness", "cluster")]
instrumentalness_subset <- instrumentalness_subset[order(instrumentalness_subset$instrumentalness, decreasing = TRUE), ]

instrumentalness_plot <- ggplot(instrumentalness_subset, 
                            aes(x = reorder(Musician, instrumentalness), 
                                y = instrumentalness, label = instrumentalness)) + 
  xlab("Musician") + ylab("instrumentalness") + 
  geom_bar(stat = 'identity', aes(fill = cluster), width = .5) + 
  scale_fill_manual(name = "Cluster",
                        labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"), 
                        values = c("Cluster 1" = "#000066", 
                                   "Cluster 2" = "#9999CC", 
                                   "Cluster 3" = "#66CC99",
                                   "Cluster 4" = "#FB7201",
                                   "Cluster 5" = "#21CDFF",
                                   "Cluster 6" = "#FF219C")) + 
  labs(title = "Instrumentalness Feature") + coord_flip()

instrumentalness_plot
```

EDM to the win! Martin Garrix, Avicii, and Marshmello produce tracks that contain almost no vocals.

### Liveness

```{r}
# Liveness feature
liveness_subset <- mean_features_norm_50[,c("Musician","liveness", "cluster")]
liveness_subset <- liveness_subset[order(liveness_subset$liveness, decreasing = TRUE), ]

liveness_plot <- ggplot(liveness_subset, 
                            aes(x = reorder(Musician, liveness), 
                                y = liveness, label = liveness)) + 
  xlab("Musician") + ylab("liveness") + 
  geom_bar(stat = 'identity', aes(fill = cluster), width = .5) + 
  scale_fill_manual(name = "Cluster",
                        labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"), 
                        values = c("Cluster 1" = "#000066", 
                                   "Cluster 2" = "#9999CC", 
                                   "Cluster 3" = "#66CC99",
                                   "Cluster 4" = "#FB7201",
                                   "Cluster 5" = "#21CDFF",
                                   "Cluster 6" = "#FF219C")) + 
  labs(title = "Liveness Feature") + coord_flip()

liveness_plot
```

So who are the 5 artists who performed the most live audio recordings? Jason Mraz, Coldplay, Martin Garrix, Kanye West, and Kendrick Lamar, in that order.

### Valence

```{r}
# Valence feature
valence_subset <- mean_features_norm_50[,c("Musician","valence", "cluster")]
valence_subset <- valence_subset[order(valence_subset$valence, decreasing = TRUE), ]

valence_plot <- ggplot(valence_subset, 
                            aes(x = reorder(Musician, valence), 
                                y = valence, label = valence)) + 
  xlab("Musician") + ylab("valence") + 
  geom_bar(stat = 'identity', aes(fill = cluster), width = .5) + 
  scale_fill_manual(name = "Cluster",
                        labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"), 
                        values = c("Cluster 1" = "#000066", 
                                   "Cluster 2" = "#9999CC", 
                                   "Cluster 3" = "#66CC99",
                                   "Cluster 4" = "#FB7201",
                                   "Cluster 5" = "#21CDFF",
                                   "Cluster 6" = "#FF219C")) + 
  labs(title = "Valence Feature") + coord_flip()

valence_plot
```

Valence is the feature that describes musical positiveness conveyed by a track. Music by Bruno Mars, Stevie Wonder, and Enrique Iglesias are very positive, while music by Lana Del Rey, Coldplay, and Martin Garrix sound quite negative.

### Tempo

```{r}
# Tempo feature
tempo_subset <- mean_features_norm_50[,c("Musician","tempo", "cluster")]
tempo_subset <- tempo_subset[order(tempo_subset$tempo, decreasing = TRUE), ]

tempo_plot <- ggplot(tempo_subset, 
                            aes(x = reorder(Musician, tempo), 
                                y = tempo, label = tempo)) + 
  xlab("Musician") + ylab("tempo") + 
  geom_bar(stat = 'identity', aes(fill = cluster), width = .5) + 
  scale_fill_manual(name = "Cluster",
                        labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"), 
                        values = c("Cluster 1" = "#000066", 
                                   "Cluster 2" = "#9999CC", 
                                   "Cluster 3" = "#66CC99",
                                   "Cluster 4" = "#FB7201",
                                   "Cluster 5" = "#21CDFF",
                                   "Cluster 6" = "#FF219C")) + 
  labs(title = "Tempo Feature") + coord_flip()

tempo_plot
```

Future, Marshmello, and Wiz Khalifa are kings of speed. They produce tracks with highest tempo in beats per minute. And Snoop Dogg, lol? He tends to take some time to utter his magic words.

### Duration

```{r}
# Duration feature
duration_subset <- mean_features_norm_50[,c("Musician","duration_ms", "cluster")]
duration_subset <- duration_subset[order(duration_subset$duration_ms, decreasing = TRUE), ]

duration_plot <- ggplot(duration_subset, 
                            aes(x = reorder(Musician, duration_ms), 
                                y = duration_ms, label = duration_ms)) + 
  xlab("Musician") + ylab("duration") + 
  geom_bar(stat = 'identity', aes(fill = cluster), width = .5) + 
  scale_fill_manual(name = "Cluster",
                        labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"), 
                        values = c("Cluster 1" = "#000066", 
                                   "Cluster 2" = "#9999CC", 
                                   "Cluster 3" = "#66CC99",
                                   "Cluster 4" = "#FB7201",
                                   "Cluster 5" = "#21CDFF",
                                   "Cluster 6" = "#FF219C")) + 
  labs(title = "Duration Feature") + coord_flip()

duration_plot
```

Last but not least, songs by Justin Timberlake, followed by Elton John and Eminem, are, sometimes excruciatingly, long. In contrast, Frank Sinatra, Zara Larsson, and Pentatonix favor short and quick music.