-
Notifications
You must be signed in to change notification settings - Fork 31
/
Copy pathSpotify-Artists-Analysis.Rmd
789 lines (614 loc) · 42.7 KB
/
Spotify-Artists-Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
---
title: "Spotify Artists Analysis"
author: "James Le"
date: 'Updated: `r Sys.Date()`'
output:
html_document:
df_print: paged
toc: yes
code_folding: hide
number_sections: yes
---
# Introduction
Each musician has his or her own unique musical styles: from Ed Sheeran who devotes his life to the acoustic guitar, to Drake who masters the art of rapping, from Adele who can belt some crazy high notes on her pop ballads, to Kygo who creates EDM magic on his DJ set. Music is about creativity, orginality, inspiration, and feelings, and it is the perfect gateway to connect people across differences.
Spotify is the largest music streaming service available. With more than 35 million songs and 170 million monthly active users, it is the ideal platform for musicians to reach their audience. On the app, music can be browsed or searched for via various parameters, such as artists, album, genre, playlist, or record label. Users can create, edit and share playlists, share tracks on social media, and make playlists with other users. Additionally, Spotify launched a variety of interesting playlists tailored made for its users, of which I most admire these 3:
* **Discover Weekly**: a weekly generated playlist (updated on Monday) that brings users 2 hours of custom-made music recommendations, mixing a user's personal taste with songs enjoyed by similar listeners.
* **Release Radar**: a personalized playlist that allows users to stay up-to-date on new music released by artists they listen to the most.
* **Daily Mix**: a series of playlists that have "near endless playback" and mixes the user's favorite tracks with new, recommended songs.
I recently discovered the ['This Is'](https://open.spotify.com/search/playlists/this%20is%20) playlist series. One of Spotify’s best original features, `This Is` delivers on a major promise of the streaming revolution - the canonization and preservation of great artists’ repertoires for future generations to discover and appreciate. Each one is dedicated to a different legendary artist, chronicling the high points of iconic discographies. This is: Kanye West. This is: Maroon 5. This is: Elton John. Spotify has provided a shortcut, giving us curated lists of the greatest songs from the greatest artists.
The purpose of this project is to analyse how different or how similar is the music that different artists on Spotify produce. The focus will be placed on disentangling the musical taste of 50 different artists from a wide range of genres. Throughout the process, I also identify different clusters of artists that share a similar musical style.
For the study, I will access the [Spotify Web API](https://beta.developer.spotify.com/web-api/), which provides data from the Spotify music catalog and can be accesed via standard HTTPS requests to an API endpoint. The Spotify API provides, among other things, track information for each song, including audio statistics such as *danceability*, *instrumentalness* or *temp*. I will focus on retrieving this audio feature information from 50 different 'This Is' Playlists of 50 different artists . Each feature measures an aspect of a song. Detailed information on how each feature is calculated can be found in the Spotify API Website.
# Getting Data
The first step is registering my application in the [API Website](https://beta.developer.spotify.com/web-api/) and getting the keys (Client ID and Client Secret) for future requests.
The Spotify Web API has different URI (Uniform Resource Identifier) to access playlists, artists or tracks information. Consequently, the process of getting data must be divided in 2 key steps.
* Get the "This Is" Playlist Series for multiple musicians.
* Get the audio features for each artist´s Playlist tracks.
## Web API Credentials
First, I created two variables for the *Client ID* and the *Client Secret* credentials.
```{r}
spotifyKey <- "182878ec396d424283c951d6769e9497"
spotifySecret <- "2a6d8f846edc4667ba9f0ba43cd7fe4c"
```
After that, I requested an access token in order to authorise my app to retrieve and manage Spotify data.
```{r}
library(Rspotify)
library(httr)
library(jsonlite)
spotifyEndpoint <- oauth_endpoint(NULL,
"https://accounts.spotify.com/authorize",
"https://accounts.spotify.com/api/token")
spotifyToken <- spotifyOAuth("Spotify Analysis", spotifyKey, spotifySecret)
```
## "This Is" Playlist Series
The first step to pull the artists´ ["This Is" series](https://open.spotify.com/search/playlists/this%20is%20) is to get the URI´s for each one. For your information, here are the 50 musicians I have chosen, using their popularity, modernity, and diversity as the main criteria:
* Pop: Taylor Swift, Ariana Grande, Shawn Mendes, Maroon 5, Adele, Justin Bieber, Ed Sheeran, Justin Timberlake, Charlie Puth, John Mayer, Lorde, Fifth Harmony, Lana Del Rey, James Arthur, Zara Larsson, Pentatonix.
* Hip-Hop / Rap: Kendrick Lamar, Post Malone, Drake, Kanye West, Eminem, Future, 50 Cent, Lil Wayne, Wiz Khalifa, Snoop Dogg, Macklemore, Jay-Z.
* R & B: Bruno Mars, Beyonce, Enrique Iglesias, Stevie Wonder, John Legend, Alicia Keys, Usher, Rihanna.
* EDM / House: Kygo, The Chainsmokers, Avicii, Marshmello, Calvin Harris, Martin Garrix.
* Rock: Coldplay, Elton John, One Republic, The Script, Jason Mraz.
* Jazz: Frank Sinatra, Michael Buble, Norah Jones.
I basically went to each musician's individual playlist, copied the URIs, and stored each URI in a .csv file and imported it in R.
```{r}
library(readr)
playlistURI <- read.csv("this-is-playlist-URI.csv", header = T, sep = ";")
```
With each Playlist URI, I applied the *getPlaylistSongs* from the *RSpotify* package and stored the Playlist information in an empty dataframe.
```{r}
# Empty dataframe
PlaylistSongs <- data.frame(PlaylistID = character(),
Musician = character(),
tracks = character(),
id = character(),
popularity = integer(),
artist = character(),
artistId = character(),
album = character(),
albumId = character(),
stringsAsFactors=FALSE)
```
```{r}
# Getting each playlist
for (i in 1:nrow(playlistURI)) {
i <- cbind(PlaylistID = as.factor(playlistURI[i,2]),
Musician = as.factor(playlistURI[i,1]),
getPlaylistSongs("spotify",
playlistid = as.factor(playlistURI[i,2]),
token=spotifyToken))
PlaylistSongs <- rbind(PlaylistSongs, i)
}
```
As we can see below, the dataframe has 10 columns and 2129 rows.
```{r}
dim(PlaylistSongs)
```
The following table shows the first 86 rows of my dataframe PlaylistSongs. It contains the tracks by Taylor Swift and Ariana Grande.
```{r}
library(knitr)
library(kableExtra)
library(dplyr)
options(knitr.table.format = "html")
options(width = 12)
# Only Taylor Swift and Ariana Grande
kable(head(PlaylistSongs,86)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12) %>%
scroll_box(width = "1000px", height = "750px")
```
## Audio Features
First, I wrote a formula (*getFeatures*) that extracts the audio features for any specific ID stored as a vector.
```{r}
getFeatures <- function (vector_id, token)
{
link <- httr::GET(paste0("https://api.spotify.com/v1/audio-features/?ids=",
vector_id), httr::config(token = token))
list <- httr::content(link)
return(list)
}
```
Next, I included *getFeatures* in another formula (*get_features*). The latter formula extracts the audio features for the track ID’s vector and returns them in a dataframe.
```{r}
get_features <- function (x)
{
getFeatures2 <- getFeatures(vector_id = x, token = spotifyToken)
features_output <- do.call(rbind, lapply(getFeatures2$audio_features, data.frame, stringsAsFactors=FALSE))
}
```
Using the formula created above, I will be able to extract the audio features for each track. In order to do so, I need a vector containing each track ID. The rate limit for the Spotify API is 100 tracks, so I decided to create a vector with track IDs for each musician.
```{r}
TaylorSwift_vc <- paste(as.character(PlaylistSongs$id[1:38]), sep="", collapse=",")
ArianaGrande_vc <- paste(as.character(PlaylistSongs$id[39:86]), sep="", collapse=",")
KendrickLamar_vc <- paste(as.character(PlaylistSongs$id[87:124]), sep="", collapse=",")
ShawnMendes_vc <- paste(as.character(PlaylistSongs$id[125:177]), sep="", collapse=",")
Maroon5_vc <- paste(as.character(PlaylistSongs$id[178:226]), sep="", collapse=",")
PostMalone_vc <- paste(as.character(PlaylistSongs$id[227:261]), sep="", collapse=",")
Kygo_vc <- paste(as.character(PlaylistSongs$id[262:299]), sep="", collapse=",")
TheChainsmokers_vc <- paste(as.character(PlaylistSongs$id[300:333]), sep="", collapse=",")
Adele_vc <- paste(as.character(PlaylistSongs$id[334:358]), sep="", collapse=",")
Drake_vc <- paste(as.character(PlaylistSongs$id[359:408]), sep="", collapse=",")
JustinBieber_vc <- paste(as.character(PlaylistSongs$id[409:457]), sep="", collapse=",")
Coldplay_vc <- paste(as.character(PlaylistSongs$id[458:494]), sep="",collapse=",")
KanyeWest_vc <- paste(as.character(PlaylistSongs$id[495:545]), sep="", collapse=",")
BrunoMars_vc <- paste(as.character(PlaylistSongs$id[546:584]), sep="", collapse=",")
EdSheeran_vc <- paste(as.character(PlaylistSongs$id[585:624]), sep="", collapse=",")
Eminem_vc <- paste(as.character(PlaylistSongs$id[625:679]), sep="", collapse=",")
Beyonce_vc <- paste(as.character(PlaylistSongs$id[680:711]), sep="", collapse=",")
Avicii_vc <- paste(as.character(PlaylistSongs$id[712:770]), sep="", collapse=",")
Marshmello_vc <- paste(as.character(PlaylistSongs$id[771:808]), sep="", collapse=",")
CalvinHarris_vc <- paste(as.character(PlaylistSongs$id[809:846]), sep="", collapse=",")
JustinTimberlake_vc <- paste(as.character(PlaylistSongs$id[847:912]), sep="", collapse=",")
FrankSinatra_vc <- paste(as.character(PlaylistSongs$id[913:962]), sep="", collapse=",")
CharliePuth_vc <- paste(as.character(PlaylistSongs$id[963:993]), sep="", collapse=",")
MichaelBuble_vc <- paste(as.character(PlaylistSongs$id[994:1035]), sep="", collapse=",")
MartinGarrix_vc <- paste(as.character(PlaylistSongs$id[1036:1084]), sep="", collapse=",")
EnriqueIglesias_vc <- paste(as.character(PlaylistSongs$id[1085:1125]), sep="", collapse=",")
JohnMayer_vc <- paste(as.character(PlaylistSongs$id[1126:1184]), sep="", collapse=",")
Future_vc <- paste(as.character(PlaylistSongs$id[1185:1224]), sep="", collapse=",")
EltonJohn_vc <- paste(as.character(PlaylistSongs$id[1225:1265]), sep="", collapse=",")
FiftyCent_vc <- paste(as.character(PlaylistSongs$id[1266:1315]), sep="", collapse=",")
Lorde_vc <- paste(as.character(PlaylistSongs$id[1316:1346]), sep="", collapse=",")
LilWayne_vc <- paste(as.character(PlaylistSongs$id[1347:1396]), sep="", collapse=",")
WizKhalifa_vc <- paste(as.character(PlaylistSongs$id[1397:1446]), sep="", collapse=",")
FifthHarmony_vc <- paste(as.character(PlaylistSongs$id[1447:1479]), sep="", collapse=",")
LanaDelRay_vc <- paste(as.character(PlaylistSongs$id[1480:1524]), sep="",collapse=",")
NorahJones_vc <- paste(as.character(PlaylistSongs$id[1525:1562]), sep="", collapse=",")
JamesArthur_vc <- paste(as.character(PlaylistSongs$id[1563:1581]), sep="", collapse=",")
OneRepublic_vc <- paste(as.character(PlaylistSongs$id[1582:1614]), sep="", collapse=",")
TheScript_vc <- paste(as.character(PlaylistSongs$id[1615:1658]), sep="", collapse=",")
StevieWonder_vc <- paste(as.character(PlaylistSongs$id[1659:1708]), sep="", collapse=",")
JasonMraz_vc <- paste(as.character(PlaylistSongs$id[1709:1758]), sep="", collapse=",")
JohnLegend_vc <- paste(as.character(PlaylistSongs$id[1759:1795]), sep="", collapse=",")
Pentatonix_vc <- paste(as.character(PlaylistSongs$id[1796:1834]), sep="", collapse=",")
AliciaKeys_vc <- paste(as.character(PlaylistSongs$id[1835:1884]), sep="", collapse=",")
Usher_vc <- paste(as.character(PlaylistSongs$id[1885:1934]), sep="", collapse=",")
SnoopDogg_vc <- paste(as.character(PlaylistSongs$id[1935:1984]), sep="", collapse=",")
Macklemore_vc <- paste(as.character(PlaylistSongs$id[1985:2007]), sep="",collapse=",")
ZaraLarsson_vc <- paste(as.character(PlaylistSongs$id[2008:2043]), sep="", collapse=",")
JayZ_vc <- paste(as.character(PlaylistSongs$id[2044:2093]), sep="", collapse=",")
Rihanna_vc <- paste(as.character(PlaylistSongs$id[2094:2129]), sep="", collapse=",")
```
Next, I apply the *get_features* formula to each vector obtaining the audio features for each musician.
```{r}
TaylorSwift <- get_features(TaylorSwift_vc)
ArianaGrande <- get_features(ArianaGrande_vc)
KendrickLamar <- get_features(KendrickLamar_vc)
ShawnMendes <- get_features(ShawnMendes_vc)
Maroon5 <- get_features(Maroon5_vc)
PostMalone <- get_features(PostMalone_vc)
Kygo <- get_features(Kygo_vc)
TheChainsmokers <- get_features(TheChainsmokers_vc)
Adele <- get_features(Adele_vc)
Drake <- get_features(Drake_vc)
JustinBieber <- get_features(JustinBieber_vc)
Coldplay <- get_features(Coldplay_vc)
KanyeWest <- get_features(KanyeWest_vc)
BrunoMars <- get_features(BrunoMars_vc)
EdSheeran <- get_features(EdSheeran_vc)
Eminem <- get_features(Eminem_vc)
Beyonce <- get_features(Beyonce_vc)
Avicii <- get_features(Avicii_vc)
Marshmello <- get_features(Marshmello_vc)
CalvinHarris <- get_features(CalvinHarris_vc)
JustinTimberlake <- get_features(JustinTimberlake_vc)
FrankSinatra <- get_features(FrankSinatra_vc)
CharliePuth <- get_features(CharliePuth_vc)
MichaelBuble <- get_features(MichaelBuble_vc)
MartinGarrix <- get_features(MartinGarrix_vc)
EnriqueIglesias <- get_features(EnriqueIglesias_vc)
JohnMayer <- get_features(JohnMayer_vc)
Future <- get_features(Future_vc)
EltonJohn <- get_features(EltonJohn_vc)
FiftyCent <- get_features(FiftyCent_vc)
Lorde <- get_features(Lorde_vc)
LilWayne <- get_features(LilWayne_vc)
WizKhalifa <- get_features(WizKhalifa_vc)
FifthHarmony <- get_features(FifthHarmony_vc)
LanaDelRay <- get_features(LanaDelRay_vc)
NorahJones <- get_features(NorahJones_vc)
JamesArthur <- get_features(JamesArthur_vc)
OneRepublic <- get_features(OneRepublic_vc)
TheScript <- get_features(TheScript_vc)
StevieWonder <- get_features(StevieWonder_vc)
JasonMraz <- get_features(JasonMraz_vc)
JohnLegend <- get_features(JohnLegend_vc)
Pentatonix <- get_features(Pentatonix_vc)
AliciaKeys <- get_features(AliciaKeys_vc)
Usher <- get_features(Usher_vc)
SnoopDogg <- get_features(SnoopDogg_vc)
Macklemore <- get_features(Macklemore_vc)
ZaraLarsson <- get_features(ZaraLarsson_vc)
JayZ <- get_features(JayZ_vc)
Rihanna <- get_features(Rihanna_vc)
```
After that, I merged each musician´s audio features dataframe into a new one, *all_features*. It contains the audio features for all the tracks in every musician's This Is Playlist.
```{r}
library(gdata)
all_features <- combine(TaylorSwift,ArianaGrande,KendrickLamar,ShawnMendes,Maroon5,PostMalone,Kygo,TheChainsmokers,Adele,Drake,JustinBieber,Coldplay,KanyeWest,BrunoMars,EdSheeran,Eminem,Beyonce,Avicii,Marshmello,CalvinHarris,JustinTimberlake,FrankSinatra,CharliePuth,MichaelBuble,MartinGarrix,EnriqueIglesias,JohnMayer,Future,EltonJohn,FiftyCent,Lorde,LilWayne,WizKhalifa,FifthHarmony,LanaDelRay,NorahJones,JamesArthur,OneRepublic,TheScript,StevieWonder,JasonMraz,JohnLegend,Pentatonix,AliciaKeys,Usher,SnoopDogg,Macklemore,ZaraLarsson,JayZ,Rihanna)
```
A preview of the *all_features* dataframe can be found below. It only shows 86 rows with the data for Taylor Swift and Ariana Grande. The last column (*Source*) contains the musician.
```{r}
options(knitr.table.format = "html")
options(width = 12)
kable(head(all_features, 86)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12) %>%
scroll_box(width = "1000px", height = "750px")
```
Finally, I have computed the mean of each musician´s audio features using the *aggregate* function. The resulting dataframe contains the audio features for each musician as the mean of the tracks in their respective playlists.
```{r}
mean_features <- aggregate(all_features[, c(1:11,17)], list(all_features$source), mean)
names(mean_features) <- c("Musician", "danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "duration_ms")
```
```{r}
options(knitr.table.format = "html")
options(width = 12)
kable(mean_features) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12) %>%
scroll_box(width = "1000px", height = "500px")
```
## Audio Features Description
The description of each feature from the [Spotify Web API Guidance](https://beta.developer.spotify.com/web-api/get-audio-features/) can be found below:
* **Danceability**: Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
* **Energy**: Is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
* **Key**: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
* **Loudness**: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
* **Mode**: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
* **Speechiness**: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
* **Acousticness**: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
* **Instrumentalness**: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
* **Liveness**: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
* **Valence**: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
* **Tempo**: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
* **Duration_ms**: The duration of the track in milliseconds.
# Data Visualization
## Radar Chart
A radar chart is useful to compare the musical vibes of these musicians in a more visual way. The first visualisation is an R implementation of the radar chart from the [chart.js](http://www.chartjs.org/) javascript library and evaluates the audio features for 10 selected musicians.
In order to do so, I normalised the values to be from 0 to 1. This will help to make the chart more clear and readable.
```{r}
mean_features_norm <- cbind(mean_features[1],
apply(mean_features[-1],2,
function(x){(x-min(x)) / diff(range(x))}))
```
Okay, let's plot these interactive radar charts in batch of 10 musicians. Each chart displays data set labels when you hover over each radial line, showing the value for the selected feature.
**Batch 1: Taylor Swift, Ariana Grande, Kendrick Lamar, Shawn Mendes, Maroon 5, Post Malone, Kygo, The Chainsmokers, Adele, Drake**
```{r}
library(radarchart)
library(tidyr)
sample1 <- mean_features[mean_features$Musician %in% c("TaylorSwift", "ArianaGrande", "KendrickLamar", "ShawnMendes", "Maroon5", "PostMalone", "Kygo", "TheChainsmokers", "Adele", "Drake"),]
mean_features_norm_1 <- cbind(sample1[1],
apply(sample1[-1],2,
function(x){(x-min(x)) / diff(range(x))}))
radarDF_1 <- gather(mean_features_norm_1, key=Attribute, value=Score, -Musician) %>%
spread(key=Musician, value=Score)
chartJSRadar(scores = radarDF_1,
scaleStartValue = -1,
maxScale =1,
showToolTipLabel = TRUE)
```
**Batch 2: Justin Bieber, Coldplay, Kanye West, Bruno Mars, Ed Sheeran, Eminem, Beyonce, Avicii, Marshmello, Calvin Harris**
```{r}
sample2 <- mean_features[mean_features$Musician %in% c("JustinBieber", "Coldplay", "KanyeWest", "BrunoMars", "EdSheeran", "Eminem", "Beyonce", "Avicii", "Marshmello", "CalvinHarris"),]
mean_features_norm_2 <- cbind(sample2[1], apply(sample2[-1],2,function(x){(x-min(x)) / diff(range(x))}))
radarDF_2 <- gather(mean_features_norm_2, key=Attribute, value=Score, -Musician) %>%
spread(key=Musician, value=Score)
chartJSRadar(scores = radarDF_2, scaleStartValue = -1, maxScale = 1, showToolTipLabel = TRUE)
```
**Batch 3: Justin Timberlake, Frank Sinatra, Charlie Puth, Michael Buble, Martin Garrix, Enrique Iglesias, John Mayer, Future, Elton John, 50 Cent**
```{r}
sample3 <- mean_features[mean_features$Musician %in% c("JustinTimberlake", "FrankSinatra", "CharliePuth", "MichaelBuble", "MartinGarrix", "EnriqueIglesias", "JohnMayer", "Future", "EltonJohn", "FiftyCent"),]
mean_features_norm_3 <- cbind(sample3[1], apply(sample3[-1],2,function(x){(x-min(x)) / diff(range(x))}))
radarDF_3 <- gather(mean_features_norm_3, key=Attribute, value=Score, -Musician) %>%
spread(key=Musician, value=Score)
chartJSRadar(scores = radarDF_3, scaleStartValue = -1, maxScale = 1, showToolTipLabel = TRUE)
```
**Batch 4: Lorde, Lil Wayne, Wiz Khalifa, Fifth Harmony, Lana Del Rey, Norah Jones, James Arthur, One Republic, The Script, Stevie Wonder**
```{r}
sample4 <- mean_features[mean_features$Musician %in% c("Lorde", "LilWayne", "WizKhalifa", "FifthHarmony", "LanaDelRay", "NorahJones", "JamesArthur", "OneRepublic", "TheScript", "StevieWonder"),]
mean_features_norm_4 <- cbind(sample4[1], apply(sample4[-1],2,function(x){(x-min(x)) / diff(range(x))}))
radarDF_4 <- gather(mean_features_norm_4, key=Attribute, value=Score, -Musician) %>%
spread(key=Musician, value=Score)
chartJSRadar(scores = radarDF_4, scaleStartValue = -1, maxScale = 1, showToolTipLabel = TRUE)
```
**Batch 5: Jason Mraz, John Legend, Pentatonix, Alicia Keys, Usher, Snoop Dogg, Macklemore, Zara Larsson, Jay-Z, Rihanna**
```{r}
sample5 <- mean_features[mean_features$Musician %in% c("JasonMraz", "JohnLegend", "Pentatonix", "AliciaKeys", "Usher", "SnoopDogg", "Macklemore", "ZaraLarsson", "JayZ", "Rihanna"),]
mean_features_norm_5 <- cbind(sample5[1], apply(sample5[-1],2,function(x){(x-min(x)) / diff(range(x))}))
radarDF_5 <- gather(mean_features_norm_5, key=Attribute, value=Score, -Musician) %>%
spread(key=Musician, value=Score)
chartJSRadar(scores = radarDF_5, scaleStartValue = -1, maxScale = 1, showToolTipLabel = TRUE)
```
## Cluster Analysis
Another way to find out the differences between these musicians in their musical repertoire is grouping them in clusters. The general idea of a clustering algorithm is to divide a given dataset into multiple groups on the basis of similarity in the data. In this case, musicians will be grouped in different clusters according to their music preferences. Rather than defining groups before looking at the data, clustering allows me to find and analyze the groups that have formed organically.
Prior to clustering data, it is important to rescale the numeric variables of the dataset. After that, I kept the musicians as the row names to be able to show them as labels in the plot.
```{r}
scaled.features <- scale(mean_features[-1])
rownames(scaled.features) <- mean_features$Musician
```
I applied the **K-Means Clustering** method, which is one of the most popular techniques of unsupervised statistical learning methods. It is a type of unsupervised learning, which is used for unlabeled data. The algorithm finds groups in the data, with the number of groups represented by the variable **K**. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. In this instance, I will choose *K = 6*, as I suppose clusters can be formed based on the 6 different genres I based on when choosing the artists (Pop, Hip-Hop, R&B, EDM, Rock, and Jazz).
Now that I apply the the K-Means algorithm for each musician, I can plot a two-dimensional view of the data. The x-axis and y-axis correspond to the first and second component, and the eigenvectors (represented by red arrows) indicate the directional influence each variable has on the principal components. Let´s have a look at the clusters that result from applying the K-Means algorithm to my dataset.
```{r}
library(ggfortify)
library(ggthemes)
set.seed(5000)
k_means <- kmeans(scaled.features, 6)
kmeans_plot <- autoplot(k_means,
main = "K-means Clustering",
data = scaled.features,
loadings = TRUE,
loadings.colour = "#CC0000",
loadings.label.colour = "#CC0000",
loadings.label = TRUE,
loadings.label.size = 2.2,
loadings.label.repel = T,
label.size = 2.2,
label.repel = T) +
scale_fill_manual(values = c("#000066", "#9999CC", "#66CC99", "#FB7201", "#21CDFF", "#FF219C")) +
scale_color_manual(values = c("#000066", "#9999CC", "#66CC99", "#FB7201", "#21CDFF", "#FF219C")) +
theme(plot.title=element_text(size=18, face="bold"))
kmeans_plot
```
Let's see which artists belong to which clusters:
```{r}
k_means$cluster
```
I have also plotted another radar chart containing the features for each cluster. It is useful to compare the attributes of the songs that each cluster creates.
```{r}
mean_features_norm_50 <- cbind(mean_features[1], apply(mean_features[-1],2,scale))
```
```{r}
library(radarchart)
library(tidyr)
cluster_centers <- as.data.frame(k_means$centers)
cluster <- c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6")
cluster_centers <- cbind(cluster, cluster_centers)
```
```{r}
radarDF_6 <- gather(cluster_centers, key=Attribute, value=Score, -cluster) %>%
spread(key=cluster, value=Score)
# we change the colours according to clusters
colMatrix = matrix(c(c(4,24,102), c(135,133,193), c(87,196,135), c(251,114,1), c(33,205,255), c(255,33,156)), nrow = 3)
# chart
chartJSRadar(scores = radarDF_6,
scaleStartValue = -4,
maxScale =1.5,
showToolTipLabel = TRUE,
colMatrix = colMatrix)
```
* *Cluster 1* includes 4 artists: Coldplay, Avicii, Marshmello, and Martin Garrix. Their music are mostly performed live and instrumental, usually loud and full of energy with high tempo. Not too surpirsed as 3 of the 4 artists perform EDM / House music, and Coldplay is known for their live concerts.
* *Cluster 2* incldues 2 artists: Frank Sinatra and Norah Jones (any Jazz fans out there?). Their music score high on acousticness and the Major scale mode. However, they score low in all the remaining attributes. Typical Jazz tunes.
* *Cluster 3* includes 10 artists: Post Malone, Kygo, The Chainsmokers, Adele, Lorde, Lana Del Rey, James Arthur, One Republic, John Legend, and Alicia Keys. This cluster scores average in mostly all the attributes. This suggests that this group of artists is well-balanced and versatile with style & creation, hence the diversity of genres presented in this cluster (EDM, Pop, R&B).
* *Cluster 4* includes 15 artists: Ariana Grande, Maroon 5, Drake, Justin Bieber, Bruno Mars, Calvin Harris, Charlie Puth, Enrique Iglesias, Future, Wiz Khalifa, Fifth Harmony, Usher, Macklemore, Zara Larsson, and Rihanna. Their music are danceable, loud, high-tempo, and energetic. This group has the presence of many young mainstream artists in the Pop and Hip-Hop genres.
* *Cluster 5* includes 10 artists: Taylor Swift, Shawn Mendes, Ed Sheeran, Michael Buble, John Mayer, Elton John, The Script, Stevie Wonder, Jason Mraz, and Pentatonix. This is my favorite group! Taylor Swift? Ed Sheeran? John Mayer? Jason Mraz? Elton John? I guess I listen to a lot of singer-songwriter artists. Their music are mostly in the Major scale, while achieve perfect balance (average score) in all other attributes.
* *Cluster 6* includes 9 artists: Kendrick Lamar, Kanye West, Eminem, Beyonce, Justin Timberlake, 50 Cent, Lil Wayne, Snoop Dogg, and Jay-Z. You already see the trend here: 7 of them are Rappers, and even Beyonce and JT regularly collaborate with rappers. Their songs have high number of spoken words and speech-like sections, are long in duration and performed live often. Any better description of rap music?
## Analysis by Feature
The following charts show the values for each feature for every musician.
### Danceability
```{r}
library(stringr)
# Converting cluster to vector
clusters <- as.vector(k_means$cluster)
clusters <- str_replace_all(clusters, "1", "Cluster 1")
clusters <- str_replace_all(clusters, "2", "Cluster 2")
clusters <- str_replace_all(clusters, "3", "Cluster 3")
clusters <- str_replace_all(clusters, "4", "Cluster 4")
clusters <- str_replace_all(clusters, "5", "Cluster 5")
clusters <- str_replace_all(clusters, "6", "Cluster 6")
mean_features_norm_50 <- cbind(mean_features_norm_50, cluster = clusters)
```
```{r}
# Danceability feature
library(ggplot2)
danceability_subset <- mean_features_norm_50[,c("Musician","danceability", "cluster")]
danceability_subset <- danceability_subset[order(danceability_subset$danceability, decreasing = TRUE), ]
danceability_plot <- ggplot(danceability_subset,
aes(x = reorder(Musician, danceability),
y = danceability, label = danceability)) +
xlab("Musician") + ylab("Danceability") +
geom_bar(stat = 'identity', aes(fill = cluster), width = .5) +
scale_fill_manual(name = "Cluster",
labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"),
values = c("Cluster 1" = "#000066",
"Cluster 2" = "#9999CC",
"Cluster 3" = "#66CC99",
"Cluster 4" = "#FB7201",
"Cluster 5" = "#21CDFF",
"Cluster 6" = "#FF219C")) +
labs(title = "Danceability Feature") + coord_flip()
danceability_plot
```
If you want to bust the moves and impresse your crush, try listen to more of Future, Drake, Wiz Khalifa, Snoop Doog, and Eminem. On the other hand, don't even attempt to dance to Frank Sinatra or Lana Del Rey's tunes.
### Energy
```{r}
# Energy feature
energy_subset <- mean_features_norm_50[,c("Musician","energy", "cluster")]
energy_subset <- energy_subset[order(energy_subset$energy, decreasing = TRUE), ]
energy_plot <- ggplot(energy_subset,
aes(x = reorder(Musician, energy),
y = energy, label = energy)) +
xlab("Musician") + ylab("energy") +
geom_bar(stat = 'identity', aes(fill = cluster), width = .5) +
scale_fill_manual(name = "Cluster",
labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"),
values = c("Cluster 1" = "#000066",
"Cluster 2" = "#9999CC",
"Cluster 3" = "#66CC99",
"Cluster 4" = "#FB7201",
"Cluster 5" = "#21CDFF",
"Cluster 6" = "#FF219C")) +
labs(title = "Energy Feature") + coord_flip()
energy_plot
```
You're a fairly energetic person if you listen to lots of Marhsmello, Calvin Harris, Enrique Iglesias, Martin Garrix, Eminem, Jay-Z. The opposite is true if you're a fan of Frank Sinatra and Norah Jones.
### Loudness
```{r}
# Loudness feature
loudness_subset <- mean_features_norm_50[,c("Musician","loudness", "cluster")]
loudness_subset <- loudness_subset[order(loudness_subset$loudness, decreasing = TRUE), ]
loudness_plot <- ggplot(loudness_subset,
aes(x = reorder(Musician, loudness),
y = loudness, label = loudness)) +
xlab("Musician") + ylab("loudness") +
geom_bar(stat = 'identity', aes(fill = cluster), width = .5) +
scale_fill_manual(name = "Cluster",
labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"),
values = c("Cluster 1" = "#000066",
"Cluster 2" = "#9999CC",
"Cluster 3" = "#66CC99",
"Cluster 4" = "#FB7201",
"Cluster 5" = "#21CDFF",
"Cluster 6" = "#FF219C")) +
labs(title = "Loudness Feature") + coord_flip()
loudness_plot
```
The Loudness ranking is almost exactly similar to the Energy one.
### Speechiness
```{r}
# Speechiness feature
speechiness_subset <- mean_features_norm_50[,c("Musician","speechiness", "cluster")]
speechiness_subset <- speechiness_subset[order(speechiness_subset$speechiness, decreasing = TRUE), ]
speechiness_plot <- ggplot(speechiness_subset,
aes(x = reorder(Musician, speechiness),
y = speechiness, label = speechiness)) +
xlab("Musician") + ylab("speechiness") +
geom_bar(stat = 'identity', aes(fill = cluster), width = .5) +
scale_fill_manual(name = "Cluster",
labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"),
values = c("Cluster 1" = "#000066",
"Cluster 2" = "#9999CC",
"Cluster 3" = "#66CC99",
"Cluster 4" = "#FB7201",
"Cluster 5" = "#21CDFF",
"Cluster 6" = "#FF219C")) +
labs(title = "Speechiness Feature") + coord_flip()
speechiness_plot
```
All the Rap fans out there: what's your favorite songs from Kendrick Lamar? or 50 Cents? or Jay-Z? Hmm, I'm surprised Eminem does not rank higher, as I personally he's the GOAT of all rappers.
### Acousticness
```{r}
# Acousticness feature
acousticness_subset <- mean_features_norm_50[,c("Musician","acousticness", "cluster")]
acousticness_subset <- acousticness_subset[order(acousticness_subset$acousticness, decreasing = TRUE), ]
acousticness_plot <- ggplot(acousticness_subset,
aes(x = reorder(Musician, acousticness),
y = acousticness, label = acousticness)) +
xlab("Musician") + ylab("acousticness") +
geom_bar(stat = 'identity', aes(fill = cluster), width = .5) +
scale_fill_manual(name = "Cluster",
labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"),
values = c("Cluster 1" = "#000066",
"Cluster 2" = "#9999CC",
"Cluster 3" = "#66CC99",
"Cluster 4" = "#FB7201",
"Cluster 5" = "#21CDFF",
"Cluster 6" = "#FF219C")) +
labs(title = "Acousticness Feature") + coord_flip()
acousticness_plot
```
Acousticness is exactly opposite of Loudness and Energy. Mr. Sinatra and Mrs. Jones released some powerful acoustic tracks throughout their careers.
### Instrumentalness
```{r}
# Instrumentalness feature
instrumentalness_subset <- mean_features_norm_50[,c("Musician","instrumentalness", "cluster")]
instrumentalness_subset <- instrumentalness_subset[order(instrumentalness_subset$instrumentalness, decreasing = TRUE), ]
instrumentalness_plot <- ggplot(instrumentalness_subset,
aes(x = reorder(Musician, instrumentalness),
y = instrumentalness, label = instrumentalness)) +
xlab("Musician") + ylab("instrumentalness") +
geom_bar(stat = 'identity', aes(fill = cluster), width = .5) +
scale_fill_manual(name = "Cluster",
labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"),
values = c("Cluster 1" = "#000066",
"Cluster 2" = "#9999CC",
"Cluster 3" = "#66CC99",
"Cluster 4" = "#FB7201",
"Cluster 5" = "#21CDFF",
"Cluster 6" = "#FF219C")) +
labs(title = "Instrumentalness Feature") + coord_flip()
instrumentalness_plot
```
EDM to the win! Martin Garrix, Avicii, and Marshmello produce tracks that contain almost no vocals.
### Liveness
```{r}
# Liveness feature
liveness_subset <- mean_features_norm_50[,c("Musician","liveness", "cluster")]
liveness_subset <- liveness_subset[order(liveness_subset$liveness, decreasing = TRUE), ]
liveness_plot <- ggplot(liveness_subset,
aes(x = reorder(Musician, liveness),
y = liveness, label = liveness)) +
xlab("Musician") + ylab("liveness") +
geom_bar(stat = 'identity', aes(fill = cluster), width = .5) +
scale_fill_manual(name = "Cluster",
labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"),
values = c("Cluster 1" = "#000066",
"Cluster 2" = "#9999CC",
"Cluster 3" = "#66CC99",
"Cluster 4" = "#FB7201",
"Cluster 5" = "#21CDFF",
"Cluster 6" = "#FF219C")) +
labs(title = "Liveness Feature") + coord_flip()
liveness_plot
```
So who are the 5 artists who performed the most live audio recordings? Jason Mraz, Coldplay, Martin Garrix, Kanye West, and Kendrick Lamar, in that order.
### Valence
```{r}
# Valence feature
valence_subset <- mean_features_norm_50[,c("Musician","valence", "cluster")]
valence_subset <- valence_subset[order(valence_subset$valence, decreasing = TRUE), ]
valence_plot <- ggplot(valence_subset,
aes(x = reorder(Musician, valence),
y = valence, label = valence)) +
xlab("Musician") + ylab("valence") +
geom_bar(stat = 'identity', aes(fill = cluster), width = .5) +
scale_fill_manual(name = "Cluster",
labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"),
values = c("Cluster 1" = "#000066",
"Cluster 2" = "#9999CC",
"Cluster 3" = "#66CC99",
"Cluster 4" = "#FB7201",
"Cluster 5" = "#21CDFF",
"Cluster 6" = "#FF219C")) +
labs(title = "Valence Feature") + coord_flip()
valence_plot
```
Valence is the feature that describes musical positiveness conveyed by a track. Music by Bruno Mars, Stevie Wonder, and Enrique Iglesias are very positive, while music by Lana Del Rey, Coldplay, and Martin Garrix sound quite negative.
### Tempo
```{r}
# Tempo feature
tempo_subset <- mean_features_norm_50[,c("Musician","tempo", "cluster")]
tempo_subset <- tempo_subset[order(tempo_subset$tempo, decreasing = TRUE), ]
tempo_plot <- ggplot(tempo_subset,
aes(x = reorder(Musician, tempo),
y = tempo, label = tempo)) +
xlab("Musician") + ylab("tempo") +
geom_bar(stat = 'identity', aes(fill = cluster), width = .5) +
scale_fill_manual(name = "Cluster",
labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"),
values = c("Cluster 1" = "#000066",
"Cluster 2" = "#9999CC",
"Cluster 3" = "#66CC99",
"Cluster 4" = "#FB7201",
"Cluster 5" = "#21CDFF",
"Cluster 6" = "#FF219C")) +
labs(title = "Tempo Feature") + coord_flip()
tempo_plot
```
Future, Marshmello, and Wiz Khalifa are kings of speed. They produce tracks with highest tempo in beats per minute. And Snoop Dogg, lol? He tends to take some time to utter his magic words.
### Duration
```{r}
# Duration feature
duration_subset <- mean_features_norm_50[,c("Musician","duration_ms", "cluster")]
duration_subset <- duration_subset[order(duration_subset$duration_ms, decreasing = TRUE), ]
duration_plot <- ggplot(duration_subset,
aes(x = reorder(Musician, duration_ms),
y = duration_ms, label = duration_ms)) +
xlab("Musician") + ylab("duration") +
geom_bar(stat = 'identity', aes(fill = cluster), width = .5) +
scale_fill_manual(name = "Cluster",
labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6"),
values = c("Cluster 1" = "#000066",
"Cluster 2" = "#9999CC",
"Cluster 3" = "#66CC99",
"Cluster 4" = "#FB7201",
"Cluster 5" = "#21CDFF",
"Cluster 6" = "#FF219C")) +
labs(title = "Duration Feature") + coord_flip()
duration_plot
```
Last but not least, songs by Justin Timberlake, followed by Elton John and Eminem, are, sometimes excruciatingly, long. In contrast, Frank Sinatra, Zara Larsson, and Pentatonix favor short and quick music.