-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathpresentation-text.txt
47 lines (32 loc) · 2.21 KB
/
presentation-text.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
On 24 Mar 2015 Germanwings flight 9525 crashed in the French Alps, killing all 150 aboard.
Global interest in the story was high from the start but the progress of the story (with the gradual revelation
of the copilot's suicide) captured the world's interest over several days. Breaking news on the story caused
spikes in internet traffic over this period.
.... [graph of traffic by timezone from bit.ly]
Examination of the global interest cannot be captured from a single website. Instead, data from bit.ly's url
shortening service serves as a proxy for the global activity.
This presentation covers the internet story, as revealed by topic modeling on bit.ly decode data.
First, the data:
- bit.ly provided me with short url decodes of the first 10 minutes of each of 72 hours
- 15 000 pages referenced by 9 million urls were found to have relevance to the Germanwings story
- document content was extracted by screen scraping
Technical notes
NMF + wordclouds
Ordinarily, topic modeling weights words by the frequency of word occurrence in a document, assuming that all documents have equal importance.
Documents in my model were weighted according to their number of hits for that document, to reflect their varying importance.
Interestingly, countries' interest in the latent topics was not uniform.
Here are countries' interest as represented in each topic:
.... [slide of topic versus country]
Here is the top German story, revealing interest in relatives, family, and victims.
Here is the main US topic which shows more focus on the mental illness of the co-pilot.
But in addition, interest in the topics changed over time, as we would expect with a story in which details emerged gradually:
.... [slide of topic versus time]
Topics that had low interest on the 25th shot up on the 26th as investigators uncovered the truth about the crash. In addition, focus shifted
from the 25th to the 26th, some topics gaining and others losing in importance.
-----
Complications
- cleaning data of HTML artifacts
- so many different layouts means inability to scrape uniformly
- different languages
- documents that do not accurately describe their content (i.e. streaming video)
- pages that retired content over 6 month period