-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.html
2847 lines (2120 loc) · 134 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>VAST Challenge with datadr and Trelliscope</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="">
<meta name="author" content="">
<link href="assets/bootstrap/css/bootstrap.css" rel="stylesheet">
<link href="assets/custom/custom.css" rel="stylesheet">
<!-- font-awesome -->
<link href="assets/font-awesome/css/font-awesome.min.css" rel="stylesheet">
<!-- prism -->
<link href="assets/prism/prism.css" rel="stylesheet">
<link href="assets/prism/prism.r.css" rel="stylesheet">
<script type='text/javascript' src='assets/prism/prism.js'></script>
<script type='text/javascript' src='assets/prism/prism.r.js'></script>
<script type="text/javascript" src="assets/MathJax/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
"HTML-CSS": { scale: 100}
});
</script>
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="js/html5shiv.js"></script>
<![endif]-->
<link href='http://fonts.googleapis.com/css?family=Lato' rel='stylesheet' type='text/css'>
<!-- <link href='http://fonts.googleapis.com/css?family=Lustria' rel='stylesheet' type='text/css'> -->
<link href='http://fonts.googleapis.com/css?family=Bitter' rel='stylesheet' type='text/css'>
<!-- Fav and touch icons -->
<link rel="apple-touch-icon-precomposed" sizes="144x144" href="ico/apple-touch-icon-144-precomposed.png">
<link rel="apple-touch-icon-precomposed" sizes="114x114" href="ico/apple-touch-icon-114-precomposed.png">
<link rel="apple-touch-icon-precomposed" sizes="72x72" href="ico/apple-touch-icon-72-precomposed.png">
<link rel="apple-touch-icon-precomposed" href="ico/apple-touch-icon-57-precomposed.png">
<!-- <link rel="shortcut icon" href="ico/favicon.png"> -->
</head>
<body>
<div class="container-narrow">
<div class="masthead">
<ul class="nav nav-pills pull-right">
<li class=''><a href='http://hafen.github.io/datadr/'>datadr</a></li><li class=''><a href='http://hafen.github.io/trelliscope/'>trelliscope</a></li>
</ul>
<p class="myHeader">VAST Challenge with datadr and Trelliscope</p>
</div>
<hr>
<div class="container-fluid">
<div class="row-fluid">
<div class="col-md-3 well">
<ul class = "nav nav-list" id="toc">
<li class='nav-header unselectable' data-edit-href='00_init.Rmd'>Getting Set Up</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#introduction'>Introduction</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#environment-setup'>Environment setup</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#file-setup'>File Setup</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#session-initialization'>Session Initialization</a>
</li>
<li class='nav-header unselectable' data-edit-href='01_read.Rmd'>Raw Data ETL</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#text-data-to-r-objects'>Text Data to R Objects</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#netflow-data'>NetFlow Data</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#ips-data'>IPS Data</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#big-brother-data'>Big Brother Data</a>
</li>
<li class='nav-header unselectable' data-edit-href='02_explore.Rmd'>NetFlow Exploration</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#sourcedestination-ip-frequency'>Source/Destination IP Frequency</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#busiest-host-ips'>Busiest Host IPs</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#sourcedest-ip-payload'>Source/Dest IP Payload</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#inside-to-inside'>Inside to Inside</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#connection-duration'>Connection Duration</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#top-ports'>Top Ports</a>
</li>
<li class='nav-header unselectable' data-edit-href='03_dnr.Rmd'>NetFlow D&R</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#division-by-inside-host'>Division by Inside Host</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#time-aggregated-recombination'>Time-Aggregated Recombination</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#trelliscope-displays'>Trelliscope Displays</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#closer-investigation'>Closer Investigation</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#more-trelliscope-displays'>More Trelliscope Displays</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#division-by-external-host'>Division by External Host</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#division-by-time'>Division by Time</a>
</li>
<li class='nav-header unselectable' data-edit-href='04_bb.Rmd'>Network Health Data</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#bb-exploration'>BB Exploration</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#bb-by-host-division'>BB By Host Division</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#joining-with-netflow'>Joining with NetFlow</a>
</li>
<li class='nav-header unselectable' data-edit-href='05_ips.Rmd'>IPS Data</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#ips-exploration'>IPS Exploration</a>
</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#ips-by-host-division'>IPS By Host Division</a>
</li>
<li class='nav-header unselectable' data-edit-href='A_code.Rmd'>Appendix</li>
<li class='active'>
<a target='_self' class='nav-not-header' href='#r-code'>R Code</a>
</li>
</ul>
</div>
<div class="col-md-9 tab-content" id="main-content">
<div class='tab-pane active' id='introduction'>
<h3>Introduction</h3>
<p>The goal of this tutorial is to provide useful examples of how to use <a href="https://github.com/hafen/datadr" title="datadr github page">datadr</a> and <a href="https://github.com/hafen/trelliscope" title="trelliscope github page">Trelliscope</a> as a supplement to the introductory tutorials provided <a href="http://hafen.github.io/datadr" title="datadr tutorial">here</a> and <a href="http://hafen.github.io/trelliscope" title="trelliscope tutorial">here</a>, which focus more on illustrating functionality than doing something useful with data. It is based around the <a href="http://vacommunity.org/VAST+Challenge+2013%3A+Mini-Challenge+3">2013 VAST Mini-Challenge 3 dataset</a>.</p>
<div class="callout callout-danger"><strong>Note: </strong>This tutorial is an evolving document. Some sections may be less filled out than others. Expect changes and updates. Also note that serious analysis of data requires a great deal of investigation and currently this document only provides examples that will get you started down the path. Please send any comments or report issues to <a href="mailto:[email protected]">[email protected]</a>.</div>
<h4>Data sources</h4>
<p>The data available for download on the <a href="http://vacommunity.org/VAST+Challenge+2013%3A+Mini-Challenge+3">VAST challenge</a> page provides files that contain Network Flow (netflow), Network Health, and Intrusion Protection System data. Documentation that describes these data, as well as a diagram of the network, is available here:</p>
<ul>
<li><a href="docs/data/NetFlow_NetworkHealth.pdf">Netflow and network health</a></li>
<li><a href="docs/data/IPS.pdf">Intrusion protection system</a></li>
<li><a href="docs/data/NetworkArhictecture.pdf">Network Diagram</a></li>
</ul>
<p><a href="http://en.wikipedia.org/wiki/NetFlow">Netflow</a> data provides summaries of connections between computers on a network. For example, if you visit a web page, you initiate a connection between your computer and a web server. The connection is identified by the IP address of your computer and the network port from which it originated, as well as the IP address and network port of the machine it is connecting to. In the course of a connection, packets containing data are sent back and forth. A netflow record provides a summary of the connection, including the source and destination information we just discussed, as well as the total number of packets sent/received, total bytes sent/received, <a href="http://en.wikipedia.org/wiki/Internet_protocol_suite">internet protocol</a> used (the two most common are <a href="http://en.wikipedia.org/wiki/Transmission_Control_Protocol">TCP</a> and <a href="http://en.wikipedia.org/wiki/User_Datagram_Protocol">UDP</a>), etc.</p>
<p>The other types of data are a bit more self-explanatory. The IPS data is simply a log of suspicious network activity. The network health data is a record of statistics of machines polled at some time interval to provide information such as the amount of memory or CPU usage.</p>
<p>We will get more familiar with the data as we begin to explore it, and endeavor to provide descriptions for aspects of the data that may be difficult to understand to someone who has not worked with this type of data before.</p>
<h4>Analysis goals</h4>
<p>According to the VAST Challenge website:</p>
<blockquote>
<p>Your job is to understand events taking place on your networks over a two week period. To support your mission, your choice of visual analytics should support near real-time situation awareness. In other words, as network manager, your goal for your department is to notice network events as quickly as possible.</p>
</blockquote>
<p>We are asked to provide a timeline of notable events and to speculate on narratives that describe the events on the network.</p>
<p>Keeping those goals in mind, we will address a more general goal of simply trying to get an understanding of the data through exploratory analysis, making heavy use of visualization throughout, and highlighting the use of datadr and Trelliscope. </p>
<!-- After getting an understanding of the data, we will attempt to try to statistically model some of the behaviors that we see and look for behavior that is atypical according to these models. -->
<div class="callout callout-danger"><strong>Note: </strong>Keep in mind that this data is synthetically generated. There are some limitations to treating this like a "real" analysis of data. One limitation is that the data was synthetically generated - something we must accept because otherwise it would be very difficult provide publicly-available sources of these modalities of network sensor data. Another limitation is that in a real analysis scenario, we would ideally have domain experts very familiar with the network helping us understand the things we are seeing in the data and helping the evolution of the analytical process.</div>
<!-- #### Analysis paradigm -->
<h4>"Prerequisites"</h4>
<p>It is assumed that the reader is familiar with the R programming language. If not, there are several references, including:</p>
<ul>
<li><a href="http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf">R for Beginners</a>.</li>
</ul>
<p>Some familiarity with datadr and Trelliscope is also a plus. It is recommended to spend some time visiting these tutorials:</p>
<ul>
<li><a href="http://hafen.github.io/datadr" title="datadr tutorial">datadr</a></li>
<li><a href="http://hafen.github.io/trelliscope" title="trelliscope tutorial">Trelliscope</a></li>
</ul>
<p>Everything in this demonstration is done from the R console. Since the data is not very large, we will mainly use R's multicore capabilities for parallel processing and local disk for storage, although a more scalable backend such as Hadoop could be used simply by replacing calls to <code>localDiskConn()</code> with <code>hdfsConn()</code>. Using multicore mode lowers the barrier to entry, since building and configuring a Hadoop cluster is not a casual endeavor.</p>
<div class="callout callout-danger"><strong>Note: </strong>This data is not that large - about 6 GB uncompressed. There are other tools in R that can handle this size of data, or some systems could handle it in memory. But imagine now that there are many more hosts, a much longer time period, etc. The size of computer network sensor data is typically much much larger than this, in the terabyte and beyond scale, and these tools scale to tackle these problems. Also, regardless of size, the analysis paradigm these tools provide is useful for any size of data.</div>
</div>
<div class='tab-pane' id='environment-setup'>
<h3>Environment setup</h3>
<p>To follow along in this tutorial, you simply need to have <a href="http://cran.r-project.org">R</a> installed along with the <code>datadr</code> and <code>trelliscope</code> packages. To get these packages, we can install them from github using the <code>devtools</code> package by entering the following commands at the R command prompt:</p>
<pre><code class="r">install.packages("devtools")
library(devtools)
install_github("datadr", "hafen")
install_github("trelliscope", "hafen")
</code></pre>
<!-- such as converting IP addresses to [CIDRs](http://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing) -->
<p>Additionally, we have packaged together some helper functions and data sets particular to this data, which can be installed with:</p>
<pre><code class="r">install_github("vastChallenge", "hafen", subdir = "package")
</code></pre>
<p>The following section will cover how to set up the raw data download to get going. You can replicate every step of this tutorial on your own, and are encouraged to do so and to be creative and explore your own analyses. For convenience, all of the code in the tutorial is provided as <code>.R</code> source files <a href="#r-code">here</a>.</p>
</div>
<div class='tab-pane' id='file-setup'>
<h3>File Setup</h3>
<p>We will organize all of the data and analysis code into a project directory. For us, this directory is located at <code>~/Documents/Code/vastChallenge</code>. Choose an appropriate directory for your project and then set that as the working directory in R:</p>
<pre><code class="r">setwd("~/Documents/Code/vastChallenge")
</code></pre>
<p>Inside this directory we will create a directory for our raw data.</p>
<pre><code class="r"># create directory for raw text data
dir.create("data/raw", recursive = TRUE)
</code></pre>
<p>Now we need the raw data to put in it. The raw data can be obtained by following download link from <a href="http://vacommunity.org/VAST+Challenge+2013%3A+Mini-Challenge+3">this page</a>. Here we are only looking at "Week 2" data.</p>
<p>Unzip the files and move the csv files to the directory <code>data/raw</code>.</p>
<p>Aside from the larger csv files, there are other files, including pdf files of data descriptions and a small text file describing the hosts, <code>BigMktNetwork.txt</code>. We have already parsed this file and its contents are available as a data set called <code>hostList</code> in the <code>cyberTools</code> R package installed previously.</p>
<p>At this point, we should have the following files in our project directory:</p>
<pre><code>data/raw/bb-week2.csv
data/raw/IPS-syslog-week2.csv
data/raw/nf-week2.csv
</code></pre>
</div>
<div class='tab-pane' id='session-initialization'>
<h3>Session Initialization</h3>
<p>To initialize an R session for this or any subsequent analyses of this data, we simply launch R and load the required R packages, set the working directory, and initialize a local "cluster":</p>
<pre><code class="r"># use this code to initialize a new R session
library(datadr)
library(trelliscope)
library(cyberTools)
setwd("~/Documents/Code/vastChallenge")
# make a local "cluster" of 8 cores
cl <- makeCluster(8)
clc <- localDiskControl(cluster = cl)
</code></pre>
</div>
<div class='tab-pane' id='text-data-to-r-objects'>
<h3>Text Data to R Objects</h3>
<p>One of the more tedious parts of data analaysis can be getting the data into the proper format for analysis. <code>datadr</code> aspires to provide as much functionality to make this process as painless as possible, but there will always be special situations that require unique solutions.</p>
<p>For analysis in <code>datadr</code>, we want to take the raw data and store it as native R objects. This provides a great degree of flexibility in what type of data structures we can use, such as non-tabular data or special classes of R objects like time series or spatial objects.</p>
<p>Here, all of our input data is text. Text files are used quite often for storing and sharing big data. For example, often <a href="https://hive.apache.org">Hive</a> tables are stored as text files. <code>datadr</code> provides some helpful functions that make it easy to deal with reading in text data and storing it as R objects.. </p>
<p>In this section we will go through how to read each of the data sources in from text. In each case, we read the data in in chunks. These examples read the data into <code>datadr</code>'s "local disk" storage mode using a helper function <code>drRead.csv()</code>. This method also works for reading in text data on HDFS.</p>
</div>
<div class='tab-pane' id='netflow-data'>
<h3>NetFlow Data</h3>
<p>The NetFlow data is located here: <code>data/raw/nf-week2.csv</code>. To get a feel for what it looks like, we'll read in the first few rows using R's built-in function <code>read.csv()</code>.</p>
<div class="callout callout-danger"><strong>Note: </strong>A common paradigm when using datadr is to test code on a subset of the data prior to applying it to the entire data set. We will see this frequently throughout this document.</div>
<h4>Looking at a subset</h4>
<p>To read in and look at the first 10 rows:</p>
<pre><code class="r"># read in 10 rows of netflow data
nfHead <- read.csv("data/raw/nf-week2.csv", nrows = 10, stringsAsFactors = FALSE)
</code></pre>
<p>Here's what the first few rows and some of the columns of this data look like:</p>
<pre><code class="r">nfHead[1:10,3:7]
</code></pre>
<pre><code> dateTimeStr ipLayerProtocol ipLayerProtocolCode firstSeenSrcIp firstSeenDestIp
1 2.013e+13 17 UDP 172.20.2.19 239.255.255.250
2 2.013e+13 17 UDP 172.20.2.18 239.255.255.250
3 2.013e+13 17 UDP 172.20.2.17 239.255.255.250
4 2.013e+13 17 UDP 172.20.2.16 239.255.255.250
5 2.013e+13 17 UDP 172.20.2.14 239.255.255.250
6 2.013e+13 17 UDP 172.20.2.13 239.255.255.250
7 2.013e+13 17 UDP 172.20.2.12 239.255.255.250
8 2.013e+13 17 UDP 172.20.2.11 239.255.255.250
9 2.013e+13 17 UDP 172.20.2.10 239.255.255.250
10 2.013e+13 17 UDP 172.20.2.35 239.255.255.250
</code></pre>
<p>Let's look at the structure of the object to see all the columns and their data types:</p>
<pre><code class="r"># look at structure of the data
str(nfHead)
</code></pre>
<pre><code>'data.frame': 10 obs. of 19 variables:
$ TimeSeconds : num 1.37e+09 1.37e+09 1.37e+09 1.37e+09 1.37e+09 ...
$ parsedDate : chr "2013-04-10 08:32:36" "2013-04-10 08:32:36" "2013-04-10 08:32:36" "2013-04-10 08:32:36" ...
$ dateTimeStr : num 2.01e+13 2.01e+13 2.01e+13 2.01e+13 2.01e+13 ...
$ ipLayerProtocol : int 17 17 17 17 17 17 17 17 17 17
$ ipLayerProtocolCode : chr "UDP" "UDP" "UDP" "UDP" ...
$ firstSeenSrcIp : chr "172.20.2.19" "172.20.2.18" "172.20.2.17" "172.20.2.16" ...
$ firstSeenDestIp : chr "239.255.255.250" "239.255.255.250" "239.255.255.250" "239.255.255.250" ...
$ firstSeenSrcPort : int 29987 29986 29985 29984 29983 29982 29981 29980 29979 29978
$ firstSeenDestPort : int 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900
$ moreFragments : int 0 0 0 0 0 0 0 0 0 0
$ contFragments : int 0 0 0 0 0 0 0 0 0 0
$ durationSeconds : int 0 0 0 0 0 0 0 0 0 0
$ firstSeenSrcPayloadBytes : int 133 133 133 133 133 133 133 133 133 133
$ firstSeenDestPayloadBytes: int 0 0 0 0 0 0 0 0 0 0
$ firstSeenSrcTotalBytes : int 175 175 175 175 175 175 175 175 175 175
$ firstSeenDestTotalBytes : int 0 0 0 0 0 0 0 0 0 0
$ firstSeenSrcPacketCount : int 1 1 1 1 1 1 1 1 1 1
$ firstSeenDestPacketCount : int 0 0 0 0 0 0 0 0 0 0
$ recordForceOut : int 0 0 0 0 0 0 0 0 0 0
</code></pre>
<p>This looks like it is almost in a suitable form for analysis. However, there are two columns that correspond to time, and neither is in a handy R-native format. Instead of having a column for <code>TimeSeconds</code> and <code>parsedDate</code>, let's create a new column <code>time</code> that is an R <code>POSIXct</code> object.</p>
<pre><code class="r"># make new date variable
nfHead$date <- as.POSIXct(nfHead$TimeSeconds, origin = "1970-01-01", tz = "UTC")
# remove old time variables
nfHead <- nfHead[,setdiff(names(nfHead), c("TimeSeconds", "parsedDate"))]
</code></pre>
<p>Let's now make this operation a function, so that when we read in new rows of the data, we can just pass it through the function to obtain the preferred format:</p>
<pre><code class="r">nfTransform <- function(x) {
x$date <- as.POSIXct(x$TimeSeconds, origin = "1970-01-01", tz = "UTC")
x[,setdiff(names(x), c("TimeSeconds", "parsedDate"))]
}
</code></pre>
<p>We will use this function later.</p>
<p>Now that we have figured out what we want to do with the data, we can read the whole thing in. But first we need to talk a little bit about disk connections in <code>datadr</code>.</p>
<h4>Local disk connections</h4>
<p>We will be storing the data we read in as a <code>datadr</code> <em>local disk connection</em>. A local disk connection is defined by the path where we would like the data to be stored. This should be an empty directory, and can be a nonexistent directory.</p>
<p>Here, we would like to store our parsed netflow data in <code>data/nfRaw</code>. We initialize this connection with a call to <code>localDiskConn()</code>:</p>
<pre><code class="r"># initiate a new connection where parsed netflow data will be stored
nfConn <- localDiskConn("data/nfRaw")
</code></pre>
<p>This will prompt for whether you want the directory to be created if it does not exist. <code>nfConn</code> is now simply an R object that points to this location on disk:</p>
<pre><code class="r"># look at the connection
nfConn
</code></pre>
<pre><code>localDiskConn connection
loc=/Users/hafe647/Documents/Code/vastChallenge/data/nfRaw; nBins=0
</code></pre>
<p>We can either add data to this connection using <code>addData()</code> or we can pass it as the <code>output</code> argument to our csv reader, as we will do in the following section.</p>
<h4>Reading it all in</h4>
<p>It turns out that there is a handy function in <code>datadr</code> that is the analog to <code>read.csv</code>, called <code>drRead.csv</code>, which reads the data in in blocks. It has the same calling interface as R's <code>read.csv</code> with additional arguments to specify where to store the output, how many rows to put in each block, and an optional transformation function to apply to each block prior to storing it.</p>
<p>We will read in the netflow csv file using the default number of rows per block (<code>50000</code>), apply our <code>nfTransform</code> function that adds the <code>time</code> variable, and save the output to our <code>nfConn</code> local disk connection:</p>
<pre><code class="r"># read in netflow data
nfRaw <- drRead.csv("data/raw/nf-week2.csv", output = nfConn, postTransFn = nfTransform)
</code></pre>
<p>Be prepared - the ETL operations using local disk are the most time-consuming tasks in this tutorial. On my machine, the above command takes about 10 minutes to execute. We will see that subsequent operations applied to the divided, parsed data are much faster.</p>
<div class="callout callout-danger"><strong>Note: </strong>The drRead.csv function for local disk reads the data in sequentially. However, drRead.csv operates in parallel when using the Hadoop backend. There are a couple of reasons for sequential operation in local disk mode. One is that simultaneous reads from the same single disk will probably not be faster, and could actually have worse performance (this is one of the most compelling reasons to use a distributed file system comprised of many disks such as what Hadoop provides). Another related reason is the difficulty of having multiple processes scanning to different locations in a single file.</div>
<h4>Distributed data objects</h4>
<p>Let's take a look at <code>nfRaw</code> to see what the object looks like:</p>
<pre><code class="r">nfRaw
</code></pre>
<pre><code>
Distributed data object of class 'kvLocalDisk' with attributes:
'ddo' attribute | value
----------------+--------------------------------------------------------------------------
keys | [empty] call updateAttributes(dat) to get this value
totStorageSize | 171.98 MB
totObjectSize | [empty] call updateAttributes(dat) to get this value
nDiv | 466
splitSizeDistn | [empty] call updateAttributes(dat) to get this value
example | use kvExample(dat) to get an example subset
bsvInfo | [empty] no BSVs have been specified
'ddf' attribute | value
----------------+--------------------------------------------------------------------------
vars | dateTimeStr(num), ipLayerProtocol(int), and 16 more
transFn | identity (original data is a data frame)
nRow | [empty] call updateAttributes(dat) to get this value
splitRowDistn | [empty] call updateAttributes(dat) to get this value
summary | [empty] call updateAttributes(dat) to get this value
localDiskConn connection
loc=/Users/hafe647/Documents/Code/vastChallenge/data/nfRaw; nBins=0
</code></pre>
<p><code>nfRaw</code> is a <em>distributed data frame</em> (ddf), and we see several aspects about the data printed. For example, we see that there are 466 subsets and that the size of the parsed data in native R format is much smaller (<code>totStorageSize</code> = 171.98 MB) than the input text data. The other attributes will be updated in a moment.</p>
<p>The <code>nfRaw</code> object itself is simply a special R object that contains metadata and pointers to the actual data stored on disk. For more background on ddf and related objects, see <a href="http://hafen.github.io/datadr/index.html#distributed-data-objects">here</a> and <a href="http://hafen.github.io/datadr/index.html#distributed-data-frames">here</a>, and particularly for ddf objects on local disk, see <a href="http://hafen.github.io/datadr/index.html#medium-disk--multicore">here</a>.</p>
<p>In any subsequent R session, we can "reload" this data object with the following:</p>
<pre><code class="r">nfRaw <- ddf(localDiskConn("data/nfRaw"))
</code></pre>
<p>Earlier we saw in the printout of <code>nfRaw</code> that it has many attibutes that have not yet been determined. We can fix this by calling <code>updateAttributes()</code>:</p>
<pre><code class="r">nfRaw <- updateAttributes(nfRaw, control = clc)
</code></pre>
<p>Here, through the <code>control</code> parameter, we specified that our local "cluster" we initialized at the beginning of our session should be used for the computation. The update job takes about 30 seconds on my machine with 8 cores.</p>
<div class="callout callout-danger"><strong>Note: </strong>This and most all other `datadr` methods can operate in a parallel fashion, where the configuration parameters for the parallel environment are specified through a <code>control</code> argument.</div>
<p>Now we can see more meaningful information about our data:</p>
<pre><code class="r">nfRaw
</code></pre>
<pre><code>
Distributed data object of class 'kvLocalDisk' with attributes:
'ddo' attribute | value
----------------+--------------------------------------------------------------------------
keys | keys are available through getKeys(dat)
totStorageSize | 171.98 MB
totObjectSize | 2 GB
nDiv | 466
splitSizeDistn | use splitSizeDistn(dat) to get distribution
example | use kvExample(dat) to get an example subset
bsvInfo | [empty] no BSVs have been specified
'ddf' attribute | value
----------------+--------------------------------------------------------------------------
vars | dateTimeStr(num), ipLayerProtocol(int), and 16 more
transFn | identity (original data is a data frame)
nRow | 23258685
splitRowDistn | use splitRowDistn(dat) to get distribution
summary | use summary(dat) to see summaries
localDiskConn connection
loc=/Users/hafe647/Documents/Code/vastChallenge/data/nfRaw; nBins=0
</code></pre>
<p>We now see that there are about 23 million rows of data, and we are supplied, among other things, with summary statistics for the variables in the ddf which we will see in the next section.</p>
<h4>Reading the data in to HDFS</h4>
<p>Before moving on it is worth noting how this data would be read in using Hadoop/HDFS as the backend. The steps are identical except for the fact that we must first put the data on HDFS and then create an HDFS connection instead of a local disk connection.</p>
<p>To copy the data to HDFS:</p>
<pre><code class="r">library(Rhipe)
rhinit()
# create directory on HDFS for csv file
rhmkdir("/tmp/vast/raw")
# copy netflow csv from local disk to /tmp/vast/raw on HDFS
rhput("data/raw/nf-week2.csv", "/tmp/vast/raw")
</code></pre>
<p>Now to read the data in as a distributed data frame:</p>
<pre><code class="r">nfRaw <- drRead.csv(hdfsConn("tmp/vast/raw/nf-week2.csv", type = "text"),
output = hdfsConn("/tmp/vast/nfRaw"),
postTransFn = nfTransform)
</code></pre>
</div>
<div class='tab-pane' id='ips-data'>
<h3>IPS Data</h3>
<p>We follow a similar approach for the IPS data.</p>
<pre><code class="r"># take a look at the data
ipsHead <- read.csv("data/raw/IPS-syslog-week2.csv", nrow = 10, stringsAsFactors = FALSE)
str(ipsHead)
</code></pre>
<pre><code>'data.frame': 10 obs. of 13 variables:
$ dateTime : chr "10/Apr/2013 07:02:35" "10/Apr/2013 07:02:35" "10/Apr/2013 07:02:35" "10/Apr/2013 07:02:35" ...
$ priority : chr "Info" "Info" "Info" "Info" ...
$ operation : chr "Built" "Teardown" "Teardown" "Built" ...
$ messageCode: chr "ASA-6-302013" "ASA-6-302014" "ASA-6-302014" "ASA-6-302013" ...
$ protocol : chr "TCP" "TCP" "TCP" "TCP" ...
$ SrcIp : chr "172.10.2.35" "172.30.1.104" "172.10.1.246" "172.10.1.138" ...
$ destIp : chr "10.1.0.75" "10.0.0.14" "10.1.0.77" "10.1.0.100" ...
$ srcPort : int 2507 2651 2504 1893 2506 2260 2673 2509 2261 2507
$ destPort : int 80 80 80 80 80 80 80 80 80 80
$ destService: chr "http" "http" "http" "http" ...
$ direction : chr "outbound" "outbound" "outbound" "outbound" ...
$ flags : chr "(empty)" "TCP FINs" "TCP FINs" "(empty)" ...
$ command : chr "(empty)" "(empty)" "(empty)" "(empty)" ...
</code></pre>
<p>Here, we have a different date/time input to deal with. The
Actually, it turns out that the <code>lubridate</code> package has a much faster implementation of <code>strptime</code>, called <code>fast_strptime</code>. To use it, we will first replace <code>"Apr"</code> with <code>"04"</code> in the date/time string, and then call <code>fast_strptime</code> to convert the variable.</p>
<pre><code class="r">ipsHead$dateTime <- gsub("Apr", "04", ipsHead$dateTime)
ipsHead$dateTime <- fast_strptime(ipsHead$dateTime, format = "%d/%m/%Y %H:%M:%S", tz = "UTC")
</code></pre>
<p>Now we can build this into the transformation function with the additional step of renaming a couple of the columns of data:</p>
<pre><code class="r"># transformation to handle time variable
ipsTransform <- function(x) {
require(lubridate)
x$dateTime <- gsub("Apr", "04", x$dateTime)
x$dateTime <- fast_strptime(x$dateTime, format = "%d/%m/%Y %H:%M:%S", tz = "UTC")
names(x)[c(1, 6)] <- c("time", "srcIp")
x
}
# read the data in
ipsRaw <- drRead.csv("data/raw/IPS-syslog-week2.csv",
output = localDiskConn("data/ipsRaw"),
postTransFn = ipsTransform)
</code></pre>
<p>As with the NetFlow data, we can call <code>updateAttributes()</code>:</p>
<pre><code class="r">ipsRaw <- updateAttributes(ipsRaw, control = clc)
</code></pre>
<pre><code class="r">ipsRaw
</code></pre>
<pre><code>
Distributed data object of class 'kvLocalDisk' with attributes:
'ddo' attribute | value
----------------+--------------------------------------------------------------------------
keys | keys are available through getKeys(dat)
totStorageSize | 101.02 MB
totObjectSize | 1.69 GB
nDiv | 333
splitSizeDistn | use splitSizeDistn(dat) to get distribution
example | use kvExample(dat) to get an example subset
bsvInfo | [empty] no BSVs have been specified
'ddf' attribute | value
----------------+--------------------------------------------------------------------------
vars | time(POS), priority(POS), operation(cha), messageCode(cha), and 9 more
transFn | identity (original data is a data frame)
nRow | 16600931
splitRowDistn | use splitRowDistn(dat) to get distribution
summary | use summary(dat) to see summaries
localDiskConn connection
loc=/Users/hafe647/Documents/Code/vastChallenge/data/ipsRaw; nBins=0
</code></pre>
</div>
<div class='tab-pane' id='big-brother-data'>
<h3>Big Brother Data</h3>
<p>The "big brother" data is handled similarly:</p>
<pre><code class="r"># look at first few rows
bbHead <- read.csv("data/raw/bb-week2.csv", nrows = 10, stringsAsFactors = FALSE)
str(bbHead)
</code></pre>
<pre><code>'data.frame': 10 obs. of 14 variables:
$ id : int 29903 29911 29913 29920 29932 29933 29951 29956 29967 29975
$ hostname : chr "web02b.bigmkt2.com" "web03d.bigmkt3.com" "web01d.bigmkt1.com" "mail02.bigmkt2.com" ...
$ servicename : chr "cpu" "cpu" "cpu" "cpu" ...
$ currenttime : int 1365605774 1365605790 1365605791 1365605795 1365605801 1365605801 1365605827 1365605832 1365605867 1365605885
$ statusVal : int 1 1 1 2 2 1 1 1 1 1
$ bbcontent : chr " Wed Apr 10 07:56:14 PDT 2013 [WEB02B.BIGMKT2.COM] up: 18 days, 1 users, 38 procs, load=0%, PhysicalMem: 4GB(14%)\n\n\n\nMemory"| __truncated__ " Wed Apr 10 07:56:29 PDT 2013 [WEB03D.BIGMKT3.COM] up: 18 days, 1 users, 38 procs, load=0%, PhysicalMem: 4GB(14%)\n\n\n\nMemory"| __truncated__ " Wed Apr 10 07:56:31 PDT 2013 [WEB01D.BIGMKT1.COM] up: 18 days, 1 users, 39 procs, load=0%, PhysicalMem: 4GB(14%)\n\n\n\nMemory"| __truncated__ " Wed Apr 10 07:56:35 PDT 2013 [MAIL02.BIGMKT2.COM] up: 0:46, 1 users, 58 procs, load=2%, PhysicalMem: 4GB(25%)\n\n&yellow Machi"| __truncated__ ...
$ receivedfrom : chr "172.20.0.6" "172.30.0.8" "172.10.0.8" "172.20.0.3" ...
$ diskUsagePercent : logi NA NA NA NA NA NA ...
$ pageFileUsagePercent : logi NA NA NA NA NA NA ...
$ numProcs : int 38 38 39 58 61 38 39 24 44 43
$ loadAveragePercent : int 0 0 0 2 1 0 0 0 0 1
$ physicalMemoryUsagePercent: int 14 14 14 25 27 14 14 11 16 17
$ connMade : logi NA NA NA NA NA NA ...
$ parsedDate : chr "2013-04-10 07:56:14" "2013-04-10 07:56:30" "2013-04-10 07:56:31" "2013-04-10 07:56:35" ...
</code></pre>
<p>There is one column that is very large in this data. We have a similar task as before of parsing the time variale and removing some columns:</p>
<pre><code class="r"># transformation to handle time parsing
bbTransform <- function(x) {
x$time <- as.POSIXct(x$parsedDate, tz = "UTC")
x[,setdiff(names(x), c("currenttime", "parsedDate"))]
}
bbRaw <- drRead.csv("data/raw/bb-week2.csv",
output = localDiskConn("data/bbRaw"),
postTransFn = bbTransform,
autoColClasses = FALSE)
bbRaw <- updateAttributes(bbRaw, control = clc)
</code></pre>
<pre><code class="r">bbRaw
</code></pre>
<pre><code>
Distributed data object of class 'kvLocalDisk' with attributes:
'ddo' attribute | value
----------------+--------------------------------------------------------------------------
keys | keys are available through getKeys(dat)
totStorageSize | 55.18 MB
totObjectSize | 1.07 GB
nDiv | 44
splitSizeDistn | use splitSizeDistn(dat) to get distribution
example | use kvExample(dat) to get an example subset
bsvInfo | [empty] no BSVs have been specified
'ddf' attribute | value
----------------+--------------------------------------------------------------------------
vars | id(int), hostname(fac), servicename(fac), statusVal(int), and 9 more
transFn | identity (original data is a data frame)
nRow | 2165507
splitRowDistn | use splitRowDistn(dat) to get distribution
summary | use summary(dat) to see summaries
localDiskConn connection
loc=/Users/hafe647/Documents/Code/vastChallenge/data/bbRaw; nBins=0
</code></pre>
</div>
<div class='tab-pane' id='sourcedestination-ip-frequency'>
<h3>Source/Destination IP Frequency</h3>
<p>We'll start exploring the data by looking at some summaries of the NetFlow data by studying our <code>nfRaw</code> data object. As we saw before, simply printing out the object gives us some high-level information about the data:</p>
<pre><code class="r"># load our data back if we are in a new session
nfRaw <- ddf(localDiskConn("data/nfRaw"))
nfRaw
</code></pre>
<pre><code>
Distributed data object of class 'kvLocalDisk' with attributes:
'ddo' attribute | value
----------------+--------------------------------------------------------------------------
keys | keys are available through getKeys(dat)
totStorageSize | 171.98 MB
totObjectSize | 2 GB
nDiv | 466
splitSizeDistn | use splitSizeDistn(dat) to get distribution
example | use kvExample(dat) to get an example subset
bsvInfo | [empty] no BSVs have been specified
'ddf' attribute | value
----------------+--------------------------------------------------------------------------
vars | dateTimeStr(num), ipLayerProtocol(int), and 16 more
transFn | identity (original data is a data frame)
nRow | 23258685
splitRowDistn | use splitRowDistn(dat) to get distribution
summary | use summary(dat) to see summaries
localDiskConn connection
loc=/Users/hafe647/Documents/Code/vastChallenge/data/nfRaw; nBins=0
</code></pre>
<p>Since <code>nfRaw</code> is a distributed data frame, we can look at various aspects of the data frame through familiar R methods.</p>
<p>We can see variable names:</p>
<pre><code class="r"># see what variables are available
names(nfRaw)
</code></pre>
<pre><code> [1] "dateTimeStr" "ipLayerProtocol" "ipLayerProtocolCode"
[4] "firstSeenSrcIp" "firstSeenDestIp" "firstSeenSrcPort"
[7] "firstSeenDestPort" "moreFragments" "contFragments"
[10] "durationSeconds" "firstSeenSrcPayloadBytes" "firstSeenDestPayloadBytes"
[13] "firstSeenSrcTotalBytes" "firstSeenDestTotalBytes" "firstSeenSrcPacketCount"
[16] "firstSeenDestPacketCount" "recordForceOut" "date"
</code></pre>
<p>We can get number of rows:</p>
<pre><code class="r"># get total number of rows
nrow(nfRaw)
</code></pre>
<pre><code>NULL
</code></pre>
<p>We can grab the first subset and look at its structure:</p>
<pre><code class="r"># look at the structure of the first key-value pair
str(nfRaw[[1]])
</code></pre>
<pre><code>List of 2
$ : num 343
$ :'data.frame': 50000 obs. of 18 variables:
..$ dateTimeStr : num [1:50000] 2.01e+13 2.01e+13 2.01e+13 2.01e+13 2.01e+13 ...
..$ ipLayerProtocol : int [1:50000] 6 6 6 6 6 6 6 6 6 6 ...
..$ ipLayerProtocolCode : chr [1:50000] "TCP" "TCP" "TCP" "TCP" ...
..$ firstSeenSrcIp : chr [1:50000] "10.15.7.85" "10.15.7.85" "10.15.7.85" "10.15.7.85" ...
..$ firstSeenDestIp : chr [1:50000] "172.30.0.4" "172.30.0.4" "172.30.0.4" "172.30.0.4" ...
..$ firstSeenSrcPort : int [1:50000] 16165 16164 16643 16162 16163 16642 27444 16436 17052 16437 ...
..$ firstSeenDestPort : int [1:50000] 80 80 80 80 80 80 80 80 80 80 ...
..$ moreFragments : int [1:50000] 0 0 0 0 0 0 0 0 0 0 ...
..$ contFragments : int [1:50000] 0 0 0 0 0 0 0 0 0 0 ...
..$ durationSeconds : int [1:50000] 5 5 2 5 5 2 0 3 0 3 ...
..$ firstSeenSrcPayloadBytes : int [1:50000] 19 19 19 19 19 19 19 19 19 19 ...
..$ firstSeenDestPayloadBytes: int [1:50000] 503 503 503 503 503 503 503 503 503 503 ...
..$ firstSeenSrcTotalBytes : int [1:50000] 297 297 297 297 297 297 297 297 297 297 ...
..$ firstSeenDestTotalBytes : int [1:50000] 619 619 619 619 619 619 619 619 619 619 ...
..$ firstSeenSrcPacketCount : int [1:50000] 5 5 5 5 5 5 5 5 5 5 ...
..$ firstSeenDestPacketCount : int [1:50000] 2 2 2 2 2 2 2 2 2 2 ...
..$ recordForceOut : int [1:50000] 0 0 0 0 0 0 0 0 0 0 ...
..$ date : POSIXct[1:50000], format: "2013-04-14 14:42:14" "2013-04-14 14:42:14" ...
</code></pre>
<p>We can view summaries of the variables in the distributed data frame:</p>
<pre><code class="r"># look at summaries (computed from updateAttributes)
summary(nfRaw)
</code></pre>
<pre><code> dateTimeStr ipLayerProtocol ipLayerProtocolCode firstSeenSrcIp
-------------------- ----------------- ------------------- -----------------------
missing : 0 missing : 0 levels : 3 levels : 1390
min : 2.013e+13 min : 1 missing : 0 missing : 0
max : 2.013e+13 max : 17 > freqTable head < > freqTable head <
mean : 2.013e+13 mean : 6.09 TCP : 23062987 10.138.214.18 : 1300759
std dev : 317466 std dev : 0.9961 UDP : 191395 10.170.32.181 : 1259035
skewness : 4.299 skewness : 10.79 OTHER : 4303 10.170.32.110 : 1257747
kurtosis : 35.2 kurtosis : 118.5 10.10.11.102 : 1251990
-------------------- ----------------- ------------------- -----------------------
firstSeenDestIp firstSeenSrcPort firstSeenDestPort moreFragments
--------------------- ------------------ ----------------- -------------------
levels : 1277 missing : 0 missing : 0 missing : 0
missing : 0 min : 0 min : 0 min : 0
> freqTable head < max : 65534 max : 65534 max : 1
172.30.0.4 : 8122427 mean : 30523 mean : 595.9 mean : 1.75e-05
172.10.0.4 : 4652570 std dev : 18235 std dev : 4261 std dev : 0.004183
172.20.0.4 : 4341038 skewness : 0.05421 skewness : 10.51 skewness : 239
172.20.0.15 : 4029911 kurtosis : 1.809 kurtosis : 121.3 kurtosis : 57145
--------------------- ------------------ ----------------- -------------------
contFragments durationSeconds firstSeenSrcPayloadBytes
-------------------- ---------------- ------------------------
missing : 0 missing : 0 missing : 0
min : 0 min : 0 min : 0
max : 1 max : 1800 max : 3050256
mean : 1.741e-05 mean : 11.36 mean : 691.6
std dev : 0.004173 std dev : 37.15 std dev : 38955
skewness : 239.6 skewness : 10.48 skewness : 67.35
kurtosis : 57427 kurtosis : 221 kurtosis : 4739
-------------------- ---------------- ------------------------
firstSeenDestPayloadBytes firstSeenSrcTotalBytes firstSeenDestTotalBytes
------------------------- ---------------------- -----------------------
missing : 0 missing : 0 missing : 0
min : 0 min : 43 min : 0
max : 3129878 max : 3326672 max : 3762470
mean : 22561 mean : 1497 mean : 23576
std dev : 245130 std dev : 41306 std dev : 254859
skewness : 11.84 skewness : 65.7 skewness : 11.84
kurtosis : 143 kurtosis : 4598 kurtosis : 143
------------------------- ---------------------- -----------------------
firstSeenSrcPacketCount firstSeenDestPacketCount recordForceOut
----------------------- ------------------------ --------------
missing : 0 missing : 0 missing : 0
min : 1 min : 0 min : 0
max : 13033 max : 13969 max : 0
mean : 14.51 mean : 18.64 mean : 0
std dev : 109.3 std dev : 182.4 std dev : 0
skewness : 14.66 skewness : 12.02 skewness : NaN
kurtosis : 334.4 kurtosis : 155.5 kurtosis : NaN
----------------------- ------------------------ --------------
date
------------------------
missing : 0
min : 13-04-10 06:50
max : 13-04-15 10:00
------------------------
</code></pre>
<p>The <code>summary()</code> method provides a nice overview of the variables in our distributed data frame. For categorical variables, it provides a frequency table, and for numeric variables, it provides summary statistics such as moments (mean, standard deviation, etc.), range, etc.</p>
<div class="callout callout-danger"><strong>Note: </strong>A good place to start in an exploratory analysis is to look at summary statistics. The summary information that comes with distributed data frames provides a simple way to start looking at the data.</div>
<p>There are several insights we can get from the data by simply scanning the summary output printed above. For example, the variable <code>ipLayerProtocolCode</code> tells us that the vast majority of the connections monitored are [TCP][TCP-wik] connections, while [UDP][UDP-wik] connections make up a little less than 1% of the traffic. Also, all other protocols are rolled up into an "other" category. We also see that timestamp of the data ranges from April 9, 2013 to April 15. We also see that the variable <code>recordForceOut</code> is all zeros (min and max are zero), meaning that there are no (recall that all variables are described <a href="docs/data/NetFlow_NetworkHealth.pdf">here</a>). </p>
<p>There are other simple insights we can gain from scanning this the summary output, but we can get better insights by visualizing the summaries in more detail.</p>
<h4>First seen source IP</h4>
<p>We want to better understand the distribution of first seen source IP addresses in the data. Note that in the summary printout above, we only see the top 4 IP addresses in the summary info for <code>firstSeenSrcIp</code>. We can extract the full frequency table from the summary with the following:</p>
<pre><code class="r"># grab the full frequency table for firstSeenSrcIp
srcIpFreq <- summary(nfRaw)$firstSeenSrcIp$freqTable
# look at the top few IPs
head(srcIpFreq)
</code></pre>
<pre><code> value Freq
35 10.138.214.18 1300759
65 10.170.32.181 1259035
64 10.170.32.110 1257747
24 10.10.11.102 1251990
86 10.247.106.27 1233811
28 10.12.15.152 1148983
</code></pre>
<p>To get more information about the IP addresses in this table, we can rely on the list of hosts provided with the raw data. We have included this data, called <code>hostListOrig</code> with the <code>cyberTools</code> package:</p>
<pre><code class="r">head(hostListOrig)
</code></pre>
<pre><code> IP hostName type externalIP
1 172.10.0.2 dc01.bigmkt1.com Domain controller 10.0.2.2
2 172.10.0.3 mail01.bigmkt1.com SMTP 10.0.2.3
3 172.10.0.4 web01.bigmkt1.com HTTP 10.0.2.4
4 172.10.0.40 administrator.bigmkt1.com Administrator <NA>
5 172.10.0.5 web01a.bigmkt1.com HTTP 10.0.2.5
6 172.10.0.7 web01c.bigmkt1.com HTTP 10.0.2.6
</code></pre>
<p>This data provides additional information about IP addresses in our data, such as the type of machine and the name of the host. This data provides a nice augmentation for our frequency table. We can merge it in with the <code>mergeHostList()</code> function provided with <code>cyberTools</code>. This function expects to recieve an input data frame and the name of the variable that contains the IP addresses to be merged to. We also specify <code>original = TRUE</code> so that the function uses the original host list provided with the data, as opposed to incorporating modifications we will discover.</p>
<pre><code class="r">srcIpFreq <- mergeHostList(srcIpFreq, "value", original = TRUE)
head(srcIpFreq)
</code></pre>
<pre><code> value Freq hostName type externalIP
1 172.10.0.4 151100 web01.bigmkt1.com HTTP 10.0.2.4
2 172.30.0.4 93584 web03.bigmkt3.com HTTP 10.0.4.4
3 172.20.0.4 47719 web02.bigmkt2.com HTTP 10.0.3.4
4 172.20.0.15 38855 web02l.bigmkt2.com HTTP 10.0.3.15
5 172.10.2.66 29283 wss1-319.bigmkt1.com Workstation <NA>
6 172.30.1.223 29270 wss3-223.bigmkt3.com Workstation <NA>
</code></pre>
<p>Now we can see, for example, what types of hosts are in the data:</p>
<pre><code class="r"># see how many of each type we have
table(srcIpFreq$type)
</code></pre>
<pre><code>
Administrator Domain controller External HTTP Other 172.*
1 3 164 16 103
SMTP Workstation
3 1100
</code></pre>
<p>Most are workstations. There are 103 "other 177.*" addresses that warrant further scrutiny.</p>
<h4>A potential issue with the provided host list</h4>
<p>From the documentation, it appears that IPs that are inside the network are of the form <code>172.x.x.x</code>. <code>mergeHostList()</code> finds IPs that are of this form that are not listed in <code>hostListOrig</code> and gives them the classification <code>"Other 172.*"</code>. Let's look at these:</p>
<pre><code class="r"># look at 172.x addresses that aren't in our host list
sort(subset(srcIpFreq, type == "Other 172.*")$value)
</code></pre>
<pre><code> [1] "172.0.0.1" "172.10.0.50" "172.10.0.6" "172.20.1.101" "172.20.1.102"
[6] "172.20.1.103" "172.20.1.104" "172.20.1.105" "172.20.1.106" "172.20.1.107"
[11] "172.20.1.108" "172.20.1.109" "172.20.1.110" "172.20.1.111" "172.20.1.112"
[16] "172.20.1.113" "172.20.1.114" "172.20.1.115" "172.20.1.116" "172.20.1.117"
[21] "172.20.1.118" "172.20.1.119" "172.20.1.120" "172.20.1.121" "172.20.1.122"
[26] "172.20.1.123" "172.20.1.124" "172.20.1.125" "172.20.1.126" "172.20.1.127"
[31] "172.20.1.128" "172.20.1.129" "172.20.1.130" "172.20.1.131" "172.20.1.132"
[36] "172.20.1.133" "172.20.1.134" "172.20.1.135" "172.20.1.136" "172.20.1.137"
[41] "172.20.1.138" "172.20.1.139" "172.20.1.140" "172.20.1.141" "172.20.1.142"
[46] "172.20.1.143" "172.20.1.144" "172.20.1.145" "172.20.1.146" "172.20.1.147"
[51] "172.20.1.148" "172.20.1.149" "172.20.1.150" "172.20.1.151" "172.20.1.152"
[56] "172.20.1.153" "172.20.1.154" "172.20.1.155" "172.20.1.156" "172.20.1.157"
[61] "172.20.1.158" "172.20.1.159" "172.20.1.160" "172.20.1.161" "172.20.1.162"
[66] "172.20.1.163" "172.20.1.164" "172.20.1.165" "172.20.1.166" "172.20.1.167"