-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathscratch.txt
89 lines (52 loc) · 1.5 KB
/
scratch.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
Master,Driver,W1,W2,W3,W4
7 nodes connected to LAN?
tevc???
A B C D
1st stage - 1a, 1b
2nd stage - 2a, 2b
Master, Driver, Worker (2)
Worker -- some executor processes, one executor per job
Job-sort :
3 stages … sort
2 stages … wordcount
Input: 1000 numbers
2 partitions -- 2 tasks
500 numbers (partition) -- HDFS -- RDD -- executor+worker1, sort, -- RDD (intermediate) -- TaskA
500 numbers (partition) -- HDFS -- RDD -- executor+worker2, sort, -- RDD (intermediate) -- TaskB
stage2:
TaskC (worker1) -- shuffling -- RDD
TaskD (worker2) -- shuffling -- RDD
Stage3:
TaskC --- TaskE -- write the entire data to HDFS file?
TaskD --- TaskE
4 stages: A-read, B-sort1, C-sort2, D-write
2 partitions
A1-B1 ??
A2-B2 ??
Does A1 send something B2?? Is this 1-on-1 communication? Or all-to-all communication?
B1-C1
B1-C2
B2-C1
B2-C2
(please confirm this is what happens)
C1-D1
C2-D2
(please confim this is what happens)
------------------------------------------------------------
Deterministic placement:
A1,B1 -- on W1
A2,B2 -- on W2
C1,D1 -- on W3
C2,D2 -- on W4
A0,B0 -- on W1
A1,B1 -- on W2
C0,D0 -- on W3
C1,D1 -- on W4
I want to delay W1 network card (w1.expername.emulab.net)
Maybe no A1’ (NFS read from local file system)
B1’ (backup task)
----------------------------------------
You need to see where the data transfer …
Where can you get something like data.size() … RDD.size()
Time it .. RPC … go deep to the framework code, Spark engine
--------------------------------------