-
Notifications
You must be signed in to change notification settings - Fork 2
Overview and Architecture
netSLS uses the Hadoop Scheduler Load Simulator (SLS) to simulate Hadoop job traces and then emulates the network behavior of these jobs in MaxiNet. First, it is essential to understand how the simulation of jobs in SLS works.
Note, that netSLS only simulates MapReduce-type jobs. If you are not familiar with MapReduce or the execution cycle of jobs in Hadoop please refer to the official documentation.
Every job in a job trace has a relative start time (cf. Job Traces). Therefore, instead of submitting the jobs and letting them be processed by the Scheduler like on a real Hadoop cluster, SLS schedules the start of all Application Masters ahead based on their specified start times. Once an Application Master is started it processes its dummy-tasks in the usual manner, i.e. it request and launch containers and monitor their status.
As SLS is only a simulation, no actual computations are performed during a containers execution. The container simply sends heartbeat messages and waits for its specified runtime, which is also specified in the job trace, to elapse.
To model the network behavior of jobs one must first understand when data would be transmitted through the network during a real execution:
Every Map-task requires a part of the input data, the so called input split. The split is located
at one or multiple split locations, i.e. Hadoop nodes storing a replica of the split. If a
Map-task is executed on a host different from its split locations the data has to be fetched prior
to executing the task.
Reduce-tasks on the other hand have to fetch a portion of each Map-task's output (intermediate
output) prior to performing any computations.
As the job traces do not contain any information on how much of a container's runtime is caused by
data transmissions the network behavior in netSLS is modeled as follows:
At the beginning of a containers execution all necessary data transmissions are emulated. Once all
transmissions are completed, the container waits for the runtime specified in the job trace (i.e.
its simulated computation time) before it terminates.
The file sizes and split locations for the Map-tasks are specified in the job trace. A Reduce-task
fetches data amounting to its input size divided by the number of Map-tasks from each Map-task's
node.
This approximation of the network behavior could be viewed as running the same jobs on a cluster with lower computational power, because in general the execution time of containers is extended. The modified simulator, however, only shows the same behavior as the original execution regarding the following simplifying assumptions:
- The simulated computation time is set to the runtime specified in the job trace, including the
time for data transmissions in the original execution. This assumption is valid under three
conditions:
- The data transmission times of tasks in the original execution are not significantly larger then the actual computation times.
- For any job, the data transmission time in the original execution must not vary significantly for tasks of the same type.
- The sizes of the intermediate files fetched by one Reduce-tasks in the original execution must not vary significantly (an effect referred to as partitioning skew).
- Control- and heartbeat-messages are omitted due to their minor size (though their delay might be interesting)
After modeling the network behavior this sections describes how SLS interacts with MaxiNet and how the data transmissions are performed. The following figure gives an overview of the netSLS architecture:
SLS does not operate directly on MaxiNet but through a middleware, the Network Emulator. The Network Emulator has two main tasks:
- Control a MaxiNet experiment (i.e. emulated network).
This especially includes setting up the desired network topology for the current simulation. Note, that the topology has to be specified explicitly, because otherwise SLS assumes a flat one-rack hierarchy (cf. Job Traces). - Perform data transmissions for SLS.
In general job traces do not come with actual data, so the Network Emulator simply transmits a given number of bytes between two MaxiNet hosts.
Due to the simulation structure of SLS (thread pool) it would be problematic to perform the transmissions blocking. Thus all commands to the Network Emulator are issued via JSON-RPC (synchronous) and the transmission results are published via ZeroMQ Pub/Sub (asynchronous).
One of the main motivations for netSLS was to try out and compare different SDN concepts. Hadoop uses TCP for data transmissions in the default configuration. Originally it was planned to use netSLS to compare TCP to Varys, which until now however hasn't been done. For legacy reasons the Coflow concept of Varys is still part of the API but can be ignored for now.
These are the RPC methods:
-
start_simulation(topology)
Resets the Network Emulator and starts a new MaxiNet experiment with the given topology. -
register_coflow(coflow_id)
Varys specific. -
unregister_coflow(coflow_id)
Varys specific. -
transmit_n_bytes(coflow_id, source, destination, n_bytes, subscription_key)
Invokes the transmission of n_bytes from source to destination. The result of the transmission will be published via ZeroMQ Pub/Sub under the given subscription_key. That means, the Hadoop container issuing the transmission has to subscribe to subscription_key.
Like mentioned above, netSLS is designed with support for different transport APIs in mind. Each transport API has to provide means for transmitting a given number of bytes from the source host to destination host in MaxiNet.
As an example consider TCP (currently the only implemented transport API): There are two python
scripts, tcp_receive and tcp_send. When start_simulation
is called these executables are copied
to all MaxiNet workers and a tcp_receive process is started on every MaxiNet host, opening a TCP
socket that simply receives all incoming data.
When transmit_n_bytes(coflow_id, source, destination, n_bytes, subscription_key)
is called, a
tcp_send process is started on the source MaxiNet host. This process simply transmits n_bytes to
the receiving socket of the destination MaxiNet host.
In order to determine when transmissions have completed, a thread periodically polls each MaxiNet worker for terminated processes. The result of each completed transmission process then is published via ZeroMQ Pub/Sub.