Fig. 1: Three hop linear topology
In today's exercise we will implement a state of the art packet loss detection system called LossRadar
. This exercise is based on recent research and a paper published
few years ago: LossRadar: Fast Detection of Lost Packets in Data Center Networks. We will provide you with all the knowledge and information needed to implement the exercise.
If you want to know more you can find the paper here.
Packet losses are very common in nowadays networks. Those losses can be caused by many reasons: congestion, blackholes, faulty links or corrupted forwarding tables. Even if small, losses can have a significant impact on the network performance and thus detecting and localizing them fast is fundamental. Furthermore, being able to get information (i.e header fields) of the packets that get dropped can be extremely useful when trying to find the root cause of the drop.
In this exercise we will show you how to achieve all those requirements by combining the data plane per packet processing capabilities and a small decoding algorithm that will run in the controller. The main idea behind this packet loss detection system are the invertible bloom lookup tables (IBLT), a probabilistic data structure that you have seen during this lecture. We saw that IBLTs can be used to store key value pairs. In this exercise, we will see that an IBLT can also be used to efficiently track missing items between two entities (we will call them traffic digests). Let's see how this works with an example:
Imagine that for every unidirectional link in the network, we keep two IBLTs. One called Upstream meter (UM, placed at the sending switch) and one called Downstream meter (DM, placed at the receiving switch). Each
IBLT consists of an array of cells, and each cell contains a packet count
and a xorSum
.
For each packet that crosses the link we do the following:
- We hash some header fields (i.e 5-tuple + IP id, as we will do in this exercise) using multiple hash functions.
- Then we use the hash indexes to update the IBLT array fields as follows:
count = count + 1
andxorSum = xorSum XOR pkt.signature
. The signature should contain all the cells used by to obtain the hash indexes and could have some extra ones if wanted.
The figure below shows one Upstream and one Downstream traffic digest or IBLT. As you can see with the coloured boxes, the upstream (the sending
switch) contains 4 different packets (different colours) that have been hashed 3 times in different cells. Also note that each cell contains a count with
the number of packets stored in the cell and an XOR of all the pkt.signatures
. Then those packets are transmitted to the downstream switch where the same process is repeated.
However, as you can see in this crafted example, the blue, orange and purple packets have been lost and thus are not contained inside the DM digest.
Fig. 2: Upstream and downstream traffic digests. Credit: LossRadar
Once the packets have passed both meters the IBLT can be decoded. To do that the controller will have to pull the content of the register
arrays and subtract the digests L=UM-DM
. In order words, for each counter cell i
we have to do L_count(i) = UM_count(i) - DM_count(i)
, and for each xorSum
cell
we do L_xorSum(i) = UM_xorSum(i) XOR DM_xorSum(i)
. Some reminder on Xoring: A xor A = 0
, A xor B xor A = B
. Thus if the packets have been
added to both digests they will always be canceled leaving only the dropped packets in L
.
One of the challenges of this system and in general network measurements is synchronization. If both meters are not perfectly synchronized we will detect
packet losses that did not happen. To make sure that the meters that are exchanged where exposed to the exact same set of packets batches
are used. As you will
see during the exercise, one way of doing that is by letting the UM switch tag packets with a batch_id
such that the DM switch counts them in the same batch. This batch
id can be something that changes every 1 second for example. When counting packets for 1 batch the controller has some time to pull and clean the region of the meters
that correspond to the previous batch while the switches are not modifying it. One way of synchronizing this with the controller is making the UM switch track all
the time if there is a batch ID change and if so notify the controller so it pulls the previous batch meters.
In order to decode L
and find out which packets have been dropped, the controller does the following:
- Finds a the cells with
count = 1
(we call them pure cells). ThexorSum
of a pure cells contains the packet header (pkt.signature
) information we decided to add to it. For example the packet's 5-tuple and some extra fields. - Uses the
pkt.signature
from the pure cell to obtain theK
hash indexes that the switch computed when adding the packet to the meter. Then it uses those indexes to subtract 1 to thecounter
field in those array cells and XOR thepkt.signature
. By doing this, some new pure cells appear, for example, in the figure above. If you remove the blue packets, then the fourth cell will have a pure orange packet. This process is then repeated until there are no more pure cells or the digest is empty. - All the packets found during the iterative process can be reported as dropped packets in the link (with direction UM->DM) where the UM and DM where placed.
In order to cover all the links in the network we will install one upstream meter at each output port of every switch, and one downstream meter at each input port of every switch. In this way both directions of every link are covered. In fact, we need two pairs of upstream meters and downstream meters for each bidirectional link. You can see depicted the placement of this meters in the picture below. However, for conciseness there is only one pair drawn.
Fig. 3: Shows how the UM and DM are placed in the pipeline. Credit: LossRadar
As explained in the paper, it is important to note that both meters are placed at the end of the ingress pipeline after all the tables and before the traffic manager. Also it is important to note that the upstream meter is executed before than the downstream meter so that the segments between the UM and the DM are also covered.
Now its time to implement your own packet loss detection application using P4 switches and a controller to decode the packet digests.
Before you start this exercise, update p4-utils
, some new features have been added for the completion of this exercise.
cd ~/p4-tools/p4-utils
git pull
As you already did for the sketch exercise you can disable the debugging options by compiling the bmv2 again using a new build directory:
cd ~/p4-tools/
git clone https://github.com/p4lang/behavioral-model.git bmv2-opt
cd bmv2-opt
git checkout 62a013a15ed2c42b1063c26331d73c2560d1e4d0
./autogen.sh
./configure --without-nanomsg --disable-elogger --disable-logging-macros 'CFLAGS=-g -O2' 'CXXFLAGS=-g -O2'
make -j 2
sudo make install
sudo ldconfig
IMPORTANT: It is recommended that you do not run the sudo make install
command until you have a working solution. When using this optimized compilation the
switch will not generate log files, and thus it will be hard for you to properly debug your program.
Since we keep the two compiled versions of bmv2
in different folders, you can enable the one with the debugging
enabled by just running the make install
command
again:
cd ~/p4-tools/bmv2
sudo make install
sudo ldconfig
Thus by running sudo make install
in ~/p4-tools/bmv2
or ~/p4-tools/bmv2-opt
you can easily enable each compiled version.
For this exercise we provide you with the following files:
p4app.json
: describes the topology we want to create with the help of mininet and p4-utils package.network.py
: a Python scripts that initializes the topology using Mininet and P4-Utils. One can use indifferentlynetwork.py
orp4app.json
to start the network.p4src/loss-detection.p4
: p4 program skeleton to use as a starting point. Furthermore, we provide you an almost completeheaders.p4
andparsers.p4
files.send.py
: python program that uses raw sockets to send packets (scapy
was not fast enough for what we needed in this exercise).packet-loss-controller.py
: controller that uses thethrift_API
to communicate with the switch. You will use it to configure the switches, to receive event notifications and decode the IBLT.crc.py
a pythoncrc32
implementation that will be used by the controller to match the same hash functions that the switch used.
Since the goal of this exercise is forwarding we will use the l2
strategy to create the topology. This strategy places all the hosts
in the same subnet. This will force hosts to send packets with the right destination MAC address making it easier for us so that we do
not have to set it with the switches. Thus, h1
and h2
get assigned with 10.0.0.1
and 10.0.0.2
respectively.
You can find all the documentation about p4app.json
in the p4-utils
documentation.
To solve this exercise we have to do two things: i) implement in P4 all the logic needed to create the packet digests, ii) decode the content of the digest using the controller to determine which packets got lost.
Before starting have a look at the header and parser files, we provide you with some headers and parser logic, however you will still have to add a bit.
-
Define the
loss_t
header. It should have 3 fields, 1 bit for abatch_id
, 7 bits of padding and 8 bits for anextProtocol
field. Since we will place this header between the IP and Transport protocols, we will use the next protocol id to tell the parser whats next. -
Go to the parser file and enhance the logic such that the parser will parse the loss header after the IP and if so, it will also parse TCP and UDP packets. For the loss header use the IP type (
TYPE_LOSS
) defined as a constant in theheader.p4
file.
Now you should be ready to start the main P4 program. Here we will have to define a bunch of registers, tables, actions and control logic (some will be provided).
-
Define the
remove_loss_header
table. This table should match to theegress_spec
and call an action calledremove_header
. This table will be used to remove the loss header once the packet travels from a switch to a host. Note that the controller will populate this table for you and thus you have to make sure you use the same naming. -
Define the
remove_header
action. This action should set the loss header to invalid. Set the right IP next protocol, and subtract 2 to theIP LEN
field (which is the size in bytes of theloss_t
header) -
At the beginning of the ingress pipeline define 8 registers, 4 for the UM and 4 for the DM. The register cells should be
64bit
long and the size of the register should beREGISTER_SIZE_TOTAL
which is a constant defined at the beginning of the file. In these registers we will store 2 things, the packetcount
and thexorSum
of the packets that cross the switch. Due to limitations with the software switch API the maximum size a register can have isint 64
and thus we will split the fields we want to store inxorSum
into multiple registers as follows:- One Register for the source ip.
- One Register for the destination ip.
- One register for the srcport + dstport + ip proto + id.
- One register for the packet
count
.
For consistency you can call them
um_ip_src
anddm_ip_src
, etc.IMPORTANT: We asked you to define the registers with size
REGISTER_SIZE_TOTAL
which is set to 2048. If you look at the constants on top of the file you will see two more:REGISTER_BATCH_SIZE
andREGISTER_PORT_SIZE
. As you can see in Figure 4 we will split our registers in 4 regions, this will be very important because you will need to be careful when reading them or when you generate hash indexes. As you can see in the image, we will reuse the same register for the different ports the switch has and for the two meter batches that we considered in the implementation. Those meter batches are used so that the controller can read/clean one part of the meter while the other is being filled by packets. As you will see later, switches use time to keep track of the current batch id.
Fig. 4: Register port and batch structure
-
Define the
compute_hash_indexes
action. This action has to define 6 crc32_custom hash functions (as we did in the sketch exercise). We will use 3 hashes to get indexes for the UM digest, and 3 hashes to get the indexes for the DM digest. There are three important things you have to consider while defining the hashes:- Define 6 metadata fields in which you will store the hash outputs.
- For the hash fields, use the packet 5 tuple and the ip protocol. Keep in mind that the transport layer protocols are saved in
meta.tmp_src_port
andmeta.tmp_dst_port
(this will be explained a bit below). Also, the IP protocol should be stored inhdr.loss.nextProtocol
. - You need to consider the array distribution explained above to select the hash min and max values. For the Upstream meters you have to compute min as:
((meta.batch_id * REGISTER_BATCH_SIZE) + ((((bit<16>)standard_metadata.egress_spec-1)*REGISTER_PORT_SIZE)))
and max asmin+REGISTER_PORT_SIZE
. Note that we use the locally generated batc_id, and the output port. For the downstream meters you have to compute the hash min as:((meta.previous_batch_id * REGISTER_BATCH_SIZE) + ((((bit<16>)standard_metadata.ingress_port-1)*REGISTER_PORT_SIZE)))
and the max asmin+REGISTER_PORT_SIZE
. Note that we use thebatch_id
received from the upstream switch, and theingres_port
.
-
Implement the
apply_um_meter
action. Here you have to use the 3 indexes obtained form theum_hashes
and update the 4 registers defined in task 5. For thecounter
register you just have to increase by 1. For thesrc_ip
,dst_ip
andports_proto_id
registers you have toxor
its content with the receive packets fields. For each register you have to perform 3 reads and 3 writes. You will need to store the values you read from the registers somewhere, you can use local variables or metadata fields that you can reuse for the next action. -
Implement the
apply_dm_meter
action. Do the same for the DM registers using the 3 DM indexes obtained incompute_hash_indexes
. -
Since this time the ingress pipeline is quite complex we will provide it to you. The pipeline consists of the following blocks:
- We first set the udp/tcp ports to the
meta.tmp_src/dst_port
field. This is used as a indirect reference so that no matter which transport protocol do we have, we will always find the ports values in the same metadata field. - We run the forwarding table to set the output port.
- If the packet does not have a
loss
header, due to the fact that this is the first switch, we add the header, and update some IP header fields. Note that at the end we set to 1 themeta.dont_execute_dm
field. This is used to avoid executing the DM meter since this packet comes from a host. In the case the packet already had aloss
header we set themeta.previous_batch_id
to the one the packet carried in its header. - We compute the local batch id, this is done using the
ingress_global_timestamp
shifting it by 2**21 (approximately 2 seconds) and doing a modulo 2. This will make thebatch_id
to oscillate between 1 and 0 approximately every 2 seconds. - In the next block we will read the previous
batch_id
we observed in this switch. If the id changed since the last packet, we will generate a clone packet to notify the controller that it should export the meters for this batch id from this switch and all neighbors. - We update the headers
batch_id
- Compute the hash indexes.
- Remove the
loss
header if the next hop is a host. - Apply the UM and DM meters.
- Decrease the packet if
ip.ttl = 1
. Decrease it by one otherwise.
- We first set the udp/tcp ports to the
-
If we receive a packet that was cloned, we have to add the
loss
header, set thebatch_id
to the last local id observed by this switch and send it to the controller. We truncate the packet to 16 bytes since we only needethernet+loss
headers.
Once you have the data plane implementation finished you can start with the controller. We provide you with some basic functions and you will have to implement the main decoding algorithm. If you have a look at the controller you can see the following functions implemented:
-
run_cpu_port_loop
andrecv_msg_cpu
. The first function uses scapy and listen to all the cpu ports that connect to the switches. The second function extract the control packets your data plane application will be sending to the controller every time a batch_id gets updated. This function will finally callcheck_sw_links
which is where you will run the main algorithm. -
connect_to_switches
: creates a controller API for all the switches and stores them in a dictionary of the form: {sw_name:controller}. -
configure_switches
: populates the tables and sets the mirroing id for cloned packets. -
set_crc_custom_hashes
sets the 3 UM hashes and the 3 DM hashes, similarly to what we did in the sketch exercise. -
create_local_hashes
emulates the hashes we created in the switch so we have them in the controller. -
reset_all_registers
sets all the registers to 0 -
reset_registers
this functions is very importat it allows you to reset a specific area of a UM or DM digest. -
read_registers
loads all the registers into `self.registers -
extract_register_information
for a given switch, stream direction, port and batch id it returns the array region where the information for those parameters is stored.
By using all this functions and the decoding algorithm explained in the introduction you have to:
-
Implement
check_sw_links
function. If you have a look at the code you can see that this function will be called byrecv_msg_cpu
function which is already implemented. This function will be called with two parameters: the name of the switch that notified the controller, and the batch ID that need to be decoded. In this function you have to do the following:- Wait a bit of time using the sleep library. This is done to make sure that packets that could have been buffered between the switches that have been tagged with this batch id have already arrived at the DM switch.
- Read the content of the registers from this switch and directly connected switches.
- For all the neighboring switches call the function
verify_link(sw1, sw2, batch_id)
make sure you the switch that triggered the controller assw1
.
-
Implement the
verify_link
function. This is the function you have to call in your previous task. Here you have to do the following:- Get the port numbers (using the controller API) of the link that connects
sw1
andsw2
. - Extract the right region of the registers from
sw1
UM andsw2
DM. For that you can use the already implemented functionextract_register_information
. - Call the function
decode_meter_pair(um, dm)
, this function should return a set with the dropped packets. You will implement it in the next task. - Clean the array region that belongs to the ports and batch id for the UM and DM of these switches. You can use the
reset_registers
function. - Print or show nicely which packets get dropped so you can verify that your solution works.
- Get the port numbers (using the controller API) of the link that connects
-
Implement the decoding function described in the exercise introduction. As a parameter you will get 2 dictionaries with the 4 registers read from the switch, you can have a look at the dictionary to see the namings. What you should do is:
- For each array cell do the counter difference UM-DM. And for all the other registers do the XOR between UM and DM.
- Implement the algorithm described above. In brief, while there is a counter cell equal to 1, extract the packet, use it to get the 3 hash indexes, and clean
the correspondent cells, by doing -1 to the counter ones, and
xor
of the packet values to the other register cells. - Each time you find a dropped packet add it to a set/list.
- Once you do not find more ones, return the list of dropped packets.
NOTE: Once you think that you have a stable working solution, use the optimized software switch. Follow the steps descrived above.
To test our solution we will use the send.py
script you can find in the provided files. This script uses raw sockets
to generate random packets. Depending on the parameters you use those packets will be dropped or not. For the packets
to get dropped the script will set the TTL to X, that will determine which switch drops a given packet.
parser.add_argument('--n-pkt', help='number of packets', type=int, required=False, default=200)
parser.add_argument('--n-drops', help='number of packts to drop',type=float, required=False, default=20)
parser.add_argument('--fail-hops', help='Number of hops until packet is dropped, can be random', type=int, required=False, default=1)
To test the program:
-
Start the topology with
sudo p4run
orsudo python network.py
. -
Start the controller so switches get configured and you start listening to batch id changes.
sudo python packet-loss-controller.py --option run
This should populate the forwarding, remove_header table and mirroring id. At this point you should be able to ping or iperf, and the controller should receive packets when traffic is generated. However, since those packets are not getting dropped the controller should report no drops.
-
Generate some constant traffic between
h1
andh2
so the switches keep on sending control messages to the controller indicating that the batch id was modified. -
Get a terminal in
h1
and send some traffic with some dropped packets. To make them be dropped betweens1
ands2
set the hops option to 1:
mx h1
python send.py --n-pkt 100 --n-drops 5 --fail-hops 1
This will send 100 packets from which 5 should be dropped at the first hop.
If everything worked the controller should tell you which packets were dropped, and they should be the same than printed by
the send.py
script.
Now you can play with this, changing the number of hops, the amount of packets to be dropped, you can do iperfs, etc.
For example if you do:
mx h1
python send.py --n-pkt 100 --n-drops 5 --fail-hops 2
You should be able to detect that the drops have happened between s2
and s3
.
We have added a small guideline in the documentation section. Use it as a reference when things do not work as expected.