[feature]: Integrate RIPE Atlas Probe Measurements Data Source #101

mohamedawnallah · 2023-12-25T00:41:30Z

Description

This PR integrates RIPE Atlas Measurements Data Source, adding the following relationships and node labels:

(:AtlasProbe)-[:PART_OF]->(:AtlasMeasurement)-[:TARGET]->(:DomainName)
(:AtlasProbe)-[:PART_OF]->(:AtlasMeasurement)-[:TARGET]->(:IP)

Motivation and Context

Closes #81

How Has This Been Tested?

I ran the atlas_measurements.py crawler and it successfully created the AtlasMeasurement Node Label and TARGET, and PART_OF relationship links.

Recordings

ripe_atlas_measurements.mov

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.

This PR integrates RIPE Atlas Measurements Data Source, adding the following relationships and node labels: - "(:AtlasProbe)-[:PART_OF]->(:AtlasMeasurement)-[:TARGET]->(:DomainName)" - "(:AtlasProbe)-[:PART_OF]->(:AtlasMeasurement)-[:TARGET]->(:IP)"

m-appel

Thanks for the implementation! I added some comments.

I noticed that this crawler adds a lot of relationships (~21m), so I don't expect you to run it in its entirety for testing. I discussed a bit with Romain and we have some ideas to maybe reduce the number of relationships, but we will do that at a later point in time, so for now addressing just the few comments is fine.

m-appel · 2023-12-25T06:40:53Z

iyp/crawlers/ripe/atlas_measurements.py

+            next_url, next_data = self.__execute_query(next_url)
+            data += next_data
+            logging.info(f'Added {len(next_data)} probes. Total: {len(data)}')
+            break


This is probably leftover from debugging and should be removed

Aha yes 👍

m-appel · 2023-12-25T06:43:47Z

iyp/crawlers/ripe/atlas_measurements.py

+
+            # Extracting target information
+            target_info = {
+                "domain": item.pop("target", None),


This project uses single quotes. If you install the pre-commit hook described here this should be handled automatically :)

Yes, thanks for sharing this!

I'm also thinking of configuring the pre-commit hooks in the CI this way we enforce all the rules implemented by .pre-commit-config.yml e.g running in the CI:

pre-commit run --all-files

That would be great! When I setup the github actions I did not (and still do not really) know how they work, so they actually don't check exactly the same things as the pre-commit hook.

Okay I would love to submit PR for this!

m-appel · 2023-12-25T06:45:53Z

iyp/crawlers/ripe/atlas_measurements.py

+    @staticmethod
+    def __process_response(response: requests.Response):
+        if response.status_code != requests.codes.ok:
+            sys.exit(f'Request to {response.url} failed with status: {response.status_code}')


I know this is from the probe crawler (and it is on our list of ToDos), but our crawlers should not call sys.exit, since this terminates the entire create_db process. You can just raise an exception with response.raise_for_status() (or raise your own with a descriptive message, either is fine).

We are planning to write a best-practices kind of thing for crawlers at some point.

Yes, this makes much sense. Acknowledged!

m-appel · 2023-12-25T06:46:42Z

iyp/crawlers/ripe/atlas_measurements.py

+            item["target"] = target_info
+
+            # Extracting group information
+            group_info = {


I don't think keeping track of the group info is needed?

The problem arises from data retrieved from the RIPE Atlas Measurement dataset. The group or target info is prefixed with target/group e.g:

"group": null, "group_id": null, "target": "[k.root-servers.net](http://k.root-servers.net/)", "target_asn": null, "target_ip": "193.0.14.129", "target_prefix": null,

When we attempt to flatten this data using the _ delimiter, it triggers a Type Error Exception within the flatdict library.

For example, running the provided code snippet below generates an error:

import flatdict object = {"group": None, "group_id": None} flatdict.FlatterDict(object, delimiter='_')

This error resembles the one shown:

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/mohamedawnallah/Desktop/internet-yellow-pages/.venv/lib/python3.11/site-packages/flatdict.py", line 389, in __init__ super(FlatterDict, self).__init__(value, delimiter, dict_class) File "/Users/mohamedawnallah/Desktop/internet-yellow-pages/.venv/lib/python3.11/site-packages/flatdict.py", line 29, in __init__ self.update(value) File "/Users/mohamedawnallah/Desktop/internet-yellow-pages/.venv/lib/python3.11/site-packages/flatdict.py", line 356, in update [self.__setitem__(k, v) for k, v in dict(other or kwargs).items()] File "/Users/mohamedawnallah/Desktop/internet-yellow-pages/.venv/lib/python3.11/site-packages/flatdict.py", line 356, in <listcomp> [self.__setitem__(k, v) for k, v in dict(other or kwargs).items()] ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mohamedawnallah/Desktop/internet-yellow-pages/.venv/lib/python3.11/site-packages/flatdict.py", line 420, in __setitem__ raise TypeError( TypeError: Assignment to invalid type for key group

This issue persists with the target information as well so I needed to flatten them out myself.

That said I'm gonna update the comments in the code to let anyone who read the code know why we extracted the information in this kind of format.

I see! Yes that makes sense then and good idea to add a comment.

m-appel · 2023-12-25T06:49:50Z

iyp/crawlers/ripe/atlas_measurements.py

+                probe_af = int(probe_measurement['af'])
+                if probe_af == 6:
+                    try:
+                        resolved_ips[i] = ipaddress.ip_address(resolved_ips[i]).compressed


You can actually just do this for all IPs, since ipaddress.ip_address accepts both IPv4 and IPv6 and there is no harm in checking if for some reason there are malformed IPv4 addresses (there shouldn't be).

There are some data points where the IP address is either "" or null.

It seems these absence values are not malformed IP addresses we should handle this in the __get_all_resolved_ips method instead.

m-appel · 2023-12-25T06:52:28Z

iyp/crawlers/ripe/atlas_measurements.py

+    @staticmethod
+    def __get_all_resolved_ips(probe_measurement):
+        # Ensure "resolved_ips" takes precedence over "target_ip".
+        resolved_ips = probe_measurement['target']['resolved_ips'] or probe_measurement['target']['ip'] or []


First time I've seen this syntax, thanks for that! You always learn something new :D

m-appel · 2023-12-25T07:06:07Z

iyp/crawlers/ripe/atlas_measurements.py

+            probe_measurement_domain = probe_measurement['target']['domain']
+            if probe_measurement_domain:
+                domain_qid = domain_ids[probe_measurement_domain]
+                target_links.append({'src_id': probe_measurement_qid, 'dst_id': domain_qid, 'props': [self.reference]})


It would be cool if the self.reference['reference_url'] could point to the respective measurement for each link.
Currently it points to https://atlas.ripe.net/api/v2/measurements for everything, but would be nice if it could point to https://atlas.ripe.net/api/v2/measurements/x for the links of measurement x. Note that this requires many dict copies in the "compute first / push later" approach.

I got what you means regards adding the reference on the relationship link but could you please elaborate more on this: Note that this requires many dict copies in the "compute first / push later" approach. ?

Do you mean, should we include the reference field when we compute or pre-compute the nodes rather than creating this field ad-hoc during the computation of the links? Additionally, considering that the result field in the data result already contains the precise path (https://atlas.ripe.net/api/v2/measurements/x), could we potentially access it directly, making the addition of the reference field or updating self.reference with /{id} unnecessary?

Regarding the first question: I made the mistake that I've updated the reference dict inside a for loop when computing the relationships, but since only a reference of the dict is stored in the list you actually only overwrite the url field of the same dict. For example

links = list() for x, y in nodes: self.reference['reference_url'] = ... links.append({'src_id': x, 'dst_id': y, 'props': [self.reference]}) self.iyp.batch_add_links(...)

would lead to all but one relationship having the wrong URL so a self.reference.copy() is needed instead. Or maybe there is a better way I don't know of :)

For the second question, even though it is a bit tedious I would prefer to update the reference_url field. The result field technically points to the results of the measurement, not the metadata. In addition, you have to know that this field exists to find it, and you have to know how to get to the metadata from there.
In general the reference_url should be as precise as possible, although this does not always work, but for Atlas it does quite well :)

Yes this makes sense, Thanks!

m-appel · 2023-12-25T13:24:31Z

iyp/crawlers/ripe/atlas_measurements.py

+        probe_measurement_ids = dict()
+
+        attrs_flattened = []
+        for probe_measurement in valid_probe_measurements:


I think we should exclude the current_probes field from the node properties as it adds a lot of fields on some cases.

m-appel · 2023-12-26T03:35:23Z

iyp/crawlers/ripe/atlas_measurements.py

+        probe_ids = set()
+        ips = set()
+        ases = set()
+        prefixes = set()


Let's not create Prefix nodes from this crawler. We are unsure where Atlas actually gets this information from. Maybe we can add it later, but for now let's just keep it in the property.

m-appel · 2023-12-26T03:42:08Z

iyp/crawlers/ripe/atlas_measurements.py

+
+            probe_measurement_ips = self.__get_all_resolved_ips(probe_measurement)
+            for probe_measurement_ip in probe_measurement_ips:
+                ip_qid = ip_ids[probe_measurement_ip]


There are measurements with an empty string in the resolved_ips list, where this breaks (example).

mohamedawnallah · 2023-12-27T05:52:06Z

Thank you, @m-appel, for the fantastic feedback. I've incorporated your suggestions and also updated the documentation.

m-appel · 2023-12-28T02:26:27Z

Thanks, looks good to me!

mohamedawnallah added 2 commits December 25, 2023 02:32

Fix linting issues

9eea7be

m-appel self-requested a review December 25, 2023 01:21

m-appel requested changes Dec 26, 2023

View reviewed changes

mohamedawnallah added 2 commits December 27, 2023 07:35

Resolve PR review

d4f40d2

Add atlas_measurements docs

938df23

mohamedawnallah force-pushed the integrate-atlas-measurements-data-source branch from e9e0b6a to 938df23 Compare December 27, 2023 05:42

mohamedawnallah requested a review from m-appel December 27, 2023 05:49

m-appel merged commit 7da908f into InternetHealthReport:main Dec 28, 2023
2 checks passed

mohamedawnallah deleted the integrate-atlas-measurements-data-source branch January 4, 2024 00:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature]: Integrate RIPE Atlas Probe Measurements Data Source #101

[feature]: Integrate RIPE Atlas Probe Measurements Data Source #101

mohamedawnallah commented Dec 25, 2023 •

edited

Loading

m-appel left a comment

m-appel Dec 25, 2023

mohamedawnallah Dec 26, 2023

m-appel Dec 25, 2023

mohamedawnallah Dec 26, 2023

mohamedawnallah Dec 26, 2023 •

edited

Loading

m-appel Dec 26, 2023

mohamedawnallah Dec 26, 2023

m-appel Dec 25, 2023

mohamedawnallah Dec 26, 2023

m-appel Dec 25, 2023

mohamedawnallah Dec 26, 2023 •

edited

Loading

mohamedawnallah Dec 26, 2023 •

edited

Loading

m-appel Dec 26, 2023

m-appel Dec 25, 2023

mohamedawnallah Dec 26, 2023

mohamedawnallah Dec 26, 2023

m-appel Dec 25, 2023

m-appel Dec 25, 2023

mohamedawnallah Dec 26, 2023

mohamedawnallah Dec 26, 2023 •

edited

Loading

m-appel Dec 26, 2023

mohamedawnallah Dec 27, 2023

m-appel Dec 25, 2023

m-appel Dec 26, 2023

m-appel Dec 26, 2023

mohamedawnallah commented Dec 27, 2023

m-appel commented Dec 28, 2023

[feature]: Integrate RIPE Atlas Probe Measurements Data Source #101

[feature]: Integrate RIPE Atlas Probe Measurements Data Source #101

Conversation

mohamedawnallah commented Dec 25, 2023 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Recordings

Types of changes

Checklist:

m-appel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mohamedawnallah Dec 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mohamedawnallah Dec 26, 2023 • edited Loading

Choose a reason for hiding this comment

mohamedawnallah Dec 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mohamedawnallah Dec 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mohamedawnallah commented Dec 27, 2023

m-appel commented Dec 28, 2023

mohamedawnallah commented Dec 25, 2023 •

edited

Loading

mohamedawnallah Dec 26, 2023 •

edited

Loading

mohamedawnallah Dec 26, 2023 •

edited

Loading

mohamedawnallah Dec 26, 2023 •

edited

Loading

mohamedawnallah Dec 26, 2023 •

edited

Loading