-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature]: Integrate RIPE Atlas Probe Measurements Data Source #101
[feature]: Integrate RIPE Atlas Probe Measurements Data Source #101
Conversation
This PR integrates RIPE Atlas Measurements Data Source, adding the following relationships and node labels: - "(:AtlasProbe)-[:PART_OF]->(:AtlasMeasurement)-[:TARGET]->(:DomainName)" - "(:AtlasProbe)-[:PART_OF]->(:AtlasMeasurement)-[:TARGET]->(:IP)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the implementation! I added some comments.
I noticed that this crawler adds a lot of relationships (~21m), so I don't expect you to run it in its entirety for testing. I discussed a bit with Romain and we have some ideas to maybe reduce the number of relationships, but we will do that at a later point in time, so for now addressing just the few comments is fine.
next_url, next_data = self.__execute_query(next_url) | ||
data += next_data | ||
logging.info(f'Added {len(next_data)} probes. Total: {len(data)}') | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably leftover from debugging and should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha yes 👍
|
||
# Extracting target information | ||
target_info = { | ||
"domain": item.pop("target", None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This project uses single quotes. If you install the pre-commit hook described here this should be handled automatically :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for sharing this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also thinking of configuring the pre-commit
hooks in the CI this way we enforce all the rules implemented by .pre-commit-config.yml
e.g running in the CI:
pre-commit run --all-files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great! When I setup the github actions I did not (and still do not really) know how they work, so they actually don't check exactly the same things as the pre-commit hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I would love to submit PR for this!
@staticmethod | ||
def __process_response(response: requests.Response): | ||
if response.status_code != requests.codes.ok: | ||
sys.exit(f'Request to {response.url} failed with status: {response.status_code}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is from the probe crawler (and it is on our list of ToDos), but our crawlers should not call sys.exit
, since this terminates the entire create_db
process. You can just raise an exception with response.raise_for_status()
(or raise your own with a descriptive message, either is fine).
We are planning to write a best-practices kind of thing for crawlers at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this makes much sense. Acknowledged!
item["target"] = target_info | ||
|
||
# Extracting group information | ||
group_info = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think keeping track of the group info is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem arises from data retrieved from the RIPE Atlas Measurement dataset. The group
or target
info is prefixed with target/group
e.g:
"group": null,
"group_id": null,
"target": "[k.root-servers.net](http://k.root-servers.net/)",
"target_asn": null,
"target_ip": "193.0.14.129",
"target_prefix": null,
When we attempt to flatten this data using the _
delimiter, it triggers a Type Error Exception within the flatdict
library.
For example, running the provided code snippet below generates an error:
import flatdict
object = {"group": None, "group_id": None}
flatdict.FlatterDict(object, delimiter='_')
This error resembles the one shown:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mohamedawnallah/Desktop/internet-yellow-pages/.venv/lib/python3.11/site-packages/flatdict.py", line 389, in __init__
super(FlatterDict, self).__init__(value, delimiter, dict_class)
File "/Users/mohamedawnallah/Desktop/internet-yellow-pages/.venv/lib/python3.11/site-packages/flatdict.py", line 29, in __init__
self.update(value)
File "/Users/mohamedawnallah/Desktop/internet-yellow-pages/.venv/lib/python3.11/site-packages/flatdict.py", line 356, in update
[self.__setitem__(k, v) for k, v in dict(other or kwargs).items()]
File "/Users/mohamedawnallah/Desktop/internet-yellow-pages/.venv/lib/python3.11/site-packages/flatdict.py", line 356, in <listcomp>
[self.__setitem__(k, v) for k, v in dict(other or kwargs).items()]
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mohamedawnallah/Desktop/internet-yellow-pages/.venv/lib/python3.11/site-packages/flatdict.py", line 420, in __setitem__
raise TypeError(
TypeError: Assignment to invalid type for key group
This issue persists with the target
information as well so I needed to flatten them out myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said I'm gonna update the comments in the code to let anyone who read the code know why we extracted the information in this kind of format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! Yes that makes sense then and good idea to add a comment.
probe_af = int(probe_measurement['af']) | ||
if probe_af == 6: | ||
try: | ||
resolved_ips[i] = ipaddress.ip_address(resolved_ips[i]).compressed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can actually just do this for all IPs, since ipaddress.ip_address
accepts both IPv4 and IPv6 and there is no harm in checking if for some reason there are malformed IPv4 addresses (there shouldn't be).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some data points where the IP address is either "" or null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems these absence values are not malformed IP addresses we should handle this in the __get_all_resolved_ips
method instead.
@staticmethod | ||
def __get_all_resolved_ips(probe_measurement): | ||
# Ensure "resolved_ips" takes precedence over "target_ip". | ||
resolved_ips = probe_measurement['target']['resolved_ips'] or probe_measurement['target']['ip'] or [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First time I've seen this syntax, thanks for that! You always learn something new :D
probe_measurement_domain = probe_measurement['target']['domain'] | ||
if probe_measurement_domain: | ||
domain_qid = domain_ids[probe_measurement_domain] | ||
target_links.append({'src_id': probe_measurement_qid, 'dst_id': domain_qid, 'props': [self.reference]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cool if the self.reference['reference_url'] could point to the respective measurement for each link.
Currently it points to https://atlas.ripe.net/api/v2/measurements
for everything, but would be nice if it could point to https://atlas.ripe.net/api/v2/measurements/x
for the links of measurement x
. Note that this requires many dict copies in the "compute first / push later" approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got what you means regards adding the reference on the relationship link but could you please elaborate more on this: Note that this requires many dict copies in the "compute first / push later" approach.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean, should we include the reference
field when we compute or pre-compute the nodes rather than creating this field ad-hoc during the computation of the links? Additionally, considering that the result
field in the data result already contains the precise path (https://atlas.ripe.net/api/v2/measurements/x
), could we potentially access it directly, making the addition of the reference
field or updating self.reference
with /{id}
unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the first question: I made the mistake that I've updated the reference dict inside a for loop when computing the relationships, but since only a reference of the dict is stored in the list you actually only overwrite the url field of the same dict. For example
links = list()
for x, y in nodes:
self.reference['reference_url'] = ...
links.append({'src_id': x, 'dst_id': y, 'props': [self.reference]})
self.iyp.batch_add_links(...)
would lead to all but one relationship having the wrong URL so a self.reference.copy()
is needed instead. Or maybe there is a better way I don't know of :)
For the second question, even though it is a bit tedious I would prefer to update the reference_url
field. The result
field technically points to the results of the measurement, not the metadata. In addition, you have to know that this field exists to find it, and you have to know how to get to the metadata from there.
In general the reference_url
should be as precise as possible, although this does not always work, but for Atlas it does quite well :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this makes sense, Thanks!
probe_measurement_ids = dict() | ||
|
||
attrs_flattened = [] | ||
for probe_measurement in valid_probe_measurements: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should exclude the current_probes
field from the node properties as it adds a lot of fields on some cases.
probe_ids = set() | ||
ips = set() | ||
ases = set() | ||
prefixes = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not create Prefix nodes from this crawler. We are unsure where Atlas actually gets this information from. Maybe we can add it later, but for now let's just keep it in the property.
|
||
probe_measurement_ips = self.__get_all_resolved_ips(probe_measurement) | ||
for probe_measurement_ip in probe_measurement_ips: | ||
ip_qid = ip_ids[probe_measurement_ip] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are measurements with an empty string in the resolved_ips
list, where this breaks (example).
e9e0b6a
to
938df23
Compare
Thank you, @m-appel, for the fantastic feedback. I've incorporated your suggestions and also updated the documentation. |
Thanks, looks good to me! |
Description
This PR integrates RIPE Atlas Measurements Data Source, adding the following relationships and node labels:
(:AtlasProbe)-[:PART_OF]->(:AtlasMeasurement)-[:TARGET]->(:DomainName)
(:AtlasProbe)-[:PART_OF]->(:AtlasMeasurement)-[:TARGET]->(:IP)
Motivation and Context
Closes #81
How Has This Been Tested?
I ran the
atlas_measurements.py
crawler and it successfully created theAtlasMeasurement
Node Label andTARGET
, andPART_OF
relationship links.Recordings
ripe_atlas_measurements.mov
Types of changes
Checklist: