Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ooni crawler #137

Merged
merged 32 commits into from
Aug 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
17b0780
added ooni crawler and link cleaner post script
Fredddi43 Aug 6, 2024
8dbffe4
Merge branch 'InternetHealthReport:main' into ooni_crawler
Fredddi43 Aug 6, 2024
39d5429
fixed missing failure categori in riseupvpn test
Fredddi43 Aug 6, 2024
9bff79d
Fix code style
m-appel Aug 7, 2024
f2e7d36
Rename signal crawler to prevent namespace collision
m-appel Aug 7, 2024
d7238b1
Fix logging format typo
m-appel Aug 7, 2024
4a4c778
Moved common functionality to OoniCrawler baseclass, currently only i…
Fredddi43 Aug 8, 2024
4c390ea
finished refactor
Fredddi43 Aug 8, 2024
8295474
refactoring fixes
Fredddi43 Aug 8, 2024
3797850
refactoring fixes
Fredddi43 Aug 8, 2024
b88083e
stunreachability refactor fixes
Fredddi43 Aug 8, 2024
acdfea7
stunreachability fixes
Fredddi43 Aug 8, 2024
30be06e
further refactoring fixes
Fredddi43 Aug 8, 2024
f6028dd
Refactor OONI grabber
m-appel Aug 8, 2024
2773e7f
stunreachability fixes
Fredddi43 Aug 9, 2024
4a262c4
stunreachability fixes
Fredddi43 Aug 9, 2024
59d5fa6
webconnectivity fixes
Fredddi43 Aug 9, 2024
ffbd93d
furthre stunreachability fixes
Fredddi43 Aug 9, 2024
263b10b
corrected indentation
Fredddi43 Aug 9, 2024
19acef3
webconnectivity fixes
Fredddi43 Aug 9, 2024
0118150
webconnectivity link fixes
Fredddi43 Aug 9, 2024
896a6f6
added result check
Fredddi43 Aug 9, 2024
a2a09eb
tor result check
Fredddi43 Aug 9, 2024
393e71f
fixed tor
Fredddi43 Aug 9, 2024
0e7c1fe
code style
m-appel Aug 9, 2024
b66e8a7
Adjust logging
m-appel Aug 9, 2024
64c06d1
More refactoring
m-appel Aug 9, 2024
b6fb125
added readme and comments
Fredddi43 Aug 13, 2024
2719417
raise ASN exception
Fredddi43 Aug 13, 2024
f3ccca5
Merge branch 'main' into ooni_crawler
m-appel Aug 30, 2024
17117c6
Add unit tests
m-appel Aug 30, 2024
ff36cc2
Update example config
m-appel Aug 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -129,5 +129,8 @@ dmypy.json
.pyre/

.history/

**/dumps
dumps/
data/
.vscode/
neo4j/
tmp/
12 changes: 6 additions & 6 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,17 @@ repos:
rev: v2.0.4
hooks:
- id: autopep8
- repo: https://github.com/PyCQA/docformatter
rev: v1.7.5
hooks:
- id: docformatter
args: [--in-place, --wrap-summaries, '88', --wrap-descriptions, '88']
- repo: https://github.com/PyCQA/docformatter
rev: v1.7.5
hooks:
- id: docformatter
args: [--in-place, --wrap-summaries, '88', --wrap-descriptions, '88']
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
hooks:
- id: double-quote-string-fixer
- id: mixed-line-ending
args: [--fix, lf]
args: ['--fix', 'lf']
- repo: https://github.com/PyCQA/flake8
rev: 7.0.0
hooks:
Expand Down
18 changes: 15 additions & 3 deletions config.json.example
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,7 @@
},

"ooni": {
"username": "",
"password": ""
"parallel_downloads": 40
},

"iyp": {
Expand Down Expand Up @@ -94,7 +93,20 @@
"iyp.crawlers.openintel.dnsgraph_nl",
"iyp.crawlers.openintel.dnsgraph_rdns",
"iyp.crawlers.cloudflare.dns_top_locations",
"iyp.crawlers.cloudflare.dns_top_ases"
"iyp.crawlers.cloudflare.dns_top_ases",
"iyp.crawlers.ooni.facebookmessenger",
"iyp.crawlers.ooni.httpheaderfieldmanipulation",
"iyp.crawlers.ooni.httpinvalidrequestline",
"iyp.crawlers.ooni.osignal",
"iyp.crawlers.ooni.psiphon",
"iyp.crawlers.ooni.riseupvpn",
"iyp.crawlers.ooni.stunreachability",
"iyp.crawlers.ooni.telegram",
"iyp.crawlers.ooni.tor",
"iyp.crawlers.ooni.torsf",
"iyp.crawlers.ooni.vanillator",
"iyp.crawlers.ooni.webconnectivity",
"iyp.crawlers.ooni.whatsapp"
],

"post": [
Expand Down
91 changes: 91 additions & 0 deletions iyp/crawlers/ooni/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# IYP OONI Implementation Tracker
This Crawler pulls the censorship data provided by the [Open
Observatory of Network Interference (OONI)](https://ooni.org/) into
the IYP. OONI runs a number of tests on devices provided by
volunteers, each test has their own crawler and they are specified
below.

As for the implementation:

The OoniCrawler baseclass, which extends the BaseCrawler, is defined
in the init.py. Each crawler then extends the base class with their
unique attributes. Common among all crawlers are the attributes reference, repo,
dataset, all_asns, all_countries, all_results, all_percentages,
all_dns_resolvers and unique_links.

- reference and repo are set to OONI to identify the crawler.
- dataset needs to be set to the dataset that specific crawler is pulling, e.g. whatsapp.
- all_asns tracks all asns in the dataset and is added to by the
process_one_line() function
- all_countries tracks all countries in the dataset and is added to by the
process_one_line() function
- all_results contains all results the process_one_line() function
produces, but as there are crawler-specific attributes, the
process_one_line() function is extended in each crawler and also
modifies this variable. To do that, we first run the base function
and then acess the last result in the extended crawler class.
Therefore, if we choose not to proceed with a given result in the
process_one_line() class for any reason, e.g. invalid parameters,
one has to be careful to pop() the last result in all_results or it
will contain an invalid result.
- all_percentages is calculated by each crawler-specific
calculate_percentages() function, which highly depend on the OONI
test implementation. See each tests' github page for that
implementation.
- all_dns_resolvers is handled in the base OoniCrawler class to track
dns resolvers and add them to the IYP. No changes need be made in
extended crawlers.
- unique_links is a dictionary of currently the following sets:
'COUNTRY': set(),
'CENSORED': set(),
'RESOLVES_TO': set(),
'PART_OF': set(),
'CATEGORIZED': set(),
if you are adding a new link, make sure to add it to this
dictionary. This is done to prevent link duplication stemming from
the same crawler, e.g. if multiple result files add the same PART_OF
relationship, the link would be duplicated if we do not track
existing links. Whenever you create a link in the extended
batch_add_to_iyp() class, make sure you add it to the corresponding
unique_links set, and before you create a link, check the set for
the existence of the link.

Functions:

- Each run starts by calling the download_and_extract() function of the grabber class.
This function is shared amongst all OONI crawlers, and takes the
repo, a directory and the dataset as the input. If implementing a
new crawler, only set the dataset correctly to the same name OONI
uses and you do not need to interact with this class.
- Then, each line in the downloaded and extracted results files is
processed in process_one_line(). This needs to be done in both the
base and the extended class, as there are test specific attributes
the extended class needs to process. See above, all_results(), and
comments in the init.py code for implementation specifics.
- calculate_percentages() calculates the link percentages based on
test-specific attributes. This is entirely done in the extended
crawler and needs to be implemented by you if you're adding a new
crawler.
- Finally, batch_add_to_iyp() is called to add the results to the IYP.


| Test Name | Implementation Tracker | GitHub URL |
|------------------------------------------|------------------------------|---------------------------------------------------------------------------------------------------------------|
| Dash (Video Performance Test) | X - Won’t Fix | [Dash Test](https://github.com/ooni/spec/blob/master/nettests/ts-016-dash.md) |
| DNS Check (new DNS Test) | ? - All Results Fail, | [DNS Check](https://github.com/ooni/spec/blob/master/nettests/ts-020-dns-check.md) |
| | Results Questionable | |
| Facebook Messenger | O - Done | [Facebook Messenger](https://github.com/ooni/spec/blob/master/nettests/ts-023-facebook-messenger.md) |
| HTTP Header Field Manipulation | O - Done | [HTTP Header Field Manipulation](https://github.com/ooni/spec/blob/master/nettests/ts-012-http-header-field.md)|
| HTTP Invalid Requestline | O - Done | [HTTP Invalid Requestline](https://github.com/ooni/spec/blob/master/nettests/ts-011-http-invalid-requestline.md)|
| NDT (Speed Test) | X - Won’t Fix | [NDT Test](https://github.com/ooni/spec/blob/master/nettests/ts-022-ndt.md) |
| Psiphon (Censorship Circumvention VPN) | O - Done | [Psiphon Test](https://github.com/ooni/spec/blob/master/nettests/ts-007-psiphon.md) |
| RiseUp VPN | O - Done | [RiseUp VPN](https://github.com/ooni/spec/blob/master/nettests/ts-019-riseup-vpn.md) |
| Run | ??? | [OONI Run](https://github.com/ooni/run) |
| Signal | O - Done | [Signal Test](https://github.com/ooni/spec/blob/master/nettests/ts-018-signal.md) |
| STUN Reachability | O - Done | [STUN Reachability](https://github.com/ooni/spec/blob/master/nettests/ts-021-stun-reachability.md) |
| Telegram | O - Done | [Telegram Test](https://github.com/ooni/spec/blob/master/nettests/ts-009-telegram.md) |
| TOR | O - Done | [TOR Test](https://github.com/ooni/spec/blob/master/nettests/ts-001-tor.md) |
| TORSF | O - Done | [TORSF Test](https://github.com/ooni/spec/blob/master/nettests/ts-014-torsf.md) |
| Vanilla TOR | O - Done | [Vanilla TOR](https://github.com/ooni/spec/blob/master/nettests/ts-002-vanilla-tor.md) |
| Webconnectivity | O - Done | [Webconnectivity](https://github.com/ooni/spec/blob/master/nettests/ts-017-web-connectivity.md) |
| Whatsapp | O - Done | [WhatsApp Test](https://github.com/ooni/spec/blob/master/nettests/ts-010-whatsapp.md) |
135 changes: 135 additions & 0 deletions iyp/crawlers/ooni/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
import ipaddress
import json
import logging
import os

from iyp import BaseCrawler
from iyp.crawlers.ooni.utils import grabber


# OONI Crawler base class
class OoniCrawler(BaseCrawler):

def __init__(self, organization, url, name, dataset):
"""OoniCrawler initialization requires the dataset name."""
super().__init__(organization, url, name)
self.reference['reference_url_info'] = 'https://ooni.org/post/mining-ooni-data'
self.repo = 'ooni-data-eu-fra'
self.dataset = dataset
self.all_asns = set()
self.all_countries = set()
self.all_results = list()
self.all_percentages = list()
self.all_dns_resolvers = set()
self.unique_links = {
'COUNTRY': set(),
'CENSORED': set(),
'RESOLVES_TO': set(),
'PART_OF': set(),
'CATEGORIZED': set(),
}

def run(self):
"""Fetch data and push to IYP."""

# Create a temporary directory
tmpdir = self.create_tmp_dir()

# Fetch data
grabber.download_and_extract(self.repo, tmpdir, self.dataset)
logging.info('Successfully downloaded and extracted all files.')
# Now that we have downloaded the jsonl files for the test we want, we can
# extract the data we want
logging.info('Processing files...')
for entry in os.scandir(tmpdir):
if not entry.is_file() or not entry.name.endswith('.jsonl'):
continue
file_path = os.path.join(tmpdir, entry.name)
with open(file_path, 'r') as file:
for line in file:
data = json.loads(line)
self.process_one_line(data)
logging.info('Calculating percentages...')
self.calculate_percentages()
logging.info('Adding entries to IYP...')
self.batch_add_to_iyp()
logging.info('Done.')

def process_one_line(self, one_line):
"""Process a single line from the jsonl file and store the results locally."""

# Extract the ASN, throw an exception if malformed
probe_asn = (
int(one_line['probe_asn'][2:])
if one_line.get('probe_asn', '').startswith('AS')
else (_ for _ in ()).throw(Exception('Invalid ASN'))
)

# Add the DNS resolver to the set, unless its not a valid IP address
try:
self.all_dns_resolvers.add(
ipaddress.ip_address(one_line.get('resolver_ip'))
)
except ValueError:
pass
probe_cc = one_line.get('probe_cc')

# Append the results to the list
self.all_asns.add(probe_asn)
self.all_countries.add(probe_cc)
self.all_results.append((probe_asn, probe_cc, None, None))
"""The base function adds a skeleton to the all_results list, which includes the
probe_asn and the probe_cc, as well as 2 dummy entries.

Each extended crawler then modifies this entry
by calling self.all_results[-1][:2] to access the latest entry
in the all_list and modify the non-populated variables. Adding
further variables (e.g. more than 4) is also possible, as well
as adding less, in that case only modify variable 3.
Attention: if you are discarding a result in the extended
class, you need to make sure you specifically pop() the entry
created here, in the base class, or you WILL end up with
misformed entries that only contain the probe_asn and
probe_cc, and mess up your data.
"""

def batch_add_to_iyp(self):
"""Add the results to the IYP."""
country_links = []

# First, add the nodes and store their IDs directly as returned dictionaries
self.node_ids = {
'asn': self.iyp.batch_get_nodes_by_single_prop(
'AS', 'asn', self.all_asns, all=False
),
'country': self.iyp.batch_get_nodes_by_single_prop(
'Country', 'country_code', self.all_countries
),
'dns_resolver': self.iyp.batch_get_nodes_by_single_prop(
'IP', 'ip', self.all_dns_resolvers, all=False
),
}
# to avoid duplication of country links, we only add them from
# the webconnectivity dataset
if self.dataset == 'webconnectivity':
for entry in self.all_results:
asn, country = entry[:2]
asn_id = self.node_ids['asn'].get(asn)
country_id = self.node_ids['country'].get(country)

# Check if the COUNTRY link is unique
if (asn_id, country_id) not in self.unique_links['COUNTRY']:
self.unique_links['COUNTRY'].add((asn_id, country_id))
country_links.append(
{
'src_id': asn_id,
'dst_id': country_id,
'props': [self.reference],
}
)
self.iyp.batch_add_links('COUNTRY', country_links)

# Batch add node labels
self.iyp.batch_add_node_label(
list(self.node_ids['dns_resolver'].values()), 'Resolver'
)
Loading
Loading