Skip to content

Commit

Permalink
Improved readme, processing, & added more data (#6)
Browse files Browse the repository at this point in the history
* Improved readme, processing, much more data

* Typo

* Disable strict

* Update process.py

* There's just too much data in that

Maybe it can be incorporated at a later date if the processing is made faster

* Formatting changes
  • Loading branch information
BelleNottelling authored Dec 10, 2023
1 parent f055a02 commit 3c41ea2
Show file tree
Hide file tree
Showing 8 changed files with 44 additions and 24 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,9 @@ jobs:
curl --show-error --output digitalocean-geofeed.csv --location "https://digitalocean.com/geo/google.csv"
curl --show-error --output vultr-geofeed.csv --location "https://geofeed.constant.com/"
curl --show-error --output starlink-geofeed.csv --location "https://geoip.starlinkisp.net/feed.csv"
curl --show-error --output google-geofeed.csv --location "https://www.gstatic.com/ipranges/cloud_geofeed"
#curl --show-error --output geolocatemuch-geofeed.csv --location "https://geolocatemuch.com/geofeeds/validated-all.csv"
- name: Setup Python
uses: actions/setup-python@v4
with:
Expand Down
22 changes: 16 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,29 @@
# IP Database Testing Data

This repository automatically builds both IPv4 and IPv6 information to be used for testing IP address databases (geolocation primarily).
This repository automatically builds both IPv4 and IPv6 information to be used for testing IP address databases. Due to the nature of how this data is collected, it may also be valuable as supplemental data when building a database and can be considered known-good data.

## Data sources utilized

The data is built utilizing self-published data by various providers. No 3rd party data is utilized and is considered inherently unreliable for the purposes of this data.

- [Pingdom probe server data](https://www.pingdom.com/rss/probe_servers.xml)
- IP address types: `IPv4`, `IPv6`
- Data available: `Country Code`, `Country Name`, `City`
- [Hetrix Monitoring IPs](https://docs.hetrixtools.com/uptime-monitoring-ip-addresses/)
- [Hetrix Monitoring IPs](https://hetrixtools.com/resources/uptime-monitor-ips.txt)
- IP address types: `IPv4`
- Data available: `Country Code`, `City`, `Subdivision Code`
- Note: Utilizes a [hand-built mapping](https://github.com/HostByBelle/ip-db-test-data/blob/main/scripts/parse-hetrix.py#L6) between Hetrix's hostnames and their locations.
- [Updown.io Monitoring IPs](https://updown.io/api/nodes)
- IP address types: `IPv4`, `IPv6`
- Data available: `Country Code`, `City`, `Latitude`, `Longitude`
- [AWS IP Address Ranges](https://ip-ranges.amazonaws.com/ip-ranges.json)
- IP address types: `IPv4`, `IPv6`
- Data available: `Country Code`
- Note: Utilizes a [hand-built mapping](https://github.com/HostByBelle/ip-db-test-data/blob/main/scripts/parse-aws.py#L6) between AWS's region IDs and their locations.
- [Oracle Cloud IP Address Ranges](https://docs.oracle.com/en-us/iaas/tools/public_ip_ranges.json)
- IP address types: `IPv4`
- Data available: `Country Code`
- Note: Utilizes a [hand-built mapping](https://github.com/HostByBelle/ip-db-test-data/blob/main/scripts/parse-oracle.py#L5) between Oracle's region IDs and their locations.
- [Linode Geofeed](https://geoip.linode.com/)
- IP address types: `IPv4`, `IPv6`
- Data available: `Country Code`, `Subdivision Code`, `City Name`, `Postal Code`
Expand All @@ -31,8 +36,11 @@ This repository automatically builds both IPv4 and IPv6 information to be used f
- [Starlink Geofeed](https://geoip.starlinkisp.net/feed.csv)
- IP address types: `IPv4`, `IPv6`
- Data available: `Country Code`, `Subdivision Code`, `City Name`
- [Google Cloud Geofeed](https://www.gstatic.com/ipranges/cloud_geofeed)
- IP address types: `IPv4`, `IPv6`
- Data available: `Country Code`, `Subdivision Code`, `City Name`

### Data Processing
## Data Processing

Each release will undergo a final "processing" step to ensure the generated data is of good quality.
The order of processing is as follows:
Expand All @@ -42,8 +50,10 @@ The order of processing is as follows:
- Only the first instance of a CIDR will be retained in the final data source.
2. The de-duplicated list is then sorted in decending order by the quantity of IP addresses in each CIDR
3. Any CIDRs which are private networks are discarded.
4. Any 3-letter country codes are converted to 2 letter country codes.
5. Next all CIDRs are looped through and compared against previous CIDRs to identify any overlaps / subnets.
4. Any CIDRs which haven no data associated with them are discarded.
5. Any 3-letter country codes are converted to 2 letter country codes.
6. Next all CIDRs are looped through and compared against previous CIDRs to identify any overlaps / subnets.
- A subnet is retained and any differing data from the parent (supernet) network is considered valid.
- Any overlapping CIDRs are simply discarded with a message as of this moment.
6. The final dataset after processing is written to the JSON file before then being uploaded to the release.
- If a subnet has identical information to it's supernet, it's removed from the dataset.
7. The final dataset after processing is written to the JSON file before then being uploaded to the release.
10 changes: 5 additions & 5 deletions scripts/parse-geofeed.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ def parse(geofeed_csv, json_file, ipver):

for row in csv_reader:
if not row[0].startswith('#'):
network = ipaddress.ip_network(row[0])
network = ipaddress.ip_network(row[0], strict=False)
if (ipver == 'ip' and network.version == 4) or (ipver == 'ipv6' and network.version == 6):
data_list.append({
'ip_range': row[0],
Expand All @@ -32,11 +32,11 @@ def parse(geofeed_csv, json_file, ipver):
with open(json_file, 'w', encoding='utf-8') as json_file:
json.dump(data_list, json_file, indent=4, ensure_ascii=False)

if __name__ == "__main__":
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("geofeed_csv", help="path to the RFC 8805 Geofeed CSV file")
parser.add_argument("json_file", help="path to output JSON file")
parser.add_argument("ipver", help="IP version (ip or ipv6)")
parser.add_argument('geofeed_csv', help='path to the RFC 8805 Geofeed CSV file')
parser.add_argument('json_file', help='path to output JSON file')
parser.add_argument('ipver', help='IP version (ip or ipv6)')
args = parser.parse_args()

parse(args.geofeed_csv, args.json_file, args.ipver)
2 changes: 2 additions & 0 deletions scripts/parse-hetrix.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
import json
import ipaddress

# Each "wk*-" hostname is associated with a location and this mapping was manually built utilizing their documentation
# See https://docs.hetrixtools.com/uptime-monitoring-ip-addresses/ and https://hetrixtools.com/resources/uptime-monitor-ips.txt
wk_mapping = {
'wk1': {
'country_code': 'US',
Expand Down
8 changes: 4 additions & 4 deletions scripts/parse-pingdom.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,12 +47,12 @@ def parse_xml(xml_file, ip_type, json_file):

def main():
parser = argparse.ArgumentParser()
parser.add_argument("xml_file", help="path to XML file")
parser.add_argument("json_file", help="path to output JSON file")
parser.add_argument("ip_type", help="type of IP address")
parser.add_argument('xml_file', help='path to XML file')
parser.add_argument('json_file', help='path to output JSON file')
parser.add_argument('ip_type', help='type of IP address')
args = parser.parse_args()

parse_xml(args.xml_file, args.ip_type, args.json_file)

if __name__ == "__main__":
if __name__ == '__main__':
main()
8 changes: 4 additions & 4 deletions scripts/parse-statuscake.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,11 @@ def parse(updown_data, json_file, ipver):
with open(json_file, 'w', encoding='utf-8') as json_file:
json.dump(data_list, json_file, indent=4, ensure_ascii=False)

if __name__ == "__main__":
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("updown_data", help="path to the statuscake JSON file")
parser.add_argument("json_file", help="path to output JSON file")
parser.add_argument("ipver", help="IP version (ip or ipv6)")
parser.add_argument('updown_data', help='path to the statuscake JSON file')
parser.add_argument('json_file', help='path to output JSON file')
parser.add_argument('ipver', help='IP version (ip or ipv6)')
args = parser.parse_args()

parse(args.updown_data, args.json_file, args.ipver)
6 changes: 3 additions & 3 deletions scripts/parse-updown.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,9 @@ def parse(updown_data, json_file, ipver):

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("updown_data", help="path to the updown JSON file")
parser.add_argument("json_file", help="path to output JSON file")
parser.add_argument("ipver", help="IP version (ip or ip6)")
parser.add_argument('updown_data', help='path to the updown JSON file')
parser.add_argument('json_file', help='path to output JSON file')
parser.add_argument('ipver', help='IP version (ip or ip6)')
args = parser.parse_args()

parse(args.updown_data, args.json_file, args.ipver)
9 changes: 8 additions & 1 deletion scripts/process.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,13 @@ def process(json_file):
result = []

for entry in ip_data_list:

# Make a copy of the entry, remove the IP range, and then check if it evaluates to false. If it does, that IP range has no data associated with it and can be discarded
entry_copy = entry.copy()
del(entry_copy['ip_range'])
if not entry_copy:
continue

ip_range = entry['ip_range']

# Convert IP range to ipaddress object
Expand All @@ -85,7 +92,7 @@ def process(json_file):
was_in_subnet = False

for kept_entry in result:
existing_range = ipaddress.ip_network(kept_entry['ip_range'])
existing_range = ipaddress.ip_network(kept_entry['ip_range'], strict=False)
if ip_network.subnet_of(existing_range):
test_data_1 = entry.copy()
test_data_2 = kept_entry.copy()
Expand Down

0 comments on commit 3c41ea2

Please sign in to comment.