Improved readme, processing, & added more data (#6)

* Improved readme, processing, much more data * Typo * Disable strict * Update process.py * There's just too much data in that Maybe it can be incorporated at a later date if the processing is made faster * Formatting changes
HostByBelle · Dec 10, 2023 · 3c41ea2 · 3c41ea2
1 parent f055a02
commit 3c41ea2
Show file tree

Hide file tree

Showing 8 changed files with 44 additions and 24 deletions.
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -49,8 +49,9 @@ jobs:
           curl --show-error --output digitalocean-geofeed.csv --location "https://digitalocean.com/geo/google.csv"
           curl --show-error --output vultr-geofeed.csv --location "https://geofeed.constant.com/"
           curl --show-error --output starlink-geofeed.csv --location "https://geoip.starlinkisp.net/feed.csv"
+          curl --show-error --output google-geofeed.csv --location "https://www.gstatic.com/ipranges/cloud_geofeed"
+          #curl --show-error --output geolocatemuch-geofeed.csv --location "https://geolocatemuch.com/geofeeds/validated-all.csv"
 
-          
       - name: Setup Python
         uses: actions/setup-python@v4
         with:

diff --git a/README.md b/README.md
@@ -1,24 +1,29 @@
 # IP Database Testing Data
 
-This repository automatically builds both IPv4 and IPv6 information to be used for testing IP address databases (geolocation primarily).
+This repository automatically builds both IPv4 and IPv6 information to be used for testing IP address databases. Due to the nature of how this data is collected, it may also be valuable as supplemental data when building a database and can be considered known-good data.
 
 ## Data sources utilized
 
+The data is built utilizing self-published data by various providers. No 3rd party data is utilized and is considered inherently unreliable for the purposes of this data.
+
 - [Pingdom probe server data](https://www.pingdom.com/rss/probe_servers.xml)
   - IP address types: `IPv4`, `IPv6`
   - Data available: `Country Code`, `Country Name`, `City`
-- [Hetrix Monitoring IPs](https://docs.hetrixtools.com/uptime-monitoring-ip-addresses/)
+- [Hetrix Monitoring IPs](https://hetrixtools.com/resources/uptime-monitor-ips.txt)
   - IP address types: `IPv4`
   - Data available: `Country Code`, `City`, `Subdivision Code`
+  - Note: Utilizes a [hand-built mapping](https://github.com/HostByBelle/ip-db-test-data/blob/main/scripts/parse-hetrix.py#L6) between Hetrix's hostnames and their locations.
 - [Updown.io Monitoring IPs](https://updown.io/api/nodes)
   - IP address types: `IPv4`, `IPv6`
   - Data available: `Country Code`, `City`, `Latitude`, `Longitude`
 - [AWS IP Address Ranges](https://ip-ranges.amazonaws.com/ip-ranges.json)
   - IP address types: `IPv4`, `IPv6`
   - Data available: `Country Code`
+  - Note: Utilizes a [hand-built mapping](https://github.com/HostByBelle/ip-db-test-data/blob/main/scripts/parse-aws.py#L6) between AWS's region IDs and their locations.
 - [Oracle Cloud IP Address Ranges](https://docs.oracle.com/en-us/iaas/tools/public_ip_ranges.json)
   - IP address types: `IPv4`
   - Data available: `Country Code`
+  - Note: Utilizes a [hand-built mapping](https://github.com/HostByBelle/ip-db-test-data/blob/main/scripts/parse-oracle.py#L5) between Oracle's region IDs and their locations.
 - [Linode Geofeed](https://geoip.linode.com/)
   - IP address types: `IPv4`, `IPv6`
   - Data available: `Country Code`, `Subdivision Code`, `City Name`, `Postal Code`
@@ -31,8 +36,11 @@ This repository automatically builds both IPv4 and IPv6 information to be used f
 - [Starlink Geofeed](https://geoip.starlinkisp.net/feed.csv)
   - IP address types: `IPv4`, `IPv6`
   - Data available: `Country Code`, `Subdivision Code`, `City Name`
+- [Google Cloud Geofeed](https://www.gstatic.com/ipranges/cloud_geofeed)
+  - IP address types: `IPv4`, `IPv6`
+  - Data available: `Country Code`, `Subdivision Code`, `City Name`
 
-### Data Processing
+## Data Processing
 
 Each release will undergo a final "processing" step to ensure the generated data is of good quality.
 The order of processing is as follows:
@@ -42,8 +50,10 @@ The order of processing is as follows:
    - Only the first instance of a CIDR will be retained in the final data source.
 2. The de-duplicated list is then sorted in decending order by the quantity of IP addresses in each CIDR
 3. Any CIDRs which are private networks are discarded.
-4. Any 3-letter country codes are converted to 2 letter country codes.
-5. Next all CIDRs are looped through and compared against previous CIDRs to identify any overlaps / subnets.
+4. Any CIDRs which haven no data associated with them are discarded.
+5. Any 3-letter country codes are converted to 2 letter country codes.
+6. Next all CIDRs are looped through and compared against previous CIDRs to identify any overlaps / subnets.
    - A subnet is retained and any differing data from the parent (supernet) network is considered valid.
    - Any overlapping CIDRs are simply discarded with a message as of this moment.
-6. The final dataset after processing is written to the JSON file before then being uploaded to the release.
+   - If a subnet has identical information to it's supernet, it's removed from the dataset.
+7. The final dataset after processing is written to the JSON file before then being uploaded to the release.
diff --git a/scripts/parse-geofeed.py b/scripts/parse-geofeed.py
@@ -18,7 +18,7 @@ def parse(geofeed_csv, json_file, ipver):
 
         for row in csv_reader:
             if not row[0].startswith('#'):
-                network = ipaddress.ip_network(row[0])
+                network = ipaddress.ip_network(row[0], strict=False)
                 if (ipver == 'ip' and network.version == 4) or (ipver == 'ipv6' and network.version == 6):
                     data_list.append({
                         'ip_range': row[0],
@@ -32,11 +32,11 @@ def parse(geofeed_csv, json_file, ipver):
         with open(json_file, 'w', encoding='utf-8') as json_file:
             json.dump(data_list, json_file, indent=4, ensure_ascii=False)
 
-if __name__ == "__main__":
+if __name__ == '__main__':
     parser = argparse.ArgumentParser()
-    parser.add_argument("geofeed_csv", help="path to the RFC 8805 Geofeed CSV file")
-    parser.add_argument("json_file", help="path to output JSON file")
-    parser.add_argument("ipver", help="IP version (ip or ipv6)")
+    parser.add_argument('geofeed_csv', help='path to the RFC 8805 Geofeed CSV file')
+    parser.add_argument('json_file', help='path to output JSON file')
+    parser.add_argument('ipver', help='IP version (ip or ipv6)')
     args = parser.parse_args()
 
     parse(args.geofeed_csv, args.json_file, args.ipver)
diff --git a/scripts/parse-hetrix.py b/scripts/parse-hetrix.py
@@ -3,6 +3,8 @@
 import json
 import ipaddress
 
+# Each "wk*-" hostname is associated with a location and this mapping was manually built utilizing their documentation
+# See https://docs.hetrixtools.com/uptime-monitoring-ip-addresses/ and https://hetrixtools.com/resources/uptime-monitor-ips.txt
 wk_mapping = {
     'wk1': {
         'country_code': 'US',

diff --git a/scripts/parse-pingdom.py b/scripts/parse-pingdom.py
@@ -47,12 +47,12 @@ def parse_xml(xml_file, ip_type, json_file):
 
 def main():
     parser = argparse.ArgumentParser()
-    parser.add_argument("xml_file", help="path to XML file")
-    parser.add_argument("json_file", help="path to output JSON file")
-    parser.add_argument("ip_type", help="type of IP address")
+    parser.add_argument('xml_file', help='path to XML file')
+    parser.add_argument('json_file', help='path to output JSON file')
+    parser.add_argument('ip_type', help='type of IP address')
     args = parser.parse_args()
 
     parse_xml(args.xml_file, args.ip_type, args.json_file)
 
-if __name__ == "__main__":
+if __name__ == '__main__':
     main()
diff --git a/scripts/parse-statuscake.py b/scripts/parse-statuscake.py
@@ -30,11 +30,11 @@ def parse(updown_data, json_file, ipver):
         with open(json_file, 'w', encoding='utf-8') as json_file:
             json.dump(data_list, json_file, indent=4, ensure_ascii=False)
 
-if __name__ == "__main__":
+if __name__ == '__main__':
     parser = argparse.ArgumentParser()
-    parser.add_argument("updown_data", help="path to the statuscake JSON file")
-    parser.add_argument("json_file", help="path to output JSON file")
-    parser.add_argument("ipver", help="IP version (ip or ipv6)")
+    parser.add_argument('updown_data', help='path to the statuscake JSON file')
+    parser.add_argument('json_file', help='path to output JSON file')
+    parser.add_argument('ipver', help='IP version (ip or ipv6)')
     args = parser.parse_args()
 
     parse(args.updown_data, args.json_file, args.ipver)
diff --git a/scripts/parse-updown.py b/scripts/parse-updown.py
@@ -33,9 +33,9 @@ def parse(updown_data, json_file, ipver):
 
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument("updown_data", help="path to the updown JSON file")
-    parser.add_argument("json_file", help="path to output JSON file")
-    parser.add_argument("ipver", help="IP version (ip or ip6)")
+    parser.add_argument('updown_data', help='path to the updown JSON file')
+    parser.add_argument('json_file', help='path to output JSON file')
+    parser.add_argument('ipver', help='IP version (ip or ip6)')
     args = parser.parse_args()
 
     parse(args.updown_data, args.json_file, args.ipver)
diff --git a/scripts/process.py b/scripts/process.py
@@ -66,6 +66,13 @@ def process(json_file):
         result = []
 
         for entry in ip_data_list:
+
+            # Make a copy of the entry, remove the IP range, and then check if it evaluates to false. If it does, that IP range has no data associated with it and can be discarded
+            entry_copy = entry.copy()
+            del(entry_copy['ip_range'])
+            if not entry_copy:
+                continue
+
             ip_range = entry['ip_range']
 
             # Convert IP range to ipaddress object
@@ -85,7 +92,7 @@ def process(json_file):
                 was_in_subnet = False
 
                 for kept_entry in result:
-                    existing_range = ipaddress.ip_network(kept_entry['ip_range'])
+                    existing_range = ipaddress.ip_network(kept_entry['ip_range'], strict=False)
                     if ip_network.subnet_of(existing_range):
                         test_data_1 = entry.copy()
                         test_data_2 = kept_entry.copy()