This repository is part of a personal project. I'm exploring the price transparency data posted by large insurance companies.
Insurance companies must post transparency data on their websites, but they don't make it easy to access and retrieve the data in an aggregate form.
collectRawDataURLs.py is a script to retrieve the URLs where the data lives. This tool will fetch thousands of URLs to .json.gz files
parseRecords.py is a tool to unzip and parse the large files themselves
Each file is built around a schema recommended, but not required, by CMS. Here's a diagram of part of it, generated by JSON Crack.
Figure 1 - Part of the extremely complex schema defined by CMS for JSON files containing price transparency data.