This project is not designed for deployment, however, it will set up a local database to be used in development.
You must have Node.js and npm installed. To see if you already have Node.js and npm installed and check the installed version, run the following commands:
node -v
npm -v
Clone the repo
git clone https://github.com/enBloc-org/Scraper.git
Intall all the dependencies with the command below.
npm install
Enter your environment variables in .env
BASE_URL=
COOKIE=
DELAY=
STATE_LIST=
The code is designed to be run in five phases:
- Crawl the source page to collect all data endpoints needed
- Download the base64 files available at each endpoint
- Convert each file to PDF
- Validate the PDF files to ensure they can be scraped
- Scrape the data from the validated PDF files
crawler.js
will navigate through all the necessary endpoints to assemble a large json file which can be iterated through by our scraper.
This file needs to be run only once on setup. It will save the recovered json file onto a local database which will be the basis for the scraper to work from. If you need access to all endpoints and are now setting up your environment complete the installation and then enter this command to the terminal:
npm run crawl
downloader.js
will run through the assembled states file and perform checks to know how many files are available for download at each endpoint. Finally, it will download and save a base64 file
npm run download
converter.js
will monitor the directory where all the base64 files have been downloaded to and convert each into a PDF file
npm run convert
Run node validateFiles.js
. This will check the files and send all files that are corrupted or in an unreadable format to corrupted_downloads.
The scraper runs in two parts in order to deal with the different layouts of the data on the PDFs. The first section of data is processed by the processGeneralData function. The second section of the data is processed by the processTableData function. These functions take the PDF path passed to them inside scraperCallback.js
. In this file, the scraper function is called for each file that has been downloaded and saved during the crawl, download, convert and validate processes. To trigger this process, you can run
npm run scrape
This will automatically create a school_data.sql database where the data from each PDF will be transferred to during the scraping process. After each PDF is scraped, you will see the word 'Scraped' appear in the terminal. You will also see Warning: fetchStandardFontData: failed to fetch file "LiberationSans-Bold.ttf" with "UnknownErrorException: The standard font "baseUrl" parameter must be specified, ensure that the "standardFontDataUrl" API parameter is provided.".
This is a known issue and will not interfere with the scraping process.