Scraper

Development Notes

This project is not designed for deployment, however, it will set up a local database to be used in development.

Prerequisites

You must have Node.js and npm installed. To see if you already have Node.js and npm installed and check the installed version, run the following commands:

node -v
npm -v

Clone the repo

git clone https://github.com/enBloc-org/Scraper.git

Intall all the dependencies with the command below.

npm install

Enter your environment variables in .env

BASE_URL=
COOKIE=
DELAY=
STATE_LIST=

Usage

The code is designed to be run in five phases:

Crawl the source page to collect all data endpoints needed
Download the base64 files available at each endpoint
Convert each file to PDF
Validate the PDF files to ensure they can be scraped
Scrape the data from the validated PDF files

Crawl

crawler.js will navigate through all the necessary endpoints to assemble a large json file which can be iterated through by our scraper.

This file needs to be run only once on setup. It will save the recovered json file onto a local database which will be the basis for the scraper to work from. If you need access to all endpoints and are now setting up your environment complete the installation and then enter this command to the terminal:

npm run crawl

Download

downloader.js will run through the assembled states file and perform checks to know how many files are available for download at each endpoint. Finally, it will download and save a base64 file

npm run download

Convert

converter.js will monitor the directory where all the base64 files have been downloaded to and convert each into a PDF file

npm run convert

Validate

Run node validateFiles.js. This will check the files and send all files that are corrupted or in an unreadable format to corrupted_downloads.

Scraper

The scraper runs in two parts in order to deal with the different layouts of the data on the PDFs. The first section of data is processed by the processGeneralData function. The second section of the data is processed by the processTableData function. These functions take the PDF path passed to them inside scraperCallback.js. In this file, the scraper function is called for each file that has been downloaded and saved during the crawl, download, convert and validate processes. To trigger this process, you can run

npm run scrape

This will automatically create a school_data.sql database where the data from each PDF will be transferred to during the scraping process. After each PDF is scraped, you will see the word 'Scraped' appear in the terminal. You will also see Warning: fetchStandardFontData: failed to fetch file "LiberationSans-Bold.ttf" with "UnknownErrorException: The standard font "baseUrl" parameter must be specified, ensure that the "standardFontDataUrl" API parameter is provided.". This is a known issue and will not interfere with the scraping process.

Acknowledgments

Authored by Alphonso and Beth
Funded by the University of Cambridge

Name		Name	Last commit message	Last commit date
Latest commit History 235 Commits
.husky		.husky
__tests__		__tests__
database		database
model		model
utils		utils
.babelrc		.babelrc
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
README.md		README.md
checkDownloads.js		checkDownloads.js
converter.js		converter.js
crawler.js		crawler.js
downloader.js		downloader.js
jest.config.js		jest.config.js
jest.setup.js		jest.setup.js
jest.teardown.js		jest.teardown.js
package-lock.json		package-lock.json
package.json		package.json
scraper.js		scraper.js
scraperCallback.js		scraperCallback.js
validateFiles.js		validateFiles.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraper

Development Notes

Prerequisites

Usage

Crawl

Download

Convert

Validate

Scraper

Acknowledgments

About

Releases

Packages

Contributors 3

Languages

enBloc-org/Scraper

Folders and files

Latest commit

History

Repository files navigation

Scraper

Development Notes

Prerequisites

Usage

Crawl

Download

Convert

Validate

Scraper

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages