The home for the spider that supports search.gov.
The spider uses the open source scrapy framework.
The spider can be found at search_gov_crawler/search_gov_spiders/spiders/domain_spider.py
.
*Note: Other files and directories are within the repository but the folders and files below relate to those needed for the scrapy framework.
├── search_gov_crawler ( scrapy root )
│ ├── search_gov_spider ( scrapy project *Note multiple projects can exist within a project root )
│ │ ├── spiders
│ │ │ ├── domain_spider.py ( main spider )
│ │ ├── utility_files ( includes text files with domains to scrape )
│ │ ├── items.py
│ │ ├── middlewares.py
│ │ ├── pipelines.py
│ │ ├── settings.py
│ ├── scrapy.cfg
The spider can either scrape for URLs from the list of required domains or take in a domain and starting URL to scrape a site/domain.
Running the spider produces a list of urls found in search_gov_crawler/search_gov_spiders/spiders/scrapy_urls/{spider_name}/{spider_name}_{date}-{UTC_time}.txt
.
Navigate down to search_gov_crawler/search_gov_spiders/spiders/
, then
enter the command below:
scrapy crawl domain_spider
^^^ This will take a long time
In the same directory specified above, enter the command below, adding the domain and starting URL for the crawler:
scrapy crawl domain_spider -a domain=example.com -a urls=www.example.com
-
Navigate to the spiders directory
-
Enter one of two following commands:
-
This command will output the yielded URLs in the destination (relative to the spiders directory) and file format specified in the “FEEDS” variable of the settings.py file:
$ scrapy runspider <spider_file.py>
-
This command will output the yielded URLs in the destination (relative to the spiders directory) and file format specified by the user:
$ scrapy runspider <spider_file.py> -o <filepath_to_output_folder/spider_output_filename.csv>
-
-
First, install Scrapyd and scrapyd-client (library that helps eggify and deploy the Scrapy project to the Scrapyd server):
-
$ pip install scrapyd
-
$ pip install git+https://github.com/scrapy/scrapyd-client.git
-
-
Next, navigate to the scrapyd_files directory and start the server :
$ scrapyd
- Note: the directory where you start the server is arbitrary. It's simply where the logs and Scrapy project FEED destination (relative to the server directory) will be.
-
Navigate to the Scrapy project root directory and run this command to eggify the Scrapy project and deploy it to the Scrapyd server:
$ scrapyd-deploy default
-
Note: This will simply deploy it to a local Scrapyd server. To add custom deployment endpoints, you navigate to the scrapy.cfg file and add or customize endpoints.
For instance, if you wanted local and production endpoints:
[settings] default = search_gov_spiders.settings [deploy: local] url = http://localhost:6800/ project = search_gov_spiders [deploy: production] url = <IP_ADDRESS> project = search_gov_spiders
To deploy:
# deploy locally scrapyd-deploy local # deploy production scrapyd-deploy production
-
-
For an interface to view jobs (pending, running, finished) and logs, access http://localhost:6800/. However, to actually manipulate the spiders deployed to the Scrapyd server, you'll need to use the Scrapyd JSON API.
Some most-used commands:
-
Schedule a job:
$ curl http://localhost:6800/schedule.json -d project=search_gov_spiders -d spider=<spider_name>
-
Check load status of a service:
$ curl http://localhost:6800/daemonstatus.json
-
-
Navigate to anywhere within the *Scrapy project root * directory and run this command:
$ scrapy genspider -t crawl <spider_name> "<spider_starting_domain>"
-
Open the
/search_gov_spiders/search_gov_spiders/spiders/boilerplate.py
file and replace the lines of the generated spider with the lines of the boilerplate spider as dictated in the boilerplate file. -
Modify the
rules
in the new spider as needed. Here's the Scrapy rules documentation for the specifics. -
To update the Scrapyd server with the new spider, run:
$ scrapyd-deploy <default or endpoint_name> ## Running Against All Listed Search.gov Domains