This is a demonstration of how to crawl images.
-
Automate web crawling to retrieve images and flip pages automatically on a website.
-
Add support for multiple browsers.
Python 3.7+
Chrome / Firefox / Edge / Safari
git clone https://github.com/HuaDeity/CrawlDemo.git
cd CrawlDemo
pip install -r requirements.txt
cd millitary
scrapy crawl gettyimages -a search_term=aircraftcarrier -a page_number=3 -a browser=chrome
scrapy crawl baidu -a search_term=航空母舰 -a image_number=10000 -a browser=chrome
In order to retrieve a specific image, the user must provide the following information:
-
Website (gettyimages/alamy/google/baidu)
-
Search term (keyword)
-
Desired page number / image number (for baidu only)
-
Preferred web browser (chrome/firefox/edge/safari)
The websites support now:
-
GettyImages may need to change the keyword appropriately when searching in different regions, such as adding hyphens to get different search results.
-
Alamy images require a uniform crop of approximately 20px from the bottom.
-
The spider disabled Baidu's robots.txt file due to its anti-crawling mechanism, which may violate the website's terms of service.
-
Both Google and Baidu are streaming websites that eliminate page loading time.
-
GettyImages uses regular pagination mode.
-
Alamy need to wait for webpage images to load, the speed is relatively slow.
- To avoid Gettyimages restricting access to your IP, it's advisable to reduce your crawling frequency. If you're still unable to download, consider changing your IP or waiting for a while before trying again.
Contributions are welcome! Please refer to our CONTRIBUTING.md for details on how to contribute.
This project is licensed under the MIT License - see the LICENSE file for details.
HuaDeity
Email