Harvester is a lightweight, pure Python library designed for straightforward web scraping without external dependencies.
- Pure Python: No third-party dependencies required.
Model
-Field
structure: Define scraping targets using a clear, class-based approach.- Flexible parsing: Use Python's standard libraries to parse and extract data.
Installing via pip:
pip install harvester
Or directly from the source code:
pip install git+https://github.com/blazaid/harvester
Harvester is compatible with Python >= 3.8 versions. There are no mandatory external dependencies. However, for certain
features, the chardet
library may be beneficial. If chardet
is not installed, those features will be bypassed with a
warning.
Define your data models by subclassing Model
and specifying fields:
from harvester import Model, StringField, IntegerField
class Product(Model):
name = StringField()
price = IntegerField()
Parse the HTML content and extract data using the model:
from harvester import parse_html
html_content = """
<html>
<body>
<h1 class="product-name">Example Product</h1>
<span class="product-price">100</span>
</body>
</html>
"""
mapping = {
"name": "h1.product-name",
"price": "span.product-price"
}
product = parse_html(html_content, Product, mapping=mapping)
print(product.to_dict())
This will output:
{"name": "Example Product", "price": 100}
Comprehensive documentation is forthcoming and will be available on Read the Docs. In the meantime, the source code is the best place to find information.
Contributions are welcome! Please review the issues for current topics and feel free to submit pull requests. Also make sure to read the contributing guidelines to get started.
Harvester is licensed under the GNU General Public License v3.0. See the LICENSE file detailed information.