Harvester: An easy-to-use Web Scraping tool.

Harvester is a lightweight, pure Python library designed for straightforward web scraping without external dependencies.

Features

Pure Python: No third-party dependencies required.
Model-Field structure: Define scraping targets using a clear, class-based approach.
Flexible parsing: Use Python's standard libraries to parse and extract data.

Installation

Installing via pip:

pip install harvester

Or directly from the source code:

pip install git+https://github.com/blazaid/harvester

Requirements

Harvester is compatible with Python >= 3.8 versions. There are no mandatory external dependencies. However, for certain features, the chardet library may be beneficial. If chardet is not installed, those features will be bypassed with a warning.

Usage

Define your data models by subclassing Model and specifying fields:

from harvester import Model, StringField, IntegerField

class Product(Model):
    name = StringField()
    price = IntegerField()

Parse the HTML content and extract data using the model:

from harvester import parse_html

html_content = """
<html>
<body>
    <h1 class="product-name">Example Product</h1>
    <span class="product-price">100</span>
</body>
</html>
"""

mapping = {
    "name": "h1.product-name",
    "price": "span.product-price"
}

product = parse_html(html_content, Product, mapping=mapping)
print(product.to_dict())

This will output:

{"name": "Example Product", "price": 100}

Documentation

Comprehensive documentation is forthcoming and will be available on Read the Docs. In the meantime, the source code is the best place to find information.

Contributing

Contributions are welcome! Please review the issues for current topics and feel free to submit pull requests. Also make sure to read the contributing guidelines to get started.

License

Harvester is licensed under the GNU General Public License v3.0. See the LICENSE file detailed information.

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
.github/workflows		.github/workflows
harvester		harvester
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Harvester: An easy-to-use Web Scraping tool.

Features

Installation

Requirements

Usage

Documentation

Contributing

License

About

Releases 1

Packages

Contributors 2

Languages

License

blazaid/harvester

Folders and files

Latest commit

History

Repository files navigation

Harvester: An easy-to-use Web Scraping tool.

Features

Installation

Requirements

Usage

Documentation

Contributing

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages