Web Scraping Project

This project involves scraping data from a website and storing it in JSON and CSV formats. The scraped data is then read and processed using PySpark to perform various transformations and analyses. The processed data is saved in JSON format and loaded into a MySQL database for further use and analysis.

Web Scraping: Scrape data from e-commerce websites and save it CSV formats.
Data Processing: Read the JSON files and process data using PySpark.
Data Storage: Save processed data in CSV format and load them into a MySQL database for further analysis and utilization.

Deployment

To run this project, you need to create a virtual environment and install neccesary libraries.

  python3 -m venv venv

  source venv/bin/activate

  pip install -r requirements

Start MySQL container

  docker compose up -d

Tech Stack

Data Processing: Python, PySpark

Database: MySQL

Web Scraping: Selenium, Beautiful Soup

Containerization: Docker

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
functions		functions
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
scrape_data.ipynb		scrape_data.ipynb
store_data.ipynb		store_data.ipynb
transform_data.ipynb		transform_data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping Project

Deployment

Tech Stack

About

Releases

Packages

Languages

huyvu1404/web_scraping

Folders and files

Latest commit

History

Repository files navigation

Web Scraping Project

Deployment

Tech Stack

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages