This project automates the process of extracting, transforming, and loading (ETL) data about the world's largest banks by market capitalization. It fetches data from a Wikipedia page, processes it, and stores the results in both CSV and SQLite database formats.
- Extracts bank data from a Wikipedia page
- Transforms market capitalization data from USD to GBP, EUR, and INR
- Loads processed data into a CSV file and SQLite database
- Logs each step of the ETL process
- Executes sample SQL queries on the resulting database
- Python 3.x
- Required Python packages:
- requests
- pandas
- beautifulsoup4
- sqlite3
main.py
: The main Python script that performs the ETL processexchange_rate.csv
: CSV file containing currency exchange ratesLargest_banks_data.csv
: Output CSV file with processed bank datacode_log.txt
: Log file that records the progress of the ETL processBanks.db
: SQLite database file storing the processed datadocs/
: Directory containing project documentationHLD.md
: High-Level Design documentLLD.md
: Low-Level Design document
- Ensure all required Python packages are installed:
pip install requests pandas beautifulsoup4
-
Place the
exchange_rate.csv
file in the same directory as the script. -
Run the script:
python main.py
- The script will:
- Extract data from the specified Wikipedia page
- Transform the data using exchange rates from
exchange_rate.csv
- Load the data into
Largest_banks_data.csv
and the SQLite databaseBanks.db
- Log the progress in
code_log.txt
- Execute and display results of sample SQL queries
log_progress(message)
: Logs messages with timestampsextract(url, table_att)
: Extracts data from the Wikipedia pagetransform(df_)
: Transforms the data using exchange ratesload_to_csv(df_, file_path)
: Saves data to a CSV fileload_to_db(df_)
: Saves data to the SQLite databaserun_query(query_statement, conn_)
: Executes SQL queries and displays results
The docs
directory contains detailed design documents:
HLD.pdf
: High-Level Design document outlining the overall architecture and components of the projectLLD.pdf
: Low-Level Design document providing detailed specifications and function descriptions
Refer to these documents for a comprehensive understanding of the project's design and implementation.
- To modify the source URL, update the
url_data
variable - To change the output file names or locations, update the respective variables at the beginning of the script
- The script uses a web archive version of the Wikipedia page to ensure consistency
- Ensure you have proper permissions to read/write files in the script's directory
- Add error handling for network issues or data inconsistencies
- Implement command-line arguments for flexible file paths and URLs
- Create a configuration file for easy customization of parameters