This repository contains a collection of Python, Node.js, and Jupyter Notebook files for the creation of Telugu News articles Instruct-Style dataset for the puporse of supervised fine-tuning of Large Language Model (LLM).
Telugu News Articles dataset is created using the code in this repository and opensourced as HuggingFace Datasets under Apache 2.0 Licence. You can access the dataset here: aya-telugu-news-articles.
The repository is beneficial for users who wants to:
- Reproduce the Telugu News Articles dataset creation workflow.
- Extend the existing Telugu News Articles dataset.
- Integrate the parts of Telugu News Articles dataset creation workflow into their own dataset creation workflow.
Note: Scraping copyrighted website without permission is unethical and not advisable. Please check the terms and conditions of scraping a website before proceeding with the workflow.
Make sure you have Python version 3.9.13 or higher installed. You can check your Python version by running:
python --version
If you don't have Python installed or have an older version, you can download the latest version from the official Python website: https://www.python.org
It is recommended to create a virtual environment to isolate the project dependencies. To create a virtual environment, run:
python -m venv venv
Activate the virtual environment:
-
For Windows:
venv\Scripts\activate
-
For macOS and Linux:
source venv/bin/activate
To install the required dependencies for the Python files in the virtual environment, run:
pip install -r requirements.txt
Make sure you have Node.js version 18.13.0 or higher installed. You can check your Node.js version by running:
node --version
If you don't have Node.js installed or have an older version, you can download the latest version from the official Node.js website: https://nodejs.org
To install the required dependencies for the Node.js files, run:
npm install
The workflow for the dataset creation consists of following three steps which needs to performed sequentially.
- Edit the
src/utils/scraper-constants.js
file according to your specifications like timeout, links to be scraped etc. - To scrape the content specified in previous step, run:
node index.js
- After successful execution, you can find the scraped content JSON file located in the
SCRAPED_CONTENT_FILE_PATH
mentioned inscraper-constants.js
file.
- Edit the
src/utils/sft_constants.py
file according to your specifications. - Run the
notebooks/exploratory_data_analysis.ipynb
notebook. - The notebook has detailed steps which performs exploratory data analysis, dataset cleaning and removal of outliers.
- After successful execution of notebook, you can find the cleaned scraped content csv file located in
FINAL_SCRAPED_DATASET_PATH
mentioned insft_constants.py
.
- Edit the
src/utils/sft_constants.py
file according to your specifications. - To create the Instruct-Style sft dataset from the scraped content, run:
python main.py
- After successful execution, you can find the final sft dataset with prompts and completions located in
SFT_DATASET_PATH
mentioned insft_constants.py
.
The repository has the following structure:
├── src/
│ ├── python/
│ │ ├── sft.py
│ │ ├── post_processor.py
│ │ └── ...
│ ├── nodejs/
│ │ ├── content-scraper.js
│ │ ├── links-scraper.js
│ │ └── ...
│ ├── data/
│ │ ├── content/tmp
│ │ └── links/
│ └── utils/
│ ├── scraper-constants.js
│ └── sft_constants.py
├── notebooks/
│ └── exploratory_data_analysis.ipynb
├── index.js
├── main.py
├── requirements.txt
├── package.json
└── README.md
- The
src/
directory contains the main source code files.- The
python/
directory contains the Python files. - The
nodejs/
directory contains the Node.js files. - The
data/
directory contains data files.- The
content/
directory contains content-related data files. - The
links/
directory contains link-related data files.
- The
- The
utils/
directory contains utility files for both python and nodejs.
- The
- The
notebooks/
directory contains the notebook to do exploratory data analysis. - The
main.py
file creates the instruct-style sft dataset. - The
index.js
file scrapes the content from the website. - The
requirements.txt
file lists the Python dependencies. - The
package.json
file lists the Node.js dependencies.
Contributions to this project are welcome. If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.
When contributing, please follow the existing code style and conventions used in the project.
This code is licensed under the MIT License.
If you use this code in your work, please cite it as follows:
@software{Guthikonda_SFT_LLM_News_2024,
author = {Guthikonda, Surya},
license = {MIT},
month = apr,
title = {{SFT LLM News Articles Telugu}},
url = {https://github.com/SuryaKrishna02/sft-llm-news-articles-telugu},
version = {1.0.0},
year = {2024}
}