Curate Alpaca

title	emoji	colorFrom	colorTo	sdk	sdk_version	app_file	pinned
Alpaca-Guarani	🦙 🇵🇾	green	blue	gradio	5.11.0	main.py	false

Curate Alpaca

About

Simple tool to curate alpaca dataset for training LLMs in other language.

In this case we built this tool to curate dataset transalated to Guarani and prepare for finetunning LLMs with this new language.

Key Features

Use Alpaca based instruction dataset
Using nllb-200-distilled-600M to translate into the desired language (Guaraní in this case)
Supabase for manage the dataset
Gradio to build the UI
HF Transformers pipeline to translate the dataset
Multiple translation backends supported:
- OpenAI GPT-4 (default) for high-quality translations
- NLLB (Neural Language Learning Benchmark) model for offline/local processing
- Google Cloud Translation for reliable machine translation

How to use

Clone the repo

git clone this repo

Install the requirements

pip install -r requirements.txt

Set env vars.

export SUPABASE_URL=your_supabase_url
export SUPABASE_KEY=your_supabase_key
export OPENAI_API_KEY=your_openai_api_key
export TRANSLATION_MODEL=openai
export GOOGLE_CREDENTIALS_PATH=/path/to/your/credentials.json

Run the app

python main.py

Alpaca-Guarani Database Schema Used for this project

Table: alpaca-guarani

This table stores original Spanish texts and their Guarani translations for the Alpaca dataset.

Column Name	Data Type	Constraints	Description
id	int8	PRIMARY KEY	Unique identifier for each entry
original_instruction	text		Original instruction text in Spanish
original_input	text		Original input text in Spanish
original_output	text		Original output text in Spanish
translated_instruction	text		Translated instruction text in Guarani
translated_input	text		Translated input text in Guarani
translated_output	text		Translated output text in Guarani
translated	bool		Flag indicating if the entry has been translated

Notes:

The id column is used as the primary key and should be automatically incremented.
The translated column is a boolean flag to easily identify which entries have been translated.
All text fields (original and translated) are stored as text type to accommodate varying lengths.
The schema follows the structure of the Alpaca dataset, with separate fields for instruction, input, and output.

Usage:

When inserting new entries, ensure the id is unique.
Set translated to false when adding new original texts.
Update translated to true when the translation is completed.
Query untranslated entries with: SELECT * FROM alpaca-guarani WHERE translated = false;
Query translated entries with: SELECT * FROM alpaca-guarani WHERE translated = true;

This schema allows for efficient storage and retrieval of both original and translated texts, facilitating the translation process and progress tracking.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
assets		assets
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Curate Alpaca

About

Key Features

How to use

Alpaca-Guarani Database Schema Used for this project

Table: alpaca-guarani

Notes:

Usage:

About

Releases

Packages

Languages

josebenitezg/alpaca-guarani

Folders and files

Latest commit

History

Repository files navigation

Curate Alpaca

About

Key Features

How to use

Alpaca-Guarani Database Schema Used for this project

Table: alpaca-guarani

Notes:

Usage:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages