Dataset Needed for Unified Nigerian Language Detection #4

sir-temi · 2025-01-05T11:52:53Z

We are building a unified language detection model for Padie to classify input text into one of the supported Nigerian languages: English, Pidgin, Yoruba, Igbo, and Hausa. To improve the model’s accuracy and diversity, we need high-quality datasets for these languages.

Your contribution will directly help Padie understand and process Nigerian languages better! 🌟

Dataset Structure

📂 The datasets is organised in the dataset/language_detection/ directory, with separate JSON files for each language:

datasets/language_detection/english.json
datasets/language_detection/pidgin.json
datasets/language_detection/yoruba.json
datasets/language_detection/igbo.json
datasets/language_detection/hausa.json

Each file should contain an array of JSON objects in the following format. For instance, datasets/language_detection/pidgin.json:

{
    "text": "Wetin dey happen for here?",
    "label": "pidgin"
}

How You Can Contribute

✨ Ways to Help:

Provide Data for Any Language:
- Add sentences or paragraphs for any of the supported languages (English, Pidgin, Yoruba, Igbo, Hausa).
- Ensure the label correctly matches the language of the text.
Ensure Dataset Quality:
- ✅ Avoid duplicates, sensitive information, or offensive content.
- 🌍 Include diverse content (formal, informal, conversational).
Submission Guidelines:
- Update the corresponding JSON file in dataset/language_detection/.
- If submitting multiple samples, ensure they are formatted correctly and well-organised.

Call to Action

🚀 Contribute Today!
Submit your changes as a pull request or share your dataset here if you're unable to format it. Let’s collaborate to build the best language detection model for Padie! Thank you.

Let’s make Padie exceptional! 🌟

The text was updated successfully, but these errors were encountered:

sir-temi added help wanted Extra attention is needed good first issue Good for newcomers dataset needed Specific requests for datasets to improve models. labels Jan 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Needed for Unified Nigerian Language Detection #4

Dataset Needed for Unified Nigerian Language Detection #4

sir-temi commented Jan 5, 2025 •

edited

Loading

Dataset Needed for Unified Nigerian Language Detection #4

Dataset Needed for Unified Nigerian Language Detection #4

Comments

sir-temi commented Jan 5, 2025 • edited Loading

Dataset Structure

How You Can Contribute

Call to Action

sir-temi commented Jan 5, 2025 •

edited

Loading