Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Needed for Unified Nigerian Language Detection #4

Open
sir-temi opened this issue Jan 5, 2025 · 0 comments
Open

Dataset Needed for Unified Nigerian Language Detection #4

sir-temi opened this issue Jan 5, 2025 · 0 comments
Labels
dataset needed Specific requests for datasets to improve models. good first issue Good for newcomers help wanted Extra attention is needed

Comments

@sir-temi
Copy link
Owner

sir-temi commented Jan 5, 2025

We are building a unified language detection model for Padie to classify input text into one of the supported Nigerian languages: English, Pidgin, Yoruba, Igbo, and Hausa. To improve the model’s accuracy and diversity, we need high-quality datasets for these languages.

Your contribution will directly help Padie understand and process Nigerian languages better! 🌟


Dataset Structure

📂 The datasets is organised in the dataset/language_detection/ directory, with separate JSON files for each language:

  • datasets/language_detection/english.json
  • datasets/language_detection/pidgin.json
  • datasets/language_detection/yoruba.json
  • datasets/language_detection/igbo.json
  • datasets/language_detection/hausa.json

Each file should contain an array of JSON objects in the following format. For instance, datasets/language_detection/pidgin.json:

{
    "text": "Wetin dey happen for here?",
    "label": "pidgin"
}

How You Can Contribute

Ways to Help:

  1. Provide Data for Any Language:

    • Add sentences or paragraphs for any of the supported languages (English, Pidgin, Yoruba, Igbo, Hausa).
    • Ensure the label correctly matches the language of the text.
  2. Ensure Dataset Quality:

    • ✅ Avoid duplicates, sensitive information, or offensive content.
    • 🌍 Include diverse content (formal, informal, conversational).
  3. Submission Guidelines:

    • Update the corresponding JSON file in dataset/language_detection/.
    • If submitting multiple samples, ensure they are formatted correctly and well-organised.

Call to Action

🚀 Contribute Today!
Submit your changes as a pull request or share your dataset here if you're unable to format it. Let’s collaborate to build the best language detection model for Padie! Thank you.

Let’s make Padie exceptional! 🌟

@sir-temi sir-temi added help wanted Extra attention is needed good first issue Good for newcomers dataset needed Specific requests for datasets to improve models. labels Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset needed Specific requests for datasets to improve models. good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant