Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to improve the accuracy while classifying short text with less context #558

Open
29swastik opened this issue Sep 20, 2024 · 1 comment

Comments

@29swastik
Copy link

29swastik commented Sep 20, 2024

Hi, my usecase is to classify Job Title into Functional Areas. I finetuned all-mpnet-base-v2 with the help of setfit by providing some 10+ examples for each class (Functional Areas).

I got 82% accuracy on running the evaluation on my test set. I observed some of the simple & straightforward job titles are classified into wrong label with 0.6 score.

For example:

Query: SDET
Predicted Label: Big Data / DWH / ETL
Confidence Scores:
Label: Accounting / Finance, Confidence: 0.0111
Label: Backend Development, Confidence: 0.0140
Label: Big Data / DWH / ETL, Confidence: 0.6092

Here SDET should have labelled as QA / SDET but it is classified to Big Data / DWH / ETL with 0.62 score. Few shot examples used for both classes doesn't have anything in common which could confuse the model except one example whose title is Data Quality Engineer and it is under Big Data / DWH / ETL.

Few shot examples (added only for 2 here)

{    "QA / SDET": [
        "Quality Assurance Engineer",
        "Software Development Engineer in Test (SDET)",
        "QA Automation Engineer",
        "Test Engineer",
        "QA Analyst",
        "Manual Tester",
        "Automation Tester",
        "Performance Test Engineer",
        "Security Test Engineer",
        "Mobile QA Engineer",
        "API Tester",
        "Load & Stress Test Engineer",
        "Senior QA Engineer",
        "Test Automation Architect",
        "QA Lead",
        "QA Manager",
        "End-to-End Tester",
        "Game QA Tester",
        "UI/UX Tester",
        "Integration Test Engineer",
        "Quality Control Engineer",
        "Test Data Engineer",
        "DevOps QA Engineer",
        "Continuous Integration (CI) Tester",
        "Software Test Consultant"
    ],
    
    "Big Data / DWH / ETL": [
        "Big Data Engineer",
        "Data Warehouse Developer",
        "ETL Developer",
        "Hadoop Developer",
        "Spark Developer",
        "Data Engineer",
        "Data Integration Specialist",
        "Data Pipeline Engineer",
        "Data Architect",
        "Database Administrator",
        "ETL Architect",
        "Data Lake Engineer",
        "Informatica Developer",
        "DataOps Engineer",
        "BI Developer",
        "Data Migration Specialist",
        "Data Warehouse Architect",
        "ETL Tester",
        "Big Data Platform Engineer",
        "Apache Kafka Engineer",
        "Snowflake Developer",
        "Data Quality Engineer",
        "Data Ingestion Engineer",
        "Big Data Consultant",
        "ETL Manager"
    ]
}

TrainingArgs

args = TrainingArguments(
    batch_size=16,
    num_epochs=1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

Here is the complete set of functional areas.

functional_areas = [
    "Accounting / Finance",
    "Backend Development",
    "Big Data / DWH / ETL",
    "Brand Management",
    "Content Writing",
    "Customer Service",
    "Data Analysis / Business Intelligence",
    "Data Science / Machine Learning",
    "Database Admin / Development",
    "DevOps / Cloud",
    "Embedded / Kernel Development",
    "Event Management",
    "Frontend Development",
    "Full-Stack Development",
    "Functional / Technical Consulting",
    "General Management / Strategy",
    "IT Management / IT Support",
    "IT Security",
    "Mobile Development",
    "Network Administration",
    "Online Marketing",
    "Operations Management",
    "PR / Communications",
    "QA / SDET",
    "SEO / SEM",
    "Sales / Business Development"
]

My guess is accuracy is low because of short text (which is just job title). Please suggest few things which I can try out to improve the accuracy of the model.

@lsiarov
Copy link

lsiarov commented Nov 11, 2024

Not sure if you figured this out already, but let me drop a comment in case someone else has the same problem.

It's very important to think about what your model is working on - which is essentially a vector representation of your text (job title) that we then transform into a prediction using a prediction model (head). I had to look up SDET, and it's almost guaranteed that a non-specific model will probably have a poor representation of that title there. Therefore, you would need a lot of examples to classify it correctly, and even then another very specific term is as likely to occupy that space as not, so all your finetuning goes to waste. You can find the datasets this was trained on here https://huggingface.co/sentence-transformers/all-mpnet-base-v2 (and you can see the vast majority is things that do not relate to your domain at all).

Things you can do to improve this relatively painlessly (as in, there are multiple tutorials on these):

  • Finetune your embedding model using software job descriptions that are likely to contain your specific terms - this is likely the best way forward, or use a model already pre-trained for your domain
  • Use a different embedding model (e.g. a count vectorizer - on words or even ngrams) - semantic embeddings are not always the best way forward for everything, especially if your names are quite structured. You can also combine both of these in various interesting ways.

Whether more labels actually helps you is an open question.

A more involved approach would be to restate the objective (i.e. change the model), including by adding more information than the job title.

Hopefully this helped a little

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@lsiarov @29swastik and others