resume-parsing/Weekly-progress.md at main · teohsinyee/resume-parsing · GitHub

Week 4 (04-Apr to 07-Apr)

Figure out PDF metadata that could be useful
Find tools that can extract coordinates & font styles from PDF
- coordinates
- font styles
Test on existing parsers
- Tested on 3 tools
- Commercial (https://affinda.com/resume-parser/) (https://labs.hrflow.ai/profile-analysis/parsing/)
- Free (https://demos.pragnakalp.com/resume-parser/)
- None of it is perfect (Still failed to parse some info)
Study Named Entity Recognition (NER) [Completed]
- Because many tagging schemes
- BIO, BIOSE, IOB, BILOU, BMEWO - Gives different performance
Text annotation tools [On-going]
- Doccano *Make a comparison
- Brat
Proposed pipeline to mentor

Feedback

How to compare our parser performance with the existing ones? (Quantitative)
How to test parser result accuracy?

Will be working on

Development (Prove proposed pipeline)

Week 5 (11-Apr to 15-Apr)

Classify 500 resumes into scanned images and contains text folders
- Will abandon scanned images PDF as for now
Research on GROBID (extration model of scholar articles) [Deprecated because less useful]
Proposed new pipeline & gained new feedback (Mentor suggested to research more about deep learning)
Comparing annotation tool (to annotate text line) : Doccano vs Label studio

Week 6 (18-Apr to 22-Apr)

Compared Label Studio with Doccano
- Label studio cannot perform multi-label for each line (TBC)
- Label studio output is a bit cluttered
- Doccano can multi-label & export annotated text in JSON & CSV format
- If use Doccano, the input file should contained all resumes text, cannot separated files
Decide annotation labels
- 6 Classes (education, ... )
- 3 general (header, content, other)
- Consider to add sentence writing style (simple,keyvalue,complex)
Brainstorm how to use the annotated text
- ???

Week 7 (25-Apr to 29-Apr)

Study Machine Learning (only NN part) course by Andrew Ng
Refine the resume tags
Improvise literature review of 10 papers
Annotate 100 resumes using Doccano

Week 8 (05-May to 06-May)

Back to stage 1 - segregate all pdf contains tables using Camelot (in-progress)
Extracted 100 resume
Annotate 100 resume using Doccano

Week 9 (10-May to 13-May)

Annotate 100 resume using Doccano

Week 10 (17-May to 20-May)

Annotate 100 resume using Google sheets
Built CNN model for sentiment analysis

Week 11 (23-May to 25-May)

Annotate 100 resume using Google sheets (Done 50%)
Load resume data to CNN model

Week 12 (25-May to 01-June)

Mainly to find out root cause of low test accuracy
Check model accuracy when reducing the number of targeted classes
Assumption: Model with less classes will have higher accuracy

Week 13 (02-June to 08-June)

Improve the classification report presentation for better visualization
Result of previous experiment: the model is overfitting

Week 14 (13-June to 17-June)

Prepare annotation guidelines
Rectified previous annotations

Week 15 (20-June to 24-June)

Literature review on word embeddings
Adapt different word embedding to current CNN model
Result shows FastText is the most outperform

Week 16 (27-June to 01-July)

Find out bottom line accuracy of current CNN model
Try out different NN models (Deeper NN)
By the end of week, should push the accuracy to 95%

Week 22 (08-Aug to 12-Aug)

Improve preprocessing pipeline
Declutter embedding corpus

STOP UPDATES HERE