- Figure out PDF metadata that could be useful
- Find tools that can extract coordinates & font styles from PDF
- Test on existing parsers
- Tested on 3 tools
- Commercial (https://affinda.com/resume-parser/) (https://labs.hrflow.ai/profile-analysis/parsing/)
- Free (https://demos.pragnakalp.com/resume-parser/)
- None of it is perfect (Still failed to parse some info)
- Study Named Entity Recognition (NER) [Completed]
- Because many tagging schemes
- BIO, BIOSE, IOB, BILOU, BMEWO - Gives different performance
- Text annotation tools [On-going]
- Doccano *Make a comparison
- Brat
- Proposed pipeline to mentor
- How to compare our parser performance with the existing ones? (Quantitative)
- How to test parser result accuracy?
- Development (Prove proposed pipeline)
- Classify 500 resumes into scanned images and contains text folders
- Will abandon scanned images PDF as for now
- Research on GROBID (extration model of scholar articles) [Deprecated because less useful]
- Proposed new pipeline & gained new feedback (Mentor suggested to research more about deep learning)
- Comparing annotation tool (to annotate text line) : Doccano vs Label studio
- Compared Label Studio with Doccano
- Label studio cannot perform multi-label for each line (TBC)
- Label studio output is a bit cluttered
- Doccano can multi-label & export annotated text in JSON & CSV format
- If use Doccano, the input file should contained all resumes text, cannot separated files
- Decide annotation labels
- 6 Classes (education, ... )
- 3 general (header, content, other)
- Consider to add sentence writing style (simple,keyvalue,complex)
- Brainstorm how to use the annotated text
- ???
- Study Machine Learning (only NN part) course by Andrew Ng
- Refine the resume tags
- Improvise literature review of 10 papers
- Annotate 100 resumes using Doccano
- Back to stage 1 - segregate all pdf contains tables using Camelot (in-progress)
- Extracted 100 resume
- Annotate 100 resume using Doccano
- Annotate 100 resume using Doccano
- Annotate 100 resume using Google sheets
- Built CNN model for sentiment analysis
- Annotate 100 resume using Google sheets (Done 50%)
- Load resume data to CNN model
- Mainly to find out root cause of low test accuracy
- Check model accuracy when reducing the number of targeted classes
- Assumption: Model with less classes will have higher accuracy
- Improve the classification report presentation for better visualization
- Result of previous experiment: the model is overfitting
- Prepare annotation guidelines
- Rectified previous annotations
- Literature review on word embeddings
- Adapt different word embedding to current CNN model
- Result shows FastText is the most outperform
- Find out bottom line accuracy of current CNN model
- Try out different NN models (Deeper NN)
- By the end of week, should push the accuracy to 95%
- Improve preprocessing pipeline
- Declutter embedding corpus