Quoted Brian Spiering :
Deep learning parser should be sequence-based (e.g., Recurrent Neural Network (RNN) or Long Short Term Memory (LSTM)).
To apply Deep Learning, you'll need many thousands of examples with each section labeled.
There is HR-XML (Human Resources - Extensible Markup Language) which are the industry standards for labels of resume sections.
HR-XML: https://schemas.liquid-technologies.com/hr-xml/2007-04-15/?page=http___ns_hr-xml_org_2007-04-15.html
Question: https://datascience.stackexchange.com/questions/71372/how-to-approach-deep-learning-cv-resume-parser-using-convolutions
Used stacked bidirectional GRU/LSTM recurrent layers
This approach was also compared with Convolutional layers, which are generally faster than recurrent layers, but recurrent layers have shown better accuracy.
We experimented with different vector embeddings including fastText and custom Word2vec. This helped us to significantly reduce the following:
- Dependency on the amount of labeled data
- Time needed model training
- Used several unsupervised approaches for text clusterization by topics.
- Particularly, we used BigARTM to resolve this problem because it showed better performance and accuracy when compared to other libraries such as Gensim.
BroutonLab AI parser: https://broutonlab.com/broutonlab-data-science-success-stories/ai-nlp-for-resume-parsing-and-job-matching
https://www.listendata.com/2018/05/named-entity-recognition-using-python.html
https://apilayer.com/marketplace/description/skills-api
- Quite simple & straight forward
Field | Method |
---|---|
Name | NLTK |
Skills | Corpus/API with help of N-gram |
Institution | Find institution corpus to train NER model & train using Spacy |
https://blog.apilayer.com/build-your-own-resume-parser-using-python-and-nlp/
- Return in JSON format with 7 main keys (headers)
Personal info | My comments |
---|---|
Name | Accurate becuz it's simple |
Email, Phone | Accurate becuz it's simple (Regex) |
Home Address | Segmented to designated fields. Detection of country & states & postcodes are accurate. Data enrichment is good (E.g. Convert Pulau Pinang to Penang) |
URL | Able to differentiate social media domain such as LinkedIn |
Gender | Gender prediction using picture provided |
Educations | My comments |
---|---|
Title | Accurate. E.g. Master of Science(Msc) |
Institution name | Can detect non-English |
course titles | Accurate |
Date start & ends | Inaccurate why? |
Other headers | My comments |
---|---|
Working experiences | Can't even process 'Research Fellow' or 'tutor'. |
Skills | Accurate |
Languages | Accurate |
Tasks | Less useful |
Attachments | Less useful |
https://labs.hrflow.ai/profile-analysis/parsing/
- Collected 420 resumes
- Annotate manually using Doccano
- Split into 80(training), 20(testing)
- Develop model using Spacy
- Becuz Spacy got higher speed & accuracy
- Can refer to Spacy archi
- Train model
- Use techniques like dropout & shuffle data after each iteration
- Evaluate model
- Use mterics such as accuracy score, precision, recall, and F-score
https://www.kharpann.com/portfolio/named-entity-recognition-from-resumes/
- Can capture prev,current words but NOT forward word
- Need to do extra feature engineering
- Evaluate performance using F1 Score (To get balance between precision & recall)
- Use Bi-LSTM (To take past & future info AKA 1 LSTM run LEFT to RIGHT, another 1 run RIGHT to LEFT)
Bi-Directional Architecture
- Bi-LSTM-CRF
- Bi-LSTM-CNN
- LSTM-CNN-CRF
- ELMo
- Input JPG to export layout
- Able to detect header
- Can look at the codes & terminologies used
- *Can do more research on DLA
https://huggingface.co/spaces/nielsr/dit-document-layout-analysis
- Can take it as reference as the concept is similar
- Try install & upload resume
https://github.com/kermitt2/grobid
- PDFplumber is outperforming Apache Tika. (Not concrete! Need more research)
https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7
- Can upload annotated sentences here
- Visualize how near is the sentence
https://projector.tensorflow.org/
- The text output from TIKA contain line breaks
https://www.trainingdragon.co.uk/blog/how-to-remove-empty-lines-in-visual-studio-code
- Very complete NER model training process
https://developers.arcgis.com/python/guide/how-named-entity-recognition-works/
- The author explains reasons to use label studio and ways to deal with its contraint.
https://medium.com/@astha.agarwal/label-studio-data-collection-for-nlp-tasks-7592ad661e32
- how to train and get the custom-named entity from your training data using spacy and python
https://towardsdatascience.com/train-ner-with-custom-training-data-using-spacy-525ce748fab7
- Approaches have been published in 2022 paper
- GitHub link is provided
- Use Bi-LSTM method (Layman's explanation in the paper)
- Layman's explanation with super detailed demo
https://towardsdatascience.com/basics-of-countvectorizer-e26677900f9c
- Explanation + Codes
https://nanonets.com/blog/how-to-use-deep-learning-when-you-have-limited-data/
https://medium.com/analytics-vidhya/word-embeddings-in-nlp-word2vec-glove-fasttext-24d4d4286a73
- Covered the comparison between different vectorization techniques
- GloVe vs Word2Vec vs FastText
- And when to use each of it
- GloVe has been found to outperform other models on word analogy, word similarity, and Named Entity Recognition tasks, so if the nature of the problem you’re trying to solve is similar to any of these, GloVe would be a smart choice. https://neptune.ai/blog/vectorization-techniques-in-nlp-guide
What are some examples of data science techniques behind it and why is it hard?
- Some of the NLP techniques we use include word embedding and sentence embedding modelssuch as BERT and USE (Universal Sentence Encoder) – which convert a piece of text into a numerical vector.
- Following this, we use the associated algorithms to compute distances between vectors (cosine similarity, Euclidian distances...).
- One of the various tasks we need to achieve is to calculate the ‘distance’ between two jobs. To do that, as mentioned, we use sentence embeddings, which is a way to represent any phrase as a vector of numbers which captures its semantic meaning (i.e., the sentence meaning in context).
- From there, we can calculate the distance between two jobs by simply calculating the distance between their vector representations.
- Another technique we use is artificial neural networks. The sentence embedding models we use rest on an advanced type of neural network architecture that has been developed to capture the meaning of texts and trained on all of Wikipedia’s articles to do so.
- Neural networks offer several benefits:
- They model well non-linear relations, which is highly adaptable to language processing.
- They scale easily to accommodate training with very large amounts of data.
- They mimic the architecture of the brain – which so far performs very well on language.
- Other AI techniques we use include (non-exhaustive): feed forward neural networks, linear regression, multivariate algorithms, clustering techniques and topic modelling. READ Boostrs white paper here