1] Stackexchange answer

Quoted Brian Spiering :
Deep learning parser should be sequence-based (e.g., Recurrent Neural Network (RNN) or Long Short Term Memory (LSTM)). To apply Deep Learning, you'll need many thousands of examples with each section labeled. There is HR-XML (Human Resources - Extensible Markup Language) which are the industry standards for labels of resume sections.

HR-XML: https://schemas.liquid-technologies.com/hr-xml/2007-04-15/?page=http___ns_hr-xml_org_2007-04-15.html
Question: https://datascience.stackexchange.com/questions/71372/how-to-approach-deep-learning-cv-resume-parser-using-convolutions

2] BroutonLab software

Used stacked bidirectional GRU/LSTM recurrent layers

This approach was also compared with Convolutional layers, which are generally faster than recurrent layers, but recurrent layers have shown better accuracy.

We experimented with different vector embeddings including fastText and custom Word2vec. This helped us to significantly reduce the following:

Dependency on the amount of labeled data
Time needed model training

Topic modeling

Used several unsupervised approaches for text clusterization by topics.
Particularly, we used BigARTM to resolve this problem because it showed better performance and accuracy when compared to other libraries such as Gensim.

BroutonLab AI parser: https://broutonlab.com/broutonlab-data-science-success-stories/ai-nlp-for-resume-parsing-and-job-matching

3] Overview of Stanford NER Tagger

https://www.listendata.com/2018/05/named-entity-recognition-using-python.html

4] Skills API

https://apilayer.com/marketplace/description/skills-api

5] Codes to extract each fields

Quite simple & straight forward

Field	Method
Name	NLTK
Skills	Corpus/API with help of N-gram
Institution	Find institution corpus to train NER model & train using Spacy

https://blog.apilayer.com/build-your-own-resume-parser-using-python-and-nlp/

6] Accurate software demo

Return in JSON format with 7 main keys (headers)

Personal info	My comments
Name	Accurate becuz it's simple
Email, Phone	Accurate becuz it's simple (Regex)
Home Address	Segmented to designated fields. Detection of country & states & postcodes are accurate. Data enrichment is good (E.g. Convert Pulau Pinang to Penang)
URL	Able to differentiate social media domain such as LinkedIn
Gender	Gender prediction using picture provided

Educations	My comments
Title	Accurate. E.g. Master of Science(Msc)
Institution name	Can detect non-English
course titles	Accurate
Date start & ends	Inaccurate why?

Other headers	My comments
Working experiences	Can't even process 'Research Fellow' or 'tutor'.
Skills	Accurate
Languages	Accurate
Tasks	Less useful
Attachments	Less useful

https://labs.hrflow.ai/profile-analysis/parsing/

7] Detailed & layman's NER Process

Collected 420 resumes
Annotate manually using Doccano
Split into 80(training), 20(testing)
Develop model using Spacy

Becuz Spacy got higher speed & accuracy
Can refer to Spacy archi

Train model

Use techniques like dropout & shuffle data after each iteration

Evaluate model

Use mterics such as accuracy score, precision, recall, and F-score

https://www.kharpann.com/portfolio/named-entity-recognition-from-resumes/

8] Approaches to NER

CRF to model sequential data (ML)

Shortcomings

Can capture prev,current words but NOT forward word
Need to do extra feature engineering

Deep Learning

Evaluate performance using F1 Score (To get balance between precision & recall)
Use Bi-LSTM (To take past & future info AKA 1 LSTM run LEFT to RIGHT, another 1 run RIGHT to LEFT)

Bi-Directional Architecture

Bi-LSTM-CRF
Bi-LSTM-CNN
LSTM-CNN-CRF
ELMo

https://towardsdatascience.com/named-entity-recognition-ner-meeting-industrys-requirement-by-applying-state-of-the-art-deep-698d2b3b4ede

9] Document Layout Analysis with Microsoft’s DiT

Input JPG to export layout
Able to detect header
Can look at the codes & terminologies used
*Can do more research on DLA

https://huggingface.co/spaces/nielsr/dit-document-layout-analysis

10] ML software for extracting information from scholarly documents

Can take it as reference as the concept is similar
Try install & upload resume

https://github.com/kermitt2/grobid

11] Tools to extract text from PDF

PDFplumber is outperforming Apache Tika. (Not concrete! Need more research)

https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7

12] Tool to visualize word embedding

Can upload annotated sentences here
Visualize how near is the sentence

https://projector.tensorflow.org/

13] Regex to remove empty lines in VS Code

The text output from TIKA contain line breaks

https://www.trainingdragon.co.uk/blog/how-to-remove-empty-lines-in-visual-studio-code

14] EntityRecognizer model

Very complete NER model training process

https://developers.arcgis.com/python/guide/how-named-entity-recognition-works/

15] The pros & cons of label studio

The author explains reasons to use label studio and ways to deal with its contraint.

https://medium.com/@astha.agarwal/label-studio-data-collection-for-nlp-tasks-7592ad661e32

16] Training Custom NER (Show in & output)

how to train and get the custom-named entity from your training data using spacy and python

https://towardsdatascience.com/train-ner-with-custom-training-data-using-spacy-525ce748fab7

17] Product brand detection using NER

Approaches have been published in 2022 paper
GitHub link is provided
Use Bi-LSTM method (Layman's explanation in the paper)

https://datascience.stackexchange.com/questions/107042/custom-named-entity-recognition-ner-in-product-titles-using-deep-learning?rq=1

18] Basics of CountVectorizer

Layman's explanation with super detailed demo

https://towardsdatascience.com/basics-of-countvectorizer-e26677900f9c

19] Guide To Text Classification using TextCNN

Explanation + Codes

https://analyticsindiamag.com/guide-to-text-classification-using-textcnn/#:~:text=CNN%20is%20just%20a%20kind,to%20data%20of%20matrix%20form.

20] DL Small data

https://nanonets.com/blog/how-to-use-deep-learning-when-you-have-limited-data/

21] Word Embeddings in NLP | Word2Vec | GloVe | fastText

https://medium.com/analytics-vidhya/word-embeddings-in-nlp-word2vec-glove-fasttext-24d4d4286a73

22] Vectorization Techniques in NLP

Covered the comparison between different vectorization techniques
GloVe vs Word2Vec vs FastText
And when to use each of it
GloVe has been found to outperform other models on word analogy, word similarity, and Named Entity Recognition tasks, so if the nature of the problem you’re trying to solve is similar to any of these, GloVe would be a smart choice. https://neptune.ai/blog/vectorization-techniques-in-nlp-guide

23] Resume parser by National Instruments

https://pef.fa.us1.oraclecloud.com/hcmUI/CandidateExperience/en/sites/CX/job/1952/apply/section/1/?utm_medium=jobshare

24] Boostrs

What are some examples of data science techniques behind it and why is it hard?

Some of the NLP techniques we use include word embedding and sentence embedding modelssuch as BERT and USE (Universal Sentence Encoder) – which convert a piece of text into a numerical vector.
Following this, we use the associated algorithms to compute distances between vectors (cosine similarity, Euclidian distances...).
One of the various tasks we need to achieve is to calculate the ‘distance’ between two jobs. To do that, as mentioned, we use sentence embeddings, which is a way to represent any phrase as a vector of numbers which captures its semantic meaning (i.e., the sentence meaning in context).
From there, we can calculate the distance between two jobs by simply calculating the distance between their vector representations.
Another technique we use is artificial neural networks. The sentence embedding models we use rest on an advanced type of neural network architecture that has been developed to capture the meaning of texts and trained on all of Wikipedia’s articles to do so.
Neural networks offer several benefits:
- They model well non-linear relations, which is highly adaptable to language processing.
- They scale easily to accommodate training with very large amounts of data.
- They mimic the architecture of the brain – which so far performs very well on language.
Other AI techniques we use include (non-exhaustive): feed forward neural networks, linear regression, multivariate algorithms, clustering techniques and topic modelling. READ Boostrs white paper here

25] Notebook for BERT NER project

https://www.kaggle.com/code/jonofields/bert-ner-sentence

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

good-reference.md

good-reference.md

1] Stackexchange answer

2] BroutonLab software

Topic modeling

3] Overview of Stanford NER Tagger

4] Skills API

5] Codes to extract each fields

6] Accurate software demo

7] Detailed & layman's NER Process

8] Approaches to NER

CRF to model sequential data (ML)

Shortcomings

Deep Learning

9] Document Layout Analysis with Microsoft’s DiT

10] ML software for extracting information from scholarly documents

11] Tools to extract text from PDF

12] Tool to visualize word embedding

13] Regex to remove empty lines in VS Code

14] EntityRecognizer model

15] The pros & cons of label studio

16] Training Custom NER (Show in & output)

17] Product brand detection using NER

18] Basics of CountVectorizer

19] Guide To Text Classification using TextCNN

20] DL Small data

21] Word Embeddings in NLP | Word2Vec | GloVe | fastText

22] Vectorization Techniques in NLP

23] Resume parser by National Instruments

24] Boostrs

25] Notebook for BERT NER project

Files

good-reference.md

Latest commit

History

good-reference.md

File metadata and controls

1] Stackexchange answer

2] BroutonLab software

Topic modeling

3] Overview of Stanford NER Tagger

4] Skills API

5] Codes to extract each fields

6] Accurate software demo

7] Detailed & layman's NER Process

8] Approaches to NER

CRF to model sequential data (ML)

Shortcomings

Deep Learning

9] Document Layout Analysis with Microsoft’s DiT

10] ML software for extracting information from scholarly documents

11] Tools to extract text from PDF

12] Tool to visualize word embedding

13] Regex to remove empty lines in VS Code

14] EntityRecognizer model

15] The pros & cons of label studio

16] Training Custom NER (Show in & output)

17] Product brand detection using NER

18] Basics of CountVectorizer

19] Guide To Text Classification using TextCNN

20] DL Small data

21] Word Embeddings in NLP | Word2Vec | GloVe | fastText

22] Vectorization Techniques in NLP

23] Resume parser by National Instruments

24] Boostrs

25] Notebook for BERT NER project