Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus.
Always welcome for pull requests.
Library
Description
Programming Languages
Features
License
Author & Link
JTCC
Thai Character Cluster
Java
GPL-3.0
Wittawat
TCC
Thai Character Cluster
Python
Apache 2.0
Wannaphong
Library
Description
Programming Languages
Features
License
Author & Link
sentiment_analysis_thai
JagerV3
Library
Description
Programming Languages
Features
License
Author & Link
PyThaiNLP
Python 3
LK82 + Udom83
Apache 2.0
Korakot, GitHub
Library
Description
Programming Languages
Features
License
Author & Link
Chamkho
Lao/Thai word segmentation
Rust
LGPL
GitHub
CutKum
Thai word segmentation with Deep Learning in Tensorflow. RNN.
Python
93% F-measure.
MIT
Pucktada, GitHub
CutThai
Thai word segmentation written in coffee-script Edit
Coffee-script
MIT
Pureexe/cutthai GitHub
DeepCut
A Thai word tokenization library using Deep Neural Network. CNN.
Python
98.8% F-measure.
MIT
rkcosmos, GitHub
Lexto: Thai Lexeme Tokenizer
Java
LGPL
NECTEC
Lexto
Python 2
LGPL
GitHub
Lexto
Python 3
LGPL
GitHub
Multi-Candidate-Word-Segmentation
Multi Candidate Word Segmentation for Thai language
Python, RNN, LSTM
97.0% F-measure (Word Level), 98.95% F-measure (Boundary Level)
MIT
paper , GitHub
PyThaiNLP
Python 3
Maximal matching and various other engines
Apache 2.0
GitHub
Swath
SWATH (Smart Word Analysis for THai) is a word segmentation for Thai
C
Longest Matching, Maximal Matching and Part-of-Speech Bigram.
GPL
Paisarn Charoenpornsawat, CMU
SynThai
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM.
Python
99.2% F-measure
MIT
KenjiroAI, GitHub
Thai Language Toolkit (tltk)
Based on a paper by Wirote Aroonmanakun in 2002. Word segmentation is based on a maximum collocation approach. Syllable segmentation is based on 3grams statistics. (Dataset is included)
Python
97.86% F-measure. (It was tested on a different testset; it is not fair to compare it with other models.)
GPLv3
PyPI
Wordcut
Thai word breaker for Node.js
JavaScript, Node.JS
LGPL-3.0
veer66, GitHub
wordcutpy
A simple Thai word tokenizer written in 1 Python file
Python 3
LGPL-3.0
veer66, GitHub
Part of Speech Tagging (POS Tagging)
Library
Description
Programming Languages
Features
License
Author & Link
Chart-POS
Thai POS Tagger
C
All rights reserved
AIAT, KINDML, Thanaruk T. ([email protected] ), tchayintr, Demo at iApp
Jitar+NAiST
A simple Trigram HMM part-of-speech tagger
Java
Ver66 , Jitar + NAiST, 1 + NAiST, 2
SynThai
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM.
Python
0.9163 F-measure. RNN. LSTM
MIT
KenjiroAI, github
Library
Description
Programming Languages
Features
License
Author & Link
Named Entity Tagging (Thai NEST)
Thai Named Entity tagging Specification and Tools
GPL
KINDML, SIIT , AIAT
ThaiNER
Thai Named Entity Recognition for PyThaiNLP
Python
Apache 2.0 (code) & CC BY 3.0 (Dataset)
ThaiNER
Library
Description
Programming Languages
Features
License
Author & Link
News Structure Tagging Program
Thai News Structure Tagging Program
Metadata tagging, Structure tagging, Automatic News Title Generation
GPL
AIAT
Syntactic Parsing & Tools
Library
Description
Programming Languages
Features
License
Author & Link
Chart-parser
Extract Syntactic Structure from POS Tagged Sentence.
C
All rights reserved
AIAT, KINDML, Thanaruk T. ([email protected] ), tchayintr, Demo at iApp
Grammar Processing
Labelled Brackets -> Context Free Grammars (CFGs)
Python
Transform and compute probability
tchayintr
Library
Description
Programming Languages
Features
License
Author & Link
kobkrit-word-embedding
Tensorflow implementation of Thai word embedding
Python
Source code, Example, Word distance graph
LGPL
Kobkrit V.
Question Answering (Machine Comprehension)
Service
Description
License
Author & Link
Thai Machine Comprehension (ThaiMC)
Bidirectional Attention Flow
Copyright (As the service)
iApp-AI
Dictionaries / Translation Pairs
Library
Description
Size
Features
License
Link
LEXiTRON
Thai<->English Dictionary
TH->EN, EN->TH
LEXiTRON License
NECTEC
Transliteration Corpus
31K pairs
Thai-Eng Translation Pair
CC BY-NC-SA 3.0 TH
NECTEC
Yaitron
LEXiTRON in machine readable format (XML)
TH->EN, EN->TH
LEXiTRON License
Veer66 Schema , Data & Conversion Code
Library
Description
Size
Features
License
Link
Click Bait Sentences
Thai Click Bait Sentence
330 sent. (90.7KB)
MIT
Wannaphongcom
InterBEST 2009/2010
5M words
Word Seg.
CC BY-NC-SA 3.0 TH
NECTEC
ORCHID
30K sent.
Word Seg., POS Tagged.
CC BY-NC-SA 3.0 TH
NECTEC
Prime Minister 29
Prime Minister 29's Speech Sentences
338KB
Word segged, Name Entity Tagged
MIT
Wannaphongcom
thai-jokes-corpus
Cleaned Thai Jokes Corpus
457 jokes
GPLv3
iApp Technology
Thai named entity corpora
named entity corpora by Wirote Aroonmanakun's students
266KB-1.5MB
syllable seg., word seg., Named Entity tagged
GPLv3 (not sure, but tltk is using this license)
นัชชา ถิระสาโรช Data ศศิวิมล กาลันสีมา Data ณัฐดาพร เลิศชีวะ Data
THAI-NEST
Thai-NEST: Thai Named Entity tagging Specification and Tools
45K+ Name Entity Token
Name Entity Tagged
LGPL
KINDML
Thai Sentimental Word List
Thai Sentimental Words List
52KB
Seperated Words as Adj, V
MIT
Wannaphongcom
Thai Wikipedia
Formal Articles
1.49GB (~213.1 MB compressed)
XML
GFDL
WIKIPEDIA
Thai WordNet
THE CONSTRUCTION OF THAI WORDNET OF 1ST ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD AND WITH DICTIONARIES OF DIFFERENT COMPILATIONAL APPROACHES(ธนนท์ หลีน้อย) THE CONSTRUCTION OF THAI WORDNET OF 2ND ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD : A STUDY OF THE DIVERSITY OF MEANINGS AFFECTING TRANSLATIONAL ACCURACY (ปริศนา อัครพุทธิพร)
WordNet
N/A
ธนนท์ หลีน้อย 2008 ปริศนา อัครพุทธิพร Data 2008
TNC Top-5000 Words
Word frequency
5,000 words
Frequency of Thai words in various genres, EXCEL
All rights reserved
CHULA
Toxicity in Thai Tweet Corpus
Tokyo Metropolitan University Natural Language Processing Group
Each tweet is labeled as toxic or non-toxic
CC BY-NC 4.0
tmu-nlp
Wisesight Sentiment Corpus
Social media message with sentiment label (positive, neutral, negative, question).
~26,700 messages
Sentiment label, Question label
Public domain
PyThaiNLP
Library
Description
Size
Features
License
Link
Thai National Corpus 2
32M words
Query text by genre, domain
All rights reserved
CHULA
Thai Medical Document
3,594 docs
Document and dynamic keyword map
All rights reserved
KINDML, SIIT
Southeast Asian Languages Library
Thai News, Web Text, Pop Music, Literature, Toponyms
20M chars
Phase around a search text
SEALang
HSE Thai Corpus
Modern texts written in Thai language (mostly news websites)
50M tokens
Query by word form, lexeme, translation, grammatical attributes, lexical attributees
HSE School of Linguistics
Library
Description
Size
Features
License
Link
TALPCo
TUFS Asian Language Parallel Corpus
1327 sent
open parallel corpus consisting of Japanese sentences and their translations into Burmese (Myanmar; the official language of the Republic of the Union of Myanmar), Malay (the national language of Malaysia, Singapore and Brunei), Indonesian, Thai, Vietnamese and English
CC BY 4.0
TALPCo
Pre-trained Language Models
Pre-trained Model
Description
Size
Dimensions
License
Link
fastText
Skip-Gram model trained on Wikipedia using fastText
300
CC BY-SA 3.0
Facebook + Bin & Text + Text Only
thai2fit
ULMFit on Wikipedia. Perplexity of 46.80959 with 60,002 embeddings.
70MB
300
MIT
thai2vec / PyThaiNLP
thbert
Yet another pre-trained BERT particularly in Thai
Apache 2.0
tchayintr
Thai Text Classification Benchmarks
Corpus extractors
Library
Description
Programming Languages
Features
License
Author & Link
BEST2010 cooker
A tool for extracting segmented words from Thai segmented BEST2010 corpus
Python3
Extracting segmented words, features, and data divisions
Apache 2.0
tchayintr
Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)
https://resources.aiat.or.th/