EUREKA

Source:

original authors: Xylander23 is the original author, and then Lyrichu modified the code to python3 version.
references: Code for Chinese Word Segmentation, Blog about New Words Detection, 中文分词新词发现.
old version: an immature version of mine could be find here.

Data:

stop-words dictionary: a stop-words dictionary file could leverage the final performance of EUREKA, an example could be seen here (this dictionary is copied from the Lyrichu).
input corpus: the input corpus is a long string, such as a novel text, or a concatenated documentation pieces. See an example.
corpus in mongodb: you can store each document as one sample in a collection of a mongodb database, with the format like this:

{"_id": ObjectId("123456789"), "content": your_corpus(long string)}

Codes Dependency:

eureka -> model

Using Example:

from eureka import Eureka
model = Eureka()
model.load_dictionary()

# data from .txt file
####################################################################
import codecs
corpus = codecs.open("document.txt", "r", "utf-8").read()

n = len(corpus)
if n < 5000:
    print("The corpus is too small.")
elif n < 250000:
    res = model.discover_corpus(corpus)
else:
    res = model.discover_corpus_multi(corpus, corpus_size=200000, re_list=True)  # corpus_size is the length of sub-corpus in from the input corpus

# data from mongo
####################################################################
import pymongo
client = pymongo.MongoClient("mongodb://localhost:27017/")
col = client["your_database_name"]["your_collection_name"]
res = model.discover_corpus_mongo(col, n=20000, corpus_size=200000, re_list=True)  # n is the number of samples used in collections

Requirements

Python>=3.5
pandas>=0.22.0
pkuseg
jieba>=0.39
tqdm>=4.19.5
Flask(optional, if runing the server.py)
pymongo(optional, EUREKA could handle mongo data while it essentially does not need this lib)
ipdb(optinoal, if debugging in command line)

Allusion

Eureka is from Ancient Greek word heúrēka, which means I have found.
Eureka is also a heroine from a Japanese anime called Eureka Seven.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

EUREKA

Source:

Data:

Codes Dependency:

Using Example:

Requirements

Allusion

Files

README.md

Latest commit

History

README.md

File metadata and controls

EUREKA

Source:

Data:

Codes Dependency:

Using Example:

Requirements

Allusion