Skip to content

Latest commit

 

History

History
77 lines (58 loc) · 2.97 KB

README.md

File metadata and controls

77 lines (58 loc) · 2.97 KB

EUREKA

Source:


Data:

  • stop-words dictionary: a stop-words dictionary file could leverage the final performance of EUREKA, an example could be seen here (this dictionary is copied from the Lyrichu).
  • input corpus: the input corpus is a long string, such as a novel text, or a concatenated documentation pieces. See an example.
  • corpus in mongodb: you can store each document as one sample in a collection of a mongodb database, with the format like this:
{"_id": ObjectId("123456789"), "content": your_corpus(long string)}

Codes Dependency:

eureka -> model   

Using Example:

from eureka import Eureka
model = Eureka()
model.load_dictionary()

# data from .txt file
####################################################################
import codecs
corpus = codecs.open("document.txt", "r", "utf-8").read()

n = len(corpus)
if n < 5000:
    print("The corpus is too small.")
elif n < 250000:
    res = model.discover_corpus(corpus)
else:
    res = model.discover_corpus_multi(corpus, corpus_size=200000, re_list=True)  # corpus_size is the length of sub-corpus in from the input corpus

# data from mongo
####################################################################
import pymongo
client = pymongo.MongoClient("mongodb://localhost:27017/")
col = client["your_database_name"]["your_collection_name"]
res = model.discover_corpus_mongo(col, n=20000, corpus_size=200000, re_list=True)  # n is the number of samples used in collections

Requirements

  • Python>=3.5
  • pandas>=0.22.0
  • pkuseg
  • jieba>=0.39
  • tqdm>=4.19.5
  • Flask(optional, if runing the server.py)
  • pymongo(optional, EUREKA could handle mongo data while it essentially does not need this lib)
  • ipdb(optinoal, if debugging in command line)

Allusion

  • Eureka is from Ancient Greek word heúrēka, which means I have found.
  • Eureka is also a heroine from a Japanese anime called Eureka Seven.