Skip to content

Commit

Permalink
mrow
Browse files Browse the repository at this point in the history
  • Loading branch information
Persephone Karnstein authored and Persephone Karnstein committed Jul 3, 2023
1 parent 4cb1db1 commit d2189e2
Show file tree
Hide file tree
Showing 4 changed files with 10 additions and 3 deletions.
2 changes: 1 addition & 1 deletion pdf-texts/alexjones.txt

Large diffs are not rendered by default.

9 changes: 8 additions & 1 deletion terfy/zip_texts.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,16 @@ def get_corpus_data():

with open("valid.txt", 'w') as f:
f.write("\n\n".join(text_list))

with open("train.txt", 'w') as f:
with open("pdf-texts/alexjones.txt", 'r') as g:
data = g.read()
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(data)
f.write("\n\n".join(sentences))

with tarfile.open("texts.tar.gz", "w:gz") as tarhandle:
for a in ["valid.txt"]:
for a in ["valid.txt","train.txt"]:
tarhandle.add(a)
os.remove(a)
# with zipfile.ZipFile("texts.zip", "w") as f:
Expand Down
Binary file modified texts.tar.gz
Binary file not shown.
2 changes: 1 addition & 1 deletion training-texts/alexjones-decimated.txt

Large diffs are not rendered by default.

0 comments on commit d2189e2

Please sign in to comment.