newsnlp

Python library that transforms news data using a variety of NLP-based processing tasks. You may:

Generate summary, caption, category for scraped content using NLP.
Gather posts by affinity, eg. all similar titles from a bunch of websites.
Generate meta-summaries from posts grouped by affinty (concept of sibling posts) using NLP.
Remove ads and undesirable content using NLP.
Downloads and caches pre-trained Transformer models from HuggingFace for the summarization tasks (check supported languages),
Supports Anaconda envs for scientific computing

Features

Uses Playwright to inflat dynamic content from websites (eg. Ads, JavaScript) before processing.
Downloads and caches pretrained NLP models locally, suitable for fast inference.
Pretrained deep learning Ad detection model.
Pretrained Transformer-based LLM for the summarization tasks (only French supported yet).
Ad detection implements the (Kushmerick, 1999)](./doc/kushmerick99learning.pdf) paper partially, but relies on Deep Learning rather than statistical fitting.

Dev

Env setup

conda create --name newsbot python=3
pip-sync

Summariser's dependencies: sentencepiece

conda install -c conda-forge sentencepiece
conda install pytorch torchvision -c pytorch
#conda install -c conda-forge transformers
conda install -c huggingface transformers

TFIDF deps: ``

conda install -c conda-forge spacy
python -m spacy download en_core_web_sm

TODO

Optimisations

Use optimised TF-IDF from Spacy or SkLearn
Utilise only half of the symmetric TF-IDF matrix
Resume vectorization of corpus where last task left off. This implies saving vectorization result to disk, and merging with docs newly added to the db.
Cython ??

Feature request

Multiple languages support - OK O7/06/2023
Summarizer currently only supports 1024 words max. Find more powerful model? push model capacity?
Require conda in setuptools/pyproject.toml?
WTF is the sumy summariser?

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
demo		demo
doc		doc
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
README.md		README.md
dev-requirements.txt		dev-requirements.txt
meta.yaml		meta.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

newsnlp

Features

Dev

TODO

Optimisations

Feature request

About

Releases

Packages

Contributors 2

Languages

License

techoutlooks/newsnlp

Folders and files

Latest commit

History

Repository files navigation

newsnlp

Features

Dev

TODO

Optimisations

Feature request

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages