Entry of VMW-THU-Hackathon @ Aug 2019
Topic #1: Trend analysis of AI techniques
Language:Javascript
, Python
, HTML
& CSS
As a technical reference documentation, basic information of this project is not included in this file. For general info including our team, the problem statement, over-all designs, project progress, results and discussion, etc., see KARMA Readme (README.md
).
- (optional_2) Make sure you have your
Python 3 >= 3.7
successfully installed, andpip
up to date. - (optional_1) Download and config swig. Mind that windows user should download swigwin package, which includes a prebuilt executable, instead of swig.
- (optional_2) Install
PyPDF2
module viapip install pypdf2
. - (optional_1) Install
pdfminer
module viapip install pdfminer
. - Make sure you have your
Node.js >= 10.16.3 LTS
successfully installed, andyarn
up to date, yarn install
the Node.js Dependencies.- Make sure your web broswer supports
ECMA2017
(Firefox >= 67; Chrome >=75; Opera>=62; Safari>= 12
). Edge and IE MAY NOT display normally.
Note 1: if you would like to use the same dataset as we do, requirments labeled "optional_1" and "optional_2" could be ignored.Our database could download from HERE
Note 2: if you would like to use pypdf2 instead of pdfminer, requirments labeled "optional_1" could be ignored.
To use the same dataset as we do, download raw_data and unzip it to root directory. Then you may able to skip pdf file content extraction part and step 1 of keywords matching.
- Run
pdfExtract/GetPaperInfo.py
to extract paper basic info from target directory. Output files will be saved asrawdata/info/info.json
. - Run
pdfExtract/pdf2Txt_PyPDF2.py
to extract main body of papers in target directory. Output files will be saved in the same directory as pdf files.
- Target keywords already exist in
analysis/keywordExtraction/dict.json
. Could be replaced by user-defined keywords.
- Configure the configuration file
analysis/config.js
.
infoPrefix
should be the directory containing the info data.infoList
should be the filename of the info data files.dataList
should be the directories containing the text-files of papers.
- Run
node index.js
to generate theresult.json
in root directory of the project. Network is required.
- Copy
result.json
tovisualization/data/
. - In directory
visualization/
, runpython3 -m http.server 3000
and the website can be visited inhttp://localhost:3000
. - Click '+' button in the bottom-right corner to add a new info card to the canvas.
- Type in target keywords and click search button to visualize the results.
- Use the move buttons to move the cards.
KARMA //
├─ .gitignore //
├─ KARMA_Reference.md // KARMA technical documentation
├─ LICENSE //
├─ README.md //
├─ README_ch.md //
├─ analysis // keywords extraction and matching
│ ├─ cache //
│ │ ├─ articles.json //
│ │ ├─ info.json //
│ │ ├─ keywords.json //
│ │ └─ unused.json //
│ ├─ config.js //
│ ├─ index.js //
│ ├─ infoExtraction.js //
│ └─ keywordExtraction //
│ ├─ datacleaner.js //
│ ├─ dict.json // target keywords dictionary
│ ├─ filestats.js //
│ ├─ index.js //
│ └─ readfile.js //
├─ package.json //
├─ pdfExtract // extraction of paper info and plain texts from pdf files
│ ├─ GetPaperInfo.py // extraction of paper info
│ ├─ Pdf2Txt_PyPDF2.py // extraction of plain text, more accurate than pdfminder but does not contain whitespaces
│ ├─ fbasic.py // helper methods
│ └─ pdf_to_txt_pdfminer.py // extraction of plain text, contains thorough info but is slower and less accurate
├─ rawinfo // paper data is expected to be exported and stored here
│ ├─ info // paper basic info is expected to be exported and stored here
│ │ ├─ info0.json // paper basic info 1906.00554 - 1906.06027
│ │ ├─ info1.json // paper basic info 1906.06032 - 1907.11917
│ │ └─ info2.json // paper basic info 1907.11922 - 1908.02269
│ └─ text // paper plain texts are expected to be exported and stored here
├─ result.json //
├─ visualization // data visualization
│ ├─ .DS_Store //
│ ├─ css //
│ ├─ data //
│ ├─ font //
│ ├─ img //
│ ├─ index.html //
│ └─ js //
└─ yarn.lock //