new tika flask process

autolordz · Aug 10, 2019 · 4cc6b9b · 4cc6b9b
1 parent bf06602
commit 4cc6b9b
Show file tree

Hide file tree

Showing 14 changed files with 1,171 additions and 418 deletions.
diff --git a/.gitignore b/.gitignore
@@ -107,5 +107,7 @@ venv.bak/
 *.zip
 *.exe
 *.txt
+*.jar
 tmp/
 exe-win7-tmp/
+README1.md
diff --git a/README.md b/README.md
@@ -1,74 +1,141 @@
-# file-batch-renamer Python 批量重命名文件脚本
+## Python 批量重命名文件
 
-> a file batch renamer based on python (include Chinese)
+* 一个基于Python的终极重命名机
+* a file batch renamer based on python (include Chinese)
+* 用于自动对文件夹里大部分类型的文件进行分析，并批量重命名
+* 重命名文件自古就是繁琐事情，谁用谁指导
+* 方便处理IT办公文件和下载文件夹的杂乱文件
+* 简单练手，练手第三方包，编写环节综合到各方面，python初学者必备
+* 基于云端和本地，也可以本地
+* 对小白提供(exe)，云端提供临时服务器
 
-- Updated 2019.1.2:
+[![](https://img.shields.io/badge/github-source-orange.svg?style=popout&logo=github)](https://github.com/autolordz/file-batch-renamer)
+[![](https://img.shields.io/github/license/autolordz/file-batch-renamer.svg?style=popout&logo=github)](https://github.com/autolordz/file-batch-renamer/blob/master/LICENSE)
+
+## Tika版架构
+
+![](img_tmp/flow.jpg)
+(假如条件不允许可以全部本地化)  
+
+## Updated
+
+- Updated 2019.8.10:
+    - **Apache Tika** 版改进，基于云端和本地，终极自动重命名机
+
+- Updated 2019.1.2:  
     - 新版 **Apache Tika** 解析全文件版本 
     - 旧版 **Python 3rd party** 解析文件版本
 
+<!--more-->
 ----------------
 
-## Tutorial
-
-### 1. Tika | Tesseract OCR
-
-- Files
-    - batch-renamer-tika.py
-
-- Requirements  
-    - [zhon](https://pypi.org/project/zhon/) zhon to deal with Chinese
-    - [tika](https://pypi.org/project/tika/) tika for python
-    - [Java Jre jre-8u91-windows-x64](https://www.oracle.com/technetwork/java/javase/downloads/java-archive-javase8-2177648.html) Jre8 is at least and fitting package
-    - [Tesseract v4.0.0.20181030](https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w32-setup-v4.0.0.20181030.exe) Tesseract for Image OCR
-
-- Supported Platform:
-    - [x] win7 32bit,win10 64bit,其他没测试过
-
-- Supported Files:
-    - [x] docx,pptx,xlsx
-    - [x] doc,ppt,xls
-    - [x] epub,rar,zip,tar,html,pdf
-    - [x] png,jpg,jpeg,bmp,tif
-    - [x] others(follows [tika](http://tika.apache.org/1.20/formats.html))
-
-- Usage:
-    - 安装必须
-        - installplug.bat
-        - setenv.bat
-    - 要重命名的文件放在当前目录
-    - 执行batch-renamer-tika.(py|exe)
-
-#### 2. Python 3rd party | Tesseract OCR
-
-- Files
-    - batch-renamer.py
-    - extrectImage.py (Author: BJ Jang (jangbi882 at gmail.com))
-
-- Requirements
-    - [python-pptx](https://pypi.org/project/python-pptx/) ppt格式
-    - [python-docx](https://pypi.org/project/python-docx/) word格式
-    - [xlrd](https://pypi.org/project/xlrd/) excel格式
-    - [zhon](https://pypi.org/project/zhon/) 提取中文
-    - [PyPDF2](https://github.com/mstamy2/PyPDF2) 提取PDF
-    - [PDFMiner](https://github.com/euske/pdfminer/) 提取PDF
-    - [pytesseract](https://pypi.org/project/pytesseract/) 识别图像
-
-- Supported Platform:
-    - [x] win7 32bit,win10 64bit,其他没测试过
-
-- Supported Files:
-    - [x] docx,pptx,xlsx
-    - [x] doc,ppt,xls
-    - [x] pdf
-    - [x] png,jpg,jpeg,bmp,tif
-
-- Usage:
-    - 安装必须或手动安装包
-        - installplug.bat
-        - setenv.bat
-    - 要重命名的文件放在当前目录
-    - 执行batch-renamer.(py|exe)
-
-[![ForTheBadge built-with-science](http://ForTheBadge.com/images/badges/built-with-science.svg)](https://github.com/autolordz/docx-content-modify/blob/master/LICENSE)
+## 环境
+
+* conda : 4.6.14
+* python : 3.7.3
+* Win10 + Spyder3.3.4 (打开脚本自上而下运行,或者自己添加main来py运行)
+
+* 组件: tika版 
+    - [zhon](https://pypi.org/project/zhon/) 提供中文字符
+    - [opencv](https://pypi.org/project/opencv-python/) 处理图片,阈值滤镜等
+    - [PIL](https://pypi.org/project/Pillow/) 处理图片
+    - [fitz](https://pypi.org/project/PyMuPDF/) 提取PDF图片
+    - [jieba](https://github.com/fxsjy/jieba) 分词词干识别
+    - [numpy,requests,string,json,glob,time,os,re,string,subprocess,configparser,BeautifulSoup4]
+    - [Java jre-8u91-windows-x64](https://www.oracle.com/technetwork/java/javase/downloads/java-archive-javase8-2177648.html) Jre8 is at least and fitting package
+    - [tika server](https://www.apache.org/dyn/closer.cgi/tika/tika-server-1.22.jar) 工程没附带，一定要下载
+    - **Tesseract 云端** 参考云端[Tesseract]安装
+
+* 组件: 普通版 
+    - [Tesseract v4.0](https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w32-setup-v4.0.0.20181030.exe) Tesseract for Image OCR
+    - [PyPDF2,pdfminer,pytesseract,docx,pptx,xlrd,PIL,extrectImage]
+
+* 打包程序: pyinstaller 
+
+- **以下重点更新和维护Tika版，普通版代码保留**
+
+## 内容
+
+- [x] 按以下格式重命名
+    - [x] ['.txt','.html','.epub','.chm','.wps','.md',
+             '.doc','.odt','.docx','.xlsx','.csv','.xls','.rtf',
+             '.rar','.zip','.tar','.tgz','.7z',
+             '.mp4','.gif','.flv','.mkv','.swf','.psd',
+             '.mp3','.m4a','.flac',
+             '.pdf',]
+    - [x] ['.ppt','.pptx','.pptm']
+    - [x] ['.png','.jpg','.jpeg','.bmp','.tif']
+    - [x] others (rules follow [tika](http://tika.apache.org/1.20/formats.html))
+
+- [x] 过滤下格式非重命名
+    - [x] ['.bat','.jar','.exe','.py','.ini']
+
+- [x] 支持平台
+    - [x] win7 32bit,win10 64bit,其他平台请按错误修改代码
+
+## 使用
+
+相关文件在flask_app目录
+
+- 云端[tika]部署
+
+```shell
+#Centos启动 tika 
+nohup java -Djava.awt.headless=true -jar tika-server.jar --host=yourhost --port=3232 >/dev/null &
+
+#Centos终止 
+ps -ef | grep tika-server | grep -v grep | awk '{print $2}' | xargs kill -9 
+```
+
+- 本地[tika]部署
+
+```shell
+
+#win启动 tika 
+
+start /b java -Djava.awt.headless=true -jar tika-server.jar --config=tika-config.xml --host=127.0.0.1 --port=3232
+
+#[tika-config.xml 用于跳过本地Tesseract,加速非图片文件读取速度]
+
+#Win终止 
+
+taskkill /F /FI "IMAGENAME eq java.exe"
+```
+- 云端[flask]部署
+
+```shell
+#启动
+nohup python3 /pyweb/app.py >/dev/null &
+
+#终止 
+ps -ef | grep pyweb | grep -v grep | awk '{print $2}' | xargs kill -9
+```
+
+- 云端[Tesseract]安装
+
+    - Centos 6.5 安装 Tesseract 4+  
+    - 参考 https://www.jianshu.com/p/bf8521703143 差异如下:  
+    - autoconf-2.63-5.1.el6.noarch 不用 2.69 也行，保留  
+    - 实际安装了 autoconf-archive-2015.02.24-1.sdl6.noarch.rpm  
+
+- 客户端安装
+    - installplug.bat -> 安装 java 环境
+    - 需要处理文件放在target目录
+    - 点击 -> batch-renamer-tika.exe -> 处理target目录
+    - cmd -> batch-renamer-tika.py 'yourfile' -> 处理yourfile(文件|目录)
+
+## 未来
+
+- [x] 以文件开始内容命名
+- [x] 识别图像内容命名
+- [ ] 提取文章(jieba)关键词命名
+- [ ] 提取文章摘要(NLP)命名
+
+## Licence
+
+[See Licence](#file-batch-renamer)
 
 That's it,enjoy.
+
+
+
diff --git a/batch-renamer.py → batch-renamer-old/batch-renamer.py b/batch-renamer.py → batch-renamer-old/batch-renamer.py
@@ -26,21 +26,16 @@
 """
 
 #%%
-
 import zhon.hanzi,zhon.cedict
 import os,re,io,glob,shutil,string,platform
 import itertools as it
-
-import extrectImage
 from pdfminer.high_level import extract_text_to_fp
-
 import pytesseract
-from PIL import Image 
-
+from PIL import Image
 from docx import Document
 from pptx import Presentation
 from xlrd import open_workbook
-from win32com.client import Dispatch # for office 97-2003 
+from win32com.client import Dispatch # for office 97-2003
 
 #%%
 def parse_subpath(path,file):
@@ -83,12 +78,12 @@ def clean_txt_func(x,**kwargs):
     return xx
 
 
-#%% rename office,officex 
-    
+#%% rename office,officex
+
 def rename_officex(file,**kwargs):
     '''rename only judgment doc files'''
     suffix = os.path.splitext(file)[1]
-    
+
     if suffix == '.docx':
         try:
             doc = Document(file)
@@ -99,7 +94,7 @@ def rename_officex(file,**kwargs):
             return x
         except Exception as e:
             print('>>> 读取 %s 失败,可能格式不正确 => %s'%(file,e))
-    
+
     if suffix == '.pptx':
         try:
             prs = Presentation(file)
@@ -117,7 +112,7 @@ def rename_officex(file,**kwargs):
             return x
         except Exception as e:
             print('>>> 读取 %s 失败,可能格式不正确 => %s'%(file,e))
-    
+
     if suffix in ['.xlsx','.xls']:
         try:
             exl = open_workbook(file)
@@ -149,12 +144,12 @@ def get_txt_text(file,**kwargs):
 def rename_office(file,**kwargs):
     name = os.path.splitext(file)[0]
     suffix = os.path.splitext(file)[1]
-    
+
     if suffix == '.txt':
         x = get_txt_text(file,**kwargs)
         print('>>> 找到 %s 内容: %s'%(file,x))
         os_rename(file,x)
-        
+
     if suffix == '.doc':
         file_txt = name + '_doc.txt'
         word = Dispatch("Word.Application")
@@ -165,7 +160,7 @@ def rename_office(file,**kwargs):
         print('>>> 找到 %s 内容: %s'%(file,x))
         os_rename(file,x)
         os.remove(file_txt)
-            
+
     if suffix == '.ppt':
         txt = []
         try:
@@ -186,7 +181,7 @@ def rename_office(file,**kwargs):
         x = clean_txt_func(','.join(txt),**kwargs)
         print('>>> 找到 %s 内容: %s'%(file,x))
         os_rename(file,x)
-    
+
     if suffix == '.xls':
         try:
             app = Dispatch("Excel.Application")
@@ -206,7 +201,7 @@ def rename_office(file,**kwargs):
         x = clean_txt_func(','.join(txt),**kwargs)
         print('>>> 找到 %s 内容: %s'%(file,x))
         os_rename(file,x)
-        
+
     return True
 
 #%% rename image
@@ -220,10 +215,10 @@ def get_image_txt(file,**kwargs):
         print('image size :',img.size)
         img = img.crop((0,0,img.width,img.height/img_h))
         print('image size 2:',img.size)
-        
+
         pytesseract.pytesseract.tesseract_cmd = 'c:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe' \
-        if '64bit' in platform.architecture() else 'c:\\Program Files\\Tesseract-OCR\\tesseract.exe' 
-        
+        if '64bit' in platform.architecture() else 'c:\\Program Files\\Tesseract-OCR\\tesseract.exe'
+
         x = pytesseract.image_to_string(img,lang='chi_sim') # eng
         x = re.sub(r'\s+',',',x)
         print('>>> 解析 %s \n 内容: %s'%(file,x))
@@ -256,7 +251,7 @@ def get_pdf_txt(ifile,**kwargs):
         if len(txt) < 10:
             print('====decode images===')
             extrectImage.main(sourceName=ifile,outputFolder=odir,**kwargs)
-            subext = [parse_subpath(odir,x) for x in 
+            subext = [parse_subpath(odir,x) for x in
                       ['*.png','*.jpg','*.jpeg','*.bmp','*.tif']]
             images = list(it.chain(*(glob.iglob(e) for e in subext)))
             print(images)

diff --git a/extrectImage.py → batch-renamer-old/extractImage.py b/extrectImage.py → batch-renamer-old/extractImage.py
@@ -171,7 +171,6 @@ def get_pdfObj_contents(pdfObj,**kwargs):
                             img = Image.open(jpgData)
                             if mode == "CMYK":
                                 # case of CMYK invert all channel
-
                                 # imgData = list(img.tobytes())
                                 # invData = [(255 - val) & 0xff for val in imgData]
                                 # data = struct.pack("{}B".format(len(invData)), *invData)
@@ -190,7 +189,7 @@ def get_pdfObj_contents(pdfObj,**kwargs):
                             img.write(data)
                             img.close()
                             print('save to:',outFileName + ".jp2")
-                        
+
                         # case of JBIG2
                         elif len(leftFilters) == 1 and leftFilters[0] == '/JBIG2Decode':
                             img = open(outFileName + ".jbig2", "wb")
@@ -222,11 +221,11 @@ def main(sourceName,**kwargs):
     outputFolder = kwargs.get('outputFolder',None)
     os.makedirs(outputFolder,exist_ok=True)
     fileBase = os.path.splitext(os.path.basename(sourceName))[0]
-    
+
     with open(sourceName, "rb") as fp:
         pdfObj = PyPDF2.PdfFileReader(fp,strict=False)
         get_pdfObj_contents(pdfObj,fileBase=fileBase,**kwargs)
-    
+
     print("Completed.")
 
 # main(sourceName = 'aa.pdf', outputFolder = ".\\Temp",num_pages = 1,targetPage = None)
-Original file line number
+Diff line change
@@ Expand Up / @@ -107,5 +107,7 @@ venv.bak/ @@
     *.zip
     *.exe
     *.txt
+    *.jar
     tmp/
     exe-win7-tmp/
+    README1.md