Skip to content

Commit

Permalink
new tika flask process
Browse files Browse the repository at this point in the history
  • Loading branch information
autolordz committed Aug 10, 2019
1 parent bf06602 commit 4cc6b9b
Show file tree
Hide file tree
Showing 14 changed files with 1,171 additions and 418 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -107,5 +107,7 @@ venv.bak/
*.zip
*.exe
*.txt
*.jar
tmp/
exe-win7-tmp/
README1.md
197 changes: 132 additions & 65 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,74 +1,141 @@
# file-batch-renamer Python 批量重命名文件脚本
## Python 批量重命名文件

> a file batch renamer based on python (include Chinese)
* 一个基于Python的终极重命名机
* a file batch renamer based on python (include Chinese)
* 用于自动对文件夹里大部分类型的文件进行分析,并批量重命名
* 重命名文件自古就是繁琐事情,谁用谁指导
* 方便处理IT办公文件和下载文件夹的杂乱文件
* 简单练手,练手第三方包,编写环节综合到各方面,python初学者必备
* 基于云端和本地,也可以本地
* 对小白提供(exe),云端提供临时服务器

- Updated 2019.1.2:
[![](https://img.shields.io/badge/github-source-orange.svg?style=popout&logo=github)](https://github.com/autolordz/file-batch-renamer)
[![](https://img.shields.io/github/license/autolordz/file-batch-renamer.svg?style=popout&logo=github)](https://github.com/autolordz/file-batch-renamer/blob/master/LICENSE)

## Tika版架构

![](img_tmp/flow.jpg)
(假如条件不允许可以全部本地化)

## Updated

- Updated 2019.8.10:
- **Apache Tika** 版改进,基于云端和本地,终极自动重命名机

- Updated 2019.1.2:
- 新版 **Apache Tika** 解析全文件版本
- 旧版 **Python 3rd party** 解析文件版本

<!--more-->
----------------

## Tutorial

### 1. Tika | Tesseract OCR

- Files
- batch-renamer-tika.py

- Requirements
- [zhon](https://pypi.org/project/zhon/) zhon to deal with Chinese
- [tika](https://pypi.org/project/tika/) tika for python
- [Java Jre jre-8u91-windows-x64](https://www.oracle.com/technetwork/java/javase/downloads/java-archive-javase8-2177648.html) Jre8 is at least and fitting package
- [Tesseract v4.0.0.20181030](https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w32-setup-v4.0.0.20181030.exe) Tesseract for Image OCR

- Supported Platform:
- [x] win7 32bit,win10 64bit,其他没测试过

- Supported Files:
- [x] docx,pptx,xlsx
- [x] doc,ppt,xls
- [x] epub,rar,zip,tar,html,pdf
- [x] png,jpg,jpeg,bmp,tif
- [x] others(follows [tika](http://tika.apache.org/1.20/formats.html))

- Usage:
- 安装必须
- installplug.bat
- setenv.bat
- 要重命名的文件放在当前目录
- 执行batch-renamer-tika.(py|exe)

#### 2. Python 3rd party | Tesseract OCR

- Files
- batch-renamer.py
- extrectImage.py (Author: BJ Jang (jangbi882 at gmail.com))

- Requirements
- [python-pptx](https://pypi.org/project/python-pptx/) ppt格式
- [python-docx](https://pypi.org/project/python-docx/) word格式
- [xlrd](https://pypi.org/project/xlrd/) excel格式
- [zhon](https://pypi.org/project/zhon/) 提取中文
- [PyPDF2](https://github.com/mstamy2/PyPDF2) 提取PDF
- [PDFMiner](https://github.com/euske/pdfminer/) 提取PDF
- [pytesseract](https://pypi.org/project/pytesseract/) 识别图像

- Supported Platform:
- [x] win7 32bit,win10 64bit,其他没测试过

- Supported Files:
- [x] docx,pptx,xlsx
- [x] doc,ppt,xls
- [x] pdf
- [x] png,jpg,jpeg,bmp,tif

- Usage:
- 安装必须或手动安装包
- installplug.bat
- setenv.bat
- 要重命名的文件放在当前目录
- 执行batch-renamer.(py|exe)

[![ForTheBadge built-with-science](http://ForTheBadge.com/images/badges/built-with-science.svg)](https://github.com/autolordz/docx-content-modify/blob/master/LICENSE)
## 环境

* conda : 4.6.14
* python : 3.7.3
* Win10 + Spyder3.3.4 (打开脚本自上而下运行,或者自己添加main来py运行)

* 组件: tika版
- [zhon](https://pypi.org/project/zhon/) 提供中文字符
- [opencv](https://pypi.org/project/opencv-python/) 处理图片,阈值滤镜等
- [PIL](https://pypi.org/project/Pillow/) 处理图片
- [fitz](https://pypi.org/project/PyMuPDF/) 提取PDF图片
- [jieba](https://github.com/fxsjy/jieba) 分词词干识别
- [numpy,requests,string,json,glob,time,os,re,string,subprocess,configparser,BeautifulSoup4]
- [Java jre-8u91-windows-x64](https://www.oracle.com/technetwork/java/javase/downloads/java-archive-javase8-2177648.html) Jre8 is at least and fitting package
- [tika server](https://www.apache.org/dyn/closer.cgi/tika/tika-server-1.22.jar) 工程没附带,一定要下载
- **Tesseract 云端** 参考云端[Tesseract]安装

* 组件: 普通版
- [Tesseract v4.0](https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w32-setup-v4.0.0.20181030.exe) Tesseract for Image OCR
- [PyPDF2,pdfminer,pytesseract,docx,pptx,xlrd,PIL,extrectImage]

* 打包程序: pyinstaller

- **以下重点更新和维护Tika版,普通版代码保留**

## 内容

- [x] 按以下格式重命名
- [x] ['.txt','.html','.epub','.chm','.wps','.md',
'.doc','.odt','.docx','.xlsx','.csv','.xls','.rtf',
'.rar','.zip','.tar','.tgz','.7z',
'.mp4','.gif','.flv','.mkv','.swf','.psd',
'.mp3','.m4a','.flac',
'.pdf',]
- [x] ['.ppt','.pptx','.pptm']
- [x] ['.png','.jpg','.jpeg','.bmp','.tif']
- [x] others (rules follow [tika](http://tika.apache.org/1.20/formats.html))

- [x] 过滤下格式非重命名
- [x] ['.bat','.jar','.exe','.py','.ini']

- [x] 支持平台
- [x] win7 32bit,win10 64bit,其他平台请按错误修改代码

## 使用

相关文件在flask_app目录

- 云端[tika]部署

```shell
#Centos启动 tika
nohup java -Djava.awt.headless=true -jar tika-server.jar --host=yourhost --port=3232 >/dev/null &

#Centos终止
ps -ef | grep tika-server | grep -v grep | awk '{print $2}' | xargs kill -9
```

- 本地[tika]部署

```shell

#win启动 tika

start /b java -Djava.awt.headless=true -jar tika-server.jar --config=tika-config.xml --host=127.0.0.1 --port=3232

#[tika-config.xml 用于跳过本地Tesseract,加速非图片文件读取速度]

#Win终止

taskkill /F /FI "IMAGENAME eq java.exe"
```
- 云端[flask]部署

```shell
#启动
nohup python3 /pyweb/app.py >/dev/null &

#终止
ps -ef | grep pyweb | grep -v grep | awk '{print $2}' | xargs kill -9
```

- 云端[Tesseract]安装

- Centos 6.5 安装 Tesseract 4+
- 参考 https://www.jianshu.com/p/bf8521703143 差异如下:
- autoconf-2.63-5.1.el6.noarch 不用 2.69 也行,保留
- 实际安装了 autoconf-archive-2015.02.24-1.sdl6.noarch.rpm

- 客户端安装
- installplug.bat -> 安装 java 环境
- 需要处理文件放在target目录
- 点击 -> batch-renamer-tika.exe -> 处理target目录
- cmd -> batch-renamer-tika.py 'yourfile' -> 处理yourfile(文件|目录)

## 未来

- [x] 以文件开始内容命名
- [x] 识别图像内容命名
- [ ] 提取文章(jieba)关键词命名
- [ ] 提取文章摘要(NLP)命名

## Licence

[See Licence](#file-batch-renamer)

That's it,enjoy.



37 changes: 16 additions & 21 deletions batch-renamer.py → batch-renamer-old/batch-renamer.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,21 +26,16 @@
"""

#%%

import zhon.hanzi,zhon.cedict
import os,re,io,glob,shutil,string,platform
import itertools as it

import extrectImage
from pdfminer.high_level import extract_text_to_fp

import pytesseract
from PIL import Image

from PIL import Image
from docx import Document
from pptx import Presentation
from xlrd import open_workbook
from win32com.client import Dispatch # for office 97-2003
from win32com.client import Dispatch # for office 97-2003

#%%
def parse_subpath(path,file):
Expand Down Expand Up @@ -83,12 +78,12 @@ def clean_txt_func(x,**kwargs):
return xx


#%% rename office,officex
#%% rename office,officex

def rename_officex(file,**kwargs):
'''rename only judgment doc files'''
suffix = os.path.splitext(file)[1]

if suffix == '.docx':
try:
doc = Document(file)
Expand All @@ -99,7 +94,7 @@ def rename_officex(file,**kwargs):
return x
except Exception as e:
print('>>> 读取 %s 失败,可能格式不正确 => %s'%(file,e))

if suffix == '.pptx':
try:
prs = Presentation(file)
Expand All @@ -117,7 +112,7 @@ def rename_officex(file,**kwargs):
return x
except Exception as e:
print('>>> 读取 %s 失败,可能格式不正确 => %s'%(file,e))

if suffix in ['.xlsx','.xls']:
try:
exl = open_workbook(file)
Expand Down Expand Up @@ -149,12 +144,12 @@ def get_txt_text(file,**kwargs):
def rename_office(file,**kwargs):
name = os.path.splitext(file)[0]
suffix = os.path.splitext(file)[1]

if suffix == '.txt':
x = get_txt_text(file,**kwargs)
print('>>> 找到 %s 内容: %s'%(file,x))
os_rename(file,x)

if suffix == '.doc':
file_txt = name + '_doc.txt'
word = Dispatch("Word.Application")
Expand All @@ -165,7 +160,7 @@ def rename_office(file,**kwargs):
print('>>> 找到 %s 内容: %s'%(file,x))
os_rename(file,x)
os.remove(file_txt)

if suffix == '.ppt':
txt = []
try:
Expand All @@ -186,7 +181,7 @@ def rename_office(file,**kwargs):
x = clean_txt_func(','.join(txt),**kwargs)
print('>>> 找到 %s 内容: %s'%(file,x))
os_rename(file,x)

if suffix == '.xls':
try:
app = Dispatch("Excel.Application")
Expand All @@ -206,7 +201,7 @@ def rename_office(file,**kwargs):
x = clean_txt_func(','.join(txt),**kwargs)
print('>>> 找到 %s 内容: %s'%(file,x))
os_rename(file,x)

return True

#%% rename image
Expand All @@ -220,10 +215,10 @@ def get_image_txt(file,**kwargs):
print('image size :',img.size)
img = img.crop((0,0,img.width,img.height/img_h))
print('image size 2:',img.size)

pytesseract.pytesseract.tesseract_cmd = 'c:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe' \
if '64bit' in platform.architecture() else 'c:\\Program Files\\Tesseract-OCR\\tesseract.exe'
if '64bit' in platform.architecture() else 'c:\\Program Files\\Tesseract-OCR\\tesseract.exe'

x = pytesseract.image_to_string(img,lang='chi_sim') # eng
x = re.sub(r'\s+',',',x)
print('>>> 解析 %s \n 内容: %s'%(file,x))
Expand Down Expand Up @@ -256,7 +251,7 @@ def get_pdf_txt(ifile,**kwargs):
if len(txt) < 10:
print('====decode images===')
extrectImage.main(sourceName=ifile,outputFolder=odir,**kwargs)
subext = [parse_subpath(odir,x) for x in
subext = [parse_subpath(odir,x) for x in
['*.png','*.jpg','*.jpeg','*.bmp','*.tif']]
images = list(it.chain(*(glob.iglob(e) for e in subext)))
print(images)
Expand Down
7 changes: 3 additions & 4 deletions extrectImage.py → batch-renamer-old/extractImage.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,6 @@ def get_pdfObj_contents(pdfObj,**kwargs):
img = Image.open(jpgData)
if mode == "CMYK":
# case of CMYK invert all channel

# imgData = list(img.tobytes())
# invData = [(255 - val) & 0xff for val in imgData]
# data = struct.pack("{}B".format(len(invData)), *invData)
Expand All @@ -190,7 +189,7 @@ def get_pdfObj_contents(pdfObj,**kwargs):
img.write(data)
img.close()
print('save to:',outFileName + ".jp2")

# case of JBIG2
elif len(leftFilters) == 1 and leftFilters[0] == '/JBIG2Decode':
img = open(outFileName + ".jbig2", "wb")
Expand Down Expand Up @@ -222,11 +221,11 @@ def main(sourceName,**kwargs):
outputFolder = kwargs.get('outputFolder',None)
os.makedirs(outputFolder,exist_ok=True)
fileBase = os.path.splitext(os.path.basename(sourceName))[0]

with open(sourceName, "rb") as fp:
pdfObj = PyPDF2.PdfFileReader(fp,strict=False)
get_pdfObj_contents(pdfObj,fileBase=fileBase,**kwargs)

print("Completed.")

# main(sourceName = 'aa.pdf', outputFolder = ".\\Temp",num_pages = 1,targetPage = None)
Expand Down
Loading

0 comments on commit 4cc6b9b

Please sign in to comment.