freebuf_crawler

用scrapy爬取freebuf文章信息，并生成可搜索(过滤器)的单页面
袖珍但五脏俱全的freebuf文章搜索功能
效果请参见 https://dog.wtf/freebuf

运行

进入freebuf_crawler/并执行

./run.sh

按

😊 freebuf文章全面且历史沉淀丰富
😕 搜索功能不好用，可能是采用了全文检索的原因，需要权衡结果的相关性
😕 按相关性和时间排序不能同时指定。按相关性排序时，结果是时间乱序的
因此遇到例如如以下场景时。就难以得到想要的结果：
- 只想按标题搜索某个应用所有漏洞和新闻，来系统性研究、学习
- 想搜索"挖洞经验"，并按时间排序时

具体需求

完善、高效、迅速的搜索功能
- 标题
- 日期
- 文章等级
  - 红色标题
  - 金币奖励
  - 现金奖励
- 标签
- 浏览数
- 评论数
- 摘要
- 以上所有AND组合搜索
页面
- 简洁的单页面
- 暗色系
数据爬取
- 可中断与继续：参考Jobs: pausing and resuming crawls
- 结果去重的增量爬虫，方便爬取新的文章：参考keeping-persistent-state-between-batches

解决方案

环境

Python 3.7.5
scrapy 2.1.0
tabulator

思路

依次爬取：https://www.freebuf.com/page/1 ~ https://www.freebuf.com/page/[N]
- 页面没有文章，退出
- 页面没有新文章，退出
首次运行
1. 爬取所有文章信息，写入json line文件freebuf_crawler/freebuf.jl
2. 用tabulator实现方便查询(过滤)的单页面html模版
3. 将爬取的结果数据注入单页面html模版，生成最终单页面htmlfreebuf_crawler/freebuf.html
非首次运行
1. 爬取新的文章信息，加入到json line文件freebuf_crawler/freebuf.jl中
2. 重新生成单页面htmlfreebuf_crawler/freebuf.html

局限性

由于下列原因
- 靠page编号递增来爬取下一个页面
- 根据页面是否有文章来判断退出
- 目前18000+条文章
导致
- 爬取其实是单线程的
- 首次爬取大概需要1个小时
- 将目前18000+条新闻信息插入单页面中，导致页面大小略大，首次加载页面可能耗费些许时间
  - 未压缩html大小：5.4 MB
  - gzip压缩响应大小：2.1 MB
由于是增量爬虫，第一次爬取完之后，后续运行只会爬取新的文章因此耗时很短

参考链接

How to prevent duplicates on Scrapy fetching depending on an existing JSON list

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
freebuf_crawler		freebuf_crawler
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
freebuf.html		freebuf.html
freebuf.jl		freebuf.jl
generate_html.py		generate_html.py
run.sh		run.sh
scrapy.cfg		scrapy.cfg
template.html		template.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

freebuf_crawler

运行

按

具体需求

解决方案

环境

思路

局限性

参考链接

About

Releases

Packages

Languages

License

dog-2/freebuf_crawler

Folders and files

Latest commit

History

Repository files navigation

freebuf_crawler

运行

按

具体需求

解决方案

环境

思路

局限性

参考链接

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages