Home

Jump to bottom Edit New page

matteoredaelli edited this page Aug 28, 2010 · 25 revisions

Architecture

Features

Crawlers

many crawlers can be run concurrently, also remotely
urls to be analysed are divided in several queues (depending on their depth/priority)
urls/domains can be filtered using regular expressions
normalize_url: urls can be normalized/rewritted using many options (max_depth, remove_queries, … )
– normalize_url can be personalized for specific url/domains using regular expression matching
– custom/external normalize_url functions are allowed
custom/external body analyzers are supported: internally the ebot system (see ebot.hrl and ebot_plugin_body_analyzer_sample.erl) or better (async and possibly remote) outside developing a custom queue consumer
external/internal url referrals can be saved to the database
many other options: see files ebot.app and sys.config

Database

the urls are saved to a NOSQL database (apache couchdb or riak) that support map/reduce queries: NOSQL database are more cheaper and more scalable than relational databases!
supported backends: apache couchdb and riak (experimental)

Statistics

ebot statistics are saved to Round Robin Databases (using rrdtool)

Web Services

web REST interface sfor
managing start/stop of crawlers
submitting urls to crawlers (sync or async)
showing ebot statistics

Licence

Toggle table of contents Pages 9

Add a custom sidebar

Clone this wiki locally