Home

Jump to bottom Edit New page

matteoredaelli edited this page Aug 28, 2010 · 25 revisions

Architecture

Features

Database

the urls data are saved to a NOSQL database (apache couchdb or riak) that support map/reduce queries: NOSQL database are more cheaper and more scalable than relational databases!
supported backends: apache couchdb and riak (experimental)

Message Queue

Processed and tobe processed urls are sent to AMQP queues (persistence and load balancing of crawlers)
Tobe processed urls are distribuited to different priority queues: you can run more crawlers for highest priority queues, and you can use your own module/function to decide the priority of urls

Crawlers

many crawlers can be run concurrently, also remotely
urls/domains can be filtered using regular expressions
normalize_url: urls can be normalized/rewritted using many options (max_depth, remove_queries, … )
– normalize_url can be personalized for specific url/domains using regular expression matching
– custom/external normalize_url functions are allowed
custom/external body analyzers are supported: internally the ebot system (see ebot.hrl and ebot_plugin_body_analyzer_sample.erl) or better (async and possibly remote) outside developing a custom queue consumer
external/internal url referrals can be saved to the database
many other options: see files ebot.app and sys.config

Statistics

ebot statistics are saved to Round Robin Databases (using rrdtool)

Web Services

web REST interface sfor
managing start/stop of crawlers
submitting urls to crawlers (sync or async)
showing ebot statistics

Licence

Toggle table of contents Pages 9

Add a custom sidebar

Clone this wiki locally