Home

Jump to bottom Edit New page

matteoredaelli edited this page Aug 28, 2010 · 25 revisions

Architecture

Features

Crawlers

many crawlers can be run concurrently, also remotely
urls to be analysed are divided in several queues (depending on their depth/priority)
urls/domains can be filtered using regular expressions
urls can be normalized/rewritted using many options (max_depth, remove_queries, … )
custom body analyzers are supported: internally the ebot system (see ebot.hrl and ebot_plugin_body_analyzer_sample.erl) or better (async and possibly remote) outside developing a custom queue consumer
external/internal url referrals can be saved to the database
many other options: see files ebot.app and ebot_local.config

Database

the urls are saved to a NOSQL database (apache couchdb or riak) that support map/reduce queries: NOSQL database are more cheaper and more scalable than relational databases!
supported backends: apache couchdb and riak

Statistics

ebot statistics are saved to Round Robin Databases (using rrdtool)

Web Services

web REST interface sfor
managing start/stop of crawlers
submitting urls to crawlers (sync or async)
showing ebot statistics

Toggle table of contents Pages 9

Add a custom sidebar

Clone this wiki locally