Skip to content
matteoredaelli edited this page Aug 28, 2010 · 25 revisions

Architecture

Features

Crawlers

  • many crawlers can be run concurrently, also remotely
  • urls to be analysed are divided in several queues (depending on their depth/priority)
  • urls/domains can be filtered using regular expressions
  • normalize_url: urls can be normalized/rewritted using many options (max_depth, remove_queries, … )
    – normalize_url can be personalized for specific url/domains using regular expression matching
    – custom/external normalize_url functions are allowed
  • custom/external body analyzers are supported: internally the ebot system (see ebot.hrl and ebot_plugin_body_analyzer_sample.erl) or better (async and possibly remote) outside developing a custom queue consumer
  • external/internal url referrals can be saved to the database
  • many other options: see files ebot.app and sys.config

Database

  • the urls are saved to a NOSQL database (apache couchdb or riak) that support map/reduce queries: NOSQL database are more cheaper and more scalable than relational databases!
  • supported backends: apache couchdb and riak (experimental)

Statistics

ebot statistics are saved to Round Robin Databases (using rrdtool)

Web Services

  • web REST interface sfor
  • managing start/stop of crawlers
  • submitting urls to crawlers (sync or async)
  • showing ebot statistics

Licence

GPL V3+

Clone this wiki locally