-
Notifications
You must be signed in to change notification settings - Fork 55
Home
matteoredaelli edited this page Aug 28, 2010
·
25 revisions
- many crawlers can be run concurrently, also remotely
- urls to be analysed are divided in several queues (depending on their depth/priority)
- urls/domains can be filtered using regular expressions
- normalize_url: urls can be normalized/rewritted using many options (max_depth, remove_queries, … )
– normalize_url can be personalized for specific url/domains using regular expression matching
– custom/external normalize_url functions are allowed - custom/external body analyzers are supported: internally the ebot system (see ebot.hrl and ebot_plugin_body_analyzer_sample.erl) or better (async and possibly remote) outside developing a custom queue consumer
- external/internal url referrals can be saved to the database
- many other options: see files ebot.app and sys.config
- the urls are saved to a NOSQL database (apache couchdb or riak) that support map/reduce queries: NOSQL database are more cheaper and more scalable than relational databases!
- supported backends: apache couchdb and riak (experimental)
ebot statistics are saved to Round Robin Databases (using rrdtool)
- web REST interface sfor
- managing start/stop of crawlers
- submitting urls to crawlers (sync or async)
- showing ebot statistics