-
Notifications
You must be signed in to change notification settings - Fork 55
Home
matteoredaelli edited this page Aug 28, 2010
·
25 revisions
- the urls data are saved to a NOSQL database (apache couchdb or riak) that support map/reduce queries: NOSQL database are more cheaper and more scalable than relational databases!
- supported backends: apache couchdb and riak (experimental)
- Processed and tobe processed urls are sent to AMQP queues (persistence and load balancing of crawlers)
- Tobe processed urls are distribuited to different priority queues: you can run more crawlers for highest priority queues, and you can use your own module/function to decide the priority of urls
- many crawlers can be run concurrently, also remotely
- urls/domains can be filtered using regular expressions
- normalize_url: urls can be normalized/rewritted using many options (max_depth, remove_queries, … )
– normalize_url can be personalized for specific url/domains using regular expression matching
– custom/external normalize_url functions are allowed - custom/external body analyzers are supported: internally the ebot system (see ebot.hrl and ebot_plugin_body_analyzer_sample.erl) or better (async and possibly remote) outside developing a custom queue consumer
- external/internal url referrals can be saved to the database
- many other options: see files ebot.app and sys.config
ebot statistics are saved to Round Robin Databases (using rrdtool)
- web REST interface sfor
- managing start/stop of crawlers
- submitting urls to crawlers (sync or async)
- showing ebot statistics