-
Notifications
You must be signed in to change notification settings - Fork 55
Home
matteoredaelli edited this page Aug 28, 2010
·
25 revisions
- the urls data are saved to a NOSQL database (apache couchdb or riak) that support map/reduce queries: NOSQL database are more cheaper and more scalable than relational databases!
- supported backends: apache couchdb and riak (experimental)
- Processed and tobe processed urls are sent to AMQP queues (persistence and load balancing of crawlers)
- Tobe processed urls are distribuited to different priority queues: you can run more crawlers for highest priority queues, and you can use your own module/function to decide the priority of urls
- many crawlers can be run concurrently, also remotely
- urls/domains can be filtered using regular expressions and custom functions
- urls can be normalized/rewritted using many options (max_depth, remove_queries, string replacements, custom functions, ….)
- custom/external body analyzers are supported: internally the ebot system (see ebot.hrl and ebot_plugin_body_analyzer_sample.erl) or better (async and possibly remote) outside developing a custom queue consumer
- url referrals can be saved to the database: only external and/or same domain and/or same main domain
- many many other options: see files ebot.app and sys.config
ebot statistics are saved to Round Robin Databases (using rrdtool)
- web REST interface sfor
- managing start/stop of crawlers
- submitting urls to crawlers (sync or async)
- showing ebot statistics