-
Notifications
You must be signed in to change notification settings - Fork 37
Architecture Overview
Adam Hooper edited this page Aug 10, 2017
·
1 revision
Here's what you need to know if you're developing Overview or designing your own deployment.
- The User: the most important person. Everything is for the user.
- Postgres: stores user information and document sets -- including document text, tags, metadata fields and user-created notes.
- BlobStorage: stores "blobs" -- raw file data. This includes uploaded documents (exactly the bytes uploaded), PDF versions of the documents (for viewing), thumbnails, and -- if splitting by page -- the split versions of said documents. This can be configured as S3 or a directory on the filesystem. The PostgreSQL database contains the information needed to read it.
- Web: responds to the user's requests. A large part of this is JavaScript code. There's a public API (which stays steady and authenticates via an "API token") and our private API (which changes and authenticates via a cookie). This listens on port 9000.
- Reverse Proxy: Users expect to access your website on port 443 -- or if you're insecure, port 80. So you'll need a process that forwards requests on port 443 to the web server on port 9000. Elastic Load Balancer is nice on the cloud; Haproxy is great locally; we also have users who use nginx.
- Redis: Our web server is very fast, except for one key operation: paginating through lists of millions of documents. We cache each search result as a list of document integer IDs (up to 80MB) on Redis. Fetching a page means asking Redis for the page of IDs, and then fetching the original documents from Postgres.
- Plugins: Every plugin is a website. Web presents a plugin ("View") in an iframe; the iframe's URL contains an API token, so the plugin can query the Web API.
-
Worker: this is where the document-processing happens:
- Tree Plugin: an architectural relic, still integrated into Overview proper. Runs multi-minute/hour processing jobs.
- Search Index: a full-text search engine. Each document set's text and metadata is indexed into a Lucene Directory on the worker's filesystem. The engine can search, extract snippets for search-result presentation, and highlight all search matches in a document.
- File Importer, CSV Importer and DocumentCloud Importer: processing pipelines. Web sends Worker user-provided data (storing it in Postgres along the way), and Worker converts that user-provided data into documents.
- The User connects to the reverse proxy on port 443 (or, if you're being insecure, port 80).
- The reverse proxy connects to Web on port 9000. The protocol is HTTP.
- Each plugin connects to the reverse proxy the same way the user does.
- Web connects to Worker via akka-remote, on Worker's port 9030.
- Worker responds to Web via akka-remote itself, on Web's port 9031. (Worker never initiates any communications, but Akka is built for peer-to-peer connections. It's simplest to just open the port.)
- Web and Worker connect to Postgres on port 9010 (in development/Docker) or port 5432 (in production).
- Web and Worker connect to blob-storage via a shared filesystem (in development/Docker) or through HTTP requests to S3.
- Web connects to Redis on port 9020 (in development/Docker) or port 6379 (production).