Skip to content

Handling huge document sets

Jonathan Stray edited this page Apr 21, 2017 · 19 revisions

Overview is designed to support document sets that contain millions of documents. Here are some tips to make sure it's fast enough.

Before reading these tips, try simply uploading your huge document set. Maybe it will work! If it doesn't, read on....

Use the command line

The command line uploader can be a far more robust way to import large numbers of files, in part because web browsers don't do well with selecting more than about 10,000 files at t time, and in part because it's restartable and will resume where it left off if something goes wrong (like network troubles.)

Set the maximum number of documents

Overview limits document-set sizes so it doesn't get overloaded on production. Set MAX_DOCUMENTS in your Configuration if you want more than two million.

Give Docker more resources

If you're on Windows or Mac, some or all of Overview is running in a virtual machine. A virtual machine restricts the resources Overview can use. See the overview-local instructions to increase the amount of memory.

Disk usage and vacuuming

You'll need to ensure Overview has enough disk space for the uploaded documents. Generally, you'll need somewhat more than twice the storage required for the original documents (Overview stores the original documents, a display PDF version, and various other data.) You can check how much space is being used by each component. If Postgres is using more disk space than blob storage is using (PDF and source file storage), you may need to vaccuum the database.

Increase clustering memory

Overview's Tree view eats up a lot of memory -- the more words in your document set, the more memory. Increase CLUSTERING_MEMORY in your Configuration if the worker outputs OutOfMemoryError.

If you set CLUSTERING_MEMORY too high, Overview may munch away at all your system's memory until the kernel kills it. When that happens, you'll see nothing but a graceless Process killed message on your console.

Use 64-bit Java

Windows users, beware! Make sure you're running a 64-bit version of Java: type java -version on the command-line and scan for the key words, "64-bit". If it doesn't say "64-bit" it isn't 64-bit; uninstall Java and install a 64-bit version from the JDK download page.

Use CSV upload

Overview processes CSV data far more quickly than files or DocumentCloud.

Avoid swap usage (advanced)

Overview expects memory accesses to be fast. If you set CLUSTERING_MEMORY too high and your machine has lots of swap space, Overview will become thousands of times slower than it should be. Try disabling swap on your machine before uploading.

If you suspect Overview is flowing into your swap space, stop. You're better off walking to a computer store, buying more memory, installing it in your computer and then restarting the clustering job. You'll finish clustering sooner; and you'll get some fresh air, too.

Ask the mailing list

At Overview, we love working with huge document sets. Ask the overview-users group for help and you'll probably get some good advice.

Clone this wiki locally