Skip to content

Handling huge document sets

Adam Hooper edited this page Dec 2, 2015 · 19 revisions

Overview is designed to support document sets that contain millions of documents. Here are some tips to make sure it's fast enough.

Before reading these tips, try simply uploading your huge document set. Maybe it will work! If it doesn't, read on....

Set the maximum number of documents

Overview limits document-set sizes so it doesn't get overloaded on production. Set MAX_DOCUMENTS in your Configuration if you want more than two million.

Give Docker more resources

If you're on Windows or Mac, some or all of Overview is running in a virtual machine. A virtual machine restricts the resources Overview can use. See the overview-local instructions to increase the amount of memory.

Increase clustering memory

Overview's Tree view eats up a lot of memory -- the more words in your document set, the more memory. Increase CLUSTERING_MEMORY in your Configuration if the worker outputs OutOfMemoryError.

If you set CLUSTERING_MEMORY too high, Overview may munch away at all your system's memory until the kernel kills it. When that happens, you'll see nothing but a graceless Process killed message on your console.

Use 64-bit Java

Windows users, beware! Make sure you're running a 64-bit version of Java: type java -version on the command-line and scan for the key words, "64-bit". If it doesn't say "64-bit" it isn't 64-bit; uninstall Java and install a 64-bit version from the JDK download page.

Use CSV upload

Overview processes CSV data far more quickly than files or DocumentCloud.

Avoid swap usage (advanced)

Overview expects memory accesses to be fast. If you set CLUSTERING_MEMORY too high and your machine has lots of swap space, Overview will become thousands of times slower than it should be. Try disabling swap on your machine before uploading.

If you suspect Overview is flowing into your swap space, stop. You're better off walking to a computer store, buying more memory, installing it in your computer and then restarting the clustering job. You'll finish clustering sooner; and you'll get some fresh air, too.

Ask the mailing list

At Overview, we love working with huge document sets. Ask the overview-users group for help and you'll probably get some good advice.

Clone this wiki locally