Handling huge document sets

Overview is designed to support document sets that contain hundreds of thousands of documents or more. (Our eventual goal is 10 million.) If you want to handle a large document set, here are some hints.

Before reading these tips, try simply uploading your huge document set. Maybe it will work! If it doesn't, read on....

How to tweak things

When we say "set max_documents to 200000, what do we mean? We have a page about that: Setting configuration options

What you can tweak

Setting the maximum number of documents

Overview runs with a maximum document set size. It will read a CSV file or uploaded PDFs past this size, but all of the documents over this limit will be discarded. It's set in worker-conf/application.conf:

# Maximum number of documents to retrieve for a document set
max_documents=200000

But that doesn't guarantee that the worker process has enough memory to actually succeed. If you see an OutOfMemoryError the console, or a similar message on your failed job within the Overview UI, your document set is too large for the worker process.

Try increasing the -Xmx parameter of your worker instance. To do this, you need to change the command line the worker JVM starts with, in runner / src / main / scala / org / overviewproject / runner / Main.scala

Seq(Flags.DatabaseUrl, Flags.DatabaseDriver, "-Dlogback.configurationFile=workerdevlog.xml", "-Xmx2g"),

That "-Xmx2g" means give the worker a maximum of 2GB of memory to play with.

In normal operation you would set max_documents so that all user document sets are going to be truncated rather than running out of memory on the worker. This seems to be about 2GB per hundred thousand documents. It does depend on the length of each document, though, so really you can only aim for "most document sets will succeed."

If you want to try your luck getting that huge document set in, set max_documents some huge number and give the worker as much physical memory as you have, less a half gig or so for the OS.

If you set -Xmx too high, Overview may munch away at all your system's memory until the kernel kills it. When that happens, you'll see nothing but a graceless Process killed message on your console. Overview should behave equivalently after either Process killed or OutOfMemoryError, so pick your poison.

Use 64-bit Java

Windows users, beware: avoid 32-bit Java if you want to upload tens of thousands of documents. Unfortunately, 32-bit versions of Java don't allow you to increase -Xmx beyond around 1.5 gigabytes. Make sure you're running a 64-bit version of Java: type java -version on the command-line and scan for the key words, "64-bit". If it doesn't say "64-bit" it isn't 64-bit; uninstall Java and install a 64-bit version from the JDK download page.

(Note: 64-bit Java won't run on 32-bit machines. Also, avoid Java's automatic installers: they tend to default to a 32-bit version on many 64-bit machines, for reasons that hopefully don't apply to you. That's why we linked to the JDK download page directly.)

Use CSV upload

CSV upload is the quickest to upload and parse, so try it if you're frustrated by other methods.

Avoid swap usage (advanced)

Overview expects memory accesses to be fast. If you set -Xmx too high and your machine has lots of swap space, those memory accesses may become very slow indeed. Try disabling swap on your machine before uploading: if that causes you to run out of memory, then you're using swap space.

If you suspect Overview is flowing into your swap space, stop. You're better off walking to a computer store, buying more memory, installing it in your computer and then restarting the clustering job. You'll finish clustering sooner; and you'll get some fresh air, too.

Ask the mailing list

At Overview, we love working with huge document sets. Ask the overview-users group for help and you'll probably get some good advice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly