-
Notifications
You must be signed in to change notification settings - Fork 37
Handling huge document sets
Overview is designed to support document sets that contain millions of documents. Here are some tips to make sure it's fast enough.
Before reading these tips, try simply uploading your huge document set. Maybe it will work! If it doesn't, read on....
Overview limits document-set sizes so it doesn't get overloaded on production. Set MAX_DOCUMENTS
in your Configuration if you want more than two million.
If you're on Windows or Mac, some or all of Overview is running in a virtual machine. A virtual machine restricts the resources Overview can use. See the overview-local instructions to increase the amount of memory.
Overview's Tree view eats up a lot of memory -- the more words in your document set, the more memory. Increase CLUSTERING_MEMORY
in your Configuration if the worker outputs OutOfMemoryError
.
If you set CLUSTERING_MEMORY
too high, Overview may munch away at all your system's memory until the kernel kills it. When that happens, you'll see nothing but a graceless Process killed
message on your console.
Windows users, beware! Make sure you're running a 64-bit version of Java: type java -version
on the command-line and scan for the key words, "64-bit". If it doesn't say "64-bit" it isn't 64-bit; uninstall Java and install a 64-bit version from the JDK download page.
Overview processes CSV data far more quickly than files or DocumentCloud.
Overview expects memory accesses to be fast. If you set CLUSTERING_MEMORY
too high and your machine has lots of swap space, Overview will become thousands of times slower than it should be. Try disabling swap on your machine before uploading.
If you suspect Overview is flowing into your swap space, stop. You're better off walking to a computer store, buying more memory, installing it in your computer and then restarting the clustering job. You'll finish clustering sooner; and you'll get some fresh air, too.
At Overview, we love working with huge document sets. Ask the overview-users group for help and you'll probably get some good advice.