-
Notifications
You must be signed in to change notification settings - Fork 37
Handling huge document sets
Overview is designed to support document sets that contain millions of documents. Here are some tips to make sure it's fast enough.
Before reading these tips, try simply uploading your huge document set. Maybe it will work! If it doesn't, read on....
The command line uploader can be a far more robust way to import large numbers of files, in part because web browsers don't do well with selecting more than about 10,000 files at t time, and in part because it's restartable and will resume where it left off if something goes wrong (like network troubles.)
Overview limits document-set sizes so it doesn't get overloaded on production. Set MAX_DOCUMENTS
in your Configuration if you want more than two million.
If you're on Windows or Mac, some or all of Overview is running in a virtual machine. A virtual machine restricts the resources Overview can use. See the overview-local instructions to increase the amount of memory.
You'll need to ensure Overview has enough disk space for the uploaded documents. Generally, you'll need somewhat more than twice the storage required for the original documents (Overview stores the original documents, a display PDF version, and various other data.) You can check how much space is being used by each component. If Postgres is using more disk space than blob storage is using (PDF and source file storage), you may need to vaccuum the database.
Overview's Tree view eats up a lot of memory -- the more words in your document set, the more memory. Increase CLUSTERING_MEMORY
in your Configuration if the worker outputs OutOfMemoryError
.
If you set CLUSTERING_MEMORY
too high, Overview may munch away at all your system's memory until the kernel kills it. When that happens, you'll see nothing but a graceless Process killed
message on your console.
Windows users, beware! Make sure you're running a 64-bit version of Java: type java -version
on the command-line and scan for the key words, "64-bit". If it doesn't say "64-bit" it isn't 64-bit; uninstall Java and install a 64-bit version from the JDK download page.
Overview processes CSV data far more quickly than files or DocumentCloud.
Overview expects memory accesses to be fast. If you set CLUSTERING_MEMORY
too high and your machine has lots of swap space, Overview will become thousands of times slower than it should be. Try disabling swap on your machine before uploading.
If you suspect Overview is flowing into your swap space, stop. You're better off walking to a computer store, buying more memory, installing it in your computer and then restarting the clustering job. You'll finish clustering sooner; and you'll get some fresh air, too.
At Overview, we love working with huge document sets. Ask the overview-users group for help and you'll probably get some good advice.