-
Notifications
You must be signed in to change notification settings - Fork 37
Handling huge document sets
Overview is designed to support document sets that contain tens of thousands of documents or more. (Our eventual goal is 10 million.) If you want to handle a large document set, here are some hints.
Before reading these tips, try simply uploading your huge document set. Maybe it will work! If it doesn't, read on....
If you see an OutOfMemoryError
, your document set is too large for the worker process. Try increasing the -Xmx
parameter of your worker instance.
Windows users, beware: avoid 32-bit Java if you want to upload tens of thousands of documents. Unfortunately, 32-bit versions of Java don't allow you to increase -Xmx
beyond around 1.5 gigabytes. Make sure you're running a 64-bit version of Java: type java -version
on the command-line and scan for the key words, "64-bit". If it doesn't say "64-bit" it isn't 64-bit; uninstall Java and install a 64-bit version from the JDK download page.
(Note: 64-bit Java won't run on 32-bit machines. Also, avoid Java's automatic installers: they tend to default to a 32-bit version on many 64-bit machines, for reasons that hopefully don't apply to you. That's why we linked to the JDK download page directly.)
If you set -Xmx
too high, Overview may munch away at all your system's memory until the kernel kills it. When that happens, you'll see nothing but a graceless Process killed
message on your console. Overview should behave equivalently after either Process killed
or OutOfMemoryError
, so pick your poison.
Overview expects memory accesses to be fast. If you set -Xmx
too high and your machine has lots of swap space, those memory accesses may become very slow indeed. Try disabling swap on your machine before uploading: if that causes you to run out of memory, then you're using swap space.
If you suspect Overview us flowing into your swap space, stop. You're better off walking to a computer store, buying more memory, installing it in your computer and then restarting the clustering job. You'll finish clustering sooner; and you'll get some fresh air, too.
CSV upload is the quickest to upload and parse, so try it if you're frustrated by other methods.
At Overview, we love working with huge document sets. Ask the overview-users group for help and you'll probably get some good advice.