-
Notifications
You must be signed in to change notification settings - Fork 37
Ingest Pipeline
Overview's unit of analysis is a "Document," but the user uploads "Files." How do we convert one to the other?
A single Document is Overview's unit of analysis, presented to the user. It has:
-
text
: valid UTF-8 with a character-count limit, stored in Postgres. -
pdf
: valid PDF, optimized to be viewed quickly, stored in S3 or a file. -
thumbnail
: valid PNG or JPEG image (optional). -
metadata
: JSON Object, supplied by the user and supplemented by Overview's ingest pipeline steps.
A File is input provided by the user.
But let's add derived Files to the mix: data of the same "shape" as a File, produced by running some processing Steps on a user-input File or previously- derived File. (A "Step" is a single operation like "unzip", "OCR" or "convert to PDF.")
A File (user-input or derived) has:
-
blob
: binary data, stored in S3 or on disk. -
metadata
: JSON Object, supplied by the user and optionally supplemented by Overview's processing pipeline. -
pipelineOptions
: user-supplied (and Step-augmented) options to be consulted during processing. For instance:ocr
(Boolean) determines whether to run OCR on a file. -
filename
: the filename of a user-input file, or the "derived" filename of a derived File: for instance, if the inputfilename
isArchive.zip
, thefilename
of a derived File might beArchive.zip/report.pdf
. -
contentType
: what one would see in an HTTPContent-Type
header. Overview stores all user input asapplication/octet-stream
, but some derived files will have known values, such asapplication/pdf
. -
text
,thumbnail
: (derived Files only) supplemental data about the File, used to create Documents. -
processingError
: optional error message that should be presented to the user, describing why the user input is invalid or how Overview is broken. -
nChildren
: number of derived Files produced by processing this File in a single Step. -
parentId
,rootId
:null
for user-input Files. As for derived files: let's look at the tree structure.
A Step runs on a single File and produces zero, one or many children
(derived Files). For example, if the user uploads Archive.zip
and
report.docx
, the trees might look like this:
-
Archive.zip
-
subarchive.tar.gz
-
stats.txt
(unprocessed)-
stats.txt
(application/pdf
withtext
andthumbnail
)
-
-
summary.pdf
(unprocessed)-
summary.pdf
(application/pdf
withtext
andthumbnail
)
-
-
run.bin
(fails to process:unhandled
)
-
-
instructions.docx
(unprocessed)-
instructions.docx
(application/pdf
withtext
andthumbanil
)
-
-
-
screenshot.png
-
screenshot.png
(application/pdf
withtext
andthumbanil
)
-
Files in bold will appear in Overview as Documents and Errors.
Overview creates a (user-visible) Document from leaf Files that are
application/pdf
with text
and thumbnail
. Overview creates Errors from all
Files with processingError
set. Overview ignores all other Files. (These
decisions come at the "ingest" phase of the pipeline.)
Processing a single Zipfile full of PDFs that need OCR can take hours (or days).
Overview is built to crash: it runs on laptops that sometimes lose power, and it
runs on the cloud with frequent deploys. It would be unacceptable to restart
processing of a user-input file from the beginning. Furthermore, Overview may
stop during any write: its expects SIGKILL
at any time. On start, Overview
must pick up where it left off: it must reload state from the database.
The solution: convert a single File to a Tree to Documents and Errors through atomic state transitions. With each transition, Overview makes more data immutable.
To make the Tree a state machine, we make each File a state machine:
- A File is created. Think of a "created" File as an open file handle on
disk. Its
blob
,thumbnail
andtext
may be set or unset. (Once they are set, we ignore attempts to set them again.) - The File transitions to written. At this point,
blob
,thumbnail
andtext
are immutable and the File is ready to be processed by a Step. - A Step sets
processingError
andnChildren
, and the File transitions to processed. A Processed file will never be processed by a Step again. - Documents and Errors are created, and the File transitions to ingested. An ingested file has reached its terminal state: it should disappear from the pipeline's memory.
Finally, we have a definition for Step:
A Step operates on a written
File: it invokes a sequence of atomic database
writes that creates 0, 1 or many written
and processed
children. Finally, it
outputs its input as a processed
File.
(Generally, a Step outputs written children; but if it outputs a child that is
application/pdf
with text
and thumbnail
, Overview considers the child
processed.)
One last definition: a Step worker is a single process that continually runs a Step. For instance, an "unzip" worker waits for files to unzip and then unzips them, one at a time. If there are two "unzip" workers, then Overview can unzip two files simultaneously.
Certain rules follow logically:
-
A Step operates on a
written
File (converting it to processed). -
Steps can recurse and multiple workers can run a Step concurrently
within the same File tree. For instance, unzip might run on
archive.zip
, outputtingarchive.zip/subarchive.zip
, which is fed to a second "unzip" Step Worker which runs at the same time as the first. - While the Step runs, a written File may have zero or more children;
children may have any state. For instance,
archive.zip
's "unzip" Step might outputarchive.zip/summary.pdf
, which might be completely ingested before the "unzip" step finishes. - During resume, a Step may run on a written File that already has already output children. When the Step outputs those children, Overview must ignore them: a Step cannot overwrite children, because they are immutable.
- A Step must be deterministic: if it's run multiple times on the same input, it must produce the exact same children and optional error message, in the exact same order.
- Within a Tree, written Files transition to processed in nondeterministic order.
- The pipeline forgets ingested files. If a File is ingested, all its children are ingested.
Basically: when Overview restarts, it may "replay" certain operations: it may unzip a file that was mostly-unzipped already, for instance. Replayed database writes must be no-ops. In-memory, things are different: replaying an unzip should make Overview track child Files' states.
Overview's ingest pipeline is often idle and often backlogged. To scale, we want to spin up workers per-Step, on demand.
"On demand" here means database demand. We can't launch infinite workers, because we can't handle infinite concurrent database writes. The bottleneck is the database: the perfect number of workers is exactly enough to keep us writing to the database as quickly as possible.
In other words: the ingest phase dictates demand. It should ask the rest of the pipeline to produce as many processed Files as it can handle. From there, scaling is easy: if a Step's workers are operating at 100% CPU for a long time, we know the ingest phase isn't writing to the database as quickly as it can: we can spin up more workers.
Conversely, we don't want to scale too high. Step workers connect to Overview over HTTP. Overview reduces their CPU load by only reading from their TCP sockets when there's demand. When there's no demand, the workers' HTTP POSTs stall, reducing their CPU usage. Low CPU usage means can shut down workers.
Overview uses Akka Streams to build its on-demand pipeline. The concepts are universal, though: a stream system like Apache Kafka could serve the same purpose.
By the way: these flow diagrams are event-driven. Conceptually, the whole system is evented and single-threaded.
Streams are demand-driven, and the demand comes from the ingest phase. So let's sketch Overview's design, backwards:
╭─────────╮
ProcessedFile ▶─┤ Ingester├─▶ IngestedFile
╰─────────╯
Ingester demands processed Files. They can come in any order; though as we'll see later, the pipeline makes a best effort to finish processing a File Tree soon after beginning it.
ProcessedFiles have id
, parentId
, rootId
and nChildren
. From these,
Ingester infers enough Tree information to detect leaf ProcessedFiles and
parent ProcessedFiles whose leaves have all been ingested. It "ingests" them:
creates Documents and Errors and marks them as "ingested."
Then it asks for more.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ╭───────╮ ╭────────╮ WrittenFile ┃
WrittenFile ━>━╉─>┤ merge ├<─────┤ buffer ├<──────╮ ┃
┃ ╰───┬───╯ ╰────────╯ │ ┃
┃ │ ╭───────────────╮ ╭───┴───╮ ┃
┃ ╰──>┤ StepProcessor ├──>┤ split ├>╊━>━ProcessedFile
┃ ╰───────────────╯ ╰───────╯ ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Processor demands written Files and outputs processed Files.
The crux of its logic is in StepProcessor, which we'll see next.
As we know, a Step consumes a written File and outputs one or more written and/or processed Files. (It always outputs its input as processed.) Processor splits the output into two streams: processed Files are destined for the Ingester, and written Files are fed back (recursively) to the processor.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ╭──────╮ ┃
┃ ╭─>┤ Step ├>╮ ┃
┃ │ ╰──────╯ │ ┃
┃ ╭─────────╮│ ╭──────╮ │ ╭───────╮ ┃ WrittenFile or
WrittenFile ━>━╉─>┤ Decider ╞╡─>┤ Step ├>╞═╡ merge ├─>╊━>━ ProcessedFile
┃ ╰─────────╯│ ╰──────╯ │ ╰───────╯ ┃ ("ConvertOutputElement")
┃ │ ╭──────╮ │ ┃
┃ ╰─>┤ Step ├>╯ ┃
┃ ╰──────╯ ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
StepProcessor demands written Files and outputs written or processed Files. (We call these outputs ConvertOutputElements in our code.)
Decider runs MIME-detection on the input to decide which Step applies. (It holds a hard-coded registry of Steps.) It outputs each file to the correct Step's input.
Step runs logic specific to its file type. For instance, "unzip" produces written-File children and then its input File, as processed.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ╭───────╮ ╭──────────╮ ╭───────╮ ┃ WrittenFile or
WrittenFile ━>━╉>┤ retry ├────────>─┤ HttpStep │ │ merge ├>╊━>━ ProcessedFile
┃ ╰──┬────╯ ╰────────╥─╯ ╰──╥────╯ ┃ ("ConvertOutputElement")
┃ ╎ ╭───────╨╮ ║ ┃
POST ━>━╉─StepOutputFragment─>┤ worker ├───>─╢ ┃
┃ ╎ ╰─┬─────╥╯ ║ ┃
┃ ╎ ╭─┴─────╨╮ ║ ┃
POST ━>━╉─StepOutputFragment─>┤ worker ├───>─╢ ┃
┃ ╎ ╰─┬─────╥╯ ║ ┃
┃ ╎ ╭─┴─────╨╮ ║ ┃
POST ━>━╉─StepOutputFragment─>┤ worker ├───>─╜ ┃
┃ ╎ ╰─┬──────╯ ┃
┃ ╰╌<╌╌╌(timeout)╌╌<╌╯ ┃
┃ WrittenFile ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
HttpStep maintains a rather complex state. This diagram is reductive.
Here are the details:
- A WrittenFile input is a unit of work: to convert one WrittenFile into one or many ConvertOutputElements.
- A worker is a simple HTTP client. It loops:
- Demand fresh work (a
WrittenFile
) - Download its
blob
- Derive and upload StepOutputFragments: atomic operations such as "create new child"; "update progress"; "write blob on current child"; and finally, "done" or "error".
- Demand fresh work (a
-
HttpStep
's Workers are server-side representations of workers' states. They make sure each worker only sends a single HTTP POST at a time. They stream "fragments" from the workers, executing the database writes they represent, and outputConvertOutputElement
s in the process.Worker
is the part of Overview that demands TCP messages from the workers; when Overview's pipeline is at capacity,Worker
will stop reading from the HTTP client -- temporarily stalling it so the ingest phase can catch up. - A
Worker
can time out. That always means the client has been killed (intentionally) and that another worker must start the same work from the beginning. - The following are the only ways for a Step to stop work on a WrittenFile:
- A worker sends a
done
fragment - A worker sends an
error
fragment - The user cancels processing of the WrittenFile, and then a worker sends an HTTP request for it.
- A worker sends a
In Overview, a user-triggered "cancel" is completely ordinary. The File-Tree state machine follows the same rules as always.
"Cancel" allows Overview to speed up import by using alternative Step logic:
"output a ProcessedFile
immediately with a processingError
of "canceled"
.
A user-triggered "cancel" is advisory: Overview may choose not to use the alternative logic. (The ambiguity makes it easy to avoid races.)
Overview checks for cancellation in two places:
- In
HttpStep
, afterretry
: Overview may skip assigning a worker to a WrittenFile, opting instead to complete the conversion itself. - In
HttpStep
, when aworker
sends any request concerning its work: Overview completes the conversion and responds with404
, which the worker should interpret to mean: "stop working on this task."
A long-running worker task should send period HEAD
requests or POST
progress
fragments. If Overview replies with 404 Not Found
, the worker
should end processing of its task. (This pattern doubles as a heartbeat: it
defers Overview's worker timeout.)
While converting "root" WrittenFiles to Overview documents, the user expects to know: how much work is done, and how much remains?
The answer is a number between 0.0
and 1.0
. It's always an estimate.
Each Step has a hard-coded progressWeight
.
The "generate thumbnails" Step (along with any other terminal Step) has
progressWeight=1.0
, meaning: "when this Step is done, processing of the
input WrittenFile is finished."
The "unzip" Step is more pessimistic. It has progressWeight=0.1
, meaning:
"when this Step is done, 90 percent of processing is still unfinished."
While a Step runs, it generates progress
fragments, from 0.0
to 1.0
.
Those progress fragments determine how much of the input WrittenFile has
been processed. When the WrittenFile is output as a ProcessedFile, Overview
considers its progress
to be 1.0
-- multiplied by progressWeight
.
The Step's output WrittenFiles will be processed in subsequent Steps. They
together account for 1.0 - progressWeight
of the input file's progress
.
Overview generates a WrittenFile when it receives a n.json
fragment (where
n>0
) or done
; so Steps should be sure to send progress
fragments
immediately before n.json
fragments. That way, Overview will know how much
of the remaining progress to assign to that WrittenFile2.
In this way, each Step outputs progress
fragments from 0.0
to 1.0
. Each
Step contributes a slice of progress itself and allocates the rest of the
progress to its children (recursively). Overview sums up the progress of each
slice to determine the progress of the entire import job. (Optimization:
Overview maintains a these progress slices in a
RangeSet
instead of summing from scratch each time it reports progress.)