-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datapusher COPY mode #221
Datapusher COPY mode #221
Conversation
Currently working on an implementation where a typical resource is in the millions of rows. At the time I started, xloader was not migrated to 2.9 yet, so I used datapusher and tried to improve it with HA capabilities, which were merged upstream. Even with multiple workers, it was still too slow for my use case, and investigated switching to xloader. However, xloader doesn't have type-guessing, which my client requires - thus this PR. I borrowed heavily from xloader, :) It is very fast now, even a tad faster than xloader - squeezed some more performance out of COPY by using the COPY FREEZE option - obviating the need to maintain WAL logs during COPY, and doing a VACUUM ANALYZE immediately after. @amercader @davidread appreciate your review/feedback. |
PEP8 reformatting; Improve Datapusher non-COPY messages; remove unnecessary TRYs in COPY mode
Took the opportunity to fully fix #219. Messages are now only emitted every 20 seconds, and also includes rowcount and elapsed time. Here are some benchmarks for the same file - a sample of NYC 311 calls with 100k rows on a VirtualBox VM running Postgres 11 and CKAN 2.9.1 on an 8gb Ubuntu 16.04 instance . Existing datapusher process: 288.17 seconds |
when COPY fails, we need it in the info message.
Going back to old behavior. Misunderstood the meaning of 'url_type'
If COPY_MODE_SIZE is zero, or the filesize is less than the COPY_MODE_SIZE threshold in bytes, push thru Datastore API. Otherwise, use COPY if we have a COPY_WRITE_ENGINE_URL. Corrected minor typos and streamlined exception handling.
with open(tmp.name, 'rb') as f: | ||
header_line = f.readline() | ||
try: | ||
sniffer = csv.Sniffer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding this correctly, you're using the file as a raw CSV to push to the database, relying on the database's native interpretation of the datatypes/nulls/etc for conversion. This is only going to work for files that have come in as CSV, not ones that are uploaded as .xls/ods and have been interpreted/converted by Messytables. The only criteria for the legacy vs copy is file-size, so some previously handled files won't be handled anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EricSoroos, this is an old PR and I have since created a fork of Datapusher (https://github.com/dathere/datapusher-plus) that now also supports xls/ods files, always uses Postgres copy and dropped the legacy datapusher stuff, and replaced messytables with qsv.
Would be interested in your feedback.
Closing this now that there is https://github.com/dathere/datapusher-plus |
Resolves #220.
When a queued file's size is >
COPY_MODE_SIZE
(bytes) and aCOPY_ENGINE_WRITE_URL
is specified, datapusher uses Postgres COPY, similar to xloader, otherwise, use existing datapusher logic.The main difference being, that we still use messytables to guess the column data types, which xloader currently doesn't do.
Also fixes #219, as the Datastore message is now more informative.