Datapusher COPY mode #221

jqnatividad · 2021-01-03T00:23:34Z

Resolves #220.

When a queued file's size is > COPY_MODE_SIZE (bytes) and a COPY_ENGINE_WRITE_URL is specified, datapusher uses Postgres COPY, similar to xloader, otherwise, use existing datapusher logic.

The main difference being, that we still use messytables to guess the column data types, which xloader currently doesn't do.

Also fixes #219, as the Datastore message is now more informative.

use postgres COPY for faster datapusher

jqnatividad · 2021-01-03T00:41:16Z

Currently working on an implementation where a typical resource is in the millions of rows.

At the time I started, xloader was not migrated to 2.9 yet, so I used datapusher and tried to improve it with HA capabilities, which were merged upstream.

Even with multiple workers, it was still too slow for my use case, and investigated switching to xloader. However, xloader doesn't have type-guessing, which my client requires - thus this PR.

I borrowed heavily from xloader, :)

It is very fast now, even a tad faster than xloader - squeezed some more performance out of COPY by using the COPY FREEZE option - obviating the need to maintain WAL logs during COPY, and doing a VACUUM ANALYZE immediately after.

@amercader @davidread appreciate your review/feedback.

PEP8 reformatting; Improve Datapusher non-COPY messages; remove unnecessary TRYs in COPY mode

jqnatividad · 2021-01-03T23:28:44Z

Took the opportunity to fully fix #219. Messages are now only emitted every 20 seconds, and also includes rowcount and elapsed time.

Here are some benchmarks for the same file - a sample of NYC 311 calls with 100k rows on a VirtualBox VM running Postgres 11 and CKAN 2.9.1 on an 8gb Ubuntu 16.04 instance .

Existing datapusher process: 288.17 seconds
Using COPY: 77.24 seconds, including a VACUUM ANALYZE.

when COPY fails, we need it in the info message.

Going back to old behavior. Misunderstood the meaning of 'url_type'

If COPY_MODE_SIZE is zero, or the filesize is less than the COPY_MODE_SIZE threshold in bytes, push thru Datastore API. Otherwise, use COPY if we have a COPY_WRITE_ENGINE_URL. Corrected minor typos and streamlined exception handling.

EricSoroos · 2022-03-30T10:57:06Z

datapusher/jobs.py

+        with open(tmp.name, 'rb') as f:
+            header_line = f.readline()
+        try:
+            sniffer = csv.Sniffer()


If I'm understanding this correctly, you're using the file as a raw CSV to push to the database, relying on the database's native interpretation of the datatypes/nulls/etc for conversion. This is only going to work for files that have come in as CSV, not ones that are uploaded as .xls/ods and have been interpreted/converted by Messytables. The only criteria for the legacy vs copy is file-size, so some previously handled files won't be handled anymore.

@EricSoroos, this is an old PR and I have since created a fork of Datapusher (https://github.com/dathere/datapusher-plus) that now also supports xls/ods files, always uses Postgres copy and dropped the legacy datapusher stuff, and replaced messytables with qsv.

Would be interested in your feedback.

jqnatividad · 2022-04-25T19:10:04Z

Closing this now that there is https://github.com/dathere/datapusher-plus

jqnatividad added 2 commits January 2, 2021 11:08

COPY_MODE

21d344f

use postgres COPY for faster datapusher

Typos; PEP8; Analyze

12b750f

PEP8; improve msgs; refactor COPY

b1e7413

PEP8 reformatting; Improve Datapusher non-COPY messages; remove unnecessary TRYs in COPY mode

jqnatividad added 3 commits January 4, 2021 10:43

Init rowcount earlier

af6b701

when COPY fails, we need it in the info message.

Update_resource

bf9cf8b

Going back to old behavior. Misunderstood the meaning of 'url_type'

Conditional COPY

df83edc

If COPY_MODE_SIZE is zero, or the filesize is less than the COPY_MODE_SIZE threshold in bytes, push thru Datastore API. Otherwise, use COPY if we have a COPY_WRITE_ENGINE_URL. Corrected minor typos and streamlined exception handling.

jqnatividad mentioned this pull request Jan 5, 2021

Faster datastore performance using psycopg2 fast execution helpers ckan/ckan#5799

Open

Comments; remove unneeded parm

f9b855e

EricSoroos reviewed Mar 30, 2022

View reviewed changes

jqnatividad closed this Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datapusher COPY mode #221

Datapusher COPY mode #221

jqnatividad commented Jan 3, 2021 •

edited

Loading

jqnatividad commented Jan 3, 2021

jqnatividad commented Jan 3, 2021 •

edited

Loading

EricSoroos Mar 30, 2022

jqnatividad Apr 25, 2022

jqnatividad commented Apr 25, 2022

Datapusher COPY mode #221

Datapusher COPY mode #221

Conversation

jqnatividad commented Jan 3, 2021 • edited Loading

jqnatividad commented Jan 3, 2021

jqnatividad commented Jan 3, 2021 • edited Loading

EricSoroos Mar 30, 2022

Choose a reason for hiding this comment

jqnatividad Apr 25, 2022

Choose a reason for hiding this comment

jqnatividad commented Apr 25, 2022

jqnatividad commented Jan 3, 2021 •

edited

Loading

jqnatividad commented Jan 3, 2021 •

edited

Loading