-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Append to existing parquet file? #13
Comments
The nature of the Parquet format is such that you can't really update an existing file. (Wellllll, maybe it's technically possible to do surgery and add new row groups and then update the footer. It'd require using relatively low-level APIs in the parquet library, though.) We could provide an experience that mimics this, but it'd be recreating the entire file each time. Would some other options work:
(cat file1.csv; tail -n+2 file2.csv; tail -n+2 file3.csv) > tmp.csv
csv2parquet tmp.csv
rm tmp.csv
csv2parquet <(cat file1.csv; tail -n+2 file2.csv; tail -n+2 file3.csv) Currently, this will fail with
csv2parquet file1.csv file2.csv file3.csv My gut feeling is that (3) would be a nice addition, and a relatively small change to the existing code. If the concern is that you want to append to avoid doing the computation work that compresses/optimizes the parquet file, none of these is suitable, though. |
Hi, For now I had come up with some bash for loops to combine my nested dir structure and cat all the *.csv.gz files into one large .csv.gz. I was then gunzip'ing this and calling csv2parquet. So pretty much a variant of (1). Unfortunately at this point I discovered that occasionally the rows in the CSV has extra commas, hence yesterdays pull request to help me discover which lines are broken! So if you have lots of CSV files then (3) is good, although how would this work with xargs for example? However there is an argument that (2) e.g. read from stdin would allow you to pipe the output of zcat directly into csv2parquet without the need to decompress to disk first. Thanks |
Wouldn't it just work? :) e.g.:
The only case where it'd fail is if your list of files was so big that your shell environment space was exhausted. Is that what you're getting at?
If this is common (and it probably is), I'd support teaching csv2parquet how to sniff the file and do the gzip decompression on the fly. Process substitution has some subtle warts - it's not supported across all shells, and failures in the underlying process don't bubble up, e.g.:
|
I feel like this is probably done better without loading the files into Py. You can do it from shell: Suppose you have file1,csv, ... file3.csv and you'd like to concatenate them so that they are 1, 2, 3 stacked.
|
Hi,
Is it possible to run csv2parquet in a way that means it appends to an existing parquet file? I have a large number of CSV file with a fixed schema that I'd like to convert into a single parquet file.
Thanks
The text was updated successfully, but these errors were encountered: