Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First release #31

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

First release #31

wants to merge 21 commits into from

Conversation

drin
Copy link

@drin drin commented Nov 16, 2024

This would be a first release of my fork of the arrow extension. I'm not sure what to version it, so I'll think about the exact logistics over the next week.

This does not change much of the upstream arrow extension for duckdb, but I would like to add quite a bit of convenience code as well as close the gap with some substrait functionality. Specifically, I would like to eventually:

  1. Handle substrait ReadRel operators backed by arrow formatted files (IPC file and IPC stream).
  2. Add convenience code that makes it much easier to get Arrow data into duckdb and out of duckdb (via C++ bindings).
  3. Perhaps add bindings for flight to make it easy to connect to a remote duckdb server.

This release would include initial implementations to support (1) and refactors code (and adds convenience functions) towards satisfying (2). For (3), I have code in another repository that I will try to "upstream" here in the future, depending on if I can think of a way for it to feel cohesive (maybe I'll look at the http extension or something of the sort for guidance).

Also, recent updates to the substrait extension replace TableFunction definitions that return QueryResults to TableFunciton definitions that return TableRefs. So, I would like to look into that as well.

Anyways, this release is a satisfactory first step in these directions and should make for a good base on which to make future PRs.

drin added 21 commits August 16, 2024 11:00
this way I can get my bearings and make necessary changes. I will figure
out what to upstream in the future, if anything
Commit ID for Arrow was updated to Arrow 17.0.0 from Arrow 9.0.0 (wow,
old)
To some degree, this refactoring is mostly for personal style and
readability. Functionality was not changed anywhere, nor were there
algorithm changes. Some code was re-organized to make it clearer what
variables actually needed to be kept around and some was re-organized to
reuse existing functions. Also, some comments were added to make some
decisions more understandable.
most simplifications are removing unnecessary code (the arrow stream
wrapper calls release in destructor).

also changed `SchemaFromIPCBuffer` to take a buffer as an argument
instead of a uintptr_t because it can be called at bind time and does
not need to go through the C data interface.
The issue was that the `ipc_buffer` attribute wasn't keeping the correct
reference alive. So instead of passing a pointer to the super-class
constructor, a nullptr is passed and the `stream_factory_ptr` is
explicitly set to be sure it points to the `ArrowIPCStreamBuffer` that
is kept alive
the one non-style change is to pass an `ArrowIPCBuffer` to
`SchemaFromIPCBuffer` instead of a `uintptr_t`. Otherwise, I moved
`ArrowIPCTableFunction::GetFunction()` to the end (since it depends on
other functions and nothing depends on it) and I made some spacing
changes

NOTE: The commit removing the duckdb submodule was accidentally skipped
during a rebase, so I am adding it to this commit.
This is so that it's easier for duckdb to support using arrow files for
ReadRel ops
Also added `arrow_readers.cpp`
Segfault issues were occuring because the arrow buffers were being
released too early. Also added a map to keep track of files to avoid
opening and re-reading a file many times.
This is due to ubuntu 22.04 having cmake 3.22.1, but changed to match
the minimum version specified by duckdb core
This is due to the changes from upstream regarding shared pointers and
a desire to make it clearer whether a duckdb::shared_ptr is being used
or a std::shared_ptr
"batch" related function and type names were renamed to "partition"
related function and type names. This was missed because of misalignment
on rebase (due to my style changes).
tired of seeing warnings about cmake <= 3.10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant