Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add iceberg catalog read support #98

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

Tmonster
Copy link

Piggy backing off of #95

This PR adds support for attaching an Iceberg catalog and be able to read iceberg tables from a datalake. It specifically supports the following sql commands:

Create an ICEBERG secret to access your catalog
CREATE SECRET (
     TYPE ICEBERG,
     CLIENT_ID <your-catalog-client-id>,
     CLIENT_SECRET <your-catalog-client-secret>,
     ENDPOINT 'https://<your-catalog-host>/api/catalog',
     AWS_REGION 'us-east-1'
);

Attach your catalog

ATTACH 'my_catalog' AS my_datalake (TYPE ICEBERG);
-- Select your iceberg tables in the datalake via the catalog
SHOW ALL TABLES;
SELECT * FROM my_datalake.schema.table;

Some tests as well to make sure everything works. There is also a double check on the config since it's possible some catalogs won't return configs. In this case the user is required to have a second key.

Copy link
Collaborator

@samansmink samansmink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I've added a few comments

CMakeLists.txt Outdated
@@ -4,7 +4,7 @@ cmake_minimum_required(VERSION 2.8.12)
set(TARGET_NAME iceberg)
project(${TARGET_NAME})

set(CMAKE_CXX_STANDARD 14)
set(CMAKE_CXX_STANDARD 17)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to bump this? I would prefer not to, to avoid any CI issues

}
return result;

// throw std::runtime_error("No AWS credentials found for table");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can be removed

}
}

// ICConnection &ICTransaction::GetConnection() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's just remove all commented out code. This has been copied from the uc_catalog extension which has a prototype-quality level codebase

require httpfs

statement ok
CREATE SECRET (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for a first PR, but we should think about the UX a little here: right now there is only 1 iceberg secret which is then fetched when calling ATTACH '' AS my_datalake (TYPE ICEBERG);

However it's probably desirable to be able to do something like:

CREATE SECRET irc_secret_1 (TYPE ICEBERG, ENDPOINT 'http://127.0.0.1:8181', BEARER_TOKEN 'bla');
CREATE SECRET irc_secret_2 (TYPE ICEBERG, ENDPOINT 'http://some.other.thing.com', BEARER_TOKEN 'bla');
ATTACH 'irc_secret_1' AS irc1 (TYPE ICEBERG);
ATTACH 'irc_secret_2' AS irc2 (TYPE ICEBERG);

or perhaps

CREATE SECRET irc_secret_1 (TYPE ICEBERG, SCOPE 'http://127.0.0.1:8181', BEARER_TOKEN 'bla');
CREATE SECRET irc_secret_2 (TYPE ICEBERG, SCOPE 'http://some.other.thing.com', BEARER_TOKEN 'bla');
ATTACH 'http://127.0.0.1:8181' AS irc1 (TYPE ICEBERG);
ATTACH 'http://some.other.thing.com' AS irc2 (TYPE ICEBERG);

We can have some discussion on what is nicest, these are currently also open questions for other catalog extensions such as postgres and mysql i think

}

auto &ic_catalog = catalog.Cast<ICCatalog>();
// TODO: handle out-of-order columns using position property
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO is copy pasted from uc_catalog: we should figure out if this is also relevant to iceberg, if so it can stay otherwise we should remove it to avoid confusion

auto &get = (LogicalGet &)*op;
bind_data = std::move(get.bind_data);

return parquet_scan_function;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't work with tables with deletes and potentially return incorrect results. Can we add a test for this?

If it does indeed fail, we should take a look into how to solve this, or at least throw an error

auto table_ref = iceberg_scan_function.bind_replace(context, bind_input);

// 1) Create a Binder and bind the parser-level TableRef -> BoundTableRef
auto binder = Binder::CreateBinder(context);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: formatting

throw NotImplementedException("BindUpdateConstraints");
}

struct MyIcebergFunctionData : public FunctionData {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is dead code i think?

CMakeLists.txt Outdated
src/common/utils.cpp
src/common/schema.cpp
src/common/iceberg.cpp
src/iceberg_functions/iceberg_snapshots.cpp
src/iceberg_functions/iceberg_scan.cpp
src/iceberg_functions/iceberg_metadata.cpp)
src/iceberg_functions/iceberg_metadata.cpp
src/storage/ic_catalog.cpp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this structure could be a bit clearer

What do you think about renaming the IC to IRC (this code is meant for the Iceberg Rest Catalog after all) and moving everything:
src/storage/irc/irc_catalog.cpp?

This would allow us to later on create another catalog which can handle static iceberg tables, similar to duckdb/duckdb-delta#110 which could then live in src/storage/static/static_catalog.cpp

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want all of these files to be icr_*or just ic_catalog.cpp to be icr_catalog.cpp

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think irc_* (not icr_ 😄) everywhere that is related to the Iceberg Rest Catalog makes the most sense. We can also look to using C++ namespacing for the iceberg extension at some point, but I think just prefixing everything with IRC is fine for now.

@sean-lynch
Copy link

sean-lynch commented Feb 4, 2025

Very excited future user here! Chiming in on a review like this may be frowned upon. Please ignore if it's a distraction.

One minor request to make the ICEBERG secret agnostic of object storage provider. Rather than accepting AWS_REGION here, can that be left entirely to the existing httpfs secret? Taking some inspiration from @samansmink's feedback, maybe a future addition lets you specify which object storage provider secret to use, but at least leaving out AWS_REGION here avoids mixing up that configuration.

table_data->data_source_format);
}

if (table_data->storage_location.find("file://") != 0) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the hardcoded file:// maybe is unnecessary? I feel like we can omit the condition and just try to pull the credentials

// Inject secret into secret manager scoped to this path
CreateSecretInfo info(OnCreateConflict::ERROR_ON_CONFLICT, SecretPersistType::TEMPORARY);
info.name = "__internal_ic_" + table_data->table_id;
info.type = "s3";
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, based on the comment here, we should see if the spec returns information on the type of the secret so we don't have to hard code this kind of information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants