Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic CMake support for the iceberg library #3

Merged
merged 6 commits into from
Dec 5, 2024
Merged

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Nov 30, 2024

Added basic CMake files for a toy iceberg library. This is a start point for following PoC implementations. I have assumed the minimum CMake version is 3.20 and C++ version is C++20. They are subject to change in the future.

@wgtmac
Copy link
Member Author

wgtmac commented Nov 30, 2024

This is the first effort to make sure the project can build successfully before providing concrete PoC implementations. Please take a look and provide your feedback. Thanks! @Fokko @Xuanwo @zeroshade @gaborkaszab @raulcd

@Fokko
Copy link
Contributor

Fokko commented Dec 1, 2024

Thanks for kicking this off @wgtmac 🙌

I don't have a strong opinion on this, mostly due to my lack of C++ knowledge. I do think it would be good to add a small section to the README.md with both instructions and the lower bounds of CMake and the C++ version.

@@ -0,0 +1,22 @@
# Licensed to the Apache Software Foundation (ASF) under one
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about the purpose of the 'include' folder. IS this for storing all the .h files separately from the .cc files? I find it easier to navigate if the headers are next to the cc files tbh.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The include folder is for public header files that will be exported and installed. I'm also fine to move them next to the cc files. In that case, I think it would be better to add _internal.h suffix to non-exported header files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a second thought: I think in the Java implementation there is a purpose of having an api/ module: Whatever is there isn't expected to change out of the Blue. Once you have anything in the spec you have to deprecate first in one of the upcoming releases and you can drop only in the following major release.

I think we can follow this as well: Have an api/ folder for the headers that we will consider the part of the lib the users should interact with. In Java these are interfaces if I'm not mistaken and then the actual implementations are part of the other modules like core/ etc. For instance Table is in the api/ and is a pure interface while inherent classes like BaseTable are in core/
I find this something we should also follow.

src/demo.cc Outdated
@@ -0,0 +1,26 @@
/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd structure the code a bit differently. How I imagined the structure of the c++ lib is something like this:

iceberg-cpp/
      - api/
          - src/
          - test/
...
      - core/
            - src/
            - test/
...
      - example/
            - src/
            - test/

and so on. (I hope the indentation is displayed as I intended to :) )

So in general I think there should be separate libs for each of the modules of the implementation and each of them should have their own src/ test/ folders. As I see now these folders are provided on the top level of the directory structure.
If we want to have an include/ folder for the headers separately, I think we should also have them on the module-level and not at the top level.

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need more libraries other than libiceberg (which is the core library in your structure)? In my mind, libiceberg provides the building blocks for types, io, exceptions, metadata, transformations, expressions, table, catalog, reader/writer, etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion we'd need at leas api/ and core/ (I'd prefer core/ over libiceberg just to be in line with the Java impl. No point of reinventing the wheel). Also I'd imagine puffin could also get a separate lib. And then if there is a need each engine could put their own connector into their own folders too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand it correctly. Do you mean that the api folder stores all public headers? Then each folder (e.g. core, puffin, example, etc) uses its own subset of headers from the api folder to create library or executable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'm not in favor of splitting this repo into multiple libraries, which is an overkill. Originally I thought that the api and core folders from the apache/iceberg repo are all that we need. However, it seems that puffin is a good example which does not need to depend on other modules. Therefore I'm open to discuss.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! It is still unclear what it will look like eventually. I'm adding third-party libraries (arrow, avro and simdjson and their indirect dependencies) now but let me finish this patch first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is api in this context? is this the public file headers to include?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, just read the other comment. No strong opinion on calling it include or api, I've seen it call it both on different projects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this?

iceberg-cpp/
├── api/
│   ├── table.h
│   └── puffin.h
├── example/
│   └── demo.cc
├── src/
│   ├── core/
│   │   ├── base_table.h
│   │   └── base_table.cc
│   └── puffin/
│       └── puffin.cc
└── test/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems great!

src/demo.cc Outdated
@@ -0,0 +1,26 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I don't get now is the include/ folder. Does that meant to hold the header files that clients of this lib could include? Isn't it something that is called 'API' in the Java implementation? If yes, could we rename it to 'api'? If no, then I think there is no point of separating thos headers from the cc files, they could live next to each other.

Could you please help me understand this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comments above. I think this relates to how many libraries we are going to add.

CMakeLists.txt Outdated
# specific language governing permissions and limitations
# under the License.

cmake_minimum_required(VERSION 3.20)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason for CMake 3.20? In Arrow C++ there was a discussion lately to try and move to CMake 3.25 to allow using some of the newer feature like:

CMake is easily installable but the default versions on newer distributuions are:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I just chose a random one. CMake 3.25 seems to be a good option.

@wgtmac
Copy link
Member Author

wgtmac commented Dec 3, 2024

I think I have addressed all comments. The BuildUtils.cmake is borrowed from Apache Arrow (a comment has been added to reflect this). Let me know what do you think. Thanks! @Fokko @gaborkaszab @raulcd

Copy link
Collaborator

@gaborkaszab gaborkaszab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working of this @wgtmac !

CMakeLists.txt Show resolved Hide resolved
set(ICEBERG_INSTALL_DOCDIR "share/doc/${PROJECT_NAME}")

add_subdirectory(api)
add_subdirectory(src)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we add example too as a subdirectory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it by purpose to verify that the libraries can be found successfully from the installed CMake config files.

# under the License.

install(
DIRECTORY "${CMAKE_CURRENT_SOURCE_DIR}/iceberg"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there is an iceberg/ subfolder within the api/ folder. I don't see the reason to have that extra layer. Do we expect other subfolders within api/ ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wisdom from Apache Arrow, Parquet and ORC libraries where their headers should be included via #include "arrow/xxx.h", #include "parquet/xxx.h" and #include "orc/xxx.h". Otherwise, it is unclear which library it comes from when #include "api/xxx.h" or #include "xxx.h" is used.

include(FindPackageHandleStandardArgs)
include(GNUInstallDirs)

set(ICEBERG_API_DIR "${CMAKE_SOURCE_DIR}/api")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure which config controls this but I built the project locally and I see that there are generated files all over the place next to the source files. I think it would be way cleaner to put these files into a build/ or such folder.
I'm open to where this build/ folder should live. The Java code is structured a bit differently as I see, they have for instance:

api/
    - build/
    - src/
core/
    - build/
    - src/
...

We decided to follow a bit different structure, so following that logic we could have the build/ folder on the top level next to test/, src/ and such.
And then build/ could still have subfolders for the different modules.
What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do anything special with the in-source build. Usually I build it (or other CMake projects) with following steps:

cd iceberg-cpp
mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=/tmp/iceberg
make install

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried and LGTM

README.md Show resolved Hide resolved

#pragma once

#include "iceberg/table.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how we can set it up, but would it make sense to have an include like "api/table.h"? It would be more intuitive to see that this includes something from the API module. In another comment I also expressed that the api/iceberg/ subfolders seems confusing for me.

Copy link
Member Author

@wgtmac wgtmac Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A header file is in the api folder if and only if it follows the pattern #include "iceberg/xxx.h". These files are included by iceberg-cpp project as well as downstream projects, therefore I think it is super clear that they are APIs. We do not export other header files in core, puffin or other folders and they can be included directly by relative paths to the src folder (e.g. #include "core/base_table.h"). Sounds good?

# Borrowed the file from Apache Arrow:
# https://github.com/apache/arrow/blob/main/cpp/cmake_modules/BuildUtils.cmake

function(iceberg_install_cmake_package PACKAGE_NAME EXPORT_NAME)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now realise that my CMake skills aren't as great TBH :)
This seems a lot of code, so my question is do we need all of this to kickstart the project? Additionally, could you help me understand the purpose of all these?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are boilerplate code to add libraries with essential arguments for includes, linking libraries, and exporting install targets. I copied them from Apache Arrow with simplification. They are super useful if we will add more libraries in the future. In the next step, I will add third-party libraries and these lines will be used so I don't want to delete them for now and add them back in the next patch :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @raulcd is familiar with these lines :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much to add here, as @wgtmac pointed out those are utility functions to help adding new libraries in the future. The ADD_ICEBERG_LIB function is a highly configurable function that simplifies the process of defining, building, and installing both shared and static libraries, along with their associated settings (e.g., include directories, link dependencies, etc.).
To be honest if we only plan to build iceberg core and puffin at the moment it might be overkill but I don't think it's harmful to reuse what Arrow has done here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much to add here, as @wgtmac pointed out those are utility functions to help adding new libraries in the future. The ADD_ICEBERG_LIB function is a highly configurable function that simplifies the process of defining, building, and installing both shared and static libraries, along with their associated settings (e.g., include directories, link dependencies, etc.). To be honest if we only plan to build iceberg core and puffin at the moment it might be overkill but I don't think it's harmful to reuse what Arrow has done here.

I also think it is overkill, I suggest we stick to modern cmake(which is target oriented) and start from easy, improve as project gets complicate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it target-oriented already?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it target-oriented already?

I guess not, you are using Set for adding source files, there is a target_sources which might be a better option. I don't insist if you think it's ok. In general, it LGTM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I have switched to use target_sources.

# Please set CMAKE_PREFIX_PATH to the install prefix of iceberg.
cmake_minimum_required(VERSION 3.25)

project(demo_example)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name of the project should follow the name of the folder for more clarity. 'example' instead of 'demo_example'.

# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this file is icebergConfig.cmake.in and in the puffin/ folder we have puffinConfig.cmake.in Wouldn't it make sense to name this to coreConfig.cmake.in to keep consistency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the package name which is used by find_package(iceberg CONFIG REQUIRED)
and find_package(puffin CONFIG REQUIRED). If we use coreConfig.cmake.in then it is weird to call find_package(core CONFIG REQUIRED) from an external project. Additionally, I think icebergCore is a little bit awkward. WDYT?

@Zheaoli
Copy link
Member

Zheaoli commented Dec 4, 2024

Should we think about the Bazel project? I think it's more advanced than cmake

@raulcd
Copy link
Member

raulcd commented Dec 4, 2024

Should we think about the Bazel project? I think it's more advanced than cmake

We can provide CMake and if someone wants to add meson, bazel or other build systems this can be done at a later stage?

@wgtmac
Copy link
Member Author

wgtmac commented Dec 4, 2024

Should we think about the Bazel project? I think it's more advanced than cmake

That's a good question. I think Bazel is a good option if the majority of active contributors have a good knowledge of it (unfortunately I don't).

@hawkingrei
Copy link

Should we think about the Bazel project? I think it's more advanced than cmake

In fact, there are already projects that can support both bazel and cmake such as abseil/abseil-cpp. Supporting bazel can be more friendly to some large projects.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as a base start. We will have to keep refining once we start adding CI and functionality.
Once this is merged I plan to start adding some workflows CI to start building on different OS's

# Borrowed the file from Apache Arrow:
# https://github.com/apache/arrow/blob/main/cpp/cmake_modules/BuildUtils.cmake

function(iceberg_install_cmake_package PACKAGE_NAME EXPORT_NAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much to add here, as @wgtmac pointed out those are utility functions to help adding new libraries in the future. The ADD_ICEBERG_LIB function is a highly configurable function that simplifies the process of defining, building, and installing both shared and static libraries, along with their associated settings (e.g., include directories, link dependencies, etc.).
To be honest if we only plan to build iceberg core and puffin at the moment it might be overkill but I don't think it's harmful to reuse what Arrow has done here.

include(FindPackageHandleStandardArgs)
include(GNUInstallDirs)

set(ICEBERG_API_DIR "${CMAKE_SOURCE_DIR}/api")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried and LGTM

# Borrowed the file from Apache Arrow:
# https://github.com/apache/arrow/blob/main/cpp/cmake_modules/BuildUtils.cmake

function(iceberg_install_cmake_package PACKAGE_NAME EXPORT_NAME)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it target-oriented already?

I guess not, you are using Set for adding source files, there is a target_sources which might be a better option. I don't insist if you think it's ok. In general, it LGTM.

@raulcd
Copy link
Member

raulcd commented Dec 5, 2024

@Fokko I think this is ready to be merged so we can kick start other things. I am unsure if there are other committers that could help with merging tasks on the iceberg-cpp implementation :)

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you all for the review. I believe this PR is a solid starting point, and I didn't notice any blockers for merging. We can always refine it further later.

@Xuanwo Xuanwo merged commit 7e97d8e into apache:main Dec 5, 2024
@gaborkaszab
Copy link
Collaborator

Just a general comment for the future: I don't think all of the active reviewers gave an approve to this PR (I certainly didn't). For me it seemed a bit rushed to merge. There was even open questions there that we haven't closed entirely. E.g. there was a discussion about other build tools than CMake. Even though there seemed to be a preference of CMake we haven't wrapped that discussion up. @Xuanwo

@raulcd
Copy link
Member

raulcd commented Dec 5, 2024

@gaborkaszab sorry about that. I'm sorry about that. Maybe I had a different view about the initial PR having an initial set up were more people can start working around other things and that's why I asked for this to be merged. I'll take that into consideration next time. As per the build tools, several projects don't limit themselves to a single build tool, in example Apache nanoarrow uses both meson and cmake, that's why I don't see the problem of having CMake initially and have also bazel or others. So I'm unsure on what the specific problem about that is.

@Fokko
Copy link
Contributor

Fokko commented Dec 5, 2024

I was following this and waiting for the discussion to dry up :) We should ensure everyone has the time to review, considering the time zones and the upcoming holidays. I often try to get approval from every active reviewer, to make sure that we have a consensus. Until a certain extent, because we also don't want to block for too long.

@gaborkaszab
Copy link
Collaborator

@gaborkaszab sorry about that. I'm sorry about that. Maybe I had a different view about the initial PR having an initial set up were more people can start working around other things and that's why I asked for this to be merged. I'll take that into consideration next time. As per the build tools, several projects don't limit themselves to a single build tool, in example Apache nanoarrow uses both meson and cmake, that's why I don't see the problem of having CMake initially and have also bazel or others. So I'm unsure on what the specific problem about that is.

Sure, no worries. It's good to have people unblocked by merging this PR. I just wanted to raise attention that later on we should be a bit more patient so that every stakeholder can have a chance to reply.
Thanks everyone for the active involvement! Let's keep up!

@Xuanwo
Copy link
Member

Xuanwo commented Dec 5, 2024

Sure, no worries. It's good to have people unblocked by merging this PR. I just wanted to raise attention that later on we should be a bit more patient so that every stakeholder can have a chance to reply.

Thank you @gaborkaszab for bringing this up. Moving forward, I'll make sure to allow sufficient time for all stakeholders to share their feedback.

@wgtmac
Copy link
Member Author

wgtmac commented Dec 5, 2024

Thank you all for review and comment! @Xuanwo @raulcd @Fokko @gaborkaszab @Zheaoli @zhjwpku @hawkingrei

It is always difficult to start something new like this because there are a lot of things missing and people have various imaginations about it. So I took the first step to write the proposal and created this PR so that we can start building things and discussing on concrete PoC pull requests. This PR leverages a proven approach (i.e. CMake scripts from Apache Arrow) so we can proceed to next steps (e.g. add minimal required third-party libraries and write initial type system APIs, etc.)

This is a brand new and immature project and may involves a lot of trial and error when add more items to it. Therefore, I totally understand the intention from @raulcd @Xuanwo that we need move fast and fix things when we find a better alternative later on. As a community, I believe we all want to make it a better project so I really appreciate and welcome opinion from every one.

wgtmac pushed a commit to wgtmac/iceberg-cpp that referenced this pull request Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants