Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add iceberg_arrow library #6

Merged
merged 34 commits into from
Jan 9, 2025
Merged

Add iceberg_arrow library #6

merged 34 commits into from
Jan 9, 2025

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Dec 7, 2024

  • Add a ThirdpartyToolchain CMake module to manage dependencies using FetchContent.
  • Add Apache Arrow 18.1.0 as a third-party dependency to libiceberg_arrow.
  • Add a new library libiceberg_arrow to leverage arrow-backed implementations (e.g. filesystem and parquet).
  • Rename libiceberg_core to libiceberg.
  • Both libiceberg_arrow and libiceberg_puffin link to libiceberg.
  • Removed api directory and let each library manage its own header files.
  • Consolidate all targets in the same export target file.
  • Add component support to the exported config file.

src/arrow/CMakeLists.txt Outdated Show resolved Hide resolved
src/arrow/IcebergArrowConfig.cmake.in Outdated Show resolved Hide resolved
@wgtmac wgtmac marked this pull request as ready for review December 8, 2024 04:56
@wgtmac
Copy link
Member Author

wgtmac commented Dec 8, 2024

I have added a iceberg-arrow library to opt in an Arrow-backed implementation in the future. I follow a similar pattern of arrow-cpp to manage third-party libraries. Let me know what you think. @lidavidm @pitrou @raulcd @Fokko @Xuanwo @gaborkaszab

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me, but if we're going to require CMake 3.25, is there anything that would simplify dependency management? Arrow has been discussing requiring this version since there are apparently new features that greatly simplify things like this.

https://ursalabs.zulipchat.com/#narrow/channel/180245-dev/topic/CMake.20minimum.20required.20version/near/481934139

@lidavidm
Copy link
Member

lidavidm commented Dec 8, 2024

e.g. it seems we could integrate find_package and FetchContent more naturally and not rely on the pile of code Arrow uses: https://stackoverflow.com/questions/73621403/cmake-integrating-fetchcontent-with-find-package

@pitrou
Copy link
Member

pitrou commented Dec 9, 2024

e.g. it seems we could integrate find_package and FetchContent more naturally and not rely on the pile of code Arrow uses

That would certainly be preferrable. The "Arrow way" of handling this has been a huge time sink.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have some minor CI before starting to push more code. I've created:
#7
to have at least a build validation.
I agree with David and Antoine that the idea of using a newer CMake is to try and simplify all the Thirdparty dependency management

@wgtmac
Copy link
Member Author

wgtmac commented Dec 10, 2024

That makes sense! Let me spend some time to simplify it.

@wgtmac wgtmac force-pushed the add_arrow branch 3 times, most recently from 3311817 to 9f7e725 Compare December 14, 2024 16:04
@wgtmac
Copy link
Member Author

wgtmac commented Dec 14, 2024

Well, it seems that I cannot make the Windows CI happy. It complains unresolved Arrow symbols during linking. I think I need a Windows PC to debug it.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can continue refining the CMake setup over time. If possible it would be good to use modern features since Arrow's setup (while well tested by now) can be rather brittle

@zhjwpku
Copy link
Collaborator

zhjwpku commented Dec 23, 2024

This seems real complicated, ISTM that find_package and FetchContent mentioned by @lidavidm should be an easy way, I tried that in #13, maybe you can check that.

@wgtmac
Copy link
Member Author

wgtmac commented Dec 23, 2024

@zhjwpku Yes, I am already working on simplifying it by integrating FetchContent_declare with FIND_PACKAGE_ARGS. They are still in my local copy and I'm polishing them. It is pretty easy to import third-party libraries. The dirty part is to export targets and config files. I'm too busy these days so it may take a while to complete.

@zhjwpku
Copy link
Collaborator

zhjwpku commented Dec 23, 2024

@zhjwpku Yes, I am already working on simplifying it by integrating FetchContent_declare with FIND_PACKAGE_ARGS. They are still in my local copy and I'm polishing them. It is pretty easy to import third-party libraries. The dirty part is to export targets and config files. I'm too busy these days so it may take a while to complete.

Sure, thanks. Take your time :)

- add libiceberg_arrow
- use single config file with three separate targets to support components
- add fetchcontent support
src/CMakeLists.txt Outdated Show resolved Hide resolved
@wgtmac
Copy link
Member Author

wgtmac commented Dec 27, 2024

This is ready for review now.

Thanks @lidavidm and @zhjwpku for approval!

Comment on lines 22 to 34
#if defined(_WIN32) || defined(__CYGWIN__)
# ifdef ICEBERG_STATIC
# define ICEBERG_EXPORT
# else
# ifdef ICEBERG_EXPORTING
# define ICEBERG_EXPORT __declspec(dllexport)
# else
# define ICEBERG_EXPORT __declspec(dllimport)
# endif
# endif
#else
# define ICEBERG_EXPORT
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to prepare this for each shared library. (We need ICEBERG_CORE_EXPORT/ICEBERG_PUFFIN_EXPORT/ICEBERG_ARROW_EXPORT.)
Because libiceberg_arrow.dll will use exported symbols in libiceberg_core.dll. In this case, libiceberg_arrow.dll uses _declspec(dllexport) for libiceberg_arrow.dll symbols and _declspec(dllimport) for libiceberg_core.dll symbols.

Copy link
Member Author

@wgtmac wgtmac Jan 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. What if we define ICEBERG_EXPORTING to all of libiceberg_core.dll, libiceberg_arrow.dll and libiceberg_puffin.dll? I think we can simply keep the same visibility to all libraries for now because the core library provides the API and core logic while puffin and arrow provide concrete implementations of the core library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it causes link errors. If libiceberg_arrow.dll uses a libiceberg_core.dll symbol, libiceberg_arrow.dll uses __declspec(dllexport)-ed libiceberg_core.dll symbols...

Can we add a simple function to libiceberg_core.dll and use it from libiceberg_arrow.dll?

Copy link
Member Author

@wgtmac wgtmac Jan 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try it on my Windows PC tonight. Thanks!

EDIT: I have added a new function DemoTable::name() in demo_table.h and called it in demo_arrow.cc. It does not have linking error. @kou

Copy link
Member Author

@wgtmac wgtmac Jan 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is relevant to how we manage the public headers from different iceberg libraries. As per previous discussion, we use a separate api folder holds all public header files from all libraries (i.e. core, puffin, arrow, etc.). IIUC, the current approach makes it difficult to manage the visibility of the header file across libraries. I'd propose to remove the current api folder and let each library manage its own header files. WDYT? @kou @raulcd @zhjwpku @gaborkaszab @lidavidm @pitrou

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having separate directories per library is reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a new function DemoTable::name() in demo_table.h and called it in demo_arrow.cc. It does not have linking error.

Hmm. It may have a real world problem because libiceberg_arrow.lib is always used with libiceberg_core.lib...
FYI: If we use the same ICEBERG_EXPORTING for all libraries, libiceberg_arrow.lib also exports symbols in libiceberg_core.dll but libiceberg_arrow.dll doesn't have them:

$ nm iceberg_arrow.lib | grep Demo
0000000000000000 T ??0DemoTable@iceberg@@QEAA@AEBV01@@Z
0000000000000000 I __imp_??0DemoTable@iceberg@@QEAA@AEBV01@@Z
0000000000000000 T ??0DemoTable@iceberg@@QEAA@XZ
0000000000000000 I __imp_??0DemoTable@iceberg@@QEAA@XZ
0000000000000000 T ??1DemoTable@iceberg@@UEAA@XZ
0000000000000000 I __imp_??1DemoTable@iceberg@@UEAA@XZ
0000000000000000 T ??4DemoArrow@arrow@iceberg@@QEAAAEAV012@$$QEAV012@@Z
0000000000000000 I __imp_??4DemoArrow@arrow@iceberg@@QEAAAEAV012@$$QEAV012@@Z
0000000000000000 T ??4DemoArrow@arrow@iceberg@@QEAAAEAV012@AEBV012@@Z
0000000000000000 I __imp_??4DemoArrow@arrow@iceberg@@QEAAAEAV012@AEBV012@@Z
0000000000000000 T ??4DemoTable@iceberg@@QEAAAEAV01@AEBV01@@Z
0000000000000000 I __imp_??4DemoTable@iceberg@@QEAAAEAV01@AEBV01@@Z
0000000000000000 I __imp_??_7DemoTable@iceberg@@6B@
0000000000000000 T ?name@DemoArrow@arrow@iceberg@@QEBA?AV?$basic_string_view@DU?$char_traits@D@std@@@std@@XZ
0000000000000000 I __imp_?name@DemoArrow@arrow@iceberg@@QEBA?AV?$basic_string_view@DU?$char_traits@D@std@@@std@@XZ
0000000000000000 T ?print@DemoArrow@arrow@iceberg@@QEBA?AV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@XZ
0000000000000000 I __imp_?print@DemoArrow@arrow@iceberg@@QEBA?AV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@XZ

If we use separated _EXPORTING macro for each library, DemoTable symbols aren't exported by libiceberg_arrow.dll:

$ nm iceberg_arrow.lib | grep Demo
0000000000000000 T ??4DemoArrow@arrow@iceberg@@QEAAAEAV012@$$QEAV012@@Z
0000000000000000 I __imp_??4DemoArrow@arrow@iceberg@@QEAAAEAV012@$$QEAV012@@Z
0000000000000000 T ??4DemoArrow@arrow@iceberg@@QEAAAEAV012@AEBV012@@Z
0000000000000000 I __imp_??4DemoArrow@arrow@iceberg@@QEAAAEAV012@AEBV012@@Z
0000000000000000 T ?name@DemoArrow@arrow@iceberg@@QEBA?AV?$basic_string_view@DU?$char_traits@D@std@@@std@@XZ
0000000000000000 I __imp_?name@DemoArrow@arrow@iceberg@@QEBA?AV?$basic_string_view@DU?$char_traits@D@std@@@std@@XZ
0000000000000000 T ?print@DemoArrow@arrow@iceberg@@QEBA?AV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@XZ
0000000000000000 I __imp_?print@DemoArrow@arrow@iceberg@@QEBA?AV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@XZ

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd propose to remove the current api folder and let each library manage its own header files.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the api folder and move the header files to each library instead.

cmake_modules/BuildUtils.cmake Outdated Show resolved Hide resolved
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

string(TOUPPER ${LIB_NAME} LIB_NAME_UPPER)
if(BUILD_SHARED)
generate_export_header(${LIB_NAME}_shared BASE_NAME ${LIB_NAME_UPPER})
target_compile_definitions(${LIB_NAME}_shared PUBLIC ${LIB_NAME}_shared_EXPORTS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this?
It seems that this is defined automatically: https://cmake.org/cmake/help/latest/prop_tgt/DEFINE_SYMBOL.html

Copy link
Member Author

@wgtmac wgtmac Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I added this to fix a linking error. I can try to remove this again to confirm.

Update: It failed at linking again: https://github.com/apache/iceberg-cpp/actions/runs/12591221855/job/35094013478?pr=6

Copy link
Member

@kou kou Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/iceberg-cpp/actions/runs/12591221855/job/35094013478#step:3:422

demo_puffin.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: virtual __cdecl iceberg::Puffin::~Puffin(void)" (__imp_??1Puffin@iceberg@@UEAA@XZ) referenced in function "public: virtual __cdecl iceberg::puffin::DemoPuffin::~DemoPuffin(void)" (??1DemoPuffin@puffin@iceberg@@UEAA@XZ) [D:\a\iceberg-cpp\iceberg-cpp\build\src\iceberg\puffin\iceberg_puffin_shared.vcxproj]
demo_puffin.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: __cdecl iceberg::Puffin::Puffin(void)" (__imp_??0Puffin@iceberg@@QEAA@XZ) referenced in function "public: __cdecl iceberg::puffin::DemoPuffin::DemoPuffin(void)" (??0DemoPuffin@puffin@iceberg@@QEAA@XZ) [D:\a\iceberg-cpp\iceberg-cpp\build\src\iceberg\puffin\iceberg_puffin_shared.vcxproj]
demo_puffin.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: __cdecl iceberg::Puffin::Puffin(class iceberg::Puffin const &)" (__imp_??0Puffin@iceberg@@QEAA@AEBV01@@Z) referenced in function "public: __cdecl iceberg::puffin::DemoPuffin::DemoPuffin(class iceberg::puffin::DemoPuffin const &)" (??0DemoPuffin@puffin@iceberg@@QEAA@AEBV012@@Z) [D:\a\iceberg-cpp\iceberg-cpp\build\src\iceberg\puffin\iceberg_puffin_shared.vcxproj]
demo_puffin.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: class iceberg::Puffin & __cdecl iceberg::Puffin::operator=(class iceberg::Puffin const &)" (__imp_??4Puffin@iceberg@@QEAAAEAV01@AEBV01@@Z) referenced in function "public: class iceberg::puffin::DemoPuffin & __cdecl iceberg::puffin::DemoPuffin::operator=(class iceberg::puffin::DemoPuffin const &)" (??4DemoPuffin@puffin@iceberg@@QEAAAEAV012@AEBV012@@Z) [D:\a\iceberg-cpp\iceberg-cpp\build\src\iceberg\puffin\iceberg_puffin_shared.vcxproj]

OK. It's an implementation problem. We should remove this line. (Or we should use PRIVATE not PUBLIC.)

iceberg::Puffin should exist in libiceberg_core.dll not libiceberg_puffin.dll. Can we define iceberg::Puffin in libiceberg_core.dll something like the following?

diff --git a/src/iceberg/demo_table.cc b/src/iceberg/demo_table.cc
index 9e46d7a..d960f1b 100644
--- a/src/iceberg/demo_table.cc
+++ b/src/iceberg/demo_table.cc
@@ -18,6 +18,7 @@
  */
 
 #include "iceberg/demo_table.h"
+#include "iceberg/puffin.h"
 
 namespace iceberg {
 

BTW, is it OK that iceberg::Puffin exists in libiceberg_core.dll not libiceberg_puffin.dll?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/iceberg-cpp/actions/runs/12584082126/job/35072964038#step:3:410

LINK : fatal error LNK1104: cannot open file '..\Debug\iceberg.lib' [D:\a\iceberg-cpp\iceberg-cpp\build\src\iceberg\arrow\iceberg_arrow_shared.vcxproj]
  Building Custom Rule D:/a/iceberg-cpp/iceberg-cpp/src/iceberg/CMakeLists.txt
  demo_table.cc
  iceberg_static.vcxproj -> D:\a\iceberg-cpp\iceberg-cpp\build\src\iceberg\Debug\iceberg_static.lib
  Building Custom Rule D:/a/iceberg-cpp/iceberg-cpp/src/iceberg/arrow/CMakeLists.txt
  demo_arrow.cc
  iceberg_arrow_static.vcxproj -> D:\a\iceberg-cpp\iceberg-cpp\build\src\iceberg\arrow\Debug\iceberg_arrow_static.lib
  Building Custom Rule D:/a/iceberg-cpp/iceberg-cpp/src/iceberg/puffin/CMakeLists.txt
  demo_puffin.cc
LINK : fatal error LNK1104: cannot open file '..\Debug\iceberg.lib' [D:\a\iceberg-cpp\iceberg-cpp\build\src\iceberg\puffin\iceberg_puffin_shared.vcxproj]

I think that it's a different problem. Could you share the URL of the commit?

Copy link
Member

@kou kou Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining the context.
How about discussing it as a follow-up task if it's needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can merge this now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your help!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac I think let's go with the approach that is the most convenient at this point just to get going. We can re-evaluate later on once we actually do the implementation of Puffin. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Let's keep the puffin library as is and revisit its design when you start the implementation.

cmake_modules/BuildUtils.cmake Outdated Show resolved Hide resolved
cmake_modules/BuildUtils.cmake Outdated Show resolved Hide resolved
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

cmake_modules/BuildUtils.cmake Outdated Show resolved Hide resolved
@wgtmac
Copy link
Member Author

wgtmac commented Jan 9, 2025

I think this PR is ready to merge. @Fokko @Xuanwo

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wgtmac for your efforts on this, and thanks as well to @kou, @zhjwpku, and @lidavidm for the review.

Let's move.

@Xuanwo Xuanwo merged commit 95b592d into apache:main Jan 9, 2025
5 checks passed
@wgtmac
Copy link
Member Author

wgtmac commented Jan 9, 2025

Thanks @Xuanwo!

@wgtmac wgtmac deleted the add_arrow branch January 24, 2025 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants