Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Paimon Spark 2025 Roadmap #4816

Open
Zouxxyy opened this issue Jan 2, 2025 · 9 comments
Open

[Feature] Paimon Spark 2025 Roadmap #4816

Zouxxyy opened this issue Jan 2, 2025 · 9 comments
Labels
enhancement New feature or request

Comments

@Zouxxyy
Copy link
Contributor

Zouxxyy commented Jan 2, 2025

Motivation

2025 has arrived, and we would like to thank everyone for the contributions in the past! Here we present the 2025 Paimon Spark roadmap, and welcome to take ownership of them or expand upon them!

Name Introduction Link
Variant Type [feat] Support Variant Type, unlocking support for semi-structured data. #4471
Optimized Write [perf] Optimize table writing, including automatic repartitioning and rebalancing data and so on.
Distributed Planning [perf] Support distributed planning in the scan phase. #4864
Dataframe Writer V2 [feat] Integrate Spark's Dataframe Writer V2.
Liquid Clustering [perf] Support liquid clustering. #4815
Isolation Level [feat] Transaction isolation that supports more levels, like serializable isolation level. #4616
Support For Spark Connect [feat] Support Spark Connect, calling "Paimon Connect".
Default Value [feat] Support default values for specified fields.
Constraints [feat] Support adding constraints to fields, such as not null or other custom constraints.
Partition Stats [feat] Support partition stats.
Row Lineage [feat] Support tracking row lineage.
Identity Column [feat] When no explicit values are provided during writing, generate unique values for identity column.
Generated Columns [feat] Support generated columns whose values are automatically generated based on a user-specified function over other columns.
CDC For Non-PK Table [feat] Support CDC for non-pk table.
@Zouxxyy Zouxxyy added the enhancement New feature or request label Jan 2, 2025
@YannByron
Copy link
Contributor

Any other features or requirements that you would like to have, please comment here. Then we can discuss and modify this roadmap together. Thanks.

@YannByron
Copy link
Contributor

And if someone want to take one or some, take it and let us know.

@Aiden-Dong
Copy link
Contributor

Can I take on this task?

Distributed Planning : [perf] Support distributed planning in the scan phase.

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Jan 8, 2025

@Aiden-Dong Yes, feel free for it, you can create an issue for it, additionally, this feature actually requires changes in the core, and then each compute engine will need to support it.

@Aiden-Dong
Copy link
Contributor

@Aiden-Dong Yes, feel free for it, you can create an issue for it, additionally, this feature actually requires changes in the core, and then each compute engine will need to support it.

Yes, I understand that we need to extend the functionality of AbstractFileStoreScan.readAndMergeFileEntries.

@Aiden-Dong
Copy link
Contributor

#4864

@zhongyujiang
Copy link
Contributor

@Zouxxyy Thank you for raising this, these optimizations are all highly anticipated!

[feat] Integrate Spark's DataFrame V2 API.

If no one has worked on this, I would like to volunteer to take it on. We are currently endeavoring to enhance write performance by utilizing the V2 write RequiresDistributionAndOrdering. In fact, I am on the verge of completing a MVP version locally.

@YannByron
Copy link
Contributor

@Zouxxyy Thank you for raising this, these optimizations are all highly anticipated!

[feat] Integrate Spark's DataFrame V2 API.

If no one has worked on this, I would like to volunteer to take it on. We are currently endeavoring to enhance write performance by utilizing the V2 write RequiresDistributionAndOrdering. In fact, I am on the verge of completing a MVP version locally.

So glad if you can take it on. I just want to remind you that be aware of the support for scenarios with different bucket modes, especially dynamic bucket mode in your implementation. This is why we compromised to use V1 write at first.

@zhongyujiang
Copy link
Contributor

especially dynamic bucket mode in your implementation

Yeah, I haven't found a easy way to support this yet. In fact, I've only implemented V2 write for the fixed bucket mode. I think we can first let the unsupported bucket modes fall back to V1 write.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants