Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyBallista - Python SQL client for Ballista #970

Merged
merged 9 commits into from
Feb 5, 2024
Merged

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Feb 4, 2024

Which issue does this PR close?

N/A

Rationale for this change

The Python bindings in https://github.com/apache/arrow-ballista-python were created by cloning the DataFusion Python bindings and then making some modifications. This project has been unmaintained for around one year now.

This PR adds new Python bindings which depend on the datafusion-python project rather than copying all of the code. I propose that we archive the old repo.

This project will be versioned and released independently from the main project and is not part of the default Cargo workspace, so will not get in the way of Rust development work.

>>> from pyballista import SessionContext
>>> ctx = SessionContext("localhost", 50050)
>>> ctx.sql("create external table t stored as parquet location '/mnt/bigdata/tpch/sf10-parquet/lineitem.parquet'")
>>> df = ctx.sql("select * from t limit 5")
>>> df.collect()

Output:

[pyarrow.RecordBatch
l_orderkey: int64 not null
l_partkey: int64 not null
l_suppkey: int64 not null
l_linenumber: int32 not null
l_quantity: decimal128(11, 2) not null
l_extendedprice: decimal128(11, 2) not null
l_discount: decimal128(11, 2) not null
l_tax: decimal128(11, 2) not null
l_returnflag: string not null
l_linestatus: string not null
l_shipdate: date32[day] not null
l_commitdate: date32[day] not null
l_receiptdate: date32[day] not null
l_shipinstruct: string not null
l_shipmode: string not null
l_comment: string not null
_ignore: string not null
----
l_orderkey: [17500001,17500001,17500001,17500001,17500001]
l_partkey: [1661786,1635206,907114,1849041,820472]
l_suppkey: [36835,60223,32124,24096,20473]
l_linenumber: [2,3,4,5,6]
l_quantity: [50.00,26.00,25.00,14.00,33.00]
l_extendedprice: [87385.00,29669.12,28026.75,13859.30,45950.19]
l_discount: [0.09,0.09,0.08,0.08,0.06]
l_tax: [0.00,0.04,0.04,0.04,0.07]
l_returnflag: ["N","N","N","N","N"]
l_linestatus: ["O","O","O","O","O"]
...]

What changes are included in this PR?

New pyballista folder containing the new Python bindings.

Are there any user-facing changes?

No

@andygrove andygrove requested a review from yahoNanJing February 4, 2024 17:38
@github-actions github-actions bot added the python label Feb 4, 2024
@andygrove andygrove changed the title PyBallista - Python client for Ballista PyBallista - Python SQL client for Ballista Feb 4, 2024
@andygrove andygrove merged commit e5cce8a into apache:main Feb 5, 2024
16 checks passed
@andygrove andygrove deleted the python branch February 5, 2024 14:28
@honeyAndSw
Copy link

Hi @andygrove, after this change, the datafusion functions submodule seems not included in PyBallista.

If I try to find functions within PyBallista

from pyballista import SessionContext, functions as f

filename = "sample.parquet"
ctx = SessionContext("localhost", 50050)

df = ctx.read_parquet(path=filename)
uri = df.select(f.col("stream_id"))
df.limit(10).show()

I'm getting the following error:

ImportError: cannot import name 'functions' from 'pyballista' (/Users/.../datafusion-ballista/python/pyballista/__init__.py)

Or if I change my imports to:

from pyballista import SessionContext
from datafusion import functions as f

Then it says:

  File "examples/test.py", line 11, in <module>
    uri = df.select(f.col("stream_id"))
TypeError: argument 'args': 'Expr' object cannot be converted to 'Expr'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants