Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DataFrame from_dict silently fails and generates invalid data with inputs in the form of orient="dict" #13614

Closed
beckernick opened this issue Jun 23, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@beckernick
Copy link
Member

beckernick commented Jun 23, 2023

cudf.DataFrame.from_dict was added in #12048 to close #11934 . In at least one scenario, from_dict fails silently on data generated by to_dict and generates columns of range(0, N) . We should either succeed or prohibit this input orientation.

import cudf

data = {"a":[10,4,6], "b":[3,5,3]}
df = cudf.DataFrame.from_dict(data)
print(df)

rawdata = df.to_dict(orient="list")
print(rawdata)

print(cudf.DataFrame.from_dict(rawdata))
    a  b
0  10  3
1   4  5
2   6  3
{'a': [10, 4, 6], 'b': [3, 5, 3]}
    a  b
0  10  3
1   4  5
2   6  3

But will fail silently and generate columns of range(0, N) if the data is in the to_dict default "dict" orientation:

import cudf

data = {"a":[10,4,6], "b":[3,5,3]}
df = cudf.DataFrame.from_dict(data)
print(df)

rawdata = df.to_dict() # default of orient="dict"
print(rawdata)

print(cudf.DataFrame.from_dict(rawdata))
    a  b
0  10  3
1   4  5
2   6  3
{'a': {0: 10, 1: 4, 2: 6}, 'b': {0: 3, 1: 5, 2: 3}}
   a  b
0  0  0
1  1  1
2  2  2

In contrast, pandas succeeds:

import pandas as pd

data = {"a":[10,4,6], "b":[3,5,3]}
df = pd.DataFrame.from_dict(data)
print(df)

rawdata = df.to_dict() # default of orient="dict"
print(rawdata)

print(pd.DataFrame.from_dict(rawdata))
    a  b
0  10  3
1   4  5
2   6  3
{'a': {0: 10, 1: 4, 2: 6}, 'b': {0: 3, 1: 5, 2: 3}}
    a  b
0  10  3
1   4  5
2   6  3
!conda list | grep "rapids\|pandas\|numpy\|arrow"
# packages in environment at /home/nicholasb/miniconda3/envs/rapids-23.06:
cudf_kafka                23.06.00        py310_230607_gf881d40c63_0    rapidsai
cusignal                  23.06.00        py39_230607_g22c7120_0    rapidsai
geopandas                 0.13.2             pyhd8ed1ab_1    conda-forge
geopandas-base            0.13.2             pyha770c72_1    conda-forge
libarrow                  11.0.0          hc00ebf5_25_cpu    conda-forge
libcucim                  23.06.00        cuda11_230607_gfdc657b_0    rapidsai
libcudf                   23.06.00        cuda11_230607_gf881d40c63_0    rapidsai
libcudf_kafka             23.06.00        230607_gf881d40c63_0    rapidsai
libcugraph                23.06.02        cuda11_230613_gdb9d3c12_0    rapidsai
libcugraph_etl            23.06.02        cuda11_230613_gdb9d3c12_0    rapidsai
libcuml                   23.06.00        cuda11_230607_ga381e03f2_0    rapidsai
libcuspatial              23.06.00        cuda11_230607_g7b3284af_0    rapidsai
libkvikio                 23.06.00        cuda11_230607_gd3b823c_0    rapidsai
libraft                   23.06.01        cuda11_230612_g9147c907_0    rapidsai
libraft-headers           23.06.01        cuda11_230612_g9147c907_0    rapidsai
libraft-headers-only      23.06.01        cuda11_230612_g9147c907_0    rapidsai
librmm                    23.06.00        cuda11_230607_gacaf3f5e_0    rapidsai
libxgboost                1.7.5dev.rapidsai23.06        cuda11_0    rapidsai
numpy                     1.24.3                   pypi_0    pypi
pandas                    1.5.3                    pypi_0    pypi
py-xgboost                1.7.5dev.rapidsai23.06  cuda11_py310_0    rapidsai
pyarrow                   11.0.0                   pypi_0    pypi
rapids                    23.06.02        cuda11_py310_230613_g9b052fc_0    rapidsai
rapids-xgboost            23.06.02        cuda11_py310_230613_g9b052fc_0    rapidsai
ucx-proc                  1.0.0                       gpu    rapidsai
@beckernick beckernick added bug Something isn't working python labels Jun 23, 2023
@beckernick beckernick changed the title [BUG] DataFrame from_dict silently fails with inputs in the form of orient="dict" [BUG] DataFrame from_dict silently fails and generates invalid data with inputs in the form of orient="dict" Jun 23, 2023
@wence-
Copy link
Contributor

wence- commented Jun 29, 2023

Thanks. This happens because from_dict(data) hands off to cudf.DataFrame(data) in this case, which then treats this case incorrectly.

In contrast the DataFrame constructor in pandas also works here:

import pandas as pd

data = {"a":[10,4,6], "b":[3,5,3]}
df = pd.DataFrame.from_dict(data)
new = pd.DataFrame(df.to_dict())

This is, I think, mostly a consequence of there not being a symmetry in the orient flags offered by to_dict and from_dict, and in this case, cudf does less introspection of the data than pandas to determine the format of the input to from_dict.

To summarise the pandas behaviour:

  • to_dict supports orient= "dict", "list", "series, "split", "tight", "records", "index"
  • from_dict supports orient= "columns", "index", "tight"
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
orient = {'dict', 'list', 'series', 'split', 'tight', 'records', 'index'}
for o in sorted(orient):
    try:
        new = pd.DataFrame.from_dict(df.to_dict(orient=o))
        try:
            same = (new == df).all().all()
            if same:
                print(f"# Success for {o}")
            else:
                raise
        except:
            print(f"# Read, but data wrong for {o}")
    except:
        print(f"# Unable to read for {o}")
# Success for dict
# Read, but data wrong for index
# Success for list
# Success for records
# Success for series
# Unable to read for split
# Unable to read for tight

So if the to_dict is called with index or tight, from_dict needs an explicit orient argument, and with index you silently get incorrect results. to_dict(orient="split") can't be round-tripped at all as far as I can tell.

In contrast, cudf:

import cudf as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
orient = {'dict', 'list', 'series', 'split', 'tight', 'records', 'index'}
for o in sorted(orient):
    try:
        new = pd.DataFrame.from_dict(df.to_dict(orient=o))
        try:
            same = (new == df).all().all()
            if same:
                print(f"# Success for {o}")
            else:
                raise
        except:
            print(f"# Read, but data wrong for {o}")
    except:
        print(f"# Unable to read for {o}")
# Read, but data wrong for dict
# Read, but data wrong for index
# Success for list
# Success for records
# Success for series
# Unable to read for split
# Unable to read for tight

So the dict case is the only one that needs handled with some introspection of the input I suspect.

@vyasr vyasr removed the python label Feb 23, 2024
@vyasr
Copy link
Contributor

vyasr commented May 15, 2024

This no longer reproduces for me on the latest cudf:

In [24]: import cudf
    ...: 
    ...: data = {"a":[10,4,6], "b":[3,5,3]}
    ...: df = cudf.DataFrame.from_dict(data)
    ...: print(df)
    ...: 
    ...: rawdata = df.to_dict() # default of orient="dict"
    ...: print(rawdata)
    ...: 
    ...: print(cudf.DataFrame.from_dict(rawdata))
    a  b
0  10  3
1   4  5
2   6  3
{'a': {0: 10, 1: 4, 2: 6}, 'b': {0: 3, 1: 5, 2: 3}}
    a  b
0  10  3
1   4  5
2   6  3

@vyasr vyasr closed this as completed May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants