[BUG] DataFrame from_dict silently fails and generates invalid data with inputs in the form of `orient="dict"` #13614

beckernick · 2023-06-23T21:19:57Z

cudf.DataFrame.from_dict was added in #12048 to close #11934 . In at least one scenario, from_dict fails silently on data generated by to_dict and generates columns of range(0, N) . We should either succeed or prohibit this input orientation.

import cudf

data = {"a":[10,4,6], "b":[3,5,3]}
df = cudf.DataFrame.from_dict(data)
print(df)

rawdata = df.to_dict(orient="list")
print(rawdata)

print(cudf.DataFrame.from_dict(rawdata))
    a  b
0  10  3
1   4  5
2   6  3
{'a': [10, 4, 6], 'b': [3, 5, 3]}
    a  b
0  10  3
1   4  5
2   6  3

But will fail silently and generate columns of range(0, N) if the data is in the to_dict default "dict" orientation:

import cudf

data = {"a":[10,4,6], "b":[3,5,3]}
df = cudf.DataFrame.from_dict(data)
print(df)

rawdata = df.to_dict() # default of orient="dict"
print(rawdata)

print(cudf.DataFrame.from_dict(rawdata))
    a  b
0  10  3
1   4  5
2   6  3
{'a': {0: 10, 1: 4, 2: 6}, 'b': {0: 3, 1: 5, 2: 3}}
   a  b
0  0  0
1  1  1
2  2  2

In contrast, pandas succeeds:

import pandas as pd

data = {"a":[10,4,6], "b":[3,5,3]}
df = pd.DataFrame.from_dict(data)
print(df)

rawdata = df.to_dict() # default of orient="dict"
print(rawdata)

print(pd.DataFrame.from_dict(rawdata))
    a  b
0  10  3
1   4  5
2   6  3
{'a': {0: 10, 1: 4, 2: 6}, 'b': {0: 3, 1: 5, 2: 3}}
    a  b
0  10  3
1   4  5
2   6  3

!conda list | grep "rapids\|pandas\|numpy\|arrow"
# packages in environment at /home/nicholasb/miniconda3/envs/rapids-23.06:
cudf_kafka                23.06.00        py310_230607_gf881d40c63_0    rapidsai
cusignal                  23.06.00        py39_230607_g22c7120_0    rapidsai
geopandas                 0.13.2             pyhd8ed1ab_1    conda-forge
geopandas-base            0.13.2             pyha770c72_1    conda-forge
libarrow                  11.0.0          hc00ebf5_25_cpu    conda-forge
libcucim                  23.06.00        cuda11_230607_gfdc657b_0    rapidsai
libcudf                   23.06.00        cuda11_230607_gf881d40c63_0    rapidsai
libcudf_kafka             23.06.00        230607_gf881d40c63_0    rapidsai
libcugraph                23.06.02        cuda11_230613_gdb9d3c12_0    rapidsai
libcugraph_etl            23.06.02        cuda11_230613_gdb9d3c12_0    rapidsai
libcuml                   23.06.00        cuda11_230607_ga381e03f2_0    rapidsai
libcuspatial              23.06.00        cuda11_230607_g7b3284af_0    rapidsai
libkvikio                 23.06.00        cuda11_230607_gd3b823c_0    rapidsai
libraft                   23.06.01        cuda11_230612_g9147c907_0    rapidsai
libraft-headers           23.06.01        cuda11_230612_g9147c907_0    rapidsai
libraft-headers-only      23.06.01        cuda11_230612_g9147c907_0    rapidsai
librmm                    23.06.00        cuda11_230607_gacaf3f5e_0    rapidsai
libxgboost                1.7.5dev.rapidsai23.06        cuda11_0    rapidsai
numpy                     1.24.3                   pypi_0    pypi
pandas                    1.5.3                    pypi_0    pypi
py-xgboost                1.7.5dev.rapidsai23.06  cuda11_py310_0    rapidsai
pyarrow                   11.0.0                   pypi_0    pypi
rapids                    23.06.02        cuda11_py310_230613_g9b052fc_0    rapidsai
rapids-xgboost            23.06.02        cuda11_py310_230613_g9b052fc_0    rapidsai
ucx-proc                  1.0.0                       gpu    rapidsai

The text was updated successfully, but these errors were encountered:

wence- · 2023-06-29T12:00:06Z

Thanks. This happens because from_dict(data) hands off to cudf.DataFrame(data) in this case, which then treats this case incorrectly.

In contrast the DataFrame constructor in pandas also works here:

import pandas as pd

data = {"a":[10,4,6], "b":[3,5,3]}
df = pd.DataFrame.from_dict(data)
new = pd.DataFrame(df.to_dict())

This is, I think, mostly a consequence of there not being a symmetry in the orient flags offered by to_dict and from_dict, and in this case, cudf does less introspection of the data than pandas to determine the format of the input to from_dict.

To summarise the pandas behaviour:

to_dict supports orient= "dict", "list", "series, "split", "tight", "records", "index"
from_dict supports orient= "columns", "index", "tight"

import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
orient = {'dict', 'list', 'series', 'split', 'tight', 'records', 'index'}
for o in sorted(orient):
    try:
        new = pd.DataFrame.from_dict(df.to_dict(orient=o))
        try:
            same = (new == df).all().all()
            if same:
                print(f"# Success for {o}")
            else:
                raise
        except:
            print(f"# Read, but data wrong for {o}")
    except:
        print(f"# Unable to read for {o}")
# Success for dict
# Read, but data wrong for index
# Success for list
# Success for records
# Success for series
# Unable to read for split
# Unable to read for tight

So if the to_dict is called with index or tight, from_dict needs an explicit orient argument, and with index you silently get incorrect results. to_dict(orient="split") can't be round-tripped at all as far as I can tell.

In contrast, cudf:

import cudf as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
orient = {'dict', 'list', 'series', 'split', 'tight', 'records', 'index'}
for o in sorted(orient):
    try:
        new = pd.DataFrame.from_dict(df.to_dict(orient=o))
        try:
            same = (new == df).all().all()
            if same:
                print(f"# Success for {o}")
            else:
                raise
        except:
            print(f"# Read, but data wrong for {o}")
    except:
        print(f"# Unable to read for {o}")
# Read, but data wrong for dict
# Read, but data wrong for index
# Success for list
# Success for records
# Success for series
# Unable to read for split
# Unable to read for tight

So the dict case is the only one that needs handled with some introspection of the input I suspect.

vyasr · 2024-05-15T19:09:37Z

This no longer reproduces for me on the latest cudf:

In [24]: import cudf
    ...: 
    ...: data = {"a":[10,4,6], "b":[3,5,3]}
    ...: df = cudf.DataFrame.from_dict(data)
    ...: print(df)
    ...: 
    ...: rawdata = df.to_dict() # default of orient="dict"
    ...: print(rawdata)
    ...: 
    ...: print(cudf.DataFrame.from_dict(rawdata))
    a  b
0  10  3
1   4  5
2   6  3
{'a': {0: 10, 1: 4, 2: 6}, 'b': {0: 3, 1: 5, 2: 3}}
    a  b
0  10  3
1   4  5
2   6  3

beckernick added bug Something isn't working python labels Jun 23, 2023

beckernick changed the title ~~[BUG] DataFrame from_dict silently fails with inputs in the form of orient="dict"~~ [BUG] DataFrame from_dict silently fails and generates invalid data with inputs in the form of orient="dict" Jun 23, 2023

wence- added this to the Pandas API Alignment and Coverage milestone Jul 3, 2023

vyasr removed the python label Feb 23, 2024

vyasr closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DataFrame from_dict silently fails and generates invalid data with inputs in the form of `orient="dict"` #13614

[BUG] DataFrame from_dict silently fails and generates invalid data with inputs in the form of `orient="dict"` #13614

beckernick commented Jun 23, 2023 •

edited

Loading

wence- commented Jun 29, 2023 •

edited

Loading

vyasr commented May 15, 2024

[BUG] DataFrame from_dict silently fails and generates invalid data with inputs in the form of orient="dict" #13614

[BUG] DataFrame from_dict silently fails and generates invalid data with inputs in the form of orient="dict" #13614

Comments

beckernick commented Jun 23, 2023 • edited Loading

wence- commented Jun 29, 2023 • edited Loading

vyasr commented May 15, 2024

[BUG] DataFrame from_dict silently fails and generates invalid data with inputs in the form of `orient="dict"` #13614

[BUG] DataFrame from_dict silently fails and generates invalid data with inputs in the form of `orient="dict"` #13614

beckernick commented Jun 23, 2023 •

edited

Loading

wence- commented Jun 29, 2023 •

edited

Loading