Retrieve loaded data as pandas.DataFrame #7

leferrad · 2020-03-27T02:18:57Z

I think this library could be a great alternative to pyarrow and pyspark to easily read and write ORC files without requiring a big library to just achieve that (not to mention the current issues that pyarrow is having, and the overhead of loading a SparkSession just to read/write data). Therefore, in order to make it usable in most of data processing application, an easy connection with Pandas (the most popular library for local processing of tabular data) would be convenient.
I'm new using this library, but I was evaluating two options:

As a method, to indicate that data should be retrieved as a pandas.DataFrame (that requires adding pandas as a dependency, which may not be desired)
Just an example on the documentation, to let interested users understand how a ORC file could be loaded as a pandas.DataFrame.

So far, I've solved that through this snippet (from what I could understand of the library):

import pandas as pd
import pyorc 
 
path_to_data =  "path/to/data.orc"

with open(path_to_data, "rb") as f:
    reader = pyorc.Reader(f) 
    columns = reader.schema.fields
    # sort by column id to ensure correct order (since "fields" is a dict, order may not be correct)
    columns = [y for x, y in sorted([(reader.schema.find_column_id(c), c) for c in columns])] 
    df = pd.DataFrame(reader, columns=columns)

noirello · 2020-03-29T16:15:06Z

Honestly, I'm not that familiar with pandas. I have to look into more, but the simplest solution to add pandas as an extra to the module, and inherit a new Reader/Writer from the existing ones with methods that can read and write pandas dataframes. There's probably a need for special converter functions for pandas' special types as well.

As you've already noticed, my goal with this library is to be a simple ORC reader and writer with the smallest overhead as possible. A few smaller tasks are on my todo list now, but I'm not against the idea. The best would be, if someone with a better knowledge of pandas could contribute. 😉

leferrad · 2020-03-30T20:47:25Z

Therefore, what do you suggest to solve this issue? These are the options I guess:

PandasReader & PandasWriter (I don't like this one)
"as_pandas" as method or argument on reader (but this adds pandas as dependency)
Just an example of how to get a Pandas DF from loaded data.
Nothing to do (just add a comment to let users understand the scope of pyorc)

Let me know what is your concern on this and then I could make a PR related to that.

noirello · 2020-03-31T22:16:12Z

You're probably right, separate classes for reading and writing are overkill. Maybe two simple functions defined in a submodule could be sufficient enough.

I don't want to add pandas as a default requirement to the module.

leferrad · 2020-04-01T20:47:10Z

OK so you prefer to add functions/methods for that, but not to require pandas to use pyorc. I can add this behavior without adding Panda as a requirement, and raise an error if Pandas is not installed, but that will be a very bad practice. Therefore, the best option will be to just add an example with Pandas for start, and if the example is not enough you could consider expanding the scope of pyorc to be more integrated with Pandas (which is my suggestion, since it's a pretty common requirement in most of data processing libraries). Let me know your thoughts and I can make a PR from that (either adding an example or adding methods with Pandas)

noirello · 2020-04-02T06:34:42Z

I think an example would be great as a start. A PR about that would be much appreciated. Thank you.

fehtemam · 2020-12-10T18:33:02Z

Can someone please include a short example of how to use converters in the Reader? I tried really hard to figure this out but I couldn't. I'm reading a file like this where orc_bytes is of class bytes:
orc = pyorc.Reader(fileo=io.BytesIO(orc_bytes))
This works fine and I can convert the resulting Reader object to a pandas dataframe. Now I am trying to add a converter to Reader to convert decimal to float upon reading. I know it needs a dictionary with keys being TypeKind but I can't figure out how to pass the dictionary values. So I am stuck at:
orc = pyorc.Reader(fileo=io.BytesIO(orc_bytes), converters={pyorc.TypeKind.DECIMAL: ???})
I found an example in test_reader.py file (https://github.com/noirello/pyorc/blob/master/tests/test_reader.py#L325) that uses ORCConverter to define a class and a from_orc method for TypeKind.TIMESTAMP but I have no idea how this should be defined to convert decimal to float. Any help please?

fehtemam · 2020-12-10T20:18:57Z

This is what I have so far but it is not doing any conversion:

import numpy as np
import pyorc
from pyorc.converters import ORCConverter

class TypeConverter(ORCConverter):
    @staticmethod
    def from_orc(decimal_input):
        return np.array(decimal_input, dtype=float)

orc = pyorc.Reader(fileo=io.BytesIO(orc_bytes), converters={pyorc.TypeKind.DECIMAL: TypeConverter})

Any suggestions?!

noirello · 2020-12-15T21:51:39Z

I updated the docs about ORCConverter

Your converter above should return a numpy array with a float in it, for every item in a decimal ORC column.

fehtemam · 2020-12-15T23:19:59Z

@noirello Thanks a lot! Highly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieve loaded data as pandas.DataFrame #7

Retrieve loaded data as pandas.DataFrame #7

leferrad commented Mar 27, 2020

noirello commented Mar 29, 2020

leferrad commented Mar 30, 2020

noirello commented Mar 31, 2020

leferrad commented Apr 1, 2020

noirello commented Apr 2, 2020

fehtemam commented Dec 10, 2020 •

edited

Loading

fehtemam commented Dec 10, 2020 •

edited

Loading

noirello commented Dec 15, 2020

fehtemam commented Dec 15, 2020

Retrieve loaded data as pandas.DataFrame #7

Retrieve loaded data as pandas.DataFrame #7

Comments

leferrad commented Mar 27, 2020

noirello commented Mar 29, 2020

leferrad commented Mar 30, 2020

noirello commented Mar 31, 2020

leferrad commented Apr 1, 2020

noirello commented Apr 2, 2020

fehtemam commented Dec 10, 2020 • edited Loading

fehtemam commented Dec 10, 2020 • edited Loading

noirello commented Dec 15, 2020

fehtemam commented Dec 15, 2020

fehtemam commented Dec 10, 2020 •

edited

Loading

fehtemam commented Dec 10, 2020 •

edited

Loading