Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UnionChecker to Optimize Union or | Type Deserialization for Common Python Types (e.g., int, Enum, Literal) #166

Open
rnag opened this issue Dec 8, 2024 · 0 comments
Labels
Great To Have! Important / Worth Addressing help wanted Extra attention is needed performance self-created Opened by me!

Comments

@rnag
Copy link
Owner

rnag commented Dec 8, 2024

Adding my thoughts here in the form of pseudo-code. I think one approach can be to create a class UnionChecker or similar in loaders.py or a new module. It would be modeled similar to LoadMixIn and have a method for each common type that would be in a Union or | in a type annotation, such as int, float, enum, datetime, ... and so on.

Again, pseudo-code of how that could look:

class UnionChecker:

  def parse_int_if_valid(o: int | str | float) -> int | None:
    if type(o) is int: return o
    if type(o) is float: return int(round(o))
    if type(o) is str: return int(o) if o.removeprefix('-').isdigit() else None
    # catch-all return statement
    return None

  def parse_enum_if_valid(o: Any) -> Enum | None:
    if type(o) is Enum Class: return o
    return enum_class(o) if o is a valid value for Enum Class else None
    # catch-all return statement
    return None

Then, in parsers.py where we call Parser.__contains__ in UnionParser.__call__:

class UnionParser:
    ...
    def __call__(self, o: Any) -> Optional[T]:
        ...
        for parser in self.parsers:
   >        if o in parser:
                return parser(o)

We can replace that check with something like:

for parser in parsers:
  possible_value = parse_to_type_if_valid(o, base_type)
  if possible_value is not None:
    return possible_value

This should give a good compromise b/w efficiency and also fix the parsing support for Union, which is currently not fully supported in this library, and which I would like to address.

Some Benchmarks

Also, just in case someone tells me I'm worried about micro-optimizations which is possible, I made a small benchmark to test validating vs blind parsing for common types like Enum and int:

from enum import Enum
from timeit import timeit


class MyEnum(Enum):
    VAL_1 = 'ONE PLEASE'
    VAL_2 = 'TWO IS GREAT'
    VAL_3 = "THREE'S AWESOME"
    VAL_4 = 97


def is_enum1(o):
    try:
        _ = MyEnum(o)
        return True
    except ValueError:
        return False


def is_enum2(o):
    return o in MyEnum._value2member_map_


def is_int1(o):
    try:
        _ = int(o)
        return True
    except ValueError:
        return False


def is_int2(o: str):
    return o.removeprefix('-').isdigit()


def is_float1(o):
    try:
        _ = float(o)
        return True
    except ValueError:
        return False


def is_float2(o: str):
    return o.removeprefix('-').replace('.', '', 1).isdigit()


n = 1_000


print(timeit('is_enum1("TWO IS GREAT")', globals=globals(), number=n))
print(timeit('is_enum2("TWO IS GREAT")', globals=globals(), number=n))
print(timeit('is_float1("54.321")', globals=globals(), number=n))
print(timeit('is_float2("54.321")', globals=globals(), number=n))
print(timeit("is_int1('123')", globals=globals(), number=n))
print(timeit("is_int2('123')", globals=globals(), number=n))
# 0.00019675004296004772
# 4.5042019337415695e-05
# 6.279093213379383e-05
# 8.941697888076305e-05
# 8.983299994724803e-05
# 4.8417001380585134e-05
print()
print(timeit('is_enum1("OOPS!!")', globals=globals(), number=n))
print(timeit('is_enum2("OOPS!!")', globals=globals(), number=n))
print(timeit('is_float1("OOPS!!")', globals=globals(), number=n))
print(timeit('is_float2("OOPS!!")', globals=globals(), number=n))
print(timeit("is_int1('1.abc')", globals=globals(), number=n))
print(timeit("is_int2('1.abc')", globals=globals(), number=n))
# 0.0014168331399559975
# 3.700004890561104e-05
# 0.00046787504106760025
# 6.270897574722767e-05
# 0.000634250001894543
# 4.758400245918892e-05

assert is_enum1("OOPS!!") is is_enum2("OOPS!!") is False
assert is_enum1("TWO IS GREAT") is is_enum2("TWO IS GREAT") is True
assert is_float1("OOPS!!") is is_float2("OOPS!!") is False
assert is_float1("54.321") is is_float2("54.321") is True
assert is_int1('123') is is_int2('123') is True
assert is_int1('1.abc') is is_int2('1.abc') is False

Results show that validating values for types like Enum and int is an overall better approach. There are some types like float, for which it might be better to try parsing using float constructor, but overall validating the value/type first seems a solid approach to have.

Hope this helps - I currently don't have time to implement this myself, which is why I'm jotting my thoughts down, in case anyone wants to take a stab at it. I'm also going to update the issue with "help wanted" label. For now, I'm going to focus on addressing other issues with the Dataclass Wizard library, but I'll keep a tab on this issue and will hopefully be back to look at it with fresh eyes sometime soon.

Originally posted by @rnag in #67 (comment)

@rnag rnag added help wanted Extra attention is needed performance Great To Have! Important / Worth Addressing self-created Opened by me! labels Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Great To Have! Important / Worth Addressing help wanted Extra attention is needed performance self-created Opened by me!
Projects
None yet
Development

No branches or pull requests

1 participant