-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Implement msgspec encoding #2541
base: main
Are you sure you want to change the base?
refactor: Implement msgspec encoding #2541
Conversation
CodSpeed Performance ReportMerging #2541 will improve performances by ×12Comparing Summary
Benchmarks breakdown
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2541 +/- ##
==========================================
+ Coverage 91.29% 91.35% +0.06%
==========================================
Files 62 63 +1
Lines 5190 5227 +37
Branches 669 669
==========================================
+ Hits 4738 4775 +37
Misses 319 319
Partials 133 133 ☔ View full report in Codecov by Sentry. |
97de9f7
to
1d9e947
Compare
1d9e947
to
4febeba
Compare
4febeba
to
cbe10bd
Compare
8876b80
to
f691f78
Compare
I found that it helped to add a
|
In the https://jcristharif.com/msgspec/perf-tips.html#line-delimited-json json.py:
SingerWriter:
|
Do you mean in sdk/singer_sdk/_singerlib/encoding/_msgspec.py Lines 73 to 94 in 08b58bf
? |
29dea7a
to
d23a8ab
Compare
Yes, that is exactly what I meant. Could have definitely been stated clearer on my part😅.
|
d23a8ab
to
3169b58
Compare
Naive of me to think I could get this across in 1/2 a day of work 😅. I'll come back to this later, there's plenty of time until the planned release date. |
Like the pun 😊. Great dad joke material. Kind an inside joke now since you dropped (naive) from the title of the PR. |
Might have to punt this if jcrist/msgspec#711 doesn't get merged before we're ready to officially declare Python 3.13 support |
Well, found that making this generic by using
In summary I had to change the tap's overwrite of |
712466f
to
07f348e
Compare
Ok, the tests are passing. Now I want to think of how to make it easy and straightforward for a developer to use msgspec as the SerDe layer, and also keep the door open to the user being the one deciding which serialization layer to use. |
@sourcery-ai review |
Reviewer's Guide by SourceryThis pull request refactors the SDK to use msgspec for encoding and decoding Singer messages, which should improve performance. It also updates the test suite to use the new encoding. Sequence diagram for MsgSpec message encoding flowsequenceDiagram
participant T as Tap
participant MW as MsgSpecWriter
participant SO as System Output
T->>MW: write_message(message)
activate MW
MW->>MW: serialize_message(message)
MW->>MW: format_message()
MW->>SO: write to stdout buffer
MW->>SO: flush stdout
deactivate MW
Class diagram for MsgSpec encoding implementationclassDiagram
class GenericSingerReader~T~ {
<<abstract>>
+default_input
+deserialize_json(line: str)* dict
}
class GenericSingerWriter~T, M~ {
<<abstract>>
+serialize_message(message: M)* T
+write_message(message: M) void
}
class MsgSpecReader {
+default_input: sys.stdin
+deserialize_json(line: str) dict
}
class MsgSpecWriter {
+serialize_message(message: Message) bytes
+write_message(message: Message) void
}
class Message {
+to_dict() dict
}
GenericSingerReader <|-- MsgSpecReader
GenericSingerWriter <|-- MsgSpecWriter
MsgSpecWriter ..> Message
note for MsgSpecReader "Uses msgspec for fast JSON deserialization"
note for MsgSpecWriter "Uses msgspec for fast JSON serialization"
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @edgarrmondragon - I've reviewed your changes - here's some feedback:
Overall Comments:
- Consider making the IO implementation an attribute of the Singer class rather than using multiple inheritance, to avoid MRO ordering issues. This would provide a cleaner and more explicit design.
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟡 Security: 1 issue found
- 🟡 Testing: 2 issues found
- 🟡 Complexity: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
||
encoder = msgspec.json.Encoder(enc_hook=enc_hook, decimal_format="number") | ||
decoder = msgspec.json.Decoder(dec_hook=dec_hook, float_hook=decimal.Decimal) | ||
_jsonl_msg_buffer = bytearray(64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚨 issue (security): The global bytearray buffer could cause thread-safety issues and the fixed size might be insufficient for larger messages
Consider using a thread-local buffer or creating a new buffer for each message. Also, the buffer size should either be dynamic or much larger to handle varying message sizes safely.
from singer_sdk._singerlib.encoding._msgspec import ( | ||
MsgSpecReader, | ||
MsgSpecWriter, | ||
dec_hook, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Test coverage for encoding helper functions.
It would be beneficial to add tests specifically for enc_hook
and dec_hook
to ensure they handle various data types and edge cases correctly. For example, test with datetimes, decimals, and other non-native JSON types.
Suggested implementation:
import decimal
import io
from datetime import datetime, timezone
from contextlib import nullcontext, redirect_stdout
from textwrap import dedent
import pytest
from singer_sdk._singerlib.exceptions import InvalidInputLine
def test_enc_hook_datetime():
"""Test encoding datetime objects."""
dt = datetime(2023, 1, 1, 12, 0, tzinfo=timezone.utc)
result = enc_hook(dt)
assert result == "2023-01-01T12:00:00+00:00"
def test_enc_hook_decimal():
"""Test encoding decimal objects."""
dec = decimal.Decimal("123.45")
result = enc_hook(dec)
assert result == "123.45"
def test_enc_hook_none():
"""Test encoding None values."""
result = enc_hook(None)
assert result is None
def test_dec_hook_datetime():
"""Test decoding datetime strings."""
dt_str = "2023-01-01T12:00:00+00:00"
result = dec_hook(dt_str)
assert isinstance(result, datetime)
assert result == datetime(2023, 1, 1, 12, 0, tzinfo=timezone.utc)
def test_dec_hook_decimal():
"""Test decoding decimal strings."""
dec_str = "123.45"
result = dec_hook(dec_str)
assert isinstance(result, decimal.Decimal)
assert result == decimal.Decimal("123.45")
def test_dec_hook_none():
"""Test decoding None values."""
result = dec_hook(None)
assert result is None
def test_dec_hook_invalid_datetime():
"""Test decoding invalid datetime strings."""
invalid_dt = "not-a-datetime"
result = dec_hook(invalid_dt)
assert result == invalid_dt # Should return original string if parsing fails
def test_enc_hook_unsupported_type():
"""Test encoding unsupported type."""
class UnsupportedType:
pass
obj = UnsupportedType()
with pytest.raises(TypeError):
enc_hook(obj)
|
||
|
||
def test_write_message(): | ||
writer = SingerWriter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Benchmark tests should include both simple and msgspec encoding.
Now that we have two encoding methods, the benchmark tests should compare both SimpleSingerWriter
and MsgSpecWriter
to assess the performance impact of the change.
Suggested implementation:
def test_bench_format_message_msgspec(benchmark, bench_record_message: RecordMessage):
"""Run benchmark for message formatting using MsgSpecWriter."""
number_of_runs = 1000
writer = MsgSpecWriter()
def test_bench_format_message_simple(benchmark, bench_record_message: RecordMessage):
"""Run benchmark for message formatting using SimpleSingerWriter."""
number_of_runs = 1000
writer = SimpleSingerWriter()
You'll also need to:
- Import SimpleSingerWriter if not already imported
- Ensure bench_record_message fixture is compatible with both writers
@@ -111,7 +111,12 @@ def get_standard_target_tests( | |||
return [] | |||
|
|||
|
|||
def tap_sync_test(tap: Tap) -> tuple[io.StringIO, io.StringIO]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider creating a type alias to simplify the function signatures.
The type annotations have become overly verbose while not adding proportional value. Create a type alias to simplify the signatures while maintaining type safety:
# At the top of the file
SingerIO = io.TextIOWrapper[io.BytesIO]
# Then simplify function signatures like:
def tap_sync_test(
tap: Tap,
) -> tuple[SingerIO, SingerIO]:
stdout_buf = SingerIO(io.BytesIO(), encoding="utf-8")
...
def tap_to_target_sync_test(
tap: Tap,
target: Target,
) -> tuple[SingerIO, SingerIO, SingerIO, SingerIO]:
...
This maintains the same type safety while improving readability and making future type changes easier to maintain.
SQLiteTap(MsgSpecWriter, SQLTap)
and notSQLiteTap(SQLTap, MsgSpecWriter)
. A better approach might be to make the IO implementation an attribute of the Singer class.📚 Documentation preview 📚: https://meltano-sdk--2541.org.readthedocs.build/en/2541/
Summary by Sourcery
Implement
msgspec
encoding for improved performance.Enhancements:
msgspec
for serialization and deserialization.Tests:
msgspec
implementation.