Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved fast serializer code generation logic for Enum type #54

Merged
merged 2 commits into from
Jun 8, 2020

Conversation

volauvent
Copy link
Collaborator

Improved fast serialization speed by avoiding the cost of storing and retrieving Enum schemas
in a Hashmap in Avro 1.4. We have seen ~44% improvement of serialization latency for Enum array.

This PR helps to resolve issue #50

JMH benchmark results of fast serialization time of an 200 elements EnumArray under Avro 1.4

Before

FastAvroSerdesBenchmark.testFastAvroSerialization                                   avgt   10  12054.915 ± 1852.177   ns/op
FastAvroSerdesBenchmark.testFastAvroSerialization:·gc.alloc.rate                    avgt   10    255.063 ±   42.062  MB/sec
FastAvroSerdesBenchmark.testFastAvroSerialization:·gc.alloc.rate.norm               avgt   10   3352.001 ±    0.001    B/op
FastAvroSerdesBenchmark.testFastAvroSerialization:·gc.churn.PS_Eden_Space           avgt   10    272.156 ±  362.710  MB/sec
FastAvroSerdesBenchmark.testFastAvroSerialization:·gc.churn.PS_Eden_Space.norm      avgt   10   3489.396 ± 4814.784    B/op
FastAvroSerdesBenchmark.testFastAvroSerialization:·gc.churn.PS_Survivor_Space       avgt   10      0.103 ±    0.477  MB/sec
FastAvroSerdesBenchmark.testFastAvroSerialization:·gc.churn.PS_Survivor_Space.norm  avgt   10      1.505 ±    6.959    B/op
FastAvroSerdesBenchmark.testFastAvroSerialization:·gc.count                         avgt   10      6.000             counts
FastAvroSerdesBenchmark.testFastAvroSerialization:·gc.time                          avgt   10     46.000                 ms

After

FastAvroSerdesBenchmark.testFastAvroSerialization                                   avgt   10  6709.067 ±  225.951   ns/op
FastAvroSerdesBenchmark.testFastAvroSerialization:·gc.alloc.rate                    avgt   10   453.950 ±   14.888  MB/sec
FastAvroSerdesBenchmark.testFastAvroSerialization:·gc.alloc.rate.norm               avgt   10  3352.000 ±    0.001    B/op
FastAvroSerdesBenchmark.testFastAvroSerialization:·gc.churn.PS_Eden_Space           avgt   10   452.412 ±  247.752  MB/sec
FastAvroSerdesBenchmark.testFastAvroSerialization:·gc.churn.PS_Eden_Space.norm      avgt   10  3337.223 ± 1836.797    B/op
FastAvroSerdesBenchmark.testFastAvroSerialization:·gc.churn.PS_Survivor_Space       avgt   10     0.104 ±    0.480  MB/sec
FastAvroSerdesBenchmark.testFastAvroSerialization:·gc.churn.PS_Survivor_Space.norm  avgt   10     0.802 ±    3.719    B/op
FastAvroSerdesBenchmark.testFastAvroSerialization:·gc.count                         avgt   10     9.000             counts
FastAvroSerdesBenchmark.testFastAvroSerialization:·gc.time                          avgt   10    37.000                 ms

@FelixGV
Copy link
Collaborator

FelixGV commented May 29, 2020

Cool result! And how does it compare with vanilla Avro now? Is it equal or better?

Can we get a diff of the generated code? Actually, I'm wondering... should we make the build trigger the code-gen for a set of schemas we care about, and check it under source control (not in the main resource, but perhaps in tests or in some kind of audit resource). If we're going to make it part of the workflow to review the generated code, we might as well think of a way to really institutionalize it properly...

@volauvent
Copy link
Collaborator Author

Fast-avro 1.4 serialization speed is ~10% faster than vanilla 1.4 now for Enum array. Below is the benchmark result of vanilla avro:

FastAvroSerdesBenchmark.testAvroSerialization                                       avgt   10  7432.819 ±  279.983   ns/op
FastAvroSerdesBenchmark.testAvroSerialization:·gc.alloc.rate                        avgt   10   410.783 ±   15.542  MB/sec
FastAvroSerdesBenchmark.testAvroSerialization:·gc.alloc.rate.norm                   avgt   10  3360.000 ±    0.001    B/op
FastAvroSerdesBenchmark.testAvroSerialization:·gc.churn.PS_Eden_Space               avgt   10   402.187 ±  326.169  MB/sec
FastAvroSerdesBenchmark.testAvroSerialization:·gc.churn.PS_Eden_Space.norm          avgt   10  3303.211 ± 2694.281    B/op
FastAvroSerdesBenchmark.testAvroSerialization:·gc.churn.PS_Survivor_Space           avgt   10     0.024 ±    0.094  MB/sec
FastAvroSerdesBenchmark.testAvroSerialization:·gc.churn.PS_Survivor_Space.norm      avgt   10     0.201 ±    0.791    B/op
FastAvroSerdesBenchmark.testAvroSerialization:·gc.count                             avgt   10     8.000             counts
FastAvroSerdesBenchmark.testAvroSerialization:·gc.time                              avgt   10    22.000                 ms

Here are fast-serializers of EnumArray schema:

Like the idea of source control of code-gen to review related changes more carefully and properly.

@FelixGV
Copy link
Collaborator

FelixGV commented Jun 1, 2020

Thanks for the gist... the change looks good. It looks quite weird to have a map lookup/set in this code. I wonder why it was done like this in the first place, and if that's a hint that there may be edge cases where this is important (I can't think of any, though).

Also, you'll notice that there is yet another map lookup hidden in the getEnumOrdinal function. This is kind of silly. If the EnumSymbol class kept an ordinal field within it, the lookup would be unnecessary. This is non-trivial to change because it would require extending or replacing some of the core Avro classes within GenericData, which we haven't done so far, but I had other reasons for wanting to do that in PR #45 so maybe we'll go there at some point... Anyhow, this is clearly out of scope from your current change.

Copy link
Collaborator

@FelixGV FelixGV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I'd like to suggest some readability / code-style improvement while we're in this section of the code, if you don't mind...

thenBody.invoke(enumSchemaMapField, "put").arg(JExpr.lit(enumSchemaFingerprint)).arg(enumSchemaVar);

codeModel.ref(Schema.class).staticInvoke("parse").arg(enumSchema.toString()));
enumSchemaVarMap.put(enumSchemaFingerprint, enumSchemaVar);
valueToWrite = JExpr.invoke(enumSchemaVar, "getEnumOrdinal").arg(enumValueCasted.invoke("toString"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this code to be a bit spaghetti... could we instead follow a pattern of

JVar enumSchemaVar = enumSchemaVarMap.computeIfAbsent(enumSchemaFingerprint, () -> 
  serializerClass.field(
              JMod.PRIVATE | JMod.FINAL, 
              Schema.class,
              getVariableName(enumSchema.getName() + "EnumSchema"),
              codeModel.ref(Schema.class).staticInvoke("parse").arg(enumSchema.toString()))
);
valueToWrite = JExpr.invoke(enumSchemaVar, "getEnumOrdinal").arg(enumValueCasted.invoke("toString"));

... or something like that? Otherwise, the JExpr.invoke part duplicated twice, and since it is meta-code, that makes things extra confusing :D ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Fixed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful! Love the red tint :D !

@volauvent volauvent force-pushed the improve-enum-ser-code-gen branch from 27c2c74 to 373916b Compare June 1, 2020 23:10
@volauvent
Copy link
Collaborator Author

I wonder why it was done like this in the first place, and if that's a hint that there may be edge cases where this is important (I can't think of any, though)

Support Avro 1.4 Enum type serialization here was initially introduced by @gaojieliu . Do you have any concern?

Copy link
Collaborator

@FelixGV FelixGV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

Improved fast serialization speed by avoiding the cost of storing and retrieving Enum schemas
in a Hashmap in Avro 1.4
@volauvent volauvent force-pushed the improve-enum-ser-code-gen branch from 373916b to 14b17a3 Compare June 4, 2020 21:39
Copy link
Collaborator

@FelixGV FelixGV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot Bingfeng! Looks great!

@@ -14,7 +11,6 @@
implements FastSerializer<IndexedRecord>
{

private Map<Long, Schema> enumSchemaMap = new ConcurrentHashMap<Long, Schema>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh... deleting an unused variable... that sounds like the right call!

@volauvent volauvent merged commit f5bf1aa into linkedin:master Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants