-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support for 'Redshift' shredding format #32
Conversation
@miike has signed the Software Grant and Corporate Contributor License Agreement |
Thanks @miike! The 'contexts' and 'unstruct_event' prefixes for the JSON objects probably seem a bit odd in hindsight. If I remember correctly, the original design was partly because contexts could be an array of a given type, while unstruct event is always a singleton. In pseudocode:
So prefixing them with
I'm not particularly attached to the 'contexts' and 'unstruct_event' prefixes - I just wanted to share their origins... |
@chuwy - this looks pretty exciting! Can you do a first pass review on this please? |
@chuwy @alexanderdean What do you think of adding similar functionality (including data + schema) for the Scala Analytics SDK? Is there a bit of additional information we should include to reference the source (contexts/unstruct)? |
Sure - worth creating placeholder tickets to track the same idea in the Scala and .NET Analytics SDKs... |
Hey @miike, I'm very sorry you had to wait so long. Idea looks really great. Though I'm agree with @alexanderdean that {
"event": {
"app_id": "foo",
"platform": "mob",
...
},
"unstruct_event": {
"schema": "iglu:com.acme/event/jsonschema/1-0-0",
"data": {"key": "value"}
},
"contexts": [
{
"schema": "iglu:com.acme/context/jsonschema/1-0-0",
"data": {"key": "value"}
},
{
"schema": "iglu:com.acme/context/jsonschema/1-0-0",
"data": {"key": "value"}
}
]
} Here's some highlights:
Overall, I like this idea very much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented in PR.
Thanks @chuwy for the detailed review. I agree with pretty much everything.
&&
The second one sounds odd because, as you say, the Redshift format is not really "Redshift" - it's just a pure-JSON intermediate form. As well as renaming, we should make this intermediate form self-describing and register its schema in Iglu Central.
I think there are some Elasticsearch-isms in this format (the geopoint?), so I'm not so bothered by that name. |
Thanks guys - definitely agreed on coming up with some clearer names for this stuff. I'm not sure about the semi-nested format for |
I see what you are saying @miike... It's nice to be able to "dot operator all the things". Anton is off for the next fortnight, so I suspect we will return to this then. |
@alexanderdean Definitely nice - but I wonder if the contexts-as-a-column is too strictly opinionated? Possibly grounds for having multiple intermediate (self-describing) formats depending on what the use case is. See you in a fortnight Anton! |
This is a WIP but I'd like to get some comments/thoughts on it based on #31
This introduces the concept of the 'redshift' (or a more appropriate name) shredding which more closely follows what is shredded into Redshift in terms of the table design. This differs from the current implementation which is more consistent with how data is sunk into Elasticsearch.
The shredding format introduced below makes the following changes:
shred_format='elasticsearch'
by defaultAs a sidenote the code likely needs some refactoring/reformatting and the existing docstrings for the methods have not been updated to reflect the new behaviour.