Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for 'Redshift' shredding format #32

Closed
wants to merge 4 commits into from
Closed

Adds support for 'Redshift' shredding format #32

wants to merge 4 commits into from

Conversation

miike
Copy link

@miike miike commented Aug 17, 2017

This is a WIP but I'd like to get some comments/thoughts on it based on #31

This introduces the concept of the 'redshift' (or a more appropriate name) shredding which more closely follows what is shredded into Redshift in terms of the table design. This differs from the current implementation which is more consistent with how data is sunk into Elasticsearch.

The shredding format introduced below makes the following changes:

  • Removes the 'contexts' and 'unstruct_event' prefix for the JSON objects
  • Retains backwards compatibility by passing shred_format='elasticsearch' by default
  • Adds a nested schema object to contexts/unstruct_events which contains vendor, name, format and version
  • Adds a nested data object which contains the contents of the payload
  • Adds additional tests to clarify the behaviour of both shredding formats in partial and complete payloads

As a sidenote the code likely needs some refactoring/reformatting and the existing docstrings for the methods have not been updated to reflect the new behaviour.

@snowplowcla
Copy link

@miike has signed the Software Grant and Corporate Contributor License Agreement

@alexanderdean
Copy link
Member

alexanderdean commented Aug 17, 2017

Thanks @miike!

The 'contexts' and 'unstruct_event' prefixes for the JSON objects probably seem a bit odd in hindsight. If I remember correctly, the original design was partly because contexts could be an array of a given type, while unstruct event is always a singleton. In pseudocode:

+ enriched event
-> contexts(array[A], B, C, array[D])
-> unstruct_event(E)

So prefixing them with contexts or unstruct_event served a couple of purposes:

  1. Cheap (but clunky) way of indicating the source of this entity
  2. Tells you whether you can just access it as event.E or whether it has to be event.A[0] (i.e. an array)

I'm not particularly attached to the 'contexts' and 'unstruct_event' prefixes - I just wanted to share their origins...

@alexanderdean
Copy link
Member

@chuwy - this looks pretty exciting! Can you do a first pass review on this please?

@chuwy chuwy self-requested a review August 23, 2017 09:29
@alexanderdean alexanderdean added this to the Version 0.3.0 milestone Aug 23, 2017
@miike
Copy link
Author

miike commented Sep 5, 2017

@chuwy @alexanderdean What do you think of adding similar functionality (including data + schema) for the Scala Analytics SDK? Is there a bit of additional information we should include to reference the source (contexts/unstruct)?

@alexanderdean
Copy link
Member

Sure - worth creating placeholder tickets to track the same idea in the Scala and .NET Analytics SDKs...

@chuwy
Copy link
Contributor

chuwy commented Sep 6, 2017

Hey @miike, I'm very sorry you had to wait so long. Idea looks really great. Though I'm agree with @alexanderdean that contexts_ prefix serves almost no purpose here. Therefore I'm thinking about something similar to following structure:

{
  "event": {
    "app_id": "foo",
    "platform": "mob",
    ...
  },
  "unstruct_event": {
      "schema": "iglu:com.acme/event/jsonschema/1-0-0",
      "data": {"key": "value"}
  },
  "contexts": [
    {
      "schema": "iglu:com.acme/context/jsonschema/1-0-0",
      "data": {"key": "value"}
    },
    {
      "schema": "iglu:com.acme/context/jsonschema/1-0-0",
      "data": {"key": "value"}
    }
  ]
}

Here's some highlights:

  1. In your implementation shredded JSONs have vendor, name, etc schema metadata. While it's quite convenient to access fields this way - it's more common for schemas, not for data envelope.
    People too often confuse our data/schema envelopes, we don't want to introduce even more confusion. Instead we should have (and already have in Iglu Scala Core - parseSchemaKey) function that safely converts IgluUri string into vendor/name/format/version record.
  2. I see this format as a generalization of what we have now. In other words having following functions: parseSchemaKey and imaginary schemaKeyToElasticsearchKey we can derive current format via EnrichedTsv -> Redshift -> Elasticsearch as "Redshift" doesn't loose any data, unlike "Elasticsearch". I like it very much.
  3. Instead of elasticsearch-keys I propose to use two separate keys for shredded types array. We just need a schema and we'll be able to query event with something like SchemaCriterion (iglu:com.acme/event/jsonschema/2-?-?)
  4. We really really need to come up with better names than "Redshift" and "Elasticsearch". Better now than later.

Overall, I like this idea very much.

Copy link
Contributor

@chuwy chuwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented in PR.

@alexanderdean
Copy link
Member

alexanderdean commented Sep 6, 2017

Thanks @chuwy for the detailed review.

I agree with pretty much everything.

We really really need to come up with better names than "Redshift" and "Elasticsearch". Better now than later.

&&

we can derive current format via EnrichedTsv -> Redshift -> Elasticsearch

The second one sounds odd because, as you say, the Redshift format is not really "Redshift" - it's just a pure-JSON intermediate form. As well as renaming, we should make this intermediate form self-describing and register its schema in Iglu Central.

better names than ... "Elasticsearch"

I think there are some Elasticsearch-isms in this format (the geopoint?), so I'm not so bothered by that name.

@miike
Copy link
Author

miike commented Sep 6, 2017

Thanks guys - definitely agreed on coming up with some clearer names for this stuff.

I'm not sure about the semi-nested format for contexts - I think it definitely makes sense from a data structure point of view but may make it more difficult/intensive for applications using the analytics SDK to then read off what they are interested in (e.g., if only one context is of interest but you need to iterate through 5 additional contexts in the contexts object this seems like more work). The other reason I'm leaning towards the context-as-a-column model is because it gives a predictable structured format (or is this the responsibility of the downstream consumer?) that can be used to sink in to databases like BigQuery.

@alexanderdean
Copy link
Member

I see what you are saying @miike... It's nice to be able to "dot operator all the things".

Anton is off for the next fortnight, so I suspect we will return to this then.

@miike
Copy link
Author

miike commented Sep 6, 2017

@alexanderdean Definitely nice - but I wonder if the contexts-as-a-column is too strictly opinionated? Possibly grounds for having multiple intermediate (self-describing) formats depending on what the use case is. See you in a fortnight Anton!

@poplindata poplindata closed this by deleting the head repository Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants