Adds support for 'Redshift' shredding format #32

miike · 2017-08-17T00:44:16Z

This is a WIP but I'd like to get some comments/thoughts on it based on #31

This introduces the concept of the 'redshift' (or a more appropriate name) shredding which more closely follows what is shredded into Redshift in terms of the table design. This differs from the current implementation which is more consistent with how data is sunk into Elasticsearch.

The shredding format introduced below makes the following changes:

Removes the 'contexts' and 'unstruct_event' prefix for the JSON objects
Retains backwards compatibility by passing shred_format='elasticsearch' by default
Adds a nested schema object to contexts/unstruct_events which contains vendor, name, format and version
Adds a nested data object which contains the contents of the payload
Adds additional tests to clarify the behaviour of both shredding formats in partial and complete payloads

As a sidenote the code likely needs some refactoring/reformatting and the existing docstrings for the methods have not been updated to reflect the new behaviour.

snowplowcla · 2017-08-17T11:23:36Z

@miike has signed the Software Grant and Corporate Contributor License Agreement

alexanderdean · 2017-08-17T16:46:00Z

Thanks @miike!

The 'contexts' and 'unstruct_event' prefixes for the JSON objects probably seem a bit odd in hindsight. If I remember correctly, the original design was partly because contexts could be an array of a given type, while unstruct event is always a singleton. In pseudocode:

+ enriched event
-> contexts(array[A], B, C, array[D])
-> unstruct_event(E)

So prefixing them with contexts or unstruct_event served a couple of purposes:

Cheap (but clunky) way of indicating the source of this entity
Tells you whether you can just access it as event.E or whether it has to be event.A[0] (i.e. an array)

I'm not particularly attached to the 'contexts' and 'unstruct_event' prefixes - I just wanted to share their origins...

alexanderdean · 2017-08-23T09:26:50Z

@chuwy - this looks pretty exciting! Can you do a first pass review on this please?

miike · 2017-09-05T12:15:01Z

@chuwy @alexanderdean What do you think of adding similar functionality (including data + schema) for the Scala Analytics SDK? Is there a bit of additional information we should include to reference the source (contexts/unstruct)?

alexanderdean · 2017-09-05T15:05:03Z

Sure - worth creating placeholder tickets to track the same idea in the Scala and .NET Analytics SDKs...

chuwy · 2017-09-06T11:12:03Z

Hey @miike, I'm very sorry you had to wait so long. Idea looks really great. Though I'm agree with @alexanderdean that contexts_ prefix serves almost no purpose here. Therefore I'm thinking about something similar to following structure:

{
  "event": {
    "app_id": "foo",
    "platform": "mob",
    ...
  },
  "unstruct_event": {
      "schema": "iglu:com.acme/event/jsonschema/1-0-0",
      "data": {"key": "value"}
  },
  "contexts": [
    {
      "schema": "iglu:com.acme/context/jsonschema/1-0-0",
      "data": {"key": "value"}
    },
    {
      "schema": "iglu:com.acme/context/jsonschema/1-0-0",
      "data": {"key": "value"}
    }
  ]
}

Here's some highlights:

In your implementation shredded JSONs have vendor, name, etc schema metadata. While it's quite convenient to access fields this way - it's more common for schemas, not for data envelope.
People too often confuse our data/schema envelopes, we don't want to introduce even more confusion. Instead we should have (and already have in Iglu Scala Core - parseSchemaKey) function that safely converts IgluUri string into vendor/name/format/version record.
I see this format as a generalization of what we have now. In other words having following functions: parseSchemaKey and imaginary schemaKeyToElasticsearchKey we can derive current format via EnrichedTsv -> Redshift -> Elasticsearch as "Redshift" doesn't loose any data, unlike "Elasticsearch". I like it very much.
Instead of elasticsearch-keys I propose to use two separate keys for shredded types array. We just need a schema and we'll be able to query event with something like SchemaCriterion (iglu:com.acme/event/jsonschema/2-?-?)
We really really need to come up with better names than "Redshift" and "Elasticsearch". Better now than later.

Overall, I like this idea very much.

chuwy

Commented in PR.

alexanderdean · 2017-09-06T21:38:44Z

Thanks @chuwy for the detailed review.

I agree with pretty much everything.

We really really need to come up with better names than "Redshift" and "Elasticsearch". Better now than later.

&&

we can derive current format via EnrichedTsv -> Redshift -> Elasticsearch

The second one sounds odd because, as you say, the Redshift format is not really "Redshift" - it's just a pure-JSON intermediate form. As well as renaming, we should make this intermediate form self-describing and register its schema in Iglu Central.

better names than ... "Elasticsearch"

I think there are some Elasticsearch-isms in this format (the geopoint?), so I'm not so bothered by that name.

miike · 2017-09-06T23:04:40Z

Thanks guys - definitely agreed on coming up with some clearer names for this stuff.

I'm not sure about the semi-nested format for contexts - I think it definitely makes sense from a data structure point of view but may make it more difficult/intensive for applications using the analytics SDK to then read off what they are interested in (e.g., if only one context is of interest but you need to iterate through 5 additional contexts in the contexts object this seems like more work). The other reason I'm leaning towards the context-as-a-column model is because it gives a predictable structured format (or is this the responsibility of the downstream consumer?) that can be used to sink in to databases like BigQuery.

alexanderdean · 2017-09-06T23:06:02Z

I see what you are saying @miike... It's nice to be able to "dot operator all the things".

Anton is off for the next fortnight, so I suspect we will return to this then.

miike · 2017-09-06T23:11:24Z

@alexanderdean Definitely nice - but I wonder if the contexts-as-a-column is too strictly opinionated? Possibly grounds for having multiple intermediate (self-describing) formats depending on what the use case is. See you in a fortnight Anton!

miike added 4 commits August 16, 2017 04:28

Add option to include schema for contexts and unstruct_event

130a10a

Add tests for shredding functionality

7447787

Refactor of logic to allow two shred formats

0e86219

Fix linting so build passes

35fd9bf

snowplowcla added the cla: yes label Aug 17, 2017

chuwy self-requested a review August 23, 2017 09:29

alexanderdean added this to the Version 0.3.0 milestone Aug 23, 2017

chuwy reviewed Sep 6, 2017

View reviewed changes

chuwy mentioned this pull request Sep 6, 2017

Add support for shredded format snowplow/snowplow-scala-analytics-sdk#42

Open

alexanderdean mentioned this pull request Sep 6, 2017

Add support for shredded format snowplow/snowplow-dotnet-analytics-sdk#11

Open

poplindata closed this by deleting the head repository Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds support for 'Redshift' shredding format #32

Adds support for 'Redshift' shredding format #32

miike commented Aug 17, 2017 •

edited

Loading

snowplowcla commented Aug 17, 2017

alexanderdean commented Aug 17, 2017 •

edited

Loading

alexanderdean commented Aug 23, 2017

miike commented Sep 5, 2017

alexanderdean commented Sep 5, 2017

chuwy commented Sep 6, 2017 •

edited

Loading

chuwy left a comment

alexanderdean commented Sep 6, 2017 •

edited

Loading

miike commented Sep 6, 2017

alexanderdean commented Sep 6, 2017

miike commented Sep 6, 2017 •

edited

Loading

Adds support for 'Redshift' shredding format #32

Adds support for 'Redshift' shredding format #32

Conversation

miike commented Aug 17, 2017 • edited Loading

snowplowcla commented Aug 17, 2017

alexanderdean commented Aug 17, 2017 • edited Loading

alexanderdean commented Aug 23, 2017

miike commented Sep 5, 2017

alexanderdean commented Sep 5, 2017

chuwy commented Sep 6, 2017 • edited Loading

chuwy left a comment

Choose a reason for hiding this comment

alexanderdean commented Sep 6, 2017 • edited Loading

miike commented Sep 6, 2017

alexanderdean commented Sep 6, 2017

miike commented Sep 6, 2017 • edited Loading

miike commented Aug 17, 2017 •

edited

Loading

alexanderdean commented Aug 17, 2017 •

edited

Loading

chuwy commented Sep 6, 2017 •

edited

Loading

alexanderdean commented Sep 6, 2017 •

edited

Loading

miike commented Sep 6, 2017 •

edited

Loading