Provide a STMO accessible version database #260

Dexterp37 · 2017-10-05T13:18:19Z

It would be very useful to have the information provided by buildhub available as a dataset which is accessible from STMO. This would allow, among the other things, to perform a join with other datasets (e.g. update dataset) and make precise build information available to other consumers.

Dexterp37 · 2017-10-05T13:18:51Z

See bug 1405614 for an example :)

Natim · 2017-10-05T13:22:50Z

Do you know what kind of API's we need to provide for that to happen?

Dexterp37 · 2017-10-05T13:30:02Z

I'm not familiar with this project so I'm not quite sure but, AFAICT, an intermediate csv can be generated. If that's correct, we could simply schedule some recurring job that grabs the CSV, loads it in a temp view in Spark and saves it to a parquet file. I'm not an expert, so I'll happily flag @mreid-moz to make sure I didn't say anything too stupid :-D

Natim · 2017-10-05T13:31:50Z

If we can call a web hook each time we add a new build it could be better and near realtime

mreid-moz · 2017-10-05T13:51:18Z

If you can incorporate an HTTP POST when a new build arrives, that could work very well and integrate with the generic ingestion service.

Natim · 2017-10-05T13:56:24Z

That would be perfect

willkg · 2017-10-05T14:33:36Z

Pretty sure @peterbe did something along these lines for Socorro.

peterbe · 2017-10-05T14:41:44Z

What we did for Socorro is that we upload (a subset of) every single crash we process into an S3 bucket. The uploads are put into a S3 "directory" which is a date. E.g. "/20171013/"

Additionally we uploaded the JSON Schema that describes this subset into the root of that S3 bucket. That way, when @mreid-moz 's cron job runs it fetches the JSON Schema to generate the Scala code (right?) that packages up the JSON blobs into Parquet files.

In terms of configuration we just let the Socorro Ops people talk to the Telemetry Ops people so they can set up IAM policies for reading. The S3 bucket belongs to the AWS org that Socorro uses (if that matters).

Also, Mark and I wrote a Python script that converts the JSON Schema into Scala code but I'm not sure if that's still used.

We also have a policy about the versioning of the JSON Schema (basically a key called $version) so that Mark's code knows to not make Parquet files that is a bundle of different schemas. I can go into more detail if necessary.

Natim · 2017-10-05T14:43:00Z

I would like to wait for the HTTP service to be ready then :)

mreid-moz · 2017-10-05T18:23:48Z

cc @jasonthomas re: ingestion stuff

leplatrem added new-feature P2 labels Oct 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a STMO accessible version database #260

Provide a STMO accessible version database #260

Dexterp37 commented Oct 5, 2017

Dexterp37 commented Oct 5, 2017

Natim commented Oct 5, 2017

Dexterp37 commented Oct 5, 2017

Natim commented Oct 5, 2017

mreid-moz commented Oct 5, 2017

Natim commented Oct 5, 2017

willkg commented Oct 5, 2017

peterbe commented Oct 5, 2017

Natim commented Oct 5, 2017

mreid-moz commented Oct 5, 2017

Provide a STMO accessible version database #260

Provide a STMO accessible version database #260

Comments

Dexterp37 commented Oct 5, 2017

Dexterp37 commented Oct 5, 2017

Natim commented Oct 5, 2017

Dexterp37 commented Oct 5, 2017

Natim commented Oct 5, 2017

mreid-moz commented Oct 5, 2017

Natim commented Oct 5, 2017

willkg commented Oct 5, 2017

peterbe commented Oct 5, 2017

Natim commented Oct 5, 2017

mreid-moz commented Oct 5, 2017