Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

Provide a STMO accessible version database #260

Open
Dexterp37 opened this issue Oct 5, 2017 · 10 comments
Open

Provide a STMO accessible version database #260

Dexterp37 opened this issue Oct 5, 2017 · 10 comments

Comments

@Dexterp37
Copy link

It would be very useful to have the information provided by buildhub available as a dataset which is accessible from STMO. This would allow, among the other things, to perform a join with other datasets (e.g. update dataset) and make precise build information available to other consumers.

@Dexterp37
Copy link
Author

See bug 1405614 for an example :)

@Natim
Copy link
Contributor

Natim commented Oct 5, 2017

Do you know what kind of API's we need to provide for that to happen?

@Dexterp37
Copy link
Author

I'm not familiar with this project so I'm not quite sure but, AFAICT, an intermediate csv can be generated. If that's correct, we could simply schedule some recurring job that grabs the CSV, loads it in a temp view in Spark and saves it to a parquet file. I'm not an expert, so I'll happily flag @mreid-moz to make sure I didn't say anything too stupid :-D

@Natim
Copy link
Contributor

Natim commented Oct 5, 2017

If we can call a web hook each time we add a new build it could be better and near realtime

@mreid-moz
Copy link

If you can incorporate an HTTP POST when a new build arrives, that could work very well and integrate with the generic ingestion service.

@Natim
Copy link
Contributor

Natim commented Oct 5, 2017

That would be perfect

@willkg
Copy link
Contributor

willkg commented Oct 5, 2017

Pretty sure @peterbe did something along these lines for Socorro.

@peterbe
Copy link
Contributor

peterbe commented Oct 5, 2017

What we did for Socorro is that we upload (a subset of) every single crash we process into an S3 bucket. The uploads are put into a S3 "directory" which is a date. E.g. "/20171013/"

Additionally we uploaded the JSON Schema that describes this subset into the root of that S3 bucket. That way, when @mreid-moz 's cron job runs it fetches the JSON Schema to generate the Scala code (right?) that packages up the JSON blobs into Parquet files.

In terms of configuration we just let the Socorro Ops people talk to the Telemetry Ops people so they can set up IAM policies for reading. The S3 bucket belongs to the AWS org that Socorro uses (if that matters).

Also, Mark and I wrote a Python script that converts the JSON Schema into Scala code but I'm not sure if that's still used.

We also have a policy about the versioning of the JSON Schema (basically a key called $version) so that Mark's code knows to not make Parquet files that is a bundle of different schemas. I can go into more detail if necessary.

@Natim
Copy link
Contributor

Natim commented Oct 5, 2017

I would like to wait for the HTTP service to be ready then :)

@mreid-moz
Copy link

cc @jasonthomas re: ingestion stuff

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants