Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking #46

Closed
brainstorm opened this issue Jul 6, 2021 · 14 comments · Fixed by #59
Closed

Benchmarking #46

brainstorm opened this issue Jul 6, 2021 · 14 comments · Fixed by #59
Assignees

Comments

@brainstorm
Copy link
Member

brainstorm commented Jul 6, 2021

As mentioned on the original GSoC proposals, this project is not complete until we validate that is well performant :)

For this, I've been cooking this repo: https://github.com/umccr/aws-benchmarks

As an example, using AWS S3 client libraries as subjects to test both CPU and I/O throughput on... but the idea is to apply those benches over here.

This task depends on #45 being finished.

@brainstorm brainstorm added the enhancement New feature or request label Jul 6, 2021
@brainstorm brainstorm removed the enhancement New feature or request label Jul 7, 2021
@brainstorm
Copy link
Member Author

Other than measuring throughput on the different htsget-rs layers, we should keep a close eye on this common htsget problem and metric: googlegenomics/htsget#7

@brainstorm
Copy link
Member Author

brainstorm commented Aug 1, 2021

Discussing this over GA4GH's Slack...

I’d argue that the best course of action is to create a benches subdirectory for each htsget-rs crate instead of creating a dedicated “htsget-benchmarks” crate within the workspace (as the aws-benchmarks crate).

I’d love the benchmarks for the server to be as end-to-end to measure real world throughput.

/cc @victorskl thoughts? ;)

@victorskl
Copy link
Member

Yah, benchmarks for the server sound good. On top of my head

  • pick and stage some (large enough) public dataset (a BAM, a VCF)
  • config server to serve from this dataset
  • prepare few target slices -- i.e. pick different variant regions / alignment references -- may be 10 to 20s in a array
  • baseline
    • try sequential calls to (hts)get each target slices; one by one
    • try simultaneous calls to get all target slices at once
  • varying bench settings
    • try apply any compiler optimisation, if any
    • try apply platform specific optimisation flag, if any
    • etc...

And by then by doing so, you could come up with more methodology around it, at that point...

This need to do for both referencing impl and Rust impl here. You can start with local for this arrangement; and then by reaching at the point; it will be clearer what come next, I reckon.

Please also to make sure implementation "correctness", first i.e. server responses are correct according to the Spec and/or same with the referencing impl. If anything differ, need to note it.

@brainstorm
Copy link
Member Author

Thanks a ton Victor for the thoughts and guidelines! There's another bit I just recalled and that is the amount of MB's this implementation returns for a given query.

I'm thinking and referring to this concerning issue with htsget: googlegenomics/htsget#7

@CastilloDel
Copy link
Contributor

Well, I still need to tinker a bit to see which problems I will find, but Victor's guidelines seem to make a lot of sense for now.

@CastilloDel
Copy link
Contributor

I'm having problems with the reference implementation 😭 After following the instruction in the README for building and running the server, I'm getting this error:

samtools flagstat http://localhost:8081/reads/public-bucket/test.bam
[E::hts_open_format] Failed to open file "http://localhost:8081/reads/public-bucket/test.bam" : Permission denied
samtools flagstat: Cannot open input file "http://localhost:8081/reads/public-bucket/test.bam": Permission denied

I didn't find anything like this in the issues, should I add a new one? The project seems somewhat abandoned, with the most recent issue/PR being from 2019 😥
Also, it can't be used as a local server so comparing the actual state of our implementation with it might be pointless (?).

@brainstorm
Copy link
Member Author

Oh, please don't try to deploy the googlegenomics htsget, it's indeed quite Google-tailored. I pointed at that particular issue because I wanted to know how many extra reads the GA4GH reference implementation returned, not Google's one, sorry for not being clear there :-S

@CastilloDel
Copy link
Contributor

Seems like I got some things mixed up. Where can I find the reference implementation?

@brainstorm
Copy link
Member Author

Sorry, when I say reference implementation, I usually refer to GA4GH's one here:

https://github.com/ga4gh/htsget-refserver

... perhaps sooner our later ours (htsget-rs) will be the GA4GH "reference implementation", we'll see after the benchmarking ;)

@CastilloDel
Copy link
Contributor

That would be awesome! 💯

@CastilloDel
Copy link
Contributor

I think I have hit a blocker :(. I've been using reqwest to perform the client requests in the benchmarks. Until now, I had only tried to get the initial response from the servers, but now I tried to also download the files with the urls the servers provided and I found a problem with our implementation. The urls it gives use the "file://" scheme, which isn't allowed by reqwest. This was something that already needed to be changed, because the htsget spec doesn't allow it, but we might have to deal with it sooner than expected.

@brainstorm
Copy link
Member Author

Yes, I reckon that this should be fixed, it's good that we find those as we go, we still have time to iron those things before the end GSoC :)

@brainstorm
Copy link
Member Author

@andrewpatto You were right, only the file:// scheme is in place as @CastilloDel mentions above 👆🏻(instead of http://)... Daniel, is there something actix-specific we could change for this behaviour or there's some middleware/router that could change those file:// pointers to http://?

@CastilloDel
Copy link
Contributor

CastilloDel commented Oct 23, 2021

Yes! I used actix_files for that. Check out 8adfd7a and cbf58e0

Those changes are live in the benchmarks branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants