Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility Brainstorm #4

Open
gwaybio opened this issue May 6, 2019 · 5 comments
Open

Reproducibility Brainstorm #4

gwaybio opened this issue May 6, 2019 · 5 comments

Comments

@gwaybio
Copy link
Member

gwaybio commented May 6, 2019

@shntnu @jccaicedo @MarziehHaghighi

I was thinking about our recent discussion on github reproducibility a bit more. I am wondering about different potential workflows and can think of some additional potentially helpful setups.

First Setup

The first setup is as I described on Friday:

  • All code and discussion (experiments, tasks, etc.) live in the repo
  • The code includes a numbered folder (e.g. 0.generate-profiles) that stores code, QC, and profile results.
  • The remaining repository is structured into various other modules that represent other experiments and/or common themes (e.g. Correlation of the present dataset with alternatives for MOA discovery). Basically any downstream analysis that uses the profiles generated in 0.generate-profiles.

Potential Alternative?

Perhaps a second setup could separate the processing code and downstream analysis into two distinct repositories. This setup could work well for a couple of reasons.

  1. Can easily separate generating profiles from the profile analysis (we would add a link to the README in each repo cross-referencing the other).
  2. This frees up the naming convention of the analysis repos (no more dates in the name 😉)
  3. This could potentially aid in automation. Based on my, albeit limited, experience it seems like a lot of the profile generation is relatively consistent, and the differences are mainly nuance. I wonder if we could setup something that would create a profiling template (much like the handbook), but that is ready to run once initiated. The workflow could be something like "New Repo" --> git clone --> profiling init (and then bash scripts would be auto populated).

Of course, every project is different, and individual decisions are required. (The same goes for storing the profiles in the actual repo! and public/private repo debate too)

@gwaybio
Copy link
Member Author

gwaybio commented May 6, 2019

Also note that I am brainstorming this general idea in this specific repository relating to the STARR grant b/c it is open source

@shntnu
Copy link
Collaborator

shntnu commented May 8, 2019

Thanks for leading this discussion!

I like this idea because of

  1. There are many different analysis one could do, given the same dataset.

So the profile-generation repo would capture everything that we do in the profiling handbook, and nothing else. Ideally, at some point, this repo would only contain the WDL workflow (or equivalent) used to process the data.

The automation question merits a separate discussion, out of scope right now.
It certainly is relatively consistent so indeed this is possible, and needs a lot more work to fully automate. But indeed, that's another reason to consider this option.

What's next? Do you want to try this out on this project? @gwaygenomics

@gwaybio
Copy link
Member Author

gwaybio commented May 9, 2019

There are many different analysis one could do, given the same dataset.

Yeah definitely! Also, depending on the size of the profiles specifically, github can handle data versioning. BBBC will store the raw images?

Ideally, at some point, this repo would only contain the WDL workflow (or equivalent) used to process the data.

Depending on the size of the data, I think it could also store processed profiles. Data versioning FTW 🎉

What's next? Do you want to try this out on this project?

yes, lets try it out! Currently, I don't think the profile processing lives here (do we know where it lives?). So it will be natural to use this strategy here.

Another thing to consider is if the analysis should live in the broadinstitute org or the carpenterlab org. I am thinking carpenterlab since (presumably) we have more control over it and it gives the lab more visibility. (there is also a new and nifty transfer issue feature on Github, so an ownership transfer should be relatively painless)

@shntnu
Copy link
Collaborator

shntnu commented May 31, 2019

BBBC will store the raw images?

Not sure yet; ideally IDR, but it isn't easy to directly access images

yes, lets try it out! Currently, I don't think the profile processing lives here (do we know where it lives?). So it will be natural to use this strategy here.

Indeed, I don't see any profile processing notes; you'd need to check with Beth.

Another thing to consider is if the analysis should live

broadinstitute works well especially for collaborative projects

@gwaybio gwaybio transferred this issue from broadinstitute/profiling-resistance-mechanisms Mar 10, 2020
@gwaybio
Copy link
Member Author

gwaybio commented Mar 10, 2020

note that I transferred this issue over from https://github.com/broadinstitute/profiling-resistance-mechanisms

This repo currently has the closest workflow to what is described above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants