-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add subworkflow for functional enrichment analysis #7254
Changes from 67 commits
4ea1b92
a434377
1567d0a
318898d
e8a1f7f
b3ce956
c0833c3
aa95ebe
88ff6c8
9837349
c9b8a88
643b7f7
e86ece4
7030f37
a7b7bd9
a36f2fd
8d9c31a
6c3df56
e8f105c
e3eb9df
99f81ba
294b1de
e175066
77ed5ea
64ac34f
01b1791
df8945f
954fde6
b187583
392dc08
79965f4
c15d4cd
3f4aecb
de82a8f
8017e21
da67af8
d42f2da
fb3f1ae
c439740
1759a86
877946b
e9f069b
93345a6
75399ec
12b7f8c
ca61db0
170e870
51229b9
af0a0a7
4f24202
cac0b07
2b78395
709abba
aeedb60
4b3013b
65ea9e5
5454a7d
f536a4e
05edb99
85d60f5
5888db2
3003d04
d795904
d88df5d
3b3a05f
1b49463
99d5669
c4e4bca
25f94ed
6a4b1c7
dbb6637
4f7cae8
6e5e1a6
3f10e16
ecd84ed
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
|
||
// | ||
// Perform enrichment analysis | ||
// | ||
include { GPROFILER2_GOST } from "../../../modules/nf-core/gprofiler2/gost/main.nf" | ||
include { CUSTOM_TABULARTOGSEAGCT } from '../../../modules/nf-core/custom/tabulartogseagct/main.nf' | ||
include { CUSTOM_TABULARTOGSEACLS } from '../../../modules/nf-core/custom/tabulartogseacls/main.nf' | ||
include { CUSTOM_TABULARTOGSEACHIP } from '../../../modules/nf-core/custom/tabulartogseachip/main.nf' | ||
include { GSEA_GSEA } from '../../../modules/nf-core/gsea/gsea/main.nf' | ||
include { PROPR_GREA } from "../../../modules/nf-core/propr/grea/main.nf" | ||
|
||
// Combine meta maps, including merging non-identical values of shared keys (e.g. 'id') | ||
def mergeMaps(meta, meta2){ | ||
(meta + meta2).collectEntries { k, v -> | ||
meta[k] && meta[k] != v ? [k, "${meta[k]}_${v}"] : [k, v] | ||
} | ||
} | ||
|
||
workflow DIFFERENTIAL_FUNCTIONAL_ENRICHMENT { | ||
take: | ||
// input data for functional analysis | ||
// They can be the results from differential expression analysis or abundance matrix | ||
// The functional analysis method to run should be explicitly provided | ||
ch_input // [ meta_input, input file, method to run ] | ||
|
||
// gene sets and background | ||
ch_gene_sets // [ meta_gmt, gmt file ] | ||
ch_background // [ meta_background, background file ] | ||
|
||
// other - for the moment these files are only needed for GSEA | ||
ch_contrasts // [ meta_contrast, contrast_variable, reference, target ] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (1) When providing results from differential expression analysis, a contrast would not be needed. IMO it would make sense to keep differential testing entirely out of this workflow. In case (1), it's not needed, and in case (2), a sample x signature matrix is produced. This matrix could just be fed into a differential analysis workflow (e.g. limma) again. Like that we can keep the complexity of this subworkflow low. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm aware that the gsea module does support providing a contrast directly when using gene expression data. However, I believe that there's no point in using this mode. In this case, GSEA anyway just computes a metric (signal2noise, t-statistic, ...) based on these variables (see docs), so we can as well provide a fold change or DE-statistic directly, while having the advantage that we can provide a full model definition including covariates to the DE method. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Fair point, but it's what diff. ab. does right now. I'd recognised the possbility of a switch, but hadn't got round to it: nf-core/differentialabundance#36 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair point that we could model the current matrix-driven GSEA as part of the differential subworkflow, alongside LIMMA et al though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wanted the current subworkflow to be able to produce the exact behaviour of how the modules are used in the DA pipeline, hence GSEA is taking gene expression data instead of DE. Not sure why this is mode is chosen for the pipeline, maybe @pinin4fjords can step in here? But yes, I do agree that it would be conceptually cleaner if the subworkflow just takes in DE output, and so for GSEA. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd be in favor of taking this chance of streamlining the workflow now that we are anyway changing quite a few things. We can help with implementing a module for preranked GSEA if required1. (alternatively, the decoupler module is kind of ready, and it comes with a very fast GSEA implementation -- it should generate the same results in terms of scores and p-values, but it doesn't produce all outputs such as the "leading edge" plots). Footnotes
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not to be a pain, but I do like those leading edge plots, they are useful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not against a preranked GSEA module. But what I really would like to get rid of is the contrast specification in this subworkflow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would actually like to start integrating the subworkflows in the current DA pipeline hopefully next week. |
||
ch_samplesheet // [ meta_exp, samples sheet ] | ||
ch_featuresheet // [ meta_exp, features sheet, features id, features symbol ] | ||
|
||
main: | ||
|
||
ch_versions = Channel.empty() | ||
|
||
// Add method information into meta map of ch_input | ||
// This information is used later to determine which method to run for each input | ||
|
||
ch_input = ch_input | ||
.combine(ch_gene_sets) | ||
.combine(ch_background) | ||
.multiMap { | ||
meta_input, file, analysis_method, meta_gmt, gmt, meta_background, background -> | ||
def meta_new = meta_input + [ 'method': analysis_method ] | ||
input: | ||
[ meta_new, file ] | ||
gene_sets: | ||
[ meta_new, gmt ] // NOTE here we assume that the modules will not make use of meta_gmt and meta_background | ||
background: | ||
[ meta_new, background ] | ||
} | ||
|
||
// In the case of GSEA, it needs additional files coming from other channels that other methods don't use | ||
// here we define the input channel for the GSEA section | ||
|
||
def criteria = multiMapCriteria { meta_input, input, gmt, meta_exp, samplesheet, featuresheet, features_id, features_symbol, meta_contrasts, variable, reference, target -> | ||
def meta_contrasts_new = meta_contrasts + [ 'variable': variable, 'reference': reference, 'target': target ] // make sure variable, reference, target are in the meta | ||
def meta_all = mergeMaps(meta_contrasts_new, meta_input) | ||
input: | ||
[ meta_all, input ] | ||
gene_sets: | ||
[ meta_all, gmt ] | ||
contrasts_and_samples: | ||
[ meta_all, samplesheet ] | ||
features: | ||
[ meta_exp, featuresheet ] | ||
features_cols: | ||
[ features_id, features_symbol ] | ||
} | ||
ch_preinput_for_gsea = ch_input.input | ||
.join(ch_input.gene_sets) | ||
.filter{ it[0].method == 'gsea' } | ||
.combine(ch_samplesheet.join(ch_featuresheet)) | ||
.combine(ch_contrasts) | ||
.multiMap(criteria) | ||
|
||
// ---------------------------------------------------- | ||
// Perform enrichment analysis with gprofiler2 | ||
// ---------------------------------------------------- | ||
|
||
GPROFILER2_GOST( | ||
ch_input.input.filter{ it[0].method == 'gprofiler2' }, | ||
ch_input.gene_sets.filter{ it[0].method == 'gprofiler2'}, | ||
ch_input.background.filter{ it[0].method == 'gprofiler2'} | ||
) | ||
|
||
// ---------------------------------------------------- | ||
// Perform enrichment analysis with GSEA | ||
// ---------------------------------------------------- | ||
|
||
// NOTE that GCT input can be more than 1, if they come from different tools (eg. limma, deseq2). | ||
// CLS input can be as many as combinations of input x contrasts | ||
// Whereas features can be only one file. | ||
|
||
CUSTOM_TABULARTOGSEAGCT(ch_preinput_for_gsea.input) | ||
|
||
CUSTOM_TABULARTOGSEACLS(ch_preinput_for_gsea.contrasts_and_samples) | ||
|
||
CUSTOM_TABULARTOGSEACHIP( | ||
ch_preinput_for_gsea.features.first(), | ||
ch_preinput_for_gsea.features_cols.first() | ||
) | ||
|
||
ch_input_for_gsea = CUSTOM_TABULARTOGSEAGCT.out.gct | ||
.join(CUSTOM_TABULARTOGSEACLS.out.cls) | ||
.join( ch_preinput_for_gsea.gene_sets ) | ||
|
||
GSEA_GSEA( | ||
ch_input_for_gsea, | ||
ch_input_for_gsea.map{ tuple(it[0].reference, it[0].target) }, | ||
CUSTOM_TABULARTOGSEACHIP.out.chip.first() | ||
) | ||
|
||
// ---------------------------------------------------- | ||
// Perform enrichment analysis with GREA | ||
// ---------------------------------------------------- | ||
|
||
PROPR_GREA( | ||
ch_input.input.filter{ it[0].method == 'grea' }, | ||
ch_input.gene_sets.filter{ it[0].method == 'grea' } | ||
) | ||
|
||
emit: | ||
// here we emit the outputs that will be useful afterwards in the | ||
// nf-core/differentialabundance pipeline | ||
|
||
// gprofiler2-specific outputs | ||
gprofiler2_all_enrich = GPROFILER2_GOST.out.all_enrich | ||
gprofiler2_sub_enrich = GPROFILER2_GOST.out.sub_enrich | ||
gprofiler2_plot_html = GPROFILER2_GOST.out.plot_html | ||
|
||
// gsea-specific outputs | ||
gsea_report = GSEA_GSEA.out.report_tsvs_ref | ||
.join(GSEA_GSEA.out.report_tsvs_target) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. strange indent |
||
|
||
// grea-specific outputs | ||
grea_results = PROPR_GREA.out.results | ||
|
||
// tool versions | ||
versions = ch_versions | ||
.mix(GPROFILER2_GOST.out.versions) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. strange indents |
||
.mix(CUSTOM_TABULARTOGSEAGCT.out.versions) | ||
.mix(CUSTOM_TABULARTOGSEACLS.out.versions) | ||
.mix(CUSTOM_TABULARTOGSEACHIP.out.versions) | ||
.mix(GSEA_GSEA.out.versions) | ||
.mix(PROPR_GREA.out.versions) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why are there no emissions? |
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this necessarily need to be a GMT file?
E.g. decoupler uses weighted gene sets for PROGENy and collecTRI analyses that are typically provided as a long-form data frame. If we included deconvolution as functional analysis tool, it would typically use a signature matrix.
It really depends a bit on the scope of this subworkflow. But if the plan is to support a wide range of functional analysis tools as suggested in nf-core/differentialabundance#367, it would be good to keep this generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but I'm always wary of premature over-engineering. I don't object to gmt in the first instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically my suggestion comes down to making the gene sets + background method-specific. Already the current workflow logic doesn't really care about whether it's a gmt file or not. It's only specified in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it is possible to standardize the gene set input format for all methods, and then each module deals with the reformatting to the proper format specifically needed for the method?
Otherwise I would imagine it become confusing from the pipeline's user perspective to have to provide the input gene set with certain format depending on the method chosen, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. It would be really nice to include omnipathdb, then the coupling between signature and method can be automatized based on name.
For now, we could have a method specific gene sets channel just as input channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not thrilled by the prospect of a multiplicity of gene_set inputs per method.
@suzannejin Maybe the gene sets should actually be part of ch_input. That way, there could be method-specific gene set files in that channel, and we wouldn't need multiple input channels