Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fc thresholds are recommended for analyses with large sample sizes #12

Open
micahpf opened this issue Aug 7, 2024 · 1 comment
Open

Comments

@micahpf
Copy link
Contributor

micahpf commented Aug 7, 2024

Both Mike Love and Gordon Smyth recommend using a lfc cutoff in analyses with massive sample sizes:

https://support.bioconductor.org/p/9153413/

Note however that it's not recommended to simply run the test with the null hypothesis that lfc=0 and then filter out the genes with lfc less than some threshold, because the pvalues will still be inflated and they've both developed methods to more rigorously imbed an lfc threshold into the null hypothesis of the test. These are DESeq2::lfcShrink, edgeR::glmTreat, and limma::treat.

from ?limma::treat:

treat computes empirical Bayes moderated-t p-values relative to a minimum fold-change threshold. Instead of testing for genes that have true log-fold-changes different from zero, it tests whether the true log2-fold-change is greater than lfc in absolute value (McCarthy and Smyth, 2009). In other words, it uses an interval null hypothesis, where the interval is [-lfc,lfc]. When the number of DE genes is large, treat is often useful for giving preference to larger fold-changes and for prioritizing genes that are biologically important. treat is concerned with p-values rather than posterior odds, so it does not compute the B-statistic lods. The idea of thresholding doesn't apply to F-statistics in a straightforward way, so moderated F-statistics are also not computed. When fc=1and lfc=0, treat is identical to eBayes, except that F-statistics and B-statistics are not computed. The fc threshold is usually chosen relatively small, because genes need to have fold changes substantially greater than the testing threshold in order to be considered statistically significant. Typical values for fc are 1.1, 1.2 or 1.5. The top genes chosen by treat can be examined using topTreat.

The treat threshold can be specified either as a fold-change via fc or as a log2-fold-change via lfc, with lfc = log2(fc). Note that the treat testing procedure is considerably more rigorous and conservative than simply applying same fc values as a fold-change cutoff to the list of differentially expressed genes. Indeed, the observed log2-fold-change needs to substantially larger than lfc for a gene to be called as statistically significant by treat. The threshold should be chosen as a small value below which results should be ignored rather than as a target fold-change. In practice, modest values for fc such as 1.1, 1.2 or 1.5 are usually the most useful. Setting fc=1.2 or fc=1.5 will usually cause most differentially expressed genes to have estimated fold-changes of 2-fold or greater, depending on the sample size and precision of the experiment. Larger thresholds are usually overly conservative and counter productive. In general, the fc threshold should be chosen sufficiently small so that a worthwhile number of DE genes remain, otherwise the purpose of prioritizing genes with larger fold-changes will be defeated.

Note that if we want to modify the fitVoomLm to allow fc thresholds using limma::treat we will also need to change the createResultsTable function to use limma::topTreat because limma::treat doesn't generate B stats and limma::topTable uses the B stat to arrange the genes.

This would look something like this:

fitVoomLm <- function(bulkObj, design = bulk$md$design, contr.matrix, 
                      block = NULL, sample.weights = TRUE, 
                      var.design = NULL, var.group = NULL, plotVoom = TRUE,
                      ebayes_fun = "eBayes", fc = 1.2) {
    bulkObj$fit <- edgeR::voomLmFit(counts = bulkObj$dge, # Defaults to the normalized (effective) library sizes in counts if counts is a DGEList or to the columnwise count totals if counts is a matrix.
                                    design = design,
                                    block = block,
                                    sample.weights = sample.weights,
                                    var.design = var.design,
                                    var.group = var.group,
                                    plot = plotVoom)
  bulkObj$fit.contr <- limma::contrasts.fit(bulkObj$fit, contrasts = contr.matrix)
  if (ebayes_fun == "eBayes") {
    bulkObj$fit.contr <- limma::eBayes(bulkObj$fit.contr, robust=TRUE)
  } else if (ebayes_fun == "treat") {
    bulkObj$fit.contr <- limma::treat(bulkObj$fit.contr, fc = fc, robust=TRUE)
  }
  limma::plotSA(bulkObj$fit.contr, main="Final model: Mean-variance trend")
  return(bulkObj)
}
createResTable <- function (fit, contr.matrix, ebayes_fun = "eBayes") {
    resultsTables_list <- list()
    for (contrast in colnames(fit$coefficients)) {
      if (ebayes_fun == "eBayes") {
        resultsTables_list[[contrast]] <- limma::topTable(fit, 
            coef = contrast, n = Inf) %>% dplyr::rename(log2FoldChange = .data$logFC, 
            pvalue = .data$P.Value, padj = .data$adj.P.Val)
      } else if (ebayes_fun == "treat") {
        resultsTables_list[[contrast]] <- limma::topTreat(fit, 
            coef = contrast, n = Inf) %>% dplyr::rename(log2FoldChange = .data$logFC, 
            pvalue = .data$P.Value, padj = .data$adj.P.Val)
      }
    }
    resultsTable <- lapply(resultsTables_list, function(one_tbl) {
        one_tbl %>% tibble::rownames_to_column(var = "gene")
    }) %>% dplyr::bind_rows(.id = "contrast") %>% dplyr::as_tibble() %>% 
        dplyr::mutate(contrast = forcats::fct(contrast, levels = colnames(contr.matrix)))
    return(resultsTable)
}
@micahpf
Copy link
Contributor Author

micahpf commented Aug 7, 2024

Also note that if you run limma::treat then the pvalue distribution is no longer expected to be uniform (see the same bioconductor post above), so don't be worried if they look crazy and comment those out in the report for clients!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant