Skip to content

Latest commit

 

History

History
82 lines (52 loc) · 3.22 KB

README.md

File metadata and controls

82 lines (52 loc) · 3.22 KB

mGSEA (modified gene set enrichment analysis)

To perform GSEA with a fuzzy ranked gene list

If you use this script in your research work, please cite at least one of the following paper(s):

Genomic diversity and post-admixture adaptation in the Uyghurs (National Science Review, 2021)

Dependencies:

  • python2.7
  • argparse
  • numpy
  • pandas
  • concurrent

Usage:

python mGSEA.py -h
python mGSEA.py \
	--qrank genelist.rnk \
	--gset kegg.gmt \
	--qsize 0.005 \
	--perm 2000 \
	--threads 50 \
	--out output

Details about the arguments

--qrank (required): file format: , tab-delimited, no header, no duplicated genes, better to keep only genes in the background gene set, e.g., KEGG genes. “quantile value” ranges from 0 to 1.0 (or something will go wrong)

--gset (required): Priori target gene list, all of the genes should be in the ranked genelist, gmt format

--qsize (required): Window size of the non-overlapping quantile regions, from 1.0 to 0.0, e.g., 0.005

--perm (optional): permutation, Default: 2000

--threads (optional): Number of threads, Default: 50

--out (required): prefix name for output file

Output:

*.enrichment_score_quarter.txt.gz: enrichment scores after permutation

*.cumulative_score.txt.gz: cumulative scores for the observed data

*.enrichment_Pvalue.txt: combined summary statistics

Rationale:

Gene Set Enrichment Analysis (GSEA) is a statistical method developed based on the Kolmogorove-Smirnov test to identify the enrichment of biologically functional categories in a ranked gene list. But the gene list was not strictly ranked in our study due to the resolution of the empirical P-values. We modified the traditional GSEA by sorting genes according to their quantiles (1.0 - P-values) into 200 bins of size 0.005.

For a given gene set $S$ (a pathway or a priori functional gene set) and a quantile bin $B$, we respectively defined the “hit” score $P_{hit}(S, B)$ and “miss” score $P_{miss}(S,B)$ as:

$$ P_{hit}(S, B)=\frac{n_{BS}}{N_S};P_{miss}(S, B)=\frac{n_B-n_{BS}}{N_{tot}-N_S}, $$

in which the $N_{tot}$ stands for the total number of genes (across all the quantile bins), $N_S$ stands for the number of genes in $S$, $n_B$ stands for the number of genes in $B$, and $n_{BS}$ stands for the number of genes in in $B$ belonging to $S$. Thus, we defined the net score $P_{net}(S,B)$ for a given bin as:

$$ P_{net}(S,B)=P_{hit}(S,B)-P_{miss}(S,B). $$

The cumulative score $P_{net}(S)$ was calculated as

$$ P_{net}(S)=\sum P_{net}(S, B_i), $$

in which the $B_i$ was the ith bin in the ranking list. In addition, the enrichment score $ES(S)$ was defined as the maximum deviation of $P_{net}(S)$ from zero. The significance of an observed $ES(S)$ was estimated by comparing it with the NULL distribution of $ES$ ($ES_{NULL}$), which was generated by permuting the gene-pathway relationship 20,000 times. Finally, the nominal empirical P-value for the gene set $S$ was calculated with one-tailed significance test using either the positive or negative portion of the NULL distribution according to the sign of the observed $ES(S)$.


By: Yuwen Pan, 2021
Contact: [email protected]