GRIMER v1.0.0 (#3)

* v1.0.0 * metadata code, sorted panels user * fix small typos, print md * updated external reference files * fix * option to input table with commulative values * check md data length, fix typos, env decv * update readme
pirovc · Jul 21, 2022 · 2a5bd2f · 2a5bd2f
1 parent 43d2feb
commit 2a5bd2f
Show file tree

Hide file tree

Showing 17 changed files with 7,233 additions and 5,225 deletions.
diff --git a/README.md b/README.md
@@ -2,23 +2,30 @@
 
 ![GRIMER](grimer/img/logo.png)
 
-GRIMER perform analysis of microbiome data and generates a portable and interactive dashboard integrating annotation, taxonomy and metadata.
+GRIMER perform analysis of microbiome data and generates a portable and interactive dashboard integrating annotation, taxonomy and metadata with focus on contamination detection. More information about the method can be found in the [pre-print](https://doi.org/10.1101/2021.06.22.449360)
 
 ## Examples
 
 Online examples of reports generated with GRIMER: https://pirovc.github.io/grimer-reports/
 
 ## Installation
 
+Via conda
+
+```bash
+conda install -c bioconda -c conda-forge grimer
+```
+
+or locally installing only dependencies via conda:
+
 ```bash
 git clone https://github.com/pirovc/grimer.git
 cd grimer
-conda env create -f env.yaml
-conda activate grimer # source activate grimer
+conda env create -f env.yaml # or mamba env create -f env.yaml
+conda activate grimer # or source activate grimer
 python setup.py install --record files.txt # Uninstall: xargs rm -rf < files.txt
 grimer -h
 ```
-***Soon GRIMER will be available as a package in BioConda.***
 
 ## Usage
 
@@ -52,11 +59,94 @@ grimer -i input_table.tsv -m metadata.tsv -t ncbi #optional -b taxdump.tar.gz
 grimer -i input_table.tsv -m metadata.tsv -t ncbi -c config/default.yaml -d -g
 ```
 
-### List all options 
+### Analyzing any MGnify public study
+
 ```bash
-grimer -h
+./grimer-mgnify.py -i MGYS00006024 -o output_folder/
 ```
 
+## Parameters
+
+	grimer
+
+	optional arguments:
+	  -h, --help            show this help message and exit
+	  -v, --version         show program's version number and exit
+
+	required arguments:
+	  -i INPUT_FILE, --input-file INPUT_FILE
+	                        Main input table with counts (Observation table, Count table, Contingency Tables, ...) or .biom file. By default rows contain observations and columns contain
+	                        samples (use --tranpose if your file is reversed). First column and first row are used as headers.
+
+	main arguments:
+	  -m METADATA_FILE, --metadata-file METADATA_FILE
+	                        Input metadata file in simple tabular format with samples in rows and metadata fields in columns. QIIME 2 metadata format is also accepted, with an extra row to
+	                        define categorical and numerical fields. If not provided and --input-file is a .biom files, will attempt to get metadata from it.
+	  -t {ncbi,gtdb,silva,greengenes,ott}, --taxonomy {ncbi,gtdb,silva,greengenes,ott}
+	                        Define taxonomy to convert entry and annotate samples. Will automatically download and parse or files can be provided with --tax-files.
+	  -b [TAX_FILES ...], --tax-files [TAX_FILES ...]
+	                        Optional specific taxonomy files to use.
+	  -r [RANKS ...], --ranks [RANKS ...]
+	                        Taxonomic ranks to generate visualizations. Use 'default' to use entries from the table directly. Default: default
+	  -c CONFIG, --config CONFIG
+	                        Configuration file with definitions of references, controls and external tools.
+
+	output arguments:
+	  -g, --mgnify          Plot MGnify chart
+	  -d, --decontam        Run and plot DECONTAM
+	  -l TITLE, --title TITLE
+	                        Title to display on the header of the report.
+	  -p [{overview,samples,heatmap,correlation} ...], --output-plots [{overview,samples,heatmap,correlation} ...]
+	                        Plots to generate. Default: overview,samples,heatmap,correlation
+	  -o OUTPUT_HTML, --output-html OUTPUT_HTML
+	                        File to output report. Default: output.html
+	  --full-offline        Embed javascript library in the output file. File will be around 1.5MB bigger but also work without internet connection. That way your report will live forever.
+
+	general data options:
+	  -f LEVEL_SEPARATOR, --level-separator LEVEL_SEPARATOR
+	                        If provided, consider --input-table to be a hierarchical multi-level table where the observations headers are separated by the indicated separator characther
+	                        (usually ';' or '|')
+	  -y VALUES, --values VALUES
+	                        Force 'count' or 'normalized' data parsing. Empty to auto-detect.
+	  -w, --cumm-levels     Activate if input table has already cummulative values among levels.
+	  -s, --transpose       Transpose --input-table (if samples are listed on columns and observations on rows)
+	  -u [UNASSIGNED_HEADER ...], --unassigned-header [UNASSIGNED_HEADER ...]
+	                        Define one or more header names containing unsassinged/unclassified counts.
+	  --obs-replace [OBS_REPLACE ...]
+	                        Replace values on table observations labels/headers (support regex). Example: '_' ' ' will replace underscore with spaces, '^.+__' '' will remove the matching
+	                        regex.
+	  --sample-replace [SAMPLE_REPLACE ...]
+	                        Replace values on table sample labels/headers (support regex). Example: '_' ' ' will replace underscore with spaces, '^.+__' '' will remove the matching regex.
+	  -z REPLACE_ZEROS, --replace-zeros REPLACE_ZEROS
+	                        INT (add 'smallest count'/INT to every raw count), FLOAT (add FLOAT to every raw count). Default: 1000
+	  --min-frequency MIN_FREQUENCY
+	                        Define minimum number/percentage of samples containing an observation to keep the observation [values between 0-1 for percentage, >1 specific number].
+	  --max-frequency MAX_FREQUENCY
+	                        Define maximum number/percentage of samples containing an observation to keep the observation [values between 0-1 for percentage, >1 specific number].
+	  --min-count MIN_COUNT
+	                        Define minimum number/percentage of counts to keep an observation [values between 0-1 for percentage, >1 specific number].
+	  --max-count MAX_COUNT
+	                        Define maximum number/percentage of counts to keep an observation [values between 0-1 for percentage, >1 specific number].
+
+	Samples options:
+	  -j TOP_OBS_BARS, --top-obs-bars TOP_OBS_BARS
+	                        Top abundant observations to show in the bars.
+
+	Heatmap and clustering options:
+	  -a TRANSFORMATION, --transformation TRANSFORMATION
+	                        none (counts), norm (percentage), log (log10), clr (centre log ratio). Default: log
+	  -e METADATA_COLS, --metadata-cols METADATA_COLS
+	                        How many metadata cols to show on the heatmap. Higher values makes plot slower to navigate.
+	  --optimal-ordering    Activate optimal_ordering on linkage, takes longer for large number of samples.
+	  --show-zeros          Do not skip zeros on heatmap. File will be bigger and iteraction with heatmap slower.
+	  --linkage-methods [{single,complete,average,centroid,median,ward,weighted} ...]
+	  --linkage-metrics [{braycurtis,canberra,chebyshev,cityblock,correlation,cosine,dice,euclidean,hamming,jaccard,jensenshannon,kulsinski,mahalanobis,minkowski,rogerstanimoto,russellrao,seuclidean,sokalmichener,sokalsneath,sqeuclidean,wminkowski,yule} ...]
+	  --skip-dendrogram     Disable dendogram. Will create smaller files.
+
+	Correlation options:
+	  -x TOP_OBS_CORR, --top-obs-corr TOP_OBS_CORR
+	                        Top abundant observations to build the correlationn matrix, based on the avg. percentage counts/sample. 0 for all
+
 ## Powered by
 
 [<img src="https://static.bokeh.org/branding/logos/bokeh-logo.png" height="60">](https://bokeh.org)

diff --git a/config/default.yaml b/config/default.yaml
@@ -7,7 +7,7 @@ references:
   # "Negative Controls": "path/file1.tsv"
 
 external:
-  mgnify: "files/mgnify.tsv"
+  mgnify: "files/mgnify5989.tsv"
   decontam:
     threshold: 0.1 # [0-1] P* hyperparameter
     method: "frequency" # frequency, prevalence, combined

diff --git a/env.yaml b/env.yaml
@@ -9,10 +9,9 @@ dependencies:
  - numpy
  - scipy>=1.6.0
  - scikit-bio>=0.5.6
- - multitax==1.1.0
+ - multitax==1.1.1
  - markdown
+ - biom-format>=2.1.10
  - r-base>=4.0.0 #DECONTAM
  - bioconductor-decontam==1.10.0 #DECONTAM
- - r-optparse==1.6.6 #DECONTAM
- - biom-format>=2.1.10 #biom
- - jsonapi-client>=0.9.7 #mgnify scripts
+ - r-optparse==1.6.6 #DECONTAM
diff --git a/files/README.md b/files/README.md
@@ -1,4 +1,4 @@
-# GRIMER References and aux. files
+# GRIMER References and other files
 
 ## Reference file format
 
@@ -27,33 +27,57 @@ references:
 
 ### contaminants.yml
 
-Last update: 2021-04-01
+Last update: 2022-03-09
 
- | Organism group | Genus | Species |
- |----------------|-------|---------|
- | Bacteria | 6 | 0 | 1998 Tanner, M.A. et al. |
- | Bacteria | 4 | 0 | 2003 Grahn, N. et al. |
- | Bacteria | 16 | 0 | 2006 Barton, H.A. et al. |
- | Bacteria | 11 | 1 | 2014 Laurence, M. et al. |
- | Bacteria | 92 | 0 | 2014 Salter, S.J. et al. |
- | Bacteria | 7 | 0 | 2015 Jervis-Bardy, J. et al. |
+Manually curated from diverse publications:
+
+ | Organism group | Genus | Species | Reference |
+ |----------------|-------|---------|-----------|
+ | Bacteria | 6 | 0 | 1998 Tanner, M.A. et al. | 
+ | Bacteria | 0 | 10 | 2002 Kulakov, L.A. et al. | 
+ | Bacteria | 4 | 0 | 2003 Grahn, N. et al. | 
+ | Bacteria | 16 | 0 | 2006 Barton, H.A. et al. | 
+ | Bacteria | 11 | 1 | 2014 Laurence, M. et al.| 
+ | Bacteria | 92 | 0 | 2014 Salter, S.J. et al. | 
+ | Bacteria | 7 | 0 | 2015 Jervis-Bardy, J. et al. | 
  | Bacteria | 28 | 0 | 2015 Jousselin, E. et al. | 
- | Bacteria | 23 | 0 | 2016 Lauder, A.P. et al. |
+ | Bacteria | 77 | 127 | 2016 Glassing, A. et al.| 
+ | Bacteria | 23 | 0 | 2016 Lauder, A.P. et al. | 
  | Bacteria | 6 | 0 | 2016 Lazarevic, V. et al. | 
- | Bacteria | 77 | 127 | 2016 Glassing, A. et al. |
- | Bacteria | 62 | 0 | 2017 Salter, S.J. et al. |
- | Bacteria | 0 | 122 | 2018 Kirstahler, P. et al. |
+ | Bacteria | 62 | 0 | 2017 Salter, S.J. et al. | 
+ | Bacteria | 0 | 122 | 2018 Kirstahler, P. et al. | 
+ | Bacteria | 34 | 0 | 2018 Stinson, L.F. et al. | 
+ | Bacteria | 18 | 0 | 2019 Stinson, L.F. et al. | 
+ | Bacteria | 52 | 2 | 2019 Weyrich, L.S. et al. | 
  | Bacteria | 8 | 26 | 2019 de Goffau, M.C. et al. | 
- | Bacteria | 52 | 2 | 2019 Weyrich, L.S. et al. |
- | Bacteria | 15 | 93 | 2020 Nejman D. et al. |
- | Viruses | 0 | 1 | 2015 Mukherjee, S. et al. |
- | Viruses | 0 | 1 | 2015 Kjartansdóttir, K.R. et al. |
- | Viruses | 0 | 301 | 2019 Asplund, M. et al. |
- | Total (unique) | 201 | 625 | |
+ | Bacteria | 15 | 93 | 2020 Nejman D. et al. | 
+ | Viruses | 0 | 1 | 2015 Kjartansdóttir, K.R. et al. | 
+ | Viruses | 0 | 1 | 2015 Mukherjee, S. et al. | 
+ | Viruses | 0 | 291 | 2019 Asplund, M. et al. |
+ | Eukaryota | 0 | 3 | 2016 Czurda, S. et al. | 
+ | Eukaryota | 0 | 1 | PRJNA168|
+ | Total (unique) | 210 | 627 |  | 
 
 ### human-related.yml
 
-BacDive and eHOMD dump date: 2021-04-13
+Last update: 2022-03-09
+
+Manually curated from from: Byrd, A., Belkaid, Y. & Segre, J. The human skin microbiome. Nat Rev Microbiol 16, 143–155 (2018). https://doi.org/10.1038/nrmicro.2017.157
+
+```yaml
+"Top organisms form the human skin microbiome":
+  "Bacteria":
+    url: "https://doi.org/10.1038/nrmicro.2017.157"
+    ids: [257758, 225324, 169292, 161879, 146827, 43765, 38304, 38287, 38286, 29466, 29388, 28037, 1747, 1305, 1303, 1290, 1282, 1270]
+  "Eukarya":
+    url: "https://doi.org/10.1038/nrmicro.2017.157"
+    ids: [2510778, 1047171, 379413, 119676, 117179, 76777, 76775, 76773, 44058, 41880, 36894, 34391, 31312, 5480, 5068, 3074, 2762]
+  "Viruses":
+    url: "https://doi.org/10.1038/nrmicro.2017.157"
+    ids: [185639, 746832, 10566, 493803, 10279, 746830, 746831, 46771]
+```
+
+BacDive and eHOMD specific subsets. Dump date: 2022-03-09
 
 ```bash
 scripts/bacdive_download.py
@@ -64,15 +88,15 @@ scripts/ehomd_download.py
 
 The downloaded MGnify database file should be provided in the main configuration file for grimer as follows:
 
-    external:
-      mgnify: "files/mgnify.tsv"
-
-## mgnify.tsv
+```yaml
+external:
+  mgnify: "files/mgnify5989.tsv"
+```
+### mgnify.tsv
 
-MGnify dump date: 2021-04-08 (latest study accession MGYS00005724)
+MGnify dump date: 2022-03-09 (latest study accession MGYS00005989)
 
 ```bash
-seq -f "MGYS%08g" 256 5724 | xargs -P 24 -I {} scripts/mgnify_download.py {} mgnify_dump_20210408/ > mgnify_dump_20210408.log 2>|1 |
-
-scripts/mgnify_extract.py -f mgnify_dump_20210408 -t 10 -o files/mgnify.tsv
+seq -f "MGYS%08g" 256 5989 | xargs -P 24 -I {} scripts/mgnify_download.py -i {} -v -g -o mgnify_dump_5989/ > mgnify_dump_5989.log 2>|1 |
+scripts/mgnify_extract.py -f mgnify_dump_5989 -t 10 -o files/mgnify.tsv
 ```