diff --git a/LICENSE b/LICENSE
index 993f608..adf1d0c 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,6 +1,6 @@
MIT License
-Copyright (c) 2021 pirovc.github.io
+Copyright (c) 2022 pirovc.github.io
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
diff --git a/README.md b/README.md
index 4d2e5e2..5c7e69a 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@
![GRIMER](grimer/img/logo.png)
-GRIMER automates analysis and reports an offline and interactive dashboard integrating annotation, taxonomy and metadata to analyse microbiome studies and detect contamination.
+GRIMER perform analysis of microbiome data and generates a portable and interactive dashboard integrating annotation, taxonomy and metadata.
## Examples
@@ -22,19 +22,34 @@ grimer -h
## Usage
-### Basic
+### Tab-separated input table
```bash
-grimer -i input_table.tsv -m metadata.tsv -c config/default.yaml
+grimer -i input_table.tsv
+```
+
+### BIOM file
+```bash
+grimer -i myfile.biom
+```
+
+### Tab-separated input table with taxonomic annotated observations (e.g. sk__Bacteria;k__;p__Actinobacteria;c__Actinobacteria...)
+```bash
+grimer -i input_table.tsv -f ";"
+```
+
+### Tab-separated input table with metadata
+```bash
+grimer -i input_table.tsv -m metadata.tsv
```
### With taxonomy integration (ncbi)
```bash
-grimer -i input_table.tsv -m metadata.tsv -c config/default.yaml -t ncbi #optional -b taxdump.tar.gz
+grimer -i input_table.tsv -m metadata.tsv -t ncbi #optional -b taxdump.tar.gz
```
-### With DECONTAM and MGnify annotations
+### With configuration file to setup external tools, references and annotations
```bash
-grimer -i input_table.tsv -m metadata.tsv -c config/default.yaml -d -g
+grimer -i input_table.tsv -m metadata.tsv -t ncbi -c config/default.yaml -d -g
```
### List all options
@@ -44,7 +59,7 @@ grimer -h
## Powered by
-[](https://bokeh.org)
-[](https://pandas.org)
-[](https://scipy.org)
-[](https://scikit-bio.org)
+[](https://bokeh.org)
+[](https://pandas.org)
+[](https://scipy.org)
+[](https://scikit-bio.org)
diff --git a/config/default.yaml b/config/default.yaml
index 88db139..4d28169 100644
--- a/config/default.yaml
+++ b/config/default.yaml
@@ -1,26 +1,15 @@
-sources:
- contaminants:
- "CC Bacteria": "sources/contaminants/cc_bacteria.yml"
- "CC Viruses": "sources/contaminants/cc_viruses.yml"
- "CC Eukaryota": "sources/contaminants/cc_eukaryota.yml"
- # "Custom Contaminants 1": "path/contfile1.tsv"
- # "Custom Contaminants 2": "path/contfile2.tsv"
- references:
- "Human-Related": "sources/references/human-related.yml"
- "Skin": "sources/references/skin.yml"
- "Oral": "sources/references/oral.yml"
- # "Custom References 1": "path/reffile1.tsv"
- # "Custom References 2": "path/reffile2.tsv"
-
-# samples:
-# controls:
-# "Positve Controls": "path/file1.tsv"
-# "Negative Controls": "path/file1.tsv"
+references:
+ "Contaminants": "files/contaminants.yml"
+ "Human-related": "files/human-related.yml"
+
+#controls:
+ # "Positve Controls": "path/file1.tsv"
+ # "Negative Controls": "path/file1.tsv"
external:
- mgnify: "sources/mgnify/taxa_counts_top10.tsv"
+ mgnify: "files/mgnify.tsv"
decontam:
- threshold: 0.2 # [0-1]
+ threshold: 0.1 # [0-1] P* hyperparameter
method: "frequency" # frequency, prevalence, combined
# # frequency (default: use sum of counts)
# frequency_file: "path/file1.txt"
diff --git a/sources/README.md b/files/README.md
similarity index 67%
rename from sources/README.md
rename to files/README.md
index 0dc401e..5e74fba 100644
--- a/sources/README.md
+++ b/files/README.md
@@ -1,31 +1,33 @@
-# GRIMER Sources
+# GRIMER References and aux. files
-## File formats
-
-Contaminant and reference sources can be provided to grimer in two formats:
+## Reference file format
1) File with a list (one per line) of taxonomic identifiers or taxonomic names
-2) Formatted .yml file:
+2) or formatted `.yml` file:
- Description/Group 1:
- Description/Group 2:
- url: ""
- ids: []
+```yaml
+"General Description":
+ "Specific description":
+ url: "www.website.com?id={}"
+ ids: [1,2,3]
+```
The url can be a link to the entries listed on the id. Use the `{}` as a placeholder for the id. Example: `https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id={}`
The files should be provided in the main configuration file for grimer as follows:
- sources:
- contaminants:
- "CUSTOM CONTAMINANTS 1": "file.txt"
- "LAB CONT": "another_file.yml"
- references:
- "Human gut": "listofnames.txt"
- "XYZ": "another.yaml"
+```yaml
+references:
+ "Contaminants": "files/contaminants.yml"
+ "Human-related": "files/human-related.yml"
+ "CUSTOM CONTAMINANTS": "file.txt"
+ "LAB RELATED BACTERIA": "another_file.yml"
+```
-## Contaminants
+### contaminants.yml
+
+Last update: 2021-04-01
| Organism group | Genus | Species |
|----------------|-------|---------|
@@ -49,20 +51,28 @@ The files should be provided in the main configuration file for grimer as follow
| Viruses | 0 | 301 | 2019 Asplund, M. et al. |
| Total (unique) | 201 | 625 | |
-## References
+### human-related.yml
+
+BacDive and eHOMD dump date: 2021-04-13
```bash
scripts/bacdive_download.py
scripts/ehomd_download.py
```
-BacDive and eHOMD dump date: 2021-04-13
+## MGnify
+
+The downloaded MGnify database file should be provided in the main configuration file for grimer as follows:
+
+ external:
+ mgnify: "files/mgnify.tsv"
+
+## mgnify.tsv
-## MGNify
+MGnify dump date: 2021-04-08 (latest study accession MGYS00005724)
```bash
seq -f "MGYS%08g" 256 5724 | xargs -P 24 -I {} scripts/mgnify_download.py {} mgnify_dump_20210408/ > mgnify_dump_20210408.log 2>|1 |
-scripts/mgnify_extract.py -f mgnify_dump_20210408 -t 10 -o taxa_counts_top10.tsv
+scripts/mgnify_extract.py -f mgnify_dump_20210408 -t 10 -o files/mgnify.tsv
```
-MGnify dump date 2021-04-08 (latest study accession MGYS00005724)
\ No newline at end of file
diff --git a/sources/contaminants/cc_bacteria.yml b/files/contaminants.yml
similarity index 55%
rename from sources/contaminants/cc_bacteria.yml
rename to files/contaminants.yml
index 593f500..4b77536 100644
--- a/sources/contaminants/cc_bacteria.yml
+++ b/files/contaminants.yml
@@ -4,7 +4,7 @@
ids: [407, 335058, 504481, 1747, 40324, 1033, 38304, 28037, 470, 29448, 1828, 92793, 75, 375, 180282, 851, 301302, 853, 816, 33870, 85698, 87883, 147207, 68909, 1043493, 293256, 1134405, 410, 321895, 432308, 1416628, 1314, 1343, 69359]
"2018 Kirstahler, P. et al.":
url: "http://doi.org/10.1038/s41598-018-22416-4"
- ids: [1747, 40324, 470, 29448, 294, 1828, 39491, 40214, 28090, 134533, 108981, 202956, 239935, 28117, 28116, 818, 820, 161879, 33011, 80866, 853, 823, 821, 296, 303, 34073, 1050843, 1055192, 106648, 1075768, 1076, 1112209, 1131812, 1150298, 1159870, 1160721, 1198452, 1217692, 1276755, 1320556, 134534, 1353941, 136273, 1435036, 147645, 1492737, 1492738, 1497615, 1504823, 1509403, 1519439, 1538644, 1619232, 162426, 1646498, 165179, 1654716, 1665556, 1678129, 169292, 1706231, 1714344, 172088, 1736272, 1736280, 1736296, 1736316, 1736528, 1736532, 1740090, 1833, 1835254, 192843, 202952, 202954, 211589, 216465, 245, 246602, 246787, 247, 266749, 2702, 2736, 285, 28901, 29536, 310297, 310298, 33010, 34062, 346179, 362413, 362418, 370974, 38303, 387661, 40520, 418240, 46506, 47920, 503361, 50340, 52133, 529884, 53412, 55197, 55508, 5665, 64974, 70863, 75659, 756892, 76773, 80878, 80882, 86182, 96345, 986, 989370, 991, 99158]
+ ids: [1747, 40324, 470, 29448, 294, 1828, 39491, 40214, 28090, 134533, 108981, 202956, 239935, 28117, 28116, 818, 820, 161879, 33011, 80866, 853, 823, 821, 296, 303, 34073, 2559073, 1055192, 106648, 1075768, 1076, 1112209, 1131812, 1150298, 1159870, 1160721, 1198452, 1217692, 1276755, 1320556, 134534, 1353941, 136273, 1435036, 147645, 1492737, 1492738, 1497615, 1504823, 1509403, 1519439, 1538644, 1619232, 162426, 1646498, 165179, 1654716, 1665556, 1678129, 169292, 1706231, 1714344, 172088, 1736272, 1736280, 1736296, 1736316, 1736528, 1736532, 1740090, 1833, 1835254, 192843, 202952, 202954, 211589, 216465, 245, 246602, 246787, 247, 266749, 2702, 2736, 285, 28901, 29536, 310297, 310298, 33010, 34062, 346179, 362413, 362418, 370974, 38303, 387661, 40520, 418240, 46506, 47920, 503361, 50340, 52133, 529884, 53412, 55197, 55508, 5665, 64974, 70863, 75659, 756892, 76773, 80878, 80882, 86182, 96345, 986, 989370, 991, 99158]
"2015 Jervis-Bardy, J. et al.":
url: "http://doi.org/10.1186/s40168-015-0083-8"
ids: [286, 48736, 59732, 335058, 41275, 28100, 34072]
@@ -44,3 +44,29 @@
"2017 Salter, S.J. et al.":
url: "http://doi.org/10.1371/journal.pntd.0005975"
ids: [50709, 299566, 1375, 2040, 507, 31988, 165779, 161492, 150247, 92793, 374, 55080, 1696, 41275, 369926, 32008, 194, 2717, 75, 10, 59732, 1716, 37914, 231454, 423604, 212791, 117563, 963, 1004300, 682522, 1357, 149698, 906, 68287, 407, 33882, 1839, 528, 376469, 84567, 335058, 28100, 838, 286, 83618, 48736, 379, 1835, 45669, 22, 28453, 13687, 40323, 1054211, 13275, 33057, 157, 213484, 29465, 1827, 265, 1386]
+ "2018 Stinson, L.F. et al.":
+ url: "http://doi.org/10.3389/fmicb.2018.00270"
+ ids: [1696, 1716, 43668, 37914, 1269, 32207, 1743, 836, 838, 1016, 308865, 1386, 2755, 1279, 66831, 1350, 1578, 1301, 29465, 374, 407, 434, 165696, 13687, 283, 80865, 93681, 48736, 570, 713, 469, 212791, 286, 40323]
+ "2019 Stinson, L.F. et al.":
+ url: "http://doi.org/10.1111/lam.13091"
+ ids: [561, 335058, 407, 13687, 407, 374, 165696, 222, 1716, 547, 48736, 1004302, 1827, 1743, 1269, 204456, 106589, 1678]
+ "2002 Kulakov, L.A. et al.":
+ url: "http://doi.org/10.1128/AEM.68.4.1548-1555.2002"
+ ids: [329, 376, 239, 36773, 69392, 1785, 1409, 304, 28214, 294]
+"Common Viral contaminants":
+ "2019 Asplund, M. et al.":
+ url: "http://doi.org/10.1016/j.cmi.2019.04.028"
+ ids: [12071, 742919, 11103, 31647, 1678143, 10298, 10376, 10359, 11676, 129951, 10583, 31552, 10798, 11908, 585044, 518981, 1225745, 11620, 1891767, 493803, 11033, 159150, 35306, 68887, 11870, 11958, 11861, 11946, 11864, 363745, 363020, 242521, 11866, 11960, 31668, 31669, 31670, 11867, 11955, 11874, 11876, 11878, 11885, 36381, 11886, 11888, 269447, 269448, 11950, 11948, 1332312, 354090, 11884, 1352534, 1395610, 1395611, 1395612, 1395613, 1395614, 1395615, 1395616, 1395617, 1395618, 1395619, 1395620, 1341019, 11801, 11809, 1511763, 1394983, 697906, 1072204, 1148801, 1574422, 12104, 763552, 10264, 85708, 759804, 28344, 85506, 33747, 10345, 285986, 220638, 1154691, 185638, 1169627, 1045778, 185636, 72201, 345198, 176652, 1301280, 68347, 1618248, 1618254, 10288, 198112, 1454023, 1454024, 1454025, 1278278, 1278246, 1278252, 1278247, 1278248, 1278249, 1278250, 1278251, 399781, 1278255, 346932, 1278261, 1278263, 1278265, 1474867, 1379694, 1521385, 1521387, 1521389, 938081, 938082, 880162, 251749, 455370, 169864, 1379788, 1608440, 642253, 642255, 1224510, 1592207, 1592212, 1592083, 1592085, 1592086, 1592088, 1592093, 1592095, 1592096, 1592081, 1843761, 1519405, 1557033, 1608451, 664785, 1435438, 1170653, 40979, 12235, 12138, 11987, 51680, 12056, 146500, 554168, 212035, 1269028, 693272, 1420594, 1094892, 1128140, 1235314, 1128143, 1128151, 1128131, 1450746, 1461100, 181522, 1424633, 1010698, 1299317, 1450749, 1416631, 1128422, 1034806, 1592112, 1592113, 1592127, 938080, 1074214, 1519385, 1519387, 1519389, 1519390, 1519395, 1519396, 1519397, 186617, 1262072, 1407671, 743583, 340016, 745107, 745102, 745100, 1416009, 1187128, 889876, 760732, 1243183, 1229760, 1481186, 1505225, 1560342, 233894, 115987, 260149, 227470, 926067, 1127514, 1296654, 294382, 1486657, 1084719, 10756, 1486662, 1285382, 1497851, 1127515, 145579, 263375, 764562, 1133292, 1133022, 242527, 260373, 279280, 644524, 242861, 1132026, 1357714, 1197951, 1327981, 1327976, 1327979, 1327992, 1328030, 1327990, 1327980, 1327972, 1327982, 1327995, 1327983, 1327970, 1327971, 756279, 1327977, 1327993, 1328029, 1327975, 1327974, 1327985, 756280, 756282, 1527524, 1540094, 1042123, 541865, 1567016, 765765, 1176422, 1327037, 1162295, 1141135, 1141136, 335924, 536444, 929832, 682650, 1137745, 536473, 749413, 1477406, 1048515, 1048516, 1048517, 1048520, 1048521, 1537091, 1264700, 1609634, 1455074, 414970, 10863, 10864, 1222338, 1147148, 1237364, 1414766, 1977402, 948870, 1524881, 10665, 10760, 1147094, 1429767, 925983, 925984, 1527519, 1527506, 1229753, 1540097, 1540098, 1054461, 1391223, 294631, 1325731, 908819, 1458858, 1458842, 90963, 1536592, 1527515, 551895, 1129191, 139872, 201847, 287412, 1262517, 754044, 1385658, 1176423, 889949, 446529, 1034128, 1056830, 1089119, 1486472, 1034111, 205879, 1340709, 1567475, 1472912, 1204539, 1399915, 1283076, 1283077, 1168479, 1168478, 440250, 400567, 994601, 1465639, 889956, 445700, 444862, 536454, 445688, 444861, 1229794, 1229793, 1229792, 1229791, 1229790, 1229789, 1229786, 1229787, 1229788, 1229784, 1229782, 376758, 1498188, 504501, 504553, 1235647, 1235648, 1235649, 1235650, 1235653, 1235654, 1235655, 1235656, 1235657, 877240, 754052, 1316739, 347326, 1235689, 31535, 757342, 582345, 1462581, 386793, 1204517, 347327, 1335230, 743813, 1348912, 1327964, 270673, 188350, 1541891, 169683, 998086, 1500757, 1458843, 1129146, 1279082, 1114179, 1548900, 1231048, 1548901, 1449437, 1548918, 1476390, 462590, 754048, 948071, 1481785, 1417599, 1131316, 691965, 136084, 754067, 1161935, 1173749, 1173761, 1173759, 1173762, 590739, 1406795, 1141134, 1204529, 1540099, 1168549, 866889, 1458859, 1458860, 1458861, 10761, 754060, 1524882, 1357423, 373126, 1150991, 1195080, 320843, 55510, 1434319, 320850, 369581, 537874, 1208587, 1566990, 10732, 490913, 1526550, 1340810, 756277, 753084, 753085, 756275, 1026955, 1340812, 238854, 555387, 754042, 444860, 981335, 469660, 215796, 1478972, 1385659, 926697, 336724, 278008, 1211417, 271647, 754075, 573173, 573174, 979525, 979534, 1529058, 1283071, 573176, 1589298, 1076759, 1461743, 1150989, 754058, 754051, 929835, 1414739, 754072, 1524880, 194802, 1168281, 1204514, 1188795, 331278]
+ "2015 Mukherjee, S. et al.":
+ url: "http://doi.org/10.1186/1944-3277-10-18"
+ ids: [10847]
+ "2015 Kjartansdóttir, K.R. et al.":
+ url: "https://doi.org/10.1073/pnas.1423756112"
+ ids: [322019]
+"Common Eukaryotic contaminants":
+ "PRJNA168":
+ url: "https://www.ncbi.nlm.nih.gov/genome/guide/human/"
+ ids: [9606]
+ "2016 Czurda, S. et al.":
+ url: "https://doi.org/10.1128/JCM.02112-15"
+ ids: [1895944, 76775, 5308]
diff --git a/sources/references/oral.yml b/files/human-related.yml
similarity index 72%
rename from sources/references/oral.yml
rename to files/human-related.yml
index 22e21dd..d994ffa 100644
--- a/sources/references/oral.yml
+++ b/files/human-related.yml
@@ -1,3 +1,16 @@
+"Human-related bacterial isolates from BacDive":
+ "Limbs":
+ url: "https://bacdive.dsmz.de/search?search=taxid:{}"
+ ids: [326522, 82380, 1701, 131110, 29388, 84698, 729, 69968, 28264, 200476, 58172, 490, 1280, 1290, 41276, 220685, 28449, 644, 1660, 147645, 1351, 90367, 391, 1717, 1720, 1314, 646, 29391, 732, 1365628, 90245, 495, 192066, 753, 1979962, 82633, 53363, 539, 37637, 37329, 755171, 29466, 291112, 614, 28090, 1402, 217204, 1509, 326523, 2014, 1303, 28038, 676, 105219, 13076, 66228, 504, 28189, 752, 108980, 1747, 1282, 1504, 33028, 303, 672, 28035, 1286, 485, 1311, 28188, 28132, 1328, 1506, 652, 29380, 760, 46124, 1379, 755172, 193461, 158822, 479, 68892, 479117, 33889, 670, 420404, 1305, 1697053, 71254, 310300, 47920, 669, 1245, 38289, 36740, 354351, 48296, 29318, 192, 38313, 180332, 135487, 33007, 287, 754, 29317, 1648, 1713, 1352, 550, 53437, 2054, 38284, 1667, 511, 1015, 40091, 59561, 411577, 587, 370622, 206506, 37326, 90239, 161902, 137732, 52132, 34105, 180588, 33968, 386414, 283734, 1891233, 478, 156979, 28125, 1529, 306, 123899, 220687, 620903, 1239307, 1348, 316, 28091, 178214, 84112, 44737, 487, 1536, 1273, 24, 630, 1034, 322095, 488730, 70348, 650, 43765, 43770, 39791, 115545, 150055, 411570, 196, 131111, 472, 38301, 51671, 292, 146827, 1785, 1977869, 40542, 29432, 28450, 1890675, 47312, 38875, 1710, 739, 47917, 33010, 1292, 169292, 158877, 1781, 400946, 501496, 488, 239, 361500, 470, 1430326, 29354, 82347, 65058, 714, 521520, 38303, 1513, 502790, 747, 1141657, 38304]
+ "Ear":
+ url: "https://bacdive.dsmz.de/search?search=taxid:{}"
+ ids: [1747, 2702, 1869190, 32002, 85698, 131111, 29388, 28037, 44750, 51671, 1353, 28264, 545, 292, 89093, 1872515, 1280, 511, 29379, 68766, 59561, 29321, 480, 1311, 285091, 727, 199591, 43263, 1313, 739, 760, 1661, 52769, 1421, 1314, 156979, 35703, 1898, 585, 87883, 90245, 123899, 306, 1895474, 670, 47770, 319939, 184870, 134375, 72557, 753, 663, 316, 1343, 217203, 267212, 678, 53364, 1014, 1776741, 93220, 1639, 666, 38313, 1652, 105219, 38287, 293, 33007, 287]
+ "Eye":
+ url: "https://bacdive.dsmz.de/search?search=taxid:{}"
+ ids: [40216, 28037, 1280, 490, 147645, 1351, 90367, 1824, 813, 1720, 38290, 29391, 732, 192066, 616, 161879, 753, 1304, 1655, 539, 37329, 28172, 161890, 90241, 504, 752, 253, 457921, 1871047, 1309, 154288, 280147, 485, 760, 46124, 1931, 1379, 29394, 1671023, 68892, 479, 1396, 1544416, 2035, 420404, 735, 47846, 666, 571, 2055, 1401, 1270, 34062, 545, 38284, 247, 498, 40091, 59561, 370622, 37326, 727, 945844, 1313, 180588, 1685, 1671022, 478, 1302, 134375, 477, 726, 47478, 197575, 207340, 38287, 650, 756689, 43765, 69392, 723, 72556, 187491, 472, 51671, 2047, 1177728, 46125, 29432, 480, 47312, 739, 134533, 740, 37330, 488, 1544413, 239, 483, 29354, 41202, 38304]
+ "Nose":
+ url: "https://bacdive.dsmz.de/search?search=taxid:{}"
+ ids: [59823, 72556, 131111, 1282, 28264, 38284, 1280, 520, 43990, 615, 727, 1328, 1313, 90367, 760, 181487, 29394, 478, 732, 40324, 33889, 306, 39950, 1304, 1673725, 65058, 74319, 1591, 90241, 105219, 504, 286802, 195105, 574]
"Human-related bacterial isolates from BacDive":
"Oral":
url: "https://bacdive.dsmz.de/search?search=taxid:{}"
@@ -15,4 +28,17 @@
"Nasal":
url: "http://www.ehomd.org/?name=HOMD"
ids: [553601, 406557, 497962, 1715211, 170187, 553573, 196620, 93062, 553594, 406559, 1008452, 1608882, 488223, 1203632, 857577, 727, 869269, 478, 553568, 760810, 1739522, 548470, 857573, 281310, 455227, 521004, 359787, 1095736, 374931, 1069623, 585203, 1715217, 497980, 585202, 1715123, 1069628, 553580, 1203562, 1834153, 1203627, 866630, 1415766, 553583, 158879, 760787, 548475, 453361, 406561, 451516, 869215, 553574, 553590, 282458, 1608898, 456482, 374928, 553592, 553588, 1069626, 452948, 480, 1130804, 1095746, 453362, 375177, 406556, 71421, 273036, 451515, 548474, 521005, 374930, 406560, 1035187, 1203619, 553565, 406562, 1095745, 1859695, 1069625, 857578, 585204, 1739317, 862964, 1340484, 681288, 1239793, 487213, 857581, 497963, 760791, 857574, 374933, 1739280, 869216, 406563, 453366, 574093, 516950, 1415765, 453363, 262727, 857575, 453364, 1203625, 548473, 1340485, 406558, 703339, 760746, 1340486, 656912, 189423, 1239792, 512767, 857572, 857579, 760834, 760861, 1203622, 585161, 1203557, 1203566, 373153, 1203559, 1095735, 1203561, 656913, 886289, 262728, 488221, 553571, 869309, 553577, 171101, 375432, 359786, 857576, 1236608, 1095737, 1121367, 375063, 888828, 374927, 1203624, 365659, 525381, 760809, 512769, 418127, 595501, 246201, 512566, 546342, 158878, 883103, 488222, 1008453, 857571, 1739254, 487214, 453365, 1739452, 90241, 553567, 28037, 512768, 553581, 426430, 553596, 93061, 935897, 450394, 282459, 561276, 374932, 862965]
-
+"Human-related bacterial isolates from BacDive":
+ "Skin/Nail/Hair":
+ url: "https://bacdive.dsmz.de/search?search=taxid:{}"
+ ids: [1747, 106654, 1986155, 59823, 1869190, 1270, 71999, 1283, 1276, 131110, 1648, 1656, 472, 1352, 34062, 729, 29388, 2047, 28264, 1314, 1280, 1290, 672, 59561, 1780, 33918, 37326, 29432, 1286, 1891644, 74703, 90367, 1931, 33010, 1720, 1965292, 181487, 169292, 38290, 29506, 1622, 281920, 1292, 1781, 861, 1698, 1260, 2035, 202789, 521392, 470, 663, 29382, 1659, 1288, 37923, 1655, 45254, 1753, 1261, 38289, 36740, 1273, 1347368, 33034, 1347369, 1282, 66228, 132933, 43765, 287]
+"Top organisms form the human skin microbiome" :
+ "Bacteria":
+ url: "https://doi.org/10.1038/nrmicro.2017.157"
+ ids: [257758, 225324, 169292, 161879, 146827, 43765, 38304, 38287, 38286, 29466, 29388, 28037, 1747, 1305, 1303, 1290, 1282, 1270]
+ "Eukarya":
+ url: "https://doi.org/10.1038/nrmicro.2017.157"
+ ids: [2510778, 1047171, 379413, 119676, 117179, 76777, 76775, 76773, 44058, 41880, 36894, 34391, 31312, 5480, 5068, 3074, 2762]
+ "Viruses":
+ url: "https://doi.org/10.1038/nrmicro.2017.157"
+ ids: [185639, 746832, 10566, 493803, 10279, 746830, 746831, 46771]
diff --git a/sources/mgnify/taxa_counts_top10.tsv b/files/mgnify.tsv
similarity index 100%
rename from sources/mgnify/taxa_counts_top10.tsv
rename to files/mgnify.tsv
diff --git a/grimer-mgnify.py b/grimer-mgnify.py
new file mode 100755
index 0000000..1bfa519
--- /dev/null
+++ b/grimer-mgnify.py
@@ -0,0 +1,42 @@
+#!/usr/bin/env python3
+
+import scripts.mgnify_download
+import grimer.grimer
+import argparse
+import os
+import glob
+
+parser = argparse.ArgumentParser(description='grimer-mgnify')
+parser.add_argument('-i', '--mgnify-study-accession', required=True, type=str, help="MGnify study accession (e.g. MGYS00002462)")
+parser.add_argument('-g', '--grimer-params', type=str, help="Extra params for grimer")
+parser.add_argument('-o', '--output-prefix', type=str, help="Output prefix for files and report")
+args = parser.parse_args()
+
+if args.output_prefix:
+ prefix = args.output_prefix
+else:
+ prefix = args.mgnify_study_accession
+
+# download files
+print("Downloading files for study accession " + args.mgnify_study_accession)
+scripts.mgnify_download.main(['-i', args.mgnify_study_accession, '-o', prefix, '-v'])
+
+files = filter(os.path.isfile, glob.glob(prefix + '*taxonomy_abundances*'))
+# Sort files by size ASC
+files = sorted(files, key=lambda x: os.stat(x).st_size)
+md = glob.glob(prefix + '*_metadata.tsv*')
+
+if args.grimer_params:
+ grimer_params = args.grimer_params.split(" ")
+else:
+ grimer_params = []
+grimer.grimer.main(["-i", files[-1],
+ "-m", md[-1],
+ "-c", 'config/default.yaml',
+ "-f", ";",
+ "--obs-replace", "^.+__", "", "_", " ",
+ "-r", "superkingdom", "kingdom", "phylum", "class", "order", "family", "genus", "species",
+ "-t", "ncbi",
+ "-o", prefix + ".html",
+ "--title", "MGnify study accession " + args.mgnify_study_accession,
+ ] + grimer_params)
diff --git a/grimer/callbacks.py b/grimer/callbacks.py
index 3689965..5db1d14 100644
--- a/grimer/callbacks.py
+++ b/grimer/callbacks.py
@@ -2,19 +2,21 @@
def link_obstable_samplebars(ele,
- cds_p_obstable,
+ cds_m_obstable,
cds_p_samplebars,
cds_d_samples,
- cds_d_sampleobs,
+ dict_d_sampleobs,
cds_d_metadata,
cds_p_decontam,
cds_p_decontam_models,
cds_d_decontam,
+ cds_p_references,
active_ranks,
min_obs_perc,
max_total_count,
cds_p_mgnify,
- dict_d_refs):
+ dict_d_refs,
+ dict_d_taxname):
bar_select_callback = CustomJS(
args=dict(y1_select=ele["samplebars"]["wid"]["y1_select"],
@@ -25,14 +27,13 @@ def link_obstable_samplebars(ele,
y_range=ele["samplebars"]["fig"].y_range,
max_total_count=max_total_count),
code='''
- console.log("bar_select_callback");
const pdata = cds_p_samplebars.data;
const ddata = cds_d_samples.data;
const total = cds_d_samples.data["cnt|total"];
const key = "cnt|" + annotbar_rank_select.value + "|" + annotbar_select.value;
- console.log(key);
+
if (y1_select.value=="%"){
for (var i = 0; i < total.length; ++i) {
pdata['bar|selected'][i] = (ddata[key][i]/total[i])*100;
@@ -56,14 +57,12 @@ def link_obstable_samplebars(ele,
change_y_obs_label_callback = CustomJS(
args=dict(yaxis=ele["samplebars"]["fig"].yaxis[1]),
code='''
- console.log("change_y_obs_label_callback");
yaxis.axis_label = this.value + " observations";
''')
change_y_counts_label_callback = CustomJS(
args=dict(yaxis=ele["samplebars"]["fig"].yaxis[0]),
code='''
- console.log("change_y_counts_label_callback");
yaxis.axis_label = this.value + " counts";
''')
@@ -76,19 +75,23 @@ def link_obstable_samplebars(ele,
cds_d_metadata=cds_d_metadata,
samplebars=ele["samplebars"]["fig"]),
code='''
- console.log("sort_groupby_callback");
- const samples = cds_d_samples.data["index"];
-
// Define value from Sort by select
var sort_col;
+ var annot_samples = cds_d_samples.data["index"];
if (sort_select.value=="input_order"){
sort_col = cds_d_samples.data["aux|input_order"];
- }else if (sort_select.value=="#"){
+ }else if (sort_select.value=="counts"){
sort_col = cds_d_samples.data["cnt|total"];
}else if (sort_select.value=="selected_annotation"){
sort_col = cds_p_samplebars.data["bar|selected"];
}else if (sort_select.value.startsWith("metadata_num|")){
sort_col = cds_d_metadata.data[sort_select.value.replace('metadata_num|','')];
+
+ // Annotate label with value
+ var annot_samples = annot_samples.map(function(s, i) {
+ return sort_col[i] + " | " + s;
+ });
+
}else if (sort_select.value.startsWith("tax|")){
sort_col = cds_p_samplebars.data[sort_select.value];
}
@@ -102,7 +105,7 @@ def link_obstable_samplebars(ele,
// Zip sample index and metadata field to create nested factors
factors = groupby_col1.map(function(m, i) {
- return [m, samples[i]];
+ return [m, annot_samples[i]];
});
// second grouping level
@@ -111,17 +114,17 @@ def link_obstable_samplebars(ele,
var groupby_col2 = cds_d_metadata.data[groupby2_select.value.replace('metadata_cat|','')];
factors = groupby_col2.map(function(m, i) {
- return [m, groupby_col1[i], samples[i]];
+ return [groupby_col1[i], m, annot_samples[i]];
});
- sorted_factors = grimer_sort(factors, sort_col, "numeric", false, groupby_col1, groupby_col2);
+ sorted_factors = grimer_sort(factors, sort_col, "numeric", false, groupby_col2, groupby_col1);
}else{
sorted_factors = grimer_sort(factors, sort_col, "numeric", false, groupby_col1);
}
}else{
// Single factors, just use the sample index
- factors = samples;
+ factors = annot_samples;
sorted_factors = grimer_sort(factors, sort_col, "numeric", false);
}
@@ -136,39 +139,38 @@ def link_obstable_samplebars(ele,
args=dict(y2_select=ele["samplebars"]["wid"]["y2_select"],
cds_p_samplebars=cds_p_samplebars,
cds_d_samples=cds_d_samples,
- cds_p_obstable=cds_p_obstable,
- cds_d_sampleobs=cds_d_sampleobs,
- samplebars=ele["samplebars"]["fig"],
+ cds_m_obstable=cds_m_obstable,
+ dict_d_sampleobs=dict_d_sampleobs,
y_range=ele["samplebars"]["fig"].extra_y_ranges['obs'],
min_obs_perc=min_obs_perc,
max_total_count=max_total_count,
active_ranks=active_ranks),
code='''
- console.log("plot_obs_callback");
// get selected row from obstable [0 to get just the first]
- var row = cds_p_obstable.selected.indices[0];
+ var row = cds_m_obstable.selected.indices[0];
if (row!=undefined){
// get totals
const total = cds_d_samples.data["cnt|total"];
// for each rank
for(let r = 0; r < active_ranks.length; r++){
+ // get rank
+ let rank = active_ranks[r];
// get taxid of the rank
- let rank_taxid = cds_p_obstable.data["tax|"+active_ranks[r]][row];
+ let taxid = cds_m_obstable.data["tax|"+rank][row];
// for each sample
- for (var i = 0; i < cds_d_sampleobs.length; i++) {
+ for (var i = 0; i < cds_d_samples.length; i++) {
+ let sample = cds_d_samples.data["index"][i];
let val = 0;
- // if taxid for the rank exists, [transform and] copy values over to the cds_p_samplebars
- if (rank_taxid){
- val = cds_d_sampleobs.data[rank_taxid][i];
+ // if taxid exists in the lineage, [transform and] copy values over to the cds_p_samplebars
+ if (taxid){
+ val = dict_d_sampleobs[rank][taxid][sample];
if(val>0){
- if (y2_select.value=="#"){
- val = cds_d_sampleobs.data[rank_taxid][i];
- }else if (y2_select.value=="%"){
- val = (cds_d_sampleobs.data[rank_taxid][i]/total[i])*100;
+ if (y2_select.value=="%"){
+ val = (val/total[i])*100;
}else if (y2_select.value=="log10(%)"){
- val = Math.log10((cds_d_sampleobs.data[rank_taxid][i]/total[i])*100);
+ val = Math.log10((val/total[i])*100);
}else if (y2_select.value=="log10(#)"){
- val = Math.log10(cds_d_sampleobs.data[rank_taxid][i]);
+ val = Math.log10(val);
}
}
}
@@ -195,20 +197,28 @@ def link_obstable_samplebars(ele,
''')
change_text_legend_obs_callback = CustomJS(
- args=dict(cds_p_obstable=cds_p_obstable,
+ args=dict(cds_m_obstable=cds_m_obstable,
legend_obs=ele["samplebars"]["legend_obs"],
+ samplebars=ele["samplebars"]["fig"],
active_ranks=active_ranks),
code='''
- console.log("change_text_legend_obs_callback");
// selected row
- var row = cb_obj.indices[0];
+ const row = cb_obj.indices[0];
+ const selected_rank = cds_m_obstable.data['col|rank'][row];
+
for(let r = 0; r < active_ranks.length; r++){
- let taxid = cds_p_obstable.data["tax|"+active_ranks[r]][row];
+ let taxid = cds_m_obstable.data["tax|"+active_ranks[r]][row];
if (taxid){
- legend_obs.items[r].label = active_ranks[r] + "|" + cds_p_obstable.data['col|name'][cds_p_obstable.data['index'].indexOf(taxid)];
+ legend_obs.items[r].label = active_ranks[r] + "|" + cds_m_obstable.data['col|name'][cds_m_obstable.data['index'].indexOf(taxid)];
}else{
legend_obs.items[r].label = active_ranks[r];
}
+ // activate only selected rank
+ if(active_ranks[r]==selected_rank){
+ samplebars.renderers[r+3].visible=true;
+ //}else{
+ // samplebars.renderers[r+3].visible=false;
+ }
}
''')
@@ -217,32 +227,57 @@ def link_obstable_samplebars(ele,
annotbar_rank_select=ele["samplebars"]["wid"]["annotbar_rank_select"],
legend_bars=ele["samplebars"]["legend_bars"]),
code='''
- console.log("change_text_legend_bars_callback");
legend_bars.items[0].label = annotbar_rank_select.value + "|" + annotbar_select.value;
''')
load_infopanel = CustomJS(
args=dict(infopanel=ele["infopanel"]["textarea"],
- cds_p_obstable=cds_p_obstable,
- dict_d_refs=dict_d_refs),
+ cds_m_obstable=cds_m_obstable,
+ dict_d_refs=dict_d_refs,
+ dict_d_taxname=dict_d_taxname,
+ active_ranks=active_ranks),
code='''
- console.log("load_infopanel");
-
// selected row
var row = cb_obj.indices[0];
- const name = cds_p_obstable.data['col|name'][row];
- const rank = cds_p_obstable.data['col|rank'][row];
- const taxid = cds_p_obstable.data['index'][row];
+
+ const name = cds_m_obstable.data['col|name'][row];
+ const rank = cds_m_obstable.data['col|rank'][row];
+ const taxid = cds_m_obstable.data['index'][row];
var text = "";
- text+="Name: " + name;
+ text+="[ Obs ]";
+ text+="\\n";
+ text+=name;
+ if(taxid!=name){
+ text+="\\n";
+ text+="taxid: " + taxid;
+ }
text+="\\n";
- text+="Id: " + taxid;
+ text+="[ Rank ]";
text+="\\n";
- text+="Rank: " + rank;
+ text+=rank;
text+="\\n";
- text+="\\n"
+ var lineage = "";
+ for(let r = 0; r < active_ranks.length; r++){
+ var obs_lin = cds_m_obstable.data["tax|" + active_ranks[r]][row];
+ if(taxid!=name){
+ if(dict_d_taxname[obs_lin])
+ lineage+=dict_d_taxname[obs_lin]+" | ";
+ else
+ lineage+=" | ";
+ }else{
+ lineage+=obs_lin+" | ";
+ }
+ if(active_ranks[r]==rank)
+ break;
+ }
+ text+="[ Lineage ]";
+ text+="\\n";
+ text+=lineage.slice(0, -3);
+ text+="\\n";
+
+ text+="\\n"
for (var source in dict_d_refs[taxid]){
text+="[ "+source+" ]";
text+="\\n";
@@ -262,20 +297,25 @@ def link_obstable_samplebars(ele,
decontam_callback = CustomJS(
args=dict(cds_d_samples=cds_d_samples,
- cds_p_obstable=cds_p_obstable,
- cds_d_sampleobs=cds_d_sampleobs,
+ cds_m_obstable=cds_m_obstable,
+ dict_d_sampleobs=dict_d_sampleobs,
cds_p_decontam=cds_p_decontam,
cds_p_decontam_models=cds_p_decontam_models,
cds_d_decontam=cds_d_decontam,
- pvalue_input=ele["decontam"]["wid"]["pvalue_input"]),
+ pscore_input=ele["decontam"]["wid"]["pscore_input"]),
code='''
- console.log("decontam_callback");
// selected row
const row = cb_obj.indices[0];
- const taxid = cds_p_obstable.data["index"][row];
+ const taxid = cds_m_obstable.data["index"][row];
+ const rank = cds_m_obstable.data["col|rank"][row];
const total = cds_d_samples.data["cnt|total"];
- for(let i = 0; i < cds_p_decontam.data["counts"].length; i++){
- cds_p_decontam.data["counts"][i] = cds_d_sampleobs.data[taxid][i]/total[i];
+ for(let i = 0; i < cds_p_decontam.length; i++){
+ let sample = cds_p_decontam.data["index"][i];
+ if (dict_d_sampleobs[rank][taxid][sample]!=undefined){
+ cds_p_decontam.data["counts"][i] = dict_d_sampleobs[rank][taxid][sample]/total[i];
+ }else{
+ cds_p_decontam.data["counts"][i] = 0;
+ }
}
cds_p_decontam.change.emit();
@@ -283,11 +323,11 @@ def link_obstable_samplebars(ele,
if (lines!=undefined){
cds_p_decontam_models.data["y_cont"] = [lines[0], lines[1]];
cds_p_decontam_models.data["y_noncont"] = [lines[2], lines[2]];
- pvalue_input.value = lines[3].toString();
+ pscore_input.value = lines[3].toString();
}else{
cds_p_decontam_models.data["y_cont"] = [];
cds_p_decontam_models.data["y_noncont"] = [];
- pvalue_input.value = "";
+ pscore_input.value = "";
}
cds_p_decontam_models.change.emit();
''')
@@ -296,15 +336,14 @@ def link_obstable_samplebars(ele,
args=dict(mgnify_fig=ele["mgnify"]["fig"],
biome_spinner=ele["mgnify"]["wid"]["biome_spinner"],
mgnify_filter=ele["mgnify"]["filter"],
- cds_p_obstable=cds_p_obstable,
+ cds_m_obstable=cds_m_obstable,
cds_p_mgnify=cds_p_mgnify),
code='''
- console.log("mgnify_callback");
// selected row
- const row = cds_p_obstable.selected.indices[0];
+ const row = cds_m_obstable.selected.indices[0];
const indices = [];
if (row!=undefined){
- const taxid = cds_p_obstable.data["index"][row];
+ const taxid = cds_m_obstable.data["index"][row];
for(let i = 0; i < cds_p_mgnify.length; i++){
if(cds_p_mgnify.data["taxa"][i]==taxid &&
cds_p_mgnify.data["level"][i]==biome_spinner.value.toString()){
@@ -316,12 +355,56 @@ def link_obstable_samplebars(ele,
cds_p_mgnify.change.emit();
''')
+ references_callback = CustomJS(
+ args=dict(references_fig=ele["references"]["fig"],
+ references_filter=ele["references"]["filter"],
+ references_select=ele["references"]["wid"]["references_select"],
+ cds_m_obstable=cds_m_obstable,
+ cds_p_references=cds_p_references,
+ active_ranks=active_ranks),
+ code='''
+ // selected row
+ const row = cds_m_obstable.selected.indices[0];
+ const indices = [];
+ if (row!=undefined){
+ for(let i = 0; i < cds_p_references.length; i++){
+ // for each rank
+ for(let r = 0; r < active_ranks.length; r++){
+ // get taxid of the rank
+ let rank_obs = cds_m_obstable.data["tax|"+active_ranks[r]][row];
+ if(cds_p_references.data["obs"][i]==rank_obs &&
+ cds_p_references.data["rank"][i]==active_ranks[r] &&
+ cds_p_references.data["ref"][i]==references_select.value){
+ indices.push(i);
+ }
+ }
+ }
+ }
+ references_filter.indices = indices;
+ cds_p_references.change.emit();
+ ''')
+
+ toggle_label_callback = CustomJS(
+ args=dict(xaxis=ele["samplebars"]["fig"].xaxis[0]),
+ code='''
+ if(this.active.includes(0)){
+ xaxis.major_label_text_font_size = "10px";
+ xaxis.major_tick_line_color="black";
+ }else{
+ xaxis.major_label_text_font_size = "0px";
+ xaxis.major_tick_line_color=null;
+ }
+ ''')
+
obstable_callbacks = [plot_obs_callback, change_text_legend_obs_callback, sort_groupby_callback, load_infopanel]
if cds_p_decontam:
obstable_callbacks.append(decontam_callback)
if cds_p_mgnify:
obstable_callbacks.append(mgnify_callback)
- cds_p_obstable.selected.js_on_change('indices', *obstable_callbacks)
+ if ele["references"]["filter"]:
+ obstable_callbacks.append(references_callback)
+
+ cds_m_obstable.selected.js_on_change('indices', *obstable_callbacks)
ele["samplebars"]["wid"]["sort_select"].js_on_change('value', sort_groupby_callback)
ele["samplebars"]["wid"]["groupby1_select"].js_on_change('value', sort_groupby_callback)
@@ -330,12 +413,15 @@ def link_obstable_samplebars(ele,
ele["samplebars"]["wid"]["annotbar_rank_select"].js_on_change('value', bar_select_callback, change_text_legend_bars_callback, sort_groupby_callback)
ele["samplebars"]["wid"]["y1_select"].js_on_change('value', bar_select_callback, change_y_counts_label_callback, sort_groupby_callback)
ele["samplebars"]["wid"]["y2_select"].js_on_change('value', plot_obs_callback, change_y_obs_label_callback, sort_groupby_callback)
+ ele["samplebars"]["wid"]["toggle_label"].js_on_click(toggle_label_callback)
ele["mgnify"]["wid"]["biome_spinner"].js_on_change('value', mgnify_callback)
+ ele["references"]["wid"]["references_select"].js_on_change('value', references_callback)
def link_heatmap_widgets(ele,
cds_d_samples,
cds_d_metadata,
+ cds_p_metadata,
dict_d_hcluster_x,
dict_d_hcluster_y,
cds_p_dendro_x,
@@ -343,111 +429,159 @@ def link_heatmap_widgets(ele,
dict_d_dedro_x,
dict_d_dedro_y,
cds_p_annotations,
- cds_p_obstable,
- cds_p_heatmap):
+ cds_m_obstable,
+ cds_p_heatmap,
+ active_ranks,
+ dict_d_taxname):
x_dendro_callback = CustomJS(
args=dict(rank_select=ele["heatmap"]["wid"]["rank_select"],
+ x_groupby_select=ele["heatmap"]["wid"]["x_groupby_select"],
x_sort_select=ele["heatmap"]["wid"]["x_sort_select"],
- x_method_select=ele["heatmap"]["wid"]["x_method_select"],
+ dendrox=ele["dendrox"]["fig"],
cds_p_dendro_x=cds_p_dendro_x,
dict_d_dedro_x=dict_d_dedro_x),
code='''
- console.log("x_dendro_callback");
- if (x_sort_select.value.startsWith("metric|")){
- const key = rank_select.value+"|"+x_method_select.value+"|"+x_sort_select.value.replace("metric|","");
+ if (x_groupby_select.value.startsWith("cluster|")){
+ const key = rank_select.value+"|"+x_groupby_select.value.replace("cluster|","");
cds_p_dendro_x.data = {"x": dict_d_dedro_x[key+"|x"],
"y": dict_d_dedro_x[key+"|y"],
"c": dict_d_dedro_x[key+"|c"]};
- // Enable method select
- x_method_select.disabled=false;
+ x_sort_select.value="none";
+ x_sort_select.disabled=true;
+ dendrox.visible=true;
}else{
cds_p_dendro_x.data = {"x": [], "y": [], "c": []};
- // Disable method select
- x_method_select.disabled=true;
+ x_sort_select.disabled=false;
+ dendrox.visible=false;
}
''')
x_select_callback = CustomJS(
args=dict(heatmap=ele["heatmap"]["fig"],
+ active_ranks=active_ranks,
rank_select=ele["heatmap"]["wid"]["rank_select"],
- x_method_select=ele["heatmap"]["wid"]["x_method_select"],
x_sort_select=ele["heatmap"]["wid"]["x_sort_select"],
+ x_groupby_select=ele["heatmap"]["wid"]["x_groupby_select"],
dict_d_hcluster_x=dict_d_hcluster_x,
cds_p_annotations=cds_p_annotations,
- cds_p_obstable=cds_p_obstable),
+ cds_m_obstable=cds_m_obstable,
+ cds_p_heatmap=cds_p_heatmap,
+ dict_d_taxname=dict_d_taxname),
code='''
- console.log("x_select_callback");
+ // selected rank
const rank = rank_select.value;
+
+ // get index to access data from observations from cds_m_obstable
+ var obs_index = [];
+ for (let i = 0; i < cds_m_obstable.data["index"].length; i++) {
+ if(cds_m_obstable.data["col|rank"][i]==rank){
+ obs_index.push(i);
+ }
+ }
+
+ var annot_obs = obs_index.map( s => cds_m_obstable.data["index"][s] );
+
var sorted_factors = [];
- if (x_sort_select.value=="none"){
- // None
- sorted_factors = dict_d_hcluster_x["default|" + rank];
- }else if (x_sort_select.value.startsWith("metric|")){
- // Clustering
- // Get sorted elements based on rank|method|metric
- const key = rank+"|"+x_method_select.value+"|"+x_sort_select.value.replace("metric|","");
- sorted_factors = dict_d_hcluster_x[key];
+ var dict_factors = {};
+ if (x_groupby_select.value.startsWith("cluster|")){
+ // Default dict_factors
+ for(let i = 0; i < annot_obs.length; i++){
+ dict_factors[annot_obs[i]] = annot_obs[i];
+ }
+ // Clustering - Get sorted elements based on rank|method|metric
+ sorted_factors = dict_d_hcluster_x[rank+"|"+x_groupby_select.value.replace("cluster|","")];
}else{
- // Sorting
- var factors = [];
+ // Define value from Sort by select
var sort_col = [];
- if (x_sort_select.value=="counts"){
- for (let i = 0; i < cds_p_obstable.data["index"].length; i++) {
- if(cds_p_obstable.data["col|rank"][i]==rank){
- factors.push(cds_p_obstable.data["index"][i])
- sort_col.push(cds_p_obstable.data["col|total_counts"][i]);
- }
- }
- sorted_factors = grimer_sort(factors, sort_col, "numeric", false);
+ var sort_col_type = "numeric";
+ if (x_sort_select.value=="none"){
+ sort_col = obs_index;
}else if (x_sort_select.value=="observations"){
- for (let i = 0; i < cds_p_obstable.data["index"].length; i++) {
- if(cds_p_obstable.data["col|rank"][i]==rank){
- factors.push(cds_p_obstable.data["index"][i])
- sort_col.push(cds_p_obstable.data["col|name"][i]);
- }
- }
- sorted_factors = grimer_sort(factors, sort_col, "string", false);
- }else{
- // copy array of factors
- factors = [...dict_d_hcluster_x["default|" + rank]];
+ sort_col = obs_index.map( s => cds_m_obstable.data["col|name"][s] );
+ sort_col_type = "string";
+ }else if (x_sort_select.value=="counts"){
+ sort_col = obs_index.map( s => cds_m_obstable.data["col|total_counts"][s] );
+ }else if (x_sort_select.value.startsWith("annot|")){
const annot = x_sort_select.value.replace("annot|","");
+ // create array with zeros, mark with one if annotation is present
+ sort_col = new Array(annot_obs.length); for (let i=0; i -1) {
- sorted_factors.push(factors[index]); // add to the sorted_factors
- factors.splice(index, 1); //remove from factors
- }
+ sort_col[annot_obs.indexOf(cds_p_annotations.data["index"][i])] = cds_p_annotations.data["tv"][i];
+ }
+ }
+ }
+
+ if(x_groupby_select.value=="none"){
+ // Default dict_factors
+ for(let i = 0; i < annot_obs.length; i++){
+ dict_factors[annot_obs[i]] = annot_obs[i];
+ }
+ sorted_factors = grimer_sort(annot_obs, sort_col, sort_col_type, false);
+ }else if (x_groupby_select.value.startsWith("tax|")){
+ const group_rank = x_groupby_select.value.replace("tax|","");
+ // if grouping with a higher rank
+ if(active_ranks.indexOf(rank) > active_ranks.indexOf(group_rank)){
+ // group entries without selected rank with space " "
+ var groupby_col = obs_index.map(function(s) { return cds_m_obstable.data["tax|" + group_rank][s] == "" ? " " : dict_d_taxname[cds_m_obstable.data["tax|" + group_rank][s]]; });
+ var factors = [];
+ for(let i = 0; i < annot_obs.length; i++){
+ dict_factors[annot_obs[i]] = [groupby_col[i], annot_obs[i]];
+ factors.push([groupby_col[i], annot_obs[i]]);
}
+ sorted_factors = grimer_sort(factors, sort_col, sort_col_type, false, groupby_col);
+ }else{
+ // Default dict_factors
+ for(let i = 0; i < annot_obs.length; i++){
+ dict_factors[annot_obs[i]] = annot_obs[i];
+ }
+ // normal sort
+ sorted_factors = grimer_sort(annot_obs, sort_col, sort_col_type, false);
}
- // join annotated and left-overs
- sorted_factors = sorted_factors.concat(factors);
}
}
+
+ // update factors on heatmap col otherwise remove
+ for (let i = 0; i < cds_p_heatmap.data["index"].length; i++) {
+ if(cds_p_heatmap.data["rank"][i]==rank){
+ cds_p_heatmap.data["factors_obs"][i] = dict_factors[cds_p_heatmap.data["obs"][i]];
+ }else{
+ cds_p_heatmap.data["factors_obs"][i] = "";
+ }
+ }
+
+ for (let i = 0; i < cds_p_annotations.data["index"].length; i++) {
+ if(cds_p_annotations.data["rank"][i]==rank){
+ cds_p_annotations.data["factors"][i] = dict_factors[cds_p_annotations.data["index"][i]];
+ }else{
+ cds_p_annotations.data["factors"][i] = "";
+ }
+ }
+
heatmap.x_range.factors = sorted_factors;
''')
y_dendro_callback = CustomJS(
args=dict(rank_select=ele["heatmap"]["wid"]["rank_select"],
+ y_groupby_select=ele["heatmap"]["wid"]["y_groupby_select"],
y_sort_select=ele["heatmap"]["wid"]["y_sort_select"],
- y_method_select=ele["heatmap"]["wid"]["y_method_select"],
+ dendroy=ele["dendroy"]["fig"],
cds_p_dendro_y=cds_p_dendro_y,
dict_d_dedro_y=dict_d_dedro_y),
code='''
- console.log("y_dendro_callback");
- if (y_sort_select.value.startsWith("metric|")){
- const key = rank_select.value+"|"+y_method_select.value+"|"+y_sort_select.value.replace("metric|","");
+ if (y_groupby_select.value.startsWith("cluster|")){
+ const key = rank_select.value+"|"+y_groupby_select.value.replace("cluster|","");
cds_p_dendro_y.data = {"x": dict_d_dedro_y[key+"|x"],
- "y": dict_d_dedro_y[key+"|y"],
- "c": dict_d_dedro_y[key+"|c"]};
- // Enable method select
- y_method_select.disabled=false;
+ "y": dict_d_dedro_y[key+"|y"],
+ "c": dict_d_dedro_y[key+"|c"]};
+ y_sort_select.value="none";
+ y_sort_select.disabled=true;
+ dendroy.visible=true;
}else{
cds_p_dendro_y.data = {"x": [], "y": [], "c": []};
- // Disable method select
- y_method_select.disabled=true;
+ y_sort_select.disabled=false;
+ dendroy.visible=false;
}
''')
@@ -455,33 +589,80 @@ def link_heatmap_widgets(ele,
args=dict(heatmap=ele["heatmap"]["fig"],
cds_d_samples=cds_d_samples,
cds_d_metadata=cds_d_metadata,
+ cds_p_metadata=cds_p_metadata,
+ cds_p_heatmap=cds_p_heatmap,
rank_select=ele["heatmap"]["wid"]["rank_select"],
- y_method_select=ele["heatmap"]["wid"]["y_method_select"],
y_sort_select=ele["heatmap"]["wid"]["y_sort_select"],
+ y_groupby_select=ele["heatmap"]["wid"]["y_groupby_select"],
dict_d_hcluster_y=dict_d_hcluster_y),
code='''
- console.log("y_select_callback");
+ // selected rank
+ const rank = rank_select.value;
+ var annot_samples = cds_d_samples.data["index"];
+
var sorted_factors = [];
- if (y_sort_select.value=="none"){
- // None
- sorted_factors = dict_d_hcluster_y["default"];
- }else if (y_sort_select.value.startsWith("metric|")){
- // Clustering
- // Get sorted elements based on rank|method|metric
- const key = rank_select.value+"|"+y_method_select.value+"|"+y_sort_select.value.replace("metric|","");
- sorted_factors = dict_d_hcluster_y[key];
+ var dict_factors = {};
+ if (y_groupby_select.value.startsWith("cluster|")){
+ // Clustering - Get sorted elements based on rank|method|metric
+ sorted_factors = dict_d_hcluster_y[rank+"|"+y_groupby_select.value.replace("cluster|","")];
+ // Default dict_factors
+ for(let i = 0; i < annot_samples.length; i++){
+ dict_factors[annot_samples[i]] = annot_samples[i];
+ }
}else{
- // Sorting
- if (y_sort_select.value=="counts"){
- sorted_factors = grimer_sort(cds_d_samples.data["index"], cds_d_samples.data["cnt|total"], "numeric", false);
+ // Define value from Sort by select
+ var sort_col = [];
+ var sort_col_type = "string";
+ if (y_sort_select.value=="none"){
+ sort_col = dict_d_hcluster_y["default"];
}else if (y_sort_select.value=="samples"){
- sorted_factors = grimer_sort(cds_d_samples.data["index"], cds_d_samples.data["index"], "string", false);
- }else if (y_sort_select.value.startsWith("metadata_cat|")){
- sorted_factors = grimer_sort(cds_d_samples.data["index"], cds_d_metadata.data[y_sort_select.value.replace("metadata_cat|","")], "string", false);
+ sort_col = annot_samples;
+ }else if (y_sort_select.value=="counts"){
+ sort_col = cds_d_samples.data["cnt|total"];
+ sort_col_type = "numeric";
}else if (y_sort_select.value.startsWith("metadata_num|")){
- sorted_factors = grimer_sort(cds_d_samples.data["index"], cds_d_metadata.data[y_sort_select.value.replace("metadata_num|","")], "numeric", false);
+ sort_col = cds_d_metadata.data[y_sort_select.value.replace("metadata_num|","")];
+ sort_col_type = "numeric";
+ }else if (y_sort_select.value.startsWith("metadata_cat|")){
+ sort_col = cds_d_metadata.data[y_sort_select.value.replace("metadata_cat|","")];
+ }
+
+ if(y_groupby_select.value=="none"){
+ sorted_factors = grimer_sort(annot_samples, sort_col, sort_col_type, false);
+ // Default dict_factors
+ for(let i = 0; i < annot_samples.length; i++){
+ dict_factors[annot_samples[i]] = annot_samples[i];
+ }
+ }else if (y_groupby_select.value.startsWith("group_metadata|")){
+ const group_metadata = y_groupby_select.value.replace("group_metadata|","");
+
+ // group entries and replace empty with space " "
+ var groupby_col = cds_d_metadata.data[group_metadata].map(function(m) { return m == "" ? " " : m; });
+
+ var factors = [];
+ for(let i = 0; i < annot_samples.length; i++){
+ dict_factors[annot_samples[i]] = [groupby_col[i], annot_samples[i]];
+ factors.push([groupby_col[i], annot_samples[i]]);
+ }
+ sorted_factors = grimer_sort(factors, sort_col, sort_col_type, false, groupby_col);
+ }
+ }
+
+ // update factors on heatmap col otherwise remove
+ for (let i = 0; i < cds_p_heatmap.data["index"].length; i++) {
+ if(cds_p_heatmap.data["rank"][i]==rank){
+ cds_p_heatmap.data["factors_sample"][i] = dict_factors[cds_p_heatmap.data["index"][i]];
+ }else{
+ cds_p_heatmap.data["factors_sample"][i] = "";
}
}
+
+ if (cds_p_metadata){
+ for (let i = 0; i < cds_p_metadata.data["index"].length; i++) {
+ cds_p_metadata.data["factors"][i] = dict_factors[cds_p_metadata.data["index"][i]];
+ }
+ }
+
heatmap.y_range.factors = sorted_factors;
''')
@@ -491,14 +672,14 @@ def link_heatmap_widgets(ele,
yaxis=ele["heatmap"]["fig"].yaxis[0]),
code='''
if(this.active.includes(0)){
- xaxis.major_label_text_font_size = "12px";
+ xaxis.major_label_text_font_size = "10px";
xaxis.major_tick_line_color="black";
}else{
xaxis.major_label_text_font_size = "0px";
xaxis.major_tick_line_color=null;
}
if(this.active.includes(1)){
- yaxis.major_label_text_font_size = "12px";
+ yaxis.major_label_text_font_size = "10px";
yaxis.major_tick_line_color="black";
}else{
yaxis.major_label_text_font_size = "0px";
@@ -506,38 +687,64 @@ def link_heatmap_widgets(ele,
}
''')
- ele["heatmap"]["wid"]["toggle_label_heatmap"].js_on_click(toggle_label_callback)
- ele["heatmap"]["wid"]["rank_select"].js_on_change('value', x_select_callback, x_dendro_callback, y_select_callback, y_dendro_callback)
- ele["heatmap"]["wid"]["x_method_select"].js_on_change('value', x_select_callback, x_dendro_callback)
+ ele["heatmap"]["wid"]["toggle_label"].js_on_click(toggle_label_callback)
+ ele["heatmap"]["wid"]["rank_select"].js_on_change('value', y_select_callback, x_select_callback, x_dendro_callback, y_dendro_callback)
ele["heatmap"]["wid"]["x_sort_select"].js_on_change('value', x_select_callback, x_dendro_callback)
- ele["heatmap"]["wid"]["y_method_select"].js_on_change('value', y_select_callback, y_dendro_callback)
+ ele["heatmap"]["wid"]["x_groupby_select"].js_on_change('value', x_select_callback, x_dendro_callback)
ele["heatmap"]["wid"]["y_sort_select"].js_on_change('value', y_select_callback, y_dendro_callback)
+ ele["heatmap"]["wid"]["y_groupby_select"].js_on_change('value', y_select_callback, y_dendro_callback)
def link_metadata_widgets(ele, cds_p_metadata, cds_d_metadata, max_metadata_cols):
metadata_multiselect_callback = CustomJS(
args=dict(metadata_heatmap=ele["metadata"]["fig"],
+ metadata_heatmap_xaxis=ele["metadata"]["fig"].xaxis[0] if cds_p_metadata else None,
+ metadata_multiselect=ele["metadata"]["wid"]["metadata_multiselect"],
+ legend_colorbars=ele["metadata"]["wid"]["legend_colorbars"],
+ toggle_legend=ele["metadata"]["wid"]["toggle_legend"],
max_metadata_cols=max_metadata_cols,
cds_p_metadata=cds_p_metadata,
cds_d_metadata=cds_d_metadata),
code='''
const index_len = cds_d_metadata.data["index"].length;
- console.log(cds_d_metadata.data)
+
var x_factors = [];
var empty_y_values = new Array(index_len);
- for (var i = 0; i < index_len; ++i) empty_y_values[i]=["", ""];
+ for (var i = 0; i < index_len; ++i) empty_y_values[i]="";
+ // hide all legends
+ for (let md_header in legend_colorbars) legend_colorbars[md_header].visible = false;
+
+ // set legend orientation
+ if(metadata_heatmap_xaxis){
+ if(metadata_multiselect.value.length==1)
+ metadata_heatmap_xaxis.major_label_orientation = "horizontal";
+ else
+ metadata_heatmap_xaxis.major_label_orientation = 0.7;
+ }
+
for(var s=0; s < max_metadata_cols; ++s){
- if (s 0 ){
var found = false;
for(let r = 0; r < active_ranks.length; r++){
- if (name_multichoice.value.indexOf(cds_p_obstable.data["tax|"+active_ranks[r]][i]) >= 0){
+ // Compare all names on multichoice (array) against cell
+ if (name_multichoice.value.indexOf(cds_m_obstable.data["tax|"+active_ranks[r]][i]) >= 0){
found = true;
break;
}
@@ -582,10 +791,10 @@ def link_obstable_filter(ele, cds_p_obstable, active_ranks):
continue;
}
}
- indices.push(i)
+ indices.push(i);
}
- widgets_filter.indices = indices;
- cds_p_obstable.change.emit();
+ filter.indices = indices;
+ cds_m_obstable.change.emit();
''')
ele["obstable"]["wid"]["frequency_spinner"].js_on_change('value', filter_callback)
ele["obstable"]["wid"]["counts_perc_avg_spinner"].js_on_change('value', filter_callback)
@@ -598,27 +807,27 @@ def link_correlation_widgets(ele, cds_p_correlation):
args=dict(correlation=ele["correlation"]["fig"],
cds_p_correlation=cds_p_correlation),
code='''
- console.log("rank_select_callback");
- const factors = new Set();
+ var factors = new Set();
for(let i = 0; i < cds_p_correlation.data["index"].length; i++){
if(cds_p_correlation.data["rank"][i]==this.value){
factors.add(cds_p_correlation.data["index"][i]);
+ factors.add(cds_p_correlation.data["taxid"][i]);
}
}
- correlation.x_range.factors = [...factors];
- correlation.y_range.factors = [...factors].reverse();
+ var sorted_factors = [...factors].sort();
+ correlation.y_range.factors = sorted_factors;
+ var rev_sorted_factors = [...sorted_factors].reverse();
+ correlation.x_range.factors = rev_sorted_factors;
''')
filter_callback = CustomJS(
- args=dict(rho_filter=ele["correlation"]["rho_filter"],
+ args=dict(filter=ele["correlation"]["filter"],
neg_slider=ele["correlation"]["wid"]["neg_slider"],
pos_slider=ele["correlation"]["wid"]["pos_slider"],
- pval_spinner=ele["correlation"]["wid"]["pval_spinner"],
cds_p_correlation=cds_p_correlation),
code='''
const indices = [];
for (var i = 0; i < cds_p_correlation.data["index"].length; i++) {
- if (cds_p_correlation.data["pval_corr"][i] > pval_spinner.value) continue;
const rho = cds_p_correlation.data["rho"][i];
if ((rho >= neg_slider.value[0] && rho <= neg_slider.value[1]) ||
(rho >= pos_slider.value[0] && rho <= pos_slider.value[1]))
@@ -626,28 +835,43 @@ def link_correlation_widgets(ele, cds_p_correlation):
indices.push(i)
}
}
- rho_filter.indices = indices;
+ filter.indices = indices;
cds_p_correlation.change.emit();
''')
+ toggle_label_callback = CustomJS(
+ args=dict(xaxis=ele["correlation"]["fig"].xaxis[0], yaxis=ele["correlation"]["fig"].yaxis[0]),
+ code='''
+ if(this.active.includes(0)){
+ xaxis.major_label_text_font_size = "10px";
+ xaxis.major_tick_line_color="black";
+ yaxis.major_label_text_font_size = "10px";
+ yaxis.major_tick_line_color="black";
+ }else{
+ xaxis.major_label_text_font_size = "0px";
+ xaxis.major_tick_line_color=null;
+ yaxis.major_label_text_font_size = "0px";
+ yaxis.major_tick_line_color=null;
+ }
+ ''')
+
+ ele["correlation"]["wid"]["toggle_label"].js_on_click(toggle_label_callback)
ele["correlation"]["wid"]["pos_slider"].js_on_change('value', filter_callback)
ele["correlation"]["wid"]["neg_slider"].js_on_change('value', filter_callback)
- ele["correlation"]["wid"]["pval_spinner"].js_on_change('value', filter_callback)
ele["correlation"]["wid"]["rank_select"].js_on_change('value', rank_select_callback)
-def link_obsbars_widgets(ele, cds_p_obsbars, dict_d_topobs, cds_d_sampleobs, cds_d_samples, top_obs_bars, dict_d_taxname, cds_d_metadata):
+def link_obsbars_widgets(ele, cds_p_obsbars, dict_d_topobs, dict_d_sampleobs, cds_d_samples, top_obs_bars, dict_d_taxname, cds_d_metadata, cds_p_sampletable):
rank_select_callback = CustomJS(
args=dict(sort_select=ele["obsbars"]["wid"]["sort_select"],
legend=ele["obsbars"]["legend"],
cds_p_obsbars=cds_p_obsbars,
- cds_d_sampleobs=cds_d_sampleobs,
+ dict_d_sampleobs=dict_d_sampleobs,
cds_d_samples=cds_d_samples,
dict_d_topobs=dict_d_topobs,
dict_d_taxname=dict_d_taxname,
top_obs_bars=top_obs_bars),
code='''
- console.log("rank_select_callback");
const rank = this.value;
const n_sample = cds_p_obsbars.data["index"].length;
const total = cds_d_samples.data["cnt|total"];
@@ -666,11 +890,16 @@ def link_obsbars_widgets(ele, cds_p_obsbars, dict_d_topobs, cds_d_sampleobs, cds
var sum_bars = 0;
if (taxid!=undefined){
for(let s = 0; s 0 ){
+ var found = false;
+ for (var m=0; m < metadata_multichoice.value.length; ++m){
+ const md = metadata_multichoice.value[m].split("|");
+ if(cds_d_metadata.data[md[0]][i]==md[1]){
+ found = true;
+ break;
+ }
+ }
+ if (!found) {
+ continue;
+ }
+ }
+ selected_indices.push(i);
+ }
+ cds_p_sampletable.selected.indices = selected_indices;
+ ''')
+ ele["sampletable"]["wid"]["total_counts_spinner"].js_on_change('value', select_callback)
+ ele["sampletable"]["wid"]["assigned_spinner"].js_on_change('value', select_callback)
+ if cds_d_metadata:
+ ele["sampletable"]["wid"]["metadata_multichoice"].js_on_change('value', select_callback)
diff --git a/grimer/cds.py b/grimer/cds.py
index be90b1f..2a0d1a8 100644
--- a/grimer/cds.py
+++ b/grimer/cds.py
@@ -4,16 +4,17 @@
from math import pi
#Internal
-from grimer.utils import print_df, transform_table, print_log, fdrcorrection_bh
+from grimer.func import print_df, transform_table, print_log, format_js_toString
#Bokeh
from bokeh.models import ColumnDataSource
-# Scipy
-from scipy import stats
-
-def generate_dict_taxname(tax, taxids):
+def dict_taxname(tax, taxids):
+ """
+ mapping taxids to names
+ (or names to names if taxid is not used)
+ """
id_name = {}
for i in taxids:
n = tax.name(i) if tax else i
@@ -21,45 +22,87 @@ def generate_dict_taxname(tax, taxids):
return id_name
-def generate_cds_annotations(table, contaminants, references, controls, decontam):
+def cds_plot_references(table, tax, references):
+ # Stacked list of references, accounting for lineage matches
+ # index -> observations (repeated)
+ # columns -> "rank", "ref", "direct", "parent"
+ clist = []
+ if references is not None:
+ for rank in table.ranks():
+ for obs in table.observations(rank):
+ for desc, ref in references.items():
+ direct = ref.get_refs_count(obs, direct=True)
+ parent = ref.get_refs_count(obs, parents=True)
+ if direct + parent > 0:
+ clist.append([obs, rank, desc, direct, parent])
+
+ df_references = pd.DataFrame(clist, columns=["obs", "rank", "ref", "direct", "parent"])
+ df_references.set_index('obs', inplace=True)
+
+ print_df(df_references, "cds_p_references")
+ return ColumnDataSource(df_references)
+
+
+def cds_annotations(table, references, controls, decontam, control_samples):
# Stacked matrix of true annotations (omit false)
# index -> taxids
# columns -> rank, annot
- df_annotations = pd.DataFrame(columns=["rank", "annot"])
- for rank in table.ranks():
+ df_annotations = pd.DataFrame(columns=["rank", "annot", "factors", "ov", "tv"])
+ for i, rank in enumerate(table.ranks()):
# Generate a DataFrame to use as source in tables
df_rank = pd.DataFrame(index=table.observations(rank))
- if decontam:
- df_rank["decontam"] = decontam.get_contaminants(rank, df_rank.index)
-
- for desc, ref in references.items():
- df_rank[desc] = table.observations(rank).isin(ref.lineage)
+ if decontam is not None:
+ contaminants = decontam.get_contaminants(rank, df_rank.index).values
+ if contaminants.any():
+ df_rank["decontam"] = decontam.get_pscore(rank, df_rank.index)[contaminants]
- for desc, cont in contaminants.items():
- df_rank[desc] = table.observations(rank).isin(cont.lineage)
+ if references is not None:
+ for desc, ref in references.items():
+ df_rank[desc] = table.observations(rank).map(lambda x: ref.get_refs_count(x, direct=True))
+ df_rank.loc[df_rank[desc] == 0, desc] = np.nan
- if controls:
+ if controls is not None:
for desc, ctrl in controls.items():
- df_rank[desc] = table.observations(rank).isin(ctrl.lineage)
+ control_table = table.get_subtable(samples=control_samples[desc], rank=rank)
+ freq_perc_control = control_table.gt(0).sum(axis=0) / control_table.shape[0]
+ df_rank[desc] = table.observations(rank).map(freq_perc_control).to_list()
- df_rank = pd.DataFrame(df_rank.stack(), columns=["val"]).reset_index(1)
+ df_rank = pd.DataFrame(df_rank.stack(), columns=["ov"]).reset_index(1)
df_rank.rename(columns={"level_1": "annot"}, inplace=True)
- df_rank = df_rank[df_rank["val"]] # keep only true entries
- df_rank["rank"] = rank # set rank
- if "val" in df_rank.columns:
- df_rank.drop(columns="val", inplace=True) # drop boolean col
+ # add transformed values to fit same scale on heatmap
+ # Decontam reverse p-score normalized
+ if not df_rank[df_rank["annot"] == "decontam"].empty:
+ min_val = df_rank[df_rank["annot"] == "decontam"]["ov"].min()
+ max_val = df_rank[df_rank["annot"] == "decontam"]["ov"].max()
+ df_rank.loc[df_rank["annot"] == "decontam", "tv"] = 1 - ((df_rank[df_rank["annot"] == "decontam"]["ov"] - min_val) / (max_val - min_val))
+
+ # max references divided by max
+ if references is not None:
+ for desc, ref in references.items():
+ if not df_rank[df_rank["annot"] == desc].empty:
+ max_val = df_rank[df_rank["annot"] == desc]["ov"].max()
+ df_rank.loc[df_rank["annot"] == desc, "tv"] = df_rank.loc[df_rank["annot"] == desc, "ov"] / max_val
+
+ # keep same percentage
+ if controls is not None:
+ for desc, ctrl in controls.items():
+ if not df_rank.loc[df_rank["annot"] == desc].empty:
+ df_rank.loc[df_rank["annot"] == desc, "tv"] = df_rank.loc[df_rank["annot"] == desc, "ov"]
+
+ df_rank["rank"] = rank # set rank
+ df_rank["factors"] = df_rank.index if i == 0 else "" # initialize just for first rank (save space)
# Concat in the main df
df_annotations = pd.concat([df_annotations, df_rank], axis=0)
- print_df(df_annotations, "df_annotations -> cds_p_annotations")
+ print_df(df_annotations, "cds_p_annotations")
return ColumnDataSource(df_annotations)
-def generate_cds_obstable(table, tax, contaminants, references, controls, control_samples, decontam):
+def cds_obstable(table, tax, references, controls, control_samples, decontam):
# index unique taxids
# col|... values to plot to columns in the datatable
# tax|... auxiliary lineage of taxa entries
@@ -78,61 +121,67 @@ def generate_cds_obstable(table, tax, contaminants, references, controls, contro
# Frequency of taxa among all samples
df_rank["col|frequency_perc"] = table.get_frequency_perc(rank)
+ df_rank["col|counts_perc_avg"] = table.get_counts_perc_avg_samples(rank)
# Average percentage of counts among all samples
- df_rank["col|counts_perc_avg"] = table.get_counts_perc_avg(rank)
- df_rank["col|total_counts"] = table.get_total_counts(rank)
+ df_rank["col|total_counts"] = table.get_counts(rank)
# If active - add decontam True/False results
if decontam:
df_rank["col|decontam"] = decontam.get_contaminants(rank, df_rank.index)
- # Add a single column, concatenating ("|") references sources
- df_rank["col|references"] = ""
- for desc, ref in references.items():
- # Check if taxids are in the lineage of the reference
- bool_ref = table.observations(rank).isin(ref.lineage)
- df_rank["col|references"] = df_rank["col|references"] + np.where(bool_ref, desc + " | ", "")
-
- # Add a column for each Contaminant source
- for desc, cont in contaminants.items():
- #print(table.observations(rank).isin(cont.ids))
- #print(table.observations(rank).isin(cont.lineage))
-
- df_rank["col|" + desc] = table.observations(rank).map(cont.get_refs_count).to_list()
- #df_rank["col|" + desc] = table.observations(rank).isin(cont.ids)
+ # Add a column for each Annotation source
+ if references is not None:
+ for desc, ref in references.items():
+ df_rank["col|" + desc] = table.observations(rank).map(lambda x: ref.get_refs_count(x, direct=True)).to_list()
# Add a column for each Control source
- if controls:
+ if controls is not None:
# calculate frequency for each group of control provided
for desc, ctrl in controls.items():
control_table = table.get_subtable(samples=control_samples[desc], rank=rank)
freq_perc_control = control_table.gt(0).sum(axis=0) / control_table.shape[0]
df_rank["col|" + desc] = table.observations(rank).map(freq_perc_control).fillna(0).to_list()
- # # Add col for each rank with parent taxid if exists, linking entries in their lineage for filtering and plotting
-
# Add col for each rank with parent taxid if exists, linking entries in their lineage for filtering and plotting
-
for other_rank in table.ranks():
-
if table.ranks().index(other_rank) > table.ranks().index(rank):
df_rank["tax|" + other_rank] = ""
elif other_rank != rank:
df_rank["tax|" + other_rank] = table.observations(rank).map(lambda txid: table.get_lineage(txid, rank, other_rank)).fillna("")
else:
df_rank["tax|" + other_rank] = df_rank.index
-
# Sort values by frequency to show on table
df_rank.sort_values(by="col|frequency_perc", ascending=False, inplace=True)
# Concat in the main df
df_obstable = pd.concat([df_obstable, df_rank], axis=0)
- print_df(df_obstable, "df_obstable -> cds_p_obstable")
+ print_df(df_obstable, "cds_m_obstable")
return ColumnDataSource(df_obstable)
-def generate_cds_bars(table):
+def cds_sampletable(table):
+ # index unique sample-ids
+ # col|... values to plot to columns in the datatable
+
+ df_sampletable = pd.DataFrame(index=table.samples)
+ df_sampletable["col|total"] = table.get_total() if not table.normalized else 0
+ df_sampletable["col|assigned"] = table.get_assigned() if not table.normalized else 0
+ df_sampletable["col|assigned_perc"] = table.get_assigned_perc()
+ df_sampletable["col|unassigned"] = table.get_unassigned() if not table.normalized else 0
+ df_sampletable["col|unassigned_perc"] = table.get_unassigned_perc()
+
+ # assigned by rank
+ for rank in table.ranks():
+ df_sampletable["col|" + rank] = table.data[rank].sum(axis=1).divide(table.get_total(), axis=0)
+
+ df_sampletable.fillna(0, inplace=True)
+
+ print_df(df_sampletable, "cds_p_sampletable")
+ return ColumnDataSource(df_sampletable)
+
+
+def cds_samplebars(table):
# index unique sample-ids
# aux| auxiliary values (not plotted)
# bar| values plotted as bars (sample counts)
@@ -141,20 +190,20 @@ def generate_cds_bars(table):
df_bars = pd.DataFrame(index=table.samples)
# factors: set the x-axis reference for plotting, it can be dinamically changed (with groups)
df_bars["aux|factors"] = df_bars.index
- df_bars["bar|unassigned"] = table.unassigned
+ df_bars["bar|unassigned"] = table.get_unassigned()
# Initialized with "Assigned" of first rank
- df_bars["bar|selected"] = table.data[table.ranks()[0]].sum(axis=1)
+ df_bars["bar|selected"] = table.get_subtable(table.ranks()[0]).sum(axis=1)
# Total assigned - assigned to rank
- df_bars["bar|others"] = (table.total - table.unassigned) - df_bars["bar|selected"]
+ df_bars["bar|others"] = (table.get_total() - table.get_unassigned()) - df_bars["bar|selected"]
# Add empty cols for taxa values, to be dynamically inserted (None to avoid printing 0)
for rank in table.ranks():
df_bars["tax|" + rank] = None
- print_df(df_bars, "df_bars -> cds_p_samplebars")
+ print_df(df_bars, "cds_p_samplebars")
return ColumnDataSource(df_bars)
-def generate_cds_samples(table, references, contaminants, controls, decontam):
+def cds_samples(table, references, controls, decontam):
# index unique sample-ids
# aux| auxiliary values (not plotted)
# cnt| count values to be copied/traansformed to bars
@@ -162,25 +211,27 @@ def generate_cds_samples(table, references, contaminants, controls, decontam):
df_samples = pd.DataFrame(index=table.samples)
# index to retrieve default input order
df_samples["aux|input_order"] = range(df_samples.shape[0], 0, -1)
- df_samples["cnt|total"] = table.total
- df_samples["cnt|unassigned"] = table.unassigned
+ df_samples["cnt|total"] = table.get_total()
+ df_samples["cnt|unassigned"] = table.get_unassigned()
# Keep total number of assignemnts for calculations
- df_samples["cnt|assigned"] = table.total - table.unassigned
+ df_samples["cnt|assigned"] = table.get_total() - table.get_unassigned()
# Add specific rank assignements
for rank in table.ranks():
df_samples["cnt|" + rank + "|assigned"] = table.data[rank].sum(axis=1)
# Add counts specific to sources
- source_list = [references.items(), contaminants.items()]
- if controls:
+ source_list = []
+ if references is not None:
+ source_list.append(references.items())
+ if controls is not None:
source_list.append(controls.items())
for sources in source_list:
for desc, src in sources:
for rank in table.ranks():
- idx = table.observations(rank).isin(src.lineage)
+ idx = table.observations(rank).map(lambda x: src.get_refs_count(x, direct=True)) >= 1
df_samples["cnt|" + rank + "|" + desc] = table.data[rank][table.observations(rank)[idx]].sum(axis=1)
if decontam:
@@ -189,36 +240,42 @@ def generate_cds_samples(table, references, contaminants, controls, decontam):
idx = table.observations(rank).isin(contaminants)
df_samples["cnt|" + rank + "|decontam"] = table.data[rank][table.observations(rank)[idx]].sum(axis=1)
- print_df(df_samples, "df_samples -> cds_d_samples")
+ # fill NaN with zero so bars do not "dissapear" when plotting
+ df_samples.fillna(0, inplace=True)
+
+ print_df(df_samples, "cds_d_samples")
return ColumnDataSource(df_samples)
-def generate_cds_metadata(metadata):
+def cds_metadata(metadata):
# index -> sample-ids
# columns -> metadata fields
# values -> metadata values
df_md = metadata.get_data()
- print_df(df_md, "df_md -> cds_d_metadata")
+ print_df(df_md, "cds_d_metadata")
return ColumnDataSource(df_md)
-def generate_cds_plot_metadata(metadata, max_metadata_cols):
+def cds_plot_metadata(metadata, max_metadata_cols):
# index (unique sample-ids)
# md0, md1, ..., md(max_metadata_cols)
# values (metadata field, metadata values)
- df_plot_md = pd.DataFrame(index=metadata.data.index)
- # Fill with first N metadata fields
- metadata_fields = metadata.get_col_headers().to_list()
- for i in range(max_metadata_cols):
- # Same transformation done in the colormap for numeric entries
- df_plot_md["md" + str(i)] = [(metadata_fields[i], '{:.16g}'.format(md_value) if not isinstance(md_value, str) else md_value) for md_value in metadata.get_col(metadata_fields[i])]
+ df_plot_md = pd.DataFrame(index=metadata.data.index, columns=["factors"] + [str(i) for i in range(1, max_metadata_cols + 1)])
+ df_plot_md["factors"] = df_plot_md.index
+ # Fill in only first metadata field
+ first_field = metadata.get_col_headers()[0]
+
+ df_plot_md["1"] = [(first_field, format_js_toString(md_value)) for md_value in metadata.get_col(first_field)]
- print_df(df_plot_md, "df_plot_md -> cds_p_metadata")
+ # Fill with empty strings to match js output when not selected
+ df_plot_md.fillna("", inplace=True)
+
+ print_df(df_plot_md, "cds_p_metadata")
return ColumnDataSource(df_plot_md)
-def generate_cds_plot_decontam(decontam):
+def cds_plot_decontam(decontam):
# index unique sample-ids
# concentrations from decontam inputs
# controls from decontam inputs
@@ -226,11 +283,11 @@ def generate_cds_plot_decontam(decontam):
df_decontam = decontam.get_data()
df_decontam["controls"] = df_decontam["controls"].map({True: 'Control', False: 'Sample'})
df_decontam["counts"] = None
- print_df(df_decontam, "df_decontam -> cds_p_decontam")
+ print_df(df_decontam, "cds_p_decontam")
return ColumnDataSource(df_decontam)
-def generate_cds_decontam(decontam, ranks):
+def cds_decontam(decontam, ranks):
"""
cds based on a dict with valid values to plot model lines
{taxid: (contam_y1, contam_y2, non_contam_y, pval)}
@@ -238,15 +295,15 @@ def generate_cds_decontam(decontam, ranks):
dict_coord_mod = {}
for rank in ranks:
df_valid_vals = decontam.rank[rank].dropna(subset=['contam'])
- pval = decontam.get_pvalues(rank, df_valid_vals.index)
+ pval = decontam.get_pscore(rank, df_valid_vals.index)
vals = list(zip(df_valid_vals["contam"], df_valid_vals["contam_2"], df_valid_vals["non.contam"], pval))
dict_coord_mod.update(dict(zip(df_valid_vals.index, vals)))
- print_df(dict_coord_mod, "dict_coord_mod -> cds_d_decontam_models")
+ print_df(dict_coord_mod, "cds_d_decontam_models")
return ColumnDataSource(dict_coord_mod)
-def generate_cds_plot_decontam_models(decontam):
+def cds_plot_decontam_models(decontam):
"""
cds based on a dict with 3 pairs of values to plot. x is shared among y_cont and y_noncont
# {x: [min,max], y_cont: [None,None], y_noncont: [None,None]}
@@ -256,23 +313,27 @@ def generate_cds_plot_decontam_models(decontam):
decontam.get_data()["concentration"].max()]
dict_decontam_models["y_cont"] = [None, None]
dict_decontam_models["y_noncont"] = [None, None]
- print_df(dict_decontam_models, "dict_decontam_models -> cds_p_decontam_models")
+ print_df(dict_decontam_models, "cds_p_decontam_models")
return ColumnDataSource(dict_decontam_models)
-def generate_cds_sampleobs(table):
- # matrix-like cds with raw counts
- # index -> sample-ids
- # columns -> taxids (from all ranks)
- # values are observation raw counts
- df_sampleobs = pd.DataFrame(index=table.samples)
+def dict_sampleobs(table):
+ # dict with raw counts (not storing zeros)
+ # dict_sampleobs[rank][obs][sample] = count
+ dict_sampleobs = {}
for rank in table.ranks():
- df_sampleobs = pd.concat([df_sampleobs, table.data[rank]], axis=1)
- print_df(df_sampleobs, "df_sampleobs -> cds_d_sampleobs")
- return ColumnDataSource(df_sampleobs)
+ dict_sampleobs[rank] = {}
+ for obs, sample_val in table.data[rank].to_dict().items():
+ dict_sampleobs[rank][obs] = {}
+ for sample, val in sample_val.items():
+ if val > 0:
+ dict_sampleobs[rank][obs][sample] = val
+
+ print_df(dict_sampleobs, "dict_d_sampleobs")
+ return dict_sampleobs
-def generate_cds_heatmap(table, transformation, replace_zero_value, show_zeros):
+def cds_heatmap(table, transformation, show_zeros):
# Stacked matrix of raw counts + transformed value
# index -> sample-ids (repeated)
# obs
@@ -280,25 +341,31 @@ def generate_cds_heatmap(table, transformation, replace_zero_value, show_zeros):
# ov -> original value (raw counts)
# tv -> transformed values (user choice: log10, clr, ...)
- df_heatmap = pd.DataFrame(columns=["obs", "rank", "ov", "tv"])
- for rank in table.ranks():
+ df_heatmap = pd.DataFrame(columns=["obs", "rank", "ov", "tv", "factors_sample", "factors_obs"])
+ for i, rank in enumerate(table.ranks()):
stacked_rank_df = pd.DataFrame(table.data[rank].stack(), columns=["ov"]).reset_index(1)
# Rename first col to obs
stacked_rank_df.rename(columns={stacked_rank_df.columns[0]: "obs"}, inplace=True)
stacked_rank_df["rank"] = rank
- tv = transform_table(table.data[rank], table.total, transformation, replace_zero_value)
+ tv = transform_table(table.data[rank], table.get_total(), transformation, table.zerorep)
stacked_rank_df["tv"] = tv.stack().values
#Drop zeros based on original counts
if not show_zeros:
stacked_rank_df = stacked_rank_df[stacked_rank_df["ov"] > 0]
+ # initialize factors only for first rank
+ #stacked_rank_df["factors_sample"] = stacked_rank_df.index
+ #stacked_rank_df["factors_obs"] = stacked_rank_df["obs"]
+ stacked_rank_df["factors_sample"] = stacked_rank_df.index if i == 0 else ""
+ stacked_rank_df["factors_obs"] = stacked_rank_df["obs"] if i == 0 else ""
df_heatmap = pd.concat([df_heatmap, stacked_rank_df], axis=0)
- print_df(df_heatmap, "df_heatmap -> cds_p_heatmap")
+ df_heatmap.drop('ov', axis=1, inplace=True)
+ print_df(df_heatmap, "cds_p_heatmap")
return ColumnDataSource(df_heatmap)
-def generate_dict_hcluster(table, hcluster):
+def dict_hcluster(table, hcluster):
# keys -> combination of hclusters
# values -> sorted sample-ids
@@ -318,21 +385,21 @@ def generate_dict_hcluster(table, hcluster):
# taxa
leaves_x[key] = hcluster[rank][method][metric]["x"]["index"]
- print_df(leaves_x, "leaves_x -> dict_d_hcluster_x")
- print_df(leaves_y, "leaves_y -> dict_d_hcluster_y")
+ print_df(leaves_x, "dict_d_hcluster_x")
+ print_df(leaves_y, "dict_d_hcluster_y")
return leaves_x, leaves_y
-def generate_cds_plot_dendro():
+def cds_plot_dendro():
# Empty CDS {"x": [], "y": [], "c": []}
dendro_x = {"x": [], "y": [], "c": []}
dendro_y = {"x": [], "y": [], "c": []}
- print_df(dendro_x, "dendro_x -> cds_p_dendro_x")
- print_df(dendro_y, "dendro_y -> cds_p_dendro_y")
+ print_df(dendro_x, "cds_p_dendro_x")
+ print_df(dendro_y, "cds_p_dendro_y")
return ColumnDataSource(dendro_x), ColumnDataSource(dendro_y)
-def generate_dict_dendro(table, dendro):
+def dict_dendro(table, dendro):
# dict_d_dedro_x and dict_d_dedro_y:
# key -> key + "|x" , key + "|y" , key + "|c"
# value -> list of lists (x and y) or list (c)
@@ -355,15 +422,15 @@ def generate_dict_dendro(table, dendro):
return dict_d_dedro_x, dict_d_dedro_y
-def generate_dict_topobs(table, top_obs_bars):
+def dict_topobs(table, top_obs_bars):
dict_top_taxa = {}
for rank in table.ranks():
dict_top_taxa[rank] = table.get_top(rank, top_obs_bars)
- print_df(dict_top_taxa, "dict_top_taxa -> dict_d_topobs")
+ print_df(dict_top_taxa, "dict_d_topobs")
return dict_top_taxa
-def generate_dict_refs(table, contaminants, references):
+def dict_refs(table, references):
# dict with information about sources and references
# references can be repeated among descriptions, sources and taxids
# {taxid: {source: {desc: [refs]}}
@@ -373,10 +440,10 @@ def generate_dict_refs(table, contaminants, references):
for rank in table.ranks():
used_ids.update(table.observations(rank))
- for i in used_ids:
- for source in [contaminants.items(), references.items()]:
- for sname, s in source:
- for ref, descs in s.get_refs_desc(i).items():
+ if references is not None:
+ for i in used_ids:
+ for sname, s in references.items():
+ for ref, descs in s.get_refs_desc(i, direct=True).items():
for desc in descs:
# Only add items if they have a reference to it
if i not in d_refs:
@@ -387,66 +454,30 @@ def generate_dict_refs(table, contaminants, references):
d_refs[i][sname][desc] = []
d_refs[i][sname][desc].append(ref)
- print_df(d_refs, "d_refs -> dict_d_refs")
+ print_df(d_refs, "dict_d_refs")
return d_refs
-def generate_cds_correlation(table, top_obs_corr):
- # index (repeated taxids)
- # other taxid
- # rank
- # rho
- # pval
-
- df_corr = pd.DataFrame(columns=["taxid", "rank", "rho", "pval"])
+def cds_correlation(table, corr):
+ df_corr = pd.DataFrame(columns=["taxid", "rank", "rho"])
for rank in table.ranks():
- if top_obs_corr:
- top_taxids = table.get_top(rank, top_obs_corr)
- matrix = table.get_subtable(taxids=top_taxids, rank=rank)
- else:
- top_taxids = sorted(table.observations(rank))
- matrix = table.data[rank]
-
- # No correlation with just one observation
- if len(matrix.columns) >= 2:
-
- rho, pval = stats.spearmanr(matrix)
-
- if len(matrix.columns) == 2:
- # If there are only 2 observations, return in a float
- # re-format in a matrix shape
- rho = np.array([[np.nan, np.nan], [rho, np.nan]])
- pval = np.array([[np.nan, np.nan], [pval, np.nan]])
- else:
- # fill upper triangular matrix (mirrored values) with nan to be ignored by pandas
- # to save half of the space
- rho[np.triu_indices(rho.shape[0])] = np.nan
-
- stacked_rank_df = pd.DataFrame(rho, index=top_taxids, columns=top_taxids).stack(dropna=False).reset_index(1)
- stacked_rank_df.rename(columns={"level_1": "taxid"}, inplace=True)
- stacked_rank_df.rename(columns={0: "rho"}, inplace=True)
- stacked_rank_df["rank"] = rank
- #stack pval
- stacked_rank_df["pval"] = np.ravel(pval)
-
- # Drop NA for rho (missing values and upper triangular matrix)
- stacked_rank_df.dropna(subset=['rho'], inplace=True)
-
- # Calculate corrected pvals
- stacked_rank_df["pval_corr"] = fdrcorrection_bh(stacked_rank_df["pval"].to_list())
+ stacked_rank_df = pd.DataFrame(corr[rank]["rho"], index=corr[rank]["observations"], columns=corr[rank]["observations"]).stack(dropna=False).reset_index(1)
+ stacked_rank_df.rename(columns={"level_1": "taxid"}, inplace=True)
+ stacked_rank_df.rename(columns={0: "rho"}, inplace=True)
+ stacked_rank_df["rank"] = rank
- # Filter by p-value
- #stacked_rank_df = stacked_rank_df[stacked_rank_df["pval_corr"] <= pval_cutoff]
+ # Drop NA for rho (missing values and upper triangular matrix)
+ stacked_rank_df.dropna(subset=['rho'], inplace=True)
- df_corr = pd.concat([df_corr, stacked_rank_df], axis=0)
+ df_corr = pd.concat([df_corr, stacked_rank_df], axis=0)
- print_df(df_corr, "df_corr -> cds_p_correlation")
+ print_df(df_corr, "cds_p_correlation")
return ColumnDataSource(df_corr)
-def generate_cds_obsbars(table, top_obs_bars):
+def cds_obsbars(table, top_obs_bars):
# index (unique sample-ids)
- # cols: 0, 1, ..., top_obs_bars, unassigned, others, factors
+ # cols: 1, 2, ..., top_obs_bars, unassigned, others, factors
#Load with data from first rank
top_taxids = table.get_top(table.ranks()[0], top_obs_bars)
@@ -458,16 +489,16 @@ def generate_cds_obsbars(table, top_obs_bars):
df_obsbars[str(ncol)] = 0
ncol += 1
# Other account for filtered taxa (not on top) and left over percentage for the rank without assignment
- df_obsbars["others"] = table.total - table.unassigned - df_obsbars.sum(axis=1)
- df_obsbars["unassigned"] = table.unassigned
- df_obsbars = transform_table(df_obsbars, table.total, "norm", 0) * 100
+ df_obsbars["others"] = table.get_total() - table.get_unassigned() - df_obsbars.sum(axis=1)
+ df_obsbars["unassigned"] = table.get_unassigned()
+ df_obsbars = transform_table(df_obsbars, table.get_total(), "norm", 0) * 100
df_obsbars["factors"] = df_obsbars.index.to_list()
- print_df(df_obsbars, "df_obsbars -> cds_p_obsbars")
+ print_df(df_obsbars, "cds_p_obsbars")
return ColumnDataSource(df_obsbars)
-def generate_cds_mgnify(mgnify, table, tax):
+def cds_mgnify(mgnify, table, tax):
# index (taxa, level, lineage)
# count for each combination of index
@@ -475,7 +506,7 @@ def generate_cds_mgnify(mgnify, table, tax):
# Match uids (taxid or names) from input and keep only found elements
uids = [txid for rank in table.ranks() for txid in table.observations(rank)]
- df_tmp = mgnify.data[mgnify.data['taxa'].isin(uids)]
+ df_tmp = mgnify[mgnify['taxa'].isin(uids)]
# reset index to properly concate later with biome lineages
df_tmp.reset_index(drop=True, inplace=True)
@@ -506,7 +537,7 @@ def generate_cds_mgnify(mgnify, table, tax):
# Calculate angle for each taxa/level for wedges
total_taxa_level = df_biome.groupby("taxa").sum().to_dict()["count"]
- df_biome["angle"] = (df_biome['count'] / df_biome['taxa'].map(total_taxa_level)) * (2*pi)
+ df_biome["angle"] = (df_biome['count'] / df_biome['taxa'].map(total_taxa_level)) * (2 * pi)
# Group to the final df
df_mgnify = pd.concat([df_mgnify, df_biome], axis=0, ignore_index=True)
@@ -514,5 +545,5 @@ def generate_cds_mgnify(mgnify, table, tax):
# set index
df_mgnify.set_index('taxa', inplace=True)
- print_df(df_mgnify, "df_mgnify -> cds_p_mgnify")
+ print_df(df_mgnify, "cds_p_mgnify")
return ColumnDataSource(df_mgnify)
diff --git a/grimer/config.py b/grimer/config.py
new file mode 100644
index 0000000..3f3d640
--- /dev/null
+++ b/grimer/config.py
@@ -0,0 +1,66 @@
+ #!/usr/bin/env python3
+import argparse
+from scipy.spatial.distance import _METRICS_NAMES
+from scipy.cluster.hierarchy import _LINKAGE_METHODS
+
+
+class Config:
+
+ version = "1.0.0-alpha1"
+ default_rank_name = "default"
+ output_plots = ["overview", "samples", "heatmap", "correlation"]
+
+ def __new__(self, argv=None):
+
+ parser = argparse.ArgumentParser(description='grimer')
+
+ required_group = parser.add_argument_group('required arguments')
+ required_group.add_argument('-i', '--input-file', required=True, type=str, help="Main input table with counts (Observation table, Count table, Contingency Tables, ...) or .biom file. By default rows contain observations and columns contain samples (use --tranpose if your file is reversed). First column and first row are used as headers.")
+
+ main_group = parser.add_argument_group('main arguments')
+ main_group.add_argument('-m', '--metadata-file', type=str, help="Input metadata file in simple tabular format with samples in rows and metadata fields in columns. QIIME 2 metadata format is also accepted, with an extra row to define categorical and numerical fields. If not provided and --input-file is a .biom files, will attempt to get metadata from it. ")
+ main_group.add_argument('-t', '--taxonomy', type=str, default=None, help="Define taxonomy to convert entry and annotate samples. Will automatically download and parse or files can be provided with --tax-files.", choices=["ncbi", "gtdb", "silva", "greengenes", "ott"])
+ main_group.add_argument('-b', '--tax-files', nargs="*", type=str, default=[], help="Optional specific taxonomy files to use.")
+ main_group.add_argument('-r', '--ranks', nargs="*", default=[Config.default_rank_name], type=str, help="Taxonomic ranks to generate visualizations. Use '" + Config.default_rank_name + "' to use entries from the table directly. Default: " + Config.default_rank_name)
+ main_group.add_argument('-c', '--config', type=str, help="Configuration file with definitions of references, controls and external tools.")
+
+ output_group = parser.add_argument_group('output arguments')
+ output_group.add_argument('-g', '--mgnify', default=False, action='store_true', help="Use MGNify data")
+ output_group.add_argument('-d', '--decontam', default=False, action='store_true', help="Run DECONTAM")
+ output_group.add_argument('-l', '--title', type=str, default="", help="Title to display on the header of the report.")
+ output_group.add_argument('-p', '--output-plots', nargs="*", type=str, default=Config.output_plots, help="Plots to generate. Default: " + ",".join(Config.output_plots), choices=Config.output_plots)
+ output_group.add_argument('-o', '--output-html', type=str, default="output.html", help="File to output report. Default: output.html")
+ output_group.add_argument('--full-offline', default=False, action='store_true', help="Embed javascript library in the output file. File will be around 1.5MB bigger but also work without internet connection. That way your report will live forever.")
+
+ data_group = parser.add_argument_group('general data options')
+ data_group.add_argument('-f', '--level-separator', default=None, type=str, help="If provided, consider --input-table to be a hiearchical multi-level table where the observations headers are separated by the indicated separator characther (usually ';' or '|')")
+ data_group.add_argument('-y', '--values', default=None, type=str, help="Force 'count' or 'normalized' data parsing. Empty to auto-detect.")
+ data_group.add_argument('-s', '--transpose', default=False, action='store_true', help="Transpose --input-table (if samples are listed on columns and observations on rows)")
+ data_group.add_argument('-u', '--unassigned-header', nargs="*", type=str, default=None, help="Define one or more header names containing unsassinged/unclassified counts.")
+ data_group.add_argument('--obs-replace', nargs="*", type=str, default=[], help="Replace values on table observations labels/headers (support regex). Example: '_' ' ' will replace underscore with spaces, '^.+__' '' will remove the matching regex.")
+ data_group.add_argument('--sample-replace', nargs="*", type=str, default=[], help="Replace values on table sample labels/headers (support regex). Example: '_' ' ' will replace underscore with spaces, '^.+__' '' will remove the matching regex.")
+ data_group.add_argument('-z', '--replace-zeros', type=str, default="1000", help="INT (add 'smallest count'/INT to every raw count), FLOAT (add FLOAT to every raw count). Default: 1000")
+ data_group.add_argument('--min-frequency', type=float, help="Define minimum number/percentage of samples containing an observation to keep the observation [values between 0-1 for percentage, >1 specific number].")
+ data_group.add_argument('--max-frequency', type=float, help="Define maximum number/percentage of samples containing an observation to keep the observation [values between 0-1 for percentage, >1 specific number].")
+ data_group.add_argument('--min-count', type=float, help="Define minimum number/percentage of counts to keep an observation [values between 0-1 for percentage, >1 specific number].")
+ data_group.add_argument('--max-count', type=float, help="Define maximum number/percentage of counts to keep an observation [values between 0-1 for percentage, >1 specific number].")
+
+ sample_group = parser.add_argument_group('Samples options')
+ sample_group.add_argument('-j', '--top-obs-bars', type=int, default=20, help="Top abundant observations to show in the bars.")
+
+ heatmap_group = parser.add_argument_group('Heatmap and clustering options')
+ heatmap_group.add_argument('-a', '--transformation', type=str, default="log", help="none (counts), norm (percentage), log (log10), clr (centre log ratio). Default: log")
+ heatmap_group.add_argument('-e', '--metadata-cols', type=int, default=3, help="How many metadata cols to show on the heatmap. Higher values makes plot slower to navigate.")
+ heatmap_group.add_argument('--optimal-ordering', default=False, action='store_true', help="Activate optimal_ordering on linkage, takes longer for large number of samples.")
+ heatmap_group.add_argument('--show-zeros', default=False, action='store_true', help="Do not skip zeros on heatmap. File will be bigger and iteraction with heatmap slower.")
+ heatmap_group.add_argument('--linkage-methods', type=str, nargs="*", default=["complete"], choices=list(_LINKAGE_METHODS))
+ heatmap_group.add_argument('--linkage-metrics', type=str, nargs="*", default=["euclidean"], choices=_METRICS_NAMES)
+ heatmap_group.add_argument('--skip-dendrogram', default=False, action='store_true', help="Disable dendogram. Will create smaller files.")
+
+ correlation_group = parser.add_argument_group('Correlation options')
+ correlation_group.add_argument('-x', '--top-obs-corr', type=int, default=50, help="Top abundant observations to build the correlationn matrix, based on the avg. percentage counts/sample. 0 for all")
+
+ parser.add_argument('-v', '--version', action='version', version='%(prog)s ' + Config.version)
+ parser.add_argument('-D', '--debug', default=False, action='store_true', help=argparse.SUPPRESS)
+
+ return parser.parse_args(argv)
diff --git a/grimer/css/popup.css b/grimer/css/popup.css
index 6656aa3..d0262a2 100644
--- a/grimer/css/popup.css
+++ b/grimer/css/popup.css
@@ -49,4 +49,4 @@
color: #fff;
font-size: 32px;
cursor: pointer;
-}
\ No newline at end of file
+}
diff --git a/grimer/decontam.py b/grimer/decontam.py
index b8853b4..5b2c7fa 100644
--- a/grimer/decontam.py
+++ b/grimer/decontam.py
@@ -17,18 +17,19 @@ def add_rank_results(self, rank, decontam_out_file, decontam_mod_file):
# Parse models enforcing index as string
mod = pd.read_table(decontam_mod_file, sep='\t', header=0, skiprows=0, index_col=0, dtype={0: str})
-
+
# Remove point counter at end (.1 or .1000)
mod.index = mod.index.map(lambda txid: txid[:-5] if txid.endswith(".1000") else txid[:-2]).to_list()
-
+
# Merge first point of model
self.rank[rank] = self.rank[rank].merge(mod.iloc[0::2, 0], left_index=True, right_index=True)
-
+
# Merge second point of model and non-contant line
self.rank[rank] = self.rank[rank].merge(mod.iloc[1::2, :], suffixes=["", "_2"], left_index=True, right_index=True)
def add_rank_empty(self, rank, idx):
self.rank[rank] = pd.DataFrame(index=idx, columns=self.cols_rank + ["contam", "contam_2", "non.contam"])
+ self.rank[rank]["contaminant"] = False
def get_data(self):
return self.data.fillna(False)
@@ -36,11 +37,11 @@ def get_data(self):
def get_contaminants(self, rank, idx):
return self.rank[rank].reindex(idx)["contaminant"]
- def get_pvalues(self, rank, idx):
+ def get_pscore(self, rank, idx):
return self.rank[rank].reindex(idx)["p"]
def get_contaminant_list(self):
clist = []
for r in self.rank:
- clist.extend(self.rank[r].index[self.rank[r]["contaminant"]==True].to_list())
+ clist.extend(self.rank[r].index[self.rank[r]["contaminant"] == True].to_list())
return clist
diff --git a/grimer/utils.py b/grimer/func.py
similarity index 56%
rename from grimer/utils.py
rename to grimer/func.py
index 533bd3f..3738403 100644
--- a/grimer/utils.py
+++ b/grimer/func.py
@@ -6,15 +6,21 @@
import shlex
import pandas as pd
import yaml
-from collections import OrderedDict
#Internal
from grimer.decontam import Decontam
-from grimer.plots import make_color_palette
-from grimer.source import Source
+from grimer.metadata import Metadata
+from grimer.reference import Reference
+from grimer.table import Table
+
+# Bokeh
+from bokeh.palettes import Category10, Category20, Colorblind, linear_palette, Turbo256
+
+# MultiTax
+from multitax import *
#biom
-from biom import parse_table as parse_table_biom
+import biom
# scikit-bio
from skbio.stats.composition import clr
@@ -23,11 +29,251 @@
import scipy.cluster.hierarchy as sch
-def parse_input_table(input_file, unassigned_header, transpose, min_frequency, max_frequency, min_count, max_count):
+def parse_config_file(config):
+ cfg = None
+ if config:
+ try:
+ with open(config, 'r') as file:
+ cfg = yaml.safe_load(file)
+ except Exception as e:
+ print_log("Failed loading configuration file [" + config + "], skipping")
+ print_log(str(e))
+ else:
+ print_log("Not provided, skipping")
+ return cfg
+
+
+def parse_taxonomy(taxonomy, tax_files):
+ tax = None
+ if taxonomy is not None:
+ try:
+ if not tax_files:
+ print_log("Downloading taxonomy")
+ if taxonomy == "ncbi":
+ tax = NcbiTx(files=tax_files, extended_names=True)
+ elif taxonomy == "gtdb":
+ tax = GtdbTx(files=tax_files)
+ elif taxonomy == "silva":
+ tax = SilvaTx(files=tax_files)
+ elif taxonomy == "greengenes":
+ tax = GreengenesTx(files=tax_files)
+ elif taxonomy == "ott":
+ tax = OttTx(files=tax_files, extended_names=True)
+ else:
+ raise
+ except Exception as e:
+ print_log("Failed loading " + taxonomy + " taxonomy, skipping")
+ print_log(str(e))
+ else:
+ print_log("Not provided, skipping")
+ return tax
+
+
+def parse_table(args, tax):
+ # Specific default params if biom file is provided
+ if args.input_file.endswith(".biom"):
+ if not args.level_separator:
+ args.level_separator = ";"
+ args.transpose = True
+
+ # Read and return full table with separated total and unassigned counts (sharing same index)
+ table_df, total, unassigned = parse_input_file(args.input_file, args.unassigned_header, args.transpose, args.sample_replace)
+
+ if table_df.empty:
+ raise Exception("Error parsing input file")
+
+ # Define if table is already normalized (0-100) or has count data
+ if args.values == "count":
+ normalized = False
+ elif args.values == "normalized":
+ normalized = True
+ elif (table_df.sum(axis=1).round() == 100).all() or (table_df % 1 != 0).any().any():
+ normalized = True
+ else:
+ normalized = False
+
+ # Zero replacement
+ try:
+ replace_zero_value = table_df[table_df.gt(0)].min().min() / int(args.replace_zeros)
+ except:
+ replace_zero_value = float(args.replace_zeros)
+ if replace_zero_value == 1 and args.transformation == "log":
+ replace_zero_value = 0.999999 # Do not allow value 1 using log
+
+ # Split table into ranks. Ranks are either in the headers in multi level tables or will be created for a one level table
+ if args.level_separator:
+ ranked_tables, lineage = parse_multi_table(table_df, args.ranks, tax, args.level_separator, args.obs_replace)
+ else:
+ ranked_tables, lineage = parse_single_table(table_df, args.ranks, tax, Config.default_rank_name)
+
+ if not ranked_tables:
+ raise Exception("Error parsing input file")
+
+ table = Table(table_df.index, total, unassigned, lineage, normalized, replace_zero_value)
+
+ print_log("")
+ print_log("Total valid samples: " + str(len(table.samples)))
+ # Check for long sample headers, break some plots
+ long_sample_headers = [h for h in table_df.index if len(h) > 70]
+ if long_sample_headers:
+ print_log("Long sample labels/headers detected, plots may break: ")
+ print_log("\n".join(long_sample_headers))
+ print_log("")
+
+ for r, t in ranked_tables.items():
+ print_log("--- " + r + " ---")
+ filtered_trimmed_t = trim_table(filter_input_table(t, total, args.min_frequency, args.max_frequency, args.min_count, args.max_count, normalized))
+ if t.empty:
+ print_log("No valid entries, skipping")
+ else:
+ # Trim table for empty zeros rows/cols
+ table.add_rank(r, filtered_trimmed_t)
+ print_log("Total valid observations: " + str(len(table.observations(r))))
+
+ print_log("")
+
+ if not normalized:
+ print_log("Total assigned (counts): " + str(table.get_total().sum() - table.get_unassigned().sum()))
+ print_log("Total unassigned (counts): " + str(table.get_unassigned().sum()))
+ print_log("")
+
+ return table
+
+
+def parse_metadata(args, table):
+ metadata = None
+ if args.metadata_file:
+ metadata = Metadata(metadata_file=args.metadata_file, samples=table.samples.to_list())
+ elif args.input_file.endswith(".biom"):
+ try:
+ biom_in = biom.load_table(args.input_file)
+ if biom_in.metadata() is not None:
+ metadata = Metadata(metadata_table=biom_in.metadata_to_dataframe(axis="sample"), samples=table.samples.to_list())
+ except:
+ print_log("Error parsing metadata from BIOM file, skipping")
+ return None
+
+ if metadata.data.empty:
+ print_log("No valid metadata, skipping")
+ return None
+
+ print_log("Samples: " + str(metadata.data.shape[0]))
+ print_log("Numeric Fields: " + str(metadata.get_data("numeric").shape[1]))
+ print_log("Categorical Fields: " + str(metadata.get_data("categorical").shape[1]))
+ if len(metadata.get_col_headers()) < args.metadata_cols:
+ args.metadata_cols = len(metadata.get_col_headers())
+
+ return metadata
+
+
+def parse_references(cfg, tax, taxonomy, ranks):
+ references = None
+ if cfg is not None and "references" in cfg:
+ if taxonomy == "ncbi":
+ references = {}
+ for desc, sf in cfg["references"].items():
+ references[desc] = Reference(file=sf)
+ if tax:
+ # Update taxids / get taxid from name
+ references[desc].update_taxids(update_tax_nodes(references[desc].ids, tax))
+ for i in list(references[desc].ids.keys()):
+ # lineage of all parent nodes (without itself)
+ for l in tax.lineage(i)[:-1]:
+ references[desc].add_parent(l, i)
+ else:
+ print_log("References only possible with ncbi taxonomy, skipping")
+ else:
+ print_log("No references defined in the configuration file, skipping")
+ return references
+
+
+def parse_controls(cfg, table):
+ controls = None
+ control_samples = None
+ if cfg is not None and "controls" in cfg:
+ controls = {}
+ control_samples = {}
+ for desc, cf in cfg["controls"].items():
+ with open(cf, "r") as file:
+ samples = file.read().splitlines()
+ obs = set()
+ valid_samples = set()
+ for rank in table.ranks():
+ # Retrieve sub-table for every rank
+ control_table = table.get_subtable(rank, samples=samples)
+ obs.update(control_table.columns.to_list())
+ valid_samples.update(control_table.index.to_list())
+
+ # Add control observations as a reference
+ controls[desc] = Reference(ids=obs)
+ control_samples[desc] = list(valid_samples)
+ else:
+ print_log("No controls defined in the configuration file, skipping")
+
+ return controls, control_samples
+
+
+def parse_mgnify(run_mgnify, cfg, tax, ranks):
+ mgnify = None
+ if run_mgnify:
+ if cfg is not None and "mgnify" in cfg["external"]:
+ try:
+ mgnify = pd.read_table(cfg["external"]["mgnify"], header=None, names=["rank", "taxa", "biome", "count"])
+ except Exception as e:
+ print_log("Failed parsing MGnify database file [" + cfg["external"]["mgnify"] + "], skipping")
+ print_log(str(e))
+ # Filter to keep only used ranks, if provided
+ if ranks:
+ mgnify = mgnify.loc[mgnify['rank'].isin(ranks)]
+ mgnify.reset_index(drop=True, inplace=True)
+ # Convert taxids if tax is provided
+ if tax:
+ updated_nodes = update_tax_nodes([tuple(x) for x in mgnify[["rank", "taxa"]].to_numpy()], tax)
+ mgnify["taxa"] = mgnify[["rank", "taxa"]].apply(lambda rt: updated_nodes[(rt[0], rt[1])] if updated_nodes[(rt[0], rt[1])] is not None else rt[1], axis=1)
+ if mgnify.empty:
+ mgnify = None
+ print_log("No matches with MGnify database, skipping")
+ else:
+ print_log("Not defined in the configuration file, skipping")
+ else:
+ print_log("Not activated, skipping")
+ return mgnify
+
+
+def run_correlation(table, top_obs_corr):
+ corr = {}
+ for rank in table.ranks():
+ corr[rank] = {}
+ if top_obs_corr:
+ top_taxids = sorted(table.get_top(rank, top_obs_corr))
+ matrix = table.get_subtable(taxids=top_taxids, rank=rank)
+ else:
+ top_taxids = sorted(table.observations(rank))
+ matrix = table.data[rank]
+
+ corr[rank]["observations"] = top_taxids
+ corr[rank]["rho"] = []
+ # No correlation with just one observation
+ if len(matrix.columns) >= 2:
+ rho = pairwise_rho(transform_table(matrix, 0, "clr", table.zerorep).values)
+ if len(matrix.columns) == 2:
+ # If there are only 2 observations, return in a float
+ # re-format in a matrix shape
+ rho = np.array([[np.nan, np.nan], [rho[1, 0], np.nan]])
+ else:
+ # fill upper triangular matrix (mirrored values) with nan to be ignored by pandas
+ # to save half of the space
+ rho[np.triu_indices(rho.shape[0])] = np.nan
+
+ corr[rank]["rho"] = rho
+
+ return corr
+
+
+def parse_input_file(input_file, unassigned_header, transpose, sample_replace):
if input_file.endswith(".biom"):
- with open(input_file, "r") as f:
- table_df = parse_table_biom(f).to_dataframe(dense=True)
+ table_df = biom.load_table(input_file).to_dataframe(dense=True)
else:
# Default input_file: index=observations, columns=samples
# table_df should have samples on indices and observations on columns
@@ -38,19 +284,34 @@ def parse_input_table(input_file, unassigned_header, transpose, min_frequency, m
table_df = table_df.transpose()
# Remove header on rows
- table_df.index.names = [None]
+ table_df.index.name = None
+
+ # Replace text on sample labels
+ if sample_replace:
+ print_log("Replacing sample values:")
+ before_replace = table_df.head(1).index
+ #get index as series to use replace method
+ new_index = table_df.reset_index()["index"].replace(regex=dict(zip(sample_replace[::2], sample_replace[1::2])))
+ table_df.set_index(new_index, inplace=True)
+ for b, a in zip(before_replace, table_df.head(1).index):
+ print_log(" " + b + " -> " + a)
+ print_log(" ...")
# Sum total before split unassigned or filter
total = table_df.sum(axis=1)
# unique unassigned/unclassified for table
+ # Separate unassigned counts column from main data frame
unassigned = pd.Series(0, index=table_df.index)
if unassigned_header:
for header in unassigned_header:
if header in table_df.columns:
- # Separate unassigned counts column from main data frame
- # Sum in case there are several equally named headers
- unassigned = unassigned + table_df[header].sum(axis=1)
+ if isinstance(table_df[header], pd.DataFrame):
+ # Sum in case there are several equally named headers
+ unassigned += table_df[header].sum(axis=1)
+ else:
+ # return a pd.Series
+ unassigned += table_df[header]
table_df.drop(columns=header, inplace=True)
else:
print_log("'" + header + "' header not found")
@@ -58,23 +319,22 @@ def parse_input_table(input_file, unassigned_header, transpose, min_frequency, m
if unassigned.sum() == 0:
print_log("No unassigned entries defined")
- print_log("")
- print_log("- Filtering table")
- table_df = filter_input_table(table_df, total, min_frequency, max_frequency, min_count, max_count)
+ print_log("Trimming table")
+ table_df = trim_table(table_df)
- # Filter based on the table
+ # Filter based on the final table
unassigned = unassigned.reindex(table_df.index)
total = total.reindex(table_df.index)
return table_df, total, unassigned
-def filter_input_table(table_df, total, min_frequency, max_frequency, min_count, max_count):
+def filter_input_table(table_df, total, min_frequency, max_frequency, min_count, max_count, normalized):
if min_count:
cnt = table_df.sum().sum()
if min_count < 1:
- table_df_norm = transform_table(table_df, total, "norm", 0)
+ table_df_norm = transform_table(table_df, total, "norm", 0) if not normalized else table_df
table_df = table_df[table_df_norm >= min_count].fillna(0)
elif min_count > 1:
table_df = table_df[table_df >= min_count].fillna(0)
@@ -83,7 +343,7 @@ def filter_input_table(table_df, total, min_frequency, max_frequency, min_count,
if max_count:
cnt = table_df.sum().sum()
if max_count < 1:
- table_df_norm = transform_table(table_df, total, "norm", 0)
+ table_df_norm = transform_table(table_df, total, "norm", 0) if not normalized else table_df
table_df = table_df[table_df_norm <= max_count].fillna(0)
elif max_count > 1:
table_df = table_df[table_df <= max_count].fillna(0)
@@ -109,27 +369,33 @@ def filter_input_table(table_df, total, min_frequency, max_frequency, min_count,
table_df = table_df.loc[:, table_df_freq <= max_frequency]
print_log(str(int(cnt - table_df.shape[1])) + " observations removed with --max-frequency " + str(max_frequency))
+ return table_df
+
+
+def trim_table(table_df):
# Check for cols/rows with sum zero
- zero_rows = table_df.sum(axis=1) == 0
+ zero_rows = table_df.sum(axis=1).eq(0)
if any(zero_rows):
table_df = table_df.loc[~zero_rows, :]
- print_log(str(sum(zero_rows)) + " samples without valid counts removed")
+ print_log(str(sum(zero_rows)) + " samples with only zero removed")
- zero_cols = table_df.sum(axis=0) == 0
+ zero_cols = table_df.sum(axis=0).eq(0)
if any(zero_cols):
table_df = table_df.loc[:, ~zero_cols]
- print_log(str(sum(zero_cols)) + " observations without valid counts removed")
+ print_log(str(sum(zero_cols)) + " observations with only zero removed")
return table_df
def parse_multi_table(table_df, ranks, tax, level_separator, obs_replace):
+ from grimer.grimer import _debug
+
# Transpose table (obseravations as index) and expand ranks in columns
ranks_df = table_df.T.index.str.split(level_separator, expand=True).to_frame(index=False)
# For every pair of replace arguments
if obs_replace:
- print_log("Replacing values:")
+ print_log("Replacing observation values:")
before_replace = ranks_df.dropna().head(1).values[0]
ranks_df.replace(regex=dict(zip(obs_replace[::2], obs_replace[1::2])), inplace=True)
for b, a in zip(before_replace, ranks_df.dropna().head(1).values[0]):
@@ -150,7 +416,7 @@ def parse_multi_table(table_df, ranks, tax, level_separator, obs_replace):
ranks_df.rename(columns=parsed_ranks, inplace=True)
# Update taxids
- if tax:
+ if tax is not None:
unmatched_nodes = 0
for i, r in parsed_ranks.items():
rank_nodes = ranks_df[r].dropna().unique()
@@ -163,9 +429,9 @@ def parse_multi_table(table_df, ranks, tax, level_separator, obs_replace):
else:
updated_nodes = update_tax_nodes(rank_nodes, tax)
- # Add nan to convert empty ranks
- updated_nodes[np.nan] = tax.undefined_node
- ranks_df[r] = ranks_df[r].map(lambda t: updated_nodes[t] if updated_nodes[t] else t)
+ # Add nan to keep missing ranks (different than tax.undefined_node [None] which will keep the name)
+ updated_nodes[np.nan] = np.nan
+ ranks_df[r] = ranks_df[r].map(lambda t: updated_nodes[t] if updated_nodes[t] is not None else t)
del updated_nodes[np.nan]
unmatched_nodes += list(updated_nodes.values()).count(tax.undefined_node)
@@ -180,28 +446,31 @@ def parse_multi_table(table_df, ranks, tax, level_separator, obs_replace):
invalid = lin_count[(lin_count > 1).any(axis=1)].index.to_list()
if invalid:
print_log(str(len(invalid)) + " observations removed with invalid lineage at " + r)
+ if _debug:
+ print_log(",".join(invalid) + " observations removed with invalid lineage at " + r)
# Set to NaN to keep shape of ranks_df
ranks_df.loc[ranks_df[r].isin(invalid), r] = np.nan
-
+
ranked_tables = {}
for i, r in parsed_ranks.items():
# ranks_df and table_df.T have the same shape
ranked_table_df = pd.concat([ranks_df[r], table_df.T.reset_index(drop=True)], axis=1)
ranked_tables[r] = ranked_table_df.groupby([r], dropna=True).sum().T
+ ranked_tables[r].columns.name = None
lineage = ranks_df
-
return ranked_tables, lineage
def parse_single_table(table_df, ranks, tax, default_rank_name):
# Update taxids
- if tax:
+ if tax is not None:
updated_nodes = update_tax_nodes(table_df.columns, tax)
unmatched_nodes = list(updated_nodes.values()).count(tax.undefined_node)
if unmatched_nodes:
print_log(str(unmatched_nodes) + " observations not found in taxonomy")
+
for node, upd_node in updated_nodes.items():
if upd_node is not None and upd_node != node:
# If updated node is a merge on an existing taxid, sum values
@@ -242,29 +511,12 @@ def parse_single_table(table_df, ranks, tax, default_rank_name):
return ranked_tables, lineage
-def fdrcorrection_bh(pvals):
- """
- Correct multiple p-values with the Benjamini/Hochberg method
- Code copied and adapted from: statsmodels.stats.multitest.multipletests
- https://github.com/statsmodels/statsmodels/blob/77bb1d276c7d11bc8657497b4307aa7575c3e65c/statsmodels/stats/multitest.py
- """
- pvals_sortind = np.argsort(pvals)
- pvals_sorted = np.take(pvals, pvals_sortind)
-
- nobs = len(pvals_sorted)
- factors = np.arange(1, nobs + 1) / float(nobs)
-
- pvals_corrected_raw = pvals_sorted / factors
- pvals_corrected = np.minimum.accumulate(pvals_corrected_raw[::-1])[::-1]
- pvals_corrected[pvals_corrected > 1] = 1
-
- pvals_corrected_ = np.empty_like(pvals_corrected)
- pvals_corrected_[pvals_sortind] = pvals_corrected
-
- return pvals_corrected_
-
-
def transform_table(df, total_counts, transformation, replace_zero_value):
+ # Special case clr with one observation (result in zeros)
+ if transformation == "clr" and df.shape[1] == 1:
+ print_log("WARNING: using log instead of clr with one observation")
+ transformation = "log"
+
if transformation == "log":
transformed_df = (df + replace_zero_value).apply(np.log10)
elif transformation == "clr":
@@ -309,7 +561,16 @@ def update_tax_nodes(nodes, tax):
return updated_nodes
-def run_decontam(cfg, table, metadata, control_samples):
+def run_decontam(run_decontam, cfg, table, metadata, control_samples):
+
+ if not run_decontam:
+ print_log("Not activated, skipping")
+ return None
+
+ if cfg is None:
+ print_log("Not defined in the configuration file, skipping")
+ return None
+
df_decontam = pd.DataFrame(index=table.samples, columns=["concentration", "controls"])
cfg_decontam = cfg["external"]["decontam"]
tmp_output_prefix = "tmp_"
@@ -327,23 +588,25 @@ def run_decontam(cfg, table, metadata, control_samples):
df_decontam["concentration"] = pd.read_table(cfg_decontam["frequency_file"], sep='\t', header=None, skiprows=0, index_col=0).reindex(table.samples)
# If any entry is unknown, input is incomplete
if df_decontam["concentration"].isnull().values.any():
- print_log("File " + cfg_decontam["frequency_file"] + " is incomplete (Missing: " + ",".join(df_decontam[df_decontam.isnull().any(axis=1)].index.to_list()) + ") Skipping DECONTAM.")
+ print_log("File " + cfg_decontam["frequency_file"] + " is incomplete (Missing: " + ",".join(df_decontam[df_decontam.isnull().any(axis=1)].index.to_list()) + "), skipping")
return None
else:
- print_log("File " + cfg_decontam["frequency_file"] + " not found. Skipping DECONTAM.")
+ print_log("File " + cfg_decontam["frequency_file"] + " not found, skipping")
return None
elif "frequency_metadata" in cfg_decontam:
if cfg_decontam["frequency_metadata"] in metadata.get_col_headers():
# Get concentrations from metadata
df_decontam["concentration"] = metadata.get_col(cfg_decontam["frequency_metadata"])
else:
- print_log("Could not find " + cfg_decontam["frequency_metadata"] + " in the metadata. Skipping DECONTAM.")
+ print_log("Could not find " + cfg_decontam["frequency_metadata"] + " in the metadata, skipping.")
return None
- else:
+ elif not table.normalized:
# Use total from table
- print_log("WARNING: Using total counts as frequency for DECONTAM")
- df_decontam["concentration"] = table.total
-
+ print_log("No concentration provided, using total counts as concentration (frequency for DECONTAM)")
+ df_decontam["concentration"] = table.get_total()
+ else:
+ print_log("Cannot run DECONTAM without defined concentration and normalized input values, skipping")
+ return None
# Print concentrations to file
df_decontam["concentration"].to_csv(out_concentration, sep="\t", header=False, index=True)
@@ -376,18 +639,20 @@ def run_decontam(cfg, table, metadata, control_samples):
print("\n".join(df_decontam.index[df_decontam["controls"]]), file=outf)
outf.close()
else:
- print("Could not find valid control entries. Skipping DECONTAM")
+ print("Could not find valid control entries, skipping")
return None
decontam = Decontam(df_decontam)
# Run DECONTAM for each for each
for rank in table.ranks():
-
if len(table.observations(rank)) == 1:
decontam.add_rank_empty(rank, table.observations(rank))
else:
- # normalized and write temporary table for each rank
- transform_table(table.data[rank], table.total, "norm", 0).to_csv(out_table, sep="\t", header=True, index=True)
+ # normalize and write temporary table for each rank
+ if not table.normalized:
+ transform_table(table.data[rank], table.get_total()[table.data[rank].index], "norm", 0).to_csv(out_table, sep="\t", header=True, index=True)
+ else:
+ table.data[rank].to_csv(out_table, sep="\t", header=True, index=True)
cmd = " ".join(["scripts/run_decontam.R",
"--resout " + tmp_output_prefix + "decontam_out.tsv",
@@ -404,16 +669,18 @@ def run_decontam(cfg, table, metadata, control_samples):
for file in [out_table, out_concentration, out_controls, tmp_output_prefix + "decontam_out.tsv", tmp_output_prefix + "decontam_mod.tsv"]:
if os.path.isfile(file):
os.remove(file)
+
return decontam
-def run_hclustering(table, linkage_methods, linkage_metrics, transformation, replace_zero_value, skip_dendrogram, optimal_ordering):
+def run_hclustering(table, linkage_methods, linkage_metrics, transformation, skip_dendrogram, optimal_ordering):
hcluster = {}
dendro = {}
for rank in table.ranks():
+
# Get .values of transform, numpy array
- matrix = transform_table(table.data[rank], table.total, transformation, replace_zero_value).values
+ matrix = transform_table(table.data[rank], table.get_total(), transformation, table.zerorep).values
hcluster[rank] = {}
dendro[rank] = {}
@@ -481,6 +748,17 @@ def dendro_lines_color(dendro, axis):
return icoord, dcoord, colors
+def pairwise_vlr(mat):
+ cov = np.cov(mat.T, ddof=1)
+ diagonal = np.diagonal(cov)
+ return -2 * cov + diagonal[:, np.newaxis] + diagonal
+
+
+def pairwise_rho(mat):
+ variances = np.var(mat, axis=0, ddof=1)
+ return 1 - (pairwise_vlr(mat) / np.add.outer(variances, variances))
+
+
def include_scripts(scripts):
# Insert global js functions and css and return template
template = "{% block postamble %}"
@@ -493,60 +771,38 @@ def include_scripts(scripts):
return template
-def parse_sources(cfg, tax, ranks):
- contaminants = {}
- references = {}
- for desc, sf in cfg["sources"]["contaminants"].items():
- contaminants[desc] = Source(file=sf)
-
- # Update taxids / get taxid from name
- if tax:
- contaminants[desc].update_taxids(update_tax_nodes(contaminants[desc].ids, tax))
- ids = contaminants[desc].ids
-
- # lineage of all children nodes
- for i in ids:
- for lin in map(lambda txid: tax.lineage(txid, root_node=i), tax.leaves(i)):
- contaminants[desc].update_lineage(lin)
- contaminants[desc].update_refs({i: l for l in lin})
-
- for desc, sf in cfg["sources"]["references"].items():
- references[desc] = Source(file=sf)
- # Update lineage and refs based on given taxonomy
- if tax:
- references[desc].update_taxids(update_tax_nodes(references[desc].ids, tax))
- ids = references[desc].ids
-
- # lineage of all children nodes
- for i in ids:
- for lin in map(lambda txid: tax.lineage(txid, root_node=i), tax.leaves(i)):
- references[desc].update_lineage(lin)
- references[desc].update_refs({i: l for l in lin})
-
- return contaminants, references
-
-
-def parse_controls(cfg, tax, table):
- controls = {}
- control_samples = {}
-
- if "samples" in cfg:
- if "controls" in cfg["samples"]:
- for desc, cf in cfg["samples"]["controls"].items():
- with open(cf, "r") as file:
- samples = file.read().splitlines()
- control_table = table.get_subtable(table.ranks()[-1], samples=samples)
- controls[desc] = Source(ids=control_table.columns.to_list())
- control_samples[desc] = control_table.index.to_list()
-
- if tax:
- ids = controls[desc].ids
- # lineage of all children nodes
- for i in ids:
- for lin in map(lambda txid: tax.lineage(txid, root_node=i), tax.leaves(i)):
- controls[desc].update_lineage(lin)
+def format_js_toString(val):
+ # Transform numeric value to float and string to match toString
+ return str(float(val)) if isinstance(val, (int, float)) else str(val)
- return controls, control_samples
+
+def make_color_palette(n_colors, linear: bool=False, palette: dict=None):
+ if isinstance(palette, dict) and n_colors <= max(palette.keys()):
+ # Special case for 1 and 2 (not in palettes)
+ palette = palette[3 if n_colors < 3 else n_colors]
+
+ if linear or n_colors > 20:
+ if not palette:
+ palette = Turbo256
+ if n_colors <= 256:
+ return linear_palette(palette, n_colors)
+ else:
+ # Repeat colors
+ return [palette[int(i * 256.0 / n_colors)] for i in range(n_colors)]
+ else:
+ # Select color palette based on number of requested colors
+ # Return the closest palette with most distinc set of colors
+ if not palette:
+ if n_colors <= 8:
+ palette = Colorblind[8]
+ elif n_colors <= 10:
+ palette = Category10[10]
+ elif n_colors <= 20:
+ palette = Category20[20]
+ else:
+ palette = Turbo256
+
+ return palette[:n_colors]
def run_cmd(cmd, print_stderr: bool=False, exit_on_error: bool=True):
errcode = 0
@@ -588,21 +844,16 @@ def print_log(text):
def print_df(df, name: str=None):
from grimer.grimer import _debug
if _debug:
- print("-----------------------------------------------")
print(name)
if isinstance(df, dict):
if df:
- print(list(df.keys())[0])
- print("...")
- print(list(df.keys())[-1])
- print(list(df.values())[0])
- print("...")
- print(list(df.values())[-1])
- print(len(df.keys()))
+ print(len(df.keys()), "keys:", list(df.keys())[0], "...", list(df.keys())[-1])
+ #print(list(df.values())[0], "...", list(df.values())[-1])
else:
- print(df.columns)
+ #print(df.columns)
print(df.head())
print(df.shape)
+ print("size:", sys.getsizeof(df))
print("-----------------------------------------------")
diff --git a/grimer/grimer.py b/grimer/grimer.py
index 337fc2e..467b530 100755
--- a/grimer/grimer.py
+++ b/grimer/grimer.py
@@ -2,293 +2,184 @@
_debug = False
#General
-import argparse
-import yaml
+import sys
#Internal
-from grimer.table import Table
-from grimer.metadata import Metadata
-from grimer.mgnify import MGnify
from grimer.callbacks import *
from grimer.cds import *
+from grimer.config import Config
from grimer.layout import *
from grimer.plots import *
-from grimer.utils import *
-
-# MultiTax
-from multitax import *
+from grimer.func import *
#Bokeh
from bokeh.io import save
from bokeh.plotting import output_file
-# Scipy
-from scipy.spatial.distance import _METRICS_NAMES
-from scipy.cluster.hierarchy import _LINKAGE_METHODS
-
-
-def main():
-
- default_rank_name = "default"
-
- version = "1.0.0-alpha0"
- parser = argparse.ArgumentParser(description='grimer')
- parser.add_argument('-i', '--input-file', required=True, type=str, help="Main input table with counts (Observation table, Count table, Contingency Tables, ...) or .biom file. By default rows contain observations and columns contain samples (use --tranpose if your file is reversed). First column and first row are used as headers.")
- parser.add_argument('-c', '--config', required=True, type=str, help="Configuration file")
- parser.add_argument('-m', '--metadata', type=str, help="Input metadata file in simple tabular format. Sample identifiers will be matched with ones provided by --input-table. QIIME 2 metadata format is also accepted, with categorical and numerical fields.")
- parser.add_argument('-t', '--tax', type=str, default=None, help="Define taxonomy to use. By default, do not use any taxonomy.", choices=["ncbi", "gtdb", "silva", "greengenes", "ott"])
- parser.add_argument('-b', '--tax-files', nargs="*", type=str, default=None, help="Taxonomy files. If not provided, will automatically be downloaded.")
-
- parser.add_argument('-r', '--ranks', nargs="*", default=[default_rank_name], type=str, help="Taxonomic ranks to generate visualizations. Use '" + default_rank_name + "' to use entries from the table directly. Default: " + default_rank_name)
- parser.add_argument('-l', '--title', type=str, default="", help="Title to display on the header of the report.")
- parser.add_argument('-o', '--output-html', type=str, default="output.html", help="File to output report. Default: output.html")
- parser.add_argument('--full-offline', default=False, action='store_true', help="Embed javascript library in the output file. File will be around 1.5MB bigger but also work without internet connection. That way your report will live forever.")
-
- table_group = parser.add_argument_group('Table options')
- table_group.add_argument('-f', '--level-separator', default=None, type=str, help="If provided, consider --input-table to be a hiearchical multi-level table where the observations headers are separated by the indicated separator characther (usually ';' or '|')")
- table_group.add_argument('-s', '--transpose', default=False, action='store_true', help="Transpose --input-table (if samples are listed on columns and observations on rows)")
- table_group.add_argument('-u', '--unassigned-header', nargs="*", type=str, default=None, help="Define one or more header names containing unsassinged/unclassified counts.")
- table_group.add_argument('--obs-replace', nargs="*", type=str, default=[], help="Replace values on observations headers usin (support regex). Example: '_' ' ' will replace underscore with spaces, '^.+__' '' will remove the matching regex.")
-
- filter_group = parser.add_argument_group('Observation filter options')
- filter_group.add_argument('--min-frequency', type=float, help="Define minimum number/percentage of samples containing an observation to keep the observation [values between 0-1 for percentage, >1 specific number].")
- filter_group.add_argument('--max-frequency', type=float, help="Define maximum number/percentage of samples containing an observation to keep the observation [values between 0-1 for percentage, >1 specific number].")
- filter_group.add_argument('--min-count', type=float, help="Define minimum number/percentage of counts to keep an observation [values between 0-1 for percentage, >1 specific number].")
- filter_group.add_argument('--max-count', type=float, help="Define maximum number/percentage of counts to keep an observation [values between 0-1 for percentage, >1 specific number].")
-
- overview_group = parser.add_argument_group('Overview options')
- overview_group.add_argument('-g', '--mgnify', default=False, action='store_true', help="Use MGNify data")
- overview_group.add_argument('-d', '--decontam', default=False, action='store_true', help="Run DECONTAM")
-
- heatmap_group = parser.add_argument_group('Heatmap and clustering options')
- heatmap_group.add_argument('-a', '--transformation', type=str, default="log", help="none (counts), norm (percentage), log (log10), clr (centre log ratio). Default: log")
- heatmap_group.add_argument('-z', '--replace-zeros', type=str, default="1000", help="INT (add 'smallest count'/INT to every raw count), FLOAT (add FLOAT to every raw count). Default: 1000")
- heatmap_group.add_argument('-e', '--metadata-cols', type=int, default=5, help="How many metadata cols to show on the heatmap. Higher values makes plot slower to navigate.")
- heatmap_group.add_argument('--optimal-ordering', default=False, action='store_true', help="Activate optimal_ordering on linkage, takes longer for large number of samples.")
- heatmap_group.add_argument('--show-zeros', default=False, action='store_true', help="Do not skip zeros on heatmap. File will be bigger and iteraction with heatmap slower.")
- heatmap_group.add_argument('--linkage-methods', type=str, nargs="*", default=["complete"], choices=list(_LINKAGE_METHODS))
- heatmap_group.add_argument('--linkage-metrics', type=str, nargs="*", default=["euclidean", "braycurtis"], choices=_METRICS_NAMES)
- heatmap_group.add_argument('--skip-dendrogram', default=False, action='store_true', help="Disable dendogram. Will create smaller files.")
-
- correlation_group = parser.add_argument_group('Correlation options')
- correlation_group.add_argument('-x', '--top-obs-corr', type=int, default=20, help="Top abundant observations to build the correlationn matrix, based on the avg. percentage counts/sample. 0 for all")
-
- bars_group = parser.add_argument_group('Bars options')
- bars_group.add_argument('-j', '--top-obs-bars', type=int, default=20, help="Top abundant observations to show in the bars.")
-
- parser.add_argument('-v', '--version', action='version', version='%(prog)s ' + version)
- parser.add_argument('-D', '--debug', default=False, action='store_true', help=argparse.SUPPRESS)
- args = parser.parse_args()
-
- print_logo_cli(version)
+def main(argv=sys.argv[1:]):
+ """
+ GRIMER code overview
+ 1) Load data/analysis: parse configuration, load files and run analysis into data objects
+ e.g. args.input_file to Table() and decontam
+ 2) Generata data sources: Convert objects and analysis int cds/dict
+ e.g. table to cds_m_obstable
+ 3) Plot elements: plot figures and widgets based on cds/dict (and some objects)
+ e.g cds_m_obstable to ele["obstable"]["fig"]
+ 4) Link javascript: link data sources and javascript custom callbacks
+ 5) Draw layout: Put elements into layout scheme and generate report
+ """
+
+ # Parse CLI arguments
+ args = Config(argv)
+ print_logo_cli(Config.version)
+ # Setup global _debug variable to be used by other files with #from grimer.grimer import _debug
global _debug
_debug = args.debug
- # Config file
- with open(args.config, 'r') as file:
- cfg = yaml.safe_load(file)
-
- # Taxonomy
+ # 1) Load data/analysis
+ # If not parsed, skipped or error, var is None
+ cfg = None
tax = None
- if args.tax:
- if args.tax_files:
- print_log("- Parsing taxonomy")
- else:
- print_log("- Downloading and parsing taxonomy")
- print_log(args.tax)
- if args.tax == "ncbi":
- tax = NcbiTx(files=args.tax_files, extended_names=True)
- elif args.tax == "gtdb":
- tax = GtdbTx(files=args.tax_files)
- elif args.tax == "silva":
- tax = SilvaTx(files=args.tax_files)
- elif args.tax == "greengenes":
- tax = GreengenesTx(files=args.tax_files)
- elif args.tax == "ott":
- tax = OttTx(files=args.tax_files, extended_names=True)
- else:
- print_log(" - No taxonomy set")
- print_log("")
-
- # Table of counts
- print_log("- Parsing table")
- if not args.ranks:
- args.ranks = [default_rank_name]
-
- if args.input_file.endswith(".biom"):
- args.level_separator = ";"
- args.transpose = True
-
- table_df, total, unassigned = parse_input_table(args.input_file, args.unassigned_header, args.transpose, args.min_frequency, args.max_frequency, args.min_count, args.max_count)
-
- if args.level_separator:
- ranked_tables, lineage = parse_multi_table(table_df, args.ranks, tax, args.level_separator, args.obs_replace)
- else:
- ranked_tables, lineage = parse_single_table(table_df, args.ranks, tax, default_rank_name)
-
- if not ranked_tables:
- print_log("Could not parse input table")
- return 1
-
- table = Table(table_df.index, total, unassigned)
- table.lineage = lineage
- for r, t in ranked_tables.items():
- if t.empty:
- print_log("Skipping rank without valid entries (" + r + ")")
- else:
- table.add_rank(r, t)
-
- print_log("")
- print_log("Samples: " + str(len(table.samples)))
- print_log("Observations: ")
-
- for rank in table.ranks():
- print_log(" - " + rank + ": " + str(len(table.observations(rank))))
-
- print_log("Total assigned (sum): " + str(table.total.sum()))
- print_log("Total unassigned (sum): " + str(table.unassigned.sum()))
- print_log("")
-
- # Zero replacement
+ table = None
+ metadata = None
+ references = None
+ controls = None
+ control_samples = None
+ hcluster = None
+ dendro = None
+ corr = None
+
+ print_log("- Parsing configuration file")
+ cfg = parse_config_file(args.config)
+
+ print_log("- Parsing taxonomy")
+ tax = parse_taxonomy(args.taxonomy, args.tax_files)
+
+ print_log("- Parsing input table")
try:
- replace_zero_value = table_df[table_df.gt(0)].min().min() / int(args.replace_zeros)
- except:
- replace_zero_value = float(args.replace_zeros)
-
- # Do not allow value 1 using log
- if replace_zero_value == 1 and args.transformation == "log":
- replace_zero_value = 0.999999
-
- # Parse Metadata
- max_metadata_cols = args.metadata_cols
- if args.metadata:
- print_log("- Parsing metadata")
- metadata = Metadata(args.metadata, samples=table.samples.to_list())
- if metadata.data.empty:
- metadata = None
- print_log("No valid metadata")
- else:
- print_log("Samples: " + str(metadata.data.shape[0]))
- print_log("Numeric Fields: " + str(metadata.get_data("numeric").shape[1]))
- print_log("Categorical Fields: " + str(metadata.get_data("categorical").shape[1]))
- if len(metadata.get_col_headers()) < args.metadata_cols:
- max_metadata_cols = len(metadata.get_col_headers())
- print_log("")
- else:
- metadata = None
+ table = parse_table(args, tax)
+ except Exception as e:
+ print(e)
+ return 1
- # Sources of contamination/references/controls
- print_log("- Parsing sources (contamination/references/controls)")
- if args.tax == "ncbi":
- contaminants, references = parse_sources(cfg, tax, table.ranks())
- else:
- contaminants, references = [{}, {}]
+ print_log("- Parsing metadata")
+ metadata = parse_metadata(args, table)
- controls, control_samples = parse_controls(cfg, tax, table)
- print_log("")
+ print_log("- Parsing references")
+ references = parse_references(cfg, tax, args.taxonomy, table.ranks())
- # Run and load decontam results
- if args.decontam:
- print_log("- Running DECONTAM")
- decontam = run_decontam(cfg, table, metadata, control_samples)
- print_log("")
- else:
- decontam = None
-
- # Mgnify
- if args.mgnify and "mgnify" in cfg["external"]:
- print_log("- Parsing MGNify")
- mgnify = MGnify(cfg["external"]["mgnify"], ranks=table.ranks() if args.ranks != [default_rank_name] else [])
- if tax:
- mgnify.update_taxids(update_tax_nodes([tuple(x) for x in mgnify.data[["rank", "taxa"]].to_numpy()], tax))
- print_log("")
- else:
- mgnify = None
-
- # Hiearchical clustering
- print_log("- Running hiearchical clustering")
- hcluster, dendro = run_hclustering(table, args.linkage_methods, args.linkage_metrics, args.transformation, replace_zero_value, args.skip_dendrogram, args.optimal_ordering)
- print_log("")
+ print_log("- Parsing controls")
+ controls, control_samples = parse_controls(cfg, table)
- # save max/min values to control ranges
- max_total_count = table.total.max()
- min_obs_perc = min([table.get_counts_perc(rank)[table.get_counts_perc(rank) > 0].min().min() for rank in table.ranks()])
+ print_log("- Parsing MGnify database")
+ mgnify = parse_mgnify(args.mgnify, cfg, tax, table.ranks())
- print_log("- Generating GRIMER report")
- ############ cds (ColumnDataSource) and dict containers: data structures loaded and parsed by bokehjs
- ############ "cds" for matrix like dataframes with fixed column sizes
- ############ "dict" for variable column sizes
- ############ _p_ : plot -> direct source of figures
- ############ _d_ : data -> auxiliar containers to be used/shared among plots
- ############ usually by copying and/or transforming values into a _p_ container
+ print_log("- Running DECONTAM")
+ decontam = run_decontam(args.decontam, cfg, table, metadata, control_samples)
- # _p_
+ print_log("- Running hiearchical clustering")
+ hcluster, dendro = run_hclustering(table, args.linkage_methods, args.linkage_metrics, args.transformation, args.skip_dendrogram, args.optimal_ordering)
+
+ print_log("- Running correlation")
+ corr = run_correlation(table, args.top_obs_corr)
+
+ # 2) Generata data sources:
+ # cds (ColumnDataSource) and dict containers: data structures loaded and parsed by bokehjs
+ # "cds" for matrix like dataframes with fixed column sizes
+ # "dict" for variable column sizes
+ # _p_ : plot -> direct source of figures either pre-loaded or empty
+ # _d_ : data -> auxiliar containers to be used/shared among plots
+ # usually by copying and/or transforming values into a _p_ container
+ # _m_ : mixed -> contain both plot and data properties
+
+ print_log("- Generating data sources")
+ # _m_
# df: index (unique observations), col|..., tax|..., aux|ref
- # this cds an exeption and contains data to plot (col|) and auxiliary data (tax|)
- cds_p_obstable = generate_cds_obstable(table, tax, contaminants, references, controls, control_samples, decontam)
+ cds_m_obstable = cds_obstable(table, tax, references, controls, control_samples, decontam)
+ # _p_
# df: index (unique sample-ids), aux|..., bar|..., tax|...
- cds_p_samplebars = generate_cds_bars(table)
+ cds_p_samplebars = cds_samplebars(table)
+ # stacked: index (repeated observations), rank, ref, direct, parent
+ cds_p_references = cds_plot_references(table, tax, references)
# matrix: index (unique sample-ids), concentrations, controls, counts
- cds_p_decontam = generate_cds_plot_decontam(decontam) if decontam else None
+ cds_p_decontam = cds_plot_decontam(decontam) if decontam else None
# {x: [min,max], y_cont: [None,None], y_noncont: [None,None]}
- cds_p_decontam_models = generate_cds_plot_decontam_models(decontam) if decontam else None
+ cds_p_decontam_models = cds_plot_decontam_models(decontam) if decontam else None
# stacked: index (taxa, level, lineage), count, perc
- cds_p_mgnify = generate_cds_mgnify(mgnify, table, tax) if mgnify else None
+ cds_p_mgnify = cds_mgnify(mgnify, table, tax) if mgnify is not None else None
# stacked: index (repeated sample-ids), obs, rank, ov, tv
- cds_p_heatmap = generate_cds_heatmap(table, args.transformation, replace_zero_value, args.show_zeros)
- # matrix: index (unique sample-ids), md0, md1, ..., md(max_metadata_cols) -> (metadata field, metadata values)
- cds_p_metadata = generate_cds_plot_metadata(metadata, max_metadata_cols) if metadata else None
+ cds_p_heatmap = cds_heatmap(table, args.transformation, args.show_zeros)
+ # matrix: index (unique sample-ids), md0, md1, ..., md(args.metadata_cols) -> (metadata field, metadata values)
+ cds_p_metadata = cds_plot_metadata(metadata, args.metadata_cols) if metadata else None
# stacked: index (repeated observations), rank, annot
- cds_p_annotations = generate_cds_annotations(table, contaminants, references, controls, decontam)
+ cds_p_annotations = cds_annotations(table, references, controls, decontam, control_samples)
# empty matrix {"x": [], "y": [], "c": []}
- cds_p_dendro_x, cds_p_dendro_y = generate_cds_plot_dendro() if not args.skip_dendrogram else [None, None]
- # stacked: index (repeated observations), other observation, rank, rho, pval, pval_corr
- cds_p_correlation = generate_cds_correlation(table, args.top_obs_corr)
+ cds_p_dendro_x, cds_p_dendro_y = cds_plot_dendro() if not args.skip_dendrogram else [None, None]
+ # stacked: index (repeated observations), other observation, rank, rho
+ cds_p_correlation = cds_correlation(table, corr)
# matrix: index (unique sample-ids), 0, 1, ..., top_obs_bars, unassigned, others, factors
- cds_p_obsbars = generate_cds_obsbars(table, args.top_obs_bars)
-
+ cds_p_obsbars = cds_obsbars(table, args.top_obs_bars)
+ # df: index (unique sample-ids), col|...
+ cds_p_sampletable = cds_sampletable(table)
# _d_
- # matrix: index (unique sample-ids), columns (unique observations) -> raw counts
- cds_d_sampleobs = generate_cds_sampleobs(table)
# df: index (unique sample-ids), aux|..., cnt|...,
- cds_d_samples = generate_cds_samples(table, references, contaminants, controls, decontam)
+ cds_d_samples = cds_samples(table, references, controls, decontam)
# matrix: index (unique sample-ids) x columns (metadata fields) -> metadata values
- cds_d_metadata = generate_cds_metadata(metadata) if metadata else None
+ cds_d_metadata = cds_metadata(metadata) if metadata else None
# {taxid: (contam_y1, contam_y2, non_contam_y, pval)}
- cds_d_decontam = generate_cds_decontam(decontam, table.ranks()) if decontam else None
+ cds_d_decontam = cds_decontam(decontam, table.ranks()) if decontam else None
# key = rank + "|" + method + "|" + metric
# y: {"default": sorted sample-ids, key: sorted sample-ids, ...}
# x: {"default|rank": sorted sample-ids, key: sorted sample-ids, ...}
- dict_d_hcluster_x, dict_d_hcluster_y = generate_dict_hcluster(table, hcluster)
+ dict_d_hcluster_x, dict_d_hcluster_y = dict_hcluster(table, hcluster)
# {key+"|x": x-values, key+"|y": y-values , key+"|c": colors}
- dict_d_dedro_x, dict_d_dedro_y = generate_dict_dendro(table, dendro) if not args.skip_dendrogram else [None, None]
+ dict_d_dedro_x, dict_d_dedro_y = dict_dendro(table, dendro) if not args.skip_dendrogram else [None, None]
# {taxid: name}
- dict_d_taxname = generate_dict_taxname(tax, [txid for rank in table.ranks() for txid in table.observations(rank)])
+ dict_d_taxname = dict_taxname(tax, [txid for rank in table.ranks() for txid in table.observations(rank)])
# {rank: [taxid1,taxid2, ..., taxid(top_obs_bars)]}
- dict_d_topobs = generate_dict_topobs(table, args.top_obs_bars)
+ dict_d_topobs = dict_topobs(table, args.top_obs_bars)
# {taxid: {source: {desc: [refs]}}
- dict_d_refs = generate_dict_refs(table, contaminants, references)
-
- ############ PLOT ELEMENTS (Figures, Widgets, ...)
- ############ "fig": main figure
- ############ "wid": widgets
-
+ dict_d_refs = dict_refs(table, references)
+ # dict: {rank: {obs: {sample: count}}}
+ dict_d_sampleobs = dict_sampleobs(table)
+
+ # 3) Plot elements
+ print_log("- Plotting elements")
+
+ # Defined fixed layout and plot sizes
+ sizes = {}
+ sizes["overview_top_panel_height"] = 300
+ sizes["overview_top_panel_width_left"] = 250
+ sizes["overview_top_panel_width_right"] = 450
+
+ # Elements to plot
+ # ele[name]["fig"] -> main figure/element
+ # ele[name]["filter"] -> filter to the figure
+ # ele[name]["wid"][widget1] -> widgets to the figure
ele = {}
# obstable
ele["obstable"] = {}
- ele["obstable"]["fig"], ele["obstable"]["widgets_filter"] = plot_obstable(cds_p_obstable, table.ranks(), contaminants.keys(), controls.keys())
- ele["obstable"]["wid"] = plot_obstable_widgets(dict_d_taxname, max(cds_p_obstable.data["col|total_counts"]))
+ ele["obstable"]["fig"], ele["obstable"]["filter"] = plot_obstable(sizes, cds_m_obstable, table.ranks(), references, controls)
+ ele["obstable"]["wid"] = plot_obstable_widgets(sizes, dict_d_taxname, max(cds_m_obstable.data["col|total_counts"]))
# infopanel
ele["infopanel"] = {}
ele["infopanel"]["textarea"] = plot_infopanel()
+ # references
+ ele["references"] = {}
+ if references:
+ ele["references"]["fig"], ele["references"]["filter"] = plot_references(sizes, table, cds_p_references, dict_d_taxname)
+ else:
+ ele["references"]["fig"], ele["references"]["filter"] = None, None
+ ele["references"]["wid"] = plot_references_widgets(sizes, references)
+
# mgnify
ele["mgnify"] = {}
if cds_p_mgnify:
- ele["mgnify"]["fig"], ele["mgnify"]["filter"] = plot_mgnify(cds_p_mgnify)
+ ele["mgnify"]["fig"], ele["mgnify"]["filter"] = plot_mgnify(sizes, cds_p_mgnify)
else:
ele["mgnify"]["fig"], ele["mgnify"]["filter"] = None, None
ele["mgnify"]["wid"] = plot_mgnify_widgets()
@@ -297,21 +188,26 @@ def main():
ele["decontam"] = {}
ele["decontam"]["wid"] = {}
if decontam:
- ele["decontam"]["fig"] = plot_decontam(cds_p_decontam, cds_p_decontam_models, min_obs_perc)
+ ele["decontam"]["fig"] = plot_decontam(sizes, cds_p_decontam, cds_p_decontam_models, table.get_min_valid_count_perc())
else:
ele["decontam"]["fig"] = None
- ele["decontam"]["wid"] = plot_decontam_widgets()
+ ele["decontam"]["wid"] = plot_decontam_widgets(sizes)
# samplebars
ele["samplebars"] = {}
- ele["samplebars"]["fig"], ele["samplebars"]["legend_obs"], ele["samplebars"]["legend_bars"] = plot_samplebars(cds_p_samplebars, max_total_count, table.ranks())
- ele["samplebars"]["wid"] = plot_samplebars_widgets(table.ranks(), metadata, list(contaminants.keys()), list(references.keys()), list(controls.keys()), decontam)
+ ele["samplebars"]["fig"], ele["samplebars"]["legend_obs"], ele["samplebars"]["legend_bars"] = plot_samplebars(cds_p_samplebars, table)
+ ele["samplebars"]["wid"] = plot_samplebars_widgets(table.ranks(), metadata, references, controls, decontam, table.normalized)
+
+ # sampletable
+ ele["sampletable"] = {}
+ ele["sampletable"]["fig"] = plot_sampletable(cds_p_sampletable, sizes, table.ranks())
+ ele["sampletable"]["wid"] = plot_sampletable_widgets(sizes, max(cds_p_sampletable.data["col|total"]), metadata)
# heatmap
tools_heatmap = "hover,save,box_zoom,reset,crosshair,box_select"
ele["heatmap"] = {}
ele["heatmap"]["fig"] = plot_heatmap(table, cds_p_heatmap, tools_heatmap, args.transformation, dict_d_taxname)
- ele["heatmap"]["wid"] = plot_heatmap_widgets(table.ranks(), args.linkage_methods, args.linkage_metrics, list(contaminants.keys()), list(references.keys()), list(controls.keys()), metadata, decontam)
+ ele["heatmap"]["wid"] = plot_heatmap_widgets(table.ranks(), args.linkage_methods, args.linkage_metrics, references, controls, metadata, decontam)
# metadata (heatmap)
ele["metadata"] = {}
@@ -321,10 +217,15 @@ def main():
else:
ele["metadata"]["fig"] = Spacer()
ele["metadata"]["wid"]["metadata_multiselect"] = Spacer()
+ ele["metadata"]["wid"]["legend_colorbars"] = Spacer()
+ ele["metadata"]["wid"]["toggle_legend"] = Spacer()
# annotations
ele["annotations"] = {}
- ele["annotations"]["fig"] = plot_annotations(ele["heatmap"]["fig"], tools_heatmap, cds_p_annotations, dict_d_taxname)
+ if cds_p_annotations.data["index"].size:
+ ele["annotations"]["fig"] = plot_annotations(ele["heatmap"]["fig"], tools_heatmap, cds_p_annotations, dict_d_taxname)
+ else:
+ ele["annotations"]["fig"] = Spacer()
# dendrograms
ele["dendrox"] = {}
@@ -337,36 +238,38 @@ def main():
# correlation
ele["correlation"] = {}
- ele["correlation"]["fig"], ele["correlation"]["rho_filter"], ele["correlation"]["pval_filter"] = plot_correlation(cds_p_correlation, table.ranks(), dict_d_taxname)
+ ele["correlation"]["fig"], ele["correlation"]["filter"] = plot_correlation(cds_p_correlation, table.ranks(), dict_d_taxname)
ele["correlation"]["wid"] = plot_correlation_widgets(table.ranks(), args.top_obs_corr)
# obsbars
ele["obsbars"] = {}
- ele["obsbars"]["fig"], ele["obsbars"]["legend"] = plot_obsbars(cds_p_obsbars, dict_d_topobs, table.ranks(), args.top_obs_bars, dict_d_taxname)
ele["obsbars"]["wid"] = plot_obsbars_widgets(table.ranks(), metadata, dict_d_topobs, dict_d_taxname, args.top_obs_bars)
+ ele["obsbars"]["fig"], ele["obsbars"]["legend"] = plot_obsbars(cds_p_obsbars, dict_d_topobs, table.ranks(), args.top_obs_bars, dict_d_taxname, ele["obsbars"]["wid"]["rank_select"])
- ############ JAVASCRIPT LINKING
-
- link_obstable_filter(ele, cds_p_obstable, table.ranks())
+ #4) Link javascript:
+ print_log("- Linking javascript")
+ link_obstable_filter(ele, cds_m_obstable, table.ranks())
link_obstable_samplebars(ele,
- cds_p_obstable,
+ cds_m_obstable,
cds_p_samplebars,
cds_d_samples,
- cds_d_sampleobs,
+ dict_d_sampleobs,
cds_d_metadata,
cds_p_decontam,
cds_p_decontam_models,
cds_d_decontam,
+ cds_p_references,
table.ranks(),
- min_obs_perc,
- max_total_count,
+ table.get_min_valid_count_perc(),
+ table.get_total().max(),
cds_p_mgnify,
- dict_d_refs)
-
+ dict_d_refs,
+ dict_d_taxname)
link_heatmap_widgets(ele,
cds_d_samples,
cds_d_metadata,
+ cds_p_metadata,
dict_d_hcluster_x,
dict_d_hcluster_y,
cds_p_dendro_x,
@@ -374,36 +277,45 @@ def main():
dict_d_dedro_x,
dict_d_dedro_y,
cds_p_annotations,
- cds_p_obstable,
- cds_p_heatmap)
-
- link_metadata_widgets(ele, cds_p_metadata, cds_d_metadata, max_metadata_cols)
-
+ cds_m_obstable,
+ cds_p_heatmap,
+ table.ranks(),
+ dict_d_taxname)
+ link_metadata_widgets(ele, cds_p_metadata, cds_d_metadata, args.metadata_cols)
link_correlation_widgets(ele, cds_p_correlation)
-
link_obsbars_widgets(ele,
cds_p_obsbars,
dict_d_topobs,
- cds_d_sampleobs,
+ dict_d_sampleobs,
cds_d_samples,
args.top_obs_bars,
dict_d_taxname,
- cds_d_metadata)
-
- ############ LAYOUT
+ cds_d_metadata,
+ cds_p_sampletable)
+ link_sampletable_select(ele, cds_p_sampletable, cds_d_metadata)
+ # 5) Draw layout
+ print_log("- Drawing layout")
# Define path of running script to get static files
script_dir, _ = os.path.split(__file__)
logo_path = os.path.join(script_dir, "img", "logo.png")
- final_layout = make_layout(ele, version, logo_path, args.title)
+ final_layout = make_layout(ele, sizes, Config.version, logo_path, args.title, args.output_plots)
template = include_scripts({os.path.join(script_dir, "js", "func.js"): "script",
os.path.join(script_dir, "js", "popup.js"): "script",
os.path.join(script_dir, "css", "popup.css"): "style"})
+ if args.full_offline:
+ mode = "inline" # configure to provide entire Bokeh JS and CSS inline
+ elif _debug:
+ mode = "absolute-dev" # non-minimized - configure to load from the installed Bokeh library static directory
+ else:
+ mode = "cdn" # configure to load Bokeh JS and CSS from https://cdn.bokeh.org
+
# setup output file and JS mode
- output_file(args.output_html, title="GRIMER" if not args.title else "GRIMER - " + args.title, mode="inline" if args.full_offline else "cdn")
+ print_log("- Saving report")
+ output_file(args.output_html, title="GRIMER" if not args.title else "GRIMER - " + args.title, mode=mode)
save(final_layout, template=template)
print_log("File: " + args.output_html)
file_size_bytes = os.path.getsize(args.output_html)
diff --git a/grimer/js/func.js b/grimer/js/func.js
index ee1c52d..b9593b4 100644
--- a/grimer/js/func.js
+++ b/grimer/js/func.js
@@ -1,14 +1,28 @@
function sort_numeric(a, b){ return a - b; }
function sort_string(a, b){ return a.localeCompare(b); }
-function grimer_sort(factors, sort_col, sort_mode="numeric", desc=false, group_col1=[], group_col2=[]) {
- //mode : numeric, string
+function grimer_sort(factors, sort_col, sort_mode="numeric", desc=false, group_col1=[], group_col2=[], index=[]) {
+ //sort_mode : numeric, string
+
+ // subset data if index provided
+ if(index.length){
+ factors = index.map( s => factors[s] );
+ sort_col = index.map( s => sort_col[s] );
+ if(group_col1.length){
+ group_col1 = index.map( s => group_col1[s] );
+ }
+ if(group_col2.length){
+ group_col2 = index.map( s => group_col2[s] );
+ }
+ }
// Generate numerical index to sort arrays
var idx = new Array(factors.length);
for (var i = 0; i < idx.length; ++i) idx[i] = i;
- //If numeric, replace NaN with sortable value (false)
- if (sort_mode=="numeric") sort_col = sort_col.map(function(v){ return isNaN(v) ? false : v })
+ //If numeric, replace NaN with sortable value (-Infinity) to be at the end of the sorted array
+ if (sort_mode=="numeric"){
+ sort_col = sort_col.map(function(v){ return isNaN(v) ? -Infinity : v })
+ }
if(group_col1.length && group_col2.length){
if (sort_mode=="numeric" && desc==false)
@@ -38,7 +52,7 @@ function grimer_sort(factors, sort_col, sort_mode="numeric", desc=false, group_c
else if (sort_mode=="string" && desc==true)
idx.sort((a, b) => sort_string(sort_col[b],sort_col[a]));
}
-
+
var sorted_factors = new Array(idx.length);
for (var i = 0; i < idx.length; ++i) sorted_factors[i] = factors[idx[i]];
return sorted_factors;
diff --git a/grimer/js/popup.js b/grimer/js/popup.js
index 62f9525..24612bd 100644
--- a/grimer/js/popup.js
+++ b/grimer/js/popup.js
@@ -46,4 +46,4 @@ var pop = {
pop.pWrap.classList.remove("open");
}
};
-window.addEventListener("DOMContentLoaded", pop.init);
\ No newline at end of file
+window.addEventListener("DOMContentLoaded", pop.init);
diff --git a/grimer/layout.py b/grimer/layout.py
index 50fa659..6991520 100644
--- a/grimer/layout.py
+++ b/grimer/layout.py
@@ -3,96 +3,107 @@
import base64
-def make_layout(ele, version, logo_path, title):
+def make_layout(ele, sizes, version, logo_path, title, output_plots):
- top_panel_height = 200
- top_panel_width_sides = 300
- filterwidgets = column(row(ele["obstable"]["wid"]["frequency_spinner"],
+ main_panels = []
+ if "overview" in output_plots:
+ filterwidgets = column(ele["obstable"]["wid"]["frequency_spinner"],
ele["obstable"]["wid"]["counts_perc_avg_spinner"],
- ele["obstable"]["wid"]["help_button"]),
- ele["obstable"]["wid"]["total_counts_spinner"],
- ele["obstable"]["wid"]["name_multichoice"])
-
- filterwidgetstabs = Tabs(tabs=[Panel(child=filterwidgets, title="Filter")],
- sizing_mode="fixed",
- height=top_panel_height + 20,
- width=top_panel_width_sides)
-
- info_tabs = [Panel(child=ele["infopanel"]["textarea"], title="Info")]
-
- if ele["mgnify"]["fig"]:
- info_tabs.append(Panel(child=column(ele["mgnify"]["fig"], row(ele["mgnify"]["wid"]["biome_spinner"], ele["mgnify"]["wid"]["help_button"])), title="MGNify"))
-
- if ele["decontam"]["fig"]:
- info_tabs.append(Panel(child=column(ele["decontam"]["fig"], row(ele["decontam"]["wid"]["pvalue_text"], ele["decontam"]["wid"]["pvalue_input"], ele["decontam"]["wid"]["help_button"])), title="DECONTAM"))
-
- infotabs = Tabs(tabs=info_tabs,
- sizing_mode="fixed",
- height=top_panel_height + 20,
- width=top_panel_width_sides)
-
- row_obstable = row(filterwidgetstabs,
- ele["obstable"]["fig"],
- infotabs,
- sizing_mode="stretch_width")
+ ele["obstable"]["wid"]["total_counts_spinner"],
+ ele["obstable"]["wid"]["name_multichoice"],
+ ele["obstable"]["wid"]["help_button"])
+ filterwidgetstabs = Tabs(tabs=[Panel(child=filterwidgets, title="Filter")],
+ sizing_mode="fixed",
+ height=sizes["overview_top_panel_height"] + 20,
+ width=sizes["overview_top_panel_width_left"])
+ info_tabs = [Panel(child=ele["infopanel"]["textarea"], title="Info")]
+ if ele["references"]["fig"]:
+ info_tabs.append(Panel(child=column(ele["references"]["fig"],
+ row(ele["references"]["wid"]["references_select"],
+ ele["references"]["wid"]["help_button"])
+ ), title="References"))
+ if ele["mgnify"]["fig"]:
+ info_tabs.append(Panel(child=column(ele["mgnify"]["fig"],
+ row(ele["mgnify"]["wid"]["biome_spinner"],
+ ele["mgnify"]["wid"]["help_button"])
+ ), title="MGNify"))
+ if ele["decontam"]["fig"]:
+ info_tabs.append(Panel(child=column(ele["decontam"]["fig"],
+ row(ele["decontam"]["wid"]["pscore_text"],
+ ele["decontam"]["wid"]["pscore_input"],
+ ele["decontam"]["wid"]["help_button"])
+ ), title="DECONTAM"))
+ infotabs = Tabs(tabs=info_tabs,
+ sizing_mode="fixed",
+ height=sizes["overview_top_panel_height"] + 20,
+ width=sizes["overview_top_panel_width_right"])
+ row_obstable = row(filterwidgetstabs,
+ ele["obstable"]["fig"],
+ infotabs,
+ sizing_mode="stretch_width")
+ row_barpot = column(row(ele["samplebars"]["fig"]),
+ row(ele["samplebars"]["wid"]["y1_select"],
+ ele["samplebars"]["wid"]["annotbar_rank_select"],
+ ele["samplebars"]["wid"]["annotbar_select"],
+ ele["samplebars"]["wid"]["groupby1_select"],
+ ele["samplebars"]["wid"]["groupby2_select"],
+ ele["samplebars"]["wid"]["sort_select"],
+ ele["samplebars"]["wid"]["y2_select"],
+ ele["samplebars"]["wid"]["help_button"]),
+ ele["samplebars"]["wid"]["toggle_label"])
+ main_panels.append(Panel(child=column(row_obstable, row_barpot, sizing_mode="stretch_width"), title="Overview"))
+
+ if "samples" in output_plots:
+ selectwidgets = column(ele["sampletable"]["wid"]["total_counts_spinner"],
+ ele["sampletable"]["wid"]["assigned_spinner"],
+ ele["sampletable"]["wid"]["metadata_multichoice"],
+ ele["sampletable"]["wid"]["help_button"])
+ selectwidgetstabs = Tabs(tabs=[Panel(child=selectwidgets, title="Select")],
+ sizing_mode="fixed",
+ height=sizes["overview_top_panel_height"] + 20,
+ width=sizes["overview_top_panel_width_left"])
+ row_sampletable = row(selectwidgetstabs,
+ ele["sampletable"]["fig"],
+ sizing_mode="stretch_width")
+ row_obsbars = column(row(ele["obsbars"]["fig"]),
+ row(ele["obsbars"]["wid"]["rank_select"],
+ ele["obsbars"]["wid"]["groupby1_select"],
+ ele["obsbars"]["wid"]["groupby2_select"],
+ ele["obsbars"]["wid"]["sort_select"],
+ ele["obsbars"]["wid"]["help_button"]),
+ ele["obsbars"]["wid"]["toggle_label"])
+ main_panels.append(Panel(child=column(row_sampletable, row_obsbars, sizing_mode="stretch_width"), title="Samples"))
+
+ if "heatmap" in output_plots:
+ row_heatmap = gridplot([[ele["heatmap"]["fig"], ele["dendroy"]["fig"], ele["metadata"]["fig"]],
+ [ele["dendrox"]["fig"]],
+ [ele["annotations"]["fig"], None, ele["heatmap"]["wid"]["help_button"]]],
+ toolbar_location='right',
+ merge_tools=True)
+ row_heatmap_widgets = row(column(ele["heatmap"]["wid"]["rank_select"],
+ ele["heatmap"]["wid"]["toggle_label"],
+ width=300),
+ row(column(ele["heatmap"]["wid"]["x_groupby_select"],
+ ele["heatmap"]["wid"]["x_sort_select"]),
+ column(ele["heatmap"]["wid"]["y_groupby_select"],
+ ele["heatmap"]["wid"]["y_sort_select"]),
+ sizing_mode="stretch_width"),
+ column(ele["metadata"]["wid"]["metadata_multiselect"],
+ ele["metadata"]["wid"]["toggle_legend"],
+ sizing_mode="stretch_height",
+ width=300))
+ main_panels.append(Panel(child=column(row_heatmap, row_heatmap_widgets, sizing_mode="stretch_width"), title="Heatmap"))
+
+ if "correlation" in output_plots:
+ row_correlation = row(column(ele["correlation"]["wid"]["rank_select"],
+ ele["correlation"]["wid"]["neg_slider"],
+ ele["correlation"]["wid"]["pos_slider"],
+ ele["correlation"]["wid"]["toggle_label"],
+ ele["correlation"]["wid"]["help_button"]),
+ ele["correlation"]["fig"])
+ main_panels.append(Panel(child=column(row_correlation, sizing_mode="stretch_width"), title="Correlation"))
- row_barpot = column(row(ele["samplebars"]["fig"]),
- row(ele["samplebars"]["wid"]["y1_select"],
- ele["samplebars"]["wid"]["annotbar_rank_select"],
- ele["samplebars"]["wid"]["annotbar_select"],
- ele["samplebars"]["wid"]["groupby1_select"],
- ele["samplebars"]["wid"]["groupby2_select"],
- ele["samplebars"]["wid"]["sort_select"],
- ele["samplebars"]["wid"]["y2_select"],
- ele["samplebars"]["wid"]["help_button"]))
-
- row_heatmap = gridplot([[ele["heatmap"]["fig"], ele["dendroy"]["fig"], ele["metadata"]["fig"]],
- [ele["dendrox"]["fig"]],
- [ele["annotations"]["fig"], ele["heatmap"]["wid"]["help_button"]]],
- toolbar_location='right',
- merge_tools=True)
-
- row_heatmap_widgets = row(column(ele["heatmap"]["wid"]["rank_select"],
- row(ele["heatmap"]["wid"]["toggle_label_text"],
- ele["heatmap"]["wid"]["toggle_label_heatmap"]),
- sizing_mode="stretch_height",
- width=300),
- column(row(ele["heatmap"]["wid"]["x_sort_select"],
- ele["heatmap"]["wid"]["y_sort_select"],
- sizing_mode="stretch_width"),
- row(ele["heatmap"]["wid"]["x_method_select"],
- ele["heatmap"]["wid"]["y_method_select"],
- sizing_mode="stretch_width"),
- #row(ele["heatmap"]["wid"]["x_group_select"],
- # Spacer(),
- # sizing_mode="stretch_width"),
- sizing_mode="stretch_width"),
- column(ele["metadata"]["wid"]["metadata_multiselect"],
- sizing_mode="stretch_height",
- width=300))
-
- row_correlation = row(column(ele["correlation"]["wid"]["rank_select"],
- ele["correlation"]["wid"]["neg_slider"],
- ele["correlation"]["wid"]["pos_slider"],
- ele["correlation"]["wid"]["pval_spinner"],
- ele["correlation"]["wid"]["help_button"]),
- ele["correlation"]["fig"])
-
- row_obsbars = column(row(ele["obsbars"]["fig"]),
- ele["obsbars"]["wid"]["toggle_label"],
- row(ele["obsbars"]["wid"]["rank_select"],
- ele["obsbars"]["wid"]["groupby1_select"],
- ele["obsbars"]["wid"]["groupby2_select"],
- ele["obsbars"]["wid"]["sort_select"],
- ele["obsbars"]["wid"]["help_button"]))
-
- main_panels = []
- main_panels.append(Panel(child=column(row_obstable, row_barpot, sizing_mode="stretch_width"), title="Overview"))
- main_panels.append(Panel(child=column(row_heatmap, row_heatmap_widgets, sizing_mode="stretch_width"), title="Heatmap"))
- main_panels.append(Panel(child=column(row_correlation, sizing_mode="stretch_width"), title="Correlation"))
- main_panels.append(Panel(child=column(row_obsbars, sizing_mode="stretch_width"), title="Bars"))
main_tab = Tabs(tabs=main_panels)
-
logo_base64 = base64.b64encode(open(logo_path, 'rb').read()) # encode to base64
logo_base64 = logo_base64.decode() # convert to string
logo_div = Div(text='' + 'v' + version + '', width=300, height=40, sizing_mode="fixed")
@@ -102,4 +113,4 @@ def make_layout(ele, version, logo_path, title):
title_div = Spacer()
final = column([row(logo_div, title_div), main_tab], sizing_mode="stretch_width")
- return final
\ No newline at end of file
+ return final
diff --git a/grimer/metadata.py b/grimer/metadata.py
index 4399ade..8fa3359 100644
--- a/grimer/metadata.py
+++ b/grimer/metadata.py
@@ -1,15 +1,18 @@
import pandas as pd
from pandas.api.types import is_numeric_dtype
-from grimer.utils import print_log
class Metadata:
valid_types = ["categorical", "numeric"]
default_type = "categorical"
- def __init__(self, metadata_file, samples: list=[]):
- # Read metadata and let pandas guess dtypes, index as str
- self.data = pd.read_table(metadata_file, sep='\t', header=0, skiprows=0, index_col=0, dtype={0:str})
+ def __init__(self, metadata_file: str="", metadata_table: pd.DataFrame=None, samples: list=[]):
+ if metadata_file:
+ # Read metadata and let pandas guess dtypes, index as str
+ self.data = pd.read_table(metadata_file, sep='\t', header=0, skiprows=0, index_col=0, dtype={0: str})
+ else:
+ # Metadata from table (biom file)
+ self.data = metadata_table
# Enforce string index
self.data.index = self.data.index.astype('str')
@@ -26,7 +29,7 @@ def __init__(self, metadata_file, samples: list=[]):
self.types[self.data.dtypes.map(is_numeric_dtype)] = "numeric"
# Convert datatypes to adequate numeric values (int, float)
- self.data = self.data.convert_dtypes(infer_objects=False, convert_string=False)
+ self.data = self.data.convert_dtypes(infer_objects=False, convert_string=False, convert_boolean=False)
# Re-convert everython to object to standardize (int64 NA is not seriazable on bokeh)
self.data = self.data.astype("object")
@@ -39,6 +42,9 @@ def __init__(self, metadata_file, samples: list=[]):
# Convert NaN on categorical to ""
self.data[self.types[self.types == "categorical"].index] = self.data[self.types[self.types == "categorical"].index].fillna('')
+ # Convert boolean from categorical to String
+ mask = self.data[self.types[self.types == "categorical"].index].applymap(type) != bool
+ self.data[self.types[self.types == "categorical"].index] = self.data[self.types[self.types == "categorical"].index].where(mask, self.data[self.types[self.types == "categorical"].index].replace({True: 'True', False: 'False'}))
# Remove names
self.data.index.names = [None]
@@ -85,13 +91,7 @@ def get_col(self, col):
return self.data[col]
def get_unique_values(self, col):
- return sorted(self.get_col(col).dropna().unique())
-
- def get_formatted_unique_values(self, col):
- if self.types[col] == "categorical":
- return self.get_unique_values(col)
- else:
- return list(map('{:.16g}'.format, self.get_unique_values(col)))
+ return self.get_col(col).dropna().unique()
def get_type(self, col):
return self.types[col]
diff --git a/grimer/mgnify.py b/grimer/mgnify.py
deleted file mode 100644
index 84d2eb5..0000000
--- a/grimer/mgnify.py
+++ /dev/null
@@ -1,27 +0,0 @@
-import pandas as pd
-
-
-class MGnify:
-
- def __init__(self, mgnify_file, ranks: list=[]):
- self.data = self.parse(mgnify_file, ranks)
-
- def __repr__(self):
- args = ['{}={}'.format(k, repr(v)) for (k, v) in vars(self).items()]
- return 'MGnify({})'.format(', '.join(args))
-
- def parse(self, file, ranks):
- mgnify_df = pd.read_table(file, header=None, names=["rank", "taxa", "biome", "count"])
- # Filter by ranks if provided
- if ranks:
- mgnify_df = mgnify_df.loc[mgnify_df['rank'].isin(ranks)]
- mgnify_df.reset_index(drop=True, inplace=True)
-
- #mgnify_df.drop(columns="rank", inplace=True)
- return mgnify_df
-
- def update_taxids(self, taxid_updated):
- # Update taxonomy to taxid or keep name if not available
- self.data["taxa"] = self.data[["rank", "taxa"]].apply(lambda rt: taxid_updated[(rt[0], rt[1])] if taxid_updated[(rt[0], rt[1])] is not None else rt[1], axis=1)
-
-
diff --git a/grimer/plots.py b/grimer/plots.py
index 9ee5481..97aa742 100644
--- a/grimer/plots.py
+++ b/grimer/plots.py
@@ -1,28 +1,40 @@
import markdown
+
# Bokeh
-from bokeh.models import AdaptiveTicker, Button, CategoricalColorMapper, CDSView, CheckboxGroup, ColorBar, ColumnDataSource, CustomJS, CustomJSHover, Div, FactorRange, FuncTickFormatter, HoverTool, Legend, LinearAxis, LinearColorMapper, MultiChoice, MultiSelect, NumberFormatter, Panel, Paragraph, Range1d, RangeSlider, Select, Spinner, Tabs, TextAreaInput, TextInput
+from bokeh.models import AdaptiveTicker, Button, CategoricalColorMapper, CDSView, CheckboxGroup, ColorBar, ColumnDataSource, CustomJS, CustomJSHover, FactorRange, FixedTicker, FuncTickFormatter, HoverTool, Legend, LegendItem, LinearAxis, LinearColorMapper, MultiChoice, MultiSelect, NumberFormatter, Panel, Paragraph, Range1d, RangeSlider, Select, Spacer, Spinner, Tabs, TextAreaInput, TextInput
from bokeh.models.filters import IndexFilter, GroupFilter
from bokeh.models.widgets import DataTable, TableColumn
-from bokeh.palettes import Blues, Category10, Category20, Colorblind, Dark2, linear_palette, Magma256, Reds, Turbo256
+from bokeh.palettes import Blues, Dark2, Magma256, Reds
from bokeh.plotting import figure
from bokeh.transform import cumsum, factor_cmap
+from grimer.func import format_js_toString, make_color_palette
+
-def plot_samplebars(cds_p_samplebars, max_total_count, ranks):
+def plot_samplebars(cds_p_samplebars, table):
# Bar plots has 3 main stacks: selection, others, unassigned
- # stacks can be annotated with sources
+ # stacks can be annotated with references and controls
samplebars_fig = figure(x_range=FactorRange(factors=cds_p_samplebars.data["aux|factors"]),
- y_range=Range1d(start=0, end=max_total_count),
+ y_range=Range1d(start=0, end=table.get_total().max()),
plot_height=400,
sizing_mode="stretch_width",
- tools="box_zoom,reset,save,hover",
- tooltips=[("Sample", "@index"), ("Value", "@$name")])
+ tools="box_zoom,reset,save")
+
+ samplebars_fig.add_tools(HoverTool(
+ tooltips=[
+ ("Sample", "@index"),
+ ("Value", "@$name")
+ ],
+ mode="mouse",
+ point_policy="follow_mouse",
+ ))
fixed_bar_options = ["selected", "others", "unassigned"]
vbar_ren = samplebars_fig.vbar_stack(["bar|" + f for f in fixed_bar_options],
x="aux|factors",
width=1,
source=cds_p_samplebars,
+ line_color=None, # to avoid printing small border for zeros
color=make_color_palette(len(fixed_bar_options)))
# Second y-axis to plot observations
@@ -30,19 +42,20 @@ def plot_samplebars(cds_p_samplebars, max_total_count, ranks):
samplebars_fig.add_layout(LinearAxis(y_range_name="obs"), 'right')
# Plot obs ranks
- obs_palette = make_color_palette(len(ranks), palette=Dark2)
+ obs_palette = make_color_palette(len(table.ranks()), palette=Dark2)
legend_obs_items = []
- for i, rank in enumerate(ranks):
+ for i, rank in enumerate(table.ranks()):
ren = samplebars_fig.scatter(x="aux|factors", y="tax|" + rank,
y_range_name="obs",
- name="tax|" + rank, #to work with hover properly
+ name="tax|" + rank, # to work with hover properly
source=cds_p_samplebars,
marker="circle", size=7, line_color="navy", alpha=0.6,
- fill_color=obs_palette[i])
+ fill_color=obs_palette[i],
+ visible=False)
legend_obs_items.append((rank, [ren]))
# Legend counts (vbars)
- legend_bars_items = [(f, [vbar_ren[i]]) for i, f in enumerate([ranks[0] + "|assigned"] + fixed_bar_options[1:])]
+ legend_bars_items = [(f, [vbar_ren[i]]) for i, f in enumerate([table.ranks()[0] + "|assigned"] + fixed_bar_options[1:])]
legend_bars = Legend(items=legend_bars_items)
legend_bars.margin = 0
legend_bars.border_line_width = 0
@@ -62,8 +75,10 @@ def plot_samplebars(cds_p_samplebars, max_total_count, ranks):
legend_obs.orientation = "horizontal"
legend_obs.location = "bottom_right"
legend_obs.click_policy = "hide"
+ legend_obs.label_text_color = "#606c38"
samplebars_fig.add_layout(legend_obs, "above")
+ samplebars_fig.xaxis.major_label_orientation = "vertical"
samplebars_fig.xaxis.major_label_text_font_size = '0pt'
samplebars_fig.xgrid.grid_line_color = None
samplebars_fig.xaxis.major_tick_line_color = None
@@ -74,54 +89,61 @@ def plot_samplebars(cds_p_samplebars, max_total_count, ranks):
samplebars_fig.xaxis.subgroup_label_orientation = "vertical"
samplebars_fig.xaxis.axis_label = "samples"
- samplebars_fig.yaxis[0].axis_label = "# counts"
+ samplebars_fig.yaxis[0].axis_label = "# counts" if not table.normalized else "% counts"
samplebars_fig.yaxis[1].axis_label = "% observations"
+ samplebars_fig.yaxis[1].axis_label_text_color = "#606c38"
return samplebars_fig, legend_obs, legend_bars
-def plot_obsbars(cds_p_obsbars, dict_d_topobs, ranks, top_obs_bars, dict_d_taxname):
+def plot_obsbars(cds_p_obsbars, dict_d_topobs, ranks, top_obs_bars, dict_d_taxname, rank_select):
obsbars_fig = figure(x_range=FactorRange(factors=cds_p_obsbars.data["factors"]),
y_range=Range1d(start=0, end=100),
height=450,
sizing_mode="stretch_width",
- tools="box_zoom,reset,hover,save",
- tooltips=[("Sample", "@index"),
- ("Label", "$name"),
- ("Value", "@$name")])
-
- # TODO Need to know which rank to get the correct set of top taxa
- # taxid_name_custom = CustomJSHover(
- # args=dict(dict_d_taxname=ColumnDataSource(dict(dict_d_taxname=[dict_d_taxname])),
- # dict_d_topobs=ColumnDataSource(dict(dict_d_topobs=[dict_d_topobs]))),
- # code='''
- # //var taxid = dict_d_topobs.data.dict_d_topobs[0][RANK][value];
- # //console.log(dict_d_topobs);
- # //return dict_d_taxname.data.dict_d_taxname[0][taxid]; // value holds the @taxid
- # ''')
- # # Add custom tooltip for heatmap (taxid->name)
- # obsbars.add_tools(HoverTool(
- # tooltips=[
- # ('Sample', '@index'),
- # ('Taxa', '$name{custom}'),
- # ("Value", "@$name")
- # ],
- # formatters={"$name": taxid_name_custom}
- # ))
+ tools="box_zoom,reset,save")
+
+ taxid_name_custom = CustomJSHover(
+ args=dict(dict_d_taxname=ColumnDataSource(dict(dict_d_taxname=[dict_d_taxname])),
+ dict_d_topobs=ColumnDataSource(dict(dict_d_topobs=[dict_d_topobs])),
+ rank_select=rank_select),
+ code='''
+ // value holds the column index
+ var taxid = dict_d_topobs.data.dict_d_topobs[0][rank_select.value][value];
+ if(taxid!=undefined){
+ return dict_d_taxname.data.dict_d_taxname[0][taxid];
+ }else{
+ return value;
+ }
+ ''')
+
+ # Add custom tooltip for heatmap (taxid->name)
+ obsbars_fig.add_tools(HoverTool(
+ tooltips=[("Sample", "@index"),
+ ("Observation", "$name{custom}"),
+ ("Value", "@$name{0.2f}%")],
+ mode="mouse",
+ point_policy="follow_mouse",
+ formatters={"$name": taxid_name_custom}
+ ))
bars = [str(i) for i in range(top_obs_bars)] + ["others", "unassigned"]
# Plot stacked bars with counts
vbar_ren = obsbars_fig.vbar_stack(bars, x="factors",
source=cds_p_obsbars,
width=1,
+ #line_color=None, # to avoid printing small border for zeros
#color=make_color_palette(top_obs_bars, linear=True) + ("#868b8e", "#eeede7"))
color=make_color_palette(top_obs_bars) + ("#868b8e", "#eeede7"))
obsbars_fig.xaxis.major_label_orientation = "vertical"
obsbars_fig.xaxis.group_label_orientation = "horizontal"
obsbars_fig.xaxis.subgroup_label_orientation = "vertical"
- obsbars_fig.xaxis.major_label_text_font_size = "8px"
+ obsbars_fig.xaxis.minor_tick_line_color = None
+ obsbars_fig.xaxis.major_tick_line_color = None
+ obsbars_fig.xaxis.major_label_text_font_size = "0px"
obsbars_fig.xgrid.grid_line_color = None
+ obsbars_fig.ygrid.grid_line_color = None
obsbars_fig.xaxis.axis_label = "samples"
obsbars_fig.yaxis.axis_label = "% counts"
@@ -143,10 +165,10 @@ def plot_obsbars(cds_p_obsbars, dict_d_topobs, ranks, top_obs_bars, dict_d_taxna
legend_bars_items = []
for i in range(top_obs_bars):
if i < len(dict_d_topobs[ranks[0]]):
- label = str(i) + ") " + dict_d_taxname[dict_d_topobs[ranks[0]][i]]
+ label = str(i + 1) + ") " + dict_d_taxname[dict_d_topobs[ranks[0]][i]]
else:
label = None
- legend_bars_items.append((label, [vbar_ren[i]]))
+ legend_bars_items.append(LegendItem(label=label, renderers=[vbar_ren[i]]))
legend_obsbars = Legend(items=legend_bars_items)
# legend_bars.label_text_font_size="9px"
@@ -173,8 +195,7 @@ def plot_obsbars_widgets(ranks, metadata, dict_d_topobs, dict_d_taxname, top_obs
sort_options["Default"].append(("col|others", "others"))
sort_options["Default"].append(("col|unassigned", "unassigned"))
- #sort_options["Observation"] = [("col|" + str(i), dict_d_taxname[t]) for i, t in enumerate(dict_d_topobs[ranks[0]])]
- sort_options["Observation"] = [("col|" + str(i), str(i)) for i in range(top_obs_bars)]
+ sort_options["Observation"] = [("col|" + str(i), str(i + 1)) for i in range(top_obs_bars)]
sort_options["Numeric Metadata"] = []
if metadata:
@@ -193,7 +214,7 @@ def plot_obsbars_widgets(ranks, metadata, dict_d_topobs, dict_d_taxname, top_obs
groupby1_select = Select(title="1) Group samples by", value="none", options=groupby_options, sizing_mode="stretch_width")
groupby2_select = Select(title="2) Group samples by", value="none", options=groupby_options, sizing_mode="stretch_width")
- toggle_label = CheckboxGroup(labels=["Hide sample labels"], active=[])
+ toggle_label = CheckboxGroup(labels=["Show samples labels"], active=[])
help_text = """
Observation bars showing proportions of top """ + str(top_obs_bars) + """ most abundant observations.
@@ -211,23 +232,28 @@ def plot_obsbars_widgets(ranks, metadata, dict_d_topobs, dict_d_taxname, top_obs
"help_button": help_button(title="Observation bars", text=help_text)}
-def plot_samplebars_widgets(ranks, metadata, contaminant_names, reference_names, control_names, decontam):
+def plot_samplebars_widgets(ranks, metadata, references, controls, decontam, normalized):
annotbar_rank_select = Select(title="Annotate bars at rank:", value=ranks[0], options=[r for r in ranks])
annotbar_options = {}
annotbar_options["Default"] = ["assigned"]
- annotbar_options["Contaminant"] = [c for c in contaminant_names]
- annotbar_options["References"] = [r for r in reference_names]
- annotbar_options["Controls"] = [c for c in control_names]
+ if references is not None:
+ annotbar_options["References"] = [r for r in references.keys()]
+ if controls is not None:
+ annotbar_options["Controls"] = [c for c in controls.keys()]
if decontam:
annotbar_options["Decontam"] = ["decontam"]
annotbar_select = Select(title="Annotate bars by:", value="assigned", options=annotbar_options)
- y1_select = Select(title="Counts", value="#", options=["#", "%"], width=80)
- y2_select = Select(title="Observations", value="%", options=["#", "%", "log10(#)", "log10(%)"], width=80)
+ if normalized:
+ y1_select = Select(title="Counts", value="%", options=["%"], width=80)
+ y2_select = Select(title="Observations", value="%", options=["%", "log10(%)"], width=80)
+ else:
+ y1_select = Select(title="Counts", value="#", options=["#", "%"], width=80)
+ y2_select = Select(title="Observations", value="%", options=["#", "%", "log10(#)", "log10(%)"], width=80)
sort_options = {}
- sort_options["Default"] = [("input_order", "input order"), ("#", "# counts"), ("selected_annotation", "selected annotation")]
+ sort_options["Default"] = [("input_order", "input order"), ("counts", "counts"), ("selected_annotation", "selected annotation")]
sort_options["Selected Rank"] = [("tax|" + r, r) for r in ranks]
sort_options["Numeric Metadata"] = []
if metadata:
@@ -246,6 +272,8 @@ def plot_samplebars_widgets(ranks, metadata, contaminant_names, reference_names,
groupby1_select = Select(title="1) Group samples by", value="none", options=groupby_options, sizing_mode="stretch_width")
groupby2_select = Select(title="2) Group samples by", value="none", options=groupby_options, sizing_mode="stretch_width")
+ toggle_label = CheckboxGroup(labels=["Show samples labels"], active=[])
+
help_text = """
Bars showing total counts (left y-axis) for each sample (x-axis).
@@ -255,7 +283,7 @@ def plot_samplebars_widgets(ranks, metadata, contaminant_names, reference_names,
Bars can be annotated by taxonomic rank. The annotation will show the proportion of counts matching the selected annotation.
-Raw counts (#) can be normalized (%). Observation counts (#) can be normalized (%) or log transformed (log10(#)) for better visualization.
+Raw counts and observations (#) can be normalized (%) and/or log transformed (log10) for better visualization.
"""
return {"y1_select": y1_select,
@@ -265,10 +293,11 @@ def plot_samplebars_widgets(ranks, metadata, contaminant_names, reference_names,
"sort_select": sort_select,
"groupby1_select": groupby1_select,
"groupby2_select": groupby2_select,
+ "toggle_label": toggle_label,
"help_button": help_button(title="Sample bars", text=help_text)}
-def plot_obstable(cds_p_obstable, ranks, contaminant_names, control_names):
+def plot_obstable(sizes, cds_m_obstable, ranks, references, controls):
# General filter for widgets
widgets_filter = IndexFilter()
@@ -277,24 +306,26 @@ def plot_obstable(cds_p_obstable, ranks, contaminant_names, control_names):
# Create table with view for each rank
for rank in ranks:
rank_filter = GroupFilter(column_name='col|rank', group=rank)
- cds_view = CDSView(source=cds_p_obstable, filters=[rank_filter, widgets_filter])
+ cds_view = CDSView(source=cds_m_obstable, filters=[rank_filter, widgets_filter])
+
table_cols = []
table_cols.append(TableColumn(field="col|name", title="Name"))
table_cols.append(TableColumn(field="col|frequency_perc", title="Frequency", default_sort="descending", formatter=NumberFormatter(format="0.00%")))
table_cols.append(TableColumn(field="col|counts_perc_avg", title="Avg. counts/sample", default_sort="descending", formatter=NumberFormatter(format="0.00%")))
table_cols.append(TableColumn(field="col|total_counts", title="Total counts", default_sort="descending"))
- for ctrl_name in control_names:
- table_cols.append(TableColumn(field="col|" + ctrl_name, title="(F) " + ctrl_name, default_sort="descending", formatter=NumberFormatter(format="0.00%")))
+ if references is not None:
+ for ref_name in references.keys():
+ table_cols.append(TableColumn(field="col|" + ref_name, title=ref_name, default_sort="descending"))
- for cont_name in contaminant_names:
- table_cols.append(TableColumn(field="col|" + cont_name, title=cont_name, default_sort="descending"))
+ if controls is not None:
+ for ctrl_name in controls.keys():
+ table_cols.append(TableColumn(field="col|" + ctrl_name, title="(F) " + ctrl_name, default_sort="descending", formatter=NumberFormatter(format="0.00%")))
- table_cols.append(TableColumn(field="col|references", title="References", default_sort="descending"))
- if "col|decontam" in cds_p_obstable.data:
+ if "col|decontam" in cds_m_obstable.data:
table_cols.append(TableColumn(field="col|decontam", title="DECONTAM", default_sort="descending"))
- datatable = DataTable(height=200,
+ datatable = DataTable(height=sizes["overview_top_panel_height"],
sizing_mode="stretch_width",
index_position=None,
autosize_mode="fit_viewport",
@@ -302,7 +333,7 @@ def plot_obstable(cds_p_obstable, ranks, contaminant_names, control_names):
#selectable="checkbox",
frozen_columns=1,
columns=table_cols,
- source=cds_p_obstable,
+ source=cds_m_obstable,
view=cds_view)
obstable_tabs.append(Panel(child=datatable, title=rank))
@@ -312,39 +343,103 @@ def plot_obstable(cds_p_obstable, ranks, contaminant_names, control_names):
return obstable, widgets_filter
-def plot_obstable_widgets(dict_d_taxname, max_count_rank):
+def plot_obstable_widgets(sizes, dict_d_taxname, max_count_rank):
# Filtering options
- frequency_spinner = Spinner(title="Frequency", low=0, high=100, value=0, step=1, width=100, height=50)
- counts_perc_avg_spinner = Spinner(title="Avg. counts/sample", low=0, high=100, value=0, step=0.1, width=100, height=50)
- total_counts_spinner = Spinner(title="Total counts", low=1, high=max_count_rank, step=1, value=1, width=100, height=50)
+ spinner_width = sizes["overview_top_panel_width_left"] - 20
+ frequency_spinner = Spinner(title="Frequency", low=0, high=100, value=0, step=1, width=spinner_width, height=50)
+ counts_perc_avg_spinner = Spinner(title="Avg. counts/sample", low=0, high=100, value=0, step=0.1, width=spinner_width, height=50)
+ total_counts_spinner = Spinner(title="Total counts", low=1, high=max_count_rank, step=1, value=1, width=spinner_width, height=50)
# Create unique list of names with taxids for filtering. map to str and set to get unique
- unique_dict_d_taxname_tuples = set(zip(dict_d_taxname.keys(), map(str, dict_d_taxname.values())))
- name_multichoice = MultiChoice(title="Obs. name or identifier", options=list(unique_dict_d_taxname_tuples), sizing_mode="fixed", width=250, height=100)
-
+ name_multichoice = MultiChoice(title="Observation name or id",
+ options=list(set(zip(dict_d_taxname.keys(), map(str, dict_d_taxname.values())))),
+ sizing_mode="fixed",
+ width=sizes["overview_top_panel_width_left"] - 20, height=60)
+
help_text = """
-Summary of observations among all samples. If taxonomy is provided, panels will show different taxonomic ranks.
+Summary of observations among all samples. If taxonomy is provided, panels will show different taxonomic ranks.
-Clicking on the entries will load further information of the observation in the other plots/panels.
+Clicking on the entries will load further information of the observation in the other plots/panels.
The table contain the following fixed columns:
- **Name**: Taxonomic or given observation name
-- **Frequency**: How often the observation is occurring among all samples
+- **Frequency**: How often the observation is occurring among all samples
- **Avg. counts/sample**: Averge percentage of the observation among all samples
- **Total counts**: Sum of counts of this observation in all samples
- **(F) Controls**: (F)requency for controls: how often the observation is occurring in the given control samples
-- **CC Bacteria, CC Viruses, ...**: How many times the observation was reported as a Common Contaminant
-- **References**: In which reference sets this observation occurs
-- **DECONTAM**: Contamination results from DECONTAM method
+- **References**: How many times the observation was reported in the references
+- **DECONTAM**: Final contamination output from DECONTAM method
-Widgets can filter entries of the table. "Obs. name or identifier" filters the lineage of the entries, if taxonomy is provided. With that is possible to, for example, filter a certain genus and the table will show only children species.
+Widgets can filter entries of the table. "Obs. name or id" filters the lineage of the entries, if taxonomy is provided. With that is possible to, for example, filter a certain genus and the table will show only children species.
"""
return {"frequency_spinner": frequency_spinner,
"counts_perc_avg_spinner": counts_perc_avg_spinner,
"total_counts_spinner": total_counts_spinner,
"name_multichoice": name_multichoice,
- "help_button": help_button(title="Observation table", text=help_text)}
+ "help_button": help_button(title="Observation table", text=help_text, align="start")}
+
+
+def plot_sampletable(cds_p_sampletable, sizes, ranks):
+
+ table_cols = []
+ table_cols.append(TableColumn(field="index", title="Sample"))
+ table_cols.append(TableColumn(field="col|total", title="Total counts", default_sort="descending"))
+ table_cols.append(TableColumn(field="col|assigned", title="Assigned"))
+ table_cols.append(TableColumn(field="col|assigned_perc", title="Assigned %", default_sort="descending", formatter=NumberFormatter(format="0.00%")))
+ table_cols.append(TableColumn(field="col|unassigned", title="Unassigned"))
+ table_cols.append(TableColumn(field="col|unassigned_perc", title="Unassigned %", formatter=NumberFormatter(format="0.00%")))
+
+ # Pre-select all checkboxes
+ cds_p_sampletable.selected.indices = list(range(len(cds_p_sampletable.data["index"])))
+
+ for rank in ranks:
+ table_cols.append(TableColumn(field="col|" + rank, title=rank, formatter=NumberFormatter(format="0.00%")))
+
+ sampletable = DataTable(height=sizes["overview_top_panel_height"],
+ sizing_mode="stretch_width",
+ index_position=None,
+ autosize_mode="fit_viewport",
+ selectable="checkbox",
+ frozen_columns=1,
+ columns=table_cols,
+ source=cds_p_sampletable)
+
+ return sampletable
+
+
+def plot_sampletable_widgets(sizes, max_count_samples, metadata):
+ # Filtering options
+ spinner_width = sizes["overview_top_panel_width_left"] - 20
+
+ total_counts_spinner = Spinner(title="Total counts", low=1, high=max_count_samples, step=1, value=1, width=spinner_width, height=50)
+ assigned_spinner = Spinner(title="Assigned %", low=0, high=100, value=0, step=1, width=spinner_width, height=50)
+
+ if metadata:
+ metadata_values = []
+ for field in metadata.get_data().columns.to_list():
+ for value in metadata.get_unique_values(field):
+ metadata_values.append((field + "|" + str(value), field + " = " + str(value)))
+
+ metadata_multichoice = MultiChoice(title="Metadata",
+ options=metadata_values,
+ sizing_mode="fixed",
+ width=sizes["overview_top_panel_width_left"] - 20, height=60)
+ else:
+ metadata_multichoice = Spacer()
+
+ help_text = """
+Summary of samples. Entries selected in the table are shown in the barplot below.
+
+Widgets can select batches of entries in the table by multiple criteria.
+
+Multiple metadata fields/values can be chosen and the union of the matching results will be selected in the table.
+"""
+
+ return {"total_counts_spinner": total_counts_spinner,
+ "assigned_spinner": assigned_spinner,
+ "metadata_multichoice": metadata_multichoice,
+ "help_button": help_button(title="Sample selection", text=help_text, align="start")}
def plot_infopanel():
@@ -353,73 +448,152 @@ def plot_infopanel():
disabled=False)
-def plot_decontam(cds_p_decontam, cds_p_decontam_lines, min_obs_perc):
+def plot_decontam(sizes, cds_p_decontam, cds_p_decontam_lines, min_obs_perc):
decontam_fig = figure(x_axis_type="log",
y_axis_type="log",
- height=170, width=300,
+ height=sizes["overview_top_panel_height"] - 50,
+ width=sizes["overview_top_panel_width_right"],
sizing_mode="stretch_width",
tools="save")
- palette = make_color_palette(2)
- factors = list(set(cds_p_decontam.data["controls"]))
-
+ palette = make_color_palette(2) # Control, Sample
+ factors = list(sorted(set(cds_p_decontam.data["controls"]), reverse=True))
# Add legend on top
decontam_fig.add_layout(Legend(), 'above')
- decontam_fig.circle(x="concentration", y="counts",
- source=cds_p_decontam,
- color=factor_cmap('controls', palette=palette, factors=factors),
- legend_group="controls",
- size=3)
-
- # If there are controls, show legend
- if len(factors) > 1:
- decontam_fig.legend.margin = 0
- decontam_fig.legend.border_line_width = 0
- decontam_fig.legend.spacing = 0
- decontam_fig.legend.padding = 0
- decontam_fig.legend.orientation = "horizontal"
- decontam_fig.legend.location = "bottom_right"
- else:
- decontam_fig.legend.visible = False
+ points = decontam_fig.circle(x="concentration", y="counts",
+ source=cds_p_decontam,
+ color=factor_cmap('controls', palette=palette, factors=factors),
+ legend_group="controls",
+ size=3)
+
+ # Add tooltip just for points
+ decontam_fig.add_tools(HoverTool(renderers=[points], tooltips=[('Sample', '@index')]))
decontam_fig.line(x="x",
y="y_cont",
source=cds_p_decontam_lines,
- color="red")
+ color="red",
+ legend_label="Cont.")
decontam_fig.line(x="x",
y="y_noncont",
source=cds_p_decontam_lines,
- color="black", line_dash="dashed")
-
- decontam_fig.xaxis.axis_label = 'DNA Concentration/Total counts'
- decontam_fig.yaxis.axis_label = 'obs. counts'
+ color="black",
+ line_dash="dashed",
+ legend_label="Non-cont.")
+
+ decontam_fig.legend.margin = 0
+ decontam_fig.legend.border_line_width = 0
+ decontam_fig.legend.spacing = 0
+ decontam_fig.legend.padding = 0
+ decontam_fig.legend.orientation = "horizontal"
+ decontam_fig.legend.location = "bottom_right"
+
+ decontam_fig.xaxis.axis_label = 'Concentration'
+ decontam_fig.yaxis.axis_label = 'Counts'
decontam_fig.y_range.start = min_obs_perc
decontam_fig.y_range.end = 1
-
-
return decontam_fig
-def plot_decontam_widgets():
- pvalue_text = Paragraph(text="P-value")
- pvalue_input = TextInput(value="", width=180, align='end')
+def plot_decontam_widgets(sizes):
+ pscore_text = Paragraph(text="P-score")
+ pscore_input = TextInput(value="", width=sizes["overview_top_panel_width_right"] - 150, align='end', disabled=True)
help_text = """
-Plot to verify the DECONTAM [1] output. Proportion of counts (y-axis) against DNA Concentration (if provided) or Total number of counts (x-axis), both in log10 scale. If provided, controls samples are displayed in a different color.
+Plot to verify the DECONTAM [1] output. Proportion of counts of selected observation (y-axis) against DNA Concentration (if provided) or Total number of counts (x-axis) of each sample, both in log10 scale. If provided, controls samples are displayed in a different color.
A indication of contamination is when counts are inversely proportional to DNA concentration. The red and black dotted lines are the expected models for contamination and non-contamination, respectively. A good indication for contamination is when the dots (excluding control samples) "fit" the red line model.
+The P-score statistic is not a P-value and it is not associated with any guarantees on the type 1 error rate [1]. Small scores indicate the contaminant model is a better fit, and high scores indicate that the non-contaminant model is a better fit.
+
More details can be found in the [DECONTAM Introduction guide](https://benjjneb.github.io/decontam/vignettes/decontam_intro.html)
[1] Davis, N. M. et al. Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome (2018) 10.1186/s40168-018-0605-2.
"""
- return {"pvalue_text": pvalue_text,
- "pvalue_input": pvalue_input,
+ return {"pscore_text": pscore_text,
+ "pscore_input": pscore_input,
"help_button": help_button(title="DECONTAM", text=help_text, align="start")}
-def plot_mgnify(cds_p_mgnify):
- mgnify_fig = figure(height=150, width=300,
+
+def plot_references(sizes, table, cds_p_references, dict_d_taxname):
+ references_fig = figure(x_range=table.ranks(),
+ height=sizes["overview_top_panel_height"] - 50,
+ width=sizes["overview_top_panel_width_right"],
+ tools="save,reset")
+
+ # Need to pass dict_d_taxname inside a one column data
+ taxid_name_custom = CustomJSHover(
+ args=dict(dict_d_taxname=ColumnDataSource(dict(dict_d_taxname=[dict_d_taxname]))),
+ code="return dict_d_taxname.data.dict_d_taxname[0][value]; // value holds the @taxid"
+ )
+ # Add custom tooltip for heatmap (taxid->name)
+ references_fig.add_tools(HoverTool(
+ tooltips=[
+ ('Observation', '@obs{custom}'),
+ ('# reported (directly)', '@direct'),
+ ('as parent', '@parent'),
+ ],
+ mode="mouse",
+ point_policy="follow_mouse",
+ formatters={"@obs": taxid_name_custom}
+ ))
+
+ references_filter = IndexFilter(indices=[])
+ cds_view_references = CDSView(source=cds_p_references, filters=[references_filter])
+
+ references_fig.add_layout(Legend(), 'above')
+
+ fixed_bar_options = ["direct", "parent"]
+ palette = ["red", "black"]
+ references_fig.vbar_stack(fixed_bar_options,
+ x="rank",
+ width=1,
+ source=cds_p_references,
+ view=cds_view_references,
+ color=palette,
+ line_color=None, # to avoid printing small border for zeros
+ fill_alpha=[1, 0.3],
+ legend_label=fixed_bar_options)
+
+ references_fig.y_range.start = 0
+
+ references_fig.xaxis.major_label_orientation = "vertical"
+ references_fig.xgrid.grid_line_color = None
+ references_fig.xaxis.minor_tick_line_color = None
+ references_fig.yaxis.minor_tick_line_color = None
+ references_fig.xaxis.major_tick_line_color = None
+ references_fig.yaxis.major_tick_line_color = None
+ references_fig.yaxis.axis_label = "# reported"
+
+ references_fig.legend.margin = 0
+ references_fig.legend.border_line_width = 0
+ references_fig.legend.spacing = 0
+ references_fig.legend.padding = 0
+ references_fig.legend.orientation = "horizontal"
+ references_fig.legend.location = "bottom_right"
+
+ return references_fig, references_filter
+
+
+def plot_references_widgets(sizes, references):
+ ref_names = list(references.keys()) if references is not None else []
+ references_select = Select(value=ref_names[0] if ref_names else None, width=sizes["overview_top_panel_width_right"] - 70, options=ref_names)
+ help_text = """
+Plot of number of occurences of provided references for each observation and its lineage.
+
+**direct** counts represent direct matches with reference identifiers
+
+**parent** counts accounts for the number of times related children (not necessarily reported) on the lineage of the selected observation node was reported among references
+"""
+
+ return {"references_select": references_select,
+ "help_button": help_button(title="References", text=help_text, align="start")}
+
+
+def plot_mgnify(sizes, cds_p_mgnify):
+ mgnify_fig = figure(height=sizes["overview_top_panel_height"] - 50,
+ width=sizes["overview_top_panel_width_right"],
tools="save,wheel_zoom,reset")
mgnify_filter = IndexFilter(indices=[])
@@ -439,16 +613,17 @@ def plot_mgnify(cds_p_mgnify):
factors.extend(lineages)
palette.extend(make_color_palette(len(lineages)))
- # Custom hover to follow mouse
+ # Add custom tooltip to show percentage (based on angle)
mgnify_fig.add_tools(HoverTool(
tooltips=[("Biome", "@lineage"),
- ("studies", "@count")],
+ ("Studies", "@count"),
+ ("Percentage", "@angle{custom}%")],
mode="mouse",
- point_policy="follow_mouse"
+ point_policy="follow_mouse",
+ formatters={"@angle": CustomJSHover(code="return ((value/6.2831853071795)*100).toFixed(2);")}
))
#mgnify_fig.text(0, 1, text=["No data"], text_baseline="middle", text_align="center")
-
mgnify_fig.wedge(x=0, y=1, radius=0.5,
start_angle=cumsum('angle', include_zero=True),
end_angle=cumsum('angle'),
@@ -466,7 +641,7 @@ def plot_mgnify(cds_p_mgnify):
def plot_mgnify_widgets():
- biome_spinner = Spinner(title="Biome level", low=1, high=5, value=1, step=1, width=100, height=50, orientation="horizontal")
+ biome_spinner = Spinner(title="Biome level", low=1, high=5, value=1, step=1, width=100, height=50) # orientation="horizontal")
help_text = """
Pie chart with the number of occurrences of the selected taxa in the table by environment (biome) in other microbiome studies analyzed and publicly available at the [MGNify](https://www.ebi.ac.uk/metagenomics) [1] resource.
@@ -476,7 +651,7 @@ def plot_mgnify_widgets():
"""
return {"biome_spinner": biome_spinner,
- "help_button": help_button(title="MGNify", text=help_text)}
+ "help_button": help_button(title="MGnify", text=help_text)}
def plot_heatmap(table, cds_p_heatmap, tools_heatmap, transformation, dict_d_taxname):
@@ -499,7 +674,6 @@ def plot_heatmap(table, cds_p_heatmap, tools_heatmap, transformation, dict_d_tax
tooltips=[
('Sample', '@index'),
('Observation', '@obs{custom}'),
- ('Original value', '@ov'),
('Transformed value (' + transformation + ')', '@tv')
],
formatters={"@obs": taxid_name_custom}
@@ -510,24 +684,30 @@ def plot_heatmap(table, cds_p_heatmap, tools_heatmap, transformation, dict_d_tax
color_mapper.low = min(cds_p_heatmap.data["tv"])
color_mapper.high = max(cds_p_heatmap.data["tv"])
- # Convert taxid ticks to taxa names on client-side
- heatmap.xaxis.formatter = FuncTickFormatter(args=dict(dict_d_taxname=dict_d_taxname), code='''
- return dict_d_taxname[tick];
- ''')
-
- heatmap.rect(x="obs", y="index", width=1, height=1,
+ heatmap.rect(x="factors_obs", y="factors_sample", width=1, height=1,
source=cds_p_heatmap,
fill_color={'field': 'tv', 'transform': color_mapper},
line_color=None)
- color_bar = ColorBar(color_mapper=color_mapper, label_standoff=12, border_line_color=None, location="center", orientation="horizontal")
- heatmap.add_layout(color_bar, 'above')
+ color_bar = ColorBar(color_mapper=color_mapper,
+ label_standoff=2,
+ width=6,
+ height=200,
+ border_line_color=None,
+ location="center",
+ orientation="vertical",
+ major_label_text_align="left",
+ major_label_text_font_size="9px",
+ title=transformation)
+ heatmap.add_layout(color_bar, 'left')
# Convert taxid ticks to taxa names on client-side
heatmap.xaxis.formatter = FuncTickFormatter(args=dict(dict_d_taxname=dict_d_taxname), code='''
return dict_d_taxname[tick];
''')
+ heatmap.xaxis.group_label_orientation = "vertical"
+ heatmap.yaxis.group_label_orientation = "horizontal"
heatmap.xaxis.major_label_orientation = "vertical"
heatmap.xgrid.grid_line_color = None
heatmap.ygrid.grid_line_color = None
@@ -543,78 +723,97 @@ def plot_heatmap(table, cds_p_heatmap, tools_heatmap, transformation, dict_d_tax
return heatmap
-def plot_heatmap_widgets(ranks, linkage_methods, linkage_metrics, contaminant_names, reference_names, controls_names, metadata, decontam):
+def plot_heatmap_widgets(ranks, linkage_methods, linkage_metrics, references, controls, metadata, decontam):
rank_select = Select(title="Taxonomic rank:", value=ranks[0], options=ranks)
+ cluster_options = []
+ for lmetric in linkage_metrics:
+ for lmethod in linkage_methods:
+ cluster_options.append(("cluster|" + lmethod + "|" + lmetric, lmethod + "/" + lmetric))
+
+ x_groupby_options = {}
+ x_groupby_options["Default"] = [("none", "none")]
+ x_groupby_options["Clustering Method/Metric"] = cluster_options
+ x_groupby_options["Group by taxonomic rank"] = [("tax|" + r, r) for r in ranks]
+
x_sort_options = {}
- x_sort_options["Clustering Metric"] = [("metric|" + lm, lm) for lm in linkage_metrics]
- x_sort_options["Default order"] = [("none", "none"), ("counts", "counts"), ("observations", "observations")]
- x_sort_options["Sort by Contaminants"] = [("annot|" + c, c) for c in contaminant_names]
- x_sort_options["Sort by References"] = [("annot|" + r, r) for r in reference_names]
- if controls_names:
- x_sort_options["Sort by Controls"] = [("annot|" + c, c) for c in controls_names]
+ x_sort_options["Default"] = [("none", "none"), ("counts", "counts"), ("observations", "observations")]
+ if references is not None:
+ x_sort_options["References"] = [("annot|" + r, r) for r in references.keys()]
+ if controls is not None:
+ x_sort_options["Controls"] = [("annot|" + c, c) for c in controls.keys()]
if decontam:
- x_sort_options["Sort by DECONTAM"] = [("annot|decontam", "decontam")]
+ x_sort_options["DECONTAM"] = [("annot|decontam", "decontam")]
- y_sort_options = {}
- y_sort_options["Clustering Metric"] = [("metric|" + lm, lm) for lm in linkage_metrics]
- y_sort_options["Default order"] = [("none", "none"), ("counts", "counts"), ("samples", "samples")]
+ y_groupby_options = {}
+ y_groupby_options["Default"] = [("none", "none")]
+ y_groupby_options["Clustering Method/Metric"] = cluster_options
+ if metadata:
+ categorical_md_data = metadata.get_data(metadata_type="categorical").columns.to_list()
+ if categorical_md_data:
+ y_groupby_options["Group by Categorical Metadata"] = [("group_metadata|" + md, md) for md in categorical_md_data]
+ y_sort_options = {}
+ y_sort_options["Default"] = [("none", "none"), ("counts", "counts"), ("samples", "samples")]
if metadata:
numeric_md_data = metadata.get_data(metadata_type="numeric").columns.to_list()
if numeric_md_data:
- y_sort_options["Sort by Numeric Metadata"] = [("metadata_num|" + md, md) for md in numeric_md_data]
+ y_sort_options["Numeric Metadata"] = [("metadata_num|" + md, md) for md in numeric_md_data]
categorical_md_data = metadata.get_data(metadata_type="categorical").columns.to_list()
if categorical_md_data:
- y_sort_options["Sort by Categorical Metadata"] = [("metadata_cat|" + md, md) for md in categorical_md_data]
-
+ y_sort_options["Categorical Metadata"] = [("metadata_cat|" + md, md) for md in categorical_md_data]
- x_sort_select = Select(title="Observation cluster/sort:", value="none", options=x_sort_options)
- x_method_select = Select(title="Observation clustering method:", value=linkage_methods[0], options=linkage_methods, disabled=True)
- #x_group_select = Select(title="Observation group by:", value="none", options={"Default": ["none"], "Taxonomic ranks": ranks})
-
- y_sort_select = Select(title="Sample cluster/sort:", value="none", options=y_sort_options)
- y_method_select = Select(title="Sample clustering method:", value=linkage_methods[0], options=linkage_methods, disabled=True)
+ x_groupby_select = Select(title="Observation cluster/group by:", value="none", options=x_groupby_options)
+ x_sort_select = Select(title="Observation sort:", value="none", options=x_sort_options)
+ y_groupby_select = Select(title="Sample cluster/group by:", value="none", options=y_groupby_options)
+ y_sort_select = Select(title="Sample sort:", value="none", options=y_sort_options)
- toggle_label_text = Div(text="show/hide labels")
- toggle_label_heatmap = CheckboxGroup(labels=["Observations", "Samples"], active=[])
+ toggle_label = CheckboxGroup(labels=["Show observations labels", "Show samples label"], active=[])
help_text = """
-The heatmap shows [transformed] values from the input table (color bar on top). If taxonomy is provided, one heatmap for each taxonomic rank is generated.
+***Heatmap***
-Values on axis can be independently clustered or sorted. Hierarchical clustering is done with [scipy linkage](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html). If clustering is selected, dendrograms will be plotted in the panels around the heatmap.
+The heatmap shows [transformed] values from the input table (color bar on top). Values on both axis can be independently clustered, grouped or sorted.
+Hierarchical clustering uses [scipy linkage](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html). Dendrograms will be plotted in the panels around the heatmap if clustering is selected.
-The right-most panel will show metadata values related to each sample (y-axis), if provided. Colors are automatically generated for categorical (distinct colors) and numeric (sequential colors) fields. Multiple metadata fields can be select with Ctrl + click in the metadata selection widget.
+***Metadata***
-The bottom-most panel shows annotations for each observation/taxa values (x-axis).
+The right-most panel will show metadata values related to each sample (y-axis). Colors are automatically generated for categorical (distinct colors) and numeric (sequential colors) fields. Multiple metadata fields can be select with Ctrl + click in the metadata selection widget.
+
+***Annotations***
+
+The bottom-most panel shows annotations for each observation values (x-axis). Values are transformed and normalized to a 0-1 scale. References values are are normalized by max. occurrence in the selected rank.
+Controls values are their frequency in the specific control annotation. Decontam values are p-scores normalized, with higher values (1) representing low p-scores.
The metadata and annotation plots are automatically sorted to reflect the clustering/sort of the heatmap.
+
+**If the panels are not properly aligned after data selection, use the reset tool (top right) to re-align them**
"""
return {"rank_select": rank_select,
- "x_method_select": x_method_select,
+ "x_groupby_select": x_groupby_select,
"x_sort_select": x_sort_select,
- #"x_group_select": x_group_select,
- "y_method_select": y_method_select,
+ "y_groupby_select": y_groupby_select,
"y_sort_select": y_sort_select,
- "toggle_label_heatmap": toggle_label_heatmap,
- "toggle_label_text": toggle_label_text,
- "help_button": help_button(title="Heatmap/Clustering", text=help_text)}
+ "toggle_label": toggle_label,
+ "help_button": help_button(title="Heatmap", text=help_text)}
def plot_dendrogram(heatmap, tools_heatmap, cds_p_dendro_x, cds_p_dendro_y):
-
+
dendrox_fig = figure(x_range=heatmap.x_range,
tools="save",
height=80,
sizing_mode="stretch_width",
- tooltips=[("y", "$y{(0.00)}"), ("c", "$swatch:c")])
+ #tooltips=[("y", "$y{(0.00)}"), ("c", "$swatch:c")],
+ visible=False)
dendroy_fig = figure(y_range=heatmap.y_range,
tools="save",
width=80,
height=heatmap.height,
- tooltips=[("x", "$x{(0.00)}"), ("c", "$swatch:c")])
+ #tooltips=[("x", "$x{(0.00)}"), ("c", "$swatch:c")],
+ visible=False)
dendroy_fig.multi_line(xs="x", ys="y", color="c",
source=cds_p_dendro_y)
@@ -647,15 +846,17 @@ def plot_dendrogram(heatmap, tools_heatmap, cds_p_dendro_x, cds_p_dendro_y):
def plot_metadata(heatmap, tools_heatmap, metadata, cds_d_metadata, cds_p_metadata):
- # Number of cols added to the plot cds (user input)
- ncols = len(cds_p_metadata.data) - 1
+ # Get fixed headers from cds
+ cols = list(cds_p_metadata.data.keys())[2:]
- metadata_fig = figure(x_range=["md" + str(c) for c in range(ncols)],
+ metadata_fig = figure(x_range=cols,
y_range=heatmap.y_range,
tools=tools_heatmap,
x_axis_location="above",
- width=80, height=heatmap.height,
+ width=300,
+ height=heatmap.height,
tooltips="")
+
metadata_fields = metadata.get_col_headers().to_list()
# A bit of a hack of the "proper" user of a colormap, since I need mixed types (categorical and numeric)
@@ -664,54 +865,90 @@ def plot_metadata(heatmap, tools_heatmap, metadata, cds_d_metadata, cds_p_metada
# (metadata header, str(metadata value)) -> color
# Add them to a CategoricalColorMapper which will be applied for the whole plot
# Use different palettes for numeric types, but convert to string to be treated as a category
- # Need to be careful with int and float, since the value of str(0.0)
+ # Need to be careful with int and float, since the value of str(0.0)
# will not match the 0 in the javascript data conversion, therefore use the numeric to calculate palette
# but make get_formatted_unique_values on the dictionary
factors = []
palette = []
+ legend_colorbars = {}
for i, md_header in enumerate(metadata_fields):
- unique_values = metadata.get_unique_values(md_header)
+ unique_values = sorted(metadata.get_unique_values(md_header))
if unique_values:
n = len(unique_values)
+ legend_colorbars[md_header] = ColorBar(label_standoff=1,
+ width=10,
+ border_line_color=None,
+ location=(0, 0),
+ title=md_header,
+ title_text_align="left",
+ title_text_font_size="11px",
+ major_label_text_align="left",
+ major_label_text_font_size="9px",
+ visible=False)
if metadata.get_type(md_header) == "numeric":
unique_palette = make_color_palette(n, linear=True)
+ legend_colorbars[md_header].color_mapper = LinearColorMapper(palette=unique_palette, low=min(unique_values), high=max(unique_values))
+ #legend_colorbars[md_header].formatter = PrintfTickFormatter(format="%f")
else:
unique_palette = make_color_palette(n)
+ legend_colorbars[md_header].color_mapper = LinearColorMapper(palette=unique_palette, low=0, high=n)
+ legend_colorbars[md_header].ticker = FixedTicker(ticks=[t + 0.5 for t in range(n)])
+ legend_colorbars[md_header].major_label_overrides = {i + 0.5: unique_values[i] for i in range(n)}
+
assert len(unique_palette) == n, 'Wrong number of colors on palette'
palette.extend(unique_palette)
- factors.extend([(md_header, md_value) for md_value in metadata.get_formatted_unique_values(md_header)])
+ factors.extend([(md_header, md_value) for md_value in map(format_js_toString, unique_values)])
+
metadata_colormap = CategoricalColorMapper(palette=palette, factors=factors)
# Custom tooltip to show metadata field and value
- md_custom = CustomJSHover(code='return value[0] ? "(" + value[0] + ") " + value[1] : "";')
+ md_custom = CustomJSHover(code='return value ? "(" + value[0] + ") " + value[1] : "";')
tooltips = [('Sample', '@index')]
formatters = {}
- for i in range(ncols):
- tooltips.append(("md" + str(i), "@md" + str(i) + "{custom}"))
- formatters["@md" + str(i)] = md_custom
+ for col in cols:
+ tooltips.append((col, "@" + col + "{custom}"))
+ formatters["@" + col] = md_custom
metadata_fig.add_tools(HoverTool(tooltips=tooltips, formatters=formatters))
- for i in range(ncols):
- metadata_fig.rect(x=i + 0.5, y="index",
+ for col in cols:
+ metadata_fig.rect(x={"value": col}, y="factors",
width=1, height=1,
source=cds_p_metadata,
- fill_color={'field': "md" + str(i), 'transform': metadata_colormap},
+ fill_color={'field': col, 'transform': metadata_colormap},
line_color=None)
+ # Show just first when loading
+ metadata_fig.x_range.factors = ["1"]
+
+ for i, md_header in enumerate(legend_colorbars.keys()):
+ # Start showing only first
+ if i == 0:
+ legend_colorbars[md_header].visible = True
+ metadata_fig.add_layout(legend_colorbars[md_header], 'right')
metadata_fig.xaxis.axis_label = "metadata"
- metadata_fig.xaxis.major_label_orientation = "vertical"
+ metadata_fig.xaxis.major_label_orientation = "horizontal"
metadata_fig.xaxis.major_label_text_font_size = "11px"
metadata_fig.xaxis.minor_tick_line_color = None
+ metadata_fig.xgrid.grid_line_color = None
metadata_fig.yaxis.major_tick_line_color = None
metadata_fig.yaxis.minor_tick_line_color = None
metadata_fig.yaxis.major_label_text_font_size = '0pt'
metadata_fig.yaxis.axis_line_color = None
+ metadata_fig.yaxis.group_text_font_size = "0px"
metadata_fig.ygrid.grid_line_color = None
- metadata_multiselect = MultiSelect(title="Metadata to show ('md#', max. " + str(ncols) + "):", value=[metadata_fields[c] for c in range(ncols)], options=metadata_fields)
+ metadata_multiselect = MultiSelect(title="Metadata to show (select max. " + str(len(cols)) + " columns):", value=[metadata_fields[0]], options=metadata_fields)
- return metadata_fig, {"metadata_multiselect": metadata_multiselect}
+ # Rename ticker to selected metadata
+ metadata_fig.xaxis.formatter = FuncTickFormatter(
+ args=dict(metadata_multiselect=metadata_multiselect),
+ code='''
+ return metadata_multiselect.value[tick-1];
+ ''')
+
+ toggle_legend = CheckboxGroup(labels=["Show metadata legend"], active=[0])
+ return metadata_fig, {"metadata_multiselect": metadata_multiselect, "legend_colorbars": legend_colorbars, "toggle_legend": toggle_legend}
def plot_annotations(heatmap, tools_heatmap, cds_p_annotations, dict_d_taxname):
@@ -731,14 +968,32 @@ def plot_annotations(heatmap, tools_heatmap, cds_p_annotations, dict_d_taxname):
)
# Add custom tooltip for heatmap (taxid->name)
annot_fig.add_tools(HoverTool(
- tooltips=[('Annotation', '@annot'), ('Observation', '@index{custom}')],
+ tooltips=[('Annotation', '@annot'),
+ ('Observation', '@index{custom}'),
+ ('Original Value', '@ov'),
+ ('Transformed Value', '@tv')],
formatters={"@index": taxid_name_custom}
))
- annot_fig.rect(x="index", y="annot",
+ color_palette = Magma256[::-1]
+ color_mapper = LinearColorMapper(palette=color_palette, low=0, high=1)
+
+ color_bar = ColorBar(color_mapper=color_mapper,
+ label_standoff=2,
+ width=6,
+ height=60,
+ border_line_color=None,
+ location="center",
+ orientation="vertical",
+ major_label_text_align="left",
+ major_label_text_font_size="9px")
+ annot_fig.add_layout(color_bar, 'left')
+
+ annot_fig.rect(x="factors", y="annot",
width=1, height=1,
source=cds_p_annotations,
- fill_color="black",
+ #fill_color="black",
+ fill_color={'field': 'tv', 'transform': color_mapper},
line_color=None)
annot_fig.yaxis.axis_label = "annotations"
@@ -748,22 +1003,28 @@ def plot_annotations(heatmap, tools_heatmap, cds_p_annotations, dict_d_taxname):
annot_fig.yaxis.minor_tick_line_color = None
annot_fig.xaxis.major_tick_line_color = None
annot_fig.xaxis.major_label_text_font_size = "0px"
+ annot_fig.xaxis.group_text_font_size = "0px"
return annot_fig
def plot_correlation(cds_p_correlation, ranks, dict_d_taxname):
taxids = set()
+ taxids.update(cds_p_correlation.data["index"])
+ taxids.update(cds_p_correlation.data["taxid"])
+ corr_fig = figure(x_range=sorted(taxids, reverse=True),
+ y_range=sorted(taxids),
+ tools="save,reset,crosshair,tap,box_zoom",
+ sizing_mode="scale_height")
+
+ # Start showing only first rank
+ factors = set()
for i, rank in enumerate(cds_p_correlation.data["rank"]):
if rank == ranks[0]:
- taxids.add(cds_p_correlation.data["index"][i])
-
- taxids = sorted(taxids)
- corr_fig = figure(x_range=taxids,
- y_range=list(reversed(taxids)),
- tools="hover,save,reset,crosshair,tap,box_zoom",
- tooltips="",
- sizing_mode="scale_height")
+ factors.add(cds_p_correlation.data["index"][i])
+ factors.add(cds_p_correlation.data["taxid"][i])
+ corr_fig.x_range.factors = sorted(factors, reverse=True)
+ corr_fig.y_range.factors = sorted(factors)
# Need to pass dict_d_taxname inside a one column data
taxid_name_custom = CustomJSHover(
@@ -775,9 +1036,7 @@ def plot_correlation(cds_p_correlation, ranks, dict_d_taxname):
tooltips=[
('x', '@taxid{custom}'),
('y', '@index{custom}'),
- ('Correlation coefficient (rho)', '@rho'),
- ('Raw p-value', '@pval'),
- ('Corrected p-value', '@pval_corr')
+ ('Correlation (rho)', '@rho'),
],
formatters={"@taxid": taxid_name_custom, "@index": taxid_name_custom}
))
@@ -786,9 +1045,8 @@ def plot_correlation(cds_p_correlation, ranks, dict_d_taxname):
color_mapper = LinearColorMapper(palette=color_palette, low=-1, high=1)
rho_filter = IndexFilter()
- pval_filter = IndexFilter()
- cds_view_correlation = CDSView(source=cds_p_correlation, filters=[rho_filter, pval_filter])
- corr_fig.rect(x="taxid", y="index",
+ cds_view_correlation = CDSView(source=cds_p_correlation, filters=[rho_filter])
+ corr_fig.rect(x="index", y="taxid",
width=1, height=1,
source=cds_p_correlation,
view=cds_view_correlation,
@@ -816,29 +1074,37 @@ def plot_correlation(cds_p_correlation, ranks, dict_d_taxname):
corr_fig.yaxis.minor_tick_line_color = None
corr_fig.xaxis.major_label_orientation = "vertical"
- return corr_fig, rho_filter, pval_filter
+ corr_fig.xaxis.major_tick_line_color = None
+ corr_fig.xaxis.major_label_text_font_size = "0px"
+ corr_fig.yaxis.major_tick_line_color = None
+ corr_fig.yaxis.major_label_text_font_size = "0px"
+
+ return corr_fig, rho_filter
def plot_correlation_widgets(ranks, top_obs_corr):
rank_select = Select(title="Taxonomic rank:", value=ranks[0], options=ranks)
neg_slider = RangeSlider(start=-1, end=0, value=(-1, 0), step=.01, title="Negative correlation")
pos_slider = RangeSlider(start=0, end=1, value=(0, 1), step=.01, title="Positive correlation")
- pval_spinner = Spinner(title="Corrected P-value", low=0, high=1, step=0.01, value=1, width=100, height=50)
-
+ toggle_label = CheckboxGroup(labels=["Show observations labels"], active=[])
+
help_text = """
-Spearman correlation coefficient with associated and corrected (Benjamini/Hochberg) p-values between the top """ + str(top_obs_corr) + """ most abundant observations. [spearmanr](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html) function from scipy is used.
+Symmetric proportionality coefficient (rho correlation) [1,2] between the top """ + str(top_obs_corr) + """ most abundant observations, based on log-ratios (clr). Only half matrix is displayed, since the values are symmetric.
- Negative correlation values [-1 .. 0] are displayed in blue.
- Positive correlation values [0 .. 1] are displayed in red.
-Only half matrix is displayed, since the values are symmetric. Widgets can control level of correlation and corrected p-value to show.
+[1] Lovell, D., Pawlowsky-Glahn, V., Egozcue, J. J., Marguerat, S. & Bähler, J. Proportionality: A Valid Alternative to Correlation for Relative Data. PLOS Computational Biology 11, e1004075 (2015).
+
+[2] Erb, I. & Notredame, C. How should we measure proportionality on relative gene expression data? Theory Biosci. 135, 21–36 (2016).
"""
return {"rank_select": rank_select,
"neg_slider": neg_slider,
"pos_slider": pos_slider,
- "pval_spinner": pval_spinner,
+ "toggle_label": toggle_label,
"help_button": help_button(title="Correlation", text=help_text)}
+
def help_button(title: str="", text: str="", align: str="end"):
hb = Button(width=32, height=32, label="?", align=align, button_type="warning")
@@ -851,32 +1117,3 @@ def help_button(title: str="", text: str="", align: str="end"):
hb.js_on_click(CustomJS(code="pop.open('" + title + "', '" + html_text + "');"))
return hb
-
-
-def make_color_palette(n_colors, linear: bool=False, palette: dict=None):
- if isinstance(palette, dict) and n_colors <= max(palette.keys()):
- # Special case for 1 and 2 (not in palettes)
- palette = palette[3 if n_colors < 3 else n_colors]
-
- if linear or n_colors > 20:
- if not palette:
- palette = Turbo256
- if n_colors <= 256:
- return linear_palette(palette, n_colors)
- else:
- # Repeat colors
- return [palette[int(i * 256.0 / n_colors)] for i in range(n_colors)]
- else:
- # Select color palette based on number of requested colors
- # Return the closest palette with most distinc set of colors
- if not palette:
- if n_colors <= 8:
- palette = Colorblind[8]
- elif n_colors <= 10:
- palette = Category10[10]
- elif n_colors <= 20:
- palette = Category20[20]
- else:
- palette = Turbo256
-
- return palette[:n_colors]
diff --git a/grimer/reference.py b/grimer/reference.py
new file mode 100644
index 0000000..82c8e62
--- /dev/null
+++ b/grimer/reference.py
@@ -0,0 +1,64 @@
+import yaml
+
+
+class Reference:
+ def __init__(self, file: str=None, ids: list=[]):
+ self.ids = {} # {refid: {ref1: set(desc1, desc2,...), ref2: set(desc3,...)}}
+ self.parents = {} # {parent_id: set(refids)}
+
+ if file is not None:
+ self.parse(file)
+ elif ids:
+ for i in ids:
+ self.add(i, "", "")
+
+ def __repr__(self):
+ args = ['{}={}'.format(k, repr(v)) for (k, v) in vars(self).items()]
+ return 'Reference({})'.format(', '.join(args))
+
+ def add(self, i, ref: str=None, desc: str=None):
+ if i not in self.ids:
+ self.ids[i] = {}
+ if ref is not None:
+ if ref not in self.ids[i]:
+ self.ids[i][ref] = set()
+ if desc is not None:
+ self.ids[i][ref].add(desc)
+
+ def add_parent(self, parent, refid):
+ if parent not in self.parents:
+ self.parents[parent] = set()
+ self.parents[parent].add(refid)
+
+ def parse(self, file):
+ with open(file, 'r') as fh:
+ if file.endswith(".yml") or file.endswith(".yaml"):
+ src = yaml.safe_load(fh)
+ for desc, val in src.items():
+ for ref, v in val.items():
+ for i in map(str, v["ids"]):
+ self.add(i, (ref, v["url"]), desc)
+ else:
+ for line in fh:
+ main_id = line.rstrip()
+ self.add(main_id)
+
+ def update_taxids(self, taxid_updated):
+ for node, upd_node in taxid_updated.items():
+ if upd_node is not None and upd_node != node:
+ print("Updated taxonomic node: " + node + " -> " + upd_node)
+ self.add(upd_node)
+ self.ids[upd_node].update(self.ids[node])
+ del self.ids[node]
+
+ def get_refs_desc(self, i, direct: bool=False, parents: bool=False):
+ refs_desc = {}
+ if direct and i in self.ids:
+ refs_desc.update(self.ids[i])
+ if parents and i in self.parents:
+ for refid in self.parents[i]:
+ refs_desc.update(self.ids[refid])
+ return refs_desc
+
+ def get_refs_count(self, i, direct: bool=False, parents: bool=False):
+ return len(self.get_refs_desc(i, direct, parents))
diff --git a/grimer/source.py b/grimer/source.py
deleted file mode 100644
index 7c48e19..0000000
--- a/grimer/source.py
+++ /dev/null
@@ -1,77 +0,0 @@
-import yaml
-
-
-class Source:
- def __init__(self, file: str=None, ids: list=[]):
- # Only leaf ids/nodes
- self.ids = set()
- self.lineage = set()
-
- # {id: {ref1: set(desc1, desc2,...), ref2: set(desc3,...)}
- self.refs = {}
-
- if file is not None:
- self.parse(file)
- elif ids:
- self.ids.update(ids)
- self.lineage.update(ids)
-
- def __repr__(self):
- args = ['{}={}'.format(k, repr(v)) for (k, v) in vars(self).items()]
- return 'Source({})'.format(', '.join(args))
-
- def parse(self, file):
- with open(file, 'r') as fh:
- if file.endswith(".yml") or file.endswith(".yaml"):
- src = yaml.safe_load(fh)
- for desc, val in src.items():
- for ref, v in val.items():
- str_ids = list(map(str, v["ids"]))
- self.ids.update(str_ids)
- self.lineage.update(str_ids)
- for i in str_ids:
- self.add_refs_desc(i, (ref, v["url"]), desc)
- else:
- for line in fh:
- main_id = line.rstrip()
- self.ids.add(main_id)
- self.lineage.add(main_id)
-
- def update_lineage(self, ids):
- self.lineage.update(ids)
-
- def update_taxids(self, taxid_updated):
- # Update taxonomy entries or convert names to taxid
- for node, upd_node in taxid_updated.items():
- if upd_node is not None and upd_node != node:
- print("Updated taxonomic node: " + node + " -> " + upd_node)
- self.ids.discard(node)
- self.ids.add(upd_node)
- self.lineage.discard(node)
- self.lineage.add(upd_node)
- if node in self.refs:
- self.refs[upd_node] = self.refs.pop(node)
-
- def add_refs_desc(self, i, ref, desc):
- if i not in self.refs:
- self.refs[i] = {}
- if ref not in self.refs[i]:
- self.refs[i][ref] = set()
- if desc is not None:
- self.refs[i][ref].add(desc)
-
- def get_refs_desc(self, i):
- return self.refs[i] if i in self.refs else {}
-
- def get_refs(self, i):
- return list(self.refs[i].keys()) if i in self.refs else ()
-
- def get_refs_count(self, i):
- return len(self.refs[i]) if i in self.refs else 0
-
- def update_refs(self, taxid_parent_rank):
- for taxid, parent_taxid in taxid_parent_rank.items():
- if parent_taxid is not None and taxid in self.refs:
- for i in self.refs[taxid]:
- for r in self.refs[taxid][i]:
- self.add_refs_desc(parent_taxid, i, r)
diff --git a/grimer/table.py b/grimer/table.py
index 11e9874..2206533 100644
--- a/grimer/table.py
+++ b/grimer/table.py
@@ -2,13 +2,15 @@
class Table:
- def __init__(self, samples, total, unassigned):
+ def __init__(self, samples, total, unassigned, lineage, normalized, zerorep):
# Ordered dict to keep rank insert order
self.data = OrderedDict()
- self.lineage = None
+ self.lineage = lineage
self.samples = samples
self.total = total
self.unassigned = unassigned
+ self.normalized = normalized
+ self.zerorep = zerorep
def __repr__(self):
args = ['{}={}'.format(k, repr(v)) for (k, v) in vars(self).items()]
@@ -23,6 +25,24 @@ def observations(self, rank):
def ranks(self):
return list(self.data.keys())
+ def get_min_valid_count_perc(self):
+ return min([self.get_counts_perc(rank)[self.get_counts_perc(rank) > 0].min().min() for rank in self.ranks()])
+
+ def get_total(self):
+ return self.total
+
+ def get_unassigned(self):
+ return self.unassigned
+
+ def get_assigned(self):
+ return self.get_total() - self.get_unassigned()
+
+ def get_unassigned_perc(self):
+ return self.get_unassigned().divide(self.get_total(), axis=0) if not self.normalized else self.get_unassigned().divide(100, axis=0)
+
+ def get_assigned_perc(self):
+ return self.get_assigned().divide(self.get_total(), axis=0) if not self.normalized else self.get_assigned().divide(100, axis=0)
+
def get_lineage(self, taxid, rank, other_rank):
# get lineage up-to requested rank
return self.lineage[self.lineage[rank] == taxid][other_rank].values[0]
@@ -33,17 +53,17 @@ def get_frequency(self, rank):
def get_frequency_perc(self, rank):
return self.get_frequency(rank) / len(self.samples)
- def get_total_counts(self, rank):
- return self.data[rank].sum(axis=0)
+ def get_counts(self, rank):
+ return self.data[rank].sum(axis=0) if not self.normalized else 0
def get_counts_perc(self, rank):
- return self.data[rank].divide(self.total, axis=0)
+ return self.data[rank].divide(self.get_total(), axis=0) if not self.normalized else self.data[rank].divide(100, axis=0)
- def get_counts_perc_avg(self, rank):
+ def get_counts_perc_avg_samples(self, rank):
return self.get_counts_perc(rank).sum(axis=0) / len(self.samples)
def get_top(self, rank, n):
- return sorted(self.get_counts_perc_avg(rank).sort_values(ascending=False).index[:n].to_list())
+ return sorted(self.get_counts_perc_avg_samples(rank).sort_values(ascending=False).index[:n].to_list())
def get_subtable(self, rank, samples: list=[], taxids: list=[], keep_shape: bool=False):
subtable = self.data[rank]
@@ -55,7 +75,7 @@ def get_subtable(self, rank, samples: list=[], taxids: list=[], keep_shape: bool
valid_samples.append(s)
if valid_samples:
- subtable = subtable.loc[valid_samples]
+ subtable = subtable.loc[subtable.index.intersection(valid_samples)]
if not keep_shape:
subtable = subtable.loc[:, subtable.sum(axis=0) > 0]
else:
diff --git a/scripts/.Rhistory b/scripts/.Rhistory
deleted file mode 100644
index e69de29..0000000
diff --git a/scripts/mgnify_download.py b/scripts/mgnify_download.py
index 5e21d85..e84269b 100755
--- a/scripts/mgnify_download.py
+++ b/scripts/mgnify_download.py
@@ -1,5 +1,6 @@
#!/usr/bin/env python3
+import argparse
import pandas as pd
import sys
import os
@@ -15,115 +16,137 @@
Usage: ./mgnify_download.py study_accession [output_folder]
Example single: ./mgnify_download.py MGYS00000554
-Example dump: seq -f "MGYS%08g" 1 5724 | xargs -P 8 -I {} ./mgnify_download.py {} mgnify_dump_20210408/ > mgnify_dump_20210408.log 2>&1 &
+Example dump: seq -f "MGYS%08g" 1 5724 | xargs -P 8 -I {} ./mgnify_download.py -i {} -v -g -o mgnify_dump_20210408/ > mgnify_dump_20210408.log 2>&1 &
"""
-API_BASE = 'https://www.ebi.ac.uk/metagenomics/api/latest/'
-
-study_accession = sys.argv[1]
-output_folder = sys.argv[2]+"/" if len(sys.argv) == 3 else "./"
-prefix = output_folder+study_accession
-gz = True
-
-md_file = prefix + "_metadata.tsv"
-out_file = prefix + ".pkl"
-
-if gz:
- out_file = out_file + ".gz"
- md_file = md_file + ".gz"
-
-# Check if files exists and skip
-tax_files = glob(prefix + "*_taxonomy_abundances_*")
-if tax_files and os.path.isfile(out_file) and os.path.isfile(md_file):
- print(study_accession, "Files found, skipping")
- sys.exit(1)
-
-with Session(API_BASE) as s:
- # Get main study resource
- try:
- study = s.get('studies', study_accession).resource
- print(study.accession, "SAMPLES:"+str(study.samples_count), sep="\t",end="\t")
- except:
- print(study_accession, "Error: Study not found")
- sys.exit(1)
-
- # Save study info as a dict in a pkl file
- f = gzip.open(out_file, 'wb') if gz else open(out_file, "wb")
- pickle.dump(study.json, file=f)
- f.close()
-
- # Get all taxonomic tables for the highest version of the pipeline
- highest_version = 0
- table_version = {}
- for download in study.downloads:
- label = download.description.label
- #["Taxonomic assignments",
- #"Taxonomic assignments SSU",
- #"Taxonomic assignments LSU"
- #"Taxonomic assignments UNITE",
- #"Taxonomic assignments ITSoneDB"]
- if "Taxonomic assignments" in label:
- version = float(download.pipeline.id)
- if version not in table_version:
- table_version[version] = []
- table_version[version].append(download.url)
- if version > highest_version:
- highest_version = version
-
- if not table_version:
- print("Error: No taxonomic assignments for this study to download")
- sys.exit(1)
+
+def main(argv=sys.argv[1:]):
+
+ API_BASE = 'https://www.ebi.ac.uk/metagenomics/api/latest/'
+
+ parser = argparse.ArgumentParser(description='grimer-download-mgnify')
+ parser.add_argument('-i', '--study-accession', required=True, type=str, help="MGnify study accession (e.g. MGYS00002462)")
+ parser.add_argument('-g', '--gzip', default=False, action='store_true', help="Gzip downloaded files")
+ parser.add_argument('-v', '--verbose', default=False, action='store_true', help="Verbose output")
+ parser.add_argument('-o', '--output-prefix', type=str, help="Output prefix for downloaded files. Default: --study-accession")
+ args = parser.parse_args(argv)
+
+ study_accession = args.study_accession
+ if args.output_prefix:
+ prefix = args.output_prefix
+ else:
+ prefix = study_accession
+ gz = args.gzip
+
+ md_file = prefix + "_metadata.tsv"
+ out_file = prefix + ".pkl"
+ if gz:
+ out_file = out_file + ".gz"
+ md_file = md_file + ".gz"
+
+ # Check if files exists and skip
+ tax_files = glob(prefix + "*_taxonomy_abundances_*")
+ if tax_files and os.path.isfile(out_file) and os.path.isfile(md_file):
+ print(study_accession, "Warning: files already exist ")
+ return
+
+ with Session(API_BASE) as s:
+ # Get main study resource
+ try:
+ study = s.get('studies', study_accession).resource
+ if args.verbose:
+ print(study.accession, "SAMPLES:" + str(study.samples_count), sep="\t", end="\t")
+ except:
+ print(study_accession, "Error: Study accession not found")
+ sys.exit(1)
+
+ # Save study info as a dict in a pkl file
+ f = gzip.open(out_file, 'wb') if gz else open(out_file, "wb")
+ pickle.dump(study.json, file=f)
+ f.close()
+
+ # Get all taxonomic tables for the highest version of the pipeline
+ highest_version = 0
+ table_version = {}
+ for download in study.downloads:
+ label = download.description.label
+ #["Taxonomic assignments",
+ #"Taxonomic assignments SSU",
+ #"Taxonomic assignments LSU"
+ #"Taxonomic assignments UNITE",
+ #"Taxonomic assignments ITSoneDB"]
+ if "Taxonomic assignments" in label:
+ version = float(download.pipeline.id)
+ if version not in table_version:
+ table_version[version] = []
+ table_version[version].append(download.url)
+ if version > highest_version:
+ highest_version = version
+
+ if not table_version:
+ print("Error: No taxonomic assignments for this study to download")
+ sys.exit(1)
+ else:
+ table_urls = table_version[highest_version]
+
+ # Get all available samples in one go and collect metadata
+ params = {
+ 'study_accession': study_accession,
+ 'page_size': study.samples_count,
+ }
+ fltr = Filter(urlencode(params))
+
+ metadata = {}
+ for sample in s.iterate('samples', fltr):
+ # TODO: how to access runs faster, sample.runs is too slow
+ #nruns += len(sample.runs)
+ metadata[sample.accession] = {}
+ for md in sample.sample_metadata:
+ metadata[sample.accession][md["key"]] = md["value"]
+ # Add sample description, name and name as metadata
+ metadata[sample.accession]['sample-desc'] = sample.sample_desc
+ metadata[sample.accession]['sample-name'] = sample.sample_name
+
+ # Get link sample accession, run accession
+ # TODO treat multiple runs per sample
+ run_sample_accesion = {}
+ try:
+ for run in s.iterate('runs', fltr):
+ run_sample_accesion[run.sample.id] = run.id
+ except:
+ print("Error: Could not retrieve run accession", sep="\t", end="\t")
+
+ # Write metadata
+ md_df = pd.DataFrame.from_dict(metadata).T
+ if run_sample_accesion:
+ mapped_accessions = md_df.index.isin(run_sample_accesion.keys())
+ if args.verbose:
+ print("MAPPED:" + str(sum(mapped_accessions)), sep="\t", end="\t")
+ md_df.index = md_df.index.map(lambda x: run_sample_accesion[x] if x in run_sample_accesion else x)
else:
- table_urls = table_version[highest_version]
-
- # Get all available samples in one go and collect metadata
- params = {
- 'study_accession': study_accession,
- 'page_size': study.samples_count,
- }
- fltr = Filter(urlencode(params))
-
- metadata = {}
- for sample in s.iterate('samples', fltr):
- # TODO: how to access runs faster, sample.runs is too slow
- #nruns += len(sample.runs)
- metadata[sample.accession] = {}
- for md in sample.sample_metadata:
- metadata[sample.accession][md["key"]] = md["value"]
- # Add sample description, name and name as metadata
- metadata[sample.accession]['sample-desc'] = sample.sample_desc
- metadata[sample.accession]['sample-name'] = sample.sample_name
-
- # Get link sample accession, run accession
- # TODO treat multiple runs per sample
- run_sample_accesion = {}
- try:
- for run in s.iterate('runs', fltr):
- run_sample_accesion[run.sample.id] = run.id
- except:
- print("Error: Could not retrieve run accession", sep="\t", end="\t")
-
-# Write metadata
-md_df = pd.DataFrame.from_dict(metadata).T
-if run_sample_accesion:
- mapped_accessions = md_df.index.isin(run_sample_accesion.keys())
- print("MAPPED:" + str(sum(mapped_accessions)), sep="\t", end="\t")
- md_df.index = md_df.index.map(lambda x: run_sample_accesion[x] if x in run_sample_accesion else x)
-else:
- print("Error: No mapping between accessions of samples and metadata", sep="\t", end="\t")
-
-print("METADATA:" + str(md_df.shape[1]), sep="\t", end="\t")
-md_df.to_csv(md_file, compression="gzip" if gz else None, sep="\t")
-
-# Read and write tables
-for table_url in table_urls:
- try:
- t = pd.read_table(table_url)
- print("OK:" + table_url, end=";")
- # Print original
- filename = prefix + "_" + os.path.basename(table_url)
- t.to_csv(filename if not gz else filename+".gz", compression="gzip" if gz else None, sep="\t", index=False)
- except:
- print("INVALID:" + table_url, end=";")
-
-print()
+ if args.verbose:
+ print("Warning: No mapping between accessions of samples and metadata", sep="\t", end="\t")
+
+ if args.verbose:
+ print("METADATA:" + str(md_df.shape[1]), sep="\t", end="\t")
+ md_df.to_csv(md_file, compression="gzip" if gz else None, sep="\t")
+
+ # Read and write tables
+ for table_url in table_urls:
+ try:
+ t = pd.read_table(table_url)
+ if args.verbose:
+ print("OK:" + table_url, end=";")
+ # Print original
+ filename = prefix + "_" + os.path.basename(table_url)
+ t.to_csv(filename if not gz else filename + ".gz", compression="gzip" if gz else None, sep="\t", index=False)
+ except:
+ if args.verbose:
+ print("INVALID:" + table_url, end=";")
+
+ if args.verbose:
+ print()
+
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/run_decontam.R b/scripts/run_decontam.R
index 7b7a714..ea0e0c2 100755
--- a/scripts/run_decontam.R
+++ b/scripts/run_decontam.R
@@ -95,7 +95,7 @@ count_matrix <- data.matrix(data.frame(count_table[,-1], row.names = rows_table,
# Load concentration table
if(!args$concentrations==""){
concentrations <- read.table(file=args$concentrations, sep='\t', header=FALSE, check.names=FALSE)
- concentrations_list <- concentrations[ , "V2"]
+ concentrations_list <- concentrations[ (concentrations[, "V1"] %in% rows_table) , "V2"]
}
# Load list of controls
diff --git a/setup.py b/setup.py
index 8d8465f..4c37b11 100755
--- a/setup.py
+++ b/setup.py
@@ -13,7 +13,7 @@ def read(filename):
setup(
name="grimer",
- version="0.0.0",
+ version="1.0.0-alpha1",
url="https://www.github.com/pirovc/grimer",
license='MIT',
@@ -24,7 +24,7 @@ def read(filename):
long_description=read("README.md"),
packages=['grimer'],
- #install_requires=['binpacking==1.4.3'],
+ #install_requires=['bokeh==2.2.3','pandas','numpy','scipy','multitax'],
include_package_data=True,
package_data={
diff --git a/sources/contaminants/cc_eukaryota.yml b/sources/contaminants/cc_eukaryota.yml
deleted file mode 100644
index 854d5f9..0000000
--- a/sources/contaminants/cc_eukaryota.yml
+++ /dev/null
@@ -1,4 +0,0 @@
-"Common Eukaryotic contaminants":
- "PRJNA168":
- url: "https://www.ncbi.nlm.nih.gov/genome/guide/human/"
- ids: [9606]
diff --git a/sources/contaminants/cc_viruses.yml b/sources/contaminants/cc_viruses.yml
deleted file mode 100644
index 95fae88..0000000
--- a/sources/contaminants/cc_viruses.yml
+++ /dev/null
@@ -1,10 +0,0 @@
-"Common Viral contaminants":
- "2019 Asplund, M. et al.":
- url: "http://doi.org/10.1016/j.cmi.2019.04.028"
- ids: [12071, 742919, 11103, 31647, 12461, 10298, 10376, 10359, 11676, 129951, 10583, 31552, 10798, 11908, 585044, 518981, 1225745, 11620, 1891767, 493803, 11033, 159150, 35306, 68887, 11870, 11958, 11861, 11946, 11864, 363745, 363020, 242521, 11866, 11960, 31668, 31669, 31670, 11867, 11955, 11874, 11876, 11878, 11885, 36381, 11886, 11888, 269447, 269448, 11950, 11948, 1332312, 354090, 11884, 1352534, 1395610, 1395611, 1395612, 1395613, 1395614, 1395615, 1395616, 1395617, 1395618, 1395619, 1395620, 1341019, 11801, 11809, 1511763, 1394983, 697906, 1072204, 1148801, 1574422, 12104, 763552, 10264, 85708, 759804, 28344, 85506, 33747, 10345, 285986, 220638, 1154691, 185638, 1169627, 1045778, 185636, 72201, 345198, 176652, 1301280, 68347, 1618248, 1618254, 10288, 198112, 1454023, 1454024, 1454025, 1278278, 1278246, 1278252, 1278247, 1278248, 1278249, 1278250, 1278251, 399781, 1278255, 346932, 1278261, 1278263, 1278265, 1474867, 1379694, 1521385, 1521387, 1521389, 938081, 938082, 880162, 251749, 455370, 169864, 1379788, 1608440, 642253, 642255, 1224510, 1592207, 1592212, 1592083, 1592085, 1592086, 1592088, 1592093, 1592095, 1592096, 1592081, 1843761, 1519405, 1557033, 1608451, 664785, 1435438, 1170653, 40979, 12235, 12138, 11987, 51680, 12056, 146500, 554168, 212035, 1269028, 693272, 1420594, 1094892, 1128140, 1235314, 1128143, 1128151, 1128131, 1450746, 1461100, 181522, 1424633, 1010698, 1299317, 1450749, 1416631, 1128422, 1034806, 1592112, 1592113, 1592127, 938080, 1074214, 1519385, 1519387, 1519389, 1519390, 1519395, 1519396, 1519397, 186617, 1262072, 1407671, 743583, 340016, 745107, 745102, 745100, 1416009, 1187128, 889876, 760732, 1243183, 1229760, 1481186, 1505225, 1560342, 233894, 115987, 260149, 227470, 926067, 1127514, 1296654, 294382, 1486657, 1084719, 10756, 1486662, 1285382, 1497851, 1127515, 145579, 263375, 764562, 1133292, 1133022, 242527, 260373, 279280, 644524, 242861, 1132026, 1357714, 1197951, 1327981, 1327976, 1327979, 1327992, 1328030, 1327990, 1327980, 1327972, 1327982, 1327995, 1327983, 1327970, 1327971, 756279, 1327977, 1327993, 1328029, 1327975, 1327974, 1327985, 756280, 756282, 1527524, 1540094, 1042123, 541865, 1567016, 765765, 1176422, 1327037, 1162295, 1141135, 1141136, 335924, 536444, 929832, 682650, 1137745, 536473, 749413, 1477406, 1048515, 1048516, 1048517, 1048520, 1048521, 1537091, 1264700, 1609634, 1455074, 414970, 10863, 10864, 1222338, 1147148, 1237364, 1414766, 1977402, 948870, 1524881, 10665, 10760, 1147094, 1429767, 925983, 925984, 1527519, 1527506, 1229753, 1540097, 1540098, 1054461, 1391223, 294631, 1325731, 908819, 1458858, 1458842, 90963, 1536592, 1527515, 551895, 1129191, 139872, 201847, 287412, 1262517, 754044, 1385658, 1176423, 889949, 446529, 1034128, 1056830, 1089119, 1486472, 1034111, 205879, 1340709, 1567475, 1472912, 1204539, 1399915, 1283076, 1283077, 1168479, 1168478, 440250, 400567, 994601, 1465639, 889956, 445700, 444862, 536454, 445688, 444861, 1229794, 1229793, 1229792, 1229791, 1229790, 1229789, 1229786, 1229787, 1229788, 1229784, 1229782, 376758, 1498188, 504501, 504553, 1235647, 1235648, 1235649, 1235650, 1235653, 1235654, 1235655, 1235656, 1235657, 877240, 754052, 1316739, 347326, 1235689, 31535, 757342, 582345, 1462581, 386793, 1204517, 347327, 1335230, 743813, 1348912, 1327964, 270673, 188350, 1541891, 169683, 998086, 1500757, 1458843, 1129146, 1279082, 1114179, 1548900, 1231048, 1548901, 1449437, 1548918, 1476390, 462590, 754048, 948071, 1481785, 1417599, 1131316, 691965, 136084, 754067, 1161935, 1173749, 1173761, 1173759, 1173762, 590739, 1406795, 1141134, 1204529, 1540099, 1168549, 866889, 1458859, 1458860, 1458861, 10761, 754060, 1524882, 1357423, 373126, 1150991, 1195080, 320843, 55510, 1434319, 320850, 369581, 537874, 1208587, 1566990, 10732, 490913, 1526550, 1340810, 756277, 753084, 753085, 756275, 1026955, 1340812, 238854, 555387, 754042, 444860, 981335, 469660, 215796, 1478972, 1385659, 926697, 336724, 278008, 1211417, 271647, 754075, 573173, 573174, 979525, 979534, 1529058, 1283071, 573176, 1589298, 1076759, 1461743, 1150989, 754058, 754051, 929835, 1414739, 754072, 1524880, 194802, 1168281, 1204514, 1188795, 331278]
- "2015 Mukherjee, S. et al.":
- url: "http://doi.org/10.1186/1944-3277-10-18"
- ids: [10847]
- "2015 Kjartansdóttir, K.R. et al.":
- url: "https://doi.org/10.1073/pnas.1423756112"
- ids: [322019]
diff --git a/sources/references/human-related.yml b/sources/references/human-related.yml
deleted file mode 100644
index 5106795..0000000
--- a/sources/references/human-related.yml
+++ /dev/null
@@ -1,13 +0,0 @@
-"Human-related bacterial isolates from BacDive":
- "Limbs":
- url: "https://bacdive.dsmz.de/search?search=taxid:{}"
- ids: [326522, 82380, 1701, 131110, 29388, 84698, 729, 69968, 28264, 200476, 58172, 490, 1280, 1290, 41276, 220685, 28449, 644, 1660, 147645, 1351, 90367, 391, 1717, 1720, 1314, 646, 29391, 732, 1365628, 90245, 495, 192066, 753, 1979962, 82633, 53363, 539, 37637, 37329, 755171, 29466, 291112, 614, 28090, 1402, 217204, 1509, 326523, 2014, 1303, 28038, 676, 105219, 13076, 66228, 504, 28189, 752, 108980, 1747, 1282, 1504, 33028, 303, 672, 28035, 1286, 485, 1311, 28188, 28132, 1328, 1506, 652, 29380, 760, 46124, 1379, 755172, 193461, 158822, 479, 68892, 479117, 33889, 670, 420404, 1305, 1697053, 71254, 310300, 47920, 669, 1245, 38289, 36740, 354351, 48296, 29318, 192, 38313, 180332, 135487, 33007, 287, 754, 29317, 1648, 1713, 1352, 550, 53437, 2054, 38284, 1667, 511, 1015, 40091, 59561, 411577, 587, 370622, 206506, 37326, 90239, 161902, 137732, 52132, 34105, 180588, 33968, 386414, 283734, 1891233, 478, 156979, 28125, 1529, 306, 123899, 220687, 620903, 1239307, 1348, 316, 28091, 178214, 84112, 44737, 487, 1536, 1273, 24, 630, 1034, 322095, 488730, 70348, 650, 43765, 43770, 39791, 115545, 150055, 411570, 196, 131111, 472, 38301, 51671, 292, 146827, 1785, 1977869, 40542, 29432, 28450, 1890675, 47312, 38875, 1710, 739, 47917, 33010, 1292, 169292, 158877, 1781, 400946, 501496, 488, 239, 361500, 470, 1430326, 29354, 82347, 65058, 714, 521520, 38303, 1513, 502790, 747, 1141657, 38304]
- "Ear":
- url: "https://bacdive.dsmz.de/search?search=taxid:{}"
- ids: [1747, 2702, 1869190, 32002, 85698, 131111, 29388, 28037, 44750, 51671, 1353, 28264, 545, 292, 89093, 1872515, 1280, 511, 29379, 68766, 59561, 29321, 480, 1311, 285091, 727, 199591, 43263, 1313, 739, 760, 1661, 52769, 1421, 1314, 156979, 35703, 1898, 585, 87883, 90245, 123899, 306, 1895474, 670, 47770, 319939, 184870, 134375, 72557, 753, 663, 316, 1343, 217203, 267212, 678, 53364, 1014, 1776741, 93220, 1639, 666, 38313, 1652, 105219, 38287, 293, 33007, 287]
- "Eye":
- url: "https://bacdive.dsmz.de/search?search=taxid:{}"
- ids: [40216, 28037, 1280, 490, 147645, 1351, 90367, 1824, 813, 1720, 38290, 29391, 732, 192066, 616, 161879, 753, 1304, 1655, 539, 37329, 28172, 161890, 90241, 504, 752, 253, 457921, 1871047, 1309, 154288, 280147, 485, 760, 46124, 1931, 1379, 29394, 1671023, 68892, 479, 1396, 1544416, 2035, 420404, 735, 47846, 666, 571, 2055, 1401, 1270, 34062, 545, 38284, 247, 498, 40091, 59561, 370622, 37326, 727, 945844, 1313, 180588, 1685, 1671022, 478, 1302, 134375, 477, 726, 47478, 197575, 207340, 38287, 650, 756689, 43765, 69392, 723, 72556, 187491, 472, 51671, 2047, 1177728, 46125, 29432, 480, 47312, 739, 134533, 740, 37330, 488, 1544413, 239, 483, 29354, 41202, 38304]
- "Nose":
- url: "https://bacdive.dsmz.de/search?search=taxid:{}"
- ids: [59823, 72556, 131111, 1282, 28264, 38284, 1280, 520, 43990, 615, 727, 1328, 1313, 90367, 760, 181487, 29394, 478, 732, 40324, 33889, 306, 39950, 1304, 1673725, 65058, 74319, 1591, 90241, 105219, 504, 286802, 195105, 574]
diff --git a/sources/references/skin.yml b/sources/references/skin.yml
deleted file mode 100644
index ccd9467..0000000
--- a/sources/references/skin.yml
+++ /dev/null
@@ -1,14 +0,0 @@
-"Human-related bacterial isolates from BacDive":
- "Skin/Nail/Hair":
- url: "https://bacdive.dsmz.de/search?search=taxid:{}"
- ids: [1747, 106654, 1986155, 59823, 1869190, 1270, 71999, 1283, 1276, 131110, 1648, 1656, 472, 1352, 34062, 729, 29388, 2047, 28264, 1314, 1280, 1290, 672, 59561, 1780, 33918, 37326, 29432, 1286, 1891644, 74703, 90367, 1931, 33010, 1720, 1965292, 181487, 169292, 38290, 29506, 1622, 281920, 1292, 1781, 861, 1698, 1260, 2035, 202789, 521392, 470, 663, 29382, 1659, 1288, 37923, 1655, 45254, 1753, 1261, 38289, 36740, 1273, 1347368, 33034, 1347369, 1282, 66228, 132933, 43765, 287]
-"Top organisms form the human skin microbiome" :
- "Bacteria":
- url: "https://doi.org/10.1038/nrmicro.2017.157"
- ids: [257758, 225324, 169292, 161879, 146827, 43765, 38304, 38287, 38286, 29466, 29388, 28037, 1747, 1305, 1303, 1290, 1282, 1270]
- "Eukarya":
- url: "https://doi.org/10.1038/nrmicro.2017.157"
- ids: [2510778, 1047171, 379413, 119676, 117179, 76777, 76775, 76773, 44058, 41880, 36894, 34391, 31312, 5480, 5068, 3074, 2762]
- "Viruses":
- url: "https://doi.org/10.1038/nrmicro.2017.157"
- ids: [185639, 746832, 10566, 493803, 10279, 746830, 746831, 46771]