No amount of high-end bioinformatics can compensate for poorly prepared samples and it is therefore imperative that careful attention is given to sample preparation and library generation within workflows, especially those involving multiple PCR steps.
- Quote from Murray et al 2015, From Benchtop to Desktop: Important Considerations when Designing Amplicon Sequencing Workflows - https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0124671
Graphics taken from Ben Callahan's lectures at the STAMPS 2024 course at MBL: https://github.com/mblstamps/stamps2024/wiki#17
- Initial data QA/QC - interpretation of phred scores, choosing quality cutoff, merging criteria (for Paired-End Illumina reads)
- Clustering or assembly of raw reads - what algorithm/pipeline to use? Using custom or default software parameters?
- QA/QC of clusters/contigs - Discard singletons? Discard clusters/contigs not meeting certain criteria (e.g. exclude all Amplicon Seqeunce Variants with <50 reads)? Assessment of contamination and/or filtering out potential contaminant seqeunces?
- Assigning taxonomy or gene function - What database to use? How to match sequences with taxonomy/functional information (sequenced-based, ontology/hierarcy based, kmer or other method, etc.)?
- How to filter, categorize, and interpret taxonomy or functional information - Data mining for a specific species/gene/pathway? Broad exploration of data? Grouping into biologically meaningful categories? Employ statistical analyses, e.g differential abundance across categories/genes/taxa?
- How to visualize patterns in your data - SO MANY DECISIONS, this is very hard. Use existing software packages with standard visualizations? More freeform exploration of data using Base R or your own scripts?
- How to create a scientific narrative through your choices + figures - this is the culmination of all the above decisions. Are you looking for a specific story (and taking a narrow path of analysis), or do you not know in advance what you should be looking for in your data (and analyzing/visualizing patterns as broadly as possible)? Probably a combination of both, in reality - looking for one story, but keeping your eye out for other patterns.
Take 5 minutes and silently brainstorm your most pressing questions on "Bioinformatics Decisions Points" - what things are you struggling with in your own analyses, or what is one area where you need to learn more about to succeed in your own research?
GDoc Link: https://docs.google.com/document/d/1Y1qFFzBRD6J7SEyZBxLT9_X9HR1bsxNmzCEiHLzBxLU/edit?usp=sharing
Get into pairs, and take 10 minutes to brainstorm a giant list of study metadata that will be relevant to your analysis - this should include things you already have in hand, and things you may need to get from other people (or are results you are waiting on from a lab analysis). This can be anything related to the environment, sample site, time / date / location of sample collection, contextual information, etc.
GDoc Link: https://docs.google.com/document/d/1Y1qFFzBRD6J7SEyZBxLT9_X9HR1bsxNmzCEiHLzBxLU/edit?usp=sharing
Take a peek into "metabarcoding-dataset" mapping and contextual info on GitHub
NOAA Omics data management guide: https://noaa-omics-dmg.readthedocs.io/en/latest/metadata-guidelines.html
Keemei: cloud-based validation of tabular bioinformatics file formats in Google Sheets. Rideout JR, Chase JH, Bolyen E, Ackermann G, González A, Knight R, Caporaso JG. GigaScience. 2016;5:27. http://dx.doi.org/10.1186/s13742-016-0133-6
Keemi Website: https://keemei.qiime2.org/
Keemi Demo metadata file: https://docs.google.com/spreadsheets/d/1_gE_jQcoYGld9aW_dTyE86zdmg1CkNIPHvVJ6CkYvKY/edit?gid=1402180572#gid=1402180572