Skip to content

Commit

Permalink
Some fixes after testing by Alice
Browse files Browse the repository at this point in the history
  • Loading branch information
Pigrenok committed Jul 23, 2024
1 parent 5d27851 commit d7da5a6
Show file tree
Hide file tree
Showing 3 changed files with 19 additions and 16 deletions.
3 changes: 3 additions & 0 deletions current-year/practicals/population_genetics/pt-4-map-call.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,9 @@ reference genome `/shared/data/popgen/reference.fa` and the alignments
from the mapping part of the practical (both `.bam` and `.bai` files) to your
`input` directory.

> **_Note:_**
> When you create links from one directory to another, it is better to use absolute path for links like `~/2023-10-02-mapping/results/*.bam*` instead of `../../2023-10-02-mapping/results/*.bam`
> **_Note:_**
> Remember to keep your commands in the `WHATIDID.txt` file.
Expand Down
28 changes: 14 additions & 14 deletions current-year/practicals/reference_genome/pt-1-read-cleaning.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ post_url: pt-1-read-cleaning
# Part 1: Reads to reference genome and gene predictions
## 1. Introduction

[Cheap sequencing](http://www.genome.gov/sequencingcosts/) has created the opportunity to perform molecular-genetic analyses on just about anything. Traditional genetic model organisms benefit from years of efforts by expert genome assemblers, gene predictors, and curators. They have created most of the prerequisites for genomic analyses. In contrast, genomic resources are much more limited for those working on "emerging" model organisms or other species. These new organisms includes most crops, animals and plant pest species, many pathogens, and major models for ecology & evolution.
[Cheap sequencing](https://www.genome.gov/sequencingcosts/) has created the opportunity to perform molecular-genetic analyses on just about anything. Traditional genetic model organisms benefit from years of efforts by expert genome assemblers, gene predictors, and curators. They have created most of the prerequisites for genomic analyses. In contrast, genomic resources are much more limited for those working on "emerging" model organisms or other species. These new organisms includes most crops, animals and plant pest species, many pathogens, and major models for ecology & evolution.

The steps below are meant to provide some ideas that can help obtaining a reference
genome and a reference geneset of sufficient quality for many analyses. They are based on (and updated from) work we did for
the [fire ant genome](http://www.pnas.org/content/108/14/5679.long "The genome of the fire ant Solenopsis invicta")[1].
the [fire ant genome](https://www.pnas.org/content/108/14/5679.long "The genome of the fire ant Solenopsis invicta")[1].

The dataset that you will use represents ~0.5% of the fire ant genome. This enables us to perform a toy/sandbox version of all analyses within a much shorter amount of time than would normally be required. For real projects, much more sophisticated approaches are needed!

Expand All @@ -36,11 +36,11 @@ to the next section. If that one is available and you see it's help screen, we'l

### Set up directory hierarchy to work in

Start by creating a directory to work in. Drawing on ideas from _[Noble (2009)](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424 "A Quick Guide to Organizing Computational Biology Projects")[2]_
Start by creating a directory to work in. Drawing on ideas from _[Noble (2009)](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424 "A Quick Guide to Organizing Computational Biology Projects")[2]_
and others, we recommend following a specific directory convention for all your
projects. The details of the convention that we will use in this practical can
be found
[here](http://github.com/wurmlab/templates/blob/master/project_structures.md "Typical multi-day project structure").
[here](https://github.com/wurmlab/templates/blob/master/project_structures.md "Typical multi-day project structure").

For the purpose of these practicals we will use a slightly simplified version of
the directory structure explained above.
Expand Down Expand Up @@ -102,8 +102,8 @@ In this practical, we will work with paired ends short read sequences from an Il

However, sequencers aren't perfect. Several problems may affect the quality of
the reads. You can find some examples
[here](http://genomecuration.github.io/genometrain/a-experimental-design/curated-collection/Presentations/Sequencing%20Troubleshooting.pptx)
and [here](http://sequencing.qcfail.com/). Also, as you may already know,
[here](https://genomecuration.github.io/genometrain/a-experimental-design/curated-collection/Presentations/Sequencing%20Troubleshooting.pptx)
and [here](https://sequencing.qcfail.com/). Also, as you may already know,
"*garbage in – garbage out*", which means that reads should be cleaned before
performing any analysis.

Expand Down Expand Up @@ -147,8 +147,8 @@ The structure of your directory should look like this (use the command `tree`):

Now, you can start evaluating the quality of the reads `reads.pe1.fastq.gz` and
`reads.pe2.fastq.gz`. To do so, we will use
[*FastQC*](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
([documentation](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/)).
[*FastQC*](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
([documentation](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/)).
FASTQC is a software tool to help visualise characteristics of a sequencing run.
It can thus inform yourread cleaning strategy.

Expand Down Expand Up @@ -179,7 +179,7 @@ command (be aware of your current working directory using the command `pwd`):
tree ~/2023-09-26-read_cleaning
```

Your [resulting directory structure](http://github.com/wurmlab/templates/blob/master/project_structures.md "Typical multi-day project structure")
Your [resulting directory structure](https://github.com/wurmlab/templates/blob/master/project_structures.md "Typical multi-day project structure")
(`~/2023-09-26-read_cleaning`), should look like this:

```bash
Expand Down Expand Up @@ -208,7 +208,7 @@ that, click on one of the links corresponding to the reports files.
> **_Question:_**
> What does the *FastQC* report tell you?
> If in doubt, check the documentation
> [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/)
> [here](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/)
> and what the quality scores mean
> [here](https://learn.gencore.bio.nyu.edu/ngs-file-formats/quality-scores/).
Expand All @@ -235,9 +235,9 @@ In the following sections, we will perform two cleaning steps:
* K-mer filtering using *kmc3*.
* Removing sequences that are of low quality or too short using cutadapt.

Other tools, including [*fastx_toolkit*](http://github.com/agordon/fastx_toolkit),
Other tools, including [*fastx_toolkit*](https://github.com/agordon/fastx_toolkit),
[*BBTools*](https://jgi.doe.gov/data-and-tools/bbtools/), and
[*Trimmomatic*](http://www.usadellab.org/cms/index.php?page=trimmomatic) can
[*Trimmomatic*](https://www.usadellab.org/cms/index.php?page=trimmomatic) can
also be useful, **but we won't use them now**.

### Read trimming
Expand Down Expand Up @@ -319,12 +319,12 @@ coverage (up to 10,000). These could be viruses or other pathogens, or highly re
> algorithms. Eliminating them can reduce subsequent memory, disk space and CPU
> requirements considerably.
Below, we use [*kmc3*](http://github.com/refresh-bio/KMC) to "mask" extremely
Below, we use [*kmc3*](https://github.com/refresh-bio/KMC) to "mask" extremely
rare k-mers (i.e., convert each base in the sequences corresponding to rare
k-mers into **N**). In this way, we will ignore these bases (those called **N**)
because they are not really present in the species. Multiple alternative
approaches for k-mer filtering exist (e.g., using
[*khmer*](http://github.com/ged-lab/khmer)).
[*khmer*](https://github.com/ged-lab/khmer)).

Here, we use *kmc3* to estimate the coverage of k-mers with a size of 21
nucleotides. When the masked k-mers are located at the end of the reads, we trim
Expand Down
4 changes: 2 additions & 2 deletions current-year/practicals/reference_genome/pt-2-assembly.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ before starting this practical.
# 1. Brief assembly example / concepts

Many different pieces of software exist for genome assembly. We will be using
[*SPAdes*](https://cab.spbu.ru/software/spades/).
[*SPAdes*](https://github.com/ablab/spades).

Following the same procedure described in Section 1.2 of
[Part 1: Read cleaning](pt-1-read-cleaning.html), create a new main directory
Expand Down Expand Up @@ -141,7 +141,7 @@ For example:
of each of the "house-keeping" genes found in related species. This is
applied in *BUSCO* (Benchmarking Universal Single-Copy Orthologs).
Note that:
* *BUSCO* is a refined, modernized implementation of [*CEGMA*]("http://korflab.ucdavis.edu/Datasets/cegma/")
* *BUSCO* is a refined, modernized implementation of [*CEGMA*](http://korflab.ucdavis.edu/Datasets/cegma/)
(Core Eukaryotic Genes Mapping Approach). *CEGMA* examines a eukaryotic
genome assembly for presence and completeness of 248 "core eukaryotic genes".
* *Quast* also includes a "quick and dirty" method of finding genes.
Expand Down

0 comments on commit d7da5a6

Please sign in to comment.