01-fastqc.Rmd

---
title: "Sample input and sequencing report"
subtitle: "Sample input and FASTQ quality control metrics, NCCT - Univesitätsklinikum Tübingen"
output:
  html_document:
    highlight: tango
    theme: cosmo
    toc: no
    #css: reports.css
    #toc_float: yes
params:
  project:
    label: "Project title"
    value: ""
    input: text
  date:
    label: "Date"
    value: !r Sys.Date()
    input: date
  author:
    label: "Author"
    value: ""
    input: text
  samplesheet:
    label: "Select QC sample sheet if available (use provided excel template)"
    value:
    input: file
  fastq_dir:
    label: "Path to folder with fastq files (required, absolute path or relative to current folder)"
    value: ""
    input: text
  fastq_pattern:
    label: "Regex to capture fastq files (and obtain sample names)"
    value: "_R(1|2)_001.fast(q|q.gz)$"
  library_prep:
    label: "Library prep kit"
    choices: ["Zymo-Seq RiboFree Total RNA", "NEBNext Ultra II RNA", "Swift", "Illumina Nextera DNA Flex", "Illumina DNA TruSeq", "Zymo 16S", "Custom amplicon"]
    value: "Illumina Nextera DNA Flex"
    input: select
  sequencer:
    label: "Sequencer"
    choices: ["MiSeq", "NextSeq", "iSeq", "NovaSeq"]
    value: "NextSeq"
    input: select
  fastp_trimming:
    label: "Do you want to produce fastp-trimmed reads?"
    choices: ["Yes", "No"]
    value: "No"
    input: radio
---

<style>
div.blue { background-color:#e6f0ff;}
</style>

<div style="position:absolute;top:0px;right:0px;padding:0px;background-color:white;width:18%;">
```{r, echo=FALSE}

knitr::include_graphics("img/NCCT_Logo_RGB.png")
mybluecolor <- "#e6f0ff"

```
</div>

```{r setup, include=FALSE}
knitr::opts_chunk$set(include = FALSE, 
                      echo = FALSE, 
                      warning = FALSE, 
                      cache = FALSE)

# add the correct bash $PATH to the R session, depending on the system R may not have the correct $PATH
# touch ~/.Renviron 
# R_PATH="PATH=$PATH"  
# echo $R_PATH >  ~/.Renviron

# IMPROVE SOURCING WITH here()
source("bin/check_install.R")
source("bin/process_samplesheet.R")
source("bin/seqkit_stats.R") # for general stats and gc stats
source("bin/seqkit_fx2tab.R")
source("bin/fastp_stats.R") # 

#-----------------------------------------
# Define si_fmt, which uses si_format.sh to print suffixes to large numbers
si_fmt <- function(x) {
  system2("bin/si_format.sh", x, stdout = TRUE)
}

# ----------------------------------------
# Check required packages and install
#-----------------------------------------
# CRAN packages
pckgs <- c("yaml", "dplyr", "purrr", "tidyr", "readr", "stringr", "DT", "parallel",
           "kableExtra", "readxl", "data.table", "heatmaply", "sessioninfo")

# Bioconductor packages
# bc_pckgs <- c("qckitfastq")

check_install(pckgs)
# check_install(bc_pckgs, repo = "Bioconductor")

# ----------------------------------------


options(width = 121)
#important to use normalizePath, so that fastq dir anywhere can be processed
fastqdir <- normalizePath(params$fastq_dir)

header_table <- data.frame(Parameter = c("Project",
                                         "Author",
                                         "Date",
                                         "FASTQ folder",
                                         "Library prep kit",
                                         "Sequencing machine"),
                           Value = c(params$project,
                                     params$author,
                                     as.character(params$date),
                                     fastqdir,
                                     params$library_prep,
                                     params$sequencer))

#---------------------------------------------------
# define output directories and fastq files


fastqfiles <- list.files(fastqdir, pattern = params$fastq_pattern, full.names = TRUE)
# make a named vector, this is where the read names come from later in the mcmapply calls!!!
names(fastqfiles) <- basename(fastqfiles)

# determine if the dataset is PE or SE, get for and rev files

 for_files <- fastqfiles[str_detect(fastqfiles, "_R1")] # if SE, then these are the same as fastqfiles
 rev_files <- fastqfiles[str_detect(fastqfiles, "_R2")]
 
# define results dir, where everything goes
  resultsdir <- file.path(getwd(), "01-fastqc-results")
  if (dir.exists(resultsdir)) {
    unlink(resultsdir, recursive = TRUE, force = TRUE)
  }
  dir.create(resultsdir)

# stop early if no fastq files found
  if (length(fastqfiles) == 0 | length(for_files) == 0) {
        #
        stop("No fastq files found in supplied directory")
  }

# determine plot height for the next figures: 20px per sample?
myfig.height <- if_else(length(fastqfiles) > 17, true = length(fastqfiles)*3, false = 10)

```
*Report generated on `r Sys.time()` by `r Sys.info()[8]` on `r Sys.info()[4]`*


```{r header_table, include=TRUE}

kable(header_table) %>%
  kable_styling(bootstrap_options = c("condensed", "hover"),
                full_width = T,
                position = "left") %>%
  column_spec(2, bold = T) %>%
  column_spec(1:2, background = "#e6f0ff") %>%
  row_spec(0, color = "white")
```


***

### Description 

This report includes data about the sample input QC as well as some sequencing metrics. The fastp report for each fastq file as well as the trimmed reads (`fastp` with default parameters) and all raw data used in the plots, are available in the  `01-fastq-results` folder.   

***

### Sample input quality control


<details>
  <summary>Show table with sample measurements at NCCT</summary>

```{r sample_form, include=TRUE}
# readsamplesheet and formatsamplesheet are defined in bin/process_samplesheet.R

  if (length(params$samplesheet) == 0) {
    paste("No QC sample sheet provided")
  } else {
    samplesheet <- readsamplesheet(params$samplesheet)
    formatsamplesheet(samplesheet)
  }
```
</details>

***

### Number of reads and read quality


```{r seqkit_stats}

seqkit_stats_data <- seqkit_stats(fastqfiles)
total_reads <- sum(seqkit_stats_data$num_seqs,na.rm = T) %>% si_fmt()
total_bases <- sum(seqkit_stats_data$sum_len, na.rm = T) %>% si_fmt()

 
 # determine how fastp is run later, determine dataset type
 if(length(for_files) == length(fastqfiles)) {
   fastp_mode <- "se"
   fastq_dataset <- paste("SE ", "(1 x ", round(mean(seqkit_stats_data$avg_len), digits = 0), ")", sep = "")
 } else if(length(for_files) == length(rev_files)) {
   fastp_mode <- "pe"
   fastq_dataset <- paste("PE ", "(2 x ", round(mean(seqkit_stats_data$avg_len), digits = 0), ")", sep = "")
 } else {
   stop("Data seems PE but different number of R1 and R2 files")
 }
#---------------------------------------------------------------
 

# write this table
readr::write_delim(seqkit_stats_data, 
                   path = file.path(resultsdir, "seqkit_stats.tsv"), 
                   delim = "\t")

```

<div class = "blue">

Dataset type: **`r fastq_dataset`**

FASTQ files: **`r length(fastqfiles)`** 

Total reads: **`r total_reads`**

Total bases: **`r total_bases`**

</div>

A copy of the table (as a tab-delimited file) is also available under ` `r paste(basename(resultsdir), "/basic_stats.tsv", sep ="")` `.


```{r seqkit_stats_table, include=TRUE}
# rewrite this for seqkit stats

# prepare a suitable header and select list depending on the type of data: ont or ilmn
basic_stats_header <- paste("Number of reads and quality metrics")

  
seqkit_stats_data %>% 
  #dplyr::mutate(num_sequences = si_fmt(num_seqs), num_bases = si_fmt(sum_len)) %>%
  dplyr::select(c(file, num_seqs, sum_len, sum_gap, "Q20(%)", "Q30(%)")) %>%
  DT::datatable(filter = "top",
                 caption = basic_stats_header,
                 extensions = c('Scroller', 'Buttons'),
                 options = list(dom= "Btp", deferRender = TRUE,
                                scrollY = 600, scroller = TRUE, buttons = c('copy', 'csv', 'excel')
                                #columnDefs = list(list(visible = FALSE, targets = c(2, 4))) # hide cols in DT
                                ),
                 style = 'bootstrap',
                 class = 'table-hover table-condensed') %>%
  # formatStyle('num_sequences', valueColumns = 'num_seqs', 
  #             background = styleColorBar(data = c(0, max(seqkit_stats_data$num_seqs)), "lightgreen")) %>%
  # formatStyle('num_bases', valueColumns = 'sum_len', 
  #             background = styleColorBar(data = c(0, max(seqkit_stats_data$sum_len)), "lightgreen")) %>%
  formatRound('num_seqs', digits = 0, mark = ",") %>%
  formatStyle('num_seqs', 
              background = styleColorBar(data = c(0, max(seqkit_stats_data$num_seqs)), mybluecolor)
              ) %>%
  formatRound('sum_len', digits = 0, mark = ",") %>%
  formatStyle(c("Q20(%)", "Q30(%)"), color = styleInterval(c(0.8, 0.9), c("red", "orange", "green")))

```

***


### `fastp` filtering
All fastq reads are processed with `fastp`, the individual report files can be found in the ` `r paste(basename(resultsdir), "/fastp_reports", sep ="")` ` directory. The values in the table below are per sample, e.g. for both R1 and R2 together in case of PE data.

```{r fastp_exec}

#----------------------------------------------------------------
# run fast_se (fast_pe) to generate list with all data
# with parallel::mclapply, use named list for fastq filenames
# no plotting here
# the fastp_data list contains all info for duplication, per-cycle content etc

# determine if trimming will be done
if (params$fastp_trimming == "Yes") {
    save_trimmed <- TRUE
} else {
    save_trimmed <- FALSE
}


# SE case, fastq_dataset is defined in the setup section
#IMPROVE MAPPLY FOR MANY PROC

if(fastp_mode == "se") {
    fastp_data <- parallel::mcmapply(fastp_se, fastqfiles, 
                                     SIMPLIFY = FALSE, 
                                     MoreArgs = list(save_trimmed = save_trimmed))

} else if (fastp_mode == "pe") {
    fastp_data <- parallel::mcmapply(fastp_pe, for_files, rev_files, 
                                     SIMPLIFY = FALSE, 
                                     MoreArgs = list(save_trimmed = save_trimmed))

} else {
    stop("something went wrong with fastp, check your fastq files!")
}

# move the data generated from fastp to results
# html files to move, explicit to avoid moving something else
htmlfiles <- paste(basename(fastqfiles), ".html", sep = "")
jsonfiles <- paste(basename(fastqfiles), ".json", sep = "")
trimmedfiles <- paste("trim_", basename(fastqfiles), sep = "")

dir.create("01-fastqc-results/fastp_reports")

file.rename(from = htmlfiles, to = paste("01-fastqc-results/fastp_reports/", htmlfiles, sep = ""))
file.rename(from = jsonfiles, to = paste("01-fastqc-results/fastp_reports/", jsonfiles, sep = ""))
if(isTRUE(save_trimmed)) {
  dir.create("01-fastqc-results/trimmed-reads/")
  file.rename(from = trimmedfiles, to = paste("01-fastqc-results/trimmed-reads/", trimmedfiles, sep = ""))
}

```


```{r fastp_filter_table, include=TRUE}

fastp_filter_stats <- map_df(fastp_data, `[[`, "filt", .id = "sample") %>% # good, eh?
  mutate(sample = basename(str_remove(sample, params$fastq_pattern))) %>%  
  rename_at(vars(ends_with("_reads")), 
              ~str_remove(., "_reads")
              ) 

readr::write_delim(fastp_filter_stats, 
                   path = file.path(resultsdir, "fastp_filter_stats.tsv"), 
                   delim = "\t")

fastp_filter_stats %>%
  #dplyr::mutate(reads_passed_filter = si_fmt(passed_filter)) %>%
  DT::datatable(filter = "top",
                caption = paste("Results of fastp filtering with default parameters"),
                extensions = c('Scroller', 'Buttons'),
                options = list(dom= "Btp", deferRender = TRUE,
                               scrollY = 600,
                               scroller = TRUE, 
                               buttons = c('copy', 'csv', 'excel') 
                               #columnDefs= list(list(visible = FALSE, targets = 2))
                               ),
                style = 'bootstrap',
                class = 'table-hover table-condensed') %>%
  formatRound('passed_filter', digits = 0, mark = ",") %>%
  formatStyle('passed_filter', 
              background = styleColorBar(data = c(0, max(fastp_filter_stats$passed_filter)), mybluecolor)
              )
  

```

### Read duplication

The duplication level is shown as percent reads at each duplication level (the first 20 levels shown).

```{r read_dup, include=TRUE, out.width="100%"}
# get data as matrix from fastp_data, to construct a heatmap
fastp_dup <- map_df(fastp_data, `[[`, "dup") %>% 
  head(20) %>%
  as.matrix() %>%
  prop.table(margin = 2) %>%
  t()

rownames(fastp_dup) <- rownames(fastp_dup) %>% basename() %>% str_remove(params$fastq_pattern)
heatmaply(fastp_dup*100, 
          Rowv = NA, Colv = NA, 
          colors = heat.colors(100, rev = TRUE), 
          grid_gap = FALSE, dynamicTicks = TRUE, hide_colorbar = TRUE)

```


### GC-content of reads

The GC-content of the reads in a sample is calculated by obtaining the GC-content of every read in the sample (`seqkit`), and applying the `density()` function in `R` to obtain the density values for 100 bins (from 1 to 100 °C, 1°C intervals). While some spreading of the GC-content within a sample can be expected, samples from the same organism should have a very similar GC-profile. The raw GC content data used for the plots can be found in the ` `r basename(resultsdir)` ` folder.


```{r gc_content, include=TRUE, out.width="100%"}
# get the list object from seqkit_gc
seqkit_gc_data <- parallel::mclapply(for_files, seqkit_gc)

gc_matrix <- map_df(seqkit_gc_data, `[[`, "y") %>% t()
rownames(gc_matrix) <- rownames(gc_matrix) %>% basename() %>% str_remove(params$fastq_pattern)

gc_matrix %>% 
  heatmaply(Rowv = NA, Colv = NA, 
            colors = heat.colors(100, rev = TRUE), 
            grid_gap = FALSE, dynamicTicks = TRUE, hide_colorbar = TRUE)
```


***

### Software versions

```{r, include=TRUE}

sessioninfo::package_info(pckgs, dependencies = FALSE)

```