Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update/locidex #96

Merged
merged 61 commits into from
Aug 23, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
c342ef4
starting locidex database integrations
mattheww95 Jul 29, 2024
4eff5ce
updated added new files
mattheww95 Jul 29, 2024
b06371b
updated locidex db identification
mattheww95 Jul 29, 2024
e711e24
updated automated allele selection
mattheww95 Jul 29, 2024
df4467f
updated todos
mattheww95 Jul 29, 2024
caba99c
added test data for locidex databases, tests should still function:
mattheww95 Jul 30, 2024
c5659dd
updated nextflow schema.json
mattheww95 Jul 30, 2024
b172dcb
Removed tests for comparing updates for old and new locidex profiles
mattheww95 Jul 31, 2024
41c1893
removed dead test file
mattheww95 Jul 31, 2024
a32dd11
updated test profile
mattheww95 Jul 31, 2024
6b4d283
restrucutured file staging for locidex select
mattheww95 Aug 1, 2024
fd6a1a6
started creation of real tests for locidex workflow
mattheww95 Aug 1, 2024
120b148
updated locidex test data
mattheww95 Aug 1, 2024
37b7a46
updated locidex workflow and tests
mattheww95 Aug 1, 2024
384fbce
updated docker container
mattheww95 Aug 1, 2024
dccd27f
addressed PR issues, still more work to be done
mattheww95 Aug 2, 2024
2fb4ac9
updated database selection functions to get an optimal match
mattheww95 Aug 2, 2024
1d43ecb
added skipped allele calling to test config
mattheww95 Aug 2, 2024
19fa250
updated test conditions
mattheww95 Aug 6, 2024
21681c2
updated test files
mattheww95 Aug 6, 2024
3cd4a28
updated locidex select paths
mattheww95 Aug 7, 2024
607c1c1
removed trailing whitespace
mattheww95 Aug 7, 2024
dd5c992
updated tests to include passing cases
mattheww95 Aug 7, 2024
0dfe87f
added locidex summary process
mattheww95 Aug 7, 2024
bae1567
updated fixe issues with integer sizing
mattheww95 Aug 7, 2024
4d2c0a4
debugging summary test
mattheww95 Aug 7, 2024
0d08952
Added field for reportable alleles
mattheww95 Aug 8, 2024
b925664
updated reportable loci test
mattheww95 Aug 8, 2024
0542162
updated locidex summary tests
mattheww95 Aug 8, 2024
ca570d4
updated missing allelels JSON type
mattheww95 Aug 8, 2024
f59d1a4
upated not database selected output
mattheww95 Aug 8, 2024
b0e8f7b
updated IRIDANEXT config for locidex values
mattheww95 Aug 8, 2024
b58c7bd
Merge branch 'dev' of github.com:phac-nml/mikrokondo into update/locidex
mattheww95 Aug 8, 2024
0e5daff
updated locidex end-to-end tests
mattheww95 Aug 8, 2024
6497afa
updated locations of locidex test
mattheww95 Aug 8, 2024
c4d45c6
Merge branch 'dev' of github.com:phac-nml/mikrokondo into update/locidex
mattheww95 Aug 8, 2024
dba1ca2
zipped and updated sample datasets
mattheww95 Aug 8, 2024
2704c5e
Merge branch 'dev' of github.com:phac-nml/mikrokondo into update/locidex
mattheww95 Aug 8, 2024
8b5d169
upated locidex end to end tests
mattheww95 Aug 8, 2024
b63cf7e
fixed typos in change log
mattheww95 Aug 8, 2024
d070156
updated locidex tests
mattheww95 Aug 9, 2024
d07257e
cleaned up datebase name parsing
mattheww95 Aug 9, 2024
8669f14
fixed code comment
mattheww95 Aug 12, 2024
4ca519b
updated code comments
mattheww95 Aug 12, 2024
4883b39
made updates on PR comments
mattheww95 Aug 12, 2024
673349a
normalized output for iridanext across allele scheme input options
mattheww95 Aug 12, 2024
e3d3c8a
updated test cases
mattheww95 Aug 13, 2024
d97e6ce
updated locidex_select tests to match updated interface
mattheww95 Aug 13, 2024
a341fc6
updated test to work in github actions
mattheww95 Aug 13, 2024
e938535
removed sneaky todo
mattheww95 Aug 13, 2024
215a70b
updating issues identified and regression in kraken2 header parsing
mattheww95 Aug 19, 2024
b9d1b04
reverted zipped file handling
mattheww95 Aug 19, 2024
9abcd03
fixed regression locidex select
mattheww95 Aug 20, 2024
2a4c9ba
removed repeated variables
mattheww95 Aug 20, 2024
0e60b8d
updated staging of report input to locidex summarize
mattheww95 Aug 20, 2024
dba0190
updated file parsing in locidex summarize
mattheww95 Aug 20, 2024
5d10389
updated changelog
mattheww95 Aug 20, 2024
32327f0
extracted inline code in function
mattheww95 Aug 20, 2024
814e45e
bumped locidex container
mattheww95 Aug 21, 2024
034d9c8
updated final todos
mattheww95 Aug 22, 2024
de0d30e
updated options to locidex_select test to terminate processes on failure
mattheww95 Aug 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,16 @@ process {
errorStrategy = "ignore"
}

withName: LOCIDEX_SELECT {
executor = 'local'
publishDir = [
mode: params.publish_dir_mode,
path: { ["${task.assembly_subtyping_directory_name}", "Locidex", "Select"].join(File.separator) },
pattern: "*.json",
saveAs: { filename -> filename.equals('versions.yml') ? null : reformat_output(filename, null, "locidex.db", meta) }
]
}

withName: REPORT_AGGREGATE {
ext.parameters = params.python3
cache = 'false' // Resume does not work on module, if enabled a warning is thrown
Expand All @@ -178,7 +188,6 @@ process {
]
}


withName: BIN_KRAKEN2 {
ext.parameters = params.python3
maxForks = 20
Expand Down
2 changes: 1 addition & 1 deletion conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ params {
r_contaminants.mega_mm2_idx = dehosting_idx
kraken2_db = "${projectDir}/tests/data/kraken2/test"
kraken.db = kraken2_db

locidex.allele_database = "${projectDir}/tests/data/databases/locidex_dbs"
fastp.args.illumina = "-Q"
min_reads = 100

Expand Down
6 changes: 3 additions & 3 deletions modules/local/locidex_report.nf
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,13 @@ process LOCIDEX_REPORT {
fi
locidex report -i $seq_store_name -o . --name ${meta.id} \\
--mode ${params.locidex.report_mode} \\
--prop ${params.locidex.report_prop} \\
--max_ambig ${params.locidex.report_max_ambig} \\
--max_stop ${params.locidex.report_max_stop} \\
--prop ${params.locidex.report_prop} \\
--force

gzip -c profile.json > $output_name
rm profile.json
gzip -c report.json > $output_name
rm report.json

cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand Down
3 changes: 1 addition & 2 deletions modules/local/locidex_search.nf
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ process LOCIDEX_SEARCH {
label "process_medium"
container "${workflow.containerEngine == 'singularity' || workflow.containerEngine == 'apptainer' ? task.ext.parameters.get('singularity') : task.ext.parameters.get('docker')}"


input:
tuple val(meta), path(fasta), path(db)

Expand Down Expand Up @@ -43,7 +42,7 @@ process LOCIDEX_SEARCH {
--min_aa_match_cov ${params.locidex.min_aa_match_cov} \\
--max_target_seqs ${params.locidex.max_target_seqs}

gzip -c seq_store.json > $output_json && rm seq_store.json
test -f seq_store.json && gzip -c seq_store.json > $output_json && rm seq_store.json
test -f annotations.gbk && gzip -c annotations.gbk > $output_gbk && rm annotations.gbk

cat <<-END_VERSIONS > versions.yml
Expand Down
208 changes: 208 additions & 0 deletions modules/local/locidex_select.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
/*
Locidex provides the option to select a database from a group of databases, by
using a manifest file. To prevent copying all databases each time an allele database
needs to be selected this file will read only the "manifest" file of the databases and
to pick the correct allele scheme.


Locidex Manifest is setup as:
{
db_name: [
{
"path": path/to/db/relative/to/manifest
# DB Config data, the newest db data will be selected as versions are not standardized
"config": {
"db_name": "Locidex Database 1",
"db_version": "1.0.0",
"db_date": "yyyy-MM-dd",
"db_author": "test1",
"db_desc": "test1",
"db_num_seqs": 53,
"is_nucl": true,
"is_prot": true,
"nucleotide_db_name": "nucleotide",
"protein_db_name": "protein"
}
}
]
}
*/

import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
import java.text.SimpleDateFormat


process LOCIDEX_SELECT {
tag "$meta.id"
label "process_single"

input:
tuple val(meta), val(top_hit), val(contigs)
mattheww95 marked this conversation as resolved.
Show resolved Hide resolved
val manifest // This is a json file to be parsed

output:
tuple val(meta), val(contigs), val(scheme), val(paired_p), emit: db_data
tuple val(meta), path(output_config), emit: config_data

exec:
if(params.allele_scheme == null && params.locidex.allele_database == null){
exit 1, "Allele calling is enabled but there is no allele scheme or locidex allele database location present."
}
mattheww95 marked this conversation as resolved.
Show resolved Hide resolved

// Tokenize the "top_hit" or species value to identify all relevant match parts of the string
def species_data = top_hit.split('_|\s')
species_data = species_data*.toLowerCase()
mattheww95 marked this conversation as resolved.
Show resolved Hide resolved

// De-serialize the manifest file from the database location
def jsonSlurper = new JsonSlurper()
String json_data = manifest.text
def allele_db_data = jsonSlurper.parseText(json_data)
def allele_DB_KEYs = allele_db_data.keySet() as String[]

// Tokenize all database keys for lookup of species top hit in the database names
def databases = []
def shortest_entry = Integer.MAX_VALUE
for(allele_db in allele_DB_KEYs){
def db_tokens = allele_db.split('_|\s')
for(token in db_tokens){
def tok_size = token.size()
if(tok_size < shortest_entry){
shortest_entry = tok_size
}
}
databases.add(new Tuple(db_tokens*.toLowerCase(), allele_db))
}
mattheww95 marked this conversation as resolved.
Show resolved Hide resolved

def DB_TOKES_POS = 0
def DB_KEY = 1

// Remove spurious characters from tokenized string
species_data = species_data.findAll { it.size() >= shortest_entry }
mattheww95 marked this conversation as resolved.
Show resolved Hide resolved

// A default locidex database is set to null as there should be no option set
// a default database can be set, but this process will then be skipped
def db_opt = null


paired_p = false // Sets predicate for db identification as false by default
mattheww95 marked this conversation as resolved.
Show resolved Hide resolved
scheme = null
report_name = "${meta.id}_${params.locidex.db_config_output_name}"
output_config = task.workDir.resolve(report_name)

for(db in databases){
// TODO not getting best matches, currently
def match_size = db[DB_TOKES_POS].size() // Prevent single token matches
def tokens = window_string(species_data, match_size)
def db_found = compare_lists(db[DB_TOKES_POS], tokens)
if(db_found){
def selected_db = select_locidex_db_path(allele_db_data[db[DB_KEY]], db[DB_KEY])
/// Write selected allele database info to a file for the final report
write_config_data(selected_db, output_config)
scheme = join_database_paths(selected_db)
paired_p = db_found
break
}
}

if(!paired_p){
write_config_data(["No database selected."], output_config)
}

}


def write_config_data(db_data, output_name){
/// Config data for db to use
def json_data = new JsonBuilder(db_data).toPrettyString()
def output_file = file(output_name).newWriter()
output_file.write(json_data)
output_file.close()
}

def join_database_paths(db_path){
/// Database paths are relative to the manifest, hopefully this will not offer many issue on cloud executors
def input_dir_path = [params.lx_allele_database, db_path[params.locidex.manifest_db_path]].join(File.separator)
return input_dir_path
}

def select_locidex_db_path(db_values, db_name){
/// Select the optimal locidex database by parsing date fields for the organism
/// Database fields are labeled by date, so the most recent will be shown
/// Db value is an object containing the path fields and the config fields
/// db_values: is the list of database config information in the manifest


def database_entries = db_values.size()
def default_date = new SimpleDateFormat(params.locidex.date_format_string).parse("0001-01-01")
def max_date = default_date
def max_date_entry = null
def dates = []

// Validate all input fields
for(idx in 0..db_values.size()){
def db_entry = db_values[idx]
if(!db_entry.containsKey(params.locidex.manifest_db_path)){
exit 1, "Missing path value in locidex config for: ${db_name}"
}
if(!db_entry.containsKey(params.locidex.manifest_config_key)){
exit 1, "Missing config data for locidex database entry: ${db_name}"
}
if(!db_entry[params.locidex.manifest_config_key].containsKey(params.locidex.database_config_value_date)){
exit 1, "Missing date created value for locidex database entry: ${db_name}"
}
def date_value = db_entry[params.locidex.manifest_config_key][params.locidex.database_config_value_date]
def date_check = new SimpleDateFormat(params.locidex.date_format_string).parse(date_value)
dates.add(date_check)
if(date_check > max_date){
max_date = date_check
max_date_entry = db_entry
}
}

def max_date_count = dates.count(max_date)
if(max_date_count > 1){
mattheww95 marked this conversation as resolved.
Show resolved Hide resolved
exit 1, "There are multiple versions of the most recent database for ${db_name}. Mikrokondo could not determine the best database to pick."
}else if (max_date_count == 0){
exit 1, "There are not databases created after the year ${defualt_date}. Please set the allele database parameter, or adjust the date your database was created in the 'config.json'"
}else if (max_date_entry == null){
exit 1, "Could not select a database for locidex sample. ${meta.id}"
}
return max_date_entry
}


def window_string(species, match_size){
/*
Create an array of strings of a various match "match size" for comparison to a given value later one.

e.g. spieces is an array of: ["1", "2", "3", "4"] and match_size is 2 the output will be.
[
["1", "2"],
["2", "3"],
["3", "4"]
]
*/
def tiles = []
apetkau marked this conversation as resolved.
Show resolved Hide resolved
def adj_match_size = match_size - 1
for(int spot = 0; spot < species.size()-adj_match_size; spot = spot + 1){
tiles.add(species[spot..spot + adj_match_size])
}
return tiles
}

def compare_lists(db_string_windows, species_tokens){
/* compare the various windows till the right db is found
The db_string is an array of [["1", "2"], ["2", "3"], ["3", "4"]] and the species tokens would be ["2", "3"]

TODO need to add a match size
*/

for(window in db_string_windows){
if(window == species_tokens){
return true
}
}
return false
}

3 changes: 2 additions & 1 deletion modules/local/select_pointfinder.nf
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ process IDENTIFY_POINTDB {
def species_data = species.split('_|\s') // tokenize string
species_data = species_data*.toLowerCase()

def overly_large_number = 100000
def overly_large_number = Integer.MAX_VALUE
def databases = []
// tokenize database options
def shortest_entry = overly_large_number
Expand Down Expand Up @@ -57,6 +57,7 @@ process IDENTIFY_POINTDB {


def tokenize_values(species, match_size){
// Create tiled values to match on, e.g. input is Salmonella enterica entrica -> [Salmonella, Salmonella enterica, Salmonella enterica enterica]
def tokens = []
def adj_match_size = match_size - 1
for(int spot = 0; spot < species.size()-adj_match_size; spot = spot + 1){
Expand Down
26 changes: 16 additions & 10 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ params {
show_hidden_params = false
validationS3PathCheck = true
validationShowHiddenParams = false
validationSchemaIgnoreParams = 'top_hit_method,abricate,locidex,assembly_status,bakta,bandage,checkm,chopper,contigs_too_short,coreutils,coverage_calc_fields,ectyper,fastp,fastqc,filtered_reads,flye,kat,kleborate,kraken,kraken_bin,kraken_species,lissero,mash,mash_meta,medaka,minimap2,mlst,mobsuite_recon,opt_platforms,pilon,pilon_iterative,pointfinder_db_tag,python3,QCReport,QCReport-fields,QCReportFields,quast,racon,raw_reads,report_aggregate,r_contaminants,samtools,seqkit,seqtk,seqtk_size,shigeifinder,sistr,spades,spatyper,staramr,subtyping_report,top_hit_species,unicycler'
validationSchemaIgnoreParams = 'allele_scheme_selected,top_hit_method,abricate,locidex,assembly_status,bakta,bandage,checkm,chopper,contigs_too_short,coreutils,coverage_calc_fields,ectyper,fastp,fastqc,filtered_reads,flye,kat,kleborate,kraken,kraken_bin,kraken_species,lissero,mash,mash_meta,medaka,minimap2,mlst,mobsuite_recon,opt_platforms,pilon,pilon_iterative,pointfinder_db_tag,python3,QCReport,QCReport-fields,QCReportFields,quast,racon,raw_reads,report_aggregate,r_contaminants,samtools,seqkit,seqtk,seqtk_size,shigeifinder,sistr,spades,spatyper,staramr,subtyping_report,top_hit_species,unicycler'
validationFailUnrecognisedParams = false // for the qcreport fields

// SKIP options
Expand Down Expand Up @@ -128,6 +128,7 @@ params {
lx_report_prop = "locus_name"
lx_report_max_ambig = 0
lx_report_max_stop = 0
lx_allele_database = null

// Overide an allele calling scheme, this will be applied globally if auto selection is not opted for
allele_scheme = null
Expand Down Expand Up @@ -214,9 +215,8 @@ params {

locidex {
// awaiting singluarity image build
//singularity = "https://depot.galaxyproject.org/singularity/locidex%3A0.1.1--pyhdfd78af_1"
singularity = "quay.io/biocontainers/locidex:0.1.1--pyhdfd78af_1"
docker = "quay.io/biocontainers/locidex:0.1.1--pyhdfd78af_1"
singularity = "docker.io/mwells14/locidex:0.2.2"
docker = "docker.io/mwells14/locidex:0.2.2"
min_evalue = params.lx_min_evalue
min_dna_len = params.lx_min_dna_len
min_aa_len = params.lx_min_aa_len
Expand All @@ -232,17 +232,23 @@ params {
report_prop = params.lx_report_prop
report_max_ambig = params.lx_report_max_ambig
report_max_stop = params.lx_report_max_stop
allele_database = params.lx_allele_database
date_format_string = "yyyy-MM-dd"
manifest_db_path = "path"
manifest_config_key = "config"
manifest_name = "manifest.json"
database_config_value_date = "db_date"
extracted_seqs_suffix = ".extracted.seqs.fasta.gz"
seq_store_suffix = ".seq_store.json.gz"
gbk_suffix = ".gbk.gz"
extraction_dir = "extracted"
report_suffix = ".profile.mlst.json.gz"
schemes {
salmonella {
search = params.QCReport.salmonella
db = null
}
}
db_config_output_name = "SelectedLocidexConfig.json"
report_tag = "LocidexDatabaseInformation"
}

allele_scheme_selected {
report_tag = "AlleleSchemeUsed"
}

// FASTP options
Expand Down
7 changes: 7 additions & 0 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -664,6 +664,13 @@
"default": 0,
"description": "Maximum number of internal stop codons allowed in a sequence.",
"minimum": 0
},
"lx_allele_database": {
"type": "string",
"description": "Folder of locidex databases. The folder should contain a 'manifest.json' file created by locidex",
"pattern": "^\\S+$",
"exists": true,
"format": "directory-path"
mattheww95 marked this conversation as resolved.
Show resolved Hide resolved
}
}
},
Expand Down
Loading
Loading