You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some pipelines require access to pre-configured databases (such as Kraken2, MLST schemes, BLAST databases, etc). In Nextflow, these can be configured as parameters where a user passes a path to the necessary files (such as --kraken2-db [PATH]). However, in IRIDA Next, this would render as a text field and users would be expected to enter the correct location.
Instead, it would be better if databases were selectable in the IRIDA Next interface with an associated and descriptive label.
Some suggestions for a long-term solution would be:
R0: Users should be able to use a pre-installed database in a pipeline.
R1: Database parameters in a pipeline should be selectable from a pre-configured set of values
R2: Absolute URIs to databases should not be stored within the pipeline code.
Related is that pipeline should be usable outside of IRIDA Next (useful for testing).
R3: Values in the database parameter should be user-readable labels (like Kraken2 Standard 2024-01-12) but the pipeline should receive the path to the database files on submission.
R4: Adding new databases should not require updating the pipeline code.
R5: Multiple tools/pipelines should be able to re-use the same database locations.
R6: There should be a straightforward way to add/remove databases.
2. Solution 0: Database as string parameter
Implements suggestions:
R0 (use pre-installed databases),
R2 (no absolute URIs in code)
R4 (no code changes to add new databases)
R5 (tools/pipelines can re-use same database)
R6 (easy to add/remove databases)
The first solution is to define a database as a parameter to a pipeline passed as a string to the location of the database. This currently exists in pipelines by defining a parameter that accepts a string pointing to the path of the database like --kraken.db [PATH].
This is already implemented in IRIDA Next.
3. Solution 1: Make database parameter an enum of selectable URIs
Implements suggestions:
R0 (use pre-installed databases)
R1 (databases are selectable)
R5 (tools/pipelines can re-use same database)
The nextflow_schema.json file is used to define parameters to pass to a pipeline. This includes databases. For example, an entry describing a Kraken2 database to pass to a pipeline could look like:
Before
"kraken.db": {
"type": "string",
"description": "Path to Kraken2 database"
}
This solution would involve modifying this entry in the JSON schema to something like:
Now, users are constrained to only select from the defined list of database paths.
3.1. Advantages
No code changes required on IRIDA Next side.
3.2. Disadvantages
Absolute URIs must be used (e.g., az://db/k2_standard_20220607/) in order for JSON Schema validation on the Nextflow side to work.
It is assumed that a system administrator will install the appropriate databases in the defined directories.
Adding/removing databases requires releasing new versions of a pipeline (R3).
4. Solution 2: Databases are absolute paths to locations within a tool container
Implements suggestions:
R0 (use pre-installed databases)
R1 (databases are selectable)
R2 (no absolute URIs)
R5 (pipelines can re-use same database)
Only half-implemented since only pipelines (but not tools) can re-use same database.
Solution 2 is nearly identical to solution 1 except that the enum in the pipeline JSON Schema contains strings that are absolute paths within the container running the particular tool.
In order for this to work it is assumed that database files are either packaged with a container, or mounted inside a container at run-time.
Alternatively, databases could be bundled as Docker containers and made available/mounted by a tool at run-time.
4.1. Advantages
No absolute URIs
For small databases/files (e.g., reference genomes), could package up in the tool container itself and need no additional administrative work
4.2. Disadvantages
Either tool containers must be modified to include database files, or additional code for mounting directories in a container must be provided.
Some databases are very large (Standard Kraken2 is 50 GB). This would add a lot of data within a tool container.
For bundling in tool containers, we would need to do work to build all these custom containers with databases.
Might have multiple copies of databases for different tools (e.g., Kraken2 and Bracken require same database, but are separate docker containers now).
5. Solution 3: Databases configured in local clone of pipeline code
Implements suggestions:
R0 (use pre-installed databases)
R1 (databases are selectable)
R2 (no absolute URIs)
R4 (no code changes to add new databases)
Half-implements since code changes in local repo
R5 (tools/pipelines can re-use same database)
In this solution, upon installing a pipeline, a clone of the git repository is made into an internal gitlab (or some other internal location). The nextflow_schema.json file in the internal copy of the code is updated to point to locally installed databases, similar to Solution 1.
nextflow_schema.json of released pipeline
"kraken.db": {
"type": "string",
"description": "Path to Kraken2 database",
}
6. Solution 4: Database parameters marked with custom annotation
Implements suggestions:
R0 (use pre-installed databases)
R1 (databases are selectable)
R2 (no absolute URIs)
R3 (readable labels)
R4 (no code changes to add new databases)
R5 (tools/pipelines can re-use same database)
R6 (easy to add/remove databases)
In Solution 3, the nextflow_schema.json is nearly identical to the default version provided by Nextflow/nf-core except that any database selection parameters are marked with a custom JSON Schema annotation, x-iridanext-database-type. This value is used within IRIDA Next to lookup pre-installed databases within some external resource.
6.1. Pipeline changes
Each parameter that defines a database will need to add an additional JSON Schema annotation, x-iridanext-database. The value will be from one of a number of pre-determined types of databases (e.g., Kraken, MLST Schemas, reference genomes).
Nextflow will ignore this custom annotation, but it will impact how IRIDA Next renders the user interface.
6.2. IRIDA Next changes
Upon parsing the nextflow_schema.json, IRIDA Next will look for these custom annotations and use them to identify parameters marked as databases. In particular, this will load up a list of pre-installed database selection options based on the value of x-iridanext-database-type (e.g., a value of kraken2 loads Kraken2 databases).
Defining the list of databases to load could be handled by a database management file loaded by IRIDA Next.
6.2.1. Database management file
The database management file is a JSON file that defines database types and label/uri pairs for each database. The format would be as follows:
In order to add/remove databases, the database management file could be stored in an internal instance of gitlab. Management of installed databases could hence be controlled via pull-requests.
6.4. Advantages
Implements all the defined requirements, making it very flexible
Minimal code changes in a pipeline
6.5. Disadvantages
More code changes in IRIDA Next
The text was updated successfully, but these errors were encountered:
1. Problem statement
Some pipelines require access to pre-configured databases (such as Kraken2, MLST schemes, BLAST databases, etc). In Nextflow, these can be configured as parameters where a user passes a path to the necessary files (such as
--kraken2-db [PATH]
). However, in IRIDA Next, this would render as a text field and users would be expected to enter the correct location.Instead, it would be better if databases were selectable in the IRIDA Next interface with an associated and descriptive label.
Some suggestions for a long-term solution would be:
2. Solution 0: Database as string parameter
Implements suggestions:
The first solution is to define a database as a parameter to a pipeline passed as a string to the location of the database. This currently exists in pipelines by defining a parameter that accepts a string pointing to the path of the database like
--kraken.db [PATH]
.This is already implemented in IRIDA Next.
3. Solution 1: Make database parameter an enum of selectable URIs
Implements suggestions:
The
nextflow_schema.json
file is used to define parameters to pass to a pipeline. This includes databases. For example, an entry describing a Kraken2 database to pass to a pipeline could look like:Before
This solution would involve modifying this entry in the JSON schema to something like:
After
Now, users are constrained to only select from the defined list of database paths.
3.1. Advantages
3.2. Disadvantages
az://db/k2_standard_20220607/
) in order for JSON Schema validation on the Nextflow side to work.4. Solution 2: Databases are absolute paths to locations within a tool container
Implements suggestions:
Solution 2 is nearly identical to solution 1 except that the
enum
in the pipeline JSON Schema contains strings that are absolute paths within the container running the particular tool.In order for this to work it is assumed that database files are either packaged with a container, or mounted inside a container at run-time.
Alternatively, databases could be bundled as Docker containers and made available/mounted by a tool at run-time.
4.1. Advantages
4.2. Disadvantages
5. Solution 3: Databases configured in local clone of pipeline code
Implements suggestions:
In this solution, upon installing a pipeline, a clone of the git repository is made into an internal gitlab (or some other internal location). The
nextflow_schema.json
file in the internal copy of the code is updated to point to locally installed databases, similar to Solution 1.nextflow_schema.json
of released pipelinenextflow_schema.json
of internal pipeline6. Solution 4: Database parameters marked with custom annotation
Implements suggestions:
In Solution 3, the
nextflow_schema.json
is nearly identical to the default version provided by Nextflow/nf-core except that any database selection parameters are marked with a custom JSON Schema annotation,x-iridanext-database-type
. This value is used within IRIDA Next to lookup pre-installed databases within some external resource.6.1. Pipeline changes
Each parameter that defines a database will need to add an additional JSON Schema annotation,
x-iridanext-database
. The value will be from one of a number of pre-determined types of databases (e.g., Kraken, MLST Schemas, reference genomes).Nextflow will ignore this custom annotation, but it will impact how IRIDA Next renders the user interface.
6.2. IRIDA Next changes
Upon parsing the
nextflow_schema.json
, IRIDA Next will look for these custom annotations and use them to identify parameters marked as databases. In particular, this will load up a list of pre-installed database selection options based on the value ofx-iridanext-database-type
(e.g., a value ofkraken2
loads Kraken2 databases).Defining the list of databases to load could be handled by a database management file loaded by IRIDA Next.
6.2.1. Database management file
The database management file is a JSON file that defines database types and label/uri pairs for each database. The format would be as follows:
For example:
6.3. Managing databases
In order to add/remove databases, the database management file could be stored in an internal instance of gitlab. Management of installed databases could hence be controlled via pull-requests.
6.4. Advantages
6.5. Disadvantages
The text was updated successfully, but these errors were encountered: