Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database handling for pipelines in IRIDA Next #15

Open
apetkau opened this issue Jan 12, 2024 · 0 comments
Open

Database handling for pipelines in IRIDA Next #15

apetkau opened this issue Jan 12, 2024 · 0 comments

Comments

@apetkau
Copy link
Member

apetkau commented Jan 12, 2024

1. Problem statement

Some pipelines require access to pre-configured databases (such as Kraken2, MLST schemes, BLAST databases, etc). In Nextflow, these can be configured as parameters where a user passes a path to the necessary files (such as --kraken2-db [PATH]). However, in IRIDA Next, this would render as a text field and users would be expected to enter the correct location.

Instead, it would be better if databases were selectable in the IRIDA Next interface with an associated and descriptive label.

Some suggestions for a long-term solution would be:

  • R0: Users should be able to use a pre-installed database in a pipeline.
  • R1: Database parameters in a pipeline should be selectable from a pre-configured set of values
  • R2: Absolute URIs to databases should not be stored within the pipeline code.
    • Related is that pipeline should be usable outside of IRIDA Next (useful for testing).
  • R3: Values in the database parameter should be user-readable labels (like Kraken2 Standard 2024-01-12) but the pipeline should receive the path to the database files on submission.
  • R4: Adding new databases should not require updating the pipeline code.
  • R5: Multiple tools/pipelines should be able to re-use the same database locations.
  • R6: There should be a straightforward way to add/remove databases.

2. Solution 0: Database as string parameter

Implements suggestions:

  • R0 (use pre-installed databases),
  • R2 (no absolute URIs in code)
  • R4 (no code changes to add new databases)
  • R5 (tools/pipelines can re-use same database)
  • R6 (easy to add/remove databases)

The first solution is to define a database as a parameter to a pipeline passed as a string to the location of the database. This currently exists in pipelines by defining a parameter that accepts a string pointing to the path of the database like --kraken.db [PATH].

This is already implemented in IRIDA Next.

image

3. Solution 1: Make database parameter an enum of selectable URIs

Implements suggestions:

  • R0 (use pre-installed databases)
  • R1 (databases are selectable)
  • R5 (tools/pipelines can re-use same database)

The nextflow_schema.json file is used to define parameters to pass to a pipeline. This includes databases. For example, an entry describing a Kraken2 database to pass to a pipeline could look like:

Before

"kraken.db": {
    "type": "string",
    "description": "Path to Kraken2 database"
}

This solution would involve modifying this entry in the JSON schema to something like:

After

"kraken.db": {
    "type": "string",
    "description": "Path to Kraken2 database",
    "enum": ["az://db/k2_standard_20220607/", "az://db/k2_standard_20231009/"]
}

Now, users are constrained to only select from the defined list of database paths.

3.1. Advantages

  • No code changes required on IRIDA Next side.

3.2. Disadvantages

  • Absolute URIs must be used (e.g., az://db/k2_standard_20220607/) in order for JSON Schema validation on the Nextflow side to work.
  • It is assumed that a system administrator will install the appropriate databases in the defined directories.
  • Adding/removing databases requires releasing new versions of a pipeline (R3).

4. Solution 2: Databases are absolute paths to locations within a tool container

Implements suggestions:

  • R0 (use pre-installed databases)
  • R1 (databases are selectable)
  • R2 (no absolute URIs)
  • R5 (pipelines can re-use same database)
    • Only half-implemented since only pipelines (but not tools) can re-use same database.

Solution 2 is nearly identical to solution 1 except that the enum in the pipeline JSON Schema contains strings that are absolute paths within the container running the particular tool.

"kraken.db": {
    "type": "string",
    "description": "Path to Kraken2 database",
    "enum": ["/opt/db/k2_standard_20220607/", "/opt/db/k2_standard_20231009/"]
}

In order for this to work it is assumed that database files are either packaged with a container, or mounted inside a container at run-time.

Alternatively, databases could be bundled as Docker containers and made available/mounted by a tool at run-time.

4.1. Advantages

  • No absolute URIs
  • For small databases/files (e.g., reference genomes), could package up in the tool container itself and need no additional administrative work

4.2. Disadvantages

  • Either tool containers must be modified to include database files, or additional code for mounting directories in a container must be provided.
  • Some databases are very large (Standard Kraken2 is 50 GB). This would add a lot of data within a tool container.
  • For bundling in tool containers, we would need to do work to build all these custom containers with databases.
    • Might have multiple copies of databases for different tools (e.g., Kraken2 and Bracken require same database, but are separate docker containers now).

5. Solution 3: Databases configured in local clone of pipeline code

Implements suggestions:

  • R0 (use pre-installed databases)
  • R1 (databases are selectable)
  • R2 (no absolute URIs)
  • R4 (no code changes to add new databases)
    • Half-implements since code changes in local repo
  • R5 (tools/pipelines can re-use same database)

In this solution, upon installing a pipeline, a clone of the git repository is made into an internal gitlab (or some other internal location). The nextflow_schema.json file in the internal copy of the code is updated to point to locally installed databases, similar to Solution 1.

nextflow_schema.json of released pipeline

"kraken.db": {
    "type": "string",
    "description": "Path to Kraken2 database",
}

nextflow_schema.json of internal pipeline

"kraken.db": {
    "type": "string",
    "description": "Path to Kraken2 database",
    "enum": ["az://db/k2_standard_20220607/", "az://db/k2_standard_20231009/"]
}

6. Solution 4: Database parameters marked with custom annotation

Implements suggestions:

  • R0 (use pre-installed databases)
  • R1 (databases are selectable)
  • R2 (no absolute URIs)
  • R3 (readable labels)
  • R4 (no code changes to add new databases)
  • R5 (tools/pipelines can re-use same database)
  • R6 (easy to add/remove databases)

In Solution 3, the nextflow_schema.json is nearly identical to the default version provided by Nextflow/nf-core except that any database selection parameters are marked with a custom JSON Schema annotation, x-iridanext-database-type. This value is used within IRIDA Next to lookup pre-installed databases within some external resource.

6.1. Pipeline changes

Each parameter that defines a database will need to add an additional JSON Schema annotation, x-iridanext-database. The value will be from one of a number of pre-determined types of databases (e.g., Kraken, MLST Schemas, reference genomes).

"kraken.db": {
    "type": "string",
    "x-iridanext-database-type": "kraken2",
    "description": "Path to Kraken2 database"
}

Nextflow will ignore this custom annotation, but it will impact how IRIDA Next renders the user interface.

6.2. IRIDA Next changes

Upon parsing the nextflow_schema.json, IRIDA Next will look for these custom annotations and use them to identify parameters marked as databases. In particular, this will load up a list of pre-installed database selection options based on the value of x-iridanext-database-type (e.g., a value of kraken2 loads Kraken2 databases).

Defining the list of databases to load could be handled by a database management file loaded by IRIDA Next.

6.2.1. Database management file

The database management file is a JSON file that defines database types and label/uri pairs for each database. The format would be as follows:

{
  "[DATABASE_TYPE]": [
    {"label": "DATASE_LABEL", "uri": "DATABASE_URI"}
  ]
}

For example:

{
  "kraken2": [
    {"label": "Kraken2 version 1", "uri": "az://db/kraken2_v1"},
    {"label": "Kraken2 version 2", "uri": "az://db/kraken2_v2"}
  ],
  "references": [
    {"label": "Listeria genome 1", "uri": "az://references/listeria_g1.fasta"}
  ]
}

6.3. Managing databases

In order to add/remove databases, the database management file could be stored in an internal instance of gitlab. Management of installed databases could hence be controlled via pull-requests.

6.4. Advantages

  • Implements all the defined requirements, making it very flexible
  • Minimal code changes in a pipeline

6.5. Disadvantages

  • More code changes in IRIDA Next
@apetkau apetkau changed the title Defining selectable databases for pipelines in IRIDA Next Database handling for pipelines in IRIDA Next Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant