Skip to content
This repository has been archived by the owner on Jul 16, 2024. It is now read-only.

Commit

Permalink
Merge pull request #47 from moka-guys/v2.1.0
Browse files Browse the repository at this point in the history
V2.1.0 (#47)
  • Loading branch information
natashapinto authored Jul 5, 2024
2 parents d4162fd + fb01a12 commit 39371f0
Show file tree
Hide file tree
Showing 34 changed files with 671 additions and 578 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
*.pyc
wscleaner/wscleaner/config.json
*.log
.coverage
.ini
11 changes: 11 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Copyright 2024 Synnovis

Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except
in compliance with the License. You may obtain a copy of the License at

[http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License
is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
or implied. See the License for the specific language governing permissions and limitations under
the License.
91 changes: 47 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,70 +1,73 @@
## Workstation Cleaner (wscleaner)

Workstation Cleaner (wscleaner) deletes local directories that have been uploaded to the DNAnexus cloud storage service.
The Synnovis Genome Informatics team use a linux workstation to manage sequencing files. These files are uploaded to the DNAnexus service for storage, however clearing the workstation is time intensive. Workstation Cleaner (wscleaner) automates the deletion of local directories that have been uploaded to the DNAnexus cloud storage service.

When executed, Runfolders in the input (root) directory are deleted based on the following criteria:
A RunFolderManager class will instantiate objects for local runfolders, each of which has an associated DNAnexus project object. The manager loops over the runfolders and deletes them if all checks pass. DNAnexus projects are accessed with the dxpy module, a python wrapper for the DNAnexus API.

## Protocol

When executed, runfolders in the input (root) directory are identified based on:
* Matching the expected runfolder regex pattern

Runfolders are identified for deletion if meeting the following criteria:
* A single DNAnexus project is found matching the runfolder name
* All local FASTQ files are uploaded and in a 'closed' state
* X logfiles are present in the DNA Nexus project /Logfiles directory (NB X can be added as a command line argument - default is 5)
* All local FASTQ files are uploaded and in a 'closed' state (for TSO runfolders, there are no local fastqs so this check automatically passes)
* X logfiles are present in the DNAnexus project `automated_scripts_logfiles` directory (NB X can be added as a command line argument - default is 6)
* Runfolder's upload runfolder log file contains no errors

or if the run is identified as a TSO500 run, based on:
* the bcl2fastq2_output.log file created by the automated scripts
AND
* Presence of `_TSO` in the human readable DNANexus project name
TSO runfolders must meet the following additional criteria to be identified for deletion:
* Presence of bcl2fastq2_output.log file
* Presence of `TSO run.` in the bcl2fastq log file
* Presence of `_TSO` in the human readable DNANexus project name

A DNAnexus API key must be cached locally using the `--set-key` option.
## Usage

## Workstation Environment
The directory `env/` in this repository contains conda environment scripts for the workstation. These remove conflicts in the PYTHONPATH environment variable by editing the variable when conda is activated. The conda documentation describes where to place these scripts under ['saving environment variables'](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#macos-and-linux).
The script takes the following arguments, and can be run in either dry run mode (doesn't delete runfolders) or live mode (deletes runfolders). The script has been developed using python 3.10.6.

## Install
As descibed above, on the workstation 2 environments exist - wscleaner and wscleaner_test (for development work).
You need to activate these environment before installing with pip (as below).
_**When running on the workstation, the conda environment must be activated prior to running the wscleaner command.**_

```
usage: __main__.py [-h] --auth_token_file AUTH_TOKEN_FILE [--dry-run] --runfolders_dir RUNFOLDERS_DIR --log_dir LOG_DIR [--min-age MIN_AGE]
[--logfile-count LOGFILE_COUNT] [--version]
```bash
git clone https://github.com/moka-guys/workstation_housekeeping.git
pip install workstation_housekeeping/wscleaner
wscleaner --version # Print version number
options:
-h, --help show this help message and exit
--auth_token_file AUTH_TOKEN_FILE
A text file containing the DNANexus authentication token
--dry-run Perform a dry run without deleting files
--runfolders_dir RUNFOLDERS_DIR
A directory containing runfolders to process
--log_dir LOG_DIR Directory to save log file to
--min-age MIN_AGE The age (days) a runfolder must be to be deleted
--logfile-count LOGFILE_COUNT
The number of logfiles a runfolder must have in /Logfiles
--version Print version
```

## Automated usage
The script `wscleaner_command.sh` is called by the crontab. This activates the enviroment and passes the logfile path (and any other non-default arguments).
A development command script `wscleaner_command_dev.sh` can be used to call the test environment and provide testing arguments, eg --dry-run

### Dry run mode

## Manual Usage
For example, if running in dry run mode:

```
usage: wscleaner [-h] [--auth AUTH] [--dry-run] [--logfile LOGFILE]
[--min-age MIN_AGE] [--logfile-count LOGFILE_COUNT]
[--version]
root
conda activate python3.10.6 && python3 -m wscleaner --dry-run --runfolders_dir $RUNFOLDERS_DIR --auth_token_file $AUTH_TOKEN_FILEPATH --log_dir $LOG_DIR
```

positional arguments:
root A directory containing runfolders to process
### Live mode

If running in production mode:

optional arguments:
-h, --help show this help message and exit
--auth AUTH A text file containing the DNANexus authentication
token
--dry-run Perform a dry run without deleting files
--logfile LOGFILE A path for the application logfile
--min-age MIN_AGE The age (days) a runfolder must be to be deleted
--logfile-count LOGFILE_COUNT
The number of logfiles a runfolder must have in
/Logfiles
--version Print version
```
conda activate python3.10.6 && python3 -m wscleaner --runfolders_dir $RUNFOLDERS_DIR --auth_token_file $AUTH_TOKEN_FILEPATH --log_dir $LOG_DIR
```

## Testing

## Test
Tests should be run and all passing prior to any new release.

```bash
# Run from the cloned repo directory after installation
pytest . --auth_token DNA_NEXUS_KEY
python3 -m pytest -v --auth_token_file=$FULL_PATH_TO_FILE_CONTAINING_AUTH_TOKEN
```

## License

Developed by Synnovis Genome Informatics
### Developed by Synnovis Genome Informatics
File renamed without changes.
2 changes: 2 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[pytest]
addopts = -v --cov=. --cov-report term-missing
8 changes: 8 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
argcomplete==3.4.0
certifi==2024.7.4
dxpy==0.378.0
psutil==6.0.0
python-dateutil==2.9.0.post0
six==1.16.0
urllib3==2.1.0
websocket-client==1.7.0
36 changes: 36 additions & 0 deletions settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"python.testing.pytestArgs": [
"."
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
"python.envFile": "${workspaceFolder}/.venv",
"python.analysis.extraPaths": [],
"editor.formatOnSaveMode": "file",
"editor.formatOnSave": true,
"editor.codeActionsOnSave": {
"source.organizeImports": "explicit"
},
"[python]": {
"editor.defaultFormatter": "ms-python.black-formatter",
"editor.formatOnSave": true,
"editor.codeActionsOnSave": {
"source.organizeImports": "explicit"
}
},
"isort.args": [
"--profile",
"black"
],
"flake8.args": [
"--max-line-length=120"
],
"pylint.args": [
"--max-line-length=120"
],
"black-formatter.args": [
"--line-length",
"120"
],
"python.analysis.typeCheckingMode": "basic"
}
86 changes: 86 additions & 0 deletions test/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
"""conftest.py
Config for pytest.
"""

import os
import pytest
import shutil
import dxpy
from pathlib import Path

PROJECT_DIR = str(Path(__file__).absolute().parent.parent) # Project working directory
DATA_DIR = os.path.join(PROJECT_DIR, "test/data/")


def pytest_addoption(parser):
"""Add command line options to pytest"""
parser.addoption(
"--auth_token_file",
action="store",
default=None,
required=True,
help="File containing DNANexus authentication key",
)


@pytest.fixture
def auth_token_file(request):
"""Create pytest fixture to return auth token file from the command line arg"""
return request.config.getoption("--auth_token_file")


@pytest.fixture(scope="session")
def data_test_runfolders():
"""A fixture that returns a list of tuples containing (runfolder_name, fastq_list_file)."""
return [
(
"999999_NB551068_1234_WSCLEANT01",
os.path.join(DATA_DIR, "test_dir_1_fastqs.txt"),
),
(
"999999_NB551068_1234_WSCLEANT02",
os.path.join(DATA_DIR, "test_dir_2_fastqs.txt"),
),
]


@pytest.fixture(scope="function", autouse=True)
def create_test_dirs(data_test_runfolders, auth_token_file, request, monkeypatch):
"""Create test data for testing.
This is an autouse fixture with session function, meaning it is run once per test
"""
for runfolder_name, fastq_list_file in data_test_runfolders:
# Create the runfolder directory as per Illumina spec
runfolder_path = os.path.join(DATA_DIR, runfolder_name)
fastqs_path = os.path.join(
PROJECT_DIR, f"{runfolder_path}/Data/Intensities/BaseCalls"
)
Path(fastqs_path).mkdir(parents=True, exist_ok=True)
# Create dummy logfile
# open(upload_runfolder_logfile, 'w').close()
# Generate empty fastqfiles in runfolder
with open(fastq_list_file) as f:
fastq_list = f.read().splitlines()
for fastq_file in fastq_list:
Path(fastqs_path, fastq_file).touch(mode=777, exist_ok=True)
open(
os.path.join(runfolder_path, "RTAComplete.txt"), "w"
).close() # Create RTAComplete file
open(
f"{runfolder_path}_upload_runfolder.log", "w"
).close() # Create dummy upload runfolder log file
with open(
auth_token_file
) as f: # Setup dxpy authentication token read from command line file
auth_token = f.read().rstrip()
dxpy.set_security_context(
{"auth_token_type": "Bearer", "auth_token": auth_token}
)

yield # Where the testing happens
# TEARDOWN - cleanup after each test
for runfolder_name, fastq_list_file in data_test_runfolders:
runfolder_path = os.path.join(PROJECT_DIR, f"test/data/{runfolder_name}")
shutil.rmtree(runfolder_path)
8 changes: 8 additions & 0 deletions test/data/test_dir_1_fastqs.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
TSTRUN01_01_000000_000000_TEST_Pan5180_S1_R1_001.fastq.gz
TSTRUN01_01_000000_000000_TEST_Pan5180_S1_R2_001.fastq.gz
TSTRUN01_02_000000_000000_TEST_Pan5180_S1_R1_001.fastq.gz
TSTRUN01_02_000000_000000_TEST_Pan5180_S1_R2_001.fastq.gz
TSTRUN01_03_000000_000000_TEST_Pan5180_S1_R1_001.fastq.gz
TSTRUN01_03_000000_000000_TEST_Pan5180_S1_R2_001.fastq.gz
TSTRUN01_04_000000_000000_TEST_Pan5180_S1_R1_001.fastq.gz
TSTRUN01_04_000000_000000_TEST_Pan5180_S1_R2_001.fastq.gz
8 changes: 8 additions & 0 deletions test/data/test_dir_2_fastqs.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
TSTRUN02_01_000000_000000_TEST_Pan5180_S1_R1_001.fastq.gz
TSTRUN02_01_000000_000000_TEST_Pan5180_S1_R2_001.fastq.gz
TSTRUN02_02_000000_000000_TEST_Pan5180_S1_R1_001.fastq.gz
TSTRUN02_02_000000_000000_TEST_Pan5180_S1_R2_001.fastq.gz
TSTRUN02_03_000000_000000_TEST_Pan5180_S1_R1_001.fastq.gz
TSTRUN02_03_000000_000000_TEST_Pan5180_S1_R2_001.fastq.gz
TSTRUN02_04_000000_000000_TEST_Pan5180_S1_R1_001.fastq.gz
TSTRUN02_04_000000_000000_TEST_Pan5180_S1_R2_001.fastq.gz
101 changes: 101 additions & 0 deletions test/test_all.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
import pytest
from pathlib import Path
import shutil
import wscleaner.wscleaner as wscleaner


test_data_dir = Path(str(Path(__file__).parent), "data")


# AUTH: Set DNAnexus authentication for tests
def test_auth(auth_token_file):
"""Test that an authentication token file is passed to pytest as a command line argument"""
assert auth_token_file is not None


@pytest.fixture
def rfm(monkeypatch):
"""Return an instance of the runfolder manager with the test/data directory
Monkeypatch is used to overwrite the upload runfolder logfile to the file created
in the conftest"""
monkeypatch.setattr(
wscleaner,
"upload_runfolder_logdir",
test_data_dir,
)
return wscleaner.RunFolderManager(str(test_data_dir))


@pytest.fixture
def rfm_dry(monkeypatch):
"""Return an instance of the runfolder manager with the test/data directory
Monkeypatch is used to overwrite the upload runfolder logfile to the file created
in the conftest"""
monkeypatch.setattr(
wscleaner,
"upload_runfolder_logdir",
test_data_dir,
)
return wscleaner.RunFolderManager(str(test_data_dir), dry_run=True)


class TestRunFolder:
def test_runfolders_ready(self, data_test_runfolders, rfm):
"""Test that runfolders in the test directory pass checks for deletion. Est. 20 seconds."""
for runfolder in rfm.find_runfolders(min_age=0):
assert all(
[
runfolder.dx_project,
rfm.check_fastqs(runfolder),
rfm.check_logfiles(runfolder, 6),
rfm.check_upload_log(runfolder),
]
)

def test_find_fastqs(self, data_test_runfolders):
"""Tests the correct number of fastqs are present in local and uploaded directories"""
for runfolder_name, fastq_list_file in data_test_runfolders:
rf = wscleaner.RunFolder(Path("test/data", runfolder_name))
with open(fastq_list_file) as f:
test_folder_fastqs = len(f.readlines())
assert len(rf.find_fastqs()) == test_folder_fastqs
assert len(rf.dx_project.find_fastqs()) == test_folder_fastqs

def test_min_age(self, rfm):
"""test that the runfolder age function records age"""
runfolders = rfm.find_runfolders(min_age=10)
# Asser that this test runfolder was recently generated
assert all([rf.age > 10 for rf in runfolders])


# TODO add a class to test the DxProjectRunFolder class
# class TestDxProjectRunFolder:


class TestRunfolderManager:
def test_find_runfolders(self, data_test_runfolders, rfm):
"""test the runfolder manager directory finding function"""
rfm_runfolders = rfm.find_runfolders(min_age=0)
runfolder_names = [str(folder.path.name) for folder in rfm_runfolders]
test_runfolder_names = [rf for rf, fastq_list_file in data_test_runfolders]
runfolders_bools = [item in runfolder_names for item in test_runfolder_names]
assert all(runfolders_bools)

def test_validate(self, rfm):
"""test the runfoldermanager _validate function correctly reads the path"""
assert rfm.runfolder_dir.name == Path(str(Path(__file__).parent), "data").name

def test_delete(self, monkeypatch, rfm):
"""test that the runfolder manager delete call creates the log of deleted files.
Here, the pytest monkeypatch fixture is used to overwrite the delete function and persist the test directories.
"""
test_folder = rfm.find_runfolders(min_age=0)[0]
monkeypatch.setattr(shutil, "rmtree", lambda x: "TEST_DELETED")
rfm.delete(test_folder)
assert test_folder.name in rfm.deleted

def test_dry_run(self, rfm_dry):
"""test that the dry_run option does not cause the test directory to be deleted"""
test_folder = rfm_dry.find_runfolders(min_age=0)[0]
rfm_dry.delete(test_folder)
assert test_folder.name not in rfm_dry.deleted
Loading

0 comments on commit 39371f0

Please sign in to comment.