Merge pull request #47 from moka-guys/v2.1.0

V2.1.0 (#47)
moka-guys · Jul 5, 2024 · 39371f0 · 39371f0
2 parents d4162fd + fb01a12
commit 39371f0
Show file tree

Hide file tree

Showing 34 changed files with 671 additions and 578 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,4 @@
 *.pyc
-wscleaner/wscleaner/config.json
+*.log
+.coverage
+.ini
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,11 @@
+Copyright 2024 Synnovis
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except
+in compliance with the License. You may obtain a copy of the License at
+
+[http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)
+
+Unless required by applicable law or agreed to in writing, software distributed under the License
+is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+or implied. See the License for the specific language governing permissions and limitations under
+the License.
diff --git a/README.md b/README.md
@@ -1,70 +1,73 @@
 ## Workstation Cleaner (wscleaner)
 
-Workstation Cleaner (wscleaner) deletes local directories that have been uploaded to the DNAnexus cloud storage service.
+The Synnovis Genome Informatics team use a linux workstation to manage sequencing files. These files are uploaded to the DNAnexus service for storage, however clearing the workstation is time intensive. Workstation Cleaner (wscleaner) automates the deletion of local directories that have been uploaded to the DNAnexus cloud storage service.
 
-When executed, Runfolders in the input (root) directory are deleted based on the following criteria:
+A RunFolderManager class will instantiate objects for local runfolders, each of which has an associated DNAnexus project object. The manager loops over the runfolders and deletes them if all checks pass. DNAnexus projects are accessed with the dxpy module, a python wrapper for the DNAnexus API.
 
+## Protocol
+
+When executed, runfolders in the input (root) directory are identified based on:
+* Matching the expected runfolder regex pattern
+
+Runfolders are identified for deletion if meeting the following criteria:
 * A single DNAnexus project is found matching the runfolder name
-* All local FASTQ files are uploaded and in a 'closed' state
-* X logfiles are present in the DNA Nexus project /Logfiles directory (NB X can be added as a command line argument - default is 5)
+* All local FASTQ files are uploaded and in a 'closed' state (for TSO runfolders, there are no local fastqs so this check automatically passes)
+* X logfiles are present in the DNAnexus project `automated_scripts_logfiles` directory (NB X can be added as a command line argument - default is 6)
+* Runfolder's upload runfolder log file contains no errors
 
-or if the run is identified as a TSO500 run, based on:
-  * the bcl2fastq2_output.log file created by the automated scripts
-  AND
-  * Presence of `_TSO` in the human readable DNANexus project name
+TSO runfolders must meet the following additional criteria to be identified for deletion:
+* Presence of bcl2fastq2_output.log file
+* Presence of `TSO run.` in the bcl2fastq log file
+* Presence of `_TSO` in the human readable DNANexus project name
 
-A DNAnexus API key must be cached locally using the `--set-key` option. 
+## Usage
 
-## Workstation Environment
-The directory `env/` in this repository contains conda environment scripts for the workstation. These remove conflicts in the PYTHONPATH environment variable by editing the variable when conda is activated. The conda documentation describes where to place these scripts under ['saving environment variables'](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#macos-and-linux).
+The script takes the following arguments, and can be run in either dry run mode (doesn't delete runfolders) or live mode (deletes runfolders). The script has been developed using python 3.10.6.
 
-## Install
-As descibed above, on the workstation 2 environments exist - wscleaner and wscleaner_test (for development work).
-You need to activate these environment before installing with pip (as below).
+_**When running on the workstation, the conda environment must be activated prior to running the wscleaner command.**_
 
+```
+usage: __main__.py [-h] --auth_token_file AUTH_TOKEN_FILE [--dry-run] --runfolders_dir RUNFOLDERS_DIR --log_dir LOG_DIR [--min-age MIN_AGE]
+                   [--logfile-count LOGFILE_COUNT] [--version]
 
-```bash
-git clone https://github.com/moka-guys/workstation_housekeeping.git
-pip install workstation_housekeeping/wscleaner
-wscleaner --version # Print version number
+options:
+  -h, --help            show this help message and exit
+  --auth_token_file AUTH_TOKEN_FILE
+                        A text file containing the DNANexus authentication token
+  --dry-run             Perform a dry run without deleting files
+  --runfolders_dir RUNFOLDERS_DIR
+                        A directory containing runfolders to process
+  --log_dir LOG_DIR     Directory to save log file to
+  --min-age MIN_AGE     The age (days) a runfolder must be to be deleted
+  --logfile-count LOGFILE_COUNT
+                        The number of logfiles a runfolder must have in /Logfiles
+  --version             Print version
 ```
 
-## Automated usage
-The script `wscleaner_command.sh` is called by the crontab. This activates the enviroment and passes the logfile path (and any other non-default arguments).
-A development command script `wscleaner_command_dev.sh` can be used to call the test environment and provide testing arguments, eg --dry-run
 
+### Dry run mode
 
-## Manual Usage
+For example, if running in dry run mode:
 
 ```
-usage: wscleaner [-h] [--auth AUTH] [--dry-run] [--logfile LOGFILE]
-                 [--min-age MIN_AGE] [--logfile-count LOGFILE_COUNT]
-                 [--version]
-                 root
+conda activate python3.10.6 && python3 -m wscleaner --dry-run --runfolders_dir $RUNFOLDERS_DIR --auth_token_file $AUTH_TOKEN_FILEPATH --log_dir $LOG_DIR
+```
 
-positional arguments:
-  root                  A directory containing runfolders to process
+### Live mode
+
+If running in production mode:
 
-optional arguments:
-  -h, --help            show this help message and exit
-  --auth AUTH           A text file containing the DNANexus authentication
-                        token
-  --dry-run             Perform a dry run without deleting files
-  --logfile LOGFILE     A path for the application logfile
-  --min-age MIN_AGE     The age (days) a runfolder must be to be deleted
-  --logfile-count LOGFILE_COUNT
-                        The number of logfiles a runfolder must have in
-                        /Logfiles
-  --version             Print version
 ```
+conda activate python3.10.6 && python3 -m wscleaner --runfolders_dir $RUNFOLDERS_DIR --auth_token_file $AUTH_TOKEN_FILEPATH --log_dir $LOG_DIR
+```
+
+## Testing
 
-## Test
+Tests should be run and all passing prior to any new release.
 
 ```bash
-# Run from the cloned repo directory after installation
-pytest . --auth_token DNA_NEXUS_KEY
+python3 -m pytest -v --auth_token_file=$FULL_PATH_TO_FILE_CONTAINING_AUTH_TOKEN
 ```
 
-## License
 
-Developed by Synnovis Genome Informatics
+### Developed by Synnovis Genome Informatics
diff --git a/wscleaner/wscleaner/__init__.py → __init__.py b/wscleaner/wscleaner/__init__.py → __init__.py
diff --git a/pytest.ini b/pytest.ini
@@ -0,0 +1,2 @@
+[pytest]
+addopts = -v --cov=. --cov-report term-missing
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,8 @@
+argcomplete==3.4.0
+certifi==2024.7.4
+dxpy==0.378.0
+psutil==6.0.0
+python-dateutil==2.9.0.post0
+six==1.16.0
+urllib3==2.1.0
+websocket-client==1.7.0
diff --git a/settings.json b/settings.json
@@ -0,0 +1,36 @@
+{
+    "python.testing.pytestArgs": [
+        "."
+    ],
+    "python.testing.unittestEnabled": false,
+    "python.testing.pytestEnabled": true,
+    "python.envFile": "${workspaceFolder}/.venv",
+    "python.analysis.extraPaths": [],
+    "editor.formatOnSaveMode": "file",
+    "editor.formatOnSave": true,
+    "editor.codeActionsOnSave": {
+        "source.organizeImports": "explicit"
+    },
+    "[python]": {
+        "editor.defaultFormatter": "ms-python.black-formatter",
+        "editor.formatOnSave": true,
+        "editor.codeActionsOnSave": {
+            "source.organizeImports": "explicit"
+        }
+    },
+    "isort.args": [
+        "--profile",
+        "black"
+    ],
+    "flake8.args": [
+        "--max-line-length=120"
+    ],
+    "pylint.args": [
+        "--max-line-length=120"
+    ],
+    "black-formatter.args": [
+        "--line-length",
+        "120"
+    ],
+    "python.analysis.typeCheckingMode": "basic"
+}
diff --git a/test/conftest.py b/test/conftest.py
@@ -0,0 +1,86 @@
+"""conftest.py
+
+Config for pytest.
+"""
+
+import os
+import pytest
+import shutil
+import dxpy
+from pathlib import Path
+
+PROJECT_DIR = str(Path(__file__).absolute().parent.parent)  # Project working directory
+DATA_DIR = os.path.join(PROJECT_DIR, "test/data/")
+
+
+def pytest_addoption(parser):
+    """Add command line options to pytest"""
+    parser.addoption(
+        "--auth_token_file",
+        action="store",
+        default=None,
+        required=True,
+        help="File containing DNANexus authentication key",
+    )
+
+
+@pytest.fixture
+def auth_token_file(request):
+    """Create pytest fixture to return auth token file from the command line arg"""
+    return request.config.getoption("--auth_token_file")
+
+
+@pytest.fixture(scope="session")
+def data_test_runfolders():
+    """A fixture that returns a list of tuples containing (runfolder_name, fastq_list_file)."""
+    return [
+        (
+            "999999_NB551068_1234_WSCLEANT01",
+            os.path.join(DATA_DIR, "test_dir_1_fastqs.txt"),
+        ),
+        (
+            "999999_NB551068_1234_WSCLEANT02",
+            os.path.join(DATA_DIR, "test_dir_2_fastqs.txt"),
+        ),
+    ]
+
+
+@pytest.fixture(scope="function", autouse=True)
+def create_test_dirs(data_test_runfolders, auth_token_file, request, monkeypatch):
+    """Create test data for testing.
+
+    This is an autouse fixture with session function, meaning it is run once per test
+    """
+    for runfolder_name, fastq_list_file in data_test_runfolders:
+        # Create the runfolder directory as per Illumina spec
+        runfolder_path = os.path.join(DATA_DIR, runfolder_name)
+        fastqs_path = os.path.join(
+            PROJECT_DIR, f"{runfolder_path}/Data/Intensities/BaseCalls"
+        )
+        Path(fastqs_path).mkdir(parents=True, exist_ok=True)
+        # Create dummy logfile
+        # open(upload_runfolder_logfile, 'w').close()
+        # Generate empty fastqfiles in runfolder
+        with open(fastq_list_file) as f:
+            fastq_list = f.read().splitlines()
+            for fastq_file in fastq_list:
+                Path(fastqs_path, fastq_file).touch(mode=777, exist_ok=True)
+        open(
+            os.path.join(runfolder_path, "RTAComplete.txt"), "w"
+        ).close()  # Create RTAComplete file
+        open(
+            f"{runfolder_path}_upload_runfolder.log", "w"
+        ).close()  # Create dummy upload runfolder log file
+        with open(
+            auth_token_file
+        ) as f:  # Setup dxpy authentication token read from command line file
+            auth_token = f.read().rstrip()
+        dxpy.set_security_context(
+            {"auth_token_type": "Bearer", "auth_token": auth_token}
+        )
+
+    yield  # Where the testing happens
+    # TEARDOWN - cleanup after each test
+    for runfolder_name, fastq_list_file in data_test_runfolders:
+        runfolder_path = os.path.join(PROJECT_DIR, f"test/data/{runfolder_name}")
+        shutil.rmtree(runfolder_path)
diff --git a/test/data/test_dir_1_fastqs.txt b/test/data/test_dir_1_fastqs.txt
@@ -0,0 +1,8 @@
+TSTRUN01_01_000000_000000_TEST_Pan5180_S1_R1_001.fastq.gz
+TSTRUN01_01_000000_000000_TEST_Pan5180_S1_R2_001.fastq.gz
+TSTRUN01_02_000000_000000_TEST_Pan5180_S1_R1_001.fastq.gz
+TSTRUN01_02_000000_000000_TEST_Pan5180_S1_R2_001.fastq.gz
+TSTRUN01_03_000000_000000_TEST_Pan5180_S1_R1_001.fastq.gz
+TSTRUN01_03_000000_000000_TEST_Pan5180_S1_R2_001.fastq.gz
+TSTRUN01_04_000000_000000_TEST_Pan5180_S1_R1_001.fastq.gz
+TSTRUN01_04_000000_000000_TEST_Pan5180_S1_R2_001.fastq.gz
diff --git a/test/data/test_dir_2_fastqs.txt b/test/data/test_dir_2_fastqs.txt
@@ -0,0 +1,8 @@
+TSTRUN02_01_000000_000000_TEST_Pan5180_S1_R1_001.fastq.gz
+TSTRUN02_01_000000_000000_TEST_Pan5180_S1_R2_001.fastq.gz
+TSTRUN02_02_000000_000000_TEST_Pan5180_S1_R1_001.fastq.gz
+TSTRUN02_02_000000_000000_TEST_Pan5180_S1_R2_001.fastq.gz
+TSTRUN02_03_000000_000000_TEST_Pan5180_S1_R1_001.fastq.gz
+TSTRUN02_03_000000_000000_TEST_Pan5180_S1_R2_001.fastq.gz
+TSTRUN02_04_000000_000000_TEST_Pan5180_S1_R1_001.fastq.gz
+TSTRUN02_04_000000_000000_TEST_Pan5180_S1_R2_001.fastq.gz
diff --git a/test/test_all.py b/test/test_all.py
@@ -0,0 +1,101 @@
+import pytest
+from pathlib import Path
+import shutil
+import wscleaner.wscleaner as wscleaner
+
+
+test_data_dir = Path(str(Path(__file__).parent), "data")
+
+
+# AUTH: Set DNAnexus authentication for tests
+def test_auth(auth_token_file):
+    """Test that an authentication token file is passed to pytest as a command line argument"""
+    assert auth_token_file is not None
+
+
+@pytest.fixture
+def rfm(monkeypatch):
+    """Return an instance of the runfolder manager with the test/data directory
+    Monkeypatch is used to overwrite the upload runfolder logfile to the file created
+    in the conftest"""
+    monkeypatch.setattr(
+        wscleaner,
+        "upload_runfolder_logdir",
+        test_data_dir,
+    )
+    return wscleaner.RunFolderManager(str(test_data_dir))
+
+
+@pytest.fixture
+def rfm_dry(monkeypatch):
+    """Return an instance of the runfolder manager with the test/data directory
+    Monkeypatch is used to overwrite the upload runfolder logfile to the file created
+    in the conftest"""
+    monkeypatch.setattr(
+        wscleaner,
+        "upload_runfolder_logdir",
+        test_data_dir,
+    )
+    return wscleaner.RunFolderManager(str(test_data_dir), dry_run=True)
+
+
+class TestRunFolder:
+    def test_runfolders_ready(self, data_test_runfolders, rfm):
+        """Test that runfolders in the test directory pass checks for deletion. Est. 20 seconds."""
+        for runfolder in rfm.find_runfolders(min_age=0):
+            assert all(
+                [
+                    runfolder.dx_project,
+                    rfm.check_fastqs(runfolder),
+                    rfm.check_logfiles(runfolder, 6),
+                    rfm.check_upload_log(runfolder),
+                ]
+            )
+
+    def test_find_fastqs(self, data_test_runfolders):
+        """Tests the correct number of fastqs are present in local and uploaded directories"""
+        for runfolder_name, fastq_list_file in data_test_runfolders:
+            rf = wscleaner.RunFolder(Path("test/data", runfolder_name))
+            with open(fastq_list_file) as f:
+                test_folder_fastqs = len(f.readlines())
+            assert len(rf.find_fastqs()) == test_folder_fastqs
+            assert len(rf.dx_project.find_fastqs()) == test_folder_fastqs
+
+    def test_min_age(self, rfm):
+        """test that the runfolder age function records age"""
+        runfolders = rfm.find_runfolders(min_age=10)
+        # Asser that this test runfolder was recently generated
+        assert all([rf.age > 10 for rf in runfolders])
+
+
+# TODO add a class to test the DxProjectRunFolder class
+# class TestDxProjectRunFolder:
+
+
+class TestRunfolderManager:
+    def test_find_runfolders(self, data_test_runfolders, rfm):
+        """test the runfolder manager directory finding function"""
+        rfm_runfolders = rfm.find_runfolders(min_age=0)
+        runfolder_names = [str(folder.path.name) for folder in rfm_runfolders]
+        test_runfolder_names = [rf for rf, fastq_list_file in data_test_runfolders]
+        runfolders_bools = [item in runfolder_names for item in test_runfolder_names]
+        assert all(runfolders_bools)
+
+    def test_validate(self, rfm):
+        """test the runfoldermanager _validate function correctly reads the path"""
+        assert rfm.runfolder_dir.name == Path(str(Path(__file__).parent), "data").name
+
+    def test_delete(self, monkeypatch, rfm):
+        """test that the runfolder manager delete call creates the log of deleted files.
+        Here, the pytest monkeypatch fixture is used to overwrite the delete function and persist the test directories.
+        """
+        test_folder = rfm.find_runfolders(min_age=0)[0]
+        monkeypatch.setattr(shutil, "rmtree", lambda x: "TEST_DELETED")
+        rfm.delete(test_folder)
+        assert test_folder.name in rfm.deleted
+
+    def test_dry_run(self, rfm_dry):
+        """test that the dry_run option does not cause the test directory to be deleted"""
+        test_folder = rfm_dry.find_runfolders(min_age=0)[0]
+        rfm_dry.delete(test_folder)
+        assert test_folder.name not in rfm_dry.deleted
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		[pytest]
		addopts = -v --cov=. --cov-report term-missing