Skip to content
This repository has been archived by the owner on Jul 16, 2024. It is now read-only.

Commit

Permalink
Merge pull request #18 from moka-guys/development
Browse files Browse the repository at this point in the history
Add wscleaner v1.0
  • Loading branch information
Graeme-Smith authored Oct 28, 2019
2 parents 81c7ec6 + 38f37be commit 3234bd5
Show file tree
Hide file tree
Showing 14 changed files with 780 additions and 5 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
*.pyc
*.egg-info
wscleaner/wscleaner/config.json
wscleaner/test/test_dir*.txt
wscleaner/test/data
39 changes: 34 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,25 @@
# Workstation Housekeeping v1.4
# Workstation Housekeeping v1.5

Scripts to manage data on the NGS workstation

---

## backup_runfolder.py

Uploads an Illumina runfolder to DNANexus.

### Quickstart
```
usage: backup_runfolder.py [-h] -i RUNFOLDER [-a AUTH_TOKEN] [--ignore IGNORE] [-p PROJECT] [--logpath LOGPATH]
### Usage

```bash
backup_runfolder.py [-h] -i RUNFOLDER [-a AUTH_TOKEN] [--ignore IGNORE] [-p PROJECT] [--logpath LOGPATH]
```

### What are the dependencies for this script?

This tool requires the DNAnexus utilities `ua` (upload agent) and `dx` (DNAnexus toolkit) to be available in the system PATH. Python3 is required, and this tool uses packages from the standard library.

### How does this tool work?

* The script parses the input parameters, asserting that the given runfolder exists.
* If the `-p` option is given, the script attempts to find a matching DNAnexus project. Otherwise, it looks for a single project matching the runfolder name. If more or less than 1 project matches, the script logs an error and exits.
* The runfolder is traversed and a list of files in each folder is obtained. If any comma-separated strings passed to the `--ignore` argument are present within the filepath, or filename the file is excluded.
Expand All @@ -26,14 +31,38 @@ This tool requires the DNAnexus utilities `ua` (upload agent) and `dx` (DNAnexus
* (If relevant) A count of files in the DNA Nexus project containing a pattern to be ignored. NB this may not be accurate if the ignore term is found in the result of dx find data (eg present in project name)
* Logs from this and the script are written to a logfile, named after the runfolder. A destination for this file can be passed to the `--logpath` flag.

---

## findfastqs.sh

Report the number of gzipped fastq files in an Illumina runfolder.

### Usage
```

```bash
$ findfastqs.sh RUNFOLDER
>>> RUNFOLDER has 156 demultiplexed fastq files with 2 undetermined. Total: 158
```

---

## Workstation Cleaner (wscleaner)

Delete local directories that have been uploaded to the DNAnexus cloud storage service.

### Install

```bash
git clone https://github.com/moka-guys/workstation_housekeeping.git
pip install workstation_housekeeping/wscleaner
wscleaner --version # Print version number
```

### Usage

```bash
wscleaner --set-key DNA_NEXUS_KEY # Cache dnanexus api key
wscleaner ROOT_DIRECTORY --logfile LOGFILE_PATH
```

---
32 changes: 32 additions & 0 deletions wscleaner/DESIGN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Workstation Cleaner Design Document

Owner: Nana Mensah
Date: 30/05/19
Status: Draft

## Brief

The Viapath Genome Informatics team use a linux workstation to manage sequencing files. These files are uploaded to the DNAnexus service for storage, however clearing the workstation is time intensive.

## User Story

As a Clinical Bioinformatician, I need to automate the deletion of sequencing folders that have been successfuly backed up, so that I can free up time for other duties.

## Functional requirements

FR1. Accurately detect sequencing folders have been successfully backed up
FR2. Delete old sequencing folders that are successfully backed up
FR3. Log all activity to a local logfile

## Non-functional requirements

NF1. Run from the Linux command line
NF2. Process runfolders within 24 hours
NF3. Use any available DNAnexus SDKs
NF4. Attempt to process all folders at least once

## Design Summary

A RunFolderManager class will instatiate objects for local Runfolders, each of which has an associated DNA Nexus project object. The manager loops over the runfolders and deletes them if all checks pass.

DNA Nexus projects are accessed with the dxpy module, a python wrapper for the DNA Nexus API. Credentials are cached locally using the command-line option '--set-key'.
57 changes: 57 additions & 0 deletions wscleaner/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Workstation Cleaner

Workstation Cleaner (wscleaner) deletes local directories that have been uploaded to the DNAnexus cloud storage service.

When executed, Runfolders in the input (root) directory are deleted based on the following criteria:

* A single DNAnexus project is found matching the runfolder name
* All local FASTQ files are uploaded and in a 'closed' state
* Six logfiles are present in the DNA Nexus project /Logfiles directory

A DNAnexus API key must be cached locally using the `--set-key` option.

## Install

```bash
git clone https://github.com/moka-guys/workstation_housekeeping.git
pip install workstation_housekeeping/wscleaner
wscleaner --version # Print version number
```

## Quickstart

```bash
wscleaner --set-key DNA_NEXUS_KEY # Cache dnanexus api key
wscleaner ROOT_DIRECTORY
```

## Usage

```
wscleaner [-h] [--set-key SET_KEY] [--print-key] [--dry-run]
[--logfile LOGFILE] [--min-age MIN_AGE] [--version]
root
positional arguments:
root A directory containing runfolders to process
optional arguments:
-h, --help show this help message and exit
--set-key SET_KEY Cache a DNA Nexus API key
--print-key Print the cached DNA Nexus API key
--dry-run Perform a dry run without deleting files
--logfile LOGFILE A path for the application logfile
--min-age MIN_AGE The age (days) a runfolder must be to be deleted
--version Print version
```

## Test

```bash
# Run from the cloned repo directory after installation
pytest . --auth_token DNA_NEXUS_KEY
```

## License

Developed by Viapath Genome Informatics
24 changes: 24 additions & 0 deletions wscleaner/setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from setuptools import setup, find_packages

setup(name='wscleaner',
version='1.0',
description='Package to remove uploaded runfolders from \
the Viapath Genome Informatics NGS workstation',
url='https://github.com/NMNS93/wscleaner',
author='Nana Mensah',
author_email='[email protected]',
license='MIT',
packages=find_packages(),
zip_safe=False,

python_requires = '>=3.6.8',
install_requires = ['docutils>=0.3', 'dxpy==0.279.0', 'pytest==4.4.0', 'pytest-cov==2.6.1',
'Sphinx==2.0.1', 'psutil==5.6.1'],

package_data = {},

entry_points={
'console_scripts': 'wscleaner = wscleaner.main:main'
}

)
39 changes: 39 additions & 0 deletions wscleaner/test/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
"""conftest.py
Config for pytest.
"""
import pytest
import pathlib

def pytest_addoption(parser):
"""Add command line options to pytest"""
parser.addoption("--auth_token", action="store", default=None, help="A DNANexus authentication key")

@pytest.fixture
def auth_token(request):
"""Create pytest fixture from command line argument for authentication token"""
return request.config.getoption("--auth_token")

@pytest.fixture(scope="session")
def data_test_runfolders():
"""A fixture that returns a list of tuples containing (runfolder_name, fastq_list_file)."""
return [
('190408_NB551068_0234_AHJ7MTAFXY_NGS265B', 'test/test_dir_1_fastqs.txt'),
('190410_NB551068_0235_AHKGMGAFXY_NGS265C', 'test/test_dir_2_fastqs.txt')
]

@pytest.fixture(scope="session", autouse=True)
def create_test_dirs(request, data_test_runfolders):
"""Create test data for testing.
This is an autouse fixture with session scope, meaning it is run once before any tests are collected.
"""
for runfolder_name, fastq_list_file in data_test_runfolders:
# Create the runfolder directory as per Illumina spec
test_path = f'test/data/{runfolder_name}/Data/Intensities/BaseCalls'
pathlib.Path(test_path).mkdir(parents=True, exist_ok=True)
# Generate empty fastqfiles in runfolder
with open(fastq_list_file) as f:
fastq_list = f.read().splitlines()
for fastq_file in fastq_list:
pathlib.Path(test_path, fastq_file).touch(mode=777, exist_ok=True)
21 changes: 21 additions & 0 deletions wscleaner/test/coverage.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
============================= test session starts ==============================
platform linux -- Python 3.6.8, pytest-4.4.0, py-1.8.0, pluggy-0.9.0
rootdir: /home/nana/Documents/MOKAGUYS/wscleaner
plugins: cov-2.6.1
collected 9 items

test/test_all.py ......... [100%]

----------- coverage: platform linux, python 3.6.8-final-0 -----------
Name Stmts Miss Cover
--------------------------------------------------
wscleaner/__init__.py 0 0 100%
wscleaner/auth.py 35 14 60%
wscleaner/lib.py 101 6 94%
wscleaner/main.py 43 26 40%
wscleaner/mokaguys_logger.py 10 5 50%
--------------------------------------------------
TOTAL 189 51 73%


========================== 9 passed in 44.68 seconds ===========================
30 changes: 30 additions & 0 deletions wscleaner/test/generate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
"""generate.py
Generates dummy data for testing.
"""

import pathlib

def data_test_runfolders():
"""A fixture that returns a list of tuples containing (runfolder_name, fastq_list_file)."""
return [
('190408_NB551068_0234_AHJ7MTAFXY_NGS265B', 'test/test_dir_1_fastqs.txt'),
('190410_NB551068_0235_AHKGMGAFXY_NGS265C', 'test/test_dir_2_fastqs.txt')
]

def create_test_dirs(test_data):
"""Create test data for testing.
This is an autouse fixture with session scope, meaning it is run once before any tests are collected.
"""
for runfolder_name, fastq_list_file in test_data:
# Create the runfolder directory as per Illumina spec
test_path = f'test/data/{runfolder_name}/Data/Intensities/BaseCalls'
pathlib.Path(test_path).mkdir(parents=True, exist_ok=True)
# Generate empty fastqfiles in runfolder
with open(fastq_list_file) as f:
fastq_list = f.read().splitlines()
for fastq_file in fastq_list:
pathlib.Path(test_path, fastq_file).touch(mode=777, exist_ok=True)

create_test_dirs(data_test_runfolders())
109 changes: 109 additions & 0 deletions wscleaner/test/test_all.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
import pytest
import dxpy
from pathlib import Path
import argparse
import json
import sys
import shutil

from pkg_resources import resource_filename
from wscleaner.auth import SetKeyAction, dx_set_auth, CONFIG_FILE
from wscleaner.main import cli_parser
from wscleaner.lib import RunFolderManager, RunFolder

# AUTH: Set DNAnexus authentication for tests
def test_auth(auth_token):
"""Test that an authentication token is passed to pytest as a command line argument"""
assert auth_token is not None

@pytest.fixture(autouse=True)
def set_auth(auth_token):
"""Set the authenticatino token for all subsequent tests"""
dx_set_auth(auth_token)


# FIXTURES: Define functions to use in downstream tests
@pytest.fixture
def rfm():
"""Return an instance of the runfolder manager with the test/data directory"""
test_path = Path(str(Path(__file__).parent), 'data')
rfm = RunFolderManager(str(test_path))
return rfm

@pytest.fixture
def rfm_dry():
"""Return an instance of the runfolder manager with the test/data directory"""
test_path = Path(str(Path(__file__).parent), 'data')
rfm_dry = RunFolderManager(str(test_path), dry_run=True)
return rfm_dry

# TESTS
class TestAuth:
def test_set_auth(self, auth_token):
"""test that the authentication token is set correctly"""
authobj = dx_set_auth(auth_token)
assert dxpy.SECURITY_CONTEXT['auth_token'] == auth_token

def test_setkey(self, monkeypatch, auth_token):
"""test that the --set-key command-line argument caches the authentication token"""
# Set setkey cli arguments
sys.argv = ['python', 'wscleaner', '--set-key', auth_token]
# Mock Action object
# Parse args
with pytest.raises(SystemExit) as err:
args = cli_parser()
# Make assertions on created config file
fn = resource_filename('wscleaner',CONFIG_FILE)
with open(fn, 'r') as f:
assert auth_token in f.read()
# Delete temp config
Path(fn).unlink()

class TestFolders:
def test_runfolders_ready(self, data_test_runfolders, rfm):
"""Test that runfolders in the test directory pass checks for deletion. Est. 20 seconds."""
for runfolder in rfm.find_runfolders(min_age=0):
assert all([runfolder.dx_project, rfm.check_fastqs(runfolder), rfm.check_logfiles(runfolder)])

def test_find_fastqs(self, data_test_runfolders):
"""Tests the correct number of fastqs are present in local and uploaded directories"""
for runfolder_name, fastq_list_file in data_test_runfolders:
rf = RunFolder(Path('test/data', runfolder_name))
with open(fastq_list_file) as f:
test_folder_fastqs = len(f.readlines())
assert len(rf.find_fastqs()) == test_folder_fastqs
assert len(rf.dx_project.find_fastqs()) == test_folder_fastqs

def test_min_age(self, rfm):
"""test that the runfolder age function records age"""
runfolders = rfm.find_runfolders(min_age=10)
# Asser that this test runfolder was recently generated
assert all([ rf.age > 10 for rf in runfolders ])

class TestRFM:
def test_find_runfolders(self, data_test_runfolders, rfm):
"""test the runfolder manager directory finding function"""
rfm_runfolders = rfm.find_runfolders(min_age=0)
runfolder_names = [str(folder.path.name) for folder in rfm_runfolders]
test_runfolder_names = [ rf for rf, fastq_list_file in data_test_runfolders ]
runfolders_bools = [ item in runfolder_names for item in test_runfolder_names ]
assert all(runfolders_bools)

def test_validate(self, rfm):
"""test the runfoldermanager _validate function correctly reads the path"""
assert rfm.root.name == Path(str(Path(__file__).parent), 'data').name

def test_delete(self, monkeypatch, rfm):
"""test that the runfolder manager delete call creates the log of deleted files.
Here, the pytest monkeypatch fixture is used to overwrite the delete function and persist the test directories.
"""
test_folder = rfm.find_runfolders(min_age=0)[0]
monkeypatch.setattr(shutil, 'rmtree', lambda x: 'TEST_DELETED')
rfm.delete(test_folder)
assert test_folder.name in rfm.deleted

def test_dry_run(self, rfm_dry):
"""test that the dry_run option does not cause the test directory to be deleted"""
test_folder = rfm_dry.find_runfolders(min_age=0)[0]
rfm_dry.delete(test_folder)
assert test_folder.name not in rfm_dry.deleted
Empty file added wscleaner/wscleaner/__init__.py
Empty file.
Loading

0 comments on commit 3234bd5

Please sign in to comment.