Skip to content
This repository has been archived by the owner on Jul 16, 2024. It is now read-only.

Commit

Permalink
Merge pull request #30 from moka-guys/v1.10
Browse files Browse the repository at this point in the history
add support for TSO and update some documentation. fix#28 fix#29
  • Loading branch information
natashapinto authored May 27, 2022
2 parents f0f31e3 + 396652b commit 331da59
Show file tree
Hide file tree
Showing 12 changed files with 124 additions and 22 deletions.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
*.pyc
*.egg-info
wscleaner/wscleaner/config.json
wscleaner/test/test_dir*.txt
wscleaner/test/data
26 changes: 16 additions & 10 deletions wscleaner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,25 +6,35 @@ When executed, Runfolders in the input (root) directory are deleted based on the

* A single DNAnexus project is found matching the runfolder name
* All local FASTQ files are uploaded and in a 'closed' state
* Six logfiles are present in the DNA Nexus project /Logfiles directory
* X logfiles are present in the DNA Nexus project /Logfiles directory (NB X can be added as a command line argument - default is 5)

or if the run is identified as a TSO500 run, based on:
* the bcl2fastq2_output.log file created by the automated scripts
AND
* Presence of `_TSO` in the human readable DNANexus project name

A DNAnexus API key must be cached locally using the `--set-key` option.

## Workstation Environment
The directory `env/` in this repository contains conda environment scripts for the workstation. These remove conflicts in the PYTHONPATH environment variable by editing the variable when conda is activated. The conda documentation describes where to place these scripts under ['saving environment variables'](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#macos-and-linux).

## Install
As descibed above, on the workstation 2 environments exist - wscleaner and wscleaner_test (for development work).
You need to activate these environment before installing with pip (as below).


```bash
git clone https://github.com/moka-guys/workstation_housekeeping.git
pip install workstation_housekeeping/wscleaner
wscleaner --version # Print version number
```

## Quickstart
## Automated usage
The script `wscleaner_command.sh` is called by the crontab. This activates the enviroment and passes the logfile path (and any other non-default arguments).
A development command script `wscleaner_command_dev.sh` can be used to call the test environment and provide testing arguments, eg --dry-run

```bash
wscleaner ROOT_DIRECTORY
```

## Usage
## Manual Usage

```
usage: wscleaner [-h] [--auth AUTH] [--dry-run] [--logfile LOGFILE]
Expand Down Expand Up @@ -55,10 +65,6 @@ optional arguments:
pytest . --auth_token DNA_NEXUS_KEY
```

## Workstation Environment
The directory `env/` in this repository contains conda environment scripts for the workstation. These remove conflicts in the PYTHONPATH environment variable by editing the variable when conda is activated. The conda documentation describes where to place these scripts under ['saving environment variables'](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#macos-and-linux).


## License

Developed by Viapath Genome Informatics
11 changes: 11 additions & 0 deletions wscleaner/wscleaner.egg-info/PKG-INFO
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Metadata-Version: 1.10
Name: wscleaner
Version: 1.10
Summary: Package to remove uploaded runfolders from the Viapath Genome Informatics NGS workstation
Home-page: https://github.com/moka-guys/workstation_housekeeping
Author: Nana Mensah
Author-email: [email protected]
License: MIT
Description: UNKNOWN
Platform: UNKNOWN
Requires-Python: >=3.6.8
14 changes: 14 additions & 0 deletions wscleaner/wscleaner.egg-info/SOURCES.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
README.md
setup.py
test/test_all.py
wscleaner/__init__.py
wscleaner/lib.py
wscleaner/main.py
wscleaner/mokaguys_logger.py
wscleaner.egg-info/PKG-INFO
wscleaner.egg-info/SOURCES.txt
wscleaner.egg-info/dependency_links.txt
wscleaner.egg-info/entry_points.txt
wscleaner.egg-info/not-zip-safe
wscleaner.egg-info/requires.txt
wscleaner.egg-info/top_level.txt
1 change: 1 addition & 0 deletions wscleaner/wscleaner.egg-info/dependency_links.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

3 changes: 3 additions & 0 deletions wscleaner/wscleaner.egg-info/entry_points.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[console_scripts]
wscleaner = wscleaner.main:main

1 change: 1 addition & 0 deletions wscleaner/wscleaner.egg-info/not-zip-safe
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

5 changes: 5 additions & 0 deletions wscleaner/wscleaner.egg-info/requires.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
docutils>=0.3
dxpy==0.279.0
pytest==4.4.0
pytest-cov==2.6.1
Sphinx==2.0.1
1 change: 1 addition & 0 deletions wscleaner/wscleaner.egg-info/top_level.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
wscleaner
68 changes: 58 additions & 10 deletions wscleaner/wscleaner/lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import shutil
import time
from pathlib import Path
import os

import dxpy

Expand All @@ -34,6 +35,7 @@ class RunFolder():
def __init__(self, path):
self.logger = logging.getLogger(__name__ + '.RunFolder')
self.path = Path(path)
self.RTA_complete_exists = os.path.isfile(os.path.join(self.path,"RTAComplete.txt"))
self.name = self.path.name
self.logger.debug(f'Initiating RunFolder instance for {self.name}')
self.dx_project = DxProjectRunFolder(self.name)
Expand Down Expand Up @@ -63,6 +65,37 @@ def find_fastqs(self, count=False):
else:
self.logger.debug(f'{self.name} contains {len(fastq_filenames)} fastq files: {fastq_filenames}')
return fastq_filenames

def TSO500_check(self):
"""
Checks if the run is a TSO500 run. These need to be cleaned up but do not contain fastqs
Returns True if TSO run detected.
"""
logfile_check=False
project_name=False
bcl2fastq_filepath=os.path.join(self.path,"bcl2fastq2_output.log")
# ensure not trying to open files that don't exist
if os.path.isdir(self.path) and os.path.exists(bcl2fastq_filepath):
# open bcl2fastq file - should contain a standard statement from automated scripts
with open(bcl2fastq_filepath) as demultiplexing_file:
# take last line of the logfile - look for statement produced by automated scripts for TSO runs
if demultiplexing_file.readlines()[-1].startswith("TSO500 run."):
logfile_check=True
self.logger.debug(f'bcl2fastq2_output.log for {self.name} contains the string expected for TSO500 runs')
else:
self.logger.debug(f'bcl2fastq2_output.log for {self.name} DOES NOT contain expected TSO500 string')
# may be an issue identifying the DNAnexus project
# get the dnanexus project name to assess if contains "_TSO"
if self.dx_project.id:
nexus_project_name = dxpy.describe(self.dx_project.id)["name"]
if "_TSO" in nexus_project_name:
self.logger.debug(f'DNANexus project name {nexus_project_name} contains the string "_TSO"')
project_name=True
else:
self.logger.debug(f'DNANexus project name {nexus_project_name} does NOT contain the string "_TSO"')
# if both checks pass return true
if project_name and logfile_check:
return True


class DxProjectRunFolder():
Expand Down Expand Up @@ -181,29 +214,44 @@ def find_runfolders(self, min_age=None):
Returns:
runfolder_objects(list): List of wscleaner.lib.RunFolder objects.
"""
subdirectories = self.root.iterdir()
runfolder_objects = []
for directory in subdirectories:
# list all directories in the runfolder dir.
for directory in [directory for directory in self.root.iterdir() if directory.is_dir()]:
rf = RunFolder(directory)
# Criteria for runfolder: Older than or equal to min_age and contains fastq.gz files
if (rf.age >= min_age) and (rf.find_fastqs(count=True) > 0):
self.logger.debug(f'{rf.name} IS RUNFOLDER.')
runfolder_objects.append(rf)
# skip any folders that do not have an RTAComplete.txt file
if not rf.RTA_complete_exists:
self.logger.debug(f'{rf.name} is not a runfolder, or sequencing has not yet finished.')
else:
self.logger.debug(f'{rf.name} IS NOT RUNFOLDER.')
# catch TSO500 runfolders here (do not contain fastqs)
if (rf.age >= min_age) and (rf.TSO500_check()):
self.logger.debug(f'{rf.name} is a TSO500 runfolder and is >= {min_age} days old.')
runfolder_objects.append(rf)
# Criteria for runfolder: Older than or equal to min_age and contains fastq.gz files
elif (rf.age >= min_age) and (rf.find_fastqs(count=True) > 0):
self.logger.debug(f'{rf.name} contains 1 or more fastq and is >= {min_age} days old.')
runfolder_objects.append(rf)
# shouldn't get this far anymore - leave in just incase.
else:
self.logger.debug(f'{rf.name} has 0 fastqs, is not a TSO runfolder or is < {min_age} days old.')

return runfolder_objects

def check_fastqs(self, runfolder):
"""Returns true if a runfolder's fastq.gz files match those in it's DNAnexus project."""
"""
Returns true if a runfolder's fastq.gz files match those in it's DNAnexus project.
Ensures all fastqs were uploaded.
"""
dx_fastqs = runfolder.dx_project.find_fastqs()
local_fastqs = runfolder.find_fastqs()
fastq_bool = all([fastq in dx_fastqs for fastq in local_fastqs])
self.logger.debug(f'{runfolder.name} FASTQ BOOL: {fastq_bool}')
return fastq_bool

def check_logfiles(self, runfolder, logfile_count):
"""Returns true if a runfolder's DNAnexus project contains 6 logfiles in the
expected location"""
"""Returns true if a runfolder's DNAnexus project contains X logfiles in the
expected location.
X is defined in the --logfile-count argument provided (default = 5)
"""
dx_logfiles = runfolder.dx_project.count_logfiles()
logfile_bool = (dx_logfiles == logfile_count)
self.logger.debug(f'{runfolder.name} LOGFILE BOOL: {logfile_bool}')
Expand Down
3 changes: 2 additions & 1 deletion wscleaner/wscleaner/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,9 @@ def main():
# If dry-run CLI flag is given, no directories are deleted by the runfolder manager.
RFM = RunFolderManager(args.root, dry_run=args.dry_run)
logger.info(f'Root directory {args.root}')
logger.info(f'Identifying local runfolders to consider deleting')
local_runfolders = RFM.find_runfolders(min_age=args.min_age)
logger.debug(f'Found local runfolders: {[rf.name for rf in local_runfolders]}')
logger.debug(f'Found local runfolders to consider deleting: {[rf.name for rf in local_runfolders]}')

for runfolder in local_runfolders:
logger.info(f'Processing {runfolder.name}')
Expand Down
12 changes: 12 additions & 0 deletions wscleaner/wscleaner_command_dev.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash

# Activate wscleaner environment
eval "$(/usr/local/bin/miniconda3/bin/conda shell.bash hook)" # Add conda environment to system path
conda activate wscleaner_test

# Set variables
logfile="/usr/local/src/mokaguys/automate_demultiplexing_logfiles/wscleaner_logs/$(date -d now +%y%m%d)_wscleaner.log"
runfolders="/media/data3/share"

# Execute
/usr/local/bin/miniconda3/envs/wscleaner_test/bin/python3 /usr/local/src/mokaguys/development_area/workstation_housekeeping/wscleaner/wscleaner/main.py $runfolders --logfile $logfile --dry-run --min-age=1

0 comments on commit 331da59

Please sign in to comment.