Skip to content

Commit

Permalink
Commit full example, simplify dataset import docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jimmymathews committed Sep 23, 2024
1 parent d951969 commit 853f7f9
Show file tree
Hide file tree
Showing 51 changed files with 113,135 additions and 352 deletions.
52 changes: 23 additions & 29 deletions data_curation/README.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,42 @@
The scripts in this directory are used to organize datasets before importing into a PostgreSQL database for the SPT application or analysis.
The scripts here are used to organize datasets before importing into a PostgreSQL database for the SPT application or analysis.

1. [Doing import / upload](#doing-import--upload)
2. [Doing import / upload for just one dataset](#doing-import--upload-for-just-one-dataset)
3. [Import without using the wrapper scripts](#import-without-using-the-wrapper-scripts)
4. [Show all progress](#show-progress)
1. [Curation or pre-processing](#curation-or-pre-processing)
2. [Doing import / upload into the database](#doing-import--upload-into-the-database)
3. [Finer control of upload process](#finer-control-of-upload-process)

Datasets are stored in subdirectories of `datasets/`. The procedure for adding a new dataset is documented in [`datasets/template/README.md`](datasets/template/README.md).
# 1. Curation or pre-processing

## Doing import / upload
A script which does a mass import of all available datasets is provided here as `import_datasets.sh`. It assumes that the [`spatialprofilingtoolbox` Python package](https://pypi.org/project/spatialprofilingtoolbox/) has been installed.
Datasets are stored in subdirectories of `datasets/`. To prepare a new dataset, follow the full example [here](`datasets/moldoveanu/README.md`). The example includes files pre-generated in a format ready for import into the database, but you can also re-generate them yourself.

The usage is, for example:
```bash
./import_datasets.sh ~/.spt_db.config.local --drop-first
```
- `~/.spt_db.config.local` is an example name of a [database configuration file](https://github.com/nadeemlab/SPT/blob/main/spatialprofilingtoolbox/workflow/assets/.spt_db.config.template).
- The `--drop-first` option causes dropping/deleting a dataset with the same study name as one which is about to be uploaded. Without this option, upload will only take place if the dataset is not already in the database.
Extraction scripts tend to be dataset-specific, but there are some common tasks like quantification over segments in images, and formulation of standardized representations of channel or clinical metadata.

## Doing import / upload for just one dataset
For example:

```bash
# 2. Doing import / upload into the database
Make sure that
- [`spatialprofilingtoolbox` Python package](https://pypi.org/project/spatialprofilingtoolbox/) has been installed, and
- PostgresQL is installed and running on your machine.

```sh
./import_datasets.sh ~/.spt_db.config.local --drop-first moldoveanu
```
or
```bash
./import_datasets.sh ~/.spt_db.config.local --no-drop-first moldoveanu
```

## Import without using the wrapper
The import-all-datasets-here script is provided for convenience only, as a wrapper around `spt` CLI commands.
- `~/.spt_db.config.local` is an example name of a [database configuration file](https://github.com/nadeemlab/SPT/blob/main/spatialprofilingtoolbox/workflow/assets/.spt_db.config.template).
- The `--drop-first` option causes dropping/deleting a dataset with the same study name as one which is about to be uploaded. Without this option, upload will only take place if the dataset is not already in the database.
- `moldoveanu` is the name of the dataset subdirectory for the complete example.


# 3. Finer control of the upload process
The usage above uses several convenience wrapper functions, and assumes that you have saved dataset artifacts to `datasetes/<dataset_name>/generated_artifacts/`.

For one dataset you may prefer to use your own custom script templated on the following:
For finer control, for example when drawing source files from an S3 bucket, use the following:

```bash
mkdir rundir; cd rundir
spt workflow configure --workflow="tabular import" --config-file=workflow.config
./run.sh
```

For the above, the `workflow.config` file should look something like this:
Here the `workflow.config` file should look something like this:
```ini
[general]
db_config_file = /Users/username/.spt_db.config.local
Expand All @@ -51,7 +48,7 @@ study_name = Melanoma CyTOF ICI
input_path = datasets/moldoveanu/generated_artifacts
```

If you wish for Nextflow to pull directly from S3, rather than a local directory like `.../generated_artifacts`, `workflow.config` may look like this instead:
If you wish for Nextflow (which runs in `run.sh`) to pull source files directly from S3, `workflow.config` may look like this instead:

```ini
[general]
Expand All @@ -71,6 +68,3 @@ You can monitor progress by watching the Nextflow logs:
```bash
tail -f -n1000 work/*/*.command.log
```

## Show all progress
By default `import_datasets.sh` is parallelized at the per-dataset level. To see basic progress across , use `./show_progress.sh` .
14 changes: 14 additions & 0 deletions data_curation/datasets/moldoveanu/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# `download.sh`
This script attempts to download this dataset from Zenodo, and checks file integrity.

# `extract.sh`
This script creates cell-level data files as well as study-level metadata files, and saves them to `generated_artifacts/`.

Use, for example,
```sh
`extract.sh --cores=4`
```
to speed up this process if your machine has multiple cores.

# `clean.sh`
This script removes extracted archive files and other intermediate files.
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from _extraction_formats import create_sparse_dataframe # pylint: disable=E0611

sys.path.append('../../convenience_scripts')
from bimodality_assessor import create_bimodal_vector
from bimodality_assessor import BimodalityAssessor

def aggregate_cell(group: DataFrame, channel_name: str) -> float:
with warnings.catch_warnings():
Expand Down
Loading

0 comments on commit 853f7f9

Please sign in to comment.