Commit full example, simplify dataset import docs

nadeemlab · Sep 23, 2024 · 853f7f9 · 853f7f9
1 parent d951969
commit 853f7f9
Show file tree

Hide file tree

Showing 51 changed files with 113,135 additions and 352 deletions.
diff --git a/data_curation/README.md b/data_curation/README.md
@@ -1,45 +1,42 @@
-The scripts in this directory are used to organize datasets before importing into a PostgreSQL database for the SPT application or analysis.
+The scripts here are used to organize datasets before importing into a PostgreSQL database for the SPT application or analysis.
 
-1. [Doing import / upload](#doing-import--upload)
-2. [Doing import / upload for just one dataset](#doing-import--upload-for-just-one-dataset)
-3. [Import without using the wrapper scripts](#import-without-using-the-wrapper-scripts)
-4. [Show all progress](#show-progress)
+1. [Curation or pre-processing](#curation-or-pre-processing)
+2. [Doing import / upload into the database](#doing-import--upload-into-the-database)
+3. [Finer control of upload process](#finer-control-of-upload-process)
 
-Datasets are stored in subdirectories of `datasets/`. The procedure for adding a new dataset is documented in [`datasets/template/README.md`](datasets/template/README.md).
+# 1. Curation or pre-processing
 
-## Doing import / upload
-A script which does a mass import of all available datasets is provided here as `import_datasets.sh`. It assumes that the [`spatialprofilingtoolbox` Python package](https://pypi.org/project/spatialprofilingtoolbox/) has been installed.
+Datasets are stored in subdirectories of `datasets/`. To prepare a new dataset, follow the full example [here](`datasets/moldoveanu/README.md`). The example includes files pre-generated in a format ready for import into the database, but you can also re-generate them yourself.
 
-The usage is, for example:
-```bash
-./import_datasets.sh ~/.spt_db.config.local --drop-first
-```
-- `~/.spt_db.config.local` is an example name of a [database configuration file](https://github.com/nadeemlab/SPT/blob/main/spatialprofilingtoolbox/workflow/assets/.spt_db.config.template).
-- The `--drop-first` option causes dropping/deleting a dataset with the same study name as one which is about to be uploaded. Without this option, upload will only take place if the dataset is not already in the database.
+Extraction scripts tend to be dataset-specific, but there are some common tasks like quantification over segments in images, and formulation of standardized representations of channel or clinical metadata.
 
-## Doing import / upload for just one dataset
-For example:
 
-```bash
+# 2. Doing import / upload into the database
+Make sure that
+- [`spatialprofilingtoolbox` Python package](https://pypi.org/project/spatialprofilingtoolbox/) has been installed, and
+- PostgresQL is installed and running on your machine.
+
+```sh
 ./import_datasets.sh ~/.spt_db.config.local --drop-first moldoveanu
 ```
-or
-```bash
-./import_datasets.sh ~/.spt_db.config.local --no-drop-first moldoveanu
-```
 
-## Import without using the wrapper
-The import-all-datasets-here script is provided for convenience only, as a wrapper around `spt` CLI commands.
+- `~/.spt_db.config.local` is an example name of a [database configuration file](https://github.com/nadeemlab/SPT/blob/main/spatialprofilingtoolbox/workflow/assets/.spt_db.config.template).
+- The `--drop-first` option causes dropping/deleting a dataset with the same study name as one which is about to be uploaded. Without this option, upload will only take place if the dataset is not already in the database.
+- `moldoveanu` is the name of the dataset subdirectory for the complete example.
+
+
+# 3. Finer control of the upload process
+The usage above uses several convenience wrapper functions, and assumes that you have saved dataset artifacts to `datasetes/<dataset_name>/generated_artifacts/`.
 
-For one dataset you may prefer to use your own custom script templated on the following:
+For finer control, for example when drawing source files from an S3 bucket, use the following:
 
 ```bash
 mkdir rundir; cd rundir
 spt workflow configure --workflow="tabular import" --config-file=workflow.config
 ./run.sh
 ```
 
-For the above, the `workflow.config` file should look something like this:
+Here the `workflow.config` file should look something like this:
 ```ini
 [general]
 db_config_file = /Users/username/.spt_db.config.local
@@ -51,7 +48,7 @@ study_name = Melanoma CyTOF ICI
 input_path = datasets/moldoveanu/generated_artifacts
 ```
 
-If you wish for Nextflow to pull directly from S3, rather than a local directory like `.../generated_artifacts`, `workflow.config` may look like this instead:
+If you wish for Nextflow (which runs in `run.sh`) to pull source files directly from S3, `workflow.config` may look like this instead:
 
 ```ini
 [general]
@@ -71,6 +68,3 @@ You can monitor progress by watching the Nextflow logs:
 ```bash
 tail -f -n1000 work/*/*.command.log
 ```
-
-## Show all progress
-By default `import_datasets.sh` is parallelized at the per-dataset level. To see basic progress across , use `./show_progress.sh` .
diff --git a/data_curation/datasets/moldoveanu/README.md b/data_curation/datasets/moldoveanu/README.md
@@ -0,0 +1,14 @@
+# `download.sh`
+This script attempts to download this dataset from Zenodo, and checks file integrity.
+
+# `extract.sh`
+This script creates cell-level data files as well as study-level metadata files, and saves them to `generated_artifacts/`.
+
+Use, for example,
+```sh
+`extract.sh --cores=4`
+```
+to speed up this process if your machine has multiple cores.
+
+# `clean.sh`
+This script removes extracted archive files and other intermediate files.
diff --git a/data_curation/datasets/moldoveanu/_cell_measurement_aggregation.py b/data_curation/datasets/moldoveanu/_cell_measurement_aggregation.py
@@ -12,7 +12,7 @@
 from _extraction_formats import create_sparse_dataframe  # pylint: disable=E0611
 
 sys.path.append('../../convenience_scripts')
-from bimodality_assessor import create_bimodal_vector
+from bimodality_assessor import BimodalityAssessor
 
 def aggregate_cell(group: DataFrame, channel_name: str) -> float:
     with warnings.catch_warnings():