diff --git a/README.md b/README.md index da3ffe5..c85ded8 100644 --- a/README.md +++ b/README.md @@ -1,37 +1,518 @@ -# Task Template +# Predict Modality -This repo is a template to create a new task for the OpenProblems v2. This repo contains several example files and components that can be used when updated with the task info. -> [!WARNING] -> This README will be overwritten when performing the `create_task_readme` script. + -## Create a repository from this template +Predicting the profiles of one modality (e.g. protein abundance) from +another (e.g. mRNA expression). -> [!IMPORTANT] -> Before creating a new repository, make sure you are part of the OpenProblems task team. This will be done when you create an issue for the task and you get the go ahead to create the task. -> For more information on how to create a new task, check out the [Create a new task](https://openproblems.bio/documentation/create_task/) documentation. +Repository: +[openproblems-bio/task_predict_modality](https://github.com/openproblems-bio/task_predict_modality) -The instructions below will guide you through creating a new repository from this template ([creating-a-repository-from-a-template](https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-repository-from-a-template#creating-a-repository-from-a-template)). +## Description +Experimental techniques to measure multiple modalities within the same +single cell are increasingly becoming available. The demand for these +measurements is driven by the promise to provide a deeper insight into +the state of a cell. Yet, the modalities are also intrinsically linked. +We know that DNA must be accessible (ATAC data) to produce mRNA +(expression data), and mRNA in turn is used as a template to produce +protein (protein abundance). These processes are regulated often by the +same molecules that they produce: for example, a protein may bind DNA to +prevent the production of more mRNA. Understanding these regulatory +processes would be transformative for synthetic biology and drug target +discovery. Any method that can predict a modality from another must have +accounted for these regulatory processes, but the demand for multi-modal +data shows that this is not trivial. -* Click the "Use this template" button on the top right of the repository. -* Use the Owner dropdown menu to select the `openproblems-bio` account. -* Type a name for your repository (task_...), and a description. -* Set the repository visibility to public. -* Click "Create repository from template". +## API -## Clone the repository +``` mermaid +flowchart LR + file_common_dataset_mod1("Raw dataset RNA") + comp_process_datasets[/"Process Dataset"/] + file_test_mod1("Test mod1") + file_test_mod2("Test mod2") + file_train_mod1("Train mod1") + file_train_mod2("Train mod2") + comp_control_method[/"Control method"/] + comp_method_predict[/"Predict"/] + comp_method_train[/"Train"/] + comp_method[/"Method"/] + comp_metric[/"Metric"/] + file_prediction("Prediction") + file_pretrained_model("Pretrained model") + file_score("Score") + file_common_dataset_mod2("Raw dataset mod2") + file_common_dataset_mod1---comp_process_datasets + comp_process_datasets-->file_test_mod1 + comp_process_datasets-->file_test_mod2 + comp_process_datasets-->file_train_mod1 + comp_process_datasets-->file_train_mod2 + file_test_mod1---comp_control_method + file_test_mod1---comp_method_predict + file_test_mod1---comp_method_train + file_test_mod1---comp_method + file_test_mod2---comp_control_method + file_test_mod2---comp_metric + file_train_mod1---comp_control_method + file_train_mod1---comp_method_predict + file_train_mod1---comp_method_train + file_train_mod1---comp_method + file_train_mod2---comp_control_method + file_train_mod2---comp_method_predict + file_train_mod2---comp_method_train + file_train_mod2---comp_method + comp_control_method-->file_prediction + comp_method_predict-->file_prediction + comp_method_train-->file_pretrained_model + comp_method-->file_prediction + comp_metric-->file_score + file_prediction---comp_metric + file_pretrained_model---comp_method_predict + file_common_dataset_mod2---comp_process_datasets +``` -To clone the repository with the submodule files, you can use the following command: +## File format: Raw dataset RNA -```bash -git clone --recursive git@github.com:openproblems-bio/.git -``` ->[!NOTE] -> If somehow there are no files visible in the submodule after cloning using the above command. Check the instructions [here](common/README.md). +The RNA modality of the raw dataset. + +Example file: +`resources_test/common/openproblems_neurips2021/bmmc_cite/dataset_mod1.h5ad` + +Format: + +
+ + AnnData object + obs: 'batch', 'size_factors' + var: 'feature_id', 'feature_name', 'hvg', 'hvg_score' + obsm: 'gene_activity' + layers: 'counts', 'normalized' + uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names' + +
+ +Data structure: + +
+ +| Slot | Type | Description | +|:---|:---|:---| +| `obs["batch"]` | `string` | Batch information. | +| `obs["size_factors"]` | `double` | (*Optional*) The size factors of the cells prior to normalization. | +| `var["feature_id"]` | `string` | Unique identifier for the feature, usually a ENSEMBL gene id. | +| `var["feature_name"]` | `string` | (*Optional*) A human-readable name for the feature, usually a gene symbol. | +| `var["hvg"]` | `boolean` | Whether or not the feature is considered to be a ‘highly variable gene’. | +| `var["hvg_score"]` | `double` | A score for the feature indicating how highly variable it is. | +| `obsm["gene_activity"]` | `double` | (*Optional*) ATAC gene activity. | +| `layers["counts"]` | `integer` | Raw counts. | +| `layers["normalized"]` | `double` | Normalized expression values. | +| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. | +| `uns["dataset_name"]` | `string` | Nicely formatted name. | +| `uns["dataset_url"]` | `string` | (*Optional*) Link to the original source of the dataset. | +| `uns["dataset_reference"]` | `string` | (*Optional*) Bibtex reference of the paper in which the dataset was published. | +| `uns["dataset_summary"]` | `string` | Short description of the dataset. | +| `uns["dataset_description"]` | `string` | Long description of the dataset. | +| `uns["dataset_organism"]` | `string` | (*Optional*) The organism of the sample in the dataset. | +| `uns["normalization_id"]` | `string` | The unique identifier of the normalization method used. | +| `uns["gene_activity_var_names"]` | `string` | (*Optional*) Names of the gene activity matrix. | + +
+ +## Component type: Process Dataset + +A predict modality dataset processor. + +Arguments: + +
+ +| Name | Type | Description | +|:---|:---|:---| +| `--input_mod1` | `file` | The RNA modality of the raw dataset. | +| `--input_mod2` | `file` | The second modality of the raw dataset. Must be an ADT or an ATAC dataset. | +| `--output_train_mod1` | `file` | (*Output*) The mod1 expression values of the train cells. | +| `--output_train_mod2` | `file` | (*Output*) The mod2 expression values of the train cells. | +| `--output_test_mod1` | `file` | (*Output*) The mod1 expression values of the test cells. | +| `--output_test_mod2` | `file` | (*Output*) The mod2 expression values of the test cells. | +| `--seed` | `integer` | (*Optional*) NA. Default: `1`. | + +
+ +## File format: Test mod1 + +The mod1 expression values of the test cells. + +Example file: +`resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/test_mod1.h5ad` + +Format: + +
+ + AnnData object + obs: 'batch', 'size_factors' + var: 'gene_ids', 'hvg', 'hvg_score' + obsm: 'gene_activity' + layers: 'counts', 'normalized' + uns: 'dataset_id', 'common_dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names' + +
+ +Data structure: + +
+ +| Slot | Type | Description | +|:---|:---|:---| +| `obs["batch"]` | `string` | Batch information. | +| `obs["size_factors"]` | `double` | (*Optional*) The size factors of the cells prior to normalization. | +| `var["gene_ids"]` | `string` | (*Optional*) The gene identifiers (if available). | +| `var["hvg"]` | `boolean` | Whether or not the feature is considered to be a ‘highly variable gene’. | +| `var["hvg_score"]` | `double` | A score for the feature indicating how highly variable it is. | +| `obsm["gene_activity"]` | `double` | (*Optional*) ATAC gene activity. | +| `layers["counts"]` | `integer` | Raw counts. | +| `layers["normalized"]` | `double` | Normalized expression values. | +| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. | +| `uns["common_dataset_id"]` | `string` | (*Optional*) A common identifier for the dataset. | +| `uns["dataset_name"]` | `string` | Nicely formatted name. | +| `uns["dataset_url"]` | `string` | (*Optional*) Link to the original source of the dataset. | +| `uns["dataset_reference"]` | `string` | (*Optional*) Bibtex reference of the paper in which the dataset was published. | +| `uns["dataset_summary"]` | `string` | Short description of the dataset. | +| `uns["dataset_description"]` | `string` | Long description of the dataset. | +| `uns["dataset_organism"]` | `string` | (*Optional*) The organism of the sample in the dataset. | +| `uns["normalization_id"]` | `string` | The unique identifier of the normalization method used. | +| `uns["gene_activity_var_names"]` | `string` | (*Optional*) Names of the gene activity matrix. | + +
+ +## File format: Test mod2 + +The mod2 expression values of the test cells. + +Example file: +`resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/test_mod2.h5ad` + +Format: + +
+ + AnnData object + obs: 'batch', 'size_factors' + var: 'gene_ids', 'hvg', 'hvg_score' + obsm: 'gene_activity' + layers: 'counts', 'normalized' + uns: 'dataset_id', 'common_dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'gene_activity_var_names' + +
+ +Data structure: + +
+ +| Slot | Type | Description | +|:---|:---|:---| +| `obs["batch"]` | `string` | Batch information. | +| `obs["size_factors"]` | `double` | (*Optional*) The size factors of the cells prior to normalization. | +| `var["gene_ids"]` | `string` | (*Optional*) The gene identifiers (if available). | +| `var["hvg"]` | `boolean` | Whether or not the feature is considered to be a ‘highly variable gene’. | +| `var["hvg_score"]` | `double` | A score for the feature indicating how highly variable it is. | +| `obsm["gene_activity"]` | `double` | (*Optional*) ATAC gene activity. | +| `layers["counts"]` | `integer` | Raw counts. | +| `layers["normalized"]` | `double` | Normalized expression values. | +| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. | +| `uns["common_dataset_id"]` | `string` | (*Optional*) A common identifier for the dataset. | +| `uns["dataset_name"]` | `string` | Nicely formatted name. | +| `uns["dataset_url"]` | `string` | (*Optional*) Link to the original source of the dataset. | +| `uns["dataset_reference"]` | `string` | (*Optional*) Bibtex reference of the paper in which the dataset was published. | +| `uns["dataset_summary"]` | `string` | Short description of the dataset. | +| `uns["dataset_description"]` | `string` | Long description of the dataset. | +| `uns["dataset_organism"]` | `string` | (*Optional*) The organism of the sample in the dataset. | +| `uns["gene_activity_var_names"]` | `string` | (*Optional*) Names of the gene activity matrix. | + +
+ +## File format: Train mod1 + +The mod1 expression values of the train cells. + +Example file: +`resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/train_mod1.h5ad` + +Format: + +
+ + AnnData object + obs: 'batch', 'size_factors' + var: 'gene_ids', 'hvg', 'hvg_score' + obsm: 'gene_activity' + layers: 'counts', 'normalized' + uns: 'dataset_id', 'common_dataset_id', 'dataset_organism', 'normalization_id', 'gene_activity_var_names' + +
+ +Data structure: + +
+ +| Slot | Type | Description | +|:---|:---|:---| +| `obs["batch"]` | `string` | Batch information. | +| `obs["size_factors"]` | `double` | (*Optional*) The size factors of the cells prior to normalization. | +| `var["gene_ids"]` | `string` | (*Optional*) The gene identifiers (if available). | +| `var["hvg"]` | `boolean` | Whether or not the feature is considered to be a ‘highly variable gene’. | +| `var["hvg_score"]` | `double` | A score for the feature indicating how highly variable it is. | +| `obsm["gene_activity"]` | `double` | (*Optional*) ATAC gene activity. | +| `layers["counts"]` | `integer` | Raw counts. | +| `layers["normalized"]` | `double` | Normalized expression values. | +| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. | +| `uns["common_dataset_id"]` | `string` | (*Optional*) A common identifier for the dataset. | +| `uns["dataset_organism"]` | `string` | (*Optional*) The organism of the sample in the dataset. | +| `uns["normalization_id"]` | `string` | The unique identifier of the normalization method used. | +| `uns["gene_activity_var_names"]` | `string` | (*Optional*) Names of the gene activity matrix. | + +
+ +## File format: Train mod2 + +The mod2 expression values of the train cells. + +Example file: +`resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/train_mod2.h5ad` + +Format: + +
+ + AnnData object + obs: 'batch', 'size_factors' + var: 'gene_ids', 'hvg', 'hvg_score' + obsm: 'gene_activity' + layers: 'counts', 'normalized' + uns: 'dataset_id', 'common_dataset_id', 'dataset_organism', 'normalization_id', 'gene_activity_var_names' + +
+ +Data structure: + +
+ +| Slot | Type | Description | +|:---|:---|:---| +| `obs["batch"]` | `string` | Batch information. | +| `obs["size_factors"]` | `double` | (*Optional*) The size factors of the cells prior to normalization. | +| `var["gene_ids"]` | `string` | (*Optional*) The gene identifiers (if available). | +| `var["hvg"]` | `boolean` | Whether or not the feature is considered to be a ‘highly variable gene’. | +| `var["hvg_score"]` | `double` | A score for the feature indicating how highly variable it is. | +| `obsm["gene_activity"]` | `double` | (*Optional*) ATAC gene activity. | +| `layers["counts"]` | `integer` | Raw counts. | +| `layers["normalized"]` | `double` | Normalized expression values. | +| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. | +| `uns["common_dataset_id"]` | `string` | (*Optional*) A common identifier for the dataset. | +| `uns["dataset_organism"]` | `string` | (*Optional*) The organism of the sample in the dataset. | +| `uns["normalization_id"]` | `string` | The unique identifier of the normalization method used. | +| `uns["gene_activity_var_names"]` | `string` | (*Optional*) Names of the gene activity matrix. | + +
+ +## Component type: Control method + +Quality control methods for verifying the pipeline. + +Arguments: + +
+ +| Name | Type | Description | +|:---|:---|:---| +| `--input_train_mod1` | `file` | The mod1 expression values of the train cells. | +| `--input_train_mod2` | `file` | The mod2 expression values of the train cells. | +| `--input_test_mod1` | `file` | The mod1 expression values of the test cells. | +| `--input_test_mod2` | `file` | The mod2 expression values of the test cells. | +| `--output` | `file` | (*Output*) A prediction of the mod2 expression values of the test cells. | + +
+ +## Component type: Predict + +Make predictions using a trained model. + +Arguments: + +
+ +| Name | Type | Description | +|:---|:---|:---| +| `--input_train_mod1` | `file` | The mod1 expression values of the train cells. | +| `--input_train_mod2` | `file` | The mod2 expression values of the train cells. | +| `--input_test_mod1` | `file` | The mod1 expression values of the test cells. | +| `--input_model` | `file` | A pretrained model for predicting the expression of one modality from another. | +| `--output` | `file` | (*Output*) A prediction of the mod2 expression values of the test cells. | + +
+ +## Component type: Train + +Train a model to predict the expression of one modality from another. + +Arguments: + +
+ +| Name | Type | Description | +|:---|:---|:---| +| `--input_train_mod1` | `file` | The mod1 expression values of the train cells. | +| `--input_train_mod2` | `file` | The mod2 expression values of the train cells. | +| `--input_test_mod1` | `file` | (*Optional*) The mod1 expression values of the test cells. | +| `--output` | `file` | (*Output*) A pretrained model for predicting the expression of one modality from another. | + +
+ +## Component type: Method + +A regression method. + +Arguments: + +
+ +| Name | Type | Description | +|:---|:---|:---| +| `--input_train_mod1` | `file` | The mod1 expression values of the train cells. | +| `--input_train_mod2` | `file` | The mod2 expression values of the train cells. | +| `--input_test_mod1` | `file` | The mod1 expression values of the test cells. | +| `--output` | `file` | (*Output*) A prediction of the mod2 expression values of the test cells. | + +
+ +## Component type: Metric + +A predict modality metric. + +Arguments: + +
+ +| Name | Type | Description | +|:---|:---|:---| +| `--input_prediction` | `file` | A prediction of the mod2 expression values of the test cells. | +| `--input_test_mod2` | `file` | The mod2 expression values of the test cells. | +| `--output` | `file` | (*Output*) Metric score file. | + +
+ +## File format: Prediction + +A prediction of the mod2 expression values of the test cells + +Example file: +`resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/prediction.h5ad` + +Format: + +
+ + AnnData object + layers: 'normalized' + uns: 'dataset_id', 'method_id' + +
+ +Data structure: + +
+ +| Slot | Type | Description | +|:-----------------------|:---------|:----------------------------------------| +| `layers["normalized"]` | `double` | Predicted normalized expression values. | +| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. | +| `uns["method_id"]` | `string` | A unique identifier for the method. | + +
+ +## File format: Pretrained model + +A pretrained model for predicting the expression of one modality from +another. + +## File format: Score + +Metric score file + +Example file: +`resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/score.h5ad` + +Format: + +
+ + AnnData object + uns: 'dataset_id', 'method_id', 'metric_ids', 'metric_values' + +
+ +Data structure: + +
+ +| Slot | Type | Description | +|:---|:---|:---| +| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. | +| `uns["method_id"]` | `string` | A unique identifier for the method. | +| `uns["metric_ids"]` | `string` | One or more unique metric identifiers. | +| `uns["metric_values"]` | `double` | The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’. | + +
+ +## File format: Raw dataset mod2 + +The second modality of the raw dataset. Must be an ADT or an ATAC +dataset + +Example file: +`resources_test/common/openproblems_neurips2021/bmmc_cite/dataset_mod2.h5ad` + +Format: + +
+ + AnnData object + obs: 'batch', 'size_factors' + var: 'feature_id', 'feature_name', 'hvg', 'hvg_score' + obsm: 'gene_activity' + layers: 'counts', 'normalized' + uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names' + +
+ +Data structure: -## What to do next +
-Check out the [instructions](https://github.com/openproblems-bio/common_resources/blob/main/INSTRUCTIONS.md) for more information on how to update the example files and components. These instructions also contain information on how to build out the task and basic commands. +| Slot | Type | Description | +|:---|:---|:---| +| `obs["batch"]` | `string` | Batch information. | +| `obs["size_factors"]` | `double` | (*Optional*) The size factors of the cells prior to normalization. | +| `var["feature_id"]` | `string` | Unique identifier for the feature, usually a ENSEMBL gene id. | +| `var["feature_name"]` | `string` | (*Optional*) A human-readable name for the feature, usually a gene symbol. | +| `var["hvg"]` | `boolean` | Whether or not the feature is considered to be a ‘highly variable gene’. | +| `var["hvg_score"]` | `double` | A score for the feature indicating how highly variable it is. | +| `obsm["gene_activity"]` | `double` | (*Optional*) ATAC gene activity. | +| `layers["counts"]` | `integer` | Raw counts. | +| `layers["normalized"]` | `double` | Normalized expression values. | +| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. | +| `uns["dataset_name"]` | `string` | Nicely formatted name. | +| `uns["dataset_url"]` | `string` | (*Optional*) Link to the original source of the dataset. | +| `uns["dataset_reference"]` | `string` | (*Optional*) Bibtex reference of the paper in which the dataset was published. | +| `uns["dataset_summary"]` | `string` | Short description of the dataset. | +| `uns["dataset_description"]` | `string` | Long description of the dataset. | +| `uns["dataset_organism"]` | `string` | (*Optional*) The organism of the sample in the dataset. | +| `uns["normalization_id"]` | `string` | The unique identifier of the normalization method used. | +| `uns["gene_activity_var_names"]` | `string` | (*Optional*) Names of the gene activity matrix. | -For more information on the OpenProblems v2, check out the [documentation](https://openproblems.bio/documentation/). \ No newline at end of file +
diff --git a/README.qmd b/README.qmd new file mode 100644 index 0000000..796cd31 --- /dev/null +++ b/README.qmd @@ -0,0 +1,529 @@ +--- +title: "Predict Modality" +format: gfm +--- + + + +Predicting the profiles of one modality (e.g. protein abundance) from another (e.g. mRNA expression). + +Repository: [openproblems-bio/task_predict_modality](https://github.com/openproblems-bio/task_predict_modality) + + + +## Description + +Experimental techniques to measure multiple modalities within the same single cell are increasingly becoming available. +The demand for these measurements is driven by the promise to provide a deeper insight into the state of a cell. +Yet, the modalities are also intrinsically linked. We know that DNA must be accessible (ATAC data) to produce mRNA +(expression data), and mRNA in turn is used as a template to produce protein (protein abundance). These processes +are regulated often by the same molecules that they produce: for example, a protein may bind DNA to prevent the production +of more mRNA. Understanding these regulatory processes would be transformative for synthetic biology and drug target discovery. +Any method that can predict a modality from another must have accounted for these regulatory processes, but the demand for +multi-modal data shows that this is not trivial. + + + + +## API + +```mermaid +flowchart LR + file_common_dataset_mod1("Raw dataset RNA") + comp_process_datasets[/"Process Dataset"/] + file_test_mod1("Test mod1") + file_test_mod2("Test mod2") + file_train_mod1("Train mod1") + file_train_mod2("Train mod2") + comp_control_method[/"Control method"/] + comp_method_predict[/"Predict"/] + comp_method_train[/"Train"/] + comp_method[/"Method"/] + comp_metric[/"Metric"/] + file_prediction("Prediction") + file_pretrained_model("Pretrained model") + file_score("Score") + file_common_dataset_mod2("Raw dataset mod2") + file_common_dataset_mod1---comp_process_datasets + comp_process_datasets-->file_test_mod1 + comp_process_datasets-->file_test_mod2 + comp_process_datasets-->file_train_mod1 + comp_process_datasets-->file_train_mod2 + file_test_mod1---comp_control_method + file_test_mod1---comp_method_predict + file_test_mod1---comp_method_train + file_test_mod1---comp_method + file_test_mod2---comp_control_method + file_test_mod2---comp_metric + file_train_mod1---comp_control_method + file_train_mod1---comp_method_predict + file_train_mod1---comp_method_train + file_train_mod1---comp_method + file_train_mod2---comp_control_method + file_train_mod2---comp_method_predict + file_train_mod2---comp_method_train + file_train_mod2---comp_method + comp_control_method-->file_prediction + comp_method_predict-->file_prediction + comp_method_train-->file_pretrained_model + comp_method-->file_prediction + comp_metric-->file_score + file_prediction---comp_metric + file_pretrained_model---comp_method_predict + file_common_dataset_mod2---comp_process_datasets +``` + + +## File format: Raw dataset RNA + +The RNA modality of the raw dataset. + +Example file: `resources_test/common/openproblems_neurips2021/bmmc_cite/dataset_mod1.h5ad` + + + +Format: + +:::{.small} + AnnData object + obs: 'batch', 'size_factors' + var: 'feature_id', 'feature_name', 'hvg', 'hvg_score' + obsm: 'gene_activity' + layers: 'counts', 'normalized' + uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names' +::: + +Data structure: + +:::{.small} +Slot |Type |Description | +:-------------------------|:--------|:------------------------------------------------------------| +`obs["batch"]` |`string` |Batch information. | +`obs["size_factors"]` |`double` |(_Optional_) The size factors of the cells prior to normalization. | +`var["feature_id"]` |`string` |Unique identifier for the feature, usually a ENSEMBL gene id. | +`var["feature_name"]` |`string` |(_Optional_) A human-readable name for the feature, usually a gene symbol. | +`var["hvg"]` |`boolean` |Whether or not the feature is considered to be a 'highly variable gene'. | +`var["hvg_score"]` |`double` |A score for the feature indicating how highly variable it is. | +`obsm["gene_activity"]` |`double` |(_Optional_) ATAC gene activity. | +`layers["counts"]` |`integer` |Raw counts. | +`layers["normalized"]` |`double` |Normalized expression values. | +`uns["dataset_id"]` |`string` |A unique identifier for the dataset. | +`uns["dataset_name"]` |`string` |Nicely formatted name. | +`uns["dataset_url"]` |`string` |(_Optional_) Link to the original source of the dataset. | +`uns["dataset_reference"]` |`string` |(_Optional_) Bibtex reference of the paper in which the dataset was published. | +`uns["dataset_summary"]` |`string` |Short description of the dataset. | +`uns["dataset_description"]` |`string` |Long description of the dataset. | +`uns["dataset_organism"]` |`string` |(_Optional_) The organism of the sample in the dataset. | +`uns["normalization_id"]` |`string` |The unique identifier of the normalization method used. | +`uns["gene_activity_var_names"]` |`string` |(_Optional_) Names of the gene activity matrix. | +::: + + + +## Component type: Process Dataset + + + +A predict modality dataset processor. + +Arguments: + +:::{.small} +Name |Type |Description | +:-------------------------|:--------|:------------------------------------------------------------| +`--input_mod1` |`file` |The RNA modality of the raw dataset. | +`--input_mod2` |`file` |The second modality of the raw dataset. Must be an ADT or an ATAC dataset. | +`--output_train_mod1` |`file` |(_Output_) The mod1 expression values of the train cells. | +`--output_train_mod2` |`file` |(_Output_) The mod2 expression values of the train cells. | +`--output_test_mod1` |`file` |(_Output_) The mod1 expression values of the test cells. | +`--output_test_mod2` |`file` |(_Output_) The mod2 expression values of the test cells. | +`--seed` |`integer` |(_Optional_) NA. Default: `1`. | +::: + + + +## File format: Test mod1 + +The mod1 expression values of the test cells. + +Example file: `resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/test_mod1.h5ad` + + + +Format: + +:::{.small} + AnnData object + obs: 'batch', 'size_factors' + var: 'gene_ids', 'hvg', 'hvg_score' + obsm: 'gene_activity' + layers: 'counts', 'normalized' + uns: 'dataset_id', 'common_dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names' +::: + +Data structure: + +:::{.small} +Slot |Type |Description | +:-------------------------|:--------|:------------------------------------------------------------| +`obs["batch"]` |`string` |Batch information. | +`obs["size_factors"]` |`double` |(_Optional_) The size factors of the cells prior to normalization. | +`var["gene_ids"]` |`string` |(_Optional_) The gene identifiers (if available). | +`var["hvg"]` |`boolean` |Whether or not the feature is considered to be a 'highly variable gene'. | +`var["hvg_score"]` |`double` |A score for the feature indicating how highly variable it is. | +`obsm["gene_activity"]` |`double` |(_Optional_) ATAC gene activity. | +`layers["counts"]` |`integer` |Raw counts. | +`layers["normalized"]` |`double` |Normalized expression values. | +`uns["dataset_id"]` |`string` |A unique identifier for the dataset. | +`uns["common_dataset_id"]` |`string` |(_Optional_) A common identifier for the dataset. | +`uns["dataset_name"]` |`string` |Nicely formatted name. | +`uns["dataset_url"]` |`string` |(_Optional_) Link to the original source of the dataset. | +`uns["dataset_reference"]` |`string` |(_Optional_) Bibtex reference of the paper in which the dataset was published. | +`uns["dataset_summary"]` |`string` |Short description of the dataset. | +`uns["dataset_description"]` |`string` |Long description of the dataset. | +`uns["dataset_organism"]` |`string` |(_Optional_) The organism of the sample in the dataset. | +`uns["normalization_id"]` |`string` |The unique identifier of the normalization method used. | +`uns["gene_activity_var_names"]` |`string` |(_Optional_) Names of the gene activity matrix. | +::: + + + +## File format: Test mod2 + +The mod2 expression values of the test cells. + +Example file: `resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/test_mod2.h5ad` + + + +Format: + +:::{.small} + AnnData object + obs: 'batch', 'size_factors' + var: 'gene_ids', 'hvg', 'hvg_score' + obsm: 'gene_activity' + layers: 'counts', 'normalized' + uns: 'dataset_id', 'common_dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'gene_activity_var_names' +::: + +Data structure: + +:::{.small} +Slot |Type |Description | +:-------------------------|:--------|:------------------------------------------------------------| +`obs["batch"]` |`string` |Batch information. | +`obs["size_factors"]` |`double` |(_Optional_) The size factors of the cells prior to normalization. | +`var["gene_ids"]` |`string` |(_Optional_) The gene identifiers (if available). | +`var["hvg"]` |`boolean` |Whether or not the feature is considered to be a 'highly variable gene'. | +`var["hvg_score"]` |`double` |A score for the feature indicating how highly variable it is. | +`obsm["gene_activity"]` |`double` |(_Optional_) ATAC gene activity. | +`layers["counts"]` |`integer` |Raw counts. | +`layers["normalized"]` |`double` |Normalized expression values. | +`uns["dataset_id"]` |`string` |A unique identifier for the dataset. | +`uns["common_dataset_id"]` |`string` |(_Optional_) A common identifier for the dataset. | +`uns["dataset_name"]` |`string` |Nicely formatted name. | +`uns["dataset_url"]` |`string` |(_Optional_) Link to the original source of the dataset. | +`uns["dataset_reference"]` |`string` |(_Optional_) Bibtex reference of the paper in which the dataset was published. | +`uns["dataset_summary"]` |`string` |Short description of the dataset. | +`uns["dataset_description"]` |`string` |Long description of the dataset. | +`uns["dataset_organism"]` |`string` |(_Optional_) The organism of the sample in the dataset. | +`uns["gene_activity_var_names"]` |`string` |(_Optional_) Names of the gene activity matrix. | +::: + + + +## File format: Train mod1 + +The mod1 expression values of the train cells. + +Example file: `resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/train_mod1.h5ad` + + + +Format: + +:::{.small} + AnnData object + obs: 'batch', 'size_factors' + var: 'gene_ids', 'hvg', 'hvg_score' + obsm: 'gene_activity' + layers: 'counts', 'normalized' + uns: 'dataset_id', 'common_dataset_id', 'dataset_organism', 'normalization_id', 'gene_activity_var_names' +::: + +Data structure: + +:::{.small} +Slot |Type |Description | +:-------------------------|:--------|:------------------------------------------------------------| +`obs["batch"]` |`string` |Batch information. | +`obs["size_factors"]` |`double` |(_Optional_) The size factors of the cells prior to normalization. | +`var["gene_ids"]` |`string` |(_Optional_) The gene identifiers (if available). | +`var["hvg"]` |`boolean` |Whether or not the feature is considered to be a 'highly variable gene'. | +`var["hvg_score"]` |`double` |A score for the feature indicating how highly variable it is. | +`obsm["gene_activity"]` |`double` |(_Optional_) ATAC gene activity. | +`layers["counts"]` |`integer` |Raw counts. | +`layers["normalized"]` |`double` |Normalized expression values. | +`uns["dataset_id"]` |`string` |A unique identifier for the dataset. | +`uns["common_dataset_id"]` |`string` |(_Optional_) A common identifier for the dataset. | +`uns["dataset_organism"]` |`string` |(_Optional_) The organism of the sample in the dataset. | +`uns["normalization_id"]` |`string` |The unique identifier of the normalization method used. | +`uns["gene_activity_var_names"]` |`string` |(_Optional_) Names of the gene activity matrix. | +::: + + + +## File format: Train mod2 + +The mod2 expression values of the train cells. + +Example file: `resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/train_mod2.h5ad` + + + +Format: + +:::{.small} + AnnData object + obs: 'batch', 'size_factors' + var: 'gene_ids', 'hvg', 'hvg_score' + obsm: 'gene_activity' + layers: 'counts', 'normalized' + uns: 'dataset_id', 'common_dataset_id', 'dataset_organism', 'normalization_id', 'gene_activity_var_names' +::: + +Data structure: + +:::{.small} +Slot |Type |Description | +:-------------------------|:--------|:------------------------------------------------------------| +`obs["batch"]` |`string` |Batch information. | +`obs["size_factors"]` |`double` |(_Optional_) The size factors of the cells prior to normalization. | +`var["gene_ids"]` |`string` |(_Optional_) The gene identifiers (if available). | +`var["hvg"]` |`boolean` |Whether or not the feature is considered to be a 'highly variable gene'. | +`var["hvg_score"]` |`double` |A score for the feature indicating how highly variable it is. | +`obsm["gene_activity"]` |`double` |(_Optional_) ATAC gene activity. | +`layers["counts"]` |`integer` |Raw counts. | +`layers["normalized"]` |`double` |Normalized expression values. | +`uns["dataset_id"]` |`string` |A unique identifier for the dataset. | +`uns["common_dataset_id"]` |`string` |(_Optional_) A common identifier for the dataset. | +`uns["dataset_organism"]` |`string` |(_Optional_) The organism of the sample in the dataset. | +`uns["normalization_id"]` |`string` |The unique identifier of the normalization method used. | +`uns["gene_activity_var_names"]` |`string` |(_Optional_) Names of the gene activity matrix. | +::: + + + +## Component type: Control method + + + +Quality control methods for verifying the pipeline. + +Arguments: + +:::{.small} +Name |Type |Description | +:-------------------------|:--------|:------------------------------------------------------------| +`--input_train_mod1` |`file` |The mod1 expression values of the train cells. | +`--input_train_mod2` |`file` |The mod2 expression values of the train cells. | +`--input_test_mod1` |`file` |The mod1 expression values of the test cells. | +`--input_test_mod2` |`file` |The mod2 expression values of the test cells. | +`--output` |`file` |(_Output_) A prediction of the mod2 expression values of the test cells. | +::: + + + +## Component type: Predict + + + +Make predictions using a trained model. + +Arguments: + +:::{.small} +Name |Type |Description | +:-------------------------|:--------|:------------------------------------------------------------| +`--input_train_mod1` |`file` |The mod1 expression values of the train cells. | +`--input_train_mod2` |`file` |The mod2 expression values of the train cells. | +`--input_test_mod1` |`file` |The mod1 expression values of the test cells. | +`--input_model` |`file` |A pretrained model for predicting the expression of one modality from another. | +`--output` |`file` |(_Output_) A prediction of the mod2 expression values of the test cells. | +::: + + + +## Component type: Train + + + +Train a model to predict the expression of one modality from another. + +Arguments: + +:::{.small} +Name |Type |Description | +:-------------------------|:--------|:------------------------------------------------------------| +`--input_train_mod1` |`file` |The mod1 expression values of the train cells. | +`--input_train_mod2` |`file` |The mod2 expression values of the train cells. | +`--input_test_mod1` |`file` |(_Optional_) The mod1 expression values of the test cells. | +`--output` |`file` |(_Output_) A pretrained model for predicting the expression of one modality from another. | +::: + + + +## Component type: Method + + + +A regression method. + +Arguments: + +:::{.small} +Name |Type |Description | +:-------------------------|:--------|:------------------------------------------------------------| +`--input_train_mod1` |`file` |The mod1 expression values of the train cells. | +`--input_train_mod2` |`file` |The mod2 expression values of the train cells. | +`--input_test_mod1` |`file` |The mod1 expression values of the test cells. | +`--output` |`file` |(_Output_) A prediction of the mod2 expression values of the test cells. | +::: + + + +## Component type: Metric + + + +A predict modality metric. + +Arguments: + +:::{.small} +Name |Type |Description | +:-------------------------|:--------|:------------------------------------------------------------| +`--input_prediction` |`file` |A prediction of the mod2 expression values of the test cells. | +`--input_test_mod2` |`file` |The mod2 expression values of the test cells. | +`--output` |`file` |(_Output_) Metric score file. | +::: + + + +## File format: Prediction + +A prediction of the mod2 expression values of the test cells + +Example file: `resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/prediction.h5ad` + + + +Format: + +:::{.small} + AnnData object + layers: 'normalized' + uns: 'dataset_id', 'method_id' +::: + +Data structure: + +:::{.small} +Slot |Type |Description | +:-------------------------|:--------|:------------------------------------------------------------| +`layers["normalized"]` |`double` |Predicted normalized expression values. | +`uns["dataset_id"]` |`string` |A unique identifier for the dataset. | +`uns["method_id"]` |`string` |A unique identifier for the method. | +::: + + + +## File format: Pretrained model + +A pretrained model for predicting the expression of one modality from another. + + + + + + + + +## File format: Score + +Metric score file + +Example file: `resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/score.h5ad` + + + +Format: + +:::{.small} + AnnData object + uns: 'dataset_id', 'method_id', 'metric_ids', 'metric_values' +::: + +Data structure: + +:::{.small} +Slot |Type |Description | +:-------------------------|:--------|:------------------------------------------------------------| +`uns["dataset_id"]` |`string` |A unique identifier for the dataset. | +`uns["method_id"]` |`string` |A unique identifier for the method. | +`uns["metric_ids"]` |`string` |One or more unique metric identifiers. | +`uns["metric_values"]` |`double` |The metric values obtained for the given prediction. Must be of same length as 'metric_ids'. | +::: + + + +## File format: Raw dataset mod2 + +The second modality of the raw dataset. Must be an ADT or an ATAC dataset + +Example file: `resources_test/common/openproblems_neurips2021/bmmc_cite/dataset_mod2.h5ad` + + + +Format: + +:::{.small} + AnnData object + obs: 'batch', 'size_factors' + var: 'feature_id', 'feature_name', 'hvg', 'hvg_score' + obsm: 'gene_activity' + layers: 'counts', 'normalized' + uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names' +::: + +Data structure: + +:::{.small} +Slot |Type |Description | +:-------------------------|:--------|:------------------------------------------------------------| +`obs["batch"]` |`string` |Batch information. | +`obs["size_factors"]` |`double` |(_Optional_) The size factors of the cells prior to normalization. | +`var["feature_id"]` |`string` |Unique identifier for the feature, usually a ENSEMBL gene id. | +`var["feature_name"]` |`string` |(_Optional_) A human-readable name for the feature, usually a gene symbol. | +`var["hvg"]` |`boolean` |Whether or not the feature is considered to be a 'highly variable gene'. | +`var["hvg_score"]` |`double` |A score for the feature indicating how highly variable it is. | +`obsm["gene_activity"]` |`double` |(_Optional_) ATAC gene activity. | +`layers["counts"]` |`integer` |Raw counts. | +`layers["normalized"]` |`double` |Normalized expression values. | +`uns["dataset_id"]` |`string` |A unique identifier for the dataset. | +`uns["dataset_name"]` |`string` |Nicely formatted name. | +`uns["dataset_url"]` |`string` |(_Optional_) Link to the original source of the dataset. | +`uns["dataset_reference"]` |`string` |(_Optional_) Bibtex reference of the paper in which the dataset was published. | +`uns["dataset_summary"]` |`string` |Short description of the dataset. | +`uns["dataset_description"]` |`string` |Long description of the dataset. | +`uns["dataset_organism"]` |`string` |(_Optional_) The organism of the sample in the dataset. | +`uns["normalization_id"]` |`string` |The unique identifier of the normalization method used. | +`uns["gene_activity_var_names"]` |`string` |(_Optional_) Names of the gene activity matrix. | +::: + + + diff --git a/src/api/file_common_dataset_mod1.yaml b/src/api/file_common_dataset_mod1.yaml index 21e0b27..f722ff1 100644 --- a/src/api/file_common_dataset_mod1.yaml +++ b/src/api/file_common_dataset_mod1.yaml @@ -3,7 +3,8 @@ example: "resources_test/common/openproblems_neurips2021/bmmc_cite/dataset_mod1. label: "Raw dataset RNA" summary: "The RNA modality of the raw dataset." info: - slots: + format: + type: h5ad layers: - type: integer name: counts diff --git a/src/api/file_common_dataset_mod2.yaml b/src/api/file_common_dataset_mod2.yaml index 3f522cd..daacd6b 100644 --- a/src/api/file_common_dataset_mod2.yaml +++ b/src/api/file_common_dataset_mod2.yaml @@ -3,7 +3,8 @@ example: "resources_test/common/openproblems_neurips2021/bmmc_cite/dataset_mod2. label: "Raw dataset mod2" summary: "The second modality of the raw dataset. Must be an ADT or an ATAC dataset" info: - slots: + format: + type: h5ad layers: - type: integer name: counts diff --git a/src/api/file_prediction.yaml b/src/api/file_prediction.yaml index bb37c9f..a92a1a0 100644 --- a/src/api/file_prediction.yaml +++ b/src/api/file_prediction.yaml @@ -3,7 +3,8 @@ example: "resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swa label: "Prediction" summary: "A prediction of the mod2 expression values of the test cells" info: - slots: + format: + type: h5ad layers: - type: double name: normalized diff --git a/src/api/file_score.yaml b/src/api/file_score.yaml index 3bdeff4..6770f76 100644 --- a/src/api/file_score.yaml +++ b/src/api/file_score.yaml @@ -3,7 +3,8 @@ example: "resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swa label: "Score" summary: "Metric score file" info: - slots: + format: + type: h5ad uns: - type: string name: dataset_id diff --git a/src/api/file_test_mod1.yaml b/src/api/file_test_mod1.yaml index 8cbae20..29c3fdd 100644 --- a/src/api/file_test_mod1.yaml +++ b/src/api/file_test_mod1.yaml @@ -3,7 +3,8 @@ example: "resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swa label: "Test mod1" summary: "The mod1 expression values of the test cells." info: - slots: + format: + type: h5ad layers: - type: integer name: counts diff --git a/src/api/file_test_mod2.yaml b/src/api/file_test_mod2.yaml index 162d867..7bccfc7 100644 --- a/src/api/file_test_mod2.yaml +++ b/src/api/file_test_mod2.yaml @@ -3,7 +3,8 @@ example: "resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swa label: "Test mod2" summary: "The mod2 expression values of the test cells." info: - slots: + format: + type: h5ad layers: - type: integer name: counts diff --git a/src/api/file_train_mod1.yaml b/src/api/file_train_mod1.yaml index c9e0e17..1b463ce 100644 --- a/src/api/file_train_mod1.yaml +++ b/src/api/file_train_mod1.yaml @@ -3,7 +3,8 @@ example: "resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swa label: "Train mod1" summary: "The mod1 expression values of the train cells." info: - slots: + format: + type: h5ad layers: - type: integer name: counts diff --git a/src/api/file_train_mod2.yaml b/src/api/file_train_mod2.yaml index 0b22a89..b608406 100644 --- a/src/api/file_train_mod2.yaml +++ b/src/api/file_train_mod2.yaml @@ -3,7 +3,8 @@ example: "resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swa label: "Train mod2" summary: "The mod2 expression values of the train cells." info: - slots: + format: + type: h5ad layers: - type: integer name: counts