Merge pull request #91 from ArcInstitute/dev

Add low count detection within pair-wise comparisons
ArcInstitute · Aug 12, 2024 · c77a166 · c77a166
2 parents 66c6b15 + 88ed081
commit c77a166
Show file tree

Hide file tree

Showing 5 changed files with 177 additions and 106 deletions.
diff --git a/README.md b/README.md
@@ -36,28 +36,6 @@ ___
 
 ___
 
-<details>
-  <summary>Benchmarking</summary>
-  <br>
-
-  Benchmarking ScreenPro2 with other CRISPR screen analysis tools
-
-  ### More thoughtful NGS read trimming recovers more sgRNA counts
-
-  ### ScreenPro2 statistical analysis is more accurate than ScreenProcessing
-
-  ### ScreenPro2 is more flexible than ScreenProcessing
-
-  Not only does ScreenPro2 have more features than ScreenProcessing, but it is also more flexible. ScreenPro2 can process data from diverse CRISPR screen platforms and is designed to be modular to enable easy extension to custom CRISPR screen platforms or other commonly used platforms in addition to the ones currently implemented.
-
-  ### ScreenPro2 is faster than ScreenProcessing
-
-  Last but not least, ScreenPro2 runs faster than ScreenProcessing (thanks to [biobear](https://github.com/wheretrue/biobear)) for processing FASTQ files.
-
-</details>
-
-___
-
 ## Installation
 ScreenPro2 is available on [PyPI](https://pypi.org/project/ScreenPro2/) and can be installed with pip:
 ```bash
@@ -97,8 +75,6 @@ Data analysis for CRISPR screens with NGS readouts can be broken down into three
 
 The first step in analyzing CRISPR screens with deep sequencing readouts is to process the FASTQ files and generate counts for each guide RNA element in the library. ScreenPro2 has built-in functionalities to process FASTQ files and generate counts for different types of CRISPR screens platforms (see [Supported CRISPR Screen Platforms](#supported-crispr-screen-platforms)).
 
-___
-
 <details>
   <summary>Command Line Interface (CLI)</summary>
   <br>
@@ -133,11 +109,10 @@ ___
     -o <output-directory>
     --write-count-matrix
   ```
+  ___
 
 </details>
 
-___
-
 <details>
   <summary>Python Package Usage</summary>
   <br>
@@ -205,85 +180,123 @@ ___
   ```python
   adata = counter.build_counts_anndata()
   ```
+
+  ___
 
 </details>
 
 ### Step 2: Phenotype calculation
 
 Once you have the counts, you can use ScreenPro2 `phenoscore` and `phenostats` modules to calculate the phenotype scores and statistics between screen arms.
 
-#### Load Data
-First, load your data into an `AnnData` object (see [anndata](https://anndata.readthedocs.io/en/latest/index.html) for more information).
+<details>
+  <summary>Load Data</summary>
+  <br>
 
-The `AnnData` object must have the following contents:
-- `adata.X` – counts matrix (samples x targets) where each value represents the sequencing count from NGS data.
-- `adata.obs` – a pandas dataframe of samples metadata including "condition" and "replicate" columns.
-  - "condition": the condition for each sample in the experiment.
-  - "replicate": the replicate number for each sample in the experiment.
-- `adata.var` – a pandas dataframe of targets in sgRNA library including "target" and "targetType" columns.
-  - "target": the target for each entry in reference sgRNA library. For single sgRNA libraries, this column can be 
-    used to store gene names. For dual or multiple targeting sgRNA libraries, this column can be used to store gene pairs
-    or any other relevant information about the target.
-  - "targetType": the type of target for each entry in reference sgRNA library. Note that this column is used to 
-    distinguish between different types of sgRNAs in the library and negative control sgRNAs can be defined as `"targetType" == "negative_control"`.
-    This is important for the phenotype calculation step.
+  First, load your data into an `AnnData` object (see [anndata](https://anndata.readthedocs.io/en/latest/index.html) for more information).
 
+  The `AnnData` object must have the following contents:
+  - `adata.X` – counts matrix (samples x targets) where each value represents the sequencing count from NGS data.
+  - `adata.obs` – a pandas dataframe of samples metadata including "condition" and "replicate" columns.
+    - "condition": the condition for each sample in the experiment.
+    - "replicate": the replicate number for each sample in the experiment.
+  - `adata.var` – a pandas dataframe of targets in sgRNA library including "target" and "targetType" columns.
+    - "target": the target for each entry in reference sgRNA library. For single sgRNA libraries, this column can be 
+      used to store gene names. For dual or multiple targeting sgRNA libraries, this column can be used to store gene pairs
+      or any other relevant information about the target.
+    - "targetType": the type of target for each entry in reference sgRNA library. Note that this column is used to 
+      distinguish between different types of sgRNAs in the library and negative control sgRNAs can be defined as `"targetType" == "negative_control"`.
+      This is important for the phenotype calculation step.
 
-ScreenPro2 has a built-in class for different types of CRISPR screen assays. Currently, there is a class called `PooledScreens` 
-that can be used to process data from pooled CRISPR screens. To create a `PooledScreens` object from an `AnnData` object, 
-you can use the following example code:
 
-```python
-import pandas as pd
-import anndata as ad
-from screenpro.assays import PooledScreens
+  ScreenPro2 has a built-in class for different types of CRISPR screen assays. Currently, there is a class called `PooledScreens` 
+  that can be used to process data from pooled CRISPR screens. To create a `PooledScreens` object from an `AnnData` object, 
+  you can use the following example code:
 
-adata = ad.AnnData(
-    X   = counts_df, # pandas dataframe of counts (samples x targets)
-    obs = meta_df,   # pandas dataframe of samples metadata including "condition" and "replicate" columns
-    var = target_df  # pandas dataframe of targets metadata including "target" and "targetType" columns
-)
+  ```python
+  import pandas as pd
+  import anndata as ad
+  from screenpro.assays import PooledScreens
+
+  adata = ad.AnnData(
+      X   = counts_df, # pandas dataframe of counts (samples x targets)
+      obs = meta_df,   # pandas dataframe of samples metadata including "condition" and "replicate" columns
+      var = target_df  # pandas dataframe of targets metadata including "target" and "targetType" columns
+  )
 
-screen = PooledScreens(adata)
-```
+  screen = PooledScreens(adata)
+  ```
 
-<img width="600" alt="image" src="https://github.com/ArcInstitute/ScreenPro2/assets/53412130/bb38d119-8f24-44fa-98ab-7ef4457ef8d2">
+  <img width="600" alt="image" src="https://github.com/ArcInstitute/ScreenPro2/assets/53412130/bb38d119-8f24-44fa-98ab-7ef4457ef8d2">
 
-#### Perform Screen Processing Analysis
-Once the screen object is created, you can use several available workflows to calculate the phenotype scores and statisitics by comparing each entry in reference sgRNA library between screen arms. Then, these scores and statistics are used to nominate hits.
+  ___
 
-##### Drug Screen Workflow: calculate `gamma`, `rho`, and `tau` scores
-`.calculateDrugScreen` method can be used to calculate the enrichment of each gene between screen arms for a drug 
-screen experiment. This method calculates `gamma`, `rho`, and `tau` scores for each gene and adds them to the 
-`.phenotypes` attribute of the `PooledScreens` object.
+</details>
 
-Here is an example for running the workflow on a [CRISPRi-dual-sgRNA-screens](#dcas9-crisprai-dual-sgrna-screens) dataset:
+<details>
+  <summary>Run workflows</summary>
+  <br>
 
-```python
-# Run the ScreenPro2 workflow for CRISPRi-dual-sgRNA-screens
-screen.calculateDrugScreen(
-  t0='T0',
-  untreated='DMSO',  # replace with the label for untreated condition
-  treated='Drug',    # replace with the label for treated condition
-  score_level='compare_reps'
-)
-```
-___
-For example, in a Decitabine CRISPRi drug screen (see Figure 1B-C in [this bioRxiv paper](https://www.biorxiv.org/content/10.1101/2022.12.14.518457v2.full)), each phenotype score represents a comparison between different arms of the screen and `rho` scores shows the main drug phenotype as illustrated here:
-<img width="800" alt="image" src="https://github.com/abearab/ScreenPro2/assets/53412130/b84b3e1f-e049-4da6-b63d-d4c72bc97cda">
+  Once the screen object is created, you can use several available workflows to calculate the phenotype scores and statisitics by comparing each entry in reference sgRNA library between screen arms. Then, these scores and statistics are used to nominate hits.
 
-##### Flow cytometry based screen workflow: calculate phenotype score to compare high and low bins
-`.calculateFlowBasedScreen` method can be used to calculate the enrichment of each target between high bin vs. low bin 
-of a flow cytometry-based screen experiment. This method calculates `PhenoScore` for each target and adds them to the 
-`.phenotypes` attribute of the `PooledScreens` object.
+  ##### Drug Screen Workflow: calculate `gamma`, `rho`, and `tau` scores
+  `.calculateDrugScreen` method can be used to calculate the enrichment of each gene between screen arms for a drug 
+  screen experiment. This method calculates `gamma`, `rho`, and `tau` scores for each gene and adds them to the 
+  `.phenotypes` attribute of the `PooledScreens` object.
+
+  Here is an example for running the workflow on a [CRISPRi-dual-sgRNA-screens](#dcas9-crisprai-dual-sgrna-screens) dataset:
+
+  ```python
+  # Run the ScreenPro2 workflow for CRISPRi-dual-sgRNA-screens
+  screen.calculateDrugScreen(
+    t0='T0',
+    untreated='DMSO',  # replace with the label for untreated condition
+    treated='Drug',    # replace with the label for treated condition
+    score_level='compare_reps'
+  )
+  ```
+  ___
+  For example, in a Decitabine CRISPRi drug screen (see Figure 1B-C in [this bioRxiv paper](https://www.biorxiv.org/content/10.1101/2022.12.14.518457v2.full)), each phenotype score represents a comparison between different arms of the screen and `rho` scores shows the main drug phenotype as illustrated here:
+  <img width="800" alt="image" src="https://github.com/abearab/ScreenPro2/assets/53412130/b84b3e1f-e049-4da6-b63d-d4c72bc97cda">
+
+  ##### Flow cytometry based screen workflow: calculate phenotype score to compare high and low bins
+  `.calculateFlowBasedScreen` method can be used to calculate the enrichment of each target between high bin vs. low bin 
+  of a flow cytometry-based screen experiment. This method calculates `PhenoScore` for each target and adds them to the 
+  `.phenotypes` attribute of the `PooledScreens` object.
+
+  ```python
+  # Run the ScreenPro2 workflow for CRISPRi-dual-sgRNA-screens
+  screen.calculateFlowBasedScreen(
+    low_bin='low_bin', high_bin='high_bin',
+    score_level='compare_reps'
+  )
+  ```
+  ___
+
+</details>
+
+<details>
+  <summary>Benchmarking ScreenPro2 vs other CRISPR screen processing tools</summary>
+  <br>
+
+  Coming soon...
+
+</details>
+
+  <!-- Benchmarking ScreenPro2 with other CRISPR screen analysis tools
+
+  ### More thoughtful NGS read trimming recovers more sgRNA counts
+
+  ### ScreenPro2 statistical analysis is more accurate than ScreenProcessing
+
+  ### ScreenPro2 is more flexible than ScreenProcessing
+
+  Not only does ScreenPro2 have more features than ScreenProcessing, but it is also more flexible. ScreenPro2 can process data from diverse CRISPR screen platforms and is designed to be modular to enable easy extension to custom CRISPR screen platforms or other commonly used platforms in addition to the ones currently implemented.
+
+  ### ScreenPro2 is faster than ScreenProcessing
+
+  Last but not least, ScreenPro2 runs faster than ScreenProcessing (thanks to [biobear](https://github.com/wheretrue/biobear)) for processing FASTQ files. -->
 
-```python
-# Run the ScreenPro2 workflow for CRISPRi-dual-sgRNA-screens
-screen.calculateFlowBasedScreen(
-  low_bin='low_bin', high_bin='high_bin',
-  score_level='compare_reps'
-)
-```
 
 ### Step 3: Data visualization
 

diff --git a/screenpro/__init__.py b/screenpro/__init__.py
@@ -31,6 +31,6 @@
 from .dashboard import DrugScreenDashboard
 
 
-__version__ = "0.4.10"
+__version__ = "0.4.11"
 __author__ = "Abe Arab"
 __email__ = '[email protected]' # "[email protected]"
diff --git a/screenpro/assays/__init__.py b/screenpro/assays/__init__.py
@@ -110,7 +110,7 @@ def _auto_run_name(self):
             )
         return run_name
 
-    def filterLowCounts(self, filter_type='all', minimum_reads=50):
+    def filterLowCounts(self, filter_type='all', minimum_reads=1):
         """
         Filter low counts in adata.X
         """

diff --git a/screenpro/phenoscore/__init__.py b/screenpro/phenoscore/__init__.py
@@ -28,7 +28,9 @@
 def runPhenoScore(adata, cond_ref, cond_test, score_level, var_names='target', test='ttest',
                   growth_rate=1, n_reps='auto', keep_top_n = None, collapse_var=False,
                   num_pseudogenes='auto', pseudogene_size='auto',
-                  count_layer=None, ctrl_label='negative_control'):
+                  count_layer=None, count_filter_type='mean', count_filter_threshold=40,
+                  ctrl_label='negative_control'
+                  ):
     """Calculate phenotype score and p-values when comparing `cond_test` vs `cond_ref`.
 
     Args:
@@ -44,6 +46,8 @@ def runPhenoScore(adata, cond_ref, cond_test, score_level, var_names='target', t
         num_pseudogenes (int): number of pseudogenes to generate
         pseudogene_size (int): number of sgRNA elements in each pseudogene
         count_layer (str): count layer to use for calculating score, default is None (use default count layer in adata.X)
+        count_filter_type (str): filter type for counts, default is 'mean'
+        count_filter_threshold (int): filter threshold for counts, default is 40
         ctrl_label (str): control label, default is 'negative_control'
     
     Returns:
@@ -85,7 +89,9 @@ def runPhenoScore(adata, cond_ref, cond_test, score_level, var_names='target', t
             var_names=var_names,
             test=test,
             ctrl_label=ctrl_label,
-            growth_rate=growth_rate
+            growth_rate=growth_rate,
+            filter_type=count_filter_type,
+            filter_threshold=count_filter_threshold
         )
 
     elif score_level in ['compare_guides']:
@@ -114,6 +120,8 @@ def runPhenoScore(adata, cond_ref, cond_test, score_level, var_names='target', t
             test=test,
             ctrl_label=ctrl_label,
             growth_rate=growth_rate,
+            filter_type=count_filter_type,
+            filter_threshold=count_filter_threshold
         )
 
         # get best best transcript as lowest p-value for each target