Skip to content

Commit

Permalink
Merge pull request #91 from ArcInstitute/dev
Browse files Browse the repository at this point in the history
Add low count detection within pair-wise comparisons
  • Loading branch information
abearab authored Aug 12, 2024
2 parents 66c6b15 + 88ed081 commit c77a166
Show file tree
Hide file tree
Showing 5 changed files with 177 additions and 106 deletions.
183 changes: 98 additions & 85 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,28 +36,6 @@ ___

___

<details>
<summary>Benchmarking</summary>
<br>

Benchmarking ScreenPro2 with other CRISPR screen analysis tools

### More thoughtful NGS read trimming recovers more sgRNA counts

### ScreenPro2 statistical analysis is more accurate than ScreenProcessing

### ScreenPro2 is more flexible than ScreenProcessing

Not only does ScreenPro2 have more features than ScreenProcessing, but it is also more flexible. ScreenPro2 can process data from diverse CRISPR screen platforms and is designed to be modular to enable easy extension to custom CRISPR screen platforms or other commonly used platforms in addition to the ones currently implemented.

### ScreenPro2 is faster than ScreenProcessing

Last but not least, ScreenPro2 runs faster than ScreenProcessing (thanks to [biobear](https://github.com/wheretrue/biobear)) for processing FASTQ files.

</details>

___

## Installation
ScreenPro2 is available on [PyPI](https://pypi.org/project/ScreenPro2/) and can be installed with pip:
```bash
Expand Down Expand Up @@ -97,8 +75,6 @@ Data analysis for CRISPR screens with NGS readouts can be broken down into three

The first step in analyzing CRISPR screens with deep sequencing readouts is to process the FASTQ files and generate counts for each guide RNA element in the library. ScreenPro2 has built-in functionalities to process FASTQ files and generate counts for different types of CRISPR screens platforms (see [Supported CRISPR Screen Platforms](#supported-crispr-screen-platforms)).

___

<details>
<summary>Command Line Interface (CLI)</summary>
<br>
Expand Down Expand Up @@ -133,11 +109,10 @@ ___
-o <output-directory>
--write-count-matrix
```
___

</details>

___

<details>
<summary>Python Package Usage</summary>
<br>
Expand Down Expand Up @@ -205,85 +180,123 @@ ___
```python
adata = counter.build_counts_anndata()
```

___

</details>

### Step 2: Phenotype calculation

Once you have the counts, you can use ScreenPro2 `phenoscore` and `phenostats` modules to calculate the phenotype scores and statistics between screen arms.

#### Load Data
First, load your data into an `AnnData` object (see [anndata](https://anndata.readthedocs.io/en/latest/index.html) for more information).
<details>
<summary>Load Data</summary>
<br>

The `AnnData` object must have the following contents:
- `adata.X` – counts matrix (samples x targets) where each value represents the sequencing count from NGS data.
- `adata.obs` – a pandas dataframe of samples metadata including "condition" and "replicate" columns.
- "condition": the condition for each sample in the experiment.
- "replicate": the replicate number for each sample in the experiment.
- `adata.var` – a pandas dataframe of targets in sgRNA library including "target" and "targetType" columns.
- "target": the target for each entry in reference sgRNA library. For single sgRNA libraries, this column can be
used to store gene names. For dual or multiple targeting sgRNA libraries, this column can be used to store gene pairs
or any other relevant information about the target.
- "targetType": the type of target for each entry in reference sgRNA library. Note that this column is used to
distinguish between different types of sgRNAs in the library and negative control sgRNAs can be defined as `"targetType" == "negative_control"`.
This is important for the phenotype calculation step.
First, load your data into an `AnnData` object (see [anndata](https://anndata.readthedocs.io/en/latest/index.html) for more information).

The `AnnData` object must have the following contents:
- `adata.X` – counts matrix (samples x targets) where each value represents the sequencing count from NGS data.
- `adata.obs` – a pandas dataframe of samples metadata including "condition" and "replicate" columns.
- "condition": the condition for each sample in the experiment.
- "replicate": the replicate number for each sample in the experiment.
- `adata.var` – a pandas dataframe of targets in sgRNA library including "target" and "targetType" columns.
- "target": the target for each entry in reference sgRNA library. For single sgRNA libraries, this column can be
used to store gene names. For dual or multiple targeting sgRNA libraries, this column can be used to store gene pairs
or any other relevant information about the target.
- "targetType": the type of target for each entry in reference sgRNA library. Note that this column is used to
distinguish between different types of sgRNAs in the library and negative control sgRNAs can be defined as `"targetType" == "negative_control"`.
This is important for the phenotype calculation step.

ScreenPro2 has a built-in class for different types of CRISPR screen assays. Currently, there is a class called `PooledScreens`
that can be used to process data from pooled CRISPR screens. To create a `PooledScreens` object from an `AnnData` object,
you can use the following example code:

```python
import pandas as pd
import anndata as ad
from screenpro.assays import PooledScreens
ScreenPro2 has a built-in class for different types of CRISPR screen assays. Currently, there is a class called `PooledScreens`
that can be used to process data from pooled CRISPR screens. To create a `PooledScreens` object from an `AnnData` object,
you can use the following example code:

adata = ad.AnnData(
X = counts_df, # pandas dataframe of counts (samples x targets)
obs = meta_df, # pandas dataframe of samples metadata including "condition" and "replicate" columns
var = target_df # pandas dataframe of targets metadata including "target" and "targetType" columns
)
```python
import pandas as pd
import anndata as ad
from screenpro.assays import PooledScreens

adata = ad.AnnData(
X = counts_df, # pandas dataframe of counts (samples x targets)
obs = meta_df, # pandas dataframe of samples metadata including "condition" and "replicate" columns
var = target_df # pandas dataframe of targets metadata including "target" and "targetType" columns
)

screen = PooledScreens(adata)
```
screen = PooledScreens(adata)
```

<img width="600" alt="image" src="https://github.com/ArcInstitute/ScreenPro2/assets/53412130/bb38d119-8f24-44fa-98ab-7ef4457ef8d2">
<img width="600" alt="image" src="https://github.com/ArcInstitute/ScreenPro2/assets/53412130/bb38d119-8f24-44fa-98ab-7ef4457ef8d2">

#### Perform Screen Processing Analysis
Once the screen object is created, you can use several available workflows to calculate the phenotype scores and statisitics by comparing each entry in reference sgRNA library between screen arms. Then, these scores and statistics are used to nominate hits.
___

##### Drug Screen Workflow: calculate `gamma`, `rho`, and `tau` scores
`.calculateDrugScreen` method can be used to calculate the enrichment of each gene between screen arms for a drug
screen experiment. This method calculates `gamma`, `rho`, and `tau` scores for each gene and adds them to the
`.phenotypes` attribute of the `PooledScreens` object.
</details>

Here is an example for running the workflow on a [CRISPRi-dual-sgRNA-screens](#dcas9-crisprai-dual-sgrna-screens) dataset:
<details>
<summary>Run workflows</summary>
<br>

```python
# Run the ScreenPro2 workflow for CRISPRi-dual-sgRNA-screens
screen.calculateDrugScreen(
t0='T0',
untreated='DMSO', # replace with the label for untreated condition
treated='Drug', # replace with the label for treated condition
score_level='compare_reps'
)
```
___
For example, in a Decitabine CRISPRi drug screen (see Figure 1B-C in [this bioRxiv paper](https://www.biorxiv.org/content/10.1101/2022.12.14.518457v2.full)), each phenotype score represents a comparison between different arms of the screen and `rho` scores shows the main drug phenotype as illustrated here:
<img width="800" alt="image" src="https://github.com/abearab/ScreenPro2/assets/53412130/b84b3e1f-e049-4da6-b63d-d4c72bc97cda">
Once the screen object is created, you can use several available workflows to calculate the phenotype scores and statisitics by comparing each entry in reference sgRNA library between screen arms. Then, these scores and statistics are used to nominate hits.

##### Flow cytometry based screen workflow: calculate phenotype score to compare high and low bins
`.calculateFlowBasedScreen` method can be used to calculate the enrichment of each target between high bin vs. low bin
of a flow cytometry-based screen experiment. This method calculates `PhenoScore` for each target and adds them to the
`.phenotypes` attribute of the `PooledScreens` object.
##### Drug Screen Workflow: calculate `gamma`, `rho`, and `tau` scores
`.calculateDrugScreen` method can be used to calculate the enrichment of each gene between screen arms for a drug
screen experiment. This method calculates `gamma`, `rho`, and `tau` scores for each gene and adds them to the
`.phenotypes` attribute of the `PooledScreens` object.

Here is an example for running the workflow on a [CRISPRi-dual-sgRNA-screens](#dcas9-crisprai-dual-sgrna-screens) dataset:

```python
# Run the ScreenPro2 workflow for CRISPRi-dual-sgRNA-screens
screen.calculateDrugScreen(
t0='T0',
untreated='DMSO', # replace with the label for untreated condition
treated='Drug', # replace with the label for treated condition
score_level='compare_reps'
)
```
___
For example, in a Decitabine CRISPRi drug screen (see Figure 1B-C in [this bioRxiv paper](https://www.biorxiv.org/content/10.1101/2022.12.14.518457v2.full)), each phenotype score represents a comparison between different arms of the screen and `rho` scores shows the main drug phenotype as illustrated here:
<img width="800" alt="image" src="https://github.com/abearab/ScreenPro2/assets/53412130/b84b3e1f-e049-4da6-b63d-d4c72bc97cda">

##### Flow cytometry based screen workflow: calculate phenotype score to compare high and low bins
`.calculateFlowBasedScreen` method can be used to calculate the enrichment of each target between high bin vs. low bin
of a flow cytometry-based screen experiment. This method calculates `PhenoScore` for each target and adds them to the
`.phenotypes` attribute of the `PooledScreens` object.

```python
# Run the ScreenPro2 workflow for CRISPRi-dual-sgRNA-screens
screen.calculateFlowBasedScreen(
low_bin='low_bin', high_bin='high_bin',
score_level='compare_reps'
)
```
___

</details>

<details>
<summary>Benchmarking ScreenPro2 vs other CRISPR screen processing tools</summary>
<br>

Coming soon...

</details>

<!-- Benchmarking ScreenPro2 with other CRISPR screen analysis tools
### More thoughtful NGS read trimming recovers more sgRNA counts
### ScreenPro2 statistical analysis is more accurate than ScreenProcessing
### ScreenPro2 is more flexible than ScreenProcessing
Not only does ScreenPro2 have more features than ScreenProcessing, but it is also more flexible. ScreenPro2 can process data from diverse CRISPR screen platforms and is designed to be modular to enable easy extension to custom CRISPR screen platforms or other commonly used platforms in addition to the ones currently implemented.
### ScreenPro2 is faster than ScreenProcessing
Last but not least, ScreenPro2 runs faster than ScreenProcessing (thanks to [biobear](https://github.com/wheretrue/biobear)) for processing FASTQ files. -->

```python
# Run the ScreenPro2 workflow for CRISPRi-dual-sgRNA-screens
screen.calculateFlowBasedScreen(
low_bin='low_bin', high_bin='high_bin',
score_level='compare_reps'
)
```

### Step 3: Data visualization

Expand Down
2 changes: 1 addition & 1 deletion screenpro/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,6 @@
from .dashboard import DrugScreenDashboard


__version__ = "0.4.10"
__version__ = "0.4.11"
__author__ = "Abe Arab"
__email__ = '[email protected]' # "[email protected]"
2 changes: 1 addition & 1 deletion screenpro/assays/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ def _auto_run_name(self):
)
return run_name

def filterLowCounts(self, filter_type='all', minimum_reads=50):
def filterLowCounts(self, filter_type='all', minimum_reads=1):
"""
Filter low counts in adata.X
"""
Expand Down
12 changes: 10 additions & 2 deletions screenpro/phenoscore/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,9 @@
def runPhenoScore(adata, cond_ref, cond_test, score_level, var_names='target', test='ttest',
growth_rate=1, n_reps='auto', keep_top_n = None, collapse_var=False,
num_pseudogenes='auto', pseudogene_size='auto',
count_layer=None, ctrl_label='negative_control'):
count_layer=None, count_filter_type='mean', count_filter_threshold=40,
ctrl_label='negative_control'
):
"""Calculate phenotype score and p-values when comparing `cond_test` vs `cond_ref`.
Args:
Expand All @@ -44,6 +46,8 @@ def runPhenoScore(adata, cond_ref, cond_test, score_level, var_names='target', t
num_pseudogenes (int): number of pseudogenes to generate
pseudogene_size (int): number of sgRNA elements in each pseudogene
count_layer (str): count layer to use for calculating score, default is None (use default count layer in adata.X)
count_filter_type (str): filter type for counts, default is 'mean'
count_filter_threshold (int): filter threshold for counts, default is 40
ctrl_label (str): control label, default is 'negative_control'
Returns:
Expand Down Expand Up @@ -85,7 +89,9 @@ def runPhenoScore(adata, cond_ref, cond_test, score_level, var_names='target', t
var_names=var_names,
test=test,
ctrl_label=ctrl_label,
growth_rate=growth_rate
growth_rate=growth_rate,
filter_type=count_filter_type,
filter_threshold=count_filter_threshold
)

elif score_level in ['compare_guides']:
Expand Down Expand Up @@ -114,6 +120,8 @@ def runPhenoScore(adata, cond_ref, cond_test, score_level, var_names='target', t
test=test,
ctrl_label=ctrl_label,
growth_rate=growth_rate,
filter_type=count_filter_type,
filter_threshold=count_filter_threshold
)

# get best best transcript as lowest p-value for each target
Expand Down
Loading

0 comments on commit c77a166

Please sign in to comment.