Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filterSCE() version 1.22.0 #366

Open
83years opened this issue Aug 21, 2023 · 5 comments
Open

filterSCE() version 1.22.0 #366

83years opened this issue Aug 21, 2023 · 5 comments

Comments

@83years
Copy link

83years commented Aug 21, 2023

Hi Helena,

I hope this finds you well.

I've just updated to the latest version and I think I may have spotted a bug. My dataset is made up of 4 conditions and I want to filter out the "Control" condition before running the clustering and DR.

#remove Controls
table(sce$condition)

       Group1       Control Healthy_Donor        Group2
      1332146        300000        600000        305666 

sce1 <- CATALYST::filterSCE(sce, condition != "Control")
table(sce1$condition)

 Group1 
1332146 

No matter what I negatively select for I always get the group1 data. The length of the data also remains the same when using sce1@colData with non-Group1 data being replaced by NAs.

When I positively select condition == "Control" I get table of extent 0 with all the data replaced by NAs.

Is this a bug or am I doing something wrong?

> sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggcyto_1.26.4               flowWorkspace_4.10.1        ncdfFlow_2.44.0             BH_1.81.0-1                 ComplexHeatmap_2.14.0       dplyr_1.1.2                
 [7] slingshot_2.6.0             TrajectoryUtils_1.6.0       princurve_2.1.6             ggplot2_3.4.3               RColorBrewer_1.1-3          cytofWorkflow_1.23.0       
[13] cowplot_1.1.1               uwot_0.1.16                 Matrix_1.6-1                HDCytoData_1.18.0           flowCore_2.10.0             ExperimentHub_2.6.0        
[19] AnnotationHub_3.6.0         BiocFileCache_2.6.1         dbplyr_2.3.3                diffcyt_1.18.0              CATALYST_1.22.0             SingleCellExperiment_1.20.1
[25] SummarizedExperiment_1.28.0 Biobase_2.58.0              GenomicRanges_1.50.2        GenomeInfoDb_1.34.9         IRanges_2.32.0              S4Vectors_0.36.0           
[31] BiocGenerics_0.44.0         MatrixGenerics_1.10.0       matrixStats_1.0.0           readxl_1.4.3                knitr_1.43                  BiocStyle_2.26.0           

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.3                ragg_1.2.5                    tidyr_1.3.0                   bit64_4.0.5                   irlba_2.3.5.1                
  [6] multcomp_1.4-25               DelayedArray_0.24.0           data.table_1.14.8             KEGGREST_1.38.0               RCurl_1.98-1.12              
 [11] doParallel_1.0.17             generics_0.1.3                ScaledMatrix_1.6.0            TH.data_1.1-2                 RSQLite_2.3.1                
 [16] bit_4.0.5                     httpuv_1.6.11                 viridis_0.6.4                 xfun_0.40                     evaluate_0.21                
 [21] promises_1.2.1                fansi_1.0.4                   Rgraphviz_2.42.0              igraph_1.5.1                  DBI_1.1.3                    
 [26] purrr_1.0.2                   ellipsis_0.3.2                ggnewscale_0.4.9              ggpubr_0.6.0                  backports_1.4.1              
 [31] cytolib_2.10.1                ROI_1.0-1                     sparseMatrixStats_1.10.0      vctrs_0.6.3                   abind_1.4-5                  
 [36] cachem_1.0.8                  withr_2.5.0                   ggforce_0.4.1                 checkmate_2.2.0               cluster_2.1.4                
 [41] crayon_1.5.2                  drc_3.0-1                     edgeR_3.40.2                  pkgconfig_2.0.3               slam_0.1-50                  
 [46] labeling_0.4.2                tweenr_2.0.2                  nlme_3.1-160                  vipor_0.4.5                   rlang_1.1.1                  
 [51] lifecycle_1.0.3               sandwich_3.0-2                registry_0.5-1                filelock_1.0.2                rsvd_1.0.5                   
 [56] cellranger_1.1.0              polyclip_1.10-4               graph_1.76.0                  carData_3.0-5                 boot_1.3-28                  
 [61] zoo_1.8-12                    beeswarm_0.4.0                ggridges_0.5.4                GlobalOptions_0.1.2           pheatmap_1.0.12              
 [66] png_0.1-8                     viridisLite_0.4.2             rjson_0.2.21                  bitops_1.0-7                  ConsensusClusterPlus_1.62.0  
 [71] Biostrings_2.66.0             blob_1.2.4                    DelayedMatrixStats_1.20.0     shape_1.4.6                   stringr_1.5.0                
 [76] rstatix_0.7.2                 ggsignif_0.6.4                beachmat_2.14.2               scales_1.2.1                  memoise_2.0.1                
 [81] magrittr_2.0.3                plyr_1.8.8                    hexbin_1.28.3                 zlibbioc_1.44.0               compiler_4.2.2               
 [86] plotrix_3.8-2                 clue_0.3-64                   lme4_1.1-34                   cli_3.6.1                     XVector_0.38.0               
 [91] FlowSOM_2.6.0                 MASS_7.3-58.1                 tidyselect_1.2.0              stringi_1.7.12                RProtoBufLib_2.10.0          
 [96] textshaping_0.3.6             yaml_2.3.7                    BiocSingular_1.14.0           locfit_1.5-9.8                ggrepel_0.9.3                
[101] tools_4.2.2                   parallel_4.2.2                circlize_0.4.15               rstudioapi_0.15.0             foreach_1.5.2                
[106] gridExtra_2.3                 farver_2.1.1                  Rtsne_0.16                    digest_0.6.33                 BiocManager_1.30.22          
[111] shiny_1.7.5                   Rcpp_1.0.11                   car_3.1-2                     broom_1.0.5                   scuttle_1.8.4                
[116] BiocVersion_3.16.0            later_1.3.1                   httr_1.4.7                    AnnotationDbi_1.60.2          colorspace_2.1-0             
[121] XML_3.99-0.14                 splines_4.2.2                 scater_1.26.1                 systemfonts_1.0.4             xtable_1.8-4                 
[126] nloptr_2.0.3                  R6_2.5.1                      pillar_1.9.0                  htmltools_0.5.6               mime_0.12                    
[131] nnls_1.4                      glue_1.6.2                    fastmap_1.1.1                 minqa_1.2.5                   BiocParallel_1.32.6          
[136] BiocNeighbors_1.16.0          interactiveDisplayBase_1.36.0 codetools_0.2-18              mvtnorm_1.2-2                 utf8_1.2.3                   
[141] lattice_0.20-45               tibble_3.2.1                  numDeriv_2016.8-1.1           curl_5.0.2                    ggbeeswarm_0.7.2             
[146] colorRamps_2.3.1              gtools_3.9.4                  survival_3.4-0                limma_3.54.2                  rmarkdown_2.24               
[151] munsell_0.5.0                 GetoptLong_1.0.5              GenomeInfoDbData_1.2.9        iterators_1.0.14              reshape2_1.4.4               
[156] gtable_0.3.3   
@83years
Copy link
Author

83years commented Aug 23, 2023

I have a work-around from the tidySingleCellExperiment package.

> sce <- filter(sce, condition != "Control")
> table(sce$condition)

       Group1       Control Healthy_Donor        Group2
      1332146             0        600000        305666 

@david-priest
Copy link

I have noticed that filterSCE() may drop some data if there are some empty entries in the initial metadata spreadsheet.

@HelenaLC
Copy link
Owner

Can't reproduce this currently... Could you post your metadata table? I.e., metadata(sce)$experiment_info? As opposed to the tidy solution, which only subsets based on the colData, filterSCE is specifically designed to work within the CATALYST framework. E.g., it will also assure metadata are in synch. For example, the experiment_info will be recomputed after filtering (using internal CATALYST:::.get_ei()), and cluster_codes will also be adjusted. The only thing going wrong here, I can image, is something with the sample_ids, which are needed for the experiment_info to be set up correctly.

@david-priest
Copy link

Sorry it's been a while since I encountered the issue. But one thing that seems to work for me is to replace ei() in filterSCE() with this modified function, which still relies on sample_ids().

ei2 <- function(sce, meta = c("patient_id"))
  
{
# Make a data frame of sample_ids
ns <- table(sample_id = sample_ids(sce))
df <- as.data.frame(ns)
m <- match(df$sample_id, sce$sample_id)

# Extract metadata
for (i in meta) df[[i]] <- sce[[i]][m]

return(df)
}

@HelenaLC
Copy link
Owner

Yeah, sure, that works in certain cases- the issue is that match will return the 1st matching entry, and there is no guarantee that sample_ids match uniquely to what's in the SCE's colData. This could result in discrepancies that may go unnoticed. E.g., cells from a given sample_id will have different cluster_ids- that's what's "special" with how .get_ei() constructs the metadata: only variables that match uniquely to sample_ids will be retained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants