filterSCE() version 1.22.0 #366

83years · 2023-08-21T13:05:28Z

Hi Helena,

I hope this finds you well.

I've just updated to the latest version and I think I may have spotted a bug. My dataset is made up of 4 conditions and I want to filter out the "Control" condition before running the clustering and DR.

#remove Controls
table(sce$condition)

       Group1       Control Healthy_Donor        Group2
      1332146        300000        600000        305666 

sce1 <- CATALYST::filterSCE(sce, condition != "Control")
table(sce1$condition)

 Group1 
1332146

No matter what I negatively select for I always get the group1 data. The length of the data also remains the same when using sce1@colData with non-Group1 data being replaced by NAs.

When I positively select condition == "Control" I get table of extent 0 with all the data replaced by NAs.

Is this a bug or am I doing something wrong?

> sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggcyto_1.26.4               flowWorkspace_4.10.1        ncdfFlow_2.44.0             BH_1.81.0-1                 ComplexHeatmap_2.14.0       dplyr_1.1.2                
 [7] slingshot_2.6.0             TrajectoryUtils_1.6.0       princurve_2.1.6             ggplot2_3.4.3               RColorBrewer_1.1-3          cytofWorkflow_1.23.0       
[13] cowplot_1.1.1               uwot_0.1.16                 Matrix_1.6-1                HDCytoData_1.18.0           flowCore_2.10.0             ExperimentHub_2.6.0        
[19] AnnotationHub_3.6.0         BiocFileCache_2.6.1         dbplyr_2.3.3                diffcyt_1.18.0              CATALYST_1.22.0             SingleCellExperiment_1.20.1
[25] SummarizedExperiment_1.28.0 Biobase_2.58.0              GenomicRanges_1.50.2        GenomeInfoDb_1.34.9         IRanges_2.32.0              S4Vectors_0.36.0           
[31] BiocGenerics_0.44.0         MatrixGenerics_1.10.0       matrixStats_1.0.0           readxl_1.4.3                knitr_1.43                  BiocStyle_2.26.0           

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.3                ragg_1.2.5                    tidyr_1.3.0                   bit64_4.0.5                   irlba_2.3.5.1                
  [6] multcomp_1.4-25               DelayedArray_0.24.0           data.table_1.14.8             KEGGREST_1.38.0               RCurl_1.98-1.12              
 [11] doParallel_1.0.17             generics_0.1.3                ScaledMatrix_1.6.0            TH.data_1.1-2                 RSQLite_2.3.1                
 [16] bit_4.0.5                     httpuv_1.6.11                 viridis_0.6.4                 xfun_0.40                     evaluate_0.21                
 [21] promises_1.2.1                fansi_1.0.4                   Rgraphviz_2.42.0              igraph_1.5.1                  DBI_1.1.3                    
 [26] purrr_1.0.2                   ellipsis_0.3.2                ggnewscale_0.4.9              ggpubr_0.6.0                  backports_1.4.1              
 [31] cytolib_2.10.1                ROI_1.0-1                     sparseMatrixStats_1.10.0      vctrs_0.6.3                   abind_1.4-5                  
 [36] cachem_1.0.8                  withr_2.5.0                   ggforce_0.4.1                 checkmate_2.2.0               cluster_2.1.4                
 [41] crayon_1.5.2                  drc_3.0-1                     edgeR_3.40.2                  pkgconfig_2.0.3               slam_0.1-50                  
 [46] labeling_0.4.2                tweenr_2.0.2                  nlme_3.1-160                  vipor_0.4.5                   rlang_1.1.1                  
 [51] lifecycle_1.0.3               sandwich_3.0-2                registry_0.5-1                filelock_1.0.2                rsvd_1.0.5                   
 [56] cellranger_1.1.0              polyclip_1.10-4               graph_1.76.0                  carData_3.0-5                 boot_1.3-28                  
 [61] zoo_1.8-12                    beeswarm_0.4.0                ggridges_0.5.4                GlobalOptions_0.1.2           pheatmap_1.0.12              
 [66] png_0.1-8                     viridisLite_0.4.2             rjson_0.2.21                  bitops_1.0-7                  ConsensusClusterPlus_1.62.0  
 [71] Biostrings_2.66.0             blob_1.2.4                    DelayedMatrixStats_1.20.0     shape_1.4.6                   stringr_1.5.0                
 [76] rstatix_0.7.2                 ggsignif_0.6.4                beachmat_2.14.2               scales_1.2.1                  memoise_2.0.1                
 [81] magrittr_2.0.3                plyr_1.8.8                    hexbin_1.28.3                 zlibbioc_1.44.0               compiler_4.2.2               
 [86] plotrix_3.8-2                 clue_0.3-64                   lme4_1.1-34                   cli_3.6.1                     XVector_0.38.0               
 [91] FlowSOM_2.6.0                 MASS_7.3-58.1                 tidyselect_1.2.0              stringi_1.7.12                RProtoBufLib_2.10.0          
 [96] textshaping_0.3.6             yaml_2.3.7                    BiocSingular_1.14.0           locfit_1.5-9.8                ggrepel_0.9.3                
[101] tools_4.2.2                   parallel_4.2.2                circlize_0.4.15               rstudioapi_0.15.0             foreach_1.5.2                
[106] gridExtra_2.3                 farver_2.1.1                  Rtsne_0.16                    digest_0.6.33                 BiocManager_1.30.22          
[111] shiny_1.7.5                   Rcpp_1.0.11                   car_3.1-2                     broom_1.0.5                   scuttle_1.8.4                
[116] BiocVersion_3.16.0            later_1.3.1                   httr_1.4.7                    AnnotationDbi_1.60.2          colorspace_2.1-0             
[121] XML_3.99-0.14                 splines_4.2.2                 scater_1.26.1                 systemfonts_1.0.4             xtable_1.8-4                 
[126] nloptr_2.0.3                  R6_2.5.1                      pillar_1.9.0                  htmltools_0.5.6               mime_0.12                    
[131] nnls_1.4                      glue_1.6.2                    fastmap_1.1.1                 minqa_1.2.5                   BiocParallel_1.32.6          
[136] BiocNeighbors_1.16.0          interactiveDisplayBase_1.36.0 codetools_0.2-18              mvtnorm_1.2-2                 utf8_1.2.3                   
[141] lattice_0.20-45               tibble_3.2.1                  numDeriv_2016.8-1.1           curl_5.0.2                    ggbeeswarm_0.7.2             
[146] colorRamps_2.3.1              gtools_3.9.4                  survival_3.4-0                limma_3.54.2                  rmarkdown_2.24               
[151] munsell_0.5.0                 GetoptLong_1.0.5              GenomeInfoDbData_1.2.9        iterators_1.0.14              reshape2_1.4.4               
[156] gtable_0.3.3

The text was updated successfully, but these errors were encountered:

83years · 2023-08-23T13:51:03Z

I have a work-around from the tidySingleCellExperiment package.

> sce <- filter(sce, condition != "Control")
> table(sce$condition)

       Group1       Control Healthy_Donor        Group2
      1332146             0        600000        305666

david-priest · 2023-09-05T08:00:25Z

I have noticed that filterSCE() may drop some data if there are some empty entries in the initial metadata spreadsheet.

HelenaLC · 2023-11-13T09:14:59Z

Can't reproduce this currently... Could you post your metadata table? I.e., metadata(sce)$experiment_info? As opposed to the tidy solution, which only subsets based on the colData, filterSCE is specifically designed to work within the CATALYST framework. E.g., it will also assure metadata are in synch. For example, the experiment_info will be recomputed after filtering (using internal CATALYST:::.get_ei()), and cluster_codes will also be adjusted. The only thing going wrong here, I can image, is something with the sample_ids, which are needed for the experiment_info to be set up correctly.

david-priest · 2023-11-14T07:31:50Z

Sorry it's been a while since I encountered the issue. But one thing that seems to work for me is to replace ei() in filterSCE() with this modified function, which still relies on sample_ids().

ei2 <- function(sce, meta = c("patient_id"))
  
{
# Make a data frame of sample_ids
ns <- table(sample_id = sample_ids(sce))
df <- as.data.frame(ns)
m <- match(df$sample_id, sce$sample_id)

# Extract metadata
for (i in meta) df[[i]] <- sce[[i]][m]

return(df)
}

HelenaLC · 2023-11-14T08:19:43Z

Yeah, sure, that works in certain cases- the issue is that match will return the 1st matching entry, and there is no guarantee that sample_ids match uniquely to what's in the SCE's colData. This could result in discrepancies that may go unnoticed. E.g., cells from a given sample_id will have different cluster_ids- that's what's "special" with how .get_ei() constructs the metadata: only variables that match uniquely to sample_ids will be retained.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filterSCE() version 1.22.0 #366

filterSCE() version 1.22.0 #366

83years commented Aug 21, 2023

83years commented Aug 23, 2023

david-priest commented Sep 5, 2023

HelenaLC commented Nov 13, 2023

david-priest commented Nov 14, 2023

HelenaLC commented Nov 14, 2023

filterSCE() version 1.22.0 #366

filterSCE() version 1.22.0 #366

Comments

83years commented Aug 21, 2023

83years commented Aug 23, 2023

david-priest commented Sep 5, 2023

HelenaLC commented Nov 13, 2023

david-priest commented Nov 14, 2023

HelenaLC commented Nov 14, 2023