Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running the same sample multiple times results in different PASS/FAIL result #39

Open
kmavrommatis opened this issue Dec 5, 2024 · 6 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@kmavrommatis
Copy link

Hi,
I have a very fragmented tumor sample which has been segmented by various methods (FACETS, SCLUST, Battenberg) and i am using CNAqc to check its quality and potentially identify the best approach to analyze this genome.
My thoughts are that given the current segmentation I could use CNAqc to identify the segments that are more likely to be correct vs more likely to be wrong. Furthermore, I am hoping that if a sample fails I can get rerun it with different arguments through the same algorithms e.g. provide a different ploidy estimate etc.

I have noticed though that if I run the same cnaqc object through the analyze_peaks function multiple times I get different results.
For example

Running:

xlist=lapply( 1:10, function(y){analyze_peaks(x, n_bootstrap=10)}) %>% magrittr::set_names( paste0( "rep_",1:10))

I get

sapply(xlist, function(x){
    x$peaks_analysis$QC
})
  rep_1 rep_2 rep_3 rep_4 rep_5 rep_6 rep_7 rep_8 rep_9 rep_10
  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> 
FAIL  PASS  FAIL  FAIL  PASS  FAIL  PASS  PASS  FAIL  PASS 

While

xlist2=lapply( 1:10, function(y){analyze_peaks(x)}) %>% magrittr::set_names( paste0( "rep_",1:10))

I get

sapply(xlist2, function(x){
    x$peaks_analysis$QC
})
  rep_1 rep_2 rep_3 rep_4 rep_5 rep_6 rep_7 rep_8 rep_9 rep_10
  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> 
PASS  PASS  PASS  PASS  PASS  FAIL  PASS  PASS  PASS  PASS

These samples fail because their purity correction is high (> 8%). However, in the instances where the samples PASS the purity correction is <2%.
Any advice on this?
Thanks
K

@caravagn caravagn added the help wanted Extra attention is needed label Dec 17, 2024
@caravagn
Copy link
Collaborator

Hi, can you post the basic plots of the genome? segmentation, statistics for the QCed segments, VAFs, peak analysis etc.

It's hard otherwise give you a helpful answer.

G

@caravagn
Copy link
Collaborator

@vvvirgy please provide some support if possible.

@kmavrommatis
Copy link
Author

cnas.rds.gz
mutations.rds.gz

Hi, thanks for your response.
I am not sure what information would be useful so I attach two files, one with mutations and the other with the CN segmentation and here is the code I run:

reps=10
seed=12323

 x=CNAqc::init(
    mutations=mut,  # mutations predicted using GATK Mutect2 on tumor/matched normal 
    cna=cnvs,
    purity=0.76,   #purity is estimated by SCLUST
    sample='test',
    ref='hg38'
  )
set.seed(seed)
  xlist2=rep(list(x) , reps )

xlist2=mclapply( xlist2, function(y){
      CNAqc::analyze_peaks(y)
    }, mc.cores=10)
xlist2=xlist2  %>% magrittr::set_names( paste0( "rep_",1:reps))


qc = sapply(xlist2, function(x){x$peaks_analysis$QC})
qc

returns

 rep_1  rep_2  rep_3  rep_4  rep_5  rep_6  rep_7  rep_8  rep_9 rep_10 
"FAIL" "PASS" "PASS" "FAIL" "PASS" "PASS" "FAIL" "PASS" "PASS" "FAIL" 
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.7.2
CNAqc_1.0.0

@graceooh
Copy link

graceooh commented Jan 9, 2025

Hi, thank you for this great tool. I would also like to back this up and say I had similar experiences where the same input data gave different PASS and FAIL statuses after peaks analysis.

@caravagn
Copy link
Collaborator

require(dplyr)

reps=10
seed=12323

mut = readRDS("~/Downloads/mutations.rds")
cnvs = readRDS("~/Downloads/cnas.rds")

x=CNAqc::init(
  mutations=mut,  # mutations predicted using GATK Mutect2 on tumor/matched normal 
  cna=cnvs,
  purity=0.76,   #purity is estimated by SCLUST
  sample='test',
  ref='hg38'
)

require(CNAqc)

plot_segments(x)

# Check for arm overfragmentation
x = CNAqc::detect_arm_overfragmentation(x)
plot_arm_fragmentation(x)

# Inspect by chr
inspect_segment(x, n = 10)

plot_data_histogram(x, which = 'VAF')

# Your code now
set.seed(seed)

xlist2=rep(list(x) , reps )
xlist2=parallel::mclapply( xlist2, function(y){
  CNAqc::analyze_peaks(y)
}, mc.cores=10)

xlist2=xlist2  %>% magrittr::set_names( paste0( "rep_",1:reps))


qc = sapply(xlist2, function(x){x$peaks_analysis$QC})
qc

# Now,look at one PASS/ FAIL, you realise that peaks are WEAK in your 
# data for both 1:0 and 2:1 
xlist2[[2]] %>% plot_peaks_analysis()
xlist2[[1]] %>% plot_peaks_analysis()

# now, the randomness is due to the Bmix analysis (mixture model) inside
# CNAqc. That algorithm samples initial conditions randomly and is obvisouly
# sensitive to the fact that you barely have peaks in 1:0/2:1.

# nonetheless, the mclapply you use is not required. You should use the
# bootstrap function by CNAqc that is done exactly to do that and compute
# an aggregated score. This is also more generous to work with your data 
# that has these complex peaks
x = CNAqc::analyze_peaks(x, n_bootstrap = 10)
plot_peaks_analysis(x)

# If you wanto to double check I would suggest
x_chr = x %>% 
  split_by_chromosome() %>% 
  lapply(function(w) {analyze_peaks(w)})

Now, this sample is difficult because, for a VAF-peak analysis it basically has NO signal in 1:0 and 2:1 states. I am inclined to think that:

  • your purity estimate is OK
  • your diploid calls are OK
  • your segmentation is OK
  • I would not trust too much 1:0 and 2:1 states because they miss the peak expected at high frequency

The last point: those 1:0 and 2:1 could be true, of course. If true, they would suggest that those CNAs have happened very late in the evolution, and that the sample MRCA collapses (in time) towards that instant. So they would be recent events.

Honestly, if these events are important to you, you need to digthe BAF and DR and double check with those too.

That's what we do usually.

@kmavrommatis
Copy link
Author

Thank you,
that is very helpful.

Implementing your comments generated more questions :) that I will ask in a subsequent issue.
Thanks again for your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants