-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Error: piscem mapping failed with exit status ExitStatus(unix_wait_status(139))" #99
Comments
Hi @wmacnair, Thanks so much for the detailed report, and for your kind words regarding So, the error looks to arise from We'd be happy to take a look, figure out what is going on, and try and fix this. A couple Thanks! |
Hi @rob-p Developing methods can sometimes be a thankless task, so I try to give some encourage when I can :) Re your questions:
These samples are not yet published, so sharing is a bit more tricky, but @marusakod has had the same error (again, repeatedly) on the BioSample SAMN13262712 in this repo: Hopefully that is enough to start with. I'll also have a look at subsampling my files and see if I can reproduce the error. Cheers! Hopefully that should help. I'll see if I can generate |
Ah Maruša suggests that sample 10X145-6 here might be easier to work with, as that already has R1 and R2 files labelled: https://data.nemoarchive.org/biccn/grant/u01_lein/linnarsson/transcriptome/sncell/10x_v3/human/raw/10X145-6/ |
Thanks! I'll give this a look and report back when I have some idea whats up. In the meantime, could you also please share the command and files you used to generate the index/reference? Thanks, |
I grabbed one of the samples linked above has several samples. I will start looking at the others, but if you could point me at a particular problem sample, that would be extra useful! Thanks, |
I used the 10x reference genome, here. These are the commands I used: #!/bin/bash
#BSUB -J simpleaf_build_index_human
#BSUB -W 24:00
#BSUB -n 24
#BSUB -R "rusage[mem=8192]"
#BSUB -R "span[hosts=1]"
#BSUB -q long
#BSUB -eo /home/macnairw/packages/scProcess/.log/simpleaf_build_index_human.err
#BSUB -oo /home/macnairw/packages/scProcess/.log/simpleaf_build_index_human.out
# how to call:
# bsub < /home/macnairw/packages/scProcess/scripts/simpleaf_build_index_human.sh
# set up environment
ml purge
ml Anaconda3/2021.05
conda activate af
# conda install numpy=1.23
ulimit -n 2048
# simpleaf configuration
export ALEVIN_FRY_HOME="/home/macnairw/packages/scProcess/data/alevin_fry_home"
simpleaf set-paths
# change working directory to somewhere not crazy
cd $ALEVIN_FRY_HOME
# set up this build
PROJ_DIR="/home/macnairw/packages/scprocess/data"
REF_DIR="$PROJ_DIR/reference_genome/refdata-gex-GRCh38-2020-A"
IDX_DIR="$PROJ_DIR/alevin_fry_home/human_2020-A_splici"
# simpleaf index
simpleaf index \
--output $IDX_DIR \
--fasta $REF_DIR/fasta/genome.fa \
--gtf $REF_DIR/genes/genes.gtf \
--rlen 91 \
--threads 24 \
--use-piscem
conda deactivate |
Ok interesting... What if you try all of the samples there, in one run? I think that's what Maruša did when it failed for her.
|
I tried the smaller samples in --Rob |
Hi @rob-p Will is right, for sample |
Thanks @marusakod, I was unable to reproduce this locally with my native build of --Rob |
So I ran on everything in
One thing worth noting is that I don't seem to have 24 files when I grabbed everything in the directory for
Can you see anything obvious missing from here? One other thing worth asking — though I doubt it's an issue — is that these runs where things seem to be failing look to be pretty big (i.e. when using all of the data from Is there any possibility you're resource constrained in terms of the disk allocated to the output, and that the mapper may be failing because it can't properly write the complete output files? --Rob |
Hi @rob-p sorry, my bad, I was also counting index files. 16 is the correct number of R1 + R2 files for sample Maruša |
Hi @rob-p , working through your comments and suggestions.
which piscem
# /projects/site/pred/neurogenomics/resources/scprocess_data/conda/af/bin/piscem
piscem --version
# piscem 0.6.0 |
Thanks @wmacnair and @marusakod, The first order of business on my side is really just to be able to reproduce the error. @marusakod — the samples here are much smaller, but there are many of them; So far, I have not been able to trigger this with either my native built --Rob |
I've checked, and I can share a file with you where it fails for me. What do you recommend for sharing massive files? 😅 Will |
I often use Box or Google Drive or some such. Whatever works ;P. Once you know where you want to put it, you can e-mail me at [email protected] with the relevant info! Thanks |
I'm working on this, but I'm having to configure some things for the first time so could be a little while... |
@marusakod — I was again not able to reproduce this on I'll try again with these samples on another machine. They are easier to work with because they're pretty small. Here was my whole run; let me know if the command looks the same as yours (modulo paths):
|
Ok, on a different machine I have found success (in failing :P). I can reproduce the segfault on this sample. The backtrace is, right now, unfortunately, not super useful. Interestingly, it seems to suggest an illegal instruction! I'm not sure if this is the same cause you are encountering or not.
|
Great! Nice to be celebrating a failure :D I think we get "SIGSEGV". But at least we're getting something similar on the same sample... |
So I get a segfault when I run it normally, the illegal instruction is only via GDB. Interestingly, it is machine dependent and the machine it fails on is a much older one. I wonder if it may actually be an illegal instruction. Are all of your jobs scheduled on the same node? Can you provide the output of
There is a minimal requirement to have bit manipulation instructions, I believe, the way its compiled. And some other SSE instructions. |
doesn't work
works
works
In particular, here is what I get if I take the set of instructions in the intersection of the machines where everything works, minus the instruction set where it doesn't work:
If I look across those, I am fairly certain that we don't enable AVX/AVX2 when we build on conda, but we do enable --Rob |
Hi @rob-p, this is way too close to the metal for me to contribute anything helpful, so I have asked our internal HPC guys if they can comment. I'll let you know what they feed back :) Will |
I've also just shared a gDrive folder with your UoM email, called simpleaf_fastqs. This is an additional sample that didn't work for us. Hopefully you've got this - let me know if not :) |
I got the link. I will try this on both machines above and see if the issue is the same. Btw, what’s the chemistry on this sample? —Rob |
Great :) Chemistry is 10xV3. The call I used had all the same options as in my first comment here: # simpleaf quantfication
simpleaf quant \
--reads1 $R1_fs \
--reads2 $R2_fs \
--threads 16 \
--index $ALEVIN_FRY_HOME/human_2020-A_splici/index \
--chemistry 10xv3 --resolution cr-like \
--expected-ori fw \
--t2g-map $ALEVIN_FRY_HOME/human_2020-A_splici/index/t2g_3col.tsv \
--unfiltered-pl $CELLRANGER_DIR/3M-february-2018.txt --min-reads 1 \
--output ./Macnair_2022/output/ms01_alevin_fry/af_WM177 |
Response from an HPC colleague:
|
Thanks, @wmacnair! By the way, in the spirit of full reproducible transparency, the build I had that worked on the BMI2 node but failed on the non-BMI2 node was installed using --Rob |
Ok, some more brief updates. Apart from the machine that exhibits the illegal instruction, I've been unable to trigger a segfault. I tried the original dataset you pointed me at, as well as the one @marusakod suggested I tried on a machine from 2016 (Intel Xeon), 2023 (AMD Epyc), as well as 2 MacOS machines (2018 Intel Mac) and (2020 Intel Mac). On all of these machines, I was not able to produce the segfault on these data using the It would be great if your HPC folks had any extra insight or luck reproducing the issue. Otherwise, the next thing I would ask is if you could provide some extra details about the specific OS distribution and version you are running on. My next step would be to try to more closely reproduce your environment by spinning up a Docker container with that precise OS and then installing these tools via --Rob |
Thanks for this. Slightly frustrating that we have the error and you don't but I guess that's reassuring for I've passed on your most recent comment to Erica, and I'll let you know what I hear back. Will |
Erica also doesn't find any errors... We are wondering whether it could be a problem with the index: Does it sound like this could be an explanation / part of an explanation? Will |
Hi @wmacnair, Thanks for the updates. I also agree that it’s frustrating since it seems that you and @marusakod are able to reproduce the issue deterministically on some samples, so it doesn’t seem to be e.g. a stochastic thread related issue. Though the index should be broadly compatible between machines (I built one index on the old intel machine and used it on all of those I listed above), it is possible that the index could be at fault. Are you sure the index completed successfully. Also, if there are different days layouts between the systems that could be problematic, perhaps. What if Erica tries using your index? If you want to put it on Google drive, I could also try running with your index. Best, |
We're currently trying out different combinations - I ran with her conda env and my index. I'm next going to try my conda env and her index. And I thiiiink the index completed successfully, but I'm not completely sure. If this second test works ok, then it looks like it is the index that is at fault. |
Hi @wmacnair, please do keep me updated here! I'd be really interested to know if it was related to something with the index either being (a) incomplete or (b) built on a different machine. As I noted, the index should be transferrable between most machines (barring things like big-endian to little-endian transfer). If you suspect (a), then I can take a look at the actual index and see if there is any evidence of that from the files themselves or the log. In Best, |
Hi @rob-p, ok that's good to know. Frustratingly for resolving this issue, right now I have other things that I need to work on, and Monday and Tuesday I'm on vacation. But I have put this on my to-do list, and hopefully I'll be able to get to it end of next week :) Cheers |
Hi @wmacnair, I hope you had a nice vacation. I just wanted to freshen this up on your list so that I know if I should be investigating further or closing this issue. Just let me know when you have a chance to get around to it. Best, |
Hi @rob-p I'm currently having a look through where I had previously got to. There have been multiple different indices / conda envs, and I worry that I have got confused somewhere 😅 So feel free to make concrete suggestions on tests that I can do. When I last looked, some tests we looked at were:
This suggested that the problem was with my index. However today I rebuilt the index and ran using this, and got the same failure. So could it be that indices created with my conda env are faulty?? These are the relevant lines from the channels:
- bioconda
- conda-forge
- main
- defaults
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_gnu
- alevin-fry=0.8.2=h4ac6f70_0
- boost-cpp=1.78.0=h5adbc97_2
- bzip2=1.0.8=h7f98852_4
- icu=70.1=h27087fc_0
- libgcc-ng=13.1.0=he5830b7_0
- libgomp=13.1.0=he5830b7_0
- libhwloc=2.9.1=hd6dc26d_0
- libiconv=1.17=h166bdaf_0
- libjemalloc=5.3.0=hcb278e6_0
- libstdcxx-ng=13.1.0=hfd8a6a1_0
- libxml2=2.10.3=hca2bb57_4
- libzlib=1.2.13=hd590300_5
- piscem=0.6.0=h09b9a2f_2
- salmon=1.10.2=hecfa306_0
- simpleaf=0.14.1=h4ac6f70_0
- tbb=2021.9.0=hf52228f_0
- xz=5.2.6=h166bdaf_0
- zstd=1.5.2=hfc55251_7 Does anything there look strange? I'll keep looking, so please keep open for now. Hopefully we can resolve by mid next week 💪 Will |
Hi @wmacnair, I can check the detail of my conda environments as well and report back here. However, in the meantime, would it be possible for you to share your actual index? Inspecting that may actually help to point at what could be going wrong at the most fine-grained technical level. Here is what I have in my micromamba environment for
Thanks! |
Thanks for this @rob-p. I've compared my and your packages, and there are a few discrepancies:
Do any of these stand out? I should have shared a file called human_2020-A_splici_buggy.zip with you via gDrive - let me know if that hasn't worked. And let me know if you have any other suggestions! Cheers |
I got the file link. I’ll take a look at that and report back here! Thanks, |
Hi @wmacnair, So we have some real progress now! I can reproduce the segfault on a machine where I previously could not when using your index. I haven't debugged further yet, but this can be investigated now. One thing I noticed when looking at your index folder (not the cause of the segfault, I believe) is that there seems to be both files related to a piscem index and a salmon index! Was there some attempt to build both of the indices in the same output folder? In terms of the indices themselves, I'll note that your
This suggests to me that, for some reason, either the index construction was bugged, or the full index did not write to disk properly. It seems that the Edit: It's not clear the size differences in Edit again: Ok, so now I'm actually curious ;). Since you are using the latest Thanks, |
Hi Rob Well at least we've found a culprit! I just shared another index with you, that I freshly built last week. I think we still get errors with this - would you mind checking that too? Thanks |
Hi Will, Thanks! I got the new file. At first look, it's certainly cleaner (e.g. there's no evidence of a partial salmon index, just the piscem index). However, it still seems the reference was created with Update: @wmacnair: Regarding this specific index. I verified that I also get the segfault using it as well. I also verified that the contig table (the Best, |
Hi @rob-p Pretty sure this is a PB in terms of github threads for me now 😄 I have tried a fresh Now the previous files that used to give segfaults do not, but in running it on other datasets we still find some that do. I have shared a folder with you (called
(please drop me a line if any of this is unclear) I'm very curious to know whether you also get a segfault here... Best |
Thanks @wmacnair! I appreciate all of the help! I got the files you shared. I'll let you know if I get the segfaut and if I can also reproduce it with my own index ir not. Thanks, |
Ok, this new index looks almost identical to the one I built, so I suspect I will see the problems with my index as well. I am moving the files to the server to check and will let you know soon. Edit: Indeed! Success in failure. Your latest reads fail with my index as well. Now we have a fully reproducible example! |
Hi @wmacnair, Ok. Good news. I was able to track this down and find the offending read and logic. It resulted from (as expected) and incredibly rare occurrence when it was possible to mistakenly try to extend a failed k-mer search to find the next k-mer. In this particular case that search succeeded, but but because the predecessor failed, some critical information that the index needed was missing. Ultimately, this resulted in subtracting 1 from our sentinel value ( Best, |
That sounds great! I will also be unavailable next week, but I think @marusakod will be interested in any bug-fixed version. A comment here - in our experience, I would say that this was rare, but not incredibly rare. Across data from multiple studies, failure rates for us ranged between 2 out of 150, and 5+ out of 34 samples. (As I'm unfamiliar with k-mers, I didn't really get the reason for the failure, so can't really speculate on why this might be.) Thanks so much for all your efforts to look into this :) |
Hi @wmacnair, Thanks for the heads up. I just meant incredibly rare in the space of k-mers. If we assume that each dataset is comprising billions (or 10s of billions) of k-mers, the rate of this corner case occurs very infrequently. Of course, the point is that it was a small but important missing case in the logic of how streaming query was handled, so it was great that you found it and a dataset to reliably reproduce it. I'll ping back here and also tag @marusakod as well when the new piscem is up on bioconda. Best, |
Hi @marusakod, The bioconda process is en route, but it takes a while. In the meantime, if you would like to check with a pre-compiled Best, |
Hi @wmacnair and @marusakod, The bioconda process how now completed as well and you should be able to pull piscem v0.6.1 from bioconda. Please let me know at your earliest convenience if this addresses all of the problem cases on your end. Also, thank you again for your continued interaction and help in reproducing and tracking down this bug. It's users like yourselves that make it possible for us to work at building high-quality and robust software, and without your patience and support, that wouldn't be possible. Edit: I checked with the bioconda version of Best, |
Hi @rob-p, I pulled the new Thank you so much for your help! if we encounter any errors in the future, we'll be happy to reach out. Best, |
Hi @marusakod, Thank you so much for confirming. That's great news. Thank you also for the detailed bug reports and data sharing. It really helped to track down (and ultimately fix) a bug that would have been so much trickier otherwise. Of course, please do reach out in the future if you encounter anything else unexpected or have any questions. Thanks, |
Hi @rob-p I've done a clean install of We have both very much appreciated your help here - Thanks again for your help |
Hi
simpleaf
teamFirst off, a thank you: we've been using
simpleaf
to do lots of mapping for a single cell atlas project, and it has been making everything faster and smoother 🚀 (especially being able to S/U/A counts super fast). This is awesome 😄However, there are a small proportion of samples where it doesn't work, and we can't really see any reason why. I've pasted some example output below.
This doesn't appear to driven by e.g. tiny file sizes due to RNAseq not working:
Right now it's difficult to diagnose what is going on... Most samples in this experiment work, a handful don't, and I don't really see anything I can work with from the error message. Any suggestions??
Thanks!
Will
(Tagging @marusakod as she has had similar problems.)
The text was updated successfully, but these errors were encountered: