Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IGV hg38 fix #63

Open
akiss-me opened this issue Mar 26, 2024 · 11 comments
Open

IGV hg38 fix #63

akiss-me opened this issue Mar 26, 2024 · 11 comments

Comments

@akiss-me
Copy link

akiss-me commented Mar 26, 2024

Dear all,

I fixed IGV hg38 issue, the solution is not very clean, since I did not rewrite the code, but simply replaced b37 with hg38 and the corresponding tracks.
I didn’t make changes to the working image or fork it, so for now I suggest just mounting the corrected version of the main script (please unzip first).

this is how to run:

docker run --rm \
--mount type=bind,source=/**your_path**/clinsv_hg38,target=/app/clinsv/bin/clinsv \
--name clinsv \
-v /your_path/ \
--entrypoint "perl" mrbradley2/clinsv:v1.1-dev /app/clinsv/bin/clinsv \
-r all \
-p "/**your_path**/clinsv_results/" \
-i "/**your_path**/*.bam" \
-ref "/**your_path**/refdata-b38/refdata-b38"

enjoy!

clinsv_hg38.zip

@akiss-me akiss-me changed the title IGV fix IGV hg38 fix Mar 26, 2024
@thedam
Copy link

thedam commented Apr 23, 2024

seems like almost all paths in .xml are generated wrongly (that includes the change in the structure of the folders).

Is the ClinSV project abandoned?
The biggest pain for me is that needs "chr1, chr2..." naming convention insead of "1, 2, 3..."

@akiss-me
Copy link
Author

seems like almost all paths in .xml are generated wrongly (that includes the change in the structure of the folders).

Is the ClinSV project abandoned? The biggest pain for me is that needs "chr1, chr2..." naming convention insead of "1, 2, 3..."

Folder structures have been made for a kind of IGV server that should be run from docker ClinSV (but was cropped from v1.1-dev). I used to solve this problem by running lscr.io/linuxserver/webtop:ubuntu-kde with installed desktop IGV and making a mount point and paths according to paths from ClinSV's XML, and it's working.

@J-Bradlee
Copy link
Collaborator

First of, thank you @akiss-me for sharing your fix. I've been working on merging some changes that should hopefully cleanly address the IGV issue. As @akiss-me mentioned, using the v1.1-dev version can handle those naming conventions for you. As for the XML files, if you are using the docker images all the non-resource paths (such as the ref genome), are file paths used in the docker container.

Therefore, if you wanted to use the XML outside of the container in an IGV session you are going to have the change the paths in the XML file to the respective paths as is on your machine. Which are the same paths you mounted onto the docker container with the "-v" commands.

In future I hope to implement this as a script that you can run outside the container.

@thedam
Copy link

thedam commented Apr 24, 2024

Hi @J-Bradlee, I'm happy that the project is sitll alive!
I have some concerns; looking only at the data in the .xml file generated for hg38, we can see:

<Session genome="b37" hasGeneTrack="true" hasSequenceTrack="true" locus="1:197493130-197514036" version="8">
here should be genome="hg38" as @akiss-me noted.

Then we have for example:

<Resource path="/app/ref-data/refdata-b38/tracks/GRCh37GenomicSuperDup.bed.gz"/>
I believe it should be /refdata-b38/tracks/GRCh38GenomicSuperDup.bed.gz

<Resource path="/app/ref-data/refdata-b38/tracks/DGV_1kG_2015-07-18.igv.bed.gz"/>
Should it be this file: /refdata-b38/tracks/DGV_GRCh38_hg38_variants_2020-02-25.igv.bed.gz?

<Resource path="/app/ref-data/refdata-b38/tracks/GRCh37_hg19_variants_2015-07-23.igv.bed.gz"/>
no such file for hg38?

<Resource path="/app/ref-data/refdata-b38/annotation/Homo_sapiens.GRCh37.75.gff.gz"/>
this: /refdata-b38/annotation/Homo_sapiens.GRCh38.99.gff.gz

Are we sure these files are also not misslabelled:

<Resource path="/app/ref-data/refdata-b38/tracks/MGRB-SV.igv.bed.gz"/>
<Resource path="/app/ref-data/refdata-b38/tracks/popCovStdev.bw"/>

.xml is easy to fix anyway. However can we be sure, that in the code and during the data processing, the right paths/files are used? Are the all CNVs calculated on the right coordinates system? At the end, it is the most important...

Anyway, you do a great job, I'm impressed how complex the application is.

ps. do you plan the handle chromosomes "1, 2, 3, 4, MT, X, Y," in hg38 (in addtion to chr1, chr2, chrM...)? I would be super great!

@J-Bradlee
Copy link
Collaborator

Hi @thedam,

Those resource paths are like that because you have run clinSV in the docker container. In the container those resource files are at those locations. However, on your own machine they will be whatever you set these environment variables to:
refdata_path=$PWD/clinsv/refdata-b38 input_path=$PWD project_folder=$PWD/test_run

Recall, you attached the volumes to the docker container with the "-v" command:
docker run -v $refdata_path:/app/ref-data \ -v $project_folder:/app/project_folder \ -v $input_path:/app/input \ --entrypoint "perl" mrbradley2/clinsv:v1.0 /app/clinsv/bin/clinsv \ -r all \ -p /app/project_folder/ \ -i "/app/input/*.bam" \ -ref /app/ref-data/refdata-b38 \

A quick fix would be to run a sed command to on the result XML to remove the "app/ref-data" .

These resource files are used throughout ClinSV accurately .

I have made a feature to handle the hg38 / hg19 chormosomes and is already dockerized for you to use. Albeit, I am currently trying to fix a bug which has broken "coverage by chromosome" part of the output report. See here and here

@thedam
Copy link

thedam commented Apr 25, 2024

Yes, I've run the job with the exact Docker command.
What I mean, is that in the b38 result .xml file the paths for b37 apperas.

Also the the pipeline doesn't work with GRCh38 chromosome naming convention (1, 2, 3, 4 instead of chr1, chr2 chr3...)

When the input is .bam file with chromosmes 1, 2, 3... the error appers:

### executing: sh /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.sh &> /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.e  ...

 ### finished after (hh:mm:ss): 00:00:01
 ### exist status: 256



 ***** error exist status != 0 (256), please check /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.e for more information

The error appears because somewhere in ClinSV code, there is samtools command with chr:
samtools view /app/project_folder/alignments/273858DL/273858DL.bam chr1:1000000-1100000
[main_samview] region "chr1:1000000-1100000" specifies an invalid region or unknown reference. Continue anyway

I quess it could be fixed by providing the reference, that .bam is aligned to and it should be named as Homo_sapiens_assembly38.fasta (as it is hardcoded in clinsv.py, so in the code in the Docker).

All files in refdata-b38 have also naming convention with 'chr1, chr2, chr3' so they should be also changed?

@akiss-me
Copy link
Author

Yes, I've run the job with the exact Docker command. What I mean, is that in the b38 result .xml file the paths for b37 apperas.

Also the the pipeline doesn't work with GRCh38 chromosome naming convention (1, 2, 3, 4 instead of chr1, chr2 chr3...)

When the input is .bam file with chromosmes 1, 2, 3... the error appers:

### executing: sh /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.sh &> /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.e  ...

 ### finished after (hh:mm:ss): 00:00:01
 ### exist status: 256



 ***** error exist status != 0 (256), please check /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.e for more information

The error appears because somewhere in ClinSV code, there is samtools command with chr: samtools view /app/project_folder/alignments/273858DL/273858DL.bam chr1:1000000-1100000 [main_samview] region "chr1:1000000-1100000" specifies an invalid region or unknown reference. Continue anyway

I quess it could be fixed by providing the reference, that .bam is aligned to and it should be named as Homo_sapiens_assembly38.fasta (as it is hardcoded in clinsv.py, so in the code in the Docker).

All files in refdata-b38 have also naming convention with 'chr1, chr2, chr3' so they should be also changed?

This is probably because you are using DIY genome reference. It's okay, but for cross compatibility, researchers and medgenetics usually use GATK genome bundle, which comes from the reference with "chr" at the beginning and is kind of standard. You can add chr to the chromosome names in the bam file (https://www.biostars.org/p/13462/), but it could lead to other problems caused by unassembled contigs naming.

@thedam
Copy link

thedam commented Apr 26, 2024

Both conventions are used (unfortunately). The main service that doesn't use prefixes is VEP:(

@drmjc
Copy link
Member

drmjc commented Apr 27, 2024 via email

@J-Bradlee
Copy link
Collaborator

Hi @thedam,

Can I confirm what docker image you are using/what version of ClinSV you are running?

@thedam
Copy link

thedam commented Apr 30, 2024

Hi,

I use: mrbradley2/clinsv v1.1-dev **ad470d0fa5d2** 19 months ago 7.41GB

I have it form Docker Hub,
docker pull mrbradley2/clinsv:v1.1-dev
however I see it is the same image id as here:
docker pull containerregistrypubliccb.azurecr.io/clinsv:v1.1-dev

containerregistrypubliccb.azurecr.io/clinsv       v1.1-dev   ad470d0fa5d2   19 months ago   7.41GB
mrbradley2/clinsv                                 v1.1-dev   ad470d0fa5d2   19 months ago   7.41GB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants