variantconvert 2.0.0 (#32)

* Structural changes for Pypi * Replace auto config creation with 'variantconvert init' and implement shortened config paths * Fully implement shortened local paths * Remove unneeded build libraries * AnnotSV: deal with empy Samples_ID column cells * AnnotSV: add default absent genotype if bed-based * AnnotSV: Overhaul split annots management. Make it optional and change it from monolithic annots to one split annot per line * Exclude __init__.py * WIP - annotSV format and info fields * WIP - annotSV format and info fields * AnnotSV: numerous INFO field fixes * Propagate config changes * Fix 'invalid identifiers' (columns with dashes) missing from df.itertuples() output * Move helper_functions.py * Remove POS/END/SVLEN changes on split annot, let users use tx data themselves * Fix bugs, reimplement keep_info and flags, remove two legacy funcs * AnnotSV: Change INFO types in config now that it is possible * Fix #28 and put all REF in caps lock * Fix rounding bug, update AnnotSV tests * black * Remove the need for --inputFormat and --outputFormat args by including them in config * Update readme regarding config * Apply markdownlint * Cleanup * Major changes incl. AnnotSV updates * Propagate config changes * Fix imports * AnnotSV: use pipes in preexisting lists * Update README
SamuelNicaise · Apr 18, 2024 · 7942ddf · 7942ddf
1 parent b7ff5ab
commit 7942ddf
Show file tree

Hide file tree

Showing 105 changed files with 5,449 additions and 2,168 deletions.
diff --git a/Makefile b/Makefile
@@ -0,0 +1,23 @@
+VERSION=$$(grep '__version__ =' src/variantconvert/__init__.py | cut -d '"' -f2)
+
+# use run like this:
+# $(make run) -args -for --variantconvert
+# otherwise makefile thinks args are for itself
+run:
+	@echo python src/variantconvert/__main__.py
+
+#other routines don't need args and can be used normally
+build:
+	python -m build
+
+pypi:
+	python -m twine upload --repository testpypi dist/variantconvert-$(VERSION)*;
+
+install:
+	python -m build
+	pip install dist/variantconvert-$(VERSION).tar.gz
+	variantconvert init
+
+#a current issue I'm trying to fix
+debug:
+	python src/variantconvert/__main__.py convert -i tests/data/DECON.results_all.AnnotSV.tsv -o decon_annotsv_test.vcf -c src/variantconvert/configs/hg19/annotsv3_from_vcf.json -v debug
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
-<img src="./images/variantconvert_large.png" alt="variantconvert logo">
+<img src="https://raw.githubusercontent.com/SamuelNicaise/variantconvert/master/images/variantconvert_large.png" alt="variantconvert logo"/>
 
-The variantconvert module is an extendable command-line tool for converting between different file formats used to store genetic variant data. Currently, the following conversions are supported : 
+The variantconvert module is an extendable command-line tool for converting between different file formats used to store genetic variant data. Currently, the following conversions are supported :
 
 - [AnnotSV](https://lbgi.fr/AnnotSV/) > VCF
 - [STAR-Fusion](https://github.com/STAR-Fusion/STAR-Fusion) > VCF
@@ -17,54 +17,106 @@ The project is still being developed and maintained. If you encounter any bug or
 
 1) Setup an environment with Python >= 3.8. You can use the provided Conda .yml or the Python in the Dockerfile
 2) Do the following commands:
-```
+
+```bash
 git clone https://github.com/SamuelNicaise/variantconvert.git
 cd variantconvert
 pip install -e .
 ```
-3) Change the GENOME["path"] variable in configs/*.json to fit your local system. 
+
+3) Create a config folder
+
+```bash
+# Create a config folder in the default location for your OS.  
+variantconvert init
+
+# To see where the folder is: 
+variantconvert config --show_config_dir
+# All variantconvert commands automatically find this folder without writing the full path (see example in step 4). 
+
+# If you wish, you can move config files to another location and use it instead. 
 ```
+
+4) Change the GENOME["path"] variable in configs/*.json to fit your local system.
+
+```bash
 # Do this for each genome you need to use
-variantconvert config -c configs/GRCh37/* --set GENOME.path=/path/to/your/local/GRCh37.fa --fill_genome_header
-variantconvert config -c configs/GRCh38/* --set GENOME.path=/path/to/your/local/GRCh38.fa --fill_genome_header
-variantconvert config -c configs/hg19/* --set GENOME.path=/path/to/your/local/hg19.fa --fill_genome_header
-variantconvert config -c configs/hs1/* --set GENOME.path=/path/to/your/local/hs1.fa --fill_genome_header
+# GRCh37
+variantconvert config -c GRCh37/* --set GENOME.path=/path/to/your/local/GRCh37.fa --fill_genome_header
+# GRCh38
+variantconvert config -c GRCh38/* --set GENOME.path=/path/to/your/local/GRCh38.fa --fill_genome_header
+# hg19
+variantconvert config -c hg19/* --set GENOME.path=/path/to/your/local/hg19.fa --fill_genome_header
+# hs1 (T2T)
+variantconvert config -c hs1/* --set GENOME.path=/path/to/your/local/hs1.fa --fill_genome_header
 # You're ready to use variantconvert !
+```
 
+```bash
 # You can create your own configs, for example to use other genomes
-# For example, let's create a folder for mm10 (Mus musculus)
-cp -r config/hg19 config/mm10
-variantconvert config -c configs/mm10/* --set GENOME.assembly=mm10 GENOME.path=/path/to/mm10.fa --fill_genome_header
+# Let's create a folder for mm10 (Mus musculus)
+cp -r path/to/config/hg19 path/to/config/mm10
+variantconvert config -c path/to/config/mm10/* --set GENOME.assembly=mm10 GENOME.path=/path/to/mm10.fa --fill_genome_header
 ```
 
-Indeed, some converters require a reference genome in fasta format. This is to fill in the VCF "REF" column in cases where we only have the position without the reference base. This implies when converting to a VCF file, you should always use the genome on which the variant was called. 
+Indeed, some converters require a reference genome in fasta format. This is to fill in the VCF "REF" column in cases where we only have the position without the reference base. This implies when converting to a VCF file, you should always use the genome on which the variant was called.
 
 You can create your own config files to customize not only the genome, but also output columns (see the Developers section).
 
-If you configured the hg19 genome, you can test that variantconvert is properly installed with the following commands:
-```
+If you installed from source and configured the hg19 genome, you can test that variantconvert is properly installed with the following commands:
+
+```bash
 cd <this_repository>
 pip install -e .[dev]
 pytest
 ```
 
-
 # Usage
-```
+
+```bash
 variantconvert --help 
 ```
+
 Or if you did not use the `pip install` command above:
-```
+
+```bash
 python variantconvert/__main__.py --help
 ```
 
 Example of a common use case: convert a STAR-Fusion output file to a VCF.
+
+```bash
+variantconvert convert -i star-fusions.tsv -o output.vcf -c hg19/starfusion.json
 ```
-variantconvert convert -i star-fusions.tsv -o output.vcf -fi breakpoints -fo vcf -c configs/hg19/starfusion.json
-```
 
+<center>
+
+| Conversion  | Default config |
+|---|---|
+| STAR-Fusion > VCF | starfusion.json  |
+| Arriba > VCF | arriba.json  |
+| DECoN > VCF | decon.json  |
+| BED/CANOES > VCF | canoes_bed.json  |
+| BEDPE > VCF | bedpe.json  |
+| Illumina microarray > VCF | snp.json  |
+|  VaRank > VCF |  varank.json |
+|  AnnotSV from BED > VCF | annotsv3_from_bed.json  |
+| AnnotSV from VCF > VCF | annotsv3_from_vcf.json  |
+
+</center>
+
+<details>
+  <summary>Usage for versions < 2.0.0</summary>
+
+In older versions, input and output format also had to be specified in command line args. Today this is included in config files.
+
+Example of a common use case: convert a STAR-Fusion output file to a VCF.
+
+```txt
+variantconvert convert -i star-fusions.tsv -o output.vcf -fi breakpoints -fo vcf -c hg19/starfusion.json
+```
 
-List of all argument combinations for all the conversions currently implemented:
+<center>
 
 | Conversion  | -fi  (input format) | -fo (output format) | Default config  |
 |---|---|---|---|
@@ -78,86 +130,117 @@ List of all argument combinations for all the conversions currently implemented:
 |  AnnotSV from BED > VCF | annotsv  | vcf  | annotsv3_from_bed.json  |
 | AnnotSV from VCF > VCF  | annotsv  | vcf  | annotsv3_from_vcf.json  |
 
-___
+</center>
+
+</details>
+
+<br/>
+
 # Documentation for AnnotSV users
 
-<details> 
+<details>
   <summary>Click to read documentation</summary>
 
 ### Creation of a VCF output file format with AnnotSV
-To convert the output format from tsv to VCF, AnnotSV relies on the variantconvert tool. 
+
+To convert the output format from tsv to VCF, AnnotSV relies on the variantconvert tool.s
 
 The variantconvert module distributed with AnnotSV can be used by setting the `-vcf` option to 1 in the AnnotSV command line.
 
-### Requirements in the AnnotSV command line:
+### Requirements in the AnnotSV command line
+
 Different AnnotSV options are required to access to a VCF output:
--	From a "BED" or a "VCF" SV input file:
-	- The user needs to define the `-SVinputInfo` option to 1 (to report in the tsv output file the 'ID', 'QUAL', 'FILTER'... fields).
--	From a "BED" SV input file:
-	- The user needs to define the `-svtBEDcol` option (to report the SV type)
-	- The `-samplesidBEDcol` option is highly recommended to use (else, the sample colum will be named "NA" (Non Attributed))  
+
+- From a "BED" or a "VCF" SV input file:
+  - The user needs to define the `-SVinputInfo` option to 1 (to report in the tsv output file the 'ID', 'QUAL', 'FILTER'... fields).
+- From a "BED" SV input file:
+  - The user needs to define the `-svtBEDcol` option (to report the SV type)
+  - The `-samplesidBEDcol` option is highly recommended to use (else, the sample colum will be named "NA" (Non Attributed))  
 
 ### Method
+
 Each SV from an AnnotSV tsv file is represented with 2 types of lines:
-- An annotation on the “full” length of the SV. Every SV are reported, even those not covering a gene. 
-- An annotation of the SV “split” by gene. This type of annotation gives an opportunity to focus on each gene overlapped by the SV. Thus, when a SV spans over several genes, the output will contain as many annotations lines as genes covered.
 
-In the converted VCF, each SV is represented with only 1 line. All the annotations (full & split) are reported in the INFO field.
-For one SV, all values from a same tsv output column are merged with a "|".
+- An annotation on the "full" length of the SV. Every SV are reported, even those not covering a gene.
+- An annotation of the SV "split" by gene. This type of annotation gives an opportunity to focus on each gene overlapped by the SV. Thus, when a SV spans over several genes, the output will contain as many annotations lines as genes covered.
 
-Example of a duplication overlapping 1 gene (1 full line + 1 split line in the tsv). The tsv output columns are represented in the INFO field in this way: 
-```
-AnnotSV_ID=21_35722427_35906593_DUP_1|21_35722427_35906593_DUP_1;SV_chrom=21|21;SV_start=35722427|35722427;SV_end=35906593|35906593;SV_lengt
-h=184166|184166;SV_type=DUP|DUP;Annotation_mode=full&split;CytoBand=q22.12|q22.12;Gene_name=PPP1R2P2|PPP1R2P2;...
+#### Example of a duplication overlapping 2 genes (1 full line + 2 split lines in the tsv)
+
+|AnnotSV_ID|SV_chrom|SV_start|SV_end|SV_length|Variant_type|Annotation_mode|Gene_name|DDD_HI_percent|
+|---|---|---|---|---|---|---|---|---|
+|10_46976157_47590995_1|10|46976157|47590995|614838|DUP|full|AGAP9;ANTXRLP1|91.07|
+|10_46976157_47590995_1|10|46976157|47590995|614838|DUP|split|AGAP9|88.1|
+|10_46976157_47590995_1|10|46976157|47590995|614838|DUP|split|ANTXRLP1| |
+
+In the converted VCF, each SV is represented with only 1 line by default (mode: "combined" in JSON config). All the annotations (full & split) are reported in the INFO field.
+For one SV, all values from a same tsv output column are merged as lists separated by ",". Consequently, all "," in annotations are replaced with "|". If all values (full and all split lines) are identical, they are merged as one.
+
+The tsv output columns are represented in the VCF in this way:
+```txt
+#mode=combined
+#CHROM	POS	REF	ALT	INFO
+chr10	46976157	G	<DUP>	AnnotSV_ID=10_46976157_47590995_1;SV_start=46976157;END=47590995;SVLEN=614838;Annotation=full,split,split;Gene_name=AGAP9|ANTXRLP1,AGAP9,ANTXRLP1;DDD_HI_percent=91.07,88.1,.
 ```
+
 Warning: the AnnotSV > VCF converter uses VCF 4.2 specification, so spaces are replaced with an "_" in the output VCF.
 
+#### Alternative modes
+
+If using lists is complex for your downstream analysis, other conversion modes are available. Instead of combining all full and split annotations, they can be each represented on one line (mode: full&split), or only "full" annotation can be kept (mode: full). Conversion mode can be changed in the JSON config.
+
 ### GT warning
-If the GT is not given in input, the GT is set to "1/." (using the variantconvert distributed by AnnotSV) or "0/1" (using the github variantconvert) for each SV in the VCF output file. Indeed, the considered SV has been called on at least one allele, but we don’t know the status of the second allele. In any case, the user can change this default value in the variantconvert config files. 
+
+If the GT is not given in input, the GT is set to "1/." (using the variantconvert distributed by AnnotSV) or "0/1" (using the github variantconvert) for each SV in the VCF output file. Indeed, the considered SV has been called on at least one allele, but we don’t know the status of the second allele. In any case, the user can change this default value in the variantconvert config files.
 </details>
 
+<br/>
+
 # Documentation for developers
 
-<details> 
+<details>
   <summary>Click to read documentation</summary>
 
 ## Adding new conversion formats
 
-An intended goal of the project is to make it easy to add new formats to the conversion possibilities. 
+An intended goal of the project is to make it easy to add new formats to the conversion possibilities.
+
+Each conversion is described by a JSON config file with the following sections:
 
-Each conversion is described by a JSON config file with the following sections: 
 - [GENERAL]
-	- skip_rows: how many rows to skip before column indexes
-	- unique_variant_id: A list of columns that are needed to uniquely identify a variant. Important for input files where a same variant can be on multiple lines. 
+  - `input_format` and `output_format`: Determine which converter module will be returned by ConverterFactory
+  - `skip_rows`: how many rows to skip before column indexes
+  - `unique_variant_id`: A list of columns that are needed to uniquely identify a variant. Important for input files where a same variant can be on multiple lines.
 
-- [VCF_COLUMNS] maps input TSV columns to their corresponding VCF fields. 
+- [VCF_COLUMNS] maps input TSV columns to their corresponding VCF fields.
   - Add or remove INFO fields at will to customize your output
   - When the equivalence is more complex than 1 input column = 1 VCF field ; you can create advanced HELPER_FUNCTION (explained below).
-  
+
 - [COLUMNS_DESCRIPTION]
   - Describes the input tsv columns to write the output VCF header. Column types can be inferred but it is usually safer to define them.
 
 ## HELPER_FUNCTION
 
-They're defined in variantconvert/helperfunctions.py and called in your converter's config .json file. 
+They're defined in variantconvert/helpers and called in your converter's config .json file.
 
 ### To call a HELPER_FUNCTION
 
-Use the following syntax in your .json: 
-```
-<vcf_field>: ["HELPER_FUNCTION", <function_name>, <tsv column 1>, <tsv column 2>...] # where tsv columns are the TSV fields sent as function input 
+Use the following syntax in your .json:
+
+```bash
+<vcf_field>: ["HELPER_FUNCTION", <function_name>, <tsv column 1>, <tsv column 2>...] 
+# where tsv columns are the TSV fields sent as function input 
 ```
 
 ### To define a HELPER_FUNCTION
 
-1. In HelperFunctions.__init__() , add *<function_name>* to the self.dispatcher dictionary
+1. In HelperFunctions.\_\_init\_\_() , add *<function_name>* to the self.dispatcher dictionary
 2. Add a new method in HelperFunctions class named as *<function_name>*, taking as parameters *<tsv column 1>, <tsv column 2>*... in the same order. Then you can use the full power of Python to do any data transformation you wish.
 
 ## If customizing a config file is not enough
 
-variantconvert relies on Converter classes that are called by a ConverterFactory depending on the --inputFormat and --outputFormat parameters. 
+variantconvert relies on Converter classes that are called by a ConverterFactory depending on the --inputFormat and --outputFormat parameters (in config file if version >= 2.0.0)
 
 You can create new Converter classes that will apply different transformations than the existing ones in variantconvert/converters/
 
-They should inherit from the AbstractConverter class and be listed in the ConverterFactory class. That will make them automatically accessible from the command line. 
+They should inherit from the AbstractConverter class and be listed in the ConverterFactory class. That will make them automatically accessible from the command line.
 </details>