diff --git a/README.pdf b/README.pdf index 391ee6d..0fa2842 100644 Binary files a/README.pdf and b/README.pdf differ diff --git a/RELEASE_NOTES.html b/RELEASE_NOTES.html index 4a3dfc3..a510c22 100644 --- a/RELEASE_NOTES.html +++ b/RELEASE_NOTES.html @@ -205,6 +205,7 @@
Sort genetic variations from contig order. Data can be loaded into +‘variants’ table from various formats (e.g. VCF, TSV, Parquet…). SQL +filter can also use external data within the request, such as a Parquet +file(s).
+Usage examples:
+++howard sort –input=tests/data/example.vcf.gz +–output=/tmp/example.sorted.vcf.gz
+
+ ++
++ + ++--input=<input> | required + +Input file path. +Format file must be either VCF, Parquet, TSV, CSV, PSV or duckDB. +Files can be compressesd (e.g. vcf.gz, tsv.gz).
++ ++--output=<output> | required + +Output file path. +Format file must be either VCF, Parquet, TSV, CSV, PSV or duckDB. +Files can be compressesd (e.g. vcf.gz, tsv.gz).
++ + ++--include_header + +Include header (in VCF format) in output file. +Only for compatible formats (tab-delimiter format as TSV or BED).
++ ++--parquet_partitions=<parquet partitions> + +Parquet partitioning using hive (available for any format). +This option is faster parallel writing, but memory consuming. +Use 'None' (string) for NO partition but split parquet files into a folder. +Examples: '#CHROM', '#CHROM,REF', 'None'.
Statistics on genetic variations, such as: number of variants, number of samples, statistics by chromosome, genotypes by samples…
Usage examples:
@@ -651,8 +713,8 @@---input=<input> | required @@ -670,8 +732,8 @@
---stats_md=<stats markdown> @@ -686,8 +748,8 @@
Convert genetic variations file to another format. Multiple format are available, such as usual and official VCF and BCF format, but also other formats such as TSV, CSV, PSV and Parquet/duckDB. These formats @@ -722,8 +784,8 @@
---input=<input> | required @@ -750,8 +812,8 @@
---explode_infos @@ -781,8 +843,8 @@
---include_header @@ -812,8 +874,8 @@
HGVS annotation using HUGO HGVS internation Sequence Variant Nomenclature (http://varnomen.hgvs.org/). Annotation refere to refGene and genome to generate HGVS nomenclature for all available transcripts. @@ -834,8 +896,8 @@
---input=<input> | required @@ -881,8 +943,8 @@
---use_gene @@ -952,8 +1014,8 @@
Annotation is mainly based on a build-in Parquet annotation method, and tools such as BCFTOOLS, Annovar and snpEff. It uses available databases (see Annovar and snpEff) and homemade databases. Format of @@ -1018,8 +1080,8 @@
---input=<input> | required @@ -1138,8 +1200,8 @@
---annotations_update @@ -1158,8 +1220,8 @@
Calculation processes variants information to generate new information, such as: identify variation type (VarType), harmonizes allele frequency (VAF) and calculate sttistics (VAF_stats), extracts @@ -1202,8 +1264,8 @@
---input=<input> @@ -1241,8 +1303,8 @@
---calculation_config=<calculation config> @@ -1257,8 +1319,8 @@
---hgvs_field=<HGVS field> (default: hgvs) @@ -1274,8 +1336,8 @@
---trio_pedigree=<trio pedigree> @@ -1286,8 +1348,8 @@
---family_pedigree=<family pedigree> @@ -1298,8 +1360,8 @@
Prioritization algorithm uses profiles to flag variants (as passed or filtered), calculate a prioritization score, and automatically generate a comment for each variants (example: ‘polymorphism identified in dbSNP. @@ -1331,8 +1393,8 @@
---input=<input> | required @@ -1368,8 +1430,8 @@
---default_profile=<default profile> @@ -1401,8 +1463,8 @@
howard process tool manage genetic variations to:
annotates genetic variants with multiple annotation @@ -1442,8 +1504,8 @@
---input=<input> | required @@ -1535,8 +1597,8 @@
---use_gene @@ -1606,8 +1668,8 @@
---annotations_update @@ -1626,8 +1688,8 @@
---calculation_config=<calculation config> @@ -1635,8 +1697,8 @@
---default_profile=<default profile> @@ -1668,8 +1730,8 @@
---query=<query> @@ -1693,8 +1755,8 @@
---explode_infos @@ -1724,8 +1786,8 @@
---include_header @@ -1755,8 +1817,8 @@
Download databases and needed files for howard and associated tools
Usage examples:
@@ -1843,8 +1905,8 @@---assembly=<assembly> (default: hg19) @@ -1875,8 +1937,8 @@
---download-genomes=<genomes> @@ -1903,8 +1965,8 @@
---download-snpeff=<snpEff> @@ -1912,8 +1974,8 @@
---download-annovar=<Annovar> @@ -1940,8 +2002,8 @@
---download-refseq=<refSeq> @@ -2023,8 +2085,8 @@
---download-dbnsfp=<dbNSFP> @@ -2129,8 +2191,8 @@
---download-alphamissense=<AlphaMissense> @@ -2146,8 +2208,8 @@
---download-exomiser=<Exomiser> @@ -2243,8 +2305,8 @@
---download-dbsnp=<dnSNP> @@ -2321,8 +2383,8 @@
---convert-hgmd=<HGMD> @@ -2351,8 +2413,8 @@
---input_annovar=<input annovar> @@ -2400,8 +2462,8 @@
---input_extann=<input extann> @@ -2453,8 +2515,8 @@
---generate-param=<param> @@ -2497,15 +2559,15 @@
Graphical User Interface tools
Usage examples:
-howard gui
Help tools
Usage examples:
@@ -2537,8 +2599,8 @@-
---help_md=<help markdown> @@ -2589,8 +2651,8 @@
Update HOWARD database
Usage examples:
@@ -2600,8 +2662,8 @@-
---param=<param> (default: {}) @@ -2610,8 +2672,8 @@
---databases_folder=<databases_folder> (default: ~/howard/databases) @@ -2640,8 +2702,8 @@
---show=<show> @@ -2656,8 +2718,8 @@
Convert VCF file to Excel ‘.xlsx’ format.
Usage examples:
@@ -2667,8 +2729,8 @@-
---input=<input> | required @@ -2687,8 +2749,8 @@
---add_variants_view @@ -2703,8 +2765,8 @@
Check if a transcript list is present in a generated transcript table from a input VCF file.
Usage examples:
@@ -2719,8 +2781,8 @@---input=<input> | required @@ -2759,8 +2821,8 @@
GeneBe annotation using REST API (see https://genebe.net/).
Usage examples:
@@ -2770,8 +2832,8 @@-
---input=<input> | required @@ -2805,8 +2867,8 @@
---genebe_use_refseq @@ -2828,8 +2890,8 @@
---explode_infos @@ -2859,8 +2921,8 @@
---include_header @@ -2890,8 +2952,8 @@
Minimalize a VCF file consists in put missing value (‘.’) on INFO/Tags, ID, QUAL or FILTER fields. Options can also minimalize samples (keep only GT) or remove all samples. INFO/tags can by exploded @@ -2911,8 +2973,8 @@
---input=<input> | required @@ -2939,8 +3001,8 @@
---minimalize_info @@ -2983,8 +3045,8 @@
---explode_infos @@ -3014,8 +3076,8 @@
---include_header @@ -3045,8 +3107,8 @@
--config=<config> (default: {}) diff --git a/docs/help.md b/docs/help.md index fdd1bdd..26786eb 100644 --- a/docs/help.md +++ b/docs/help.md @@ -15,111 +15,115 @@ title: HOWARD Help options](#main-options-1) - [3.2 Filters](#filters) - [3.3 Export](#export-1) -- [4 STATS tool](#stats-tool) +- [4 SORT tool](#sort-tool) - [4.1 Main options](#main-options-2) - - [4.2 Stats](#stats) -- [5 CONVERT - tool](#convert-tool) + - [4.2 Export](#export-2) +- [5 STATS tool](#stats-tool) - [5.1 Main options](#main-options-3) - - [5.2 Explode](#explode-1) - - [5.3 Export](#export-2) -- [6 HGVS tool](#hgvs-tool) + - [5.2 Stats](#stats) +- [6 CONVERT + tool](#convert-tool) - [6.1 Main options](#main-options-4) - - [6.2 HGVS](#hgvs) -- [7 ANNOTATION - tool](#annotation-tool) + - [6.2 Explode](#explode-1) + - [6.3 Export](#export-3) +- [7 HGVS tool](#hgvs-tool) - [7.1 Main options](#main-options-5) - - [7.2 - Annotation](#annotation) -- [8 CALCULATION - tool](#calculation-tool) + - [7.2 HGVS](#hgvs) +- [8 ANNOTATION + tool](#annotation-tool) - [8.1 Main options](#main-options-6) - [8.2 - Calculation](#calculation) - - [8.3 NOMEN](#nomen) - - [8.4 TRIO](#trio) - - [8.5 - BARCODEFAMILY](#barcodefamily) -- [9 PRIORITIZATION - tool](#prioritization-tool) + Annotation](#annotation) +- [9 CALCULATION + tool](#calculation-tool) - [9.1 Main options](#main-options-7) - [9.2 - Prioritization](#prioritization) -- [10 PROCESS - tool](#process-tool) + Calculation](#calculation) + - [9.3 NOMEN](#nomen) + - [9.4 TRIO](#trio) + - [9.5 + BARCODEFAMILY](#barcodefamily) +- [10 PRIORITIZATION + tool](#prioritization-tool) - [10.1 Main options](#main-options-8) - - [10.2 HGVS](#hgvs-1) - - [10.3 + - [10.2 + Prioritization](#prioritization) +- [11 PROCESS + tool](#process-tool) + - [11.1 Main + options](#main-options-9) + - [11.2 HGVS](#hgvs-1) + - [11.3 Annotation](#annotation-1) - - [10.4 + - [11.4 Calculation](#calculation-1) - - [10.5 + - [11.5 Prioritization](#prioritization-1) - - [10.6 Query](#query-1) - - [10.7 Explode](#explode-2) - - [10.8 Export](#export-3) -- [11 DATABASES + - [11.6 Query](#query-1) + - [11.7 Explode](#explode-2) + - [11.8 Export](#export-4) +- [12 DATABASES tool](#databases-tool) - - [11.1 Main - options](#main-options-9) - - [11.2 Genomes](#genomes) - - [11.3 snpEff](#snpeff) - - [11.4 Annovar](#annovar) - - [11.5 refSeq](#refseq) - - [11.6 dbNSFP](#dbnsfp) - - [11.7 + - [12.1 Main + options](#main-options-10) + - [12.2 Genomes](#genomes) + - [12.3 snpEff](#snpeff) + - [12.4 Annovar](#annovar) + - [12.5 refSeq](#refseq) + - [12.6 dbNSFP](#dbnsfp) + - [12.7 AlphaMissense](#alphamissense) - - [11.8 Exomiser](#exomiser) - - [11.9 dbSNP](#dbsnp) - - [11.10 HGMD](#hgmd) - - [11.11 + - [12.8 Exomiser](#exomiser) + - [12.9 dbSNP](#dbsnp) + - [12.10 HGMD](#hgmd) + - [12.11 from_Annovar](#from_annovar) - - [11.12 + - [12.12 from_extann](#from_extann) - - [11.13 + - [12.13 Parameters](#parameters) -- [12 GUI tool](#gui-tool) -- [13 HELP tool](#help-tool) - - [13.1 Main - options](#main-options-10) -- [14 UPDATE_DATABASE - tool](#update_database-tool) +- [13 GUI tool](#gui-tool) +- [14 HELP tool](#help-tool) - [14.1 Main options](#main-options-11) - - [14.2 - Update_database](#update_database) - - [14.3 Options](#options) -- [15 TO_EXCEL - tool](#to_excel-tool) +- [15 UPDATE_DATABASE + tool](#update_database-tool) - [15.1 Main options](#main-options-12) - - [15.2 Add](#add) -- [16 TRANSCRIPTS_CHECK - tool](#transcripts_check-tool) + - [15.2 + Update_database](#update_database) + - [15.3 Options](#options) +- [16 TO_EXCEL + tool](#to_excel-tool) - [16.1 Main options](#main-options-13) -- [17 GENEBE tool](#genebe-tool) + - [16.2 Add](#add) +- [17 TRANSCRIPTS_CHECK + tool](#transcripts_check-tool) - [17.1 Main options](#main-options-14) - - [17.2 GeneBe](#genebe) - - [17.3 Explode](#explode-3) - - [17.4 Export](#export-4) -- [18 MINIMALIZE - tool](#minimalize-tool) +- [18 GENEBE tool](#genebe-tool) - [18.1 Main options](#main-options-15) - - [18.2 - Minimalize](#minimalize) - - [18.3 Explode](#explode-4) + - [18.2 GeneBe](#genebe) + - [18.3 Explode](#explode-3) - [18.4 Export](#export-5) -- [19 Shared +- [19 MINIMALIZE + tool](#minimalize-tool) + - [19.1 Main + options](#main-options-16) + - [19.2 + Minimalize](#minimalize) + - [19.3 Explode](#explode-4) + - [19.4 Export](#export-6) +- [20 Shared arguments](#shared-arguments) # Introduction @@ -392,6 +396,64 @@ Usage examples: +# SORT tool + +Sort genetic variations from contig order. Data can be loaded into +'variants' table from various formats (e.g. VCF, TSV, Parquet...). SQL +filter can also use external data within the request, such as a Parquet +file(s). + +Usage examples: + +> howard sort --input=tests/data/example.vcf.gz +> --output=/tmp/example.sorted.vcf.gz + +> + +## Main options + + + +> --input= | required +> +> Input file path. +> Format file must be either VCF, Parquet, TSV, CSV, PSV or duckDB. +> Files can be compressesd (e.g. vcf.gz, tsv.gz). + + + + + +> --output= + +## Export + + + +> --include_header +> +> Include header (in VCF format) in output file. +> Only for compatible formats (tab-delimiter format as TSV or BED). + + + + + +> --parquet_partitions=
+> +> Parquet partitioning using hive (available for any format). +> This option is faster parallel writing, but memory consuming. +> Use 'None' (string) for NO partition but split parquet files into a folder. +> Examples: '#CHROM', '#CHROM,REF', 'None'. + + + # STATS tool Statistics on genetic variations, such as: number of variants, number of diff --git a/docs/help.parameters.databases.pdf b/docs/help.parameters.databases.pdf index c0144ce..bca5569 100644 Binary files a/docs/help.parameters.databases.pdf and b/docs/help.parameters.databases.pdf differ diff --git a/docs/help.parameters.pdf b/docs/help.parameters.pdf index 9c50931..89ca3f7 100644 Binary files a/docs/help.parameters.pdf and b/docs/help.parameters.pdf differ diff --git a/docs/help.pdf b/docs/help.pdf index 7d0cb7a..fcd449e 100644 Binary files a/docs/help.pdf and b/docs/help.pdf differ diff --git a/docs/pdoc/howard/functions/commons.html b/docs/pdoc/howard/functions/commons.html index 6034bc3..abf4143 100644 --- a/docs/pdoc/howard/functions/commons.html +++ b/docs/pdoc/howard/functions/commons.html @@ -429,6 +429,9 @@API Documentation
- docker_automount
+- + sort_contigs +
4358def sort_contigs(vcf_reader): +4359 """ +4360 Function that sort contigs in VCF header +4361 +4362 Args: +4363 vcf_reader (vcf): VCF object from VCF package +4364 +4365 Returns: +4366 vcf:VCF object from VCF package +4367 """ +4368 +4369 from collections import OrderedDict +4370 +4371 # inf +4372 inf = 100000000 +4373 +4374 # Extract contigs from header +4375 contigs = list(vcf_reader.contigs.keys()) +4376 +4377 # Sort function +4378 def contig_sort_key(contig): +4379 +4380 # Remove 'chr' from contig +4381 contig_clean = re.sub(r"^chr", "", contig) +4382 +4383 # Special cases: X, Y, M/MT +4384 if contig_clean == "X": +4385 return (float(inf) - 3, contig) +4386 elif contig_clean == "Y": +4387 return (float(inf) - 2, contig) +4388 elif contig_clean in ["M", "MT"]: +4389 return (float(inf) - 1, contig) +4390 +4391 # Contig as integer +4392 try: +4393 return (int(contig_clean), contig) +4394 except ValueError: +4395 # Contig as on-numeric +4396 return (float(inf), contig_clean) +4397 +4398 # Sort contigs +4399 sorted_contigs = sorted(contigs, key=contig_sort_key) +4400 +4401 # Create new contgis OrderedDict +4402 ordered_contigs = OrderedDict() +4403 +4404 # Add contigs +4405 for contig in sorted_contigs: +4406 ordered_contigs[contig] = vcf_reader.contigs[contig] +4407 +4408 # Replace contigs +4409 vcf_reader.contigs = ordered_contigs +4410 +4411 # Return +4412 return vcf_reader +
Function that sort contigs in VCF header
+ +Args: + vcf_reader (vcf): VCF object from VCF package
+ +Returns: + vcf:VCF object from VCF package
+