When loading multiple files, the INFO
and FORMAT
fields
from all VCF files are merged together to generate a "representative header",
which is then used to generate a single BigQuery schema. If the same key is
defined in multiple files, then its definition must be compatible across files.
The compatibility rules are as follows:
-
Fields are compatible if they have the same
Number
andType
fields. Annotation fields (i.e. those specified by--annotation_fields
) must also have the sameDescription
. -
Fields with different
Type
values are compatible in the following cases:Integer
andFloat
fields are compatible and are converted toFloat
.- You must run the pipeline with
--allow_incompatible_records
to automatically resolve conflicts between incompatible fields (e.g.String
andInteger
). This is to ensure incompatible types are not silently ignored. See below for more details.
-
Fields with different
Number
values are compatible in the following cases:-
"Repeated" numbers are compatible with each other. They include:
Number=.
(unknown number)- Any
Number
greater than 1. Number=G
(one value per genotype) andNumber=R
(one value for each alternate and reference).Number=A
(one value for each alternate) only if running with--split_alternate_allele_info_fields False
.
-
You must run the pipeline with
--allow_incompatible-records
to automatically resolve conflicts between incompatible fields (e.g.Number=1
andNumber=.
). This is to ensure incompatible types are not silently ignored. See below for more details.
-
You can run preprossing tool to get a summary of malformed/incompatible records. Please refer to VCF files preprocessor for more details.
The headers in the --representative_header_file <path_to_file>
essentially
specify the merged headers from all files being loaded to BigQuery. This file is
used to directly generate the BigQuery schema. Note that we only read the
header info from the file and ignore VCF records, so the
representative_header_file
can either be a file containing just the header
fields or can point to an actual VCF file. Providing this file can be useful
for:
- Speeding up the pipeline especially if a large number of files are provided. The pipeline will use the provided file to generate the BigQuery schema and will skip merging headers across files. This is particularly useful if all files have identical VCF headers.
- Providing definitions for missing header fields. See the troubleshooting page for more details.
- Resolving incompatible field definition across files. See below for an alternative.
If this flag is set, pipeline will infer TYPE
and NUMBER
for undefined
fields based on field values seen in VCF files. It will also output a
representive header that contains inferred definitions as well as definitions
from headers. Use this flag if there are fields with missing definition or if
pipeline should ignore header definitions that are incompatible with field
values, and instead should infer the correct header definitions for
the corresponding fields.
Note that this will make pipelines do two passes on the data, which results in ~30% more compute.
Pipeline will fail by default if there is a mismatch between field definition
and actual values or if a field has two inconsistent definitions in two
different VCF files.
By specifying --allow_incompatible_records
, pipeline will resolve conflicts
in header definitons. It will also cast field values to match BigQuery schema if
there is a mismatch between field definition and field value (e.g. Integer
field
value is casted to String
to match a field schema of type String
).