package learning.genomics.deepvariant

Get desktop application:
View/edit binary Protocol Buffers messages

Config parameters for "alignment (aln)" phase.

int32 match = 1
Match score (expected to be a non-negative score).
int32 mismatch = 2
Mismatch score (expected to be a non-positive score).
int32 gap_open = 3
Gap open score (expected to be a non-positive score). Score for a gap of length g is (gap_open + (g - 1) * gap_extend).
int32 gap_extend = 4
Gap extend score (expected to be a non-positive score). Score for a gap of length g is (gap_open + (g - 1) * gap_extend).
int32 k = 5
k-mer size used to index target sequence. TODO This parameter is not used fast_pass_aligner. Since we no longer use python realigner this parameter is obsolete.
float error_rate = 6
Estimated sequencing error rate. TODO This parameter is not used in fast_pass_aligner. We need to remove it.
int32 read_size = 8
Average read size. This parameter is used to calculate a ssw_alignment_score_threshold_ - the threshold to filter out reads aligned with SSW library. Not all the reads may be the same size. This parameter needs to be set to a value close enough to the average read size.
int32 kmer_size = 9
K-mer size in read index used in Fast Pass Aligner.
int32 max_num_of_mismatches = 10
Num of maximum allowed mismatches for quick read to haplotype alignment.
double realignment_similarity_threshold = 11
Similarity threshold used to filter out bad read alignments made with Smith-Waterman alignment. Alignment is discarded if read is aligned to a haplotype with too many mismatches.
bool force_alignment = 12
Force realignment so the original alignment is never returned, defaulting instead to computing a new SSW alignment against the reference. This is used for alt-aligned pileups where reads are aligned to a new "reference", making the original read alignments invalid.

An Allele observed in some type of NGS read data. Conceptually, an Allele is a sequence of bases that represent a type of change relative to a reference genome sequence, along with a discrete count of the number of times that allele was observed in the NGS data.

Used in: AlleleCount, AlleleCount.Alleles

string bases = 1
The string of bases that make up this Allele. Should not be empty. A simple reference allele might have a single base "A", while a complex insertion of the bases "CTG" following that base "A" would have a bases sequence of "ACTG".
AlleleType type = 2
The type of this allele.
int32 count = 3
The number of times this Allele was seen in the NGS data. The count should be >= 0, where 0 indicates that no observations of the allele were observed (which can happen if you want to record that you checked for some allele in the data and never saw any evidence for it).
bool is_low_quality = 4
Set to true if allele contains low quality bases.
int32 mapping_quality = 5
int32 avg_base_quality = 6
bool is_reverse_strand = 7

An AlleleCount summarizes the NGS data observed at a position in the genome. An AlleleCount proto is a key intermediate data structure in DeepVariant summarizing the NGS read data covering a site in the genome. It is intended to be relatively simple but keep track of the key pieces of information about the observed reads and their associated alleles at this position so that downstream tools can reconstruct the read coverage, do variant calling, and compute reference confidence. It is conceptually similar to a samtools read pileup (http://samtools.sourceforge.net/pileup.shtml) but without detailed information about bases or their qualities. The AlleleCount at its core tracks the Alleles observed in reads that overlap this position in the genome. Consider a read that has a base, X, aligned to the position of interest. If X is a non-reference allele, the AlleleCount proto adds a new read name ==> Allele key-value entry to the read_alleles map field. If X is the reference allele, the AlleleCount proto increments either the ref_supporting_read_count or the ref_nonconfident_read_count counter field, depending on read alignment confidence as defined by pipeline-specific parameters. The complexity here is introduced by following the VCF convention of representing indel and complex substitution alleles as occurring at the preceding base in the genome. So if in fact our base X is followed by a 3 bp insertion of acg, than we would in fact not have a count for X at all but would see an allele Xacg with a count of 1 (or more if other reads have the same allele). The primary contract here is that each aligned base at this site goes into a +1 for exactly one allele. A concrete example might clarify this logic. Consider the following alignment of two reads to the reference genome: Position: 123 4567 Ref: ATT---TGCT Read1: ATT---TGCT Read2: ATTCCCTG-T The 'T' base in position 3 of Read 1 matches the reference and so the AlleleCount's ref_supporting_read_count is incremented. The 'T' base in position 3 of Read 2 also matches the reference, but it is the anchor base preceding the 'CCC' insertion. The 'TCCC' INSERTION Allele is therefore added to the AlleleCount proto for this position. A deletion occurs in read 2 at position 6 as well, which produces a DELETION allele 'GT' at position 5. Additionally, because only read 1 has a C base at position 6, the AlleleCount at 6 would have no entries in its read_alleles and ref_supporting_read_count would be 1. Another design choice is that spanning deletions don't count as coverage under bases, so there's an actual drop in coverage under regions of the genome with spanning deletions. This is classically the difference between physical coverage and sequence coverage: https://en.wikipedia.org/wiki/Shotgun_sequencing#Coverage so it's safe to think of an AlleleCount as representing the sequence coverage of a position, not its physical coverage. What this means is that its very straightforward to inspect the reference counters and read_alleles in an AlleleCount and determine the corresponding alleles for a Variant as well as compute the depth of coverage. This enables us to write algorithms to call SNPs, indels, CNVs as well as identify regions for assembly using a series of AlleleCount objects rather than the underlying read data. An AlleleCount is a lossy transformation of the raw read data. Fundamentally, the digestion of a read into its correspond AlleleCount components loses some of this contiguity information provided by reads spanning across multiple positions on the genome. None of the specifics of base quality, mapping quality, read names, etc. are preserved. Furthermore, an AlleleCount can be constructed using only a subset of all of the raw reads (e.g., those that pass minimum quality criteria) and even only parts of each read (e.g., if the read contains Ns or low quality bases). The data used to compute an AlleleCount isn't specified as part of the proto, but is left up to the implementation details and runtime parameters of the generating program. For usability and performance reasons we track reference and alternate allele supporting reads in slightly different ways. The number of reads that confidently carry the reference allele at this position is stored in ref_supporting_read_count. Confidence here means that the read's alignment to the reference is reliable. See the Base Alignment Quality (BAQ) paper: http://bioinformatics.oxfordjournals.org/content/early/2011/02/13/bioinformatics.btr076.full.pdf For background and motivation. For reads that would have been counted as reference supporting but don't have a reliable alignment we instead tally those in ref_nonconfident_read_count. Finally, reads that don't have the reference allele are stored in a map from the string "fragment_name/read_number" to the allele it supports, which by construction will always have a count of 1. This allows for more detailed downstream analyses of the alt allele containing reads. Consequentially, the total (usable) coverage at this location is: coverage = // reads supporting an observed alternate allele. sum(read_allele_i.count) // [also equal to read_alleles_size()] // All of the reads confidently asserting reference. + ref_supporting_read_count // All of the reads supporting ref without a confident alignment. + ref_nonconfident_read_count

optional nucleus.genomics.v1.Position position = 1
The position on the genome of this AlleleCount.
string ref_base = 2
The reference bases of this AlleleCount. Since AlleleCount currently only represents a single location, this field should always be a single base.
int32 ref_supporting_read_count = 3
The number of reads that confidently carry the reference allele at this position.
map<string, Allele> read_alleles = 4
A map from a read's key to an Allele message containing information about the allele supported by that read at this position. There will be one binding for each usable read spanning this position that supports a non-reference allele. The read's key is a unique string that identifies the read, currently "fragment_name/read_number".
int32 ref_nonconfident_read_count = 5
A count of the number of reads that supported the reference allele but whose alignment to the reference genome isn't 100% certain.
map<string, AlleleCount.Alleles> sample_alleles = 6
bool track_ref_reads = 7
If true reads supporting ref are tracked (read ids are saved).

A map where key is a sample name and value is a list of alt alleles that are supported by reads from this sample.

Used in: AlleleCount

repeated Allele alleles = 1

A lighter-weight version of AlleleCount. The only material difference with this proto is that we don't store the map from read names to Alleles, but instead have the total number of reads we've seen at this position.

string reference_name = 1
The Position field of AlleleCount with values inlined here.
int64 position = 2
string ref_base = 3
Same as in AlleleCount.
int32 ref_supporting_read_count = 4
Same as in AlleleCount.
int32 total_read_count = 5
This is the total number of reads observed at position.
int32 ref_nonconfident_read_count = 6
Same as in AlleleCount.

Options to control how our AlleleCounter code works.

Used in: MakeExamplesOptions

int32 partition_size = 1
The number of basepairs to include in each partition of the reference genome. This determines how many map/reduce jobs are used to compute the AlleleCounts. Using a too small value (below 10000 for example) results in having many many intervals to process which may be a performance problem for the tool. Using too large of a value will result in difficulty parallelizing the computation as there will be too few work units to parallelize and each unit will use a lot of memory.
optional nucleus.genomics.v1.ReadRequirements read_requirements = 2
The requirements for reads to be used when counting alleles.
bool track_ref_reads = 3
Determains how allele counter keeps track of ref reads. If True then allele_counter stores reads IDs of ref reads, otherwise just a counter is used for ref reads. Default value is False.
bool normalize_reads = 4
Option to left align INDELs for each read.
bool keep_legacy_behavior = 5
If True, the behavior in this commit is reverted: https://github.com/google/deepvariant/commit/fbde0674639a28cb9e8004c7a01bbe25240c7d46

The type of an Allele. An allele type indicates what kind of event would have produced this allele. An allele can be the reference sequence, a substitution of bases, insertion of bases, or deletion of bases. Allele types need not be real genetic variants: for example, the SOFT_CLIP type indicates that a read contained bases SOFT_CLIPPED away (similar to an insertion), which is often indicative of some large event near the start or end of the read.

Used in: Allele

UNSPECIFIED = 0
Default should be unspecified: https://docs.google.com/document/d/1oavZD9XB_147ti93MCBoR5HrFKoBh1xkZcTxjInYf0M/edit#heading=h.8ylxmf942vui
REFERENCE = 1
The allele corresponding to that found in the genome sequence.
SUBSTITUTION = 2
A substitution of bases that are difference from the genome sequence.
INSERTION = 3
An insertion of bases w.r.t. the reference genome.
DELETION = 4
A deletion of bases w.r.t. the reference genome.
SOFT_CLIP = 5
An allele type produced by a SOFT_CLIP operation during alignment. Maybe indicative of a real genetic event occurring at this position, or may be a data quality / alignment artifact.

Variant call for a single site, in a pseudo-biallelic manner. This is an intermediate format for call_variants.py that needs to be merged if there are multiallelics. The `variant` here likely doesn't have fully filled information for output to a VCF file yet.

optional nucleus.genomics.v1.Variant variant = 1
optional CallVariantsOutput.AltAlleleIndices alt_allele_indices = 2
repeated double genotype_probabilities = 3
optional CallVariantsOutput.DebugInfo debug_info = 4

The alt allele indices is represented as a sub-message so that it's easier to re-use as a standalone proto for encoding+decoding.

Used in: CallVariantsOutput

repeated int32 indices = 1

Next ID: 11

Used in: CallVariantsOutput

int32 predicted_label = 1
bool has_insertion = 2
bool has_deletion = 3
bool is_snp = 4
int32 true_label = 5
repeated double logits = 6
repeated double prelogits = 7
bytes image_encoded = 8
The encoded image used for inference.
map<string, bytes> layer_output_encoded = 9
Key-value pairs of layer names of call variant models and the encoded layers' outputs.
optional DebugInfo.PileupCuration pileup_curation = 10

Used in: DebugInfo

int32 diff_category = 1
The following enums are defined in nucleus/util/vis.py.
int32 base_quality = 2
int32 mapping_quality = 3
int32 strand_bias = 4
int32 read_support = 5

Encapsulates a list of candidate haplotype sequences for a genomic region.

optional nucleus.genomics.v1.Range span = 1
The genomic region containing the candidate haplotypes.
repeated string haplotypes = 2
The list of candidate haplotype sequences. Each individual haplotype is represented by its nucleotide sequence.

Config parameters for "de-Bruijn graph (dbg)" phase.

Used in: RealignerOptions

int32 min_k = 1
Initial k-mer size to build the graph.
int32 max_k = 2
Maximum k-mer size. Larger k-mer size is used to resolve graph cycles.
int32 step_k = 3
Increment size for k to try in resolving graph cycles.
int32 min_mapq = 4
Minimum read alignment quality to consider in building the graph.
int32 min_base_quality = 5
Minimum base quality in a k-mer sequence to consider in building the graph.
int32 min_edge_weight = 6
Minimum number of supporting reads to keep an edge.
int32 max_num_paths = 7
Maximum number of paths within a graph to consider for realignment. Set max_num_paths to 0 to have unlimited number of paths.

A message encapsulating all of the information about a Variant call site for consumption by further stages of the DeepVariant data processing workflow.

optional nucleus.genomics.v1.Variant variant = 1
A Variant call based on the information in allele_count. Will always be a non-reference variant call (no gVCF or reference records).
map<string, DeepVariantCall.SupportingReads> allele_support = 2
map<string, float> allele_frequency = 3
repeated string ref_support = 4
List of Read keys supporting red allele.
map<string, DeepVariantCall.SupportingReadsExt> allele_support_ext = 5
This is to replace allele_support.
optional DeepVariantCall.SupportingReadsExt ref_support_ext = 6
map<int32, int32> allele_frequency_at_position = 7
Map of VAF at a given +/- position relative to the variant.

Used in: SupportingReadsExt

string read_name = 1
bool is_low_quality = 2
int32 mapping_quality = 3
int32 average_base_quality = 4
bool is_reverse_strand = 5

A map from an alt allele in Variant to Read key that support that allele. Every alt allele in the variant will have an entry. Reference supporting reads aren't listed. There may be a special key "UNCALLED_ALLELE" for reads that don't support either the reference allele or any alt allele in the variant. This can happen when the read supports an allele that didn't pass our calling thresholds. The read's key is a unique string that identifies the read constructed as "fragment_read/read_number".

Used in: DeepVariantCall

repeated string read_names = 1

A map from alt allele in Variant to ReadSupport structure. This structrue is to replace SupportingReads but for back a backward compatibility old one is kept.

Used in: DeepVariantCall

repeated ReadSupport read_infos = 1

Next ID: 23.

Used in: SampleOptions

CH_UNSPECIFIED = 0
Default should be unspecified.
CH_READ_BASE = 1
6 channels that exist in all DeepVariant production models.
CH_BASE_QUALITY = 2
CH_MAPPING_QUALITY = 3
CH_STRAND = 4
CH_READ_SUPPORTS_VARIANT = 5
CH_BASE_DIFFERS_FROM_REF = 6
CH_HAPLOTYPE_TAG = 7
"Improving Variant Calling using Haplotype Information" https://google.github.io/deepvariant/posts/2021-02-08-the-haplotype-channel/
CH_ALLELE_FREQUENCY = 8
"Improving variant calling using population data and deep learning" https://doi.org/10.1101/2021.01.06.425550
CH_DIFF_CHANNELS_ALTERNATE_ALLELE_1 = 9
Two extra channels for diff_channels:
CH_DIFF_CHANNELS_ALTERNATE_ALLELE_2 = 10
CH_BASE_CHANNELS_ALTERNATE_ALLELE_1 = 20
Two extra channels for base_channels:
CH_BASE_CHANNELS_ALTERNATE_ALLELE_2 = 21
CH_READ_MAPPING_PERCENT = 11
The following channels correspond to the "Opt Channels" defined in deepvariant/pileup_channel_lib.h:
CH_AVG_BASE_QUALITY = 12
CH_IDENTITY = 13
CH_GAP_COMPRESSED_IDENTITY = 14
CH_GC_CONTENT = 15
CH_IS_HOMOPOLYMER = 16
CH_HOMOPOLYMER_WEIGHTED = 17
CH_BLANK = 18
CH_INSERT_SIZE = 19
CH_MEAN_COVERAGE = 22

Config describe information needed for a dataset that can be used for training, validation, or testing.

string name = 1
A human-readable name of the dataset.
string tfrecord_path = 2
Full path of the tensorflow.Example TFRecord file.
uint32 num_examples = 3
Number of examples for this dataset. Right now this needs to be manually filled in order to compute how the learning rate decays, and also used in make_training_batches.

Config parameters for "alignment (aln)" phase.

Used in: RealignerOptions

bool enabled = 1
Enable runtime diagnostic outputs.
string output_root = 2
The root where we'll put our diagnostic outputs.
bool emit_realigned_reads = 3
True if we should also emit the realigned reads themselves.

Metrics on the labeling of candidate / truth variants when running DeepVariant's make_examples in training mode. Next ID: 17.

Notes on counting by site or by allele: Throughout this proto we often measure the same quantity (e.g., false positives) by site and by alleles. This reflects two different ways of counting errors in genomes where the number of chromosomes > 1. We can give a concrete example: Candidate: chr20:10 with A/C Truth: chr20:10 with A/C alleles and with genotype (0, 1) Since we have the same variant with the same alleles at the same position in both the candidates and the truth, the matching is trivial. For this variant, we'd update our counts as follows: # We have only a single site, so we +1 to each truth and candidates. n_truth_variant_sites += 1 n_candidate_variant_sites += 1 # Both the candidate and truth variants only have 1 alternative allele, # so we increment the allele counts by 1 as well. n_truth_variant_alleles += 1 n_candidate_variant_alleles += 1 A similar logic would apply to counting true positives (+1), false negatives (+0), and false positives (+0) for both sites and alleles. Now let's take a more complex example where candidates and truth differ in their alleles: Candidate: chr20:20 with A/C/T Truth: chr20:20 with A/C/G alleles and with genotype (1, 2) Here we have a candidate and truth at the same position but they can only be partially matched, since the truth variant includes a G allele (e.g., genotype == 2) that isn't even present in the candidate. And the candidate has an extra allele T that isn't real. Matching these variants produces a genotype of (0, 1) for the candidate, since we have one copy of the C allele (e.g., genotype == 1) but we cannot match the true G allele, so we are forced to say one allele is reference (e.g., genotype == 0). Now let's update our counts: # We have only a single site, so we +1 to each truth and candidates, even # though they both have multiple alt allele, as this is the sites-level # metric. n_truth_variant_sites += 1 n_candidate_variant_sites += 1 # Both variants have 2 alternative alleles, so we +2 each of the alleles # counts. n_truth_variant_alleles += 2 n_candidate_variant_alleles += 2 # Now for the complex TP/FN/FP counts. # We found a candidate for this true site, even though we didn't get all # of the alleles, so we increment the n_true_positive_sites by 1 and get # +0 for each of the FN and FP sites metrics: n_true_positive_sites += 1 n_false_negative_sites += 0 n_false_positive_sites += 0 # However, we only got one of the two true positive alleles, so we get # one TP allele (the C), one FN allele (the missed G), and one FP allele # (the bad T allele in the candidate). n_true_positive_alleles += 1 n_false_negative_alleles += 1 n_false_positive_alleles += 1

Used in: MakeExamplesRunInfo

int32 n_truth_variant_sites = 1
Counts of candidate and truth variants by site and by allele. The sites metrics are essentially the number of records seen of each type, after removing non-PASS truth variants, regardless of the number of alleles in the records. The allele metrics are like the sites metrics, but instead of getting a +1 for each record we get +N for each site where N is the number of alternative alleles. See above for more information on the difference between sites and alleles counting.
int32 n_truth_variant_alleles = 2
int32 n_candidate_variant_sites = 3
int32 n_candidate_variant_alleles = 4
int32 n_non_confident_candidate_variant_sites = 5
int32 n_true_positive_sites = 6
TPs: The number of true variants assigned a non-ref genotype in the match for the sites calculation. For the allele calculation, the number of true variant alleles with a gt > 0 in the assignment.
int32 n_true_positive_alleles = 7
int32 n_false_negative_sites = 8
FNs: The number of true variants assigned 0/0 genotypes for the sites calculation. For the allele calculation, the number of true variant alleles without a gt > 0 in the assignment.
int32 n_false_negative_alleles = 9
int32 n_false_positive_sites = 10
FPs: The number of candidate variants assigned a 0/0 genotype for the sites calculation. For the allele calculation, the number of candidate variant alternative alleles without a genotype > 0 in the assignment.
int32 n_false_positive_alleles = 11
int32 n_inexact_position_matches = 12
The number of candidate variants (counted by site) that are assigned a non-reference genotype (e.g., != 0/0) but that don't have an exact position match in truth.
int32 n_exact_position_matches = 13
A count of the number of sites where the candidate and truth variants occur at the same position, with increasingly strict additional matching criteria. These metrics are all computed over sites, not alleles. The number of sites where candidate and truth have the same start position.
int32 n_exact_position_and_allele_matches = 14
Same criteria as above but with the additional requirement that all of the alleles be exactly the same between candidate and truth.
int32 n_exact_position_and_allele_and_genotype_matches = 15
Same criteria as above but with the additional requirement that the matched genotypes to be identical as well.
int32 n_truth_multiallelics_sites_with_missed_alleles = 16
Number of truth variants (counted by site) with more than one alternative allele where at least one alternative allele was missed (i.e., was assigned a genotype of 0 in the match).

High-level options that encapsulates all of the parameters needed to run DeepVariant end-to-end. Next ID: 82.

Used in: MakeExamplesRunInfo

repeated string exclude_contigs = 1
A list of contig names we never want to call variants on. For example, chrM in humans is the mitocondrial genome and the caller isn't trained to call variants on that genome.
repeated string calling_regions = 2
List of regions where we want to call variants. If missing, we will call variants throughout the entire genome.
uint32 random_seed = 3
Fixed random seed to use for DeepVariant itself.
int32 n_cores = 4
The number of cores to use when running DeepVariant. Must be >= 1.
optional AlleleCounterOptions allele_counter_options = 5
Options to control how we run the AlleleCounter.
optional VariantCallerOptions deprecated_variant_caller_options = 6
Deprecated. Use sample_options instead.
optional PileupImageOptions pic_options = 7
Options to control how we generate pileup images.
optional VariantLabelerOptions labeler_options = 8
Options to control how we label our examples.
optional nucleus.genomics.v1.ReadRequirements read_requirements = 9
Only reads satisfying these requirements will be used in DeepVariant. This parameters are propagated as appropriate to read_requirement fields in our tool-specific options.
string reference_filename = 10
Options to control out input data sources and output data sinks. Path to our genome reference.
repeated string deprecated_reads_filenames = 32
Deprecated. Use sample_options instead.
string deprecated_reads_filename = 11
Deprecated.
string candidates_filename = 12
Path where we'll write out our candidate variants.
string examples_filename = 13
Path to examples.
string confident_regions_filename = 14
Path to a list of regions we are confident in, for determining which candidate variants get labels.
string truth_variants_filename = 15
Path to the truth variants, for use in labeling our examples.
string deprecated_proposed_variants_filename = 33
Path to the variants for vcf_candidate_importer.
string gvcf_filename = 16
Path where we should put our gVCF records.
bool include_med_dp = 43
Whether to generate MED_DP in gVCF records or not.
string model_name = 17
The name of the deep learning model to use with DeepVariant.
MakeExamplesOptions.Mode mode = 18
float min_shared_contigs_basepairs = 19
The minimum fraction of basepairs that must be shared by all contigs common to DeepVariant inputs and the reference contigs alone. If the common contigs cover less than min_shared_contig_basepairs of the reference genome contigs DeepVariant will signal an error that the input datasets aren't from compatible genomes.
int32 task_id = 20
The task identifier, as an integer, of this task. If we are running with multiple tasks processing the same inputs into sharded outputs, this id should be set to a number from 0 (master) to N - 1 to indicate which of the tasks we are currently processing.
int32 num_shards = 21
When running in sharded output mode (i.e., writing outputs to foo@N), this field captures the number of sharded outputs (i.e., N). When not running in sharded output mode, this field should be 0.
bool realigner_enabled = 22
Whether the realigner should be enabled.
bool joint_realignment = 54
If True, realign reads from all samples together. If False, realign per sample.
optional RealignerOptions realigner_options = 23
Settings for the realigner module.
int32 max_reads_per_partition = 24
The maximum number of reads per partition that we consider before following processing such as sampling and realigner.
int32 max_reads_for_dynamic_bases_per_region = 57
Similar to `max_reads_per_partition`, we want to we add another field to constrain the number of reads to downsample. Even with `max_reads_per_partition`, the memory usage can sometimes still be too large, especially when the reads are very long. When this field is set, we'll multiple it by the region we're in. And then we will only sample the reads up to a point where the number of bases in the region are larger than (max_reads_for_dynamic_bases_per_region * region length).
float deprecated_downsample_fraction = 25
Deprecated. Use sample_options instead.
repeated string exclude_calling_regions = 26
List of regions where we DON'T want to call variants. If missing, no regions will be excluded from calling.
MakeExamplesOptions.LabelerAlgorithm labeler_algorithm = 27
The labeling algorithm we are using in this DeepVariant run. Only needed when in CALLING mode.
string run_info_filename = 28
bool use_original_quality_scores = 29
By default aligned_quality field is read from QUAL in SAM. If flag is set, aligned_quality field is read from OQ tag in SAM.
repeated string select_variant_types = 30
A list of variant types that we want to restrict our examples to. E.g., select_variant_types = ['snps'] would indicate that we only want to generate SNP candidate variants.
MakeExamplesOptions.VariantCaller variant_caller = 31
bool use_allele_frequency = 34
If flag is set, consider allele frequency.
repeated string population_vcf_filenames = 35
A list of VCF or VCF.gz files that specify allele frequency information.
string exclude_variants_vcf_filename = 70
float exclude_variants_af_threshold = 71
string runtime_by_region = 36
Path to output optional runtime profiling by region.
bool use_ref_for_cram = 37
Use --ref argument as the reference file for the CRAM.
bool parse_sam_aux_fields = 38
Parse aux fields from BAM -- needed for some features like using HP tags.
repeated string aux_fields_to_keep = 50
This field is used to pass into SamReaderOptions. By default, this field is empty. If empty, we keep all aux fields if they are parsed. If set, we only keep the aux fields with the names in this list.
int32 hts_block_size = 39
Size of blocks to read from BAM.
int32 logging_every_n_candidates = 40
How often to show log messages.
string customized_classes_labeler_classes_list = 41
string customized_classes_labeler_info_field_name = 42
int32 main_sample_index = 44
The index of the sample to focus on within the list of samples.
string bam_fname = 45
per-sample statistics for training examples.
repeated SampleOptions sample_options = 46
Samples, e.g. a list of 3 for DeepTrio, 2 for DeepSomatic or 1 for DeepVariant.
string sample_role_to_train = 47
Sample role to focus on for training. This can be different from the sample indicated by the main_sample_index.
bool phase_reads = 51
DirectPhasing related options.
int32 phase_reads_region_padding_pct = 52
int32 phase_max_candidates = 53
string read_phases_output = 55
bool discard_non_dna_regions = 56
bool output_sitelist = 58
string denovo_regions_filename = 59
Related to de novo variants labeling
bool deterministic_serialization = 60
Useful for creating deterministic testdata.
string channel_list = 61
bool trim_reads_for_pileup = 62
bool write_small_model_examples = 63
Write small model training examples to .tsv files.
bool skip_pileup_image_generation = 73
When writing small model examples, skip pileup images altogether.
bool call_small_model_examples = 64
Call small model examples directly instead of passing to CNN.
int32 small_model_snp_gq_threshold = 65
GQ threshold for small model SNPs.
int32 small_model_indel_gq_threshold = 78
GQ threshold for small model INDELs.
string trained_small_model_path = 66
Path to the pickled small model.
int32 small_model_inference_batch_size = 76
Batch size to use during inference
int32 small_model_vaf_context_window_size = 77
Small model context window size
bool stream_examples = 67
Stream examples to shared memory buffer instead of writing them to disk.
string shm_prefix = 68
Shared memory objects name prefix.
int32 shm_buffer_size = 69
Size of the shared memory buffer per each shard.
repeated float downsample_classes = 72
Proportion of examples to keep when in training mode.
bool sample_mean_coverage_on_calling_regions = 75
If true, the mean coverage is calculated on the calling regions, rather than on the whole genome. This is useful in the case of WES where the regions that have reads are 2 percent of the genome.
string ref_name_pangenome = 79
The name of the reference genome in the pangenome gbz file. This reference should match the reference used for the reads. This attribute is added since the exact name assigned to the pangenome reference can be different from the name of the reference fasta used for the reads.
string ref_chrom_prefix = 80
The prefix to add to the chromosome name in the pangenome gbz file. It is empty by default. However sometimes we need to add a prefix (like "GRCh38.") to the chromosome name in the pangenome gbz file to match the chromosome name in the reads.
bool output_phase_info = 81
bool use_loaded_gbz_shared_memory = 83
If true, the sequences of the gbz file is already loaded into shared memory and the SamReader reads the sequences from the shared memory.
string gbz_shared_memory_name = 84
The name of the shared memory segment that contains the sequences of the gbz file.

An enumeration of all of the labeler algorithms we support in DeepVariant.

Used in: MakeExamplesOptions

UNSPECIFIED_LABELER_ALGORITHM = 0
POSITIONAL_LABELER = 1
The labeling algorithm used with DeepVariant 0.4-0.5, which does position matching to find truth variant to label our candidates.
HAPLOTYPE_LABELER = 2
A haplotype-aware labeling algorithm, similar to hap.py xcmp, that looks for genotypes for candidate variants that produce haplotypes that match those implied by the genotypes of our truth variants. Produces more accurate labels than the POSITIONAL_LABELER labeling algorithm.
CUSTOMIZED_CLASSES_LABELER = 3
The labeling algorithm which labels the variants into customized classes specified in the specified INFO field in the VCF file.

Used in: MakeExamplesOptions

UNSPECIFIED = 0
CALLING = 1
TRAINING = 2
CANDIDATE_SWEEP = 3

Used in: MakeExamplesOptions

UNSPECIFIED_CALLER = 0
VERY_SENSITIVE_CALLER = 1
The default very sensitive caller.
VCF_CANDIDATE_IMPORTER = 2
An advanced caller that uses an input VCF to call variants.

Configuration and runtime information about a MakeExamples run in DeepVariant. Next ID: 5.

optional MakeExamplesOptions options = 1
optional LabelingMetrics labeling_metrics = 2
optional ResourceMetrics resource_metrics = 3
optional MakeExamplesStats stats = 4

Statistics about MakeExamples. Next ID: 9.

Used in: MakeExamplesRunInfo

int32 num_examples = 1
int32 num_indels = 2
int32 num_snps = 3
int32 num_class_0 = 4
int32 num_class_1 = 5
int32 num_class_2 = 6
int32 num_denovo = 7
int32 num_nondenovo = 8

Options to control how we construct pileup images. Next ID: 41.

Used in: MakeExamplesOptions

int32 height = 1
The height, in pixels, of the pileup image we'll construct.
int32 width = 2
The width, in pixels, of the pileup image we'll construct.
int32 reference_band_height = 3
We include at the top of the each image a band of reference pixels with this specified height.
int32 base_color_offset_a_and_g = 4
Controls how bases are encoded as red pixel values. A is base_color_offset_a_and_g + base_color_stride * 3 G is base_color_offset_a_and_g + base_color_stride * 2 T is base_color_offset_t_and_c + base_color_stride * 1 C is base_color_offset_t_and_c + base_color_stride * 0 The offset in red color space for A and G bases.
int32 base_color_offset_t_and_c = 5
The offset in red color space for T and C bases.
int32 base_color_stride = 6
Each base color is offset from each other by this stride.
float reference_alpha = 7
The alpha value applied to pixels in the reference genome band.
int32 reference_base_quality = 8
The base quality we assume for the reference genome bases.
float allele_supporting_read_alpha = 9
The alpha to apply to reads that support our alt alleles.
float other_allele_supporting_read_alpha = 32
The alpha to apply to reads that support the other alt allele.
float allele_unsupporting_read_alpha = 10
The alpha to apply to reads that do not support our alt alleles.
float reference_matching_read_alpha = 11
The alpha to apply to a base that matches the reference sequence.
float reference_mismatching_read_alpha = 12
The alpha to apply to a base that doesn't matches the reference sequence.
string indel_anchoring_base_char = 13
The character we'll use when encoding insertion/deletion anchor bases.
int32 positive_strand_color = 14
The color value to use for reads on the positive strand.
int32 negative_strand_color = 15
The color value to use for reads on the negative strand.
int32 base_quality_cap = 16
The maximum base quality we'll allow in PIC. Base qualities above this value are treated as being base_quality_cap.
int32 read_overlap_buffer_bp = 17
Extend read windows by a small amount when calculating overlap with calls. This is important to include all the reads involved in deletions in a pileup image.
optional nucleus.genomics.v1.ReadRequirements read_requirements = 18
The requirements for reads to be used when creating pileup images.
PileupImageOptions.MultiAllelicMode multi_allelic_mode = 19
int32 mapping_quality_cap = 20
The maximum mapping quality we'll allow in PIC. Mapping qualities above this value are treated as being mapping_quality_cap.
uint32 random_seed = 21
The random seed to use in our Pileup Image Creation.
int32 num_channels = 22
The number of data channels in our pileup images.
string unused_insert_base_char = 23
(Experimental feature that was removed.) The character we'll use when encoding insertion anchor bases.
string unused_delete_base_char = 24
(Experimental feature that was removed.) The character we'll use when encoding deletion anchor bases.
bool unused_custom_pileup_image = 25
(Experimental feature that was removed.) Include custom pileup image feature.
bool unused_sequencing_type_image = 26
(Experimental feature that was removed.) Include sequencing type image feature.
PileupImageOptions.SequencingType sequencing_type = 27
string alt_aligned_pileup = 30
Whether and how to include alt-aligned pileup images (experimental).
bool sort_by_haplotypes = 31
If set reads are sorted by haplotype tag (HP tag) and then by alignment position.
bool reverse_haplotypes = 40
float min_non_zero_allele_frequency = 33
Minimal non-zero allele frequency. This is used when normalizing color intensities for the allele frequency channel.
bool use_allele_frequency = 34
Whether to consider allele frequencies.
string types_to_alt_align = 36
Which variant types to use alt-align on.
bool add_hp_channel = 37
If true, add an additional channel where the color information per-read indicates the HP value.
int32 hp_tag_for_assembly_polishing = 38
For assembly polishing, specifies the HP tag we're calling for.
repeated string channels = 39
The set of channels to collect
int32 sort_by_haplotypes_sample_hp_tag = 35
Deprecated fields.

Used in: PileupImageOptions

UNSPECIFIED = 0
ADD_HET_ALT_IMAGES = 1
NO_HET_ALT_IMAGES = 2

Sequencing type of input bam file.

Used in: PileupImageOptions

UNSPECIFIED_SEQ_TYPE = 0
WGS = 1
WES = 2
TRIO = 3

Used in: MakeExamplesOptions

optional WindowSelectorOptions ws_config = 1
Config parameters for "window selector (ws)" phase.
optional DeBruijnGraphOptions dbg_config = 2
Config parameters for "de-Bruijn graph (dbg)" phase.
optional AlignerOptions aln_config = 3
Config parameters for "alignment (aln)" phase.
optional Diagnostics diagnostics = 4
Diagnostics options.
bool split_skip_reads = 5
Split reads with large SKIP regions (i.e. RNA-seq)
bool normalize_reads = 6
This value should be the same as the one in AlleleCounterOptions, both come from --normalize_reads flag. Realigner might act differently based on whether normalize_reads is set.

This proto encodes basic runtime performance metrics for the execution of a command---the command that was executed, the start/stop times, CPU, memory, and disk utilization, etc.

--------------------------------------------------------------------------- Information about the runtime environment ---------------------------------------------------------------------------

Used in: MakeExamplesRunInfo

string host_name = 1
Fully qualified host name.
int32 physical_core_count = 2
The count of physical CPU cores. If the default value (0) indicates that the value couldn't be determined.
double cpu_frequency_mhz = 3
Nominal (maximum) CPU frequency in MHz. If the default value (0.0) indicates that the value couldn't be determined.
int32 total_memory_mb = 4
Total physical memory, in megabytes.
double wall_time_seconds = 5
Total wall clock time in seconds.
double cpu_user_time_seconds = 6
CPU time in seconds in user mode. If the default value (0.0) indicates that the value couldn't be determined.
double cpu_system_time_seconds = 7
CPU time in seconds in system mode. If the default value (0.0) indicates that the value couldn't be determined.
int32 memory_peak_rss_mb = 8
Peak memory usage in megabytes (RSS). If the default value (0.0) indicates that the value couldn't be determined.
int64 read_bytes = 9
See https://psutil.readthedocs.io/en/latest/#psutil.Process.io_counters for more details. The number of bytes read (cumulative).
int64 write_bytes = 10
The number of bytes written (cumulative).

Options that may differ by sample. Next ID: 18.

Used in: MakeExamplesOptions

string role = 6
A string to identify the role of this sample in the analysis, e.g. in trios 'child', 'parent1', or 'parent2'. Importantly, `role` strings should not be checked inside make_examples_core.py. For example, instead of checking whether sample.role == "child" to set pileup height, instead add the pileup height to the sample, adding new properties to this proto if needed. This keeps make_examples_core.py functioning for multiple samples without it having to reason about sample roles that belong to each application. This role is used to keep track of sample identities throughout the analysis, and for debugging.
string name = 7
Sample name, e.g. HG002. Often given by a --sample_name flag or inferred from input files.
repeated string reads_filenames = 1
Paths to files with read alignments, e.g. BAM or CRAM files.
float downsample_fraction = 2
Should we downsample our reads and if so, by how much? If == 0.0 (default), no downsampling occurs. But if set, must be between 0.0 and 1.0 and indicates the probability that a read will be kept (randomly) when read from the input. This option makes it easy to simulate lower coverage data.
optional VariantCallerOptions variant_caller_options = 3
Options for finding candidate variants.
int32 pileup_height = 4
Height of the pileup image for this sample.
repeated int32 order = 5
A list of integers indicating the order in which samples should be shown in the pileup image when calling on this sample. The indices refer to the list of samples in the regionprocessor.
string proposed_variants_filename = 8
Path to the variants for vcf_candidate_importer.
string candidate_positions = 9
Path to binary file containing candidate positions.
bool skip_output_generation = 10
If true, skip any output generation for this sample.
bool keep_only_window_spanning_reads = 12
If True, trim reads to fit in the example window This is experimental for pangenome integration
float mean_coverage = 14
The mean number of reads aligned to any given position or base in the genome per sample. Used in CH_MEAN_COVERAGE channel.
repeated DeepVariantChannelEnum channels_enum_to_blank = 13
The channels to blank out in the pileup image for this sample.
repeated SampleOptions.VariantType variant_types_to_blank = 15
bool skip_phasing = 16
If true, skip phasing for this sample.
bool skip_normalization = 17

Blank out all channels for these variant types if set.

Used in: SampleOptions

VARIANT_TYPE_UNSPECIFIED = 0
VARIANT_TYPE_SNP = 1
VARIANT_TYPE_INDEL = 2

Options to control how our candidate VariantCaller works. Next ID: 21

Used in: MakeExamplesOptions, SampleOptions

int32 min_count_snps = 1
Alleles occurring at least this many times in our AlleleCount are considered candidate variants.
int32 min_count_indels = 2
float min_fraction_snps = 3
Alleles that have counts at least this fraction of the all counts in an AlleleCount are considered candidate variants.
float min_fraction_indels = 4
float min_fraction_multiplier = 12
In candidate generation, this multiplier is applied to the minimum allele fraction thresholds (vsc_min_fraction_snps and vsc_min_fraction_indels) to adapt thresholds for multi-sample calling.
float max_fraction_snps_for_non_target_sample = 16
In candidate generation, this threshold is used to exclude a variant when the allele frequency is above this threshold from a non-target sample. This is designed for the somatic case - where we want to avoid generating a candidate if the AF is high in any of the non-target samples.
float max_fraction_indels_for_non_target_sample = 17
float fraction_reference_sites_to_emit = 5
If provided, we will emit "candidate" variant records at a random fraction of otherwise non-candidate sites. Useful for training.
uint32 random_seed = 6
The random seed to use in our variant caller. If not provided, a truly random seed will be used.
string sample_name = 7
The name of the sample we will put in our VariantCall field of constructed variants.
float p_error = 8
The probability that a non-reference allele is actually an error.
int32 max_gq = 9
The maximum genotype quality we'll emit for a reference site.
int32 gq_resolution = 10
The width of a GQ bin used to quantize the raw double GQ values into coarser-grained bins than just 1 integer unit. See QuantizeGQ for more information about the quantization process.
int32 ploidy = 11
The ploidy of this sample. For humans, this is 2 (diploid). Currently the code makes implicit assumptions that the ploidy is 2, but this value is used in calculations directly involving ploidy so when we generalize the caller to handle other ploidy values we don't have to update all of those constants.
bool skip_uncalled_genotypes = 13
Skip uncalled genotypes. This is used during training so that uncalled ./. genotypes are not used to generate and label examples.
bool track_ref_reads = 14
int32 phase_reads_region_padding_pct = 15
int32 small_model_vaf_context_window_size = 18
Small model context window size
repeated string haploid_contigs = 19
string par_regions_bed = 20

Options to control how we label variant calls.

Currently there are no options for VariantLabeler.

Used in: MakeExamplesOptions

(message has no fields)

Config parameters for the selection of candidate location in the "window selector (ws)" phase.

Used in: WindowSelectorOptions

WindowSelectorModel.ModelType model_type = 1
Window selection algorithm to be used.
oneof model
Configuration associated with the selected algorithm.
- WindowSelectorModel.VariantReadsThresholdModel variant_reads_model = 2
- WindowSelectorModel.AlleleCountLinearModel allele_count_linear_model = 3

Linear model based on the type of reads at each locus.

Used in: WindowSelectorModel

float bias = 1
float coeff_soft_clip = 2
float coeff_substitution = 3
float coeff_insertion = 4
float coeff_deletion = 5
float coeff_reference = 6
float decision_boundary = 7
Threshold for realignment, the higher it is, the lower the recall.

Two models are currently supported: - VARIANT_READS: based on the number of SNPs, INDELs and SOFT_CLIPs at a location. - ALLELE_COUNT_LINEAR: linear model based on the AlleleCount at each location.

Used in: WindowSelectorModel

UNDEFINED = 0
VARIANT_READS = 1
ALLELE_COUNT_LINEAR = 2

Model requiring #reads > min_num_supporting_reads and #reads < max_num_supporting_reads.

Used in: WindowSelectorModel

int32 min_num_supporting_reads = 1
Minimum number of supporting reads to call a reference position for local assembly.
int32 max_num_supporting_reads = 2
Maximum number of supporting reads to call a reference position for local assembly.

Config parameters for "window selector (ws)" phase. Next ID: 10.

Used in: RealignerOptions

int32 min_num_supporting_reads = 1
Minimum number of supporting reads to call a reference position for local assembly. DEPRECATED: Use VariantReadsThresholdModel.min_num_supporting_reads instead.
int32 max_num_supporting_reads = 2
Maximum number of supporting reads to call a reference position for local assembly. DEPRECATED: Use VariantReadsThresholdModel.max_num_supporting_reads instead.
int32 min_mapq = 3
Minimum read alignment quality to consider in calling a reference position for local assembly.
int32 min_base_quality = 4
Minimum base quality to consider in calling a reference position for local assembly.
int32 min_windows_distance = 5
Minimum distance between candidate windows for local assembly.
int32 max_window_size = 6
Maximum window size to consider for local assembly. Large noisy regions are skipped for realignment.
int32 region_expansion_in_bp = 7
How much should we expand the region we compute the candidate positions? This is needed because we want variants near, but not within, our actual window region to contribute evidence towards our window sites. Larger values allow larger events (i.e., an 50 bp deletion) 49 bp away from the region to contribute. However, larger values also means greater computation overhead as we are processing extra positions that aren't themselves directly used.
optional WindowSelectorModel window_selector_model = 8
Config for the '_candidates_from_reads' phase.
bool keep_legacy_behavior = 9
If True, the behavior in this commit is reverted: https://github.com/google/deepvariant/commit/fbde0674639a28cb9e8004c7a01bbe25240c7d46

package learning.genomics.deepvariant

message AlignerOptions

int32 match = 1

int32 mismatch = 2

int32 gap_open = 3

int32 gap_extend = 4

int32 k = 5

float error_rate = 6

int32 read_size = 8

int32 kmer_size = 9

int32 max_num_of_mismatches = 10

double realignment_similarity_threshold = 11

bool force_alignment = 12

message Allele

string bases = 1

AlleleType type = 2

int32 count = 3

bool is_low_quality = 4

int32 mapping_quality = 5

int32 avg_base_quality = 6

bool is_reverse_strand = 7

message AlleleCount

optional nucleus.genomics.v1.Position position = 1

string ref_base = 2

int32 ref_supporting_read_count = 3

map<string, Allele> read_alleles = 4

int32 ref_nonconfident_read_count = 5

map<string, AlleleCount.Alleles> sample_alleles = 6

bool track_ref_reads = 7

message AlleleCount.Alleles

repeated Allele alleles = 1

message AlleleCountSummary

string reference_name = 1

int64 position = 2

string ref_base = 3

int32 ref_supporting_read_count = 4

int32 total_read_count = 5

int32 ref_nonconfident_read_count = 6

message AlleleCounterOptions

int32 partition_size = 1

optional nucleus.genomics.v1.ReadRequirements read_requirements = 2

bool track_ref_reads = 3

bool normalize_reads = 4

bool keep_legacy_behavior = 5

enum AlleleType

UNSPECIFIED = 0

REFERENCE = 1

SUBSTITUTION = 2

INSERTION = 3

DELETION = 4

SOFT_CLIP = 5

message CallVariantsOutput

optional nucleus.genomics.v1.Variant variant = 1

optional CallVariantsOutput.AltAlleleIndices alt_allele_indices = 2

repeated double genotype_probabilities = 3

optional CallVariantsOutput.DebugInfo debug_info = 4

message CallVariantsOutput.AltAlleleIndices

repeated int32 indices = 1

message CallVariantsOutput.DebugInfo

int32 predicted_label = 1

bool has_insertion = 2

bool has_deletion = 3

bool is_snp = 4

int32 true_label = 5

repeated double logits = 6

repeated double prelogits = 7

bytes image_encoded = 8

map<string, bytes> layer_output_encoded = 9

optional DebugInfo.PileupCuration pileup_curation = 10

message CallVariantsOutput.DebugInfo.PileupCuration

int32 diff_category = 1

int32 base_quality = 2

int32 mapping_quality = 3

int32 strand_bias = 4

int32 read_support = 5

message CandidateHaplotypes

optional nucleus.genomics.v1.Range span = 1

repeated string haplotypes = 2

message DeBruijnGraphOptions

int32 min_k = 1