Get desktop application:
View/edit binary Protocol Buffers messages
Config parameters for "alignment (aln)" phase.
Used in:
Match score (expected to be a non-negative score).
Mismatch score (expected to be a non-positive score).
Gap open score (expected to be a non-positive score). Score for a gap of length g is (gap_open + (g - 1) * gap_extend).
Gap extend score (expected to be a non-positive score). Score for a gap of length g is (gap_open + (g - 1) * gap_extend).
k-mer size used to index target sequence. TODO This parameter is not used fast_pass_aligner. Since we no longer use python realigner this parameter is obsolete.
Estimated sequencing error rate. TODO This parameter is not used in fast_pass_aligner. We need to remove it.
Average read size. This parameter is used to calculate a ssw_alignment_score_threshold_ - the threshold to filter out reads aligned with SSW library. Not all the reads may be the same size. This parameter needs to be set to a value close enough to the average read size.
K-mer size in read index used in Fast Pass Aligner.
Num of maximum allowed mismatches for quick read to haplotype alignment.
Similarity threshold used to filter out bad read alignments made with Smith-Waterman alignment. Alignment is discarded if read is aligned to a haplotype with too many mismatches.
Force realignment so the original alignment is never returned, defaulting instead to computing a new SSW alignment against the reference. This is used for alt-aligned pileups where reads are aligned to a new "reference", making the original read alignments invalid.
An Allele observed in some type of NGS read data. Conceptually, an Allele is a sequence of bases that represent a type of change relative to a reference genome sequence, along with a discrete count of the number of times that allele was observed in the NGS data.
Used in:
,The string of bases that make up this Allele. Should not be empty. A simple reference allele might have a single base "A", while a complex insertion of the bases "CTG" following that base "A" would have a bases sequence of "ACTG".
The type of this allele.
The number of times this Allele was seen in the NGS data. The count should be >= 0, where 0 indicates that no observations of the allele were observed (which can happen if you want to record that you checked for some allele in the data and never saw any evidence for it).
Set to true if allele contains low quality bases.
An AlleleCount summarizes the NGS data observed at a position in the genome. An AlleleCount proto is a key intermediate data structure in DeepVariant summarizing the NGS read data covering a site in the genome. It is intended to be relatively simple but keep track of the key pieces of information about the observed reads and their associated alleles at this position so that downstream tools can reconstruct the read coverage, do variant calling, and compute reference confidence. It is conceptually similar to a samtools read pileup (http://samtools.sourceforge.net/pileup.shtml) but without detailed information about bases or their qualities. The AlleleCount at its core tracks the Alleles observed in reads that overlap this position in the genome. Consider a read that has a base, X, aligned to the position of interest. If X is a non-reference allele, the AlleleCount proto adds a new read name ==> Allele key-value entry to the read_alleles map field. If X is the reference allele, the AlleleCount proto increments either the ref_supporting_read_count or the ref_nonconfident_read_count counter field, depending on read alignment confidence as defined by pipeline-specific parameters. The complexity here is introduced by following the VCF convention of representing indel and complex substitution alleles as occurring at the preceding base in the genome. So if in fact our base X is followed by a 3 bp insertion of acg, than we would in fact not have a count for X at all but would see an allele Xacg with a count of 1 (or more if other reads have the same allele). The primary contract here is that each aligned base at this site goes into a +1 for exactly one allele. A concrete example might clarify this logic. Consider the following alignment of two reads to the reference genome: Position: 123 4567 Ref: ATT---TGCT Read1: ATT---TGCT Read2: ATTCCCTG-T The 'T' base in position 3 of Read 1 matches the reference and so the AlleleCount's ref_supporting_read_count is incremented. The 'T' base in position 3 of Read 2 also matches the reference, but it is the anchor base preceding the 'CCC' insertion. The 'TCCC' INSERTION Allele is therefore added to the AlleleCount proto for this position. A deletion occurs in read 2 at position 6 as well, which produces a DELETION allele 'GT' at position 5. Additionally, because only read 1 has a C base at position 6, the AlleleCount at 6 would have no entries in its read_alleles and ref_supporting_read_count would be 1. Another design choice is that spanning deletions don't count as coverage under bases, so there's an actual drop in coverage under regions of the genome with spanning deletions. This is classically the difference between physical coverage and sequence coverage: https://en.wikipedia.org/wiki/Shotgun_sequencing#Coverage so it's safe to think of an AlleleCount as representing the sequence coverage of a position, not its physical coverage. What this means is that its very straightforward to inspect the reference counters and read_alleles in an AlleleCount and determine the corresponding alleles for a Variant as well as compute the depth of coverage. This enables us to write algorithms to call SNPs, indels, CNVs as well as identify regions for assembly using a series of AlleleCount objects rather than the underlying read data. An AlleleCount is a lossy transformation of the raw read data. Fundamentally, the digestion of a read into its correspond AlleleCount components loses some of this contiguity information provided by reads spanning across multiple positions on the genome. None of the specifics of base quality, mapping quality, read names, etc. are preserved. Furthermore, an AlleleCount can be constructed using only a subset of all of the raw reads (e.g., those that pass minimum quality criteria) and even only parts of each read (e.g., if the read contains Ns or low quality bases). The data used to compute an AlleleCount isn't specified as part of the proto, but is left up to the implementation details and runtime parameters of the generating program. For usability and performance reasons we track reference and alternate allele supporting reads in slightly different ways. The number of reads that confidently carry the reference allele at this position is stored in ref_supporting_read_count. Confidence here means that the read's alignment to the reference is reliable. See the Base Alignment Quality (BAQ) paper: http://bioinformatics.oxfordjournals.org/content/early/2011/02/13/bioinformatics.btr076.full.pdf For background and motivation. For reads that would have been counted as reference supporting but don't have a reliable alignment we instead tally those in ref_nonconfident_read_count. Finally, reads that don't have the reference allele are stored in a map from the string "fragment_name/read_number" to the allele it supports, which by construction will always have a count of 1. This allows for more detailed downstream analyses of the alt allele containing reads. Consequentially, the total (usable) coverage at this location is: coverage = // reads supporting an observed alternate allele. sum(read_allele_i.count) // [also equal to read_alleles_size()] // All of the reads confidently asserting reference. + ref_supporting_read_count // All of the reads supporting ref without a confident alignment. + ref_nonconfident_read_count
The position on the genome of this AlleleCount.
The reference bases of this AlleleCount. Since AlleleCount currently only represents a single location, this field should always be a single base.
The number of reads that confidently carry the reference allele at this position.
A map from a read's key to an Allele message containing information about the allele supported by that read at this position. There will be one binding for each usable read spanning this position that supports a non-reference allele. The read's key is a unique string that identifies the read, currently "fragment_name/read_number".
A count of the number of reads that supported the reference allele but whose alignment to the reference genome isn't 100% certain.
If true reads supporting ref are tracked (read ids are saved).
A map where key is a sample name and value is a list of alt alleles that are supported by reads from this sample.
Used in:
A lighter-weight version of AlleleCount. The only material difference with this proto is that we don't store the map from read names to Alleles, but instead have the total number of reads we've seen at this position.
The Position field of AlleleCount with values inlined here.
Same as in AlleleCount.
Same as in AlleleCount.
This is the total number of reads observed at position.
Same as in AlleleCount.
Options to control how our AlleleCounter code works.
Used in:
The number of basepairs to include in each partition of the reference genome. This determines how many map/reduce jobs are used to compute the AlleleCounts. Using a too small value (below 10000 for example) results in having many many intervals to process which may be a performance problem for the tool. Using too large of a value will result in difficulty parallelizing the computation as there will be too few work units to parallelize and each unit will use a lot of memory.
The requirements for reads to be used when counting alleles.
Determains how allele counter keeps track of ref reads. If True then allele_counter stores reads IDs of ref reads, otherwise just a counter is used for ref reads. Default value is False.
Option to left align INDELs for each read.
If True, the behavior in this commit is reverted: https://github.com/google/deepvariant/commit/fbde0674639a28cb9e8004c7a01bbe25240c7d46
The type of an Allele. An allele type indicates what kind of event would have produced this allele. An allele can be the reference sequence, a substitution of bases, insertion of bases, or deletion of bases. Allele types need not be real genetic variants: for example, the SOFT_CLIP type indicates that a read contained bases SOFT_CLIPPED away (similar to an insertion), which is often indicative of some large event near the start or end of the read.
Used in:
Default should be unspecified: https://docs.google.com/document/d/1oavZD9XB_147ti93MCBoR5HrFKoBh1xkZcTxjInYf0M/edit#heading=h.8ylxmf942vui
The allele corresponding to that found in the genome sequence.
A substitution of bases that are difference from the genome sequence.
An insertion of bases w.r.t. the reference genome.
A deletion of bases w.r.t. the reference genome.
An allele type produced by a SOFT_CLIP operation during alignment. Maybe indicative of a real genetic event occurring at this position, or may be a data quality / alignment artifact.
Variant call for a single site, in a pseudo-biallelic manner. This is an intermediate format for call_variants.py that needs to be merged if there are multiallelics. The `variant` here likely doesn't have fully filled information for output to a VCF file yet.
The alt allele indices is represented as a sub-message so that it's easier to re-use as a standalone proto for encoding+decoding.
Used in:
Next ID: 11
Used in:
The encoded image used for inference.
Key-value pairs of layer names of call variant models and the encoded layers' outputs.
Used in:
The following enums are defined in nucleus/util/vis.py.
Encapsulates a list of candidate haplotype sequences for a genomic region.
The genomic region containing the candidate haplotypes.
The list of candidate haplotype sequences. Each individual haplotype is represented by its nucleotide sequence.
Config parameters for "de-Bruijn graph (dbg)" phase.
Used in:
Initial k-mer size to build the graph.
Maximum k-mer size. Larger k-mer size is used to resolve graph cycles.
Increment size for k to try in resolving graph cycles.
Minimum read alignment quality to consider in building the graph.
Minimum base quality in a k-mer sequence to consider in building the graph.
Minimum number of supporting reads to keep an edge.
Maximum number of paths within a graph to consider for realignment. Set max_num_paths to 0 to have unlimited number of paths.
A message encapsulating all of the information about a Variant call site for consumption by further stages of the DeepVariant data processing workflow.
A Variant call based on the information in allele_count. Will always be a non-reference variant call (no gVCF or reference records).
List of Read keys supporting red allele.
This is to replace allele_support.
Map of VAF at a given +/- position relative to the variant.
Used in:
A map from an alt allele in Variant to Read key that support that allele. Every alt allele in the variant will have an entry. Reference supporting reads aren't listed. There may be a special key "UNCALLED_ALLELE" for reads that don't support either the reference allele or any alt allele in the variant. This can happen when the read supports an allele that didn't pass our calling thresholds. The read's key is a unique string that identifies the read constructed as "fragment_read/read_number".
Used in:
A map from alt allele in Variant to ReadSupport structure. This structrue is to replace SupportingReads but for back a backward compatibility old one is kept.
Used in:
Next ID: 23.
Used in:
Default should be unspecified.
6 channels that exist in all DeepVariant production models.
"Improving Variant Calling using Haplotype Information" https://google.github.io/deepvariant/posts/2021-02-08-the-haplotype-channel/
"Improving variant calling using population data and deep learning" https://doi.org/10.1101/2021.01.06.425550
Two extra channels for diff_channels:
Two extra channels for base_channels:
The following channels correspond to the "Opt Channels" defined in deepvariant/pileup_channel_lib.h:
Config describe information needed for a dataset that can be used for training, validation, or testing.
A human-readable name of the dataset.
Full path of the tensorflow.Example TFRecord file.
Number of examples for this dataset. Right now this needs to be manually filled in order to compute how the learning rate decays, and also used in make_training_batches.
Config parameters for "alignment (aln)" phase.
Used in:
Enable runtime diagnostic outputs.
The root where we'll put our diagnostic outputs.
True if we should also emit the realigned reads themselves.
Metrics on the labeling of candidate / truth variants when running DeepVariant's make_examples in training mode. Next ID: 17.
Notes on counting by site or by allele: Throughout this proto we often measure the same quantity (e.g., false positives) by site and by alleles. This reflects two different ways of counting errors in genomes where the number of chromosomes > 1. We can give a concrete example: Candidate: chr20:10 with A/C Truth: chr20:10 with A/C alleles and with genotype (0, 1) Since we have the same variant with the same alleles at the same position in both the candidates and the truth, the matching is trivial. For this variant, we'd update our counts as follows: # We have only a single site, so we +1 to each truth and candidates. n_truth_variant_sites += 1 n_candidate_variant_sites += 1 # Both the candidate and truth variants only have 1 alternative allele, # so we increment the allele counts by 1 as well. n_truth_variant_alleles += 1 n_candidate_variant_alleles += 1 A similar logic would apply to counting true positives (+1), false negatives (+0), and false positives (+0) for both sites and alleles. Now let's take a more complex example where candidates and truth differ in their alleles: Candidate: chr20:20 with A/C/T Truth: chr20:20 with A/C/G alleles and with genotype (1, 2) Here we have a candidate and truth at the same position but they can only be partially matched, since the truth variant includes a G allele (e.g., genotype == 2) that isn't even present in the candidate. And the candidate has an extra allele T that isn't real. Matching these variants produces a genotype of (0, 1) for the candidate, since we have one copy of the C allele (e.g., genotype == 1) but we cannot match the true G allele, so we are forced to say one allele is reference (e.g., genotype == 0). Now let's update our counts: # We have only a single site, so we +1 to each truth and candidates, even # though they both have multiple alt allele, as this is the sites-level # metric. n_truth_variant_sites += 1 n_candidate_variant_sites += 1 # Both variants have 2 alternative alleles, so we +2 each of the alleles # counts. n_truth_variant_alleles += 2 n_candidate_variant_alleles += 2 # Now for the complex TP/FN/FP counts. # We found a candidate for this true site, even though we didn't get all # of the alleles, so we increment the n_true_positive_sites by 1 and get # +0 for each of the FN and FP sites metrics: n_true_positive_sites += 1 n_false_negative_sites += 0 n_false_positive_sites += 0 # However, we only got one of the two true positive alleles, so we get # one TP allele (the C), one FN allele (the missed G), and one FP allele # (the bad T allele in the candidate). n_true_positive_alleles += 1 n_false_negative_alleles += 1 n_false_positive_alleles += 1
Used in:
Counts of candidate and truth variants by site and by allele. The sites metrics are essentially the number of records seen of each type, after removing non-PASS truth variants, regardless of the number of alleles in the records. The allele metrics are like the sites metrics, but instead of getting a +1 for each record we get +N for each site where N is the number of alternative alleles. See above for more information on the difference between sites and alleles counting.
TPs: The number of true variants assigned a non-ref genotype in the match for the sites calculation. For the allele calculation, the number of true variant alleles with a gt > 0 in the assignment.
FNs: The number of true variants assigned 0/0 genotypes for the sites calculation. For the allele calculation, the number of true variant alleles without a gt > 0 in the assignment.
FPs: The number of candidate variants assigned a 0/0 genotype for the sites calculation. For the allele calculation, the number of candidate variant alternative alleles without a genotype > 0 in the assignment.
The number of candidate variants (counted by site) that are assigned a non-reference genotype (e.g., != 0/0) but that don't have an exact position match in truth.
A count of the number of sites where the candidate and truth variants occur at the same position, with increasingly strict additional matching criteria. These metrics are all computed over sites, not alleles. The number of sites where candidate and truth have the same start position.
Same criteria as above but with the additional requirement that all of the alleles be exactly the same between candidate and truth.
Same criteria as above but with the additional requirement that the matched genotypes to be identical as well.
Number of truth variants (counted by site) with more than one alternative allele where at least one alternative allele was missed (i.e., was assigned a genotype of 0 in the match).
High-level options that encapsulates all of the parameters needed to run DeepVariant end-to-end. Next ID: 82.
Used in:
A list of contig names we never want to call variants on. For example, chrM in humans is the mitocondrial genome and the caller isn't trained to call variants on that genome.
List of regions where we want to call variants. If missing, we will call variants throughout the entire genome.
Fixed random seed to use for DeepVariant itself.
The number of cores to use when running DeepVariant. Must be >= 1.
Options to control how we run the AlleleCounter.
Deprecated. Use sample_options instead.
Options to control how we generate pileup images.
Options to control how we label our examples.
Only reads satisfying these requirements will be used in DeepVariant. This parameters are propagated as appropriate to read_requirement fields in our tool-specific options.
Options to control out input data sources and output data sinks. Path to our genome reference.
Deprecated. Use sample_options instead.
Deprecated.
Path where we'll write out our candidate variants.
Path to examples.
Path to a list of regions we are confident in, for determining which candidate variants get labels.
Path to the truth variants, for use in labeling our examples.
Path to the variants for vcf_candidate_importer.
Path where we should put our gVCF records.
Whether to generate MED_DP in gVCF records or not.
The name of the deep learning model to use with DeepVariant.
The minimum fraction of basepairs that must be shared by all contigs common to DeepVariant inputs and the reference contigs alone. If the common contigs cover less than min_shared_contig_basepairs of the reference genome contigs DeepVariant will signal an error that the input datasets aren't from compatible genomes.
The task identifier, as an integer, of this task. If we are running with multiple tasks processing the same inputs into sharded outputs, this id should be set to a number from 0 (master) to N - 1 to indicate which of the tasks we are currently processing.
When running in sharded output mode (i.e., writing outputs to foo@N), this field captures the number of sharded outputs (i.e., N). When not running in sharded output mode, this field should be 0.
Whether the realigner should be enabled.
If True, realign reads from all samples together. If False, realign per sample.
Settings for the realigner module.
The maximum number of reads per partition that we consider before following processing such as sampling and realigner.
Similar to `max_reads_per_partition`, we want to we add another field to constrain the number of reads to downsample. Even with `max_reads_per_partition`, the memory usage can sometimes still be too large, especially when the reads are very long. When this field is set, we'll multiple it by the region we're in. And then we will only sample the reads up to a point where the number of bases in the region are larger than (max_reads_for_dynamic_bases_per_region * region length).
Deprecated. Use sample_options instead.
List of regions where we DON'T want to call variants. If missing, no regions will be excluded from calling.
The labeling algorithm we are using in this DeepVariant run. Only needed when in CALLING mode.
By default aligned_quality field is read from QUAL in SAM. If flag is set, aligned_quality field is read from OQ tag in SAM.
A list of variant types that we want to restrict our examples to. E.g., select_variant_types = ['snps'] would indicate that we only want to generate SNP candidate variants.
If flag is set, consider allele frequency.
A list of VCF or VCF.gz files that specify allele frequency information.
Path to output optional runtime profiling by region.
Use --ref argument as the reference file for the CRAM.
Parse aux fields from BAM -- needed for some features like using HP tags.
This field is used to pass into SamReaderOptions. By default, this field is empty. If empty, we keep all aux fields if they are parsed. If set, we only keep the aux fields with the names in this list.
Size of blocks to read from BAM.
How often to show log messages.
The index of the sample to focus on within the list of samples.
per-sample statistics for training examples.
Samples, e.g. a list of 3 for DeepTrio, 2 for DeepSomatic or 1 for DeepVariant.
Sample role to focus on for training. This can be different from the sample indicated by the main_sample_index.
DirectPhasing related options.
Related to de novo variants labeling
Useful for creating deterministic testdata.
Write small model training examples to .tsv files.
When writing small model examples, skip pileup images altogether.
Call small model examples directly instead of passing to CNN.
GQ threshold for small model SNPs.
GQ threshold for small model INDELs.
Path to the pickled small model.
Batch size to use during inference
Small model context window size
Stream examples to shared memory buffer instead of writing them to disk.
Shared memory objects name prefix.
Size of the shared memory buffer per each shard.
Proportion of examples to keep when in training mode.
If true, the mean coverage is calculated on the calling regions, rather than on the whole genome. This is useful in the case of WES where the regions that have reads are 2 percent of the genome.
The name of the reference genome in the pangenome gbz file. This reference should match the reference used for the reads. This attribute is added since the exact name assigned to the pangenome reference can be different from the name of the reference fasta used for the reads.
The prefix to add to the chromosome name in the pangenome gbz file. It is empty by default. However sometimes we need to add a prefix (like "GRCh38.") to the chromosome name in the pangenome gbz file to match the chromosome name in the reads.
If true, the sequences of the gbz file is already loaded into shared memory and the SamReader reads the sequences from the shared memory.
The name of the shared memory segment that contains the sequences of the gbz file.
An enumeration of all of the labeler algorithms we support in DeepVariant.
Used in:
The labeling algorithm used with DeepVariant 0.4-0.5, which does position matching to find truth variant to label our candidates.
A haplotype-aware labeling algorithm, similar to hap.py xcmp, that looks for genotypes for candidate variants that produce haplotypes that match those implied by the genotypes of our truth variants. Produces more accurate labels than the POSITIONAL_LABELER labeling algorithm.
The labeling algorithm which labels the variants into customized classes specified in the specified INFO field in the VCF file.
Used in:
Used in:
The default very sensitive caller.
An advanced caller that uses an input VCF to call variants.
Configuration and runtime information about a MakeExamples run in DeepVariant. Next ID: 5.
Statistics about MakeExamples. Next ID: 9.
Used in:
Options to control how we construct pileup images. Next ID: 41.
Used in:
The height, in pixels, of the pileup image we'll construct.
The width, in pixels, of the pileup image we'll construct.
We include at the top of the each image a band of reference pixels with this specified height.
Controls how bases are encoded as red pixel values. A is base_color_offset_a_and_g + base_color_stride * 3 G is base_color_offset_a_and_g + base_color_stride * 2 T is base_color_offset_t_and_c + base_color_stride * 1 C is base_color_offset_t_and_c + base_color_stride * 0 The offset in red color space for A and G bases.
The offset in red color space for T and C bases.
Each base color is offset from each other by this stride.
The alpha value applied to pixels in the reference genome band.
The base quality we assume for the reference genome bases.
The alpha to apply to reads that support our alt alleles.
The alpha to apply to reads that support the other alt allele.
The alpha to apply to reads that do not support our alt alleles.
The alpha to apply to a base that matches the reference sequence.
The alpha to apply to a base that doesn't matches the reference sequence.
The character we'll use when encoding insertion/deletion anchor bases.
The color value to use for reads on the positive strand.
The color value to use for reads on the negative strand.
The maximum base quality we'll allow in PIC. Base qualities above this value are treated as being base_quality_cap.
Extend read windows by a small amount when calculating overlap with calls. This is important to include all the reads involved in deletions in a pileup image.
The requirements for reads to be used when creating pileup images.
The maximum mapping quality we'll allow in PIC. Mapping qualities above this value are treated as being mapping_quality_cap.
The random seed to use in our Pileup Image Creation.
The number of data channels in our pileup images.
(Experimental feature that was removed.) The character we'll use when encoding insertion anchor bases.
(Experimental feature that was removed.) The character we'll use when encoding deletion anchor bases.
(Experimental feature that was removed.) Include custom pileup image feature.
(Experimental feature that was removed.) Include sequencing type image feature.
Whether and how to include alt-aligned pileup images (experimental).
If set reads are sorted by haplotype tag (HP tag) and then by alignment position.
Minimal non-zero allele frequency. This is used when normalizing color intensities for the allele frequency channel.
Whether to consider allele frequencies.
Which variant types to use alt-align on.
If true, add an additional channel where the color information per-read indicates the HP value.
For assembly polishing, specifies the HP tag we're calling for.
The set of channels to collect
Deprecated fields.
Used in:
Sequencing type of input bam file.
Used in:
Used in:
Config parameters for "window selector (ws)" phase.
Config parameters for "de-Bruijn graph (dbg)" phase.
Config parameters for "alignment (aln)" phase.
Diagnostics options.
Split reads with large SKIP regions (i.e. RNA-seq)
This value should be the same as the one in AlleleCounterOptions, both come from --normalize_reads flag. Realigner might act differently based on whether normalize_reads is set.
This proto encodes basic runtime performance metrics for the execution of a command---the command that was executed, the start/stop times, CPU, memory, and disk utilization, etc.
--------------------------------------------------------------------------- Information about the runtime environment ---------------------------------------------------------------------------
Used in:
Fully qualified host name.
The count of physical CPU cores. If the default value (0) indicates that the value couldn't be determined.
Nominal (maximum) CPU frequency in MHz. If the default value (0.0) indicates that the value couldn't be determined.
Total physical memory, in megabytes.
Total wall clock time in seconds.
CPU time in seconds in user mode. If the default value (0.0) indicates that the value couldn't be determined.
CPU time in seconds in system mode. If the default value (0.0) indicates that the value couldn't be determined.
Peak memory usage in megabytes (RSS). If the default value (0.0) indicates that the value couldn't be determined.
See https://psutil.readthedocs.io/en/latest/#psutil.Process.io_counters for more details. The number of bytes read (cumulative).
The number of bytes written (cumulative).
Options that may differ by sample. Next ID: 18.
Used in:
A string to identify the role of this sample in the analysis, e.g. in trios 'child', 'parent1', or 'parent2'. Importantly, `role` strings should not be checked inside make_examples_core.py. For example, instead of checking whether sample.role == "child" to set pileup height, instead add the pileup height to the sample, adding new properties to this proto if needed. This keeps make_examples_core.py functioning for multiple samples without it having to reason about sample roles that belong to each application. This role is used to keep track of sample identities throughout the analysis, and for debugging.
Sample name, e.g. HG002. Often given by a --sample_name flag or inferred from input files.
Paths to files with read alignments, e.g. BAM or CRAM files.
Should we downsample our reads and if so, by how much? If == 0.0 (default), no downsampling occurs. But if set, must be between 0.0 and 1.0 and indicates the probability that a read will be kept (randomly) when read from the input. This option makes it easy to simulate lower coverage data.
Options for finding candidate variants.
Height of the pileup image for this sample.
A list of integers indicating the order in which samples should be shown in the pileup image when calling on this sample. The indices refer to the list of samples in the regionprocessor.
Path to the variants for vcf_candidate_importer.
Path to binary file containing candidate positions.
If true, skip any output generation for this sample.
If True, trim reads to fit in the example window This is experimental for pangenome integration
The mean number of reads aligned to any given position or base in the genome per sample. Used in CH_MEAN_COVERAGE channel.
The channels to blank out in the pileup image for this sample.
If true, skip phasing for this sample.
Blank out all channels for these variant types if set.
Used in:
Options to control how our candidate VariantCaller works. Next ID: 21
Used in:
,Alleles occurring at least this many times in our AlleleCount are considered candidate variants.
Alleles that have counts at least this fraction of the all counts in an AlleleCount are considered candidate variants.
In candidate generation, this multiplier is applied to the minimum allele fraction thresholds (vsc_min_fraction_snps and vsc_min_fraction_indels) to adapt thresholds for multi-sample calling.
In candidate generation, this threshold is used to exclude a variant when the allele frequency is above this threshold from a non-target sample. This is designed for the somatic case - where we want to avoid generating a candidate if the AF is high in any of the non-target samples.
If provided, we will emit "candidate" variant records at a random fraction of otherwise non-candidate sites. Useful for training.
The random seed to use in our variant caller. If not provided, a truly random seed will be used.
The name of the sample we will put in our VariantCall field of constructed variants.
The probability that a non-reference allele is actually an error.
The maximum genotype quality we'll emit for a reference site.
The width of a GQ bin used to quantize the raw double GQ values into coarser-grained bins than just 1 integer unit. See QuantizeGQ for more information about the quantization process.
The ploidy of this sample. For humans, this is 2 (diploid). Currently the code makes implicit assumptions that the ploidy is 2, but this value is used in calculations directly involving ploidy so when we generalize the caller to handle other ploidy values we don't have to update all of those constants.
Skip uncalled genotypes. This is used during training so that uncalled ./. genotypes are not used to generate and label examples.
Small model context window size
Options to control how we label variant calls.
Currently there are no options for VariantLabeler.
Used in:
(message has no fields)
Config parameters for the selection of candidate location in the "window selector (ws)" phase.
Used in:
Window selection algorithm to be used.
Configuration associated with the selected algorithm.
Linear model based on the type of reads at each locus.
Used in:
Threshold for realignment, the higher it is, the lower the recall.
Two models are currently supported: - VARIANT_READS: based on the number of SNPs, INDELs and SOFT_CLIPs at a location. - ALLELE_COUNT_LINEAR: linear model based on the AlleleCount at each location.
Used in:
Model requiring #reads > min_num_supporting_reads and #reads < max_num_supporting_reads.
Used in:
Minimum number of supporting reads to call a reference position for local assembly.
Maximum number of supporting reads to call a reference position for local assembly.
Config parameters for "window selector (ws)" phase. Next ID: 10.
Used in:
Minimum number of supporting reads to call a reference position for local assembly. DEPRECATED: Use VariantReadsThresholdModel.min_num_supporting_reads instead.
Maximum number of supporting reads to call a reference position for local assembly. DEPRECATED: Use VariantReadsThresholdModel.max_num_supporting_reads instead.
Minimum read alignment quality to consider in calling a reference position for local assembly.
Minimum base quality to consider in calling a reference position for local assembly.
Minimum distance between candidate windows for local assembly.
Maximum window size to consider for local assembly. Large noisy regions are skipped for realignment.
How much should we expand the region we compute the candidate positions? This is needed because we want variants near, but not within, our actual window region to contribute evidence towards our window sites. Larger values allow larger events (i.e., an 50 bp deletion) 49 bp away from the region to contribute. However, larger values also means greater computation overhead as we are processing extra positions that aren't themselves directly used.
Config for the '_candidates_from_reads' phase.
If True, the behavior in this commit is reverted: https://github.com/google/deepvariant/commit/fbde0674639a28cb9e8004c7a01bbe25240c7d46