package learning.genomics.deepvariant

Mouse Melon logoGet desktop application:
View/edit binary Protocol Buffers messages

message AlignerOptions

realigner.proto:167

Config parameters for "alignment (aln)" phase.

Used in: RealignerOptions

message Allele

deepvariant.proto:76

An Allele observed in some type of NGS read data. Conceptually, an Allele is a sequence of bases that represent a type of change relative to a reference genome sequence, along with a discrete count of the number of times that allele was observed in the NGS data.

Used in: AlleleCount, AlleleCount.Alleles

message AlleleCount

deepvariant.proto:201

An AlleleCount summarizes the NGS data observed at a position in the genome. An AlleleCount proto is a key intermediate data structure in DeepVariant summarizing the NGS read data covering a site in the genome. It is intended to be relatively simple but keep track of the key pieces of information about the observed reads and their associated alleles at this position so that downstream tools can reconstruct the read coverage, do variant calling, and compute reference confidence. It is conceptually similar to a samtools read pileup (http://samtools.sourceforge.net/pileup.shtml) but without detailed information about bases or their qualities. The AlleleCount at its core tracks the Alleles observed in reads that overlap this position in the genome. Consider a read that has a base, X, aligned to the position of interest. If X is a non-reference allele, the AlleleCount proto adds a new read name ==> Allele key-value entry to the read_alleles map field. If X is the reference allele, the AlleleCount proto increments either the ref_supporting_read_count or the ref_nonconfident_read_count counter field, depending on read alignment confidence as defined by pipeline-specific parameters. The complexity here is introduced by following the VCF convention of representing indel and complex substitution alleles as occurring at the preceding base in the genome. So if in fact our base X is followed by a 3 bp insertion of acg, than we would in fact not have a count for X at all but would see an allele Xacg with a count of 1 (or more if other reads have the same allele). The primary contract here is that each aligned base at this site goes into a +1 for exactly one allele. A concrete example might clarify this logic. Consider the following alignment of two reads to the reference genome: Position: 123 4567 Ref: ATT---TGCT Read1: ATT---TGCT Read2: ATTCCCTG-T The 'T' base in position 3 of Read 1 matches the reference and so the AlleleCount's ref_supporting_read_count is incremented. The 'T' base in position 3 of Read 2 also matches the reference, but it is the anchor base preceding the 'CCC' insertion. The 'TCCC' INSERTION Allele is therefore added to the AlleleCount proto for this position. A deletion occurs in read 2 at position 6 as well, which produces a DELETION allele 'GT' at position 5. Additionally, because only read 1 has a C base at position 6, the AlleleCount at 6 would have no entries in its read_alleles and ref_supporting_read_count would be 1. Another design choice is that spanning deletions don't count as coverage under bases, so there's an actual drop in coverage under regions of the genome with spanning deletions. This is classically the difference between physical coverage and sequence coverage: https://en.wikipedia.org/wiki/Shotgun_sequencing#Coverage so it's safe to think of an AlleleCount as representing the sequence coverage of a position, not its physical coverage. What this means is that its very straightforward to inspect the reference counters and read_alleles in an AlleleCount and determine the corresponding alleles for a Variant as well as compute the depth of coverage. This enables us to write algorithms to call SNPs, indels, CNVs as well as identify regions for assembly using a series of AlleleCount objects rather than the underlying read data. An AlleleCount is a lossy transformation of the raw read data. Fundamentally, the digestion of a read into its correspond AlleleCount components loses some of this contiguity information provided by reads spanning across multiple positions on the genome. None of the specifics of base quality, mapping quality, read names, etc. are preserved. Furthermore, an AlleleCount can be constructed using only a subset of all of the raw reads (e.g., those that pass minimum quality criteria) and even only parts of each read (e.g., if the read contains Ns or low quality bases). The data used to compute an AlleleCount isn't specified as part of the proto, but is left up to the implementation details and runtime parameters of the generating program. For usability and performance reasons we track reference and alternate allele supporting reads in slightly different ways. The number of reads that confidently carry the reference allele at this position is stored in ref_supporting_read_count. Confidence here means that the read's alignment to the reference is reliable. See the Base Alignment Quality (BAQ) paper: http://bioinformatics.oxfordjournals.org/content/early/2011/02/13/bioinformatics.btr076.full.pdf For background and motivation. For reads that would have been counted as reference supporting but don't have a reliable alignment we instead tally those in ref_nonconfident_read_count. Finally, reads that don't have the reference allele are stored in a map from the string "fragment_name/read_number" to the allele it supports, which by construction will always have a count of 1. This allows for more detailed downstream analyses of the alt allele containing reads. Consequentially, the total (usable) coverage at this location is: coverage = // reads supporting an observed alternate allele. sum(read_allele_i.count) // [also equal to read_alleles_size()] // All of the reads confidently asserting reference. + ref_supporting_read_count // All of the reads supporting ref without a confident alignment. + ref_nonconfident_read_count

message AlleleCount.Alleles

deepvariant.proto:227

A map where key is a sample name and value is a list of alt alleles that are supported by reads from this sample.

Used in: AlleleCount

message AlleleCountSummary

deepvariant.proto:241

A lighter-weight version of AlleleCount. The only material difference with this proto is that we don't store the map from read names to Alleles, but instead have the total number of reads we've seen at this position.

message AlleleCounterOptions

deepvariant.proto:302

Options to control how our AlleleCounter code works.

Used in: MakeExamplesOptions

enum AlleleType

deepvariant.proto:48

The type of an Allele. An allele type indicates what kind of event would have produced this allele. An allele can be the reference sequence, a substitution of bases, insertion of bases, or deletion of bases. Allele types need not be real genetic variants: for example, the SOFT_CLIP type indicates that a read contained bases SOFT_CLIPPED away (similar to an insertion), which is often indicative of some large event near the start or end of the read.

Used in: Allele

message CallVariantsOutput

deepvariant.proto:333

Variant call for a single site, in a pseudo-biallelic manner. This is an intermediate format for call_variants.py that needs to be merged if there are multiallelics. The `variant` here likely doesn't have fully filled information for output to a VCF file yet.

message CallVariantsOutput.AltAlleleIndices

deepvariant.proto:338

The alt allele indices is represented as a sub-message so that it's easier to re-use as a standalone proto for encoding+decoding.

Used in: CallVariantsOutput

message CallVariantsOutput.DebugInfo

deepvariant.proto:346

Next ID: 11

Used in: CallVariantsOutput

message CallVariantsOutput.DebugInfo.PileupCuration

deepvariant.proto:360

Used in: DebugInfo

message CandidateHaplotypes

realigner.proto:37

Encapsulates a list of candidate haplotype sequences for a genomic region.

message DeBruijnGraphOptions

realigner.proto:141

Config parameters for "de-Bruijn graph (dbg)" phase.

Used in: RealignerOptions

message DeepVariantCall

deepvariant.proto:257

A message encapsulating all of the information about a Variant call site for consumption by further stages of the DeepVariant data processing workflow.

message DeepVariantCall.ReadSupport

deepvariant.proto:278

Used in: SupportingReadsExt

message DeepVariantCall.SupportingReads

deepvariant.proto:269

A map from an alt allele in Variant to Read key that support that allele. Every alt allele in the variant will have an entry. Reference supporting reads aren't listed. There may be a special key "UNCALLED_ALLELE" for reads that don't support either the reference allele or any alt allele in the variant. This can happen when the read supports an allele that didn't pass our calling thresholds. The read's key is a unique string that identifies the read constructed as "fragment_read/read_number".

Used in: DeepVariantCall

message DeepVariantCall.SupportingReadsExt

deepvariant.proto:288

A map from alt allele in Variant to ReadSupport structure. This structrue is to replace SupportingReads but for back a backward compatibility old one is kept.

Used in: DeepVariantCall

enum DeepVariantChannelEnum

deepvariant.proto:1152

Next ID: 23.

Used in: SampleOptions

message DeepVariantDatasetConfig

deepvariant.proto:945

Config describe information needed for a dataset that can be used for training, validation, or testing.

message Diagnostics

realigner.proto:218

Config parameters for "alignment (aln)" phase.

Used in: RealignerOptions

message LabelingMetrics

deepvariant.proto:962

Metrics on the labeling of candidate / truth variants when running DeepVariant's make_examples in training mode. Next ID: 17.

Notes on counting by site or by allele: Throughout this proto we often measure the same quantity (e.g., false positives) by site and by alleles. This reflects two different ways of counting errors in genomes where the number of chromosomes > 1. We can give a concrete example: Candidate: chr20:10 with A/C Truth: chr20:10 with A/C alleles and with genotype (0, 1) Since we have the same variant with the same alleles at the same position in both the candidates and the truth, the matching is trivial. For this variant, we'd update our counts as follows: # We have only a single site, so we +1 to each truth and candidates. n_truth_variant_sites += 1 n_candidate_variant_sites += 1 # Both the candidate and truth variants only have 1 alternative allele, # so we increment the allele counts by 1 as well. n_truth_variant_alleles += 1 n_candidate_variant_alleles += 1 A similar logic would apply to counting true positives (+1), false negatives (+0), and false positives (+0) for both sites and alleles. Now let's take a more complex example where candidates and truth differ in their alleles: Candidate: chr20:20 with A/C/T Truth: chr20:20 with A/C/G alleles and with genotype (1, 2) Here we have a candidate and truth at the same position but they can only be partially matched, since the truth variant includes a G allele (e.g., genotype == 2) that isn't even present in the candidate. And the candidate has an extra allele T that isn't real. Matching these variants produces a genotype of (0, 1) for the candidate, since we have one copy of the C allele (e.g., genotype == 1) but we cannot match the true G allele, so we are forced to say one allele is reference (e.g., genotype == 0). Now let's update our counts: # We have only a single site, so we +1 to each truth and candidates, even # though they both have multiple alt allele, as this is the sites-level # metric. n_truth_variant_sites += 1 n_candidate_variant_sites += 1 # Both variants have 2 alternative alleles, so we +2 each of the alleles # counts. n_truth_variant_alleles += 2 n_candidate_variant_alleles += 2 # Now for the complex TP/FN/FP counts. # We found a candidate for this true site, even though we didn't get all # of the alleles, so we increment the n_true_positive_sites by 1 and get # +0 for each of the FN and FP sites metrics: n_true_positive_sites += 1 n_false_negative_sites += 0 n_false_positive_sites += 0 # However, we only got one of the two true positive alleles, so we get # one TP allele (the C), one FN allele (the missed G), and one FP allele # (the bad T allele in the candidate). n_true_positive_alleles += 1 n_false_negative_alleles += 1 n_false_positive_alleles += 1

Used in: MakeExamplesRunInfo

message MakeExamplesOptions

deepvariant.proto:659

High-level options that encapsulates all of the parameters needed to run DeepVariant end-to-end. Next ID: 82.

Used in: MakeExamplesRunInfo

enum MakeExamplesOptions.LabelerAlgorithm

deepvariant.proto:777

An enumeration of all of the labeler algorithms we support in DeepVariant.

Used in: MakeExamplesOptions

enum MakeExamplesOptions.Mode

deepvariant.proto:720

Used in: MakeExamplesOptions

enum MakeExamplesOptions.VariantCaller

deepvariant.proto:805

Used in: MakeExamplesOptions

message MakeExamplesRunInfo

deepvariant.proto:1144

Configuration and runtime information about a MakeExamples run in DeepVariant. Next ID: 5.

message MakeExamplesStats

deepvariant.proto:1128

Statistics about MakeExamples. Next ID: 9.

Used in: MakeExamplesRunInfo

message PileupImageOptions

deepvariant.proto:449

Options to control how we construct pileup images. Next ID: 41.

Used in: MakeExamplesOptions

enum PileupImageOptions.MultiAllelicMode

deepvariant.proto:510

Used in: PileupImageOptions

enum PileupImageOptions.SequencingType

deepvariant.proto:544

Sequencing type of input bam file.

Used in: PileupImageOptions

message RealignerOptions

realigner.proto:229

Used in: MakeExamplesOptions

message ResourceMetrics

resources.proto:39

This proto encodes basic runtime performance metrics for the execution of a command---the command that was executed, the start/stop times, CPU, memory, and disk utilization, etc.

--------------------------------------------------------------------------- Information about the runtime environment ---------------------------------------------------------------------------

Used in: MakeExamplesRunInfo

message SampleOptions

deepvariant.proto:587

Options that may differ by sample. Next ID: 18.

Used in: MakeExamplesOptions

enum SampleOptions.VariantType

deepvariant.proto:643

Blank out all channels for these variant types if set.

Used in: SampleOptions

message VariantCallerOptions

deepvariant.proto:375

Options to control how our candidate VariantCaller works. Next ID: 21

Used in: MakeExamplesOptions, SampleOptions

message VariantLabelerOptions

deepvariant.proto:443

Options to control how we label variant calls.

Currently there are no options for VariantLabeler.

Used in: MakeExamplesOptions

(message has no fields)

message WindowSelectorModel

realigner.proto:48

Config parameters for the selection of candidate location in the "window selector (ws)" phase.

Used in: WindowSelectorOptions

message WindowSelectorModel.AlleleCountLinearModel

realigner.proto:72

Linear model based on the type of reads at each locus.

Used in: WindowSelectorModel

enum WindowSelectorModel.ModelType

realigner.proto:54

Two models are currently supported: - VARIANT_READS: based on the number of SNPs, INDELs and SOFT_CLIPs at a location. - ALLELE_COUNT_LINEAR: linear model based on the AlleleCount at each location.

Used in: WindowSelectorModel

message WindowSelectorModel.VariantReadsThresholdModel

realigner.proto:62

Model requiring #reads > min_num_supporting_reads and #reads < max_num_supporting_reads.

Used in: WindowSelectorModel

message WindowSelectorOptions

realigner.proto:95

Config parameters for "window selector (ws)" phase. Next ID: 10.

Used in: RealignerOptions