Get desktop application:
View/edit binary Protocol Buffers messages
Represents one line of a BedGraph file. See https://genome.ucsc.edu/goldenPath/help/bedgraph.html for details on the format.
The reference sequence name, for example `chr1`, `1`, or `chrX`.
The start position of the range on the reference, 0-based inclusive.
The end position of the range on the reference, 0-based exclusive.
The data value can be positive or negative real values.
The number of fields in the BED file.
Options for reading BED files.
Optional. The number of fields to read from the BED file. If this is unset, or set to more fields than are present in the BED file, all fields are read.
This message represents a single BED record. See https://genome.ucsc.edu/FAQ/FAQformat.html#format1 for details.
The reference on which this variant occurs. Corresponds to "CHROM" in UCSC.
The position at which this region occurs (0-based inclusive).
The position at which this region ends (0-based exclusive).
The name of the record.
As described in https://genome.ucsc.edu/FAQ/FAQformat.html#format1, score should be an integer in [0, 1000]. However, many non-integer values are seen in BED records in the wild, so we represent this as a double.
The strand on the genome that the record is on.
For visualization purposes, the position at which the feature starts to be drawn thickly. In gene structures this corresponds to the start codon. This is zero-based inclusive numbering, like the `start` field.
For visualization purposes, the position at which the feature stops being drawn thickly. In gene structures this corresponds to the stop codon. This is zero-based exclusive numbering, like the `end` field.
A comma-separated RGB value R,G,B for visualization.
The number of distinct blocks in the BED line (e.g. exon count in gene structures).
Comma-separated list of block sizes. The number of items in the list should be equal to `block_count`.
Comma-separated list of block start positions. The number of items in the list should be equal to `block_count`. This is zero-based inclusive numbering, like the `start` field.
Used in:
The strand is unspecified, unknown, or not meaningful.
Options for writing BED files. Currently this is a placeholder message.
(message has no fields)
A single CIGAR operation.
Used in:
The number of genomic bases that the operation runs for. Required.
`referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) and deletions (`DELETE`). Filling this field replaces SAM's MD tag. If the relevant information is not available, this field is unset.
Describes the different types of CIGAR alignment operations that exist. Used wherever CIGAR alignments are used.
Used in:
An alignment match indicates that a sequence can be aligned to the reference without evidence of an INDEL. Unlike the `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` operator does not indicate whether the reference and read sequences are an exact match. This operator is equivalent to SAM's `M`.
The insert operator indicates that the read contains evidence of bases being inserted into the reference. This operator is equivalent to SAM's `I`.
The delete operator indicates that the read contains evidence of bases being deleted from the reference. This operator is equivalent to SAM's `D`.
The skip operator indicates that this read skips a long segment of the reference, but the bases have not been deleted. This operator is commonly used when working with RNA-seq data, where reads may skip long segments of the reference between exons. This operator is equivalent to SAM's `N`.
The soft clip operator indicates that bases at the start/end of a read have not been considered during alignment. This may occur if the majority of a read maps, except for low quality bases at the start/end of a read. This operator is equivalent to SAM's `S`. Bases that are soft clipped will still be stored in the read.
The hard clip operator indicates that bases at the start/end of a read have been omitted from this alignment. This may occur if this linear alignment is part of a chimeric alignment, or if the read has been trimmed (for example, during error correction or to trim poly-A tails for RNA-seq). This operator is equivalent to SAM's `H`.
The pad operator indicates that there is padding in an alignment. This operator is equivalent to SAM's `P`.
This operator indicates that this portion of the aligned sequence exactly matches the reference. This operator is equivalent to SAM's `=`.
This operator indicates that this portion of the aligned sequence is an alignment match to the reference, but a sequence mismatch. This can indicate a SNP or a read error. This operator is equivalent to SAM's `X`.
This record type records information about a contig. This is used both in VCF header parsing and by GenomeReference objects for querying references. Due to its generality, this message is also used by the FastaReader to provide detailed information on the description line of a FASTA record even in cases where the record does not correspond to a reference genome contig.
Used in:
, ,Required. The name of the contig. Canonically this is the first non-whitespace-containing string after the > marker in a FASTA file. For example, the line: >chr1 more info here has a name of "chr1" and a description of "more info here"
Ideally this record is filled in as described above, but not all FASTA readers capture the description information after the name. Since a description is not required by the FASTA spec, we cannot distinguish cases where a description was not present and where a parser ignored it.
The length of this contig in basepairs.
Additional information used when reading and writing VCF headers. An example map of key-value extra fields would transform an input line containing 'assembly=B36,taxonomy=x,species="Homo sapiens"' to a map with "assembly" -> "B36", "taxonomy" -> "x", "species" -> "Homo sapiens". We never use this information internally, other than reading it in so we can write the contig out again.
The position of this contig in the src_fasta file. The first contig would have position 0. TODO: rename to something more generic.
If false, casts all bases to uppercase before returning them.
If set, all sequences are verified to contain only characters present in the input alphabet defined here.
If true, the `region` field is populated in each FastaRecord.
Used in:
No parsing is performed, and the `defline` field holds the raw string of the line.
Parses the description line of each record into a ContigInfo object in the `contig` field.
This message represents a single FASTA record. This can be any FASTA file, representing DNA, RNA, protein, or other sequence.
If the FastaReaderOptions.parse_header field is false, this field is populated with the raw text of the description line, stripping the leading '>' and any trailing whitespace and the newline. Otherwise this field is empty.
If the FastaReaderOptions.parse_header field is true, this message is populated based on the contents of the description and sequence lines. Otherwise this field is empty. NOTE: the "contig" info provided here is solely based on the record itself, and provides a mechanism to separate the sequence name from its description and includes the number of basepairs in the sequence.
Iff the FastaReaderOptions.include_range field is true, this message is populated with the location of the sequence within the contig. `region.end - region.start` should thus equal the length of the sequence. This could differ from the range [0, len(sequence)) in the case of a query operation for a particular region of a FASTA sequence.
The raw sequence letters. Depending on the `FastaReaderOptions.keep_true_case` field, these may be uppercased or keeping the original true case.
Options for writing FASTA files. Currently this is a placeholder message but could be used to support different choices on output like the number of columns per line.
(message has no fields)
If true, simply drop invalid records. Otherwise, raise an error on invalid records.
This message represents a single FASTQ record.
The first line of a FASTQ record begins with '@' and is followed by a sequence identifier (up to the first whitespace character) and then an optional description. This line is parsed into its constituent id and description. The sequence identifier.
Optional. The description provided in the header line.
The raw sequence letters.
The quality values for the sequence. Its length must be the same as the sequence length, and is encoded in ASCII. The meaning of each base quality may vary: it is usually a Phred-scaled score (-10 * log_10(Pr{call is incorrect})) but differs for some older versions of FASTQs.
Options for writing FASTQ files. Currently this is a placeholder message but could be used to support different choices on output like whether the pad line should include the header or not.
(message has no fields)
A message encoding the directives contained in a GFF3 file header. Consult the file format reference for detailed descriptions of these directives. Note that we do NOT handle the FASTA directive (a rarely-used method to bundle reference sequences within a GFF file.)
`gff_version`, a string of the format "gff-version 3.#.#" encoding the exact GFF spec version used.
`sequence_regions` is a list of the sequence regions that are referenced by GFF records.
A string name for the biological species analyzed. "This directive indicates the species that the annotations apply to. The preferred format is a NCBI URL that points to the relevant species page in either of the following formats: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6239 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?name=Caenorhabditis+elegans"
Used in:
An OntologyDirective holds the URI to a sequence ontology database, reflecting the ontology over the entities in the `type`, `source`, and `attributes` fields of a GffRecord.
Used in:
(message has no fields)
This message represents a single GFF3 record. See https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md for details on the file format; that document is quoted below. TODO: deal with %-encoding.
`range` field, reflecting the genomic location of the feature. NOTE: this is a 0-based, end-exclusive interval, in contrast to the 1-based encoding used natively in the GFF text format. For reference, consult the documentation for `Range`. This field is required.
`source` designation, as defined by GFF spec: "The source is a free text qualifier intended to describe the algorithm or operating procedure that generated this feature. Typically this is the name of a piece of software, such as "Genescan" or a database name, such as "Genbank." In effect, the source is used to extend the feature ontology by adding a qualifier to the type creating a new composite type that is a subclass of the type in the type column." Missingness of this field is encoded as "".
`type` designation, as defined by GFF spec: "The type of the feature (previously called the "method"). This is constrained to be either a term from the Sequence Ontology or an SO accession number. The latter alternative is distinguished using the syntax SO:000000. In either case, it must be sequence_feature (SO:0000110) or an is_a child of it." Missingness of this field is encoded as "".
`score` designation, as defined by GFF spec: "The score of the feature, a floating point number. As in earlier versions of the format, the semantics of the score are ill-defined. It is strongly recommended that E-values be used for sequence similarity features, and that P-values be used for ab initio gene prediction features." Missingness of this field is encoded by -infinity.
The strand if the feature, if relevant.
`phase` designation, as defined by the GFF spec: "For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region. This is NOT to be confused with the frame, which is simply start modulo 3. For forward strand features, phase is counted from the start field. For reverse strand features, phase is counted from the end field. The phase is REQUIRED for all CDS features." Missingness of this field is encoded by -1.
`attributes`, a free-form map of keys to string values, corresponding to the semi-colon separated attributes field in the GFF text format.
TODO: factor this out (here and BED, at least)
Used in:
The strand is unspecified, unknown, or not meaningful.
(message has no fields)
A linear alignment can be represented by one CIGAR string. Describes the mapped position and local alignment of the read to the reference.
Used in:
The position of this alignment.
The mapping quality of this alignment. Represents how likely the read maps to this position as opposed to other locations. Specifically, this is -10 log10 Pr(mapping position is wrong), rounded to the nearest integer.
Represents the local alignment of this sequence (alignment matches, indels, etc) against the reference.
`ListValue` is a wrapper around a repeated field of values. The JSON representation for `ListValue` is JSON array.
Used in:
, , ,Repeated field of dynamically typed values.
`NullValue` is a singleton enumeration to represent the null value for the `Value` type union. The JSON representation for `NullValue` is JSON `null`.
Used in:
Null value.
An abstraction for referring to a genomic position, in relation to some already known reference. For now, represents a genomic position as a reference name, a base number on that reference (0-based), and a determination of forward or reverse strand.
Used in:
, ,The name of the reference in whatever reference set is being used.
The 0-based offset from the start of the forward strand for that reference.
Whether this position is on the reverse strand, as opposed to the forward strand.
A Program is used in the SAM header to track how alignment data is generated. This is a sub-message of SamHeader, at the same scope to reduce verbosity.
Used in:
@PG ID field in SAM spec. The locally unique ID of the program. Used along with `prev_program_id` to define an ordering between programs.
@PG PN field in SAM spec. The display name of the program. This is typically the colloquial name of the tool used, for example 'bwa' or 'picard'.
@PG CL field in SAM spec. The command line used to run this program.
@PG PP field in SAM spec. The ID of the program run before this one.
@PG DS field in SAM spec. The description of the program.
@PG VN field in SAM spec. The version of the program run.
A 0-based half-open genomic coordinate range for search requests.
Used in:
, , , ,The reference sequence name, for example `chr1`, `1`, or `chrX`.
The start position of the range on the reference, 0-based inclusive.
The end position of the range on the reference, 0-based exclusive.
A read alignment describes a linear alignment of a string of DNA to a [reference sequence][learning.genomics.v1.Reference], in addition to metadata about the fragment (the molecule of DNA sequenced) and the read (the bases which were read by the sequencer). A read is equivalent to a line in a SAM file. A read belongs to exactly one read group and exactly one [read group set][learning.genomics.v1.ReadGroupSet]. For more genomics resource definitions, see [Fundamentals of Google Genomics](https://cloud.google.com/genomics/fundamentals-of-google-genomics) ### Reverse-stranded reads Mapped reads (reads having a non-null `alignment`) can be aligned to either the forward or the reverse strand of their associated reference. Strandedness of a mapped read is encoded by `alignment.position.reverseStrand`. If we consider the reference to be a forward-stranded coordinate space of `[0, reference.length)` with `0` as the left-most position and `reference.length` as the right-most position, reads are always aligned left to right. That is, `alignment.position.position` always refers to the left-most reference coordinate and `alignment.cigar` describes the alignment of this read to the reference from left to right. All per-base fields such as `alignedSequence` and `alignedQuality` share this same left-to-right orientation; this is true of reads which are aligned to either strand. For reverse-stranded reads, this means that `alignedSequence` is the reverse complement of the bases that were originally reported by the sequencing machine. ### Generating a reference-aligned sequence string When interacting with mapped reads, it's often useful to produce a string representing the local alignment of the read to reference. The following pseudocode demonstrates one way of doing this: out = "" offset = 0 for c in read.alignment.cigar { switch c.operation { case "ALIGNMENT_MATCH", "SEQUENCE_MATCH", "SEQUENCE_MISMATCH": out += read.alignedSequence[offset:offset+c.operationLength] offset += c.operationLength break case "CLIP_SOFT", "INSERT": offset += c.operationLength break case "PAD": out += repeat("*", c.operationLength) break case "DELETE": out += repeat("-", c.operationLength) break case "SKIP": out += repeat(" ", c.operationLength) break case "CLIP_HARD": break } } return out ### Converting to SAM's CIGAR string The following pseudocode generates a SAM CIGAR string from the `cigar` field. Note that this is a lossy conversion (`cigar.referenceSequence` is lost). cigarMap = { "ALIGNMENT_MATCH": "M", "INSERT": "I", "DELETE": "D", "SKIP": "N", "CLIP_SOFT": "S", "CLIP_HARD": "H", "PAD": "P", "SEQUENCE_MATCH": "=", "SEQUENCE_MISMATCH": "X", } cigarStr = "" for c in read.alignment.cigar { cigarStr += c.operationLength + cigarMap[c.operation] } return cigarStr (== resource_for v1.reads ==)
The server-generated read ID, unique across all reads. This is different from the `fragmentName`.
The ID of the read group this read belongs to. A read belongs to exactly one read group. This is a server-generated ID which is distinct from SAM's RG tag (for that value, see [ReadGroup.name][learning.genomics.v1.ReadGroup.name]).
The ID of the read group set this read belongs to. A read belongs to exactly one read group set.
The fragment name. Equivalent to QNAME (query template name) in SAM.
The orientation and the distance between reads from the fragment are consistent with the sequencing protocol (SAM flag 0x2).
The fragment is a PCR or optical duplicate (SAM flag 0x400).
The observed length of the fragment, equivalent to TLEN in SAM.
The read number in sequencing. 0-based and less than numberReads. This field replaces SAM flag 0x40 and 0x80.
The number of reads in the fragment (extension to SAM flag 0x1).
Whether this read did not pass filters, such as platform or vendor quality controls (SAM flag 0x200).
The linear alignment for this alignment record. This field is null for unmapped reads.
Whether this alignment is secondary. Equivalent to SAM flag 0x100. A secondary alignment represents an alternative to the primary alignment for this read. Aligners may return secondary alignments if a read can map ambiguously to multiple coordinates in the genome. By convention, each read has one and only one alignment where both `secondaryAlignment` and `supplementaryAlignment` are false.
Whether this alignment is supplementary. Equivalent to SAM flag 0x800. Supplementary alignments are used in the representation of a chimeric alignment. In a chimeric alignment, a read is split into multiple linear alignments that map to different reference contigs. The first linear alignment in the read will be designated as the representative alignment; the remaining linear alignments will be designated as supplementary alignments. These alignments may have different mapping quality scores. In each linear alignment in a chimeric alignment, the read will be hard clipped. The `alignedSequence` and `alignedQuality` fields in the alignment record will only represent the bases for its respective linear alignment.
The bases of the read sequence contained in this alignment record, **without CIGAR operations applied** (equivalent to SEQ in SAM). `alignedSequence` and `alignedQuality` may be shorter than the full read sequence and quality. This will occur if the alignment is part of a chimeric alignment, or if the read was trimmed. When this occurs, the CIGAR for this read will begin/end with a hard clip operator that will indicate the length of the excised sequence.
The quality of the read sequence contained in this alignment record (equivalent to QUAL in SAM). Optionally can be read from OQ tag. See `SamReaderOptions` proto for more details. `alignedSequence` and `alignedQuality` may be shorter than the full read sequence and quality. This will occur if the alignment is part of a chimeric alignment, or if the read was trimmed. When this occurs, the CIGAR for this read will begin/end with a hard clip operator that will indicate the length of the excised sequence.
The mapping of the primary alignment of the `(readNumber+1)%numberReads` read in the fragment. It replaces mate position and mate strand in SAM.
A map of additional read alignment information. This must be of the form map<string, string[]> (string key mapping to a list of string values).
A read group is all the data that's processed the same way by the sequencer. This is a sub-message of SamHeader, at the same scope to reduce verbosity.
Used in:
RG@ ID field in SAM spec. The read group name.
RG@ CN field in SAM spec. The name of the sequencing center producing the read.
@RG DS field in SAM spec. A free-form text description of this read group.
@RG DT field in SAM spec.
@RG FO field in SAM spec.
@RG KS field in SAM spec.
@RG LB field in SAM spec. A library is a collection of DNA fragments which have been prepared for sequencing from a sample. This field is important for quality control as error or bias can be introduced during sample preparation.
@RG PG field in SAM spec.
@RG PI field in SAM spec. The predicted insert size of this read group. The insert size is the length of the sequenced DNA fragment from end-to-end, not including the adapters.
@RG PL field in SAM spec. The platform/technology used to produce the reads.
@RG PM field in SAM spec. The platform model used as part of this run.
@RG PU field in SAM spec. The platform unit used as part of this experiment, for example flowcell-barcode.lane for Illumina or slide for SOLiD. A unique identifier.
@RG SM field in SAM spec. A client-supplied sample identifier for the reads in this read group.
Describes requirements for a read for it to be returned by a SamReader.
Used in:
, , ,By default, duplicate reads will not be kept. Set this flag to keep them.
By default, reads that failed the vendor quality checks will not be kept. Set this flag to keep them.
By default, reads that are marked as secondary alignments will not be kept. Set this flag to keep them.
By default, reads that are marked as supplementary alignments will not be kept. Set this flag to keep them.
By default, reads that aren't aligned are not kept. Set this flag to keep them.
Paired (or greater) reads that are improperly placed are not kept by default. Set this flag to keep them. We define improperly placed to mean reads whose (next) mate is mapped to a different contig.
By default, reads with any mapping quality are kept. Setting this field to a positive integer i will only keep reads that have a MAPQ >= i. Note this only applies to aligned reads. If keep_unaligned is set, unaligned reads, which by definition do not have a mapping quality, will still be kept.
Minimum base quality. This field indicates that we are enforcing a minimum base quality score for a read to be used. How this field is enforced, though, depends on the enum field min_base_quality_mode, as there are multiple ways for this requirement to be interpreted.
How should we enforce the min_base_quality requirement?
Used in:
If UNSPECIFIED, there are no guarantees on whether and how min_base_quality would be enforced. By default we recommend implementations ignore min_base_quality if this is set to UNSPECIFIED.
The min_base_quality requirement is being enforced not by the reader but by the client itself. This is commonly used when the algorithm for computing whether a read satisfying the min_base_quality requirement is too complex or too specific for the reader.
A full, or partial, sequence of bases from a contig in a reference genome.
The location on the genome this sequence of bases comes from.
The bases of this part of the reference genome.
The SamHeader message represents the metadata present in the header of a SAM/BAM file.
The VN field from the HD line. Empty if not present (valid formats will match /^[0-9]+\.[0-9]+$/).
@SQ header field in SAM spec. The order of the contigs defines the sorting order.
@RG header field in SAM spec. Read groups.
@PG header field in SAM spec. A program run to generate the alignment data.
@CO header field in SAM spec. One-line text comments.
The GO field from the HD line.
Used in:
The SO field from the HD line.
Used in:
The SamReaderOptions message is used to alter the properties of a SamReader. It enables reads to be omitted from parsing based on their attributes, as well as more fine-grained handling of particular fields within the SAM records. Next ID: 12.
Read requirements that must be satisfied before our reader will return a read to use.
Block size to use in htslib, in reading the SAM/BAM. Value <=0 will use the default htslib block size.
Controls if, and at what rate, we discard reads from the input stream. This option allows the user to efficiently remove a random fraction of reads from the source SAM/BAM file. The reads are discarded on the fly before being parsed into protos, so the downsampling is reasonably efficient. If 0.0 (the default protobuf value), this field is ignored. If != 0.0, then this must be a value between (0.0, 1.0] indicating the probability p that a read should be kept, or equivalently (1 - p) that a read will be kept. For example, if downsample_fraction is 0.25, then each read has a 25% chance of being included in the output reads.
Random seed to use with downsampling fraction.
By default aligned_quality field is read from QUAL in SAM. If flag is set, aligned_quality field is read from OQ tag in SAM.
By default, this field is empty. If empty, we keep all aux fields if they are parsed. If set, we only keep the aux fields with the names in this list.
How should we handle the aux fields in the SAM record?
Used in:
`Struct` represents a structured data value, consisting of fields which map to dynamically typed values. In some languages, `Struct` might be supported by a native representation. For example, in scripting languages like JS a struct is represented as an object. The details of that representation are described together with the proto support for the language. The JSON representation for `Struct` is JSON object.
Used in:
Unordered map of dynamically typed values.
`Value` represents a dynamically typed value which can be either null, a number, a string, a boolean, a recursive struct value, or a list of values. A producer of value is expected to set one of that variants, absence of any variant indicates an error. The JSON representation for `Value` is JSON value.
Used in:
,The kind of value.
Represents a null value.
Represents a double value.
Represents an integer value.
Represents a string value.
Represents a boolean value.
Represents a structured value.
Represents a repeated `Value`.
A variant represents a change in DNA sequence relative to a reference sequence. For example, a variant could represent a SNP or an insertion. The definition of the Variant message closely follows the common VCF variant representation. Each of the calls on a variant represent a determination of genotype with respect to that variant. For example, a call might assign probability of 0.32 to the occurrence of a SNP named rs1234 in a sample named NA12345. NextID: 17
Used in:
,The reference on which this variant occurs. (such as `chr20` or `X`) Corresponds to the "CHROM" field of VCF 4.3.
The position at which this variant occurs (0-based inclusive). This corresponds to the first base of the string of reference bases.
The end position (0-based exclusive) of this variant. This corresponds to the first base after the last base in the reference allele. So, the length of the reference allele is (end - start). This is useful for variants that don't explicitly give alternate bases, for example large deletions.
Names for the variant, for example a dbSNP ID. Corresponds to the "ID" field of VCF 4.3.
The reference bases for this variant. They start at the given position.
The bases that appear instead of the reference bases.
A measure of how likely this variant is to be real. A higher value is better. Since this is a Phred-scaled probability (i.e. is -10 * log_10(p) for some p, which depends on whether this is a variant or non-variant site) it is guaranteed to be non-negative. We use -1 to represent the `unset` value.
A list of filters (normally quality filters) this variant has failed. `PASS` indicates this variant has passed all filters.
A map of additional variant information. This must be of the form map<string, string[]> (string key mapping to a list of string values).
The variant calls for this particular variant. Each one represents the determination of genotype with respect to this variant.
/////////////////////////////////////////////////////////////////////// DEPRECATED or unused fields of the Variant proto below. These are relics of the Google Genomics API and/or are used to support GA4GH specs. The ID of the variant set this variant belongs to. DEPRECATED.
The server-generated variant ID, unique across all variants. DEPRECATED.
The date this variant was created, in milliseconds from the epoch. (-- GA4GH also specifies an "updated" timestamp. --) DEPRECATED.
A call represents the determination of genotype with respect to a particular variant. It may include associated information such as quality and phasing. For example, a call might assign a probability of 0.32 to the occurrence of a SNP named rs1234 in a call set with the name NA12345. NextID: 10
Used in:
The name of the call set this variant call belongs to. Also known as "sample".
The genotype of this variant call. Each value represents either the value of the `referenceBases` field or a 1-based index into `alternateBases`. If a variant had a `referenceBases` value of `T` and an `alternateBases` value of `["A", "C"]`, and the `genotype` was `[2, 1]`, that would mean the call represented the heterozygous value `CA` for this variant. If the `genotype` was instead `[0, 1]`, the represented value would be `TA`. Ordering of the genotype values is important if the `phaseset` is present ('PS' field in the call.info map). Uncalled genotypes (represented as `.` in the GT string) are represented by -1 in this array.
If true, this variant call's genotype ordering implies the phase of the bases and is consistent with any other variant calls in the same reference sequence which have the same phaseset value (the integer 'PS' field in the call.info map). If this is true but the 'PS' field is not set, the call is assumed to be phased with all other calls for which the same state applies.
DEPRECATED. This previously was used as a special-cased field for capturing phasing information. This field should no longer be set, in favor of using the 'PS' field in the call.info map and the `is_phased` boolean attribute.
DEPRECATED.
The genotype log10-likelihoods for this variant call. Each array entry represents how likely a specific genotype is for this call. The value ordering is defined by the GL tag in the VCF spec. If Phred-scaled genotype likelihood scores (PL) are available and log10(P) genotype likelihood scores (GL) are not, PL scores are converted to GL scores. If both are available, the GL scores are stored here and PL scores are omitted (as they are just a lower-resolution representation of GL scores).
A map of additional variant call information. This must be of the form map<string, string[]> (string key mapping to a list of string values).
/////////////////////////////////////////////////////////////////////// DEPRECATED or unused fields of the VariantCall proto below. These are relics of the Google Genomics API and/or are used to support GA4GH specs. The ID of the call set this variant call belongs to.
This record type is a catch-all for other types of headers. For example, ##pedigreeDB=http://url_of_pedigrees The VcfExtra message would represent this with key="pedigreeDB", value="http://url_of_pedigrees".
Used in:
,Required by VCF. The key of the extra header field. Note that this key does not have to be unique within a VcfHeader.
Required by VCF. The value of the extra header field.
The below messages are sub-messages of the VCF header. They are not nested within VcfHeader simply to avoid verbosity. We comment fields in one of three states: "Required": Required by both the VCF file format and for downstream users of Variant and VariantCall protos. "Required by VCF": Required by the VCF file format, unused otherwise. "Optional": Optional within the VCF file format, unused otherwise. This record type mirrors a VCF "FILTER" header.
Used in:
Required. The unique ID of the filter. Examples include "PASS", "RefCall".
Required by VCF. The description of the filter.
This record type mirrors a VCF "FORMAT" header.
Used in:
Required. The unique ID of the FORMAT field. Examples include "GT", "PL".
Required. The number of entries expected. See description above in the VcfInfo message.
Required. The type of the field. Valid values are "Integer", "Float", "Character", and "String" (same as INFO except "Flag" is not supported).
Required by VCF. The description of the field.
This record type mirrors a VCF header. See https://samtools.github.io/hts-specs/VCFv4.3.pdf for details on the spec.
The required first line of the VCF. Values e.g. "VCFv4.3".
The list of contigs used to produce this VCF. All variants within the VCF must lie on a contig represented here. All contigs must have distinct IDs.
A list of all filters used to produce this VCF. All variants within the VCF must only use filters represented here. All filters must have distinct IDs.
A list of all info tags used to annotate variants within the VCF. All info fields present in Variants must only use infos with IDs represented here. All infos must have distinct IDs.
A list of all format fields used to annotate genotypes within the VCF. All fields present in VariantCalls must only use formats with IDs represented here. All formats must have distinct IDs.
An ordered list of all the sample names present in the VCF. All Variants within the VCF must contain `len(sample_names)` VariantCalls and must be in the same order. I.e. for any Variant v, v.calls[i].call_set_name == sample_names[i] for all i.
A list of all header lines that are not one represented above, represented in a key=value format. The key within the extras may be duplicated.
A list of all header lines that are not one represented above, represented in an unstructured format. The key within the extras may be duplicated.
This message type mirrors a VCF "INFO" header.
Used in:
Required. The unique ID of the INFO field. Examples include "MQ0" or "END".
Required. The number of values included with the info field. This should be the string representation of the number, e.g. "1" for a single entry, "2" for a pair of entries, etc. Special cases arise when the number of entries depend on attributes of the Variant or are unknown in advance, and include: "A": The field has one value per alternate allele. "R": The field has one value per allele (including the reference). "G": The field has one value for each possible genotype. ".": The number of values varies, is unknown, or is unbounded.
Required. The type of the INFO field. Valid values are "Integer", "Float", "Flag", "Character", and "String".
Required by VCF. The description of the field.
Optional. The annotation source used to generate the field.
Optional. The version of the annotation source used to generate the field.
The Vcf{Reader,Writer}Options messages are used to alter the properties of reading and writing variants. They enables certain fields to be omitted from parsing.
A list of all INFO field IDs that should be excluded from parsing.
A list of all FORMAT field IDs that should be excluded from parsing.
If true, the GL and PL format tags are stored in the VariantCall.info map with the type and number as specified in the VCF header, similar to other FORMAT fields. Otherwise, the GL and PL tags are special-cased and available in the VariantCall.genotype_likelihood field, with the enforcement that each is of type=Float and Number=G.
This record type is a catch-all for other headers containing multiple key-value pairs. For example, headers may have META lines that provide metadata about the VCF as a whole, e.g. ##META=<ID=Assay,Type=String,Number=.,Values=[WholeGenome, Exome]> The VcfStructuredExtra message would represent this with key="META", and fields mapping "ID" -> "Assay", "Type" -> "String", etc.
Used in:
Required by VCF. The key of the extra header field. Note that this key does not have to be unique within a VcfHeader.
Required by VCF. The key=value pairs contained in the structure.
A list of all INFO field IDs that should be excluded from writing.
A list of all FORMAT field IDs that should be excluded from writing.
Should QUAL field values be rounded to one point past the decimal?
If true, the GL and PL format tags are written from the VariantCall.info map with the type and number as specified in the VCF header. In this case, any values set in the VariantCall.genotype_likelihood field are ignored. Otherwise, the GL and PL tags are retrieved from the VariantCall.genotype_likelihood field, with the enforcement that each is of type=Float and Number=G, and neither GL nor PL should be present in the VariantCall.info map.
If true, the writer will skip writing the VcfHeader.