package nucleus.genomics.v1

Get desktop application:
View/edit binary Protocol Buffers messages

Represents one line of a BedGraph file. See https://genome.ucsc.edu/goldenPath/help/bedgraph.html for details on the format.

string reference_name = 1
The reference sequence name, for example `chr1`, `1`, or `chrX`.
int64 start = 2
The start position of the range on the reference, 0-based inclusive.
int64 end = 3
The end position of the range on the reference, 0-based exclusive.
double data_value = 4
The data value can be positive or negative real values.

int32 num_fields = 1
The number of fields in the BED file.

Options for reading BED files.

int32 num_fields = 2
Optional. The number of fields to read from the BED file. If this is unset, or set to more fields than are present in the BED file, all fields are read.

This message represents a single BED record. See https://genome.ucsc.edu/FAQ/FAQformat.html#format1 for details.

string reference_name = 1
The reference on which this variant occurs. Corresponds to "CHROM" in UCSC.
int64 start = 2
The position at which this region occurs (0-based inclusive).
int64 end = 3
The position at which this region ends (0-based exclusive).
string name = 4
The name of the record.
double score = 5
As described in https://genome.ucsc.edu/FAQ/FAQformat.html#format1, score should be an integer in [0, 1000]. However, many non-integer values are seen in BED records in the wild, so we represent this as a double.
BedRecord.Strand strand = 6
The strand on the genome that the record is on.
int64 thick_start = 7
For visualization purposes, the position at which the feature starts to be drawn thickly. In gene structures this corresponds to the start codon. This is zero-based inclusive numbering, like the `start` field.
int64 thick_end = 8
For visualization purposes, the position at which the feature stops being drawn thickly. In gene structures this corresponds to the stop codon. This is zero-based exclusive numbering, like the `end` field.
string item_rgb = 9
A comma-separated RGB value R,G,B for visualization.
int32 block_count = 10
The number of distinct blocks in the BED line (e.g. exon count in gene structures).
string block_sizes = 11
Comma-separated list of block sizes. The number of items in the list should be equal to `block_count`.
string block_starts = 12
Comma-separated list of block start positions. The number of items in the list should be equal to `block_count`. This is zero-based inclusive numbering, like the `start` field.

Used in: BedRecord

NO_STRAND = 0
The strand is unspecified, unknown, or not meaningful.
FORWARD_STRAND = 1
REVERSE_STRAND = 2

Options for writing BED files. Currently this is a placeholder message.

(message has no fields)

A single CIGAR operation.

Used in: LinearAlignment

CigarUnit.Operation operation = 1
int64 operation_length = 2
The number of genomic bases that the operation runs for. Required.
string reference_sequence = 3
`referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) and deletions (`DELETE`). Filling this field replaces SAM's MD tag. If the relevant information is not available, this field is unset.

Describes the different types of CIGAR alignment operations that exist. Used wherever CIGAR alignments are used.

Used in: CigarUnit

OPERATION_UNSPECIFIED = 0
ALIGNMENT_MATCH = 1
An alignment match indicates that a sequence can be aligned to the reference without evidence of an INDEL. Unlike the `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` operator does not indicate whether the reference and read sequences are an exact match. This operator is equivalent to SAM's `M`.
INSERT = 2
The insert operator indicates that the read contains evidence of bases being inserted into the reference. This operator is equivalent to SAM's `I`.
DELETE = 3
The delete operator indicates that the read contains evidence of bases being deleted from the reference. This operator is equivalent to SAM's `D`.
SKIP = 4
The skip operator indicates that this read skips a long segment of the reference, but the bases have not been deleted. This operator is commonly used when working with RNA-seq data, where reads may skip long segments of the reference between exons. This operator is equivalent to SAM's `N`.
CLIP_SOFT = 5
The soft clip operator indicates that bases at the start/end of a read have not been considered during alignment. This may occur if the majority of a read maps, except for low quality bases at the start/end of a read. This operator is equivalent to SAM's `S`. Bases that are soft clipped will still be stored in the read.
CLIP_HARD = 6
The hard clip operator indicates that bases at the start/end of a read have been omitted from this alignment. This may occur if this linear alignment is part of a chimeric alignment, or if the read has been trimmed (for example, during error correction or to trim poly-A tails for RNA-seq). This operator is equivalent to SAM's `H`.
PAD = 7
The pad operator indicates that there is padding in an alignment. This operator is equivalent to SAM's `P`.
SEQUENCE_MATCH = 8
This operator indicates that this portion of the aligned sequence exactly matches the reference. This operator is equivalent to SAM's `=`.
SEQUENCE_MISMATCH = 9
This operator indicates that this portion of the aligned sequence is an alignment match to the reference, but a sequence mismatch. This can indicate a SNP or a read error. This operator is equivalent to SAM's `X`.

This record type records information about a contig. This is used both in VCF header parsing and by GenomeReference objects for querying references. Due to its generality, this message is also used by the FastaReader to provide detailed information on the description line of a FASTA record even in cases where the record does not correspond to a reference genome contig.

Used in: FastaRecord, SamHeader, VcfHeader

string name = 1
Required. The name of the contig. Canonically this is the first non-whitespace-containing string after the > marker in a FASTA file. For example, the line: >chr1 more info here has a name of "chr1" and a description of "more info here"
string description = 2
Ideally this record is filled in as described above, but not all FASTA readers capture the description information after the name. Since a description is not required by the FASTA spec, we cannot distinguish cases where a description was not present and where a parser ignored it.
int64 n_bases = 3
The length of this contig in basepairs.
map<string, string> extra = 5
Additional information used when reading and writing VCF headers. An example map of key-value extra fields would transform an input line containing 'assembly=B36,taxonomy=x,species="Homo sapiens"' to a map with "assembly" -> "B36", "taxonomy" -> "x", "species" -> "Homo sapiens". We never use this information internally, other than reading it in so we can write the contig out again.
int32 pos_in_fasta = 4
The position of this contig in the src_fasta file. The first contig would have position 0. TODO: rename to something more generic.

bool keep_true_case = 1
If false, casts all bases to uppercase before returning them.
string alphabet = 2
If set, all sequences are verified to contain only characters present in the input alphabet defined here.
FastaReaderOptions.DeflineParsing defline_parsing = 3
bool include_range_in_records = 4
If true, the `region` field is populated in each FastaRecord.

Used in: FastaReaderOptions

NONE = 0
No parsing is performed, and the `defline` field holds the raw string of the line.
CONTIG_INFO = 1
Parses the description line of each record into a ContigInfo object in the `contig` field.

This message represents a single FASTA record. This can be any FASTA file, representing DNA, RNA, protein, or other sequence.

string defline = 1
If the FastaReaderOptions.parse_header field is false, this field is populated with the raw text of the description line, stripping the leading '>' and any trailing whitespace and the newline. Otherwise this field is empty.
optional ContigInfo contig = 2
If the FastaReaderOptions.parse_header field is true, this message is populated based on the contents of the description and sequence lines. Otherwise this field is empty. NOTE: the "contig" info provided here is solely based on the record itself, and provides a mechanism to separate the sequence name from its description and includes the number of basepairs in the sequence.
optional Range region = 3
Iff the FastaReaderOptions.include_range field is true, this message is populated with the location of the sequence within the contig. `region.end - region.start` should thus equal the length of the sequence. This could differ from the range [0, len(sequence)) in the case of a query operation for a particular region of a FASTA sequence.
string sequence = 4
The raw sequence letters. Depending on the `FastaReaderOptions.keep_true_case` field, these may be uppercased or keeping the original true case.

Options for writing FASTA files. Currently this is a placeholder message but could be used to support different choices on output like the number of columns per line.

(message has no fields)

bool skip_invalid_records = 2
If true, simply drop invalid records. Otherwise, raise an error on invalid records.

This message represents a single FASTQ record.

string id = 1
The first line of a FASTQ record begins with '@' and is followed by a sequence identifier (up to the first whitespace character) and then an optional description. This line is parsed into its constituent id and description. The sequence identifier.
string description = 2
Optional. The description provided in the header line.
string sequence = 3
The raw sequence letters.
string quality = 4
The quality values for the sequence. Its length must be the same as the sequence length, and is encoded in ASCII. The meaning of each base quality may vary: it is usually a Phred-scaled score (-10 * log_10(Pr{call is incorrect})) but differs for some older versions of FASTQs.

Options for writing FASTQ files. Currently this is a placeholder message but could be used to support different choices on output like whether the pad line should include the header or not.

(message has no fields)

A message encoding the directives contained in a GFF3 file header. Consult the file format reference for detailed descriptions of these directives. Note that we do NOT handle the FASTA directive (a rarely-used method to bundle reference sequences within a GFF file.)

string gff_version = 1
`gff_version`, a string of the format "gff-version 3.#.#" encoding the exact GFF spec version used.
repeated Range sequence_regions = 2
`sequence_regions` is a list of the sequence regions that are referenced by GFF records.
repeated GffHeader.OntologyDirective feature_ontologies = 3
repeated GffHeader.OntologyDirective attribute_ontologies = 4
repeated GffHeader.OntologyDirective source_ontologies = 5
string species = 6
A string name for the biological species analyzed. "This directive indicates the species that the annotations apply to. The preferred format is a NCBI URL that points to the relevant species page in either of the following formats: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6239 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?name=Caenorhabditis+elegans"
optional GffHeader.GenomeBuildDirective genome_build = 7

Used in: GffHeader

string source = 1
string name = 2

An OntologyDirective holds the URI to a sequence ontology database, reflecting the ontology over the entities in the `type`, `source`, and `attributes` fields of a GffRecord.

Used in: GffHeader

string uri = 1

(message has no fields)

This message represents a single GFF3 record. See https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md for details on the file format; that document is quoted below. TODO: deal with %-encoding.

optional Range range = 1
`range` field, reflecting the genomic location of the feature. NOTE: this is a 0-based, end-exclusive interval, in contrast to the 1-based encoding used natively in the GFF text format. For reference, consult the documentation for `Range`. This field is required.
string source = 2
`source` designation, as defined by GFF spec: "The source is a free text qualifier intended to describe the algorithm or operating procedure that generated this feature. Typically this is the name of a piece of software, such as "Genescan" or a database name, such as "Genbank." In effect, the source is used to extend the feature ontology by adding a qualifier to the type creating a new composite type that is a subclass of the type in the type column." Missingness of this field is encoded as "".
string type = 3
`type` designation, as defined by GFF spec: "The type of the feature (previously called the "method"). This is constrained to be either a term from the Sequence Ontology or an SO accession number. The latter alternative is distinguished using the syntax SO:000000. In either case, it must be sequence_feature (SO:0000110) or an is_a child of it." Missingness of this field is encoded as "".
double score = 4
`score` designation, as defined by GFF spec: "The score of the feature, a floating point number. As in earlier versions of the format, the semantics of the score are ill-defined. It is strongly recommended that E-values be used for sequence similarity features, and that P-values be used for ab initio gene prediction features." Missingness of this field is encoded by -infinity.
GffRecord.Strand strand = 5
The strand if the feature, if relevant.
int32 phase = 6
`phase` designation, as defined by the GFF spec: "For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region. This is NOT to be confused with the frame, which is simply start modulo 3. For forward strand features, phase is counted from the start field. For reverse strand features, phase is counted from the end field. The phase is REQUIRED for all CDS features." Missingness of this field is encoded by -1.
map<string, string> attributes = 7
`attributes`, a free-form map of keys to string values, corresponding to the semi-colon separated attributes field in the GFF text format.

TODO: factor this out (here and BED, at least)

Used in: GffRecord

UNSPECIFIED_STRAND = 0
The strand is unspecified, unknown, or not meaningful.
FORWARD_STRAND = 1
REVERSE_STRAND = 2

(message has no fields)

A linear alignment can be represented by one CIGAR string. Describes the mapped position and local alignment of the read to the reference.

Used in: Read

optional Position position = 1
The position of this alignment.
int32 mapping_quality = 2
The mapping quality of this alignment. Represents how likely the read maps to this position as opposed to other locations. Specifically, this is -10 log10 Pr(mapping position is wrong), rounded to the nearest integer.
repeated CigarUnit cigar = 3
Represents the local alignment of this sequence (alignment matches, indels, etc) against the reference.

`ListValue` is a wrapper around a repeated field of values. The JSON representation for `ListValue` is JSON array.

Used in: Read, Value, Variant, VariantCall

repeated Value values = 1
Repeated field of dynamically typed values.

`NullValue` is a singleton enumeration to represent the null value for the `Value` type union. The JSON representation for `NullValue` is JSON `null`.

Used in: Value

NULL_VALUE = 0
Null value.

An abstraction for referring to a genomic position, in relation to some already known reference. For now, represents a genomic position as a reference name, a base number on that reference (0-based), and a determination of forward or reverse strand.

Used in: learning.genomics.deepvariant.AlleleCount, LinearAlignment, Read

string reference_name = 1
The name of the reference in whatever reference set is being used.
int64 position = 2
The 0-based offset from the start of the forward strand for that reference.
bool reverse_strand = 3
Whether this position is on the reverse strand, as opposed to the forward strand.

A Program is used in the SAM header to track how alignment data is generated. This is a sub-message of SamHeader, at the same scope to reduce verbosity.

Used in: SamHeader

string id = 2
@PG ID field in SAM spec. The locally unique ID of the program. Used along with `prev_program_id` to define an ordering between programs.
string name = 3
@PG PN field in SAM spec. The display name of the program. This is typically the colloquial name of the tool used, for example 'bwa' or 'picard'.
string command_line = 1
@PG CL field in SAM spec. The command line used to run this program.
string prev_program_id = 4
@PG PP field in SAM spec. The ID of the program run before this one.
string description = 6
@PG DS field in SAM spec. The description of the program.
string version = 5
@PG VN field in SAM spec. The version of the program run.

A 0-based half-open genomic coordinate range for search requests.

Used in: learning.genomics.deepvariant.CandidateHaplotypes, FastaRecord, GffHeader, GffRecord, ReferenceSequence

string reference_name = 1
The reference sequence name, for example `chr1`, `1`, or `chrX`.
int64 start = 2
The start position of the range on the reference, 0-based inclusive.
int64 end = 3
The end position of the range on the reference, 0-based exclusive.

A read alignment describes a linear alignment of a string of DNA to a [reference sequence][learning.genomics.v1.Reference], in addition to metadata about the fragment (the molecule of DNA sequenced) and the read (the bases which were read by the sequencer). A read is equivalent to a line in a SAM file. A read belongs to exactly one read group and exactly one [read group set][learning.genomics.v1.ReadGroupSet]. For more genomics resource definitions, see [Fundamentals of Google Genomics](https://cloud.google.com/genomics/fundamentals-of-google-genomics) ### Reverse-stranded reads Mapped reads (reads having a non-null `alignment`) can be aligned to either the forward or the reverse strand of their associated reference. Strandedness of a mapped read is encoded by `alignment.position.reverseStrand`. If we consider the reference to be a forward-stranded coordinate space of `[0, reference.length)` with `0` as the left-most position and `reference.length` as the right-most position, reads are always aligned left to right. That is, `alignment.position.position` always refers to the left-most reference coordinate and `alignment.cigar` describes the alignment of this read to the reference from left to right. All per-base fields such as `alignedSequence` and `alignedQuality` share this same left-to-right orientation; this is true of reads which are aligned to either strand. For reverse-stranded reads, this means that `alignedSequence` is the reverse complement of the bases that were originally reported by the sequencing machine. ### Generating a reference-aligned sequence string When interacting with mapped reads, it's often useful to produce a string representing the local alignment of the read to reference. The following pseudocode demonstrates one way of doing this: out = "" offset = 0 for c in read.alignment.cigar { switch c.operation { case "ALIGNMENT_MATCH", "SEQUENCE_MATCH", "SEQUENCE_MISMATCH": out += read.alignedSequence[offset:offset+c.operationLength] offset += c.operationLength break case "CLIP_SOFT", "INSERT": offset += c.operationLength break case "PAD": out += repeat("*", c.operationLength) break case "DELETE": out += repeat("-", c.operationLength) break case "SKIP": out += repeat(" ", c.operationLength) break case "CLIP_HARD": break } } return out ### Converting to SAM's CIGAR string The following pseudocode generates a SAM CIGAR string from the `cigar` field. Note that this is a lossy conversion (`cigar.referenceSequence` is lost). cigarMap = { "ALIGNMENT_MATCH": "M", "INSERT": "I", "DELETE": "D", "SKIP": "N", "CLIP_SOFT": "S", "CLIP_HARD": "H", "PAD": "P", "SEQUENCE_MATCH": "=", "SEQUENCE_MISMATCH": "X", } cigarStr = "" for c in read.alignment.cigar { cigarStr += c.operationLength + cigarMap[c.operation] } return cigarStr (== resource_for v1.reads ==)

string id = 1
The server-generated read ID, unique across all reads. This is different from the `fragmentName`.
string read_group_id = 2
The ID of the read group this read belongs to. A read belongs to exactly one read group. This is a server-generated ID which is distinct from SAM's RG tag (for that value, see [ReadGroup.name][learning.genomics.v1.ReadGroup.name]).
string read_group_set_id = 3
The ID of the read group set this read belongs to. A read belongs to exactly one read group set.
string fragment_name = 4
The fragment name. Equivalent to QNAME (query template name) in SAM.
bool proper_placement = 5
The orientation and the distance between reads from the fragment are consistent with the sequencing protocol (SAM flag 0x2).
bool duplicate_fragment = 6
The fragment is a PCR or optical duplicate (SAM flag 0x400).
int32 fragment_length = 7
The observed length of the fragment, equivalent to TLEN in SAM.
int32 read_number = 8
The read number in sequencing. 0-based and less than numberReads. This field replaces SAM flag 0x40 and 0x80.
int32 number_reads = 9
The number of reads in the fragment (extension to SAM flag 0x1).
bool failed_vendor_quality_checks = 10
Whether this read did not pass filters, such as platform or vendor quality controls (SAM flag 0x200).
optional LinearAlignment alignment = 11
The linear alignment for this alignment record. This field is null for unmapped reads.
bool secondary_alignment = 12
Whether this alignment is secondary. Equivalent to SAM flag 0x100. A secondary alignment represents an alternative to the primary alignment for this read. Aligners may return secondary alignments if a read can map ambiguously to multiple coordinates in the genome. By convention, each read has one and only one alignment where both `secondaryAlignment` and `supplementaryAlignment` are false.
bool supplementary_alignment = 13
Whether this alignment is supplementary. Equivalent to SAM flag 0x800. Supplementary alignments are used in the representation of a chimeric alignment. In a chimeric alignment, a read is split into multiple linear alignments that map to different reference contigs. The first linear alignment in the read will be designated as the representative alignment; the remaining linear alignments will be designated as supplementary alignments. These alignments may have different mapping quality scores. In each linear alignment in a chimeric alignment, the read will be hard clipped. The `alignedSequence` and `alignedQuality` fields in the alignment record will only represent the bases for its respective linear alignment.
string aligned_sequence = 14
The bases of the read sequence contained in this alignment record, **without CIGAR operations applied** (equivalent to SEQ in SAM). `alignedSequence` and `alignedQuality` may be shorter than the full read sequence and quality. This will occur if the alignment is part of a chimeric alignment, or if the read was trimmed. When this occurs, the CIGAR for this read will begin/end with a hard clip operator that will indicate the length of the excised sequence.
repeated int32 aligned_quality = 15
The quality of the read sequence contained in this alignment record (equivalent to QUAL in SAM). Optionally can be read from OQ tag. See `SamReaderOptions` proto for more details. `alignedSequence` and `alignedQuality` may be shorter than the full read sequence and quality. This will occur if the alignment is part of a chimeric alignment, or if the read was trimmed. When this occurs, the CIGAR for this read will begin/end with a hard clip operator that will indicate the length of the excised sequence.
optional Position next_mate_position = 16
The mapping of the primary alignment of the `(readNumber+1)%numberReads` read in the fragment. It replaces mate position and mate strand in SAM.
map<string, ListValue> info = 17
A map of additional read alignment information. This must be of the form map<string, string[]> (string key mapping to a list of string values).

A read group is all the data that's processed the same way by the sequencer. This is a sub-message of SamHeader, at the same scope to reduce verbosity.

Used in: SamHeader

string name = 1
RG@ ID field in SAM spec. The read group name.
string sequencing_center = 2
RG@ CN field in SAM spec. The name of the sequencing center producing the read.
string description = 3
@RG DS field in SAM spec. A free-form text description of this read group.
string date = 4
@RG DT field in SAM spec.
string flow_order = 5
@RG FO field in SAM spec.
string key_sequence = 6
@RG KS field in SAM spec.
string library_id = 7
@RG LB field in SAM spec. A library is a collection of DNA fragments which have been prepared for sequencing from a sample. This field is important for quality control as error or bias can be introduced during sample preparation.
repeated string program_ids = 8
@RG PG field in SAM spec.
int32 predicted_insert_size = 9
@RG PI field in SAM spec. The predicted insert size of this read group. The insert size is the length of the sequenced DNA fragment from end-to-end, not including the adapters.
string platform = 10
@RG PL field in SAM spec. The platform/technology used to produce the reads.
string platform_model = 11
@RG PM field in SAM spec. The platform model used as part of this run.
string platform_unit = 12
@RG PU field in SAM spec. The platform unit used as part of this experiment, for example flowcell-barcode.lane for Illumina or slide for SOLiD. A unique identifier.
string sample_id = 13
@RG SM field in SAM spec. A client-supplied sample identifier for the reads in this read group.

Describes requirements for a read for it to be returned by a SamReader.

Used in: learning.genomics.deepvariant.AlleleCounterOptions, learning.genomics.deepvariant.MakeExamplesOptions, learning.genomics.deepvariant.PileupImageOptions, SamReaderOptions

bool keep_duplicates = 1
By default, duplicate reads will not be kept. Set this flag to keep them.
bool keep_failed_vendor_quality_checks = 2
By default, reads that failed the vendor quality checks will not be kept. Set this flag to keep them.
bool keep_secondary_alignments = 3
By default, reads that are marked as secondary alignments will not be kept. Set this flag to keep them.
bool keep_supplementary_alignments = 4
By default, reads that are marked as supplementary alignments will not be kept. Set this flag to keep them.
bool keep_unaligned = 5
By default, reads that aren't aligned are not kept. Set this flag to keep them.
bool keep_improperly_placed = 6
Paired (or greater) reads that are improperly placed are not kept by default. Set this flag to keep them. We define improperly placed to mean reads whose (next) mate is mapped to a different contig.
int32 min_mapping_quality = 7
By default, reads with any mapping quality are kept. Setting this field to a positive integer i will only keep reads that have a MAPQ >= i. Note this only applies to aligned reads. If keep_unaligned is set, unaligned reads, which by definition do not have a mapping quality, will still be kept.
int32 min_base_quality = 8
Minimum base quality. This field indicates that we are enforcing a minimum base quality score for a read to be used. How this field is enforced, though, depends on the enum field min_base_quality_mode, as there are multiple ways for this requirement to be interpreted.
ReadRequirements.MinBaseQualityMode min_base_quality_mode = 9

How should we enforce the min_base_quality requirement?

Used in: ReadRequirements

UNSPECIFIED = 0
If UNSPECIFIED, there are no guarantees on whether and how min_base_quality would be enforced. By default we recommend implementations ignore min_base_quality if this is set to UNSPECIFIED.
ENFORCED_BY_CLIENT = 1
The min_base_quality requirement is being enforced not by the reader but by the client itself. This is commonly used when the algorithm for computing whether a read satisfying the min_base_quality requirement is too complex or too specific for the reader.

A full, or partial, sequence of bases from a contig in a reference genome.

optional Range region = 1
The location on the genome this sequence of bases comes from.
string bases = 2
The bases of this part of the reference genome.

The SamHeader message represents the metadata present in the header of a SAM/BAM file.

string format_version = 1
The VN field from the HD line. Empty if not present (valid formats will match /^[0-9]+\.[0-9]+$/).
SamHeader.SortingOrder sorting_order = 2
SamHeader.AlignmentGrouping alignment_grouping = 3
repeated ContigInfo contigs = 4
@SQ header field in SAM spec. The order of the contigs defines the sorting order.
repeated ReadGroup read_groups = 5
@RG header field in SAM spec. Read groups.
repeated Program programs = 6
@PG header field in SAM spec. A program run to generate the alignment data.
repeated string comments = 7
@CO header field in SAM spec. One-line text comments.

The GO field from the HD line.

Used in: SamHeader

NONE = 0
QUERY = 1
REFERENCE = 2

The SO field from the HD line.

Used in: SamHeader

UNKNOWN = 0
UNSORTED = 1
QUERYNAME = 2
COORDINATE = 3

The SamReaderOptions message is used to alter the properties of a SamReader. It enables reads to be omitted from parsing based on their attributes, as well as more fine-grained handling of particular fields within the SAM records. Next ID: 12.

optional ReadRequirements read_requirements = 1
Read requirements that must be satisfied before our reader will return a read to use.
SamReaderOptions.AuxFieldHandling aux_field_handling = 3
int64 hts_block_size = 4
Block size to use in htslib, in reading the SAM/BAM. Value <=0 will use the default htslib block size.
float downsample_fraction = 5
Controls if, and at what rate, we discard reads from the input stream. This option allows the user to efficiently remove a random fraction of reads from the source SAM/BAM file. The reads are discarded on the fly before being parsed into protos, so the downsampling is reasonably efficient. If 0.0 (the default protobuf value), this field is ignored. If != 0.0, then this must be a value between (0.0, 1.0] indicating the probability p that a read should be kept, or equivalently (1 - p) that a read will be kept. For example, if downsample_fraction is 0.25, then each read has a 25% chance of being included in the output reads.
int64 random_seed = 6
Random seed to use with downsampling fraction.
bool use_original_base_quality_scores = 10
By default aligned_quality field is read from QUAL in SAM. If flag is set, aligned_quality field is read from OQ tag in SAM.
repeated string aux_fields_to_keep = 11
By default, this field is empty. If empty, we keep all aux fields if they are parsed. If set, we only keep the aux fields with the names in this list.

How should we handle the aux fields in the SAM record?

Used in: SamReaderOptions

UNSPECIFIED = 0
SKIP_AUX_FIELDS = 1
PARSE_ALL_AUX_FIELDS = 2

`Struct` represents a structured data value, consisting of fields which map to dynamically typed values. In some languages, `Struct` might be supported by a native representation. For example, in scripting languages like JS a struct is represented as an object. The details of that representation are described together with the proto support for the language. The JSON representation for `Struct` is JSON object.

Used in: Value

map<string, Value> fields = 1
Unordered map of dynamically typed values.

`Value` represents a dynamically typed value which can be either null, a number, a string, a boolean, a recursive struct value, or a list of values. A producer of value is expected to set one of that variants, absence of any variant indicates an error. The JSON representation for `Value` is JSON value.

Used in: ListValue, Struct

oneof kind
The kind of value.
- NullValue null_value = 1
  Represents a null value.
- double number_value = 2
  Represents a double value.
- int32 int_value = 7
  Represents an integer value.
- string string_value = 3
  Represents a string value.
- bool bool_value = 4
  Represents a boolean value.
- Struct struct_value = 5
  Represents a structured value.
- ListValue list_value = 6
  Represents a repeated `Value`.

A variant represents a change in DNA sequence relative to a reference sequence. For example, a variant could represent a SNP or an insertion. The definition of the Variant message closely follows the common VCF variant representation. Each of the calls on a variant represent a determination of genotype with respect to that variant. For example, a call might assign probability of 0.32 to the occurrence of a SNP named rs1234 in a sample named NA12345. NextID: 17

Used in: learning.genomics.deepvariant.CallVariantsOutput, learning.genomics.deepvariant.DeepVariantCall

string reference_name = 14
The reference on which this variant occurs. (such as `chr20` or `X`) Corresponds to the "CHROM" field of VCF 4.3.
int64 start = 16
The position at which this variant occurs (0-based inclusive). This corresponds to the first base of the string of reference bases.
int64 end = 13
The end position (0-based exclusive) of this variant. This corresponds to the first base after the last base in the reference allele. So, the length of the reference allele is (end - start). This is useful for variants that don't explicitly give alternate bases, for example large deletions.
repeated string names = 3
Names for the variant, for example a dbSNP ID. Corresponds to the "ID" field of VCF 4.3.
string reference_bases = 6
The reference bases for this variant. They start at the given position.
repeated string alternate_bases = 7
The bases that appear instead of the reference bases.
double quality = 8
A measure of how likely this variant is to be real. A higher value is better. Since this is a Phred-scaled probability (i.e. is -10 * log_10(p) for some p, which depends on whether this is a variant or non-variant site) it is guaranteed to be non-negative. We use -1 to represent the `unset` value.
repeated string filter = 9
A list of filters (normally quality filters) this variant has failed. `PASS` indicates this variant has passed all filters.
map<string, ListValue> info = 10
A map of additional variant information. This must be of the form map<string, string[]> (string key mapping to a list of string values).
repeated VariantCall calls = 11
The variant calls for this particular variant. Each one represents the determination of genotype with respect to this variant.
string variant_set_id = 15
/////////////////////////////////////////////////////////////////////// DEPRECATED or unused fields of the Variant proto below. These are relics of the Google Genomics API and/or are used to support GA4GH specs. The ID of the variant set this variant belongs to. DEPRECATED.
string id = 2
The server-generated variant ID, unique across all variants. DEPRECATED.
int64 created = 12
The date this variant was created, in milliseconds from the epoch. (-- GA4GH also specifies an "updated" timestamp. --) DEPRECATED.

A call represents the determination of genotype with respect to a particular variant. It may include associated information such as quality and phasing. For example, a call might assign a probability of 0.32 to the occurrence of a SNP named rs1234 in a call set with the name NA12345. NextID: 10

Used in: Variant

string call_set_name = 9
The name of the call set this variant call belongs to. Also known as "sample".
repeated int32 genotype = 7
The genotype of this variant call. Each value represents either the value of the `referenceBases` field or a 1-based index into `alternateBases`. If a variant had a `referenceBases` value of `T` and an `alternateBases` value of `["A", "C"]`, and the `genotype` was `[2, 1]`, that would mean the call represented the heterozygous value `CA` for this variant. If the `genotype` was instead `[0, 1]`, the represented value would be `TA`. Ordering of the genotype values is important if the `phaseset` is present ('PS' field in the call.info map). Uncalled genotypes (represented as `.` in the GT string) are represented by -1 in this array.
bool is_phased = 10
If true, this variant call's genotype ordering implies the phase of the bases and is consistent with any other variant calls in the same reference sequence which have the same phaseset value (the integer 'PS' field in the call.info map). If this is true but the 'PS' field is not set, the call is assumed to be phased with all other calls for which the same state applies.
string phaseset = 5
DEPRECATED. This previously was used as a special-cased field for capturing phasing information. This field should no longer be set, in favor of using the 'PS' field in the call.info map and the `is_phased` boolean attribute.
DEPRECATED.
repeated double genotype_likelihood = 6
The genotype log10-likelihoods for this variant call. Each array entry represents how likely a specific genotype is for this call. The value ordering is defined by the GL tag in the VCF spec. If Phred-scaled genotype likelihood scores (PL) are available and log10(P) genotype likelihood scores (GL) are not, PL scores are converted to GL scores. If both are available, the GL scores are stored here and PL scores are omitted (as they are just a lower-resolution representation of GL scores).
map<string, ListValue> info = 2
A map of additional variant call information. This must be of the form map<string, string[]> (string key mapping to a list of string values).
string call_set_id = 8
/////////////////////////////////////////////////////////////////////// DEPRECATED or unused fields of the VariantCall proto below. These are relics of the Google Genomics API and/or are used to support GA4GH specs. The ID of the call set this variant call belongs to.

This record type is a catch-all for other types of headers. For example, ##pedigreeDB=http://url_of_pedigrees The VcfExtra message would represent this with key="pedigreeDB", value="http://url_of_pedigrees".

Used in: VcfHeader, VcfStructuredExtra

string key = 1
Required by VCF. The key of the extra header field. Note that this key does not have to be unique within a VcfHeader.
string value = 2
Required by VCF. The value of the extra header field.

The below messages are sub-messages of the VCF header. They are not nested within VcfHeader simply to avoid verbosity. We comment fields in one of three states: "Required": Required by both the VCF file format and for downstream users of Variant and VariantCall protos. "Required by VCF": Required by the VCF file format, unused otherwise. "Optional": Optional within the VCF file format, unused otherwise. This record type mirrors a VCF "FILTER" header.

Used in: VcfHeader

string id = 1
Required. The unique ID of the filter. Examples include "PASS", "RefCall".
string description = 2
Required by VCF. The description of the filter.

This record type mirrors a VCF "FORMAT" header.

Used in: VcfHeader

string id = 1
Required. The unique ID of the FORMAT field. Examples include "GT", "PL".
string number = 2
Required. The number of entries expected. See description above in the VcfInfo message.
string type = 3
Required. The type of the field. Valid values are "Integer", "Float", "Character", and "String" (same as INFO except "Flag" is not supported).
string description = 4
Required by VCF. The description of the field.

This record type mirrors a VCF header. See https://samtools.github.io/hts-specs/VCFv4.3.pdf for details on the spec.

string fileformat = 1
The required first line of the VCF. Values e.g. "VCFv4.3".
repeated ContigInfo contigs = 2
The list of contigs used to produce this VCF. All variants within the VCF must lie on a contig represented here. All contigs must have distinct IDs.
repeated VcfFilterInfo filters = 3
A list of all filters used to produce this VCF. All variants within the VCF must only use filters represented here. All filters must have distinct IDs.
repeated VcfInfo infos = 4
A list of all info tags used to annotate variants within the VCF. All info fields present in Variants must only use infos with IDs represented here. All infos must have distinct IDs.
repeated VcfFormatInfo formats = 5
A list of all format fields used to annotate genotypes within the VCF. All fields present in VariantCalls must only use formats with IDs represented here. All formats must have distinct IDs.
repeated string sample_names = 6
An ordered list of all the sample names present in the VCF. All Variants within the VCF must contain `len(sample_names)` VariantCalls and must be in the same order. I.e. for any Variant v, v.calls[i].call_set_name == sample_names[i] for all i.
repeated VcfStructuredExtra structured_extras = 8
A list of all header lines that are not one represented above, represented in a key=value format. The key within the extras may be duplicated.
repeated VcfExtra extras = 7
A list of all header lines that are not one represented above, represented in an unstructured format. The key within the extras may be duplicated.

This message type mirrors a VCF "INFO" header.

Used in: VcfHeader

string id = 1
Required. The unique ID of the INFO field. Examples include "MQ0" or "END".
string number = 2
Required. The number of values included with the info field. This should be the string representation of the number, e.g. "1" for a single entry, "2" for a pair of entries, etc. Special cases arise when the number of entries depend on attributes of the Variant or are unknown in advance, and include: "A": The field has one value per alternate allele. "R": The field has one value per allele (including the reference). "G": The field has one value for each possible genotype. ".": The number of values varies, is unknown, or is unbounded.
string type = 3
Required. The type of the INFO field. Valid values are "Integer", "Float", "Flag", "Character", and "String".
string description = 4
Required by VCF. The description of the field.
string source = 5
Optional. The annotation source used to generate the field.
string version = 6
Optional. The version of the annotation source used to generate the field.

The Vcf{Reader,Writer}Options messages are used to alter the properties of reading and writing variants. They enables certain fields to be omitted from parsing.

repeated string excluded_info_fields = 3
A list of all INFO field IDs that should be excluded from parsing.
repeated string excluded_format_fields = 4
A list of all FORMAT field IDs that should be excluded from parsing.
bool store_gl_and_pl_in_info_map = 5
If true, the GL and PL format tags are stored in the VariantCall.info map with the type and number as specified in the VCF header, similar to other FORMAT fields. Otherwise, the GL and PL tags are special-cased and available in the VariantCall.genotype_likelihood field, with the enforcement that each is of type=Float and Number=G.

This record type is a catch-all for other headers containing multiple key-value pairs. For example, headers may have META lines that provide metadata about the VCF as a whole, e.g. ##META=<ID=Assay,Type=String,Number=.,Values=[WholeGenome, Exome]> The VcfStructuredExtra message would represent this with key="META", and fields mapping "ID" -> "Assay", "Type" -> "String", etc.

Used in: VcfHeader

string key = 1
Required by VCF. The key of the extra header field. Note that this key does not have to be unique within a VcfHeader.
repeated VcfExtra fields = 2
Required by VCF. The key=value pairs contained in the structure.

repeated string excluded_info_fields = 7
A list of all INFO field IDs that should be excluded from writing.
repeated string excluded_format_fields = 8
A list of all FORMAT field IDs that should be excluded from writing.
bool round_qual_values = 6
Should QUAL field values be rounded to one point past the decimal?
bool retrieve_gl_and_pl_from_info_map = 9
If true, the GL and PL format tags are written from the VariantCall.info map with the type and number as specified in the VCF header. In this case, any values set in the VariantCall.genotype_likelihood field are ignored. Otherwise, the GL and PL tags are retrieved from the VariantCall.genotype_likelihood field, with the enforcement that each is of type=Float and Number=G, and neither GL nor PL should be present in the VariantCall.info map.
bool exclude_header = 10
If true, the writer will skip writing the VcfHeader.

package nucleus.genomics.v1

message BedGraphRecord

string reference_name = 1

int64 start = 2

int64 end = 3

double data_value = 4

message BedHeader

int32 num_fields = 1

message BedReaderOptions

int32 num_fields = 2

message BedRecord

string reference_name = 1

int64 start = 2

int64 end = 3

string name = 4

double score = 5

BedRecord.Strand strand = 6

int64 thick_start = 7

int64 thick_end = 8

string item_rgb = 9

int32 block_count = 10

string block_sizes = 11

string block_starts = 12

enum BedRecord.Strand

NO_STRAND = 0

FORWARD_STRAND = 1

REVERSE_STRAND = 2

message BedWriterOptions

message CigarUnit

CigarUnit.Operation operation = 1

int64 operation_length = 2

string reference_sequence = 3

enum CigarUnit.Operation

OPERATION_UNSPECIFIED = 0

ALIGNMENT_MATCH = 1

INSERT = 2

DELETE = 3

SKIP = 4

CLIP_SOFT = 5

CLIP_HARD = 6

PAD = 7

SEQUENCE_MATCH = 8

SEQUENCE_MISMATCH = 9

message ContigInfo

string name = 1

string description = 2

int64 n_bases = 3

map<string, string> extra = 5

int32 pos_in_fasta = 4

message FastaReaderOptions

bool keep_true_case = 1

string alphabet = 2

FastaReaderOptions.DeflineParsing defline_parsing = 3

bool include_range_in_records = 4

enum FastaReaderOptions.DeflineParsing

NONE = 0

CONTIG_INFO = 1

message FastaRecord

string defline = 1

optional ContigInfo contig = 2

optional Range region = 3

string sequence = 4

message FastaWriterOptions

message FastqReaderOptions

bool skip_invalid_records = 2

message FastqRecord

string id = 1

string description = 2

string sequence = 3

string quality = 4

message FastqWriterOptions

message GffHeader

string gff_version = 1

repeated Range sequence_regions = 2

repeated GffHeader.OntologyDirective feature_ontologies = 3

repeated GffHeader.OntologyDirective attribute_ontologies = 4

repeated GffHeader.OntologyDirective source_ontologies = 5

string species = 6

optional GffHeader.GenomeBuildDirective genome_build = 7

message GffHeader.GenomeBuildDirective