package vg

Get desktop application:
View/edit binary Protocol Buffers messages

Alignments link query strings, such as other genomes or reads, to Paths.

string sequence = 1
The sequence that has been aligned.
optional Path path = 2
The Path that the sequence follows in the graph it has been aligned to, containing the `Edit`s that modify the graph to produce the sequence.
string name = 3
The name of the sequence that has been aligned. Similar to read name in BAM.
bytes quality = 4
The quality scores for the sequence, as values on a 0-255 scale.
int32 mapping_quality = 5
The mapping quality score for the alignment, in Phreds.
int32 score = 6
The score for the alignment, in points.
int32 query_position = 7
The offset in the query at which this Alignment occurs.
string sample_name = 9
The name of the sample that produced the aligned read.
string read_group = 10
The name of the read group to which the aligned read belongs.
optional Alignment fragment_prev = 11
The previous Alignment in the fragment. Contains just enough information to locate the full Alignment; e.g. contains an Alignment with only a name, or only a graph mapping position.
optional Alignment fragment_next = 12
Similarly, the next Alignment in the fragment.
bool is_secondary = 15
Flag marking the Alignment as secondary. All but one maximal-scoring alignment of a given read in a GAM file must be secondary.
double identity = 16
Portion of aligned bases that are perfect matches, or 0 if no bases are aligned.
repeated Path fragment = 17
An estimate of the length of the fragment, if this Alignment is paired.
repeated Locus locus = 18
The loci that this alignment supports. TODO: get rid of this, we have annotations in our data model again.
repeated Position refpos = 19
Position of the alignment in reference paths embedded in graph
bool read_paired = 20
SAMTools-style flags
bool read_mapped = 21
bool mate_unmapped = 22
bool read_on_reverse_strand = 23
bool mate_on_reverse_strand = 24
bool soft_clipped = 25
bool discordant_insert_size = 26
double uniqueness = 27
The fraction of bases in the alignment that are covered by MEMs with <=1 total hits in the graph
double correct = 28
Correctness metric 1 = perfectly aligned to truth, 0 = not overlapping true alignment
repeated int32 secondary_score = 29
The ordered list of scores of secondary mappings
double fragment_score = 30
Score under the given fragment model, assume higher is better
bool mate_mapped_to_disjoint_subgraph = 31
string fragment_length_distribution = 32
The fragment length distribution under which a paired-end alignment was aligned.
bool haplotype_scored = 33
True if this alignment's score is adjusted for haplotype consistency, and false otherwise.
double haplotype_logprob = 34
Actual log probability haplotype consistency likelihood
double time_used = 35
The time this alignment took
optional Position to_correct = 36
A path/offset/orientation pair specifying the distance to the correct alignment
bool correctly_mapped = 37
This can be set to true to annotate the Alignment as having been mapped correctly.
optional google.protobuf.Struct annotation = 100
Annotations carried along with the Alignment.

Summarizes reads that map to single position in the graph. This structure is pretty much identical to a line in Samtools pileup format if qualities set, it must have size = num_bases

Used in: NodePileup

int32 ref_base = 1
int32 num_bases = 2
string bases = 3
bytes qualities = 4

*Edges* describe linkages between nodes. They are bidirected, connecting the end (default) or start of the "from" node to the start (default) or end of the "to" node.

Used in: EdgePileup, Graph, LocationSupport

int64 from = 1
ID of upstream node.
int64 to = 2
ID of downstream node.
bool from_start = 3
If the edge leaves from the 5' (start) of a node.
bool to_end = 4
If the edge goes to the 3' (end) of a node.
int32 overlap = 5
Length of overlap between the connected `Node`s.

Keep pileup-like record for reads that span edges

Used in: Pileup

optional Edge edge = 1
int32 num_reads = 2
total reads mapped
int32 num_forward_reads = 3
number of reads mapped on forward strand
bytes qualities = 4

Edits describe how to generate a new string from elements in the graph. To determine the new string, just walk the series of edits, stepping from_length distance in the basis node, and to_length in the novel element, replacing from_length in the basis node with the sequence. There are several types of Edit: - *matches*: from_length == to_length; sequence is empty - *snps*: from_length == to_length; sequence = alt - *deletions*: to_length == 0 && from_length > to_length; sequence is empty - *insertions*: from_length < to_length; sequence = alt

Used in: Mapping

int32 from_length = 1
Length in the target/ref sequence that is removed.
int32 to_length = 2
Length in read/alt of the sequence it is replaced with.
string sequence = 3
The replacement sequence, if different from the original sequence.

Describes a genotype at a particular locus.

Used in: Locus

repeated int32 allele = 1
These refer to the offsets of the alleles in the Locus object.
bool is_phased = 2
double likelihood = 3
double log_likelihood = 4
Likelihood natural logged.
double log_prior = 5
Prior natural logged.
double log_posterior = 6
Posterior natural logged (unnormalized).

*Graphs* are collections of nodes and edges. They can represent subgraphs of larger graphs or be wholly-self-sufficient. Protobuf memory limits of 67108864 bytes mean we typically keep the size of them small generating graphs as collections of smaller subgraphs.

repeated Node node = 1
The `Node`s that make up the graph.
repeated Edge edge = 2
The `Edge`s that connect the `Node`s in the graph.
repeated Path path = 3
A set of named `Path`s that visit sequences of oriented `Node`s.

Used to serialize kmer matches.

string sequence = 1
int64 node_id = 2
sint32 position = 3
bool backward = 4
If true, this kmer is backwards relative to its node, and position counts from the end of the node.

Support pinned to a location, which can be either a node or an edge

optional Support support = 1
The support
oneof oneof_location
The location
- Edge edge = 2
- int64 node_id = 3

Describes a genetic locus with multiple possible alleles, a genotype, and observational support.

Used in: Alignment

string name = 1
A locus may have an identifying name.
repeated Path allele = 2
These are all the alleles at the locus, not just the called ones. Note that a primary reference allele may or may not appear.
repeated Support support = 3
These supports are per-allele, matching the alleles above
repeated Genotype genotype = 4
sorted by likelihood or posterior the first one is the "call"
optional Support overall_support = 5
We also have a Support for the locus overall, because reads may have supported multiple alleles and we want to know how many total there were.
repeated double allele_log_likelihood = 6
We track the likelihood of each allele individually, in addition to genotype likelihoods. Stores the likelihood natural logged.

A Mapping defines the relationship between a node in system and another entity. An empty edit list implies complete match, however it is preferred to specify the full edit structure. as it is more complex to handle special cases.

Used in: Path

optional Position position = 1
The position at which the first Edit, if any, in the Mapping starts. Inclusive.
repeated Edit edit = 2
The series of `Edit`s to transform to region in read/alt.
int64 rank = 5
The 1-based rank of the mapping in its containing path.

A subgraph of the unrolled Graph in which each non-branching path is associated with an alignment of part of the read and part of the graph such that any path through the MultipathAlignment indicates a valid alignment of a read to the graph

string sequence = 1
bytes quality = 2
string name = 3
string sample_name = 4
string read_group = 5
repeated Subpath subpath = 6
non-branching paths of the multipath alignment, each containing an alignment of part of the sequence to a Graph IMPORTANT: downstream applications will assume these are stored in topological order
int32 mapping_quality = 7
-10 * log_10(probability of mismapping)
repeated uint32 start = 8
optional: indices of Subpaths that align the beginning of the read (i.e. source nodes)
string paired_read_name = 9
optional google.protobuf.Struct annotation = 100
Annotations carried along with the Alignment.

*Nodes* store sequence data.

Used in: Graph

string sequence = 1
Sequence of DNA bases represented by the Node.
string name = 2
A name provides an identifier.
int64 id = 3
Each Node has a unique positive nonzero ID within its Graph.

Collect pileup records by node. Saves some space and hashing over storing individually, assuming not too sparse and avg. node length more than couple bases the ith BasePileup in the array corresponds to the position at offset i.

Used in: Pileup

int64 node_id = 1
repeated BasePileup base_pileup = 2

Paths are walks through nodes defined by a series of `Edit`s. They can be used to represent: - haplotypes - mappings of reads, or alignments, by including edits - relationships between nodes - annotations from other data sources, such as: genes, exons, motifs, transcripts, peaks

Used in: Alignment, Graph, Locus, Subpath, Translation

string name = 1
The name of the path. Path names starting with underscore (_) are reserved for internal VG use.
repeated Mapping mapping = 2
The `Mapping`s which describe the order and orientation in which the Path visits `Node`s.
bool is_circular = 3
Set to true if the path is circular.
int64 length = 4
Optional length annotation for the Path.

Bundle up Node and Edge pileups

repeated NodePileup node_pileups = 1
repeated EdgePileup edge_pileups = 2

Used in: Alignment, Mapping

int64 node_id = 1
The Node on which the Position is.
int64 offset = 2
The offset into that node's sequence at which the Position occurs.
bool is_reverse = 4
True if we obtain the original sequence of the path by reverse complementing the mappings.
string name = 5
If the position is used to represent a position against a reference path

Describes a subgraph that is connected to the rest of the graph by two nodes.

Used in: Visit

SnarlType type = 1
What type of snarl is this?
optional Visit start = 2
Visits that connect the Snarl to the rest of the graph
points *INTO* the snarl
optional Visit end = 3
points *OUT OF* the snarl
optional Snarl parent = 4
If this Snarl is nested in another, this field should be filled in with a Snarl that has the start and end visits filled in (other information is optional/extraneous)
string name = 5
Allows snarls to be named, e.g. by the hash of the VCF variant they come from.
bool start_self_reachable = 6
Indicate whether there is a reversing path contained in the Snarl from either the start to itself or the end to itself
bool end_self_reachable = 7
bool start_end_reachable = 8
Indicate whether the start of the Snarl is connected through to the end.
bool directed_acyclic_net_graph = 9
Indicate whether the snarl's net graph is free of directed cycles

Describes a walk through a Snarl where each step is given as either a node or a child Snarl (leaving the walk through the child Snarl to another SnarlTraversal)

repeated Visit visit = 1
Steps of the walk through a Snarl, including the start and end nodes. If the traversal includes a Visit that represents a Snarl, both the node entering the Snarl and the node leaving the Snarl should be included in the traversal.
string name = 2
The name of the traversal can be used for a variant allele id (e.g. <parentSnarlHash>_0, <parentSnarlHash>_1... or by some other arbitrary annotation , unique or non-unique, e.g. deleteterious, gain_of_function, etc., though these will be lost in any indices).

Enumeration of the classifications of snarls

Used in: Snarl

UNCLASSIFIED = 0
ULTRABUBBLE = 1
UNARY = 2

A non-branching path of a MultipathAlignment

Used in: MultipathAlignment

optional Path path = 1
describes node sequence and edits to the graph sequences
repeated uint32 next = 2
the indices of subpaths in the multipath alignment that are to the right of this path where right is in the direction of the end of the read sequence
int32 score = 3
score of this subpath's alignment

Aggregates information about the reads supporting an allele.

Used in: LocationSupport, Locus

double quality = 1
The overall quality of all the support, as -10 * log10(P(all support is wrong))
double forward = 2
The number of supporting reads on the forward strand (which may be fractional)
double reverse = 3
The number of supporting reads on the reverse strand (which may be fractional)
double left = 4
TODO: what is this?
double right = 5
TODO: What is this?

Translations map from one graph to another. A collection of these provides a covering mapping between a from and to graph. If each "from" path through the base graph corresponds to a "to" path in an updated graph, then we can use these translations to project positions, mappings, and paths in the new graph into the old one using the Translator interface.

optional Path from = 1
optional Path to = 2

Describes a step of a walk through a Snarl either on a node or through a child Snarl

Used in: Snarl, SnarlTraversal

int64 node_id = 1
The node ID or snarl of this step (only one should be given)
optional Snarl snarl = 2
only needs to contain the start and end Visits
bool backward = 3
Indicates: if node_id is specified reverse complement of node if feature_id is specified traversal of a child snarl entering backwards through end and leaving backwards through start

package vg

message Alignment

string sequence = 1

optional Path path = 2

string name = 3

bytes quality = 4

int32 mapping_quality = 5

int32 score = 6

int32 query_position = 7

string sample_name = 9

string read_group = 10

optional Alignment fragment_prev = 11

optional Alignment fragment_next = 12

bool is_secondary = 15

double identity = 16

repeated Path fragment = 17

repeated Locus locus = 18

repeated Position refpos = 19

bool read_paired = 20

bool read_mapped = 21

bool mate_unmapped = 22

bool read_on_reverse_strand = 23

bool mate_on_reverse_strand = 24

bool soft_clipped = 25

bool discordant_insert_size = 26

double uniqueness = 27

double correct = 28

repeated int32 secondary_score = 29

double fragment_score = 30

bool mate_mapped_to_disjoint_subgraph = 31

string fragment_length_distribution = 32

bool haplotype_scored = 33

double haplotype_logprob = 34

double time_used = 35

optional Position to_correct = 36

bool correctly_mapped = 37

optional google.protobuf.Struct annotation = 100

message BasePileup

int32 ref_base = 1

int32 num_bases = 2

string bases = 3

bytes qualities = 4

message Edge

int64 from = 1

int64 to = 2

bool from_start = 3

bool to_end = 4

int32 overlap = 5

message EdgePileup

optional Edge edge = 1

int32 num_reads = 2

int32 num_forward_reads = 3

bytes qualities = 4

message Edit

int32 from_length = 1

int32 to_length = 2

string sequence = 3

message Genotype

repeated int32 allele = 1

bool is_phased = 2

double likelihood = 3

double log_likelihood = 4

double log_prior = 5

double log_posterior = 6

message Graph

repeated Node node = 1

repeated Edge edge = 2

repeated Path path = 3

message KmerMatch

string sequence = 1

int64 node_id = 2

sint32 position = 3

bool backward = 4

message LocationSupport

optional Support support = 1

oneof oneof_location

Edge edge = 2

int64 node_id = 3

message Locus

string name = 1