Get desktop application:
View/edit binary Protocol Buffers messages
A layout used for pages where all values are null There may be buffers of repetition and definition information if required in order to interpret what kind of nulls are present
Used in:
The meaning of each repdef layer, used to interpret repdef buffers correctly
A layout where large binary data is encoded externally and only the descriptions (position + size) are placed in the page Repdef information is stored in the descriptions. A description with a size of 0 and a position of 0 is an empty value. A description with a size of 0 and a non-zero position is a null value and the position is the repdef value.
Used in:
The inner layout used to store the descriptions
The meaning of each repdef layer, used to interpret repdef buffers correctly The inner layout's repdef layers will always be 1 all valid item layer
Compression applied to a single buffer of data A buffer is the leaf of the compression tree. Unlike data blocks, which can be further compressed with a variety of techniques, a buffer cannot be understood in any particular way. A general compression scheme may be applied to a buffer. This is something like zstd, lz4, etc. The entire buffer is compressed as a single unit. If this happens then any parent encoding becomes opaque, even if it would normally be transparent. This is a leaf, no further compression is applied to the data.
Used in: , , ,
A general compression scheme to apply to the buffer
The compression level Optional, if not present a scheme-specific default value will be used. Interpretation of this value depends on the compression scheme. Generally, larger values indicate more compression at the expense of more CPU time.
A compression scheme where fixed-width values are transposed into a series of byte streams This is commonly used for floating point values where the upper bits (the mantissa) have a significantly different meaning than the lower bits. By splitting the values into byte streams we group the mantissa bits together and the exponent bits together. The end result is typically more compressible. Note that this encoding is mostly useful when combined with other encodings. It does not do any compression on its own. This is an opaque encoding. The input is a fixed-width data block The output is a single fixed-width data block
Used in:
The compression used to store the values
Used in:
An encoding that compresses a data block into buffers
Used in: , , , , , , , , , , ,
Compression algorithm where all values have a constant value (encoded in the description) This is a leaf encoding, there is no compression applied to the data. The input can be any kind of data block. There is no output.
Used in:
The value (TODO: define encoding for literals?)
A compression scheme where common values are stored in a dictionary and the values are encoded as indices into the dictionary. This is an opaque encoding unless the dictionary is considered metadata. The input is a any kind of data block. There are two outputs: - A data block of the same kind as the input (the dictionary) - A fixed-width data block containing the indices into the dictionary.
Used in:
The compression used to store the indices data block
The compression used to store the dictionary items data block
The number of items in the dictionary
Converts a fixed-size-list of values into a flattened list of values This encoding does not actually compress the data, it just flattens out the FSL layers. This is a transparent encoding. The input is a single block of fixed-width data (with a wide width and few items) The output is a single block of fixed-width data (with a narrow width and many items)
Used in:
The number of items in this layer of FSL
Whether or not there is a validity buffer
The compression used to store the flattened values data block
Fixed width items placed contiguously in a single buffer This is a leaf encoding, there is no compression applied to the data. This is a transparent encoding by definition. The input is a fixed-width data block. The output is a single buffer.
Used in:
the number of bits per value, must be greater than 0, does not need to be a multiple of 8
The compression applied to the data
A compression scheme for variable-width data A small dictionary (referred to as a "symbol table") is used to compress the values. In this scheme there is a single symbol table for the entire page and it is stored in the encoding description itself. This is a transparent encoding. The input is a variable-width data block. The output is a single variable-width data block.
Used in:
The FSST symbol table
The compression used to store the compressed values data block
A layout used for pages where the data is large In this case the cost of transposing the data is relatively small (compared to the cost of writing the data) and so we just zip the buffers together
Used in:
The number of bits of repetition info (0 if there is no repetition)
The number of bits of definition info (0 if there is no definition)
The number of bits of value info Note: we use bits here (and not bytes) for consistency with other encodings. However, in practice, there is never a reason to use a bits per value that is not a multiple of 8. The complexity is not worth the small savings in space since this encoding is typically used with large values already.
If this is a fixed width block then we need to have a fixed number of bits per value
If this is a variable width block then we need to have a fixed number of bits per offset
The number of items in the page
The number of visible items in the page
Description of the compression of values
The meaning of each repdef layer, used to interpret repdef buffers correctly
A compression scheme that wraps the underlying data with general compression Note: The application of wrapped compression will depend on the layout of the data. If we apply it to mini-block data then we compress entire mini-blocks. If we apply it to full-zip data then we compress each value individually. Note: Wrapped compression is somewhat unique at the moment as it is applied to the output of the inner encoding and not the input like all other compressive encodings. Note: General compression can usually be applied in two spots. We can apply it to individual buffers or we can apply it here, to the entire array. For example, let's say we are storing mini-blocks of strings and we are using FSST and bitpacking the offsets. We have something like this... WRAPPED(†3) -> FSST -> VARIABLE -(offsets)-> INLINE_BITPACKING -(data)-> FLAT -> BUFFER (†1) -(data)-> BUFFER (†2) General compression can be applied at †1, †2, or †3 (or any combination of these). If we apply it at †1 then we apply it just to the bitpacked offsets If we apply it at †2 then we apply it just to the FSST compressed data If we apply it at †3 then we apply it to the entire mini-block (both offsets and data) The input is a single data block of any kind. The output is a single data block of the same kind as the input.
Used in:
The compression to apply to the values
The compression used to store the output data block
Bitpacking variant where the bits per value are stored inline in the chunks themselves This variation of bitpacking allows for the number of bits per value to change throughout the buffer, which makes the compression more robust to outliers. This is an opaque encoding. The input is a fixed-width data block. The output is a single buffer.
Used in:
the number of bits of the uncompressed value. e.g. for a u32, this will be 32
The compression applied to the values
A layout used for pages where the data is small In this case we can fit many values into a single disk sector and transposing buffers is expensive. As a result, we do not transpose the buffers but compress the data into small chunks (called mini blocks) which are roughly the size of a disk sector. The end result is a small amount of read amplification (since we must read an entire page at a time) but we have more flexibility in compression and do less work per value when compressing and decompressing in bulk.
Used in:
Description of the compression of repetition levels (e.g. how many bits per rep) Optional, if there is no repetition then this field is not present
Description of the compression of definition levels (e.g. how many bits per def) Optional, if there is no definition then this field is not present
Description of the compression of values
Description of the compression of the dictionary data Optional, if there is no dictionary then this field is not present
Number of items in the dictionary
The meaning of each repdef layer, used to interpret repdef buffers correctly
The number of buffers in each mini-block, this is determined by the compression and does NOT include the repetition or definition buffers (the presence of these buffers can be determined by looking at the rep_compression and def_compression fields)
The depth of the repetition index. If there is repetition then the depth must be at least 1. If there are many layers of repetition then deeper repetition indices will support deeper nested random access. For example, given 5 layers of repetition then the repetition index depth must be at least 3 to support access like rows[50][17][3]. We require `repetition_index_depth + 1` u64 values per mini-block to store the repetition index if the `repetition_index_depth` is greater than 0. The +1 is because we need to store the number of "leftover items" at the end of the chunk. Otherwise, we wouldn't have any way to know if the final item in a chunk is valid or not.
The page already records how many rows are in the page. For mini-block we also need to know how many "items" are in the page. A row and an item are the same thing unless the page has lists.
A compression scheme in which a single fixed-width block is "packed" into a smaller fixed-width block values where each value has fewer bits. This is typically done by throwing away the most significant bits of each value when those bits are all the same. In this scheme the number of bits per value is fixed across the entire buffer and stored in this message. This is a transparent encoding. The input is a fixed-width data block. The output is a single fixed-width data block.
Used in:
the number of bits of the uncompressed value. e.g. for a u32, this will be 32
The compression used to store the bitpacked values data block
Packs a struct containing only fixed-width children into a single fixed-width data block The children are concatenated row by row and stored as a single fixed-width buffer. This is the legacy packed struct representation and remains available for backwards compatibility.
Used in:
The number of bits contributed by each child field in the packed row
The compression used to store the packed fixed-width values
Describes the structural encoding of a page
Used in:
A layout used for pages where the data is small
A layout used for pages where all values are null
A layout used for pages where the data is large
A layout where large binary data is encoded externally and only the descriptions are put in the page
Repetition and definition levels are described in more detail elsewhere. As we peel through the structure of an array we will encounter layers of struct and list. Each of these layers potentially adds a new level to the repetition and definition levels. This message describes the meaning of each layer.
Used in: , , ,
Should never be used, included for debugging purporses and general protobuf best practice
All values are valid (can be primitive or struct)
All list values are valid
There are one or more null items (can be primitive or struct)
A list layer with null lists but no empty lists
A list layer with empty lists but no null lists
A list layer with both empty lists and null lists
A compression scheme where runs of common values are encoded as a single value and a count This is an opaque encoding unless the run lengths are considered metadata. The input is a single data block of any kind. There are two outputs: - A data block of the same kind as the input (the run values) - A fixed-width data block containing the lengths of the runs
Used in:
The compression used to store the run values data block
The compression used to store the run lengths data block
Variable width items have the values stored in one buffer and the offsets are output as a data block that may be further compressed. This is a partial leaf encoding. Values are not compressed but the offsets may be further compressed. This is a transparent encoding by definition. The input is a variable-width data block. The output is a single fixed-width data block (the offsets) and a single buffer (the values)
Used in:
Describes how the offsets data block is compressed
The compression applied to the values
Variable-width packed struct encoding (2.2 extension) Each child value is compressed independently before being transposed into a row-major layout. This preserves per-field compression boundaries at the cost of disabling mini-block compression. Readers must prefer this field when present and fall back to the legacy encoding otherwise.
Used in:
Per-field encoding metadata in struct order
Encoding description for a single child field
Used in:
Compression applied to individual field values before transposition
Bit width of each compressed value (when fixed width)
Bit width of the length prefix for variable-width compressed values