package lance.encodings21

Get desktop application:
View/edit binary Protocol Buffers messages

A layout used for pages where all values are null There may be buffers of repetition and definition information if required in order to interpret what kind of nulls are present

Used in: PageLayout

repeated RepDefLayer layers = 5
The meaning of each repdef layer, used to interpret repdef buffers correctly

A layout where large binary data is encoded externally and only the descriptions (position + size) are placed in the page Repdef information is stored in the descriptions. A description with a size of 0 and a position of 0 is an empty value. A description with a size of 0 and a non-zero position is a null value and the position is the repdef value.

Used in: PageLayout

optional PageLayout inner_layout = 1
The inner layout used to store the descriptions
repeated RepDefLayer layers = 2
The meaning of each repdef layer, used to interpret repdef buffers correctly The inner layout's repdef layers will always be 1 all valid item layer

Compression applied to a single buffer of data A buffer is the leaf of the compression tree. Unlike data blocks, which can be further compressed with a variety of techniques, a buffer cannot be understood in any particular way. A general compression scheme may be applied to a buffer. This is something like zstd, lz4, etc. The entire buffer is compressed as a single unit. If this happens then any parent encoding becomes opaque, even if it would normally be transparent. This is a leaf, no further compression is applied to the data.

Used in: Flat, General, InlineBitpacking, Variable

CompressionScheme scheme = 1
A general compression scheme to apply to the buffer
optional int32 level = 2
The compression level Optional, if not present a scheme-specific default value will be used. Interpretation of this value depends on the compression scheme. Generally, larger values indicate more compression at the expense of more CPU time.

A compression scheme where fixed-width values are transposed into a series of byte streams This is commonly used for floating point values where the upper bits (the mantissa) have a significantly different meaning than the lower bits. By splitting the values into byte streams we group the mantissa bits together and the exponent bits together. The end result is typically more compressible. Note that this encoding is mostly useful when combined with other encodings. It does not do any compression on its own. This is an opaque encoding. The input is a fixed-width data block The output is a single fixed-width data block

Used in: CompressiveEncoding

optional CompressiveEncoding values = 1
The compression used to store the values

Used in: BufferCompression

COMPRESSION_ALGORITHM_UNSPECIFIED = 0
COMPRESSION_ALGORITHM_LZ4 = 1
COMPRESSION_ALGORITHM_ZSTD = 2

An encoding that compresses a data block into buffers

Used in: ByteStreamSplit, Dictionary, FixedSizeList, Fsst, FullZipLayout, General, MiniBlockLayout, OutOfLineBitpacking, PackedStruct, Rle, Variable, VariablePackedStruct.FieldEncoding

oneof compression
- Flat flat = 1
- Variable variable = 2
- Constant constant = 3
- OutOfLineBitpacking out_of_line_bitpacking = 4
- InlineBitpacking inline_bitpacking = 5
- Fsst fsst = 6
- Dictionary dictionary = 7
- Rle rle = 8
- ByteStreamSplit byte_stream_split = 9
- General general = 10
- FixedSizeList fixed_size_list = 11
- PackedStruct packed_struct = 12
- VariablePackedStruct variable_packed_struct = 13

Compression algorithm where all values have a constant value (encoded in the description) This is a leaf encoding, there is no compression applied to the data. The input can be any kind of data block. There is no output.

Used in: CompressiveEncoding

optional bytes value = 1
The value (TODO: define encoding for literals?)

A compression scheme where common values are stored in a dictionary and the values are encoded as indices into the dictionary. This is an opaque encoding unless the dictionary is considered metadata. The input is a any kind of data block. There are two outputs: - A data block of the same kind as the input (the dictionary) - A fixed-width data block containing the indices into the dictionary.

Used in: CompressiveEncoding

optional CompressiveEncoding indices = 1
The compression used to store the indices data block
optional CompressiveEncoding items = 2
The compression used to store the dictionary items data block
uint32 num_dictionary_items = 3
The number of items in the dictionary

Converts a fixed-size-list of values into a flattened list of values This encoding does not actually compress the data, it just flattens out the FSL layers. This is a transparent encoding. The input is a single block of fixed-width data (with a wide width and few items) The output is a single block of fixed-width data (with a narrow width and many items)

Used in: CompressiveEncoding

uint64 items_per_value = 1
The number of items in this layer of FSL
bool has_validity = 3
Whether or not there is a validity buffer
optional CompressiveEncoding values = 2
The compression used to store the flattened values data block

Fixed width items placed contiguously in a single buffer This is a leaf encoding, there is no compression applied to the data. This is a transparent encoding by definition. The input is a fixed-width data block. The output is a single buffer.

Used in: CompressiveEncoding

uint64 bits_per_value = 1
the number of bits per value, must be greater than 0, does not need to be a multiple of 8
optional BufferCompression data = 2
The compression applied to the data

A compression scheme for variable-width data A small dictionary (referred to as a "symbol table") is used to compress the values. In this scheme there is a single symbol table for the entire page and it is stored in the encoding description itself. This is a transparent encoding. The input is a variable-width data block. The output is a single variable-width data block.

Used in: CompressiveEncoding

bytes symbol_table = 1
The FSST symbol table
optional CompressiveEncoding values = 2
The compression used to store the compressed values data block

A layout used for pages where the data is large In this case the cost of transposing the data is relatively small (compared to the cost of writing the data) and so we just zip the buffers together

Used in: PageLayout

uint32 bits_rep = 1
The number of bits of repetition info (0 if there is no repetition)
uint32 bits_def = 2
The number of bits of definition info (0 if there is no definition)
oneof details
The number of bits of value info Note: we use bits here (and not bytes) for consistency with other encodings. However, in practice, there is never a reason to use a bits per value that is not a multiple of 8. The complexity is not worth the small savings in space since this encoding is typically used with large values already.
- uint32 bits_per_value = 3
  If this is a fixed width block then we need to have a fixed number of bits per value
- uint32 bits_per_offset = 4
  If this is a variable width block then we need to have a fixed number of bits per offset
uint32 num_items = 5
The number of items in the page
uint32 num_visible_items = 6
The number of visible items in the page
optional CompressiveEncoding value_compression = 7
Description of the compression of values
repeated RepDefLayer layers = 8
The meaning of each repdef layer, used to interpret repdef buffers correctly

A compression scheme that wraps the underlying data with general compression Note: The application of wrapped compression will depend on the layout of the data. If we apply it to mini-block data then we compress entire mini-blocks. If we apply it to full-zip data then we compress each value individually. Note: Wrapped compression is somewhat unique at the moment as it is applied to the output of the inner encoding and not the input like all other compressive encodings. Note: General compression can usually be applied in two spots. We can apply it to individual buffers or we can apply it here, to the entire array. For example, let's say we are storing mini-blocks of strings and we are using FSST and bitpacking the offsets. We have something like this... WRAPPED(†3) -> FSST -> VARIABLE -(offsets)-> INLINE_BITPACKING -(data)-> FLAT -> BUFFER (†1) -(data)-> BUFFER (†2) General compression can be applied at †1, †2, or †3 (or any combination of these). If we apply it at †1 then we apply it just to the bitpacked offsets If we apply it at †2 then we apply it just to the FSST compressed data If we apply it at †3 then we apply it to the entire mini-block (both offsets and data) The input is a single data block of any kind. The output is a single data block of the same kind as the input.

Used in: CompressiveEncoding

optional BufferCompression compression = 1
The compression to apply to the values
optional CompressiveEncoding values = 3
The compression used to store the output data block

Bitpacking variant where the bits per value are stored inline in the chunks themselves This variation of bitpacking allows for the number of bits per value to change throughout the buffer, which makes the compression more robust to outliers. This is an opaque encoding. The input is a fixed-width data block. The output is a single buffer.

Used in: CompressiveEncoding

uint64 uncompressed_bits_per_value = 1
the number of bits of the uncompressed value. e.g. for a u32, this will be 32
optional BufferCompression values = 2
The compression applied to the values

A layout used for pages where the data is small In this case we can fit many values into a single disk sector and transposing buffers is expensive. As a result, we do not transpose the buffers but compress the data into small chunks (called mini blocks) which are roughly the size of a disk sector. The end result is a small amount of read amplification (since we must read an entire page at a time) but we have more flexibility in compression and do less work per value when compressing and decompressing in bulk.

Used in: PageLayout

optional CompressiveEncoding rep_compression = 1
Description of the compression of repetition levels (e.g. how many bits per rep) Optional, if there is no repetition then this field is not present
optional CompressiveEncoding def_compression = 2
Description of the compression of definition levels (e.g. how many bits per def) Optional, if there is no definition then this field is not present
optional CompressiveEncoding value_compression = 3
Description of the compression of values
optional CompressiveEncoding dictionary = 4
Description of the compression of the dictionary data Optional, if there is no dictionary then this field is not present
uint64 num_dictionary_items = 5
Number of items in the dictionary
repeated RepDefLayer layers = 6
The meaning of each repdef layer, used to interpret repdef buffers correctly
uint64 num_buffers = 7
The number of buffers in each mini-block, this is determined by the compression and does NOT include the repetition or definition buffers (the presence of these buffers can be determined by looking at the rep_compression and def_compression fields)
uint32 repetition_index_depth = 8
The depth of the repetition index. If there is repetition then the depth must be at least 1. If there are many layers of repetition then deeper repetition indices will support deeper nested random access. For example, given 5 layers of repetition then the repetition index depth must be at least 3 to support access like rows[50][17][3]. We require `repetition_index_depth + 1` u64 values per mini-block to store the repetition index if the `repetition_index_depth` is greater than 0. The +1 is because we need to store the number of "leftover items" at the end of the chunk. Otherwise, we wouldn't have any way to know if the final item in a chunk is valid or not.
uint64 num_items = 9
The page already records how many rows are in the page. For mini-block we also need to know how many "items" are in the page. A row and an item are the same thing unless the page has lists.

A compression scheme in which a single fixed-width block is "packed" into a smaller fixed-width block values where each value has fewer bits. This is typically done by throwing away the most significant bits of each value when those bits are all the same. In this scheme the number of bits per value is fixed across the entire buffer and stored in this message. This is a transparent encoding. The input is a fixed-width data block. The output is a single fixed-width data block.

Used in: CompressiveEncoding

uint64 uncompressed_bits_per_value = 1
the number of bits of the uncompressed value. e.g. for a u32, this will be 32
optional CompressiveEncoding values = 3
The compression used to store the bitpacked values data block

Packs a struct containing only fixed-width children into a single fixed-width data block The children are concatenated row by row and stored as a single fixed-width buffer. This is the legacy packed struct representation and remains available for backwards compatibility.

Used in: CompressiveEncoding

repeated uint64 bits_per_value = 1
The number of bits contributed by each child field in the packed row
optional CompressiveEncoding values = 2
The compression used to store the packed fixed-width values

Describes the structural encoding of a page

Used in: BlobLayout

oneof layout
- MiniBlockLayout mini_block_layout = 1
  A layout used for pages where the data is small
- AllNullLayout all_null_layout = 2
  A layout used for pages where all values are null
- FullZipLayout full_zip_layout = 3
  A layout used for pages where the data is large
- BlobLayout blob_layout = 4
  A layout where large binary data is encoded externally and only the descriptions are put in the page

Repetition and definition levels are described in more detail elsewhere. As we peel through the structure of an array we will encounter layers of struct and list. Each of these layers potentially adds a new level to the repetition and definition levels. This message describes the meaning of each layer.

Used in: AllNullLayout, BlobLayout, FullZipLayout, MiniBlockLayout

REPDEF_UNSPECIFIED = 0
Should never be used, included for debugging purporses and general protobuf best practice
REPDEF_ALL_VALID_ITEM = 1
All values are valid (can be primitive or struct)
REPDEF_ALL_VALID_LIST = 2
All list values are valid
REPDEF_NULLABLE_ITEM = 3
There are one or more null items (can be primitive or struct)
REPDEF_NULLABLE_LIST = 4
A list layer with null lists but no empty lists
REPDEF_EMPTYABLE_LIST = 5
A list layer with empty lists but no null lists
REPDEF_NULL_AND_EMPTY_LIST = 6
A list layer with both empty lists and null lists

A compression scheme where runs of common values are encoded as a single value and a count This is an opaque encoding unless the run lengths are considered metadata. The input is a single data block of any kind. There are two outputs: - A data block of the same kind as the input (the run values) - A fixed-width data block containing the lengths of the runs

Used in: CompressiveEncoding

optional CompressiveEncoding values = 1
The compression used to store the run values data block
optional CompressiveEncoding run_lengths = 2
The compression used to store the run lengths data block

Variable width items have the values stored in one buffer and the offsets are output as a data block that may be further compressed. This is a partial leaf encoding. Values are not compressed but the offsets may be further compressed. This is a transparent encoding by definition. The input is a variable-width data block. The output is a single fixed-width data block (the offsets) and a single buffer (the values)

Used in: CompressiveEncoding

optional CompressiveEncoding offsets = 1
Describes how the offsets data block is compressed
optional BufferCompression values = 2
The compression applied to the values

Variable-width packed struct encoding (2.2 extension) Each child value is compressed independently before being transposed into a row-major layout. This preserves per-field compression boundaries at the cost of disabling mini-block compression. Readers must prefer this field when present and fall back to the legacy encoding otherwise.

Used in: CompressiveEncoding

repeated VariablePackedStruct.FieldEncoding fields = 1
Per-field encoding metadata in struct order

Encoding description for a single child field

Used in: VariablePackedStruct

optional CompressiveEncoding value = 1
Compression applied to individual field values before transposition
oneof layout
- uint64 bits_per_value = 2
  Bit width of each compressed value (when fixed width)
- uint64 bits_per_length = 3
  Bit width of the length prefix for variable-width compressed values

package lance.encodings21

message AllNullLayout

repeated RepDefLayer layers = 5

message BlobLayout

optional PageLayout inner_layout = 1

repeated RepDefLayer layers = 2

message BufferCompression

CompressionScheme scheme = 1

optional int32 level = 2

message ByteStreamSplit

optional CompressiveEncoding values = 1

enum CompressionScheme

COMPRESSION_ALGORITHM_UNSPECIFIED = 0

COMPRESSION_ALGORITHM_LZ4 = 1

COMPRESSION_ALGORITHM_ZSTD = 2

message CompressiveEncoding

oneof compression

Flat flat = 1

Variable variable = 2

Constant constant = 3

OutOfLineBitpacking out_of_line_bitpacking = 4

InlineBitpacking inline_bitpacking = 5

Fsst fsst = 6

Dictionary dictionary = 7

Rle rle = 8

ByteStreamSplit byte_stream_split = 9

General general = 10

FixedSizeList fixed_size_list = 11

PackedStruct packed_struct = 12

VariablePackedStruct variable_packed_struct = 13

message Constant

optional bytes value = 1

message Dictionary

optional CompressiveEncoding indices = 1

optional CompressiveEncoding items = 2

uint32 num_dictionary_items = 3

message FixedSizeList

uint64 items_per_value = 1

bool has_validity = 3

optional CompressiveEncoding values = 2

message Flat

uint64 bits_per_value = 1

optional BufferCompression data = 2

message Fsst

bytes symbol_table = 1

optional CompressiveEncoding values = 2

message FullZipLayout

uint32 bits_rep = 1

uint32 bits_def = 2

oneof details

uint32 bits_per_value = 3

uint32 bits_per_offset = 4

uint32 num_items = 5

uint32 num_visible_items = 6

optional CompressiveEncoding value_compression = 7

repeated RepDefLayer layers = 8

message General

optional BufferCompression compression = 1

optional CompressiveEncoding values = 3

message InlineBitpacking

uint64 uncompressed_bits_per_value = 1

optional BufferCompression values = 2

message MiniBlockLayout

optional CompressiveEncoding rep_compression = 1

optional CompressiveEncoding def_compression = 2

optional CompressiveEncoding value_compression = 3

optional CompressiveEncoding dictionary = 4

uint64 num_dictionary_items = 5

repeated RepDefLayer layers = 6

uint64 num_buffers = 7

uint32 repetition_index_depth = 8

uint64 num_items = 9

message OutOfLineBitpacking

uint64 uncompressed_bits_per_value = 1

optional CompressiveEncoding values = 3

message PackedStruct

repeated uint64 bits_per_value = 1

optional CompressiveEncoding values = 2

message PageLayout

oneof layout