package lance.encodings

Get desktop application:
View/edit binary Protocol Buffers messages

/ A layout used for pages where all values are null / / There may be buffers of repetition and definition information / if required in order to interpret what kind of nulls are present

Used in: PageLayout

repeated RepDefLayer layers = 5
The meaning of each repdef layer, used to interpret repdef buffers correctly

Encodings that decode into an Arrow array

Used in: Binary, Dictionary, FixedSizeBinary, FixedSizeList, Fsst, FullZipLayout, List, MiniBlockLayout, Nullable.NoNull, Nullable.SomeNull, PackedStruct, PackedStructFixedWidthMiniBlock, encodings_datafusion.ZoneMaps

oneof array_encoding
- Flat flat = 1
- Nullable nullable = 2
- FixedSizeList fixed_size_list = 3
- List list = 4
- SimpleStruct struct = 5
- Binary binary = 6
- Dictionary dictionary = 7
- Fsst fsst = 8
- PackedStruct packed_struct = 9
- Bitpacked bitpacked = 10
- FixedSizeBinary fixed_size_binary = 11
- BitpackedForNonNeg bitpacked_for_non_neg = 12
- Constant constant = 13
- InlineBitpacking inline_bitpacking = 14
- OutOfLineBitpacking out_of_line_bitpacking = 15
- Variable variable = 16
- PackedStructFixedWidthMiniBlock packed_struct_fixed_width_mini_block = 17
- Block block = 18

An array encoding for binary fields

Used in: ArrayEncoding

optional ArrayEncoding indices = 1
optional ArrayEncoding bytes = 2
uint64 null_adjustment = 3

Items are bitpacked in a buffer

Used in: ArrayEncoding

uint64 compressed_bits_per_value = 1
the number of bits used for a value in the buffer
uint64 uncompressed_bits_per_value = 2
the number of bits of the uncompressed value. e.g. for a u32, this will be 32
optional Buffer buffer = 3
The items in the list
bool signed = 4
Whether or not a sign bit is included in the bitpacked value

Items are bitpacked in a buffer

Used in: ArrayEncoding

uint64 compressed_bits_per_value = 1
the number of bits used for a value in the buffer
uint64 uncompressed_bits_per_value = 2
the number of bits of the uncompressed value. e.g. for a u32, this will be 32
optional Buffer buffer = 3
The items in the list

Marks a column as blob data. It will contain a packed struct with fields position and size (u64)

Used in: ColumnEncoding

optional ColumnEncoding inner = 1

Used in: ArrayEncoding

string scheme = 1

A pointer to a buffer in a Lance file A writer can place a buffer in three different locations. The buffer can go in the data page, in the column metadata, or in the file metadata. The writer is free to choose whatever is most appropriate (for example, a dictionary that is shared across all pages in a column will probably go in the column metadata). This specification does not dictate where the buffer should go.

Used in: Bitpacked, BitpackedForNonNeg, Flat, PackedStruct, ZoneIndex

uint32 buffer_index = 1
The index of the buffer in the collection of buffers
Buffer.BufferType buffer_type = 2

The collection holding the buffer

Used in: Buffer

page = 0
The buffer is stored in the data page itself
column = 1
The buffer is stored in the column metadata
file = 2
The buffer is stored in the file metadata

Encodings that describe a column of values

Used in: Blob, ZoneIndex

oneof column_encoding
- google.protobuf.Empty values = 1
  No special encoding, just column values
- ZoneIndex zone_index = 2
- Blob blob = 3

Used in: Flat

string scheme = 1
optional int32 level = 2

Compression algorithm where all values have a constant value

Used in: ArrayEncoding

bytes value = 1
The value (TODO: define encoding for literals?)

An array encoding for dictionary-encoded fields

Used in: ArrayEncoding

optional ArrayEncoding indices = 1
optional ArrayEncoding items = 2
uint32 num_dictionary_items = 3

Used in: ArrayEncoding

optional ArrayEncoding bytes = 1
uint32 byte_width = 2

An array encoding for fixed-size list fields

Used in: ArrayEncoding

uint32 dimension = 1
/ The number of items in each list
bool has_validity = 3
/ True if the list is nullable
optional ArrayEncoding items = 2
/ The items in the list

Fixed width items placed contiguously in a buffer

Used in: ArrayEncoding

uint64 bits_per_value = 1
the number of bits per value, must be greater than 0, does not need to be a multiple of 8
optional Buffer buffer = 2
the buffer of values
optional Compression compression = 3
The Compression message can specify the compression scheme (e.g. zstd) and any other information that is needed for decompression. If this array is compressed then the bits_per_value refers to the uncompressed data.

Used in: ArrayEncoding

optional ArrayEncoding binary = 1
bytes symbol_table = 2

/ A layout used for pages where the data is large / / In this case the cost of transposing the data is relatively small (compared to the cost of writing the data) / and so we just zip the buffers together

Used in: PageLayout

uint32 bits_rep = 1
The number of bits of repetition info (0 if there is no repetition)
uint32 bits_def = 2
The number of bits of definition info (0 if there is no definition)
oneof details
The number of bits of value info Note: we use bits here (and not bytes) for consistency with other encodings. However, in practice, there is never a reason to use a bits per value that is not a multiple of 8. The complexity is not worth the small savings in space since this encoding is typically used with large values already.
- uint32 bits_per_value = 3
  If this is a fixed width block then we need to have a fixed number of bits per value
- uint32 bits_per_offset = 4
  If this is a variable width block then we need to have a fixed number of bits per offset
uint32 num_items = 5
The number of items in the page
uint32 num_visible_items = 6
The number of visible items in the page
optional ArrayEncoding value_compression = 7
Description of the compression of values
repeated RepDefLayer layers = 8
The meaning of each repdef layer, used to interpret repdef buffers correctly

Opaque bitpacking variant where the bits per value are stored inline in the chunks themselves

Used in: ArrayEncoding

uint64 uncompressed_bits_per_value = 2
the number of bits of the uncompressed value. e.g. for a u32, this will be 32

An array encoding for variable-length list fields

Used in: ArrayEncoding

optional ArrayEncoding offsets = 1
An array containing the offsets into an items array. This array will have num_rows items and will never have nulls. If the list at index i is not null then offsets[i] will contain `base + len(list)` where `base` is defined as: i == 0: 0 i > 0: (offsets[i-1] % null_offset_adjustment) To help understand we can consider the following example list: [ [A, B], null, [], [C, D, E] ] The offsets will be [2, ?, 2, 5] If the incoming list at index i IS null then offsets[i] will contain `base + len(list) + null_offset_adjustment` where `base` is defined the same as above. To complete the above example let's assume that `null_offset_adjustment` is 7. Then the offsets will be [2, 9, 2, 5] If there are no nulls then the offsets we write here are exactly the same as the offsets in an Arrow list array (except we omit the leading 0 which is redundant) The reason we do this is so that reading a single list at index i only requires us to load the indices at i and i-1. If the offset at index i is greater than `null_offset_adjustment`` then the list at index i is null. Otherwise the length of the list is `offsets[i] - base` where base is defined the same as above. Let's consider our example offsets: [2, 9, 2, 5] We can take any range of lists and determine how many list items are referenced by the sublist. 0..3: [_, 5] -> items 0..5 (base = 0* and end is 5) 0..2: [_, 2] -> items 0..2 (base = 0* and end is 2) 0..1: [_, 9] -> items 0..2 (base = 0* and end is 9 % 7) 1..3: [2, 5] -> items 2..5 (base = 2 and end is 5) 1..2: [2, 2] -> items 2..2 (base = 2 and end is 2) 2..3: [9, 5] -> items 2..5 (base = 9 % 7 and end is 5) * When the start of our range is the 0th item the base is always 0 and we only need to load a single index from disk to determine the range. The data type of the offsets array is flexible and does not need to match the data type of the destination array. Please note that the offsets array is very likely to be efficiently encoded by bit packing deltas.
uint64 null_offset_adjustment = 2
If a list is null then we add this value to the offset This value must be greater than the length of the items so that (offset + null_offset_adjustment) is never used by a non-null list. Note that this value cannot be equal to the length of the items because then a page with a single list would store [ X ] and we couldn't know if that is a null list or a list with X items. Therefore, the best choice for this value is 1 + # of items. Choosing this will maximize the bit packing that we can apply to the offsets.
uint64 num_items = 3
How many items are referenced by these offsets. This is needed in order to determine which items pages map to this offsets page.

/ A layout used for pages where the data is small / / In this case we can fit many values into a single disk sector and transposing buffers is / expensive. As a result, we do not transpose the buffers but compress the data into small / chunks (called mini blocks) which are roughly the size of a disk sector.

Used in: PageLayout

optional ArrayEncoding rep_compression = 1
Description of the compression of repetition levels (e.g. how many bits per rep) Optional, if there is no repetition then this field is not present
optional ArrayEncoding def_compression = 2
Description of the compression of definition levels (e.g. how many bits per def) Optional, if there is no definition then this field is not present
optional ArrayEncoding value_compression = 3
Description of the compression of values
optional ArrayEncoding dictionary = 4
Dictionary data
uint64 num_dictionary_items = 5
Number of items in the dictionary
repeated RepDefLayer layers = 6
The meaning of each repdef layer, used to interpret repdef buffers correctly
uint64 num_buffers = 7
The number of buffers in each mini-block, this is determined by the compression and does NOT include the repetition or definition buffers (the presence of these buffers can be determined by looking at the rep_compression and def_compression fields)
uint32 repetition_index_depth = 8
The depth of the repetition index. If there is repetition then the depth must be at least 1. If there are many layers of repetition then deeper repetition indices will support deeper nested random access. For example, given 5 layers of repetition then the repetition index depth must be at least 3 to support access like rows[50][17][3]. We require `repetition_index_depth + 1` u64 values per mini-block to store the repetition index if the `repetition_index_depth` is greater than 0. The +1 is because we need to store the number of "leftover items" at the end of the chunk. Otherwise, we wouldn't have any way to know if the final item in a chunk is valid or not.
uint64 num_items = 9
The page already records how many rows are in the page. For mini-block we also need to know how many "items" are in the page. A row and an item are the same thing unless the page has lists.

An encoding that adds nullability to another array encoding This can wrap any array encoding and add nullability information

Used in: ArrayEncoding

oneof nullability
- Nullable.NoNull no_nulls = 1
  The array has no nulls and there is a single buffer needed
- Nullable.SomeNull some_nulls = 2
  The array may have nulls and we need two buffers
- Nullable.AllNull all_nulls = 3
  All values are null (no buffers needed)

Used in: Nullable

(message has no fields)

Used in: Nullable

optional ArrayEncoding values = 1

Used in: Nullable

optional ArrayEncoding validity = 1
optional ArrayEncoding values = 2

Transparent bitpacking variant where the number of bits per value is fixed through the whole buffer

Used in: ArrayEncoding

uint64 uncompressed_bits_per_value = 2
the number of bits of the uncompressed value. e.g. for a u32, this will be 32
uint64 compressed_bits_per_value = 3
The number of compressed bits per value, fixed across the entire buffer

Used in: ArrayEncoding

repeated ArrayEncoding inner = 1
optional Buffer buffer = 2

Used in: ArrayEncoding

optional ArrayEncoding Flat = 1
repeated uint32 bits_per_values = 2

oneof layout
- MiniBlockLayout mini_block_layout = 1
- AllNullLayout all_null_layout = 2
- FullZipLayout full_zip_layout = 3

/ Describes the meaning of each repdef layer in a mini-block layout

Used in: AllNullLayout, FullZipLayout, MiniBlockLayout

REPDEF_UNSPECIFIED = 0
Should never be used, included for debugging purporses and general protobuf best practice
REPDEF_ALL_VALID_ITEM = 1
All values are valid (can be primitive or struct)
REPDEF_ALL_VALID_LIST = 2
All list values are valid
REPDEF_NULLABLE_ITEM = 3
There are one or more null items (can be primitive or struct)
REPDEF_NULLABLE_LIST = 4
A list layer with null lists but no empty lists
REPDEF_EMPTYABLE_LIST = 5
A list layer with empty lists but no null lists
REPDEF_NULL_AND_EMPTY_LIST = 6
A list layer with both empty lists and null lists

An array encoding for shredded structs that will never be null There is no actual data in this column. TODO: Struct validity bitmaps will be placed here.

Used in: ArrayEncoding

(message has no fields)

Used in: ArrayEncoding

uint32 bits_per_offset = 1

Wraps a column with a zone map index that can be used to apply pushdown filters

Used in: ColumnEncoding

uint32 rows_per_zone = 1
optional Buffer zone_map_buffer = 2
optional ColumnEncoding inner = 3

package lance.encodings

message AllNullLayout

repeated RepDefLayer layers = 5

message ArrayEncoding

oneof array_encoding

Flat flat = 1

Nullable nullable = 2

FixedSizeList fixed_size_list = 3

List list = 4

SimpleStruct struct = 5

Binary binary = 6

Dictionary dictionary = 7

Fsst fsst = 8

PackedStruct packed_struct = 9

Bitpacked bitpacked = 10

FixedSizeBinary fixed_size_binary = 11

BitpackedForNonNeg bitpacked_for_non_neg = 12

Constant constant = 13

InlineBitpacking inline_bitpacking = 14

OutOfLineBitpacking out_of_line_bitpacking = 15

Variable variable = 16

PackedStructFixedWidthMiniBlock packed_struct_fixed_width_mini_block = 17

Block block = 18

message Binary

optional ArrayEncoding indices = 1

optional ArrayEncoding bytes = 2

uint64 null_adjustment = 3

message Bitpacked

uint64 compressed_bits_per_value = 1

uint64 uncompressed_bits_per_value = 2

optional Buffer buffer = 3

bool signed = 4

message BitpackedForNonNeg

uint64 compressed_bits_per_value = 1

uint64 uncompressed_bits_per_value = 2

optional Buffer buffer = 3

message Blob

optional ColumnEncoding inner = 1

message Block

string scheme = 1

message Buffer

uint32 buffer_index = 1

Buffer.BufferType buffer_type = 2

enum Buffer.BufferType

page = 0

column = 1

file = 2

message ColumnEncoding

oneof column_encoding

google.protobuf.Empty values = 1

ZoneIndex zone_index = 2

Blob blob = 3

message Compression

string scheme = 1

optional int32 level = 2

message Constant

bytes value = 1

message Dictionary

optional ArrayEncoding indices = 1

optional ArrayEncoding items = 2

uint32 num_dictionary_items = 3

message FixedSizeBinary

optional ArrayEncoding bytes = 1

uint32 byte_width = 2

message FixedSizeList

uint32 dimension = 1

bool has_validity = 3

optional ArrayEncoding items = 2

message Flat

uint64 bits_per_value = 1

optional Buffer buffer = 2

optional Compression compression = 3

message Fsst

optional ArrayEncoding binary = 1

bytes symbol_table = 2

message FullZipLayout

uint32 bits_rep = 1

uint32 bits_def = 2

oneof details

uint32 bits_per_value = 3