Get desktop application:
View/edit binary Protocol Buffers messages
/ A layout used for pages where all values are null / / There may be buffers of repetition and definition information / if required in order to interpret what kind of nulls are present
Used in:
The meaning of each repdef layer, used to interpret repdef buffers correctly
Encodings that decode into an Arrow array
Used in:
, , , , , , , , , , , ,An array encoding for binary fields
Used in:
Items are bitpacked in a buffer
Used in:
the number of bits used for a value in the buffer
the number of bits of the uncompressed value. e.g. for a u32, this will be 32
The items in the list
Whether or not a sign bit is included in the bitpacked value
Items are bitpacked in a buffer
Used in:
the number of bits used for a value in the buffer
the number of bits of the uncompressed value. e.g. for a u32, this will be 32
The items in the list
Marks a column as blob data. It will contain a packed struct with fields position and size (u64)
Used in:
Used in:
A pointer to a buffer in a Lance file A writer can place a buffer in three different locations. The buffer can go in the data page, in the column metadata, or in the file metadata. The writer is free to choose whatever is most appropriate (for example, a dictionary that is shared across all pages in a column will probably go in the column metadata). This specification does not dictate where the buffer should go.
Used in:
, , , ,The index of the buffer in the collection of buffers
The collection holding the buffer
Used in:
The buffer is stored in the data page itself
The buffer is stored in the column metadata
The buffer is stored in the file metadata
Encodings that describe a column of values
Used in:
,No special encoding, just column values
Used in:
Compression algorithm where all values have a constant value
Used in:
The value (TODO: define encoding for literals?)
An array encoding for dictionary-encoded fields
Used in:
Used in:
An array encoding for fixed-size list fields
Used in:
/ The number of items in each list
/ True if the list is nullable
/ The items in the list
Fixed width items placed contiguously in a buffer
Used in:
the number of bits per value, must be greater than 0, does not need to be a multiple of 8
the buffer of values
The Compression message can specify the compression scheme (e.g. zstd) and any other information that is needed for decompression. If this array is compressed then the bits_per_value refers to the uncompressed data.
Used in:
/ A layout used for pages where the data is large / / In this case the cost of transposing the data is relatively small (compared to the cost of writing the data) / and so we just zip the buffers together
Used in:
The number of bits of repetition info (0 if there is no repetition)
The number of bits of definition info (0 if there is no definition)
The number of bits of value info Note: we use bits here (and not bytes) for consistency with other encodings. However, in practice, there is never a reason to use a bits per value that is not a multiple of 8. The complexity is not worth the small savings in space since this encoding is typically used with large values already.
If this is a fixed width block then we need to have a fixed number of bits per value
If this is a variable width block then we need to have a fixed number of bits per offset
The number of items in the page
The number of visible items in the page
Description of the compression of values
The meaning of each repdef layer, used to interpret repdef buffers correctly
Opaque bitpacking variant where the bits per value are stored inline in the chunks themselves
Used in:
the number of bits of the uncompressed value. e.g. for a u32, this will be 32
An array encoding for variable-length list fields
Used in:
An array containing the offsets into an items array. This array will have num_rows items and will never have nulls. If the list at index i is not null then offsets[i] will contain `base + len(list)` where `base` is defined as: i == 0: 0 i > 0: (offsets[i-1] % null_offset_adjustment) To help understand we can consider the following example list: [ [A, B], null, [], [C, D, E] ] The offsets will be [2, ?, 2, 5] If the incoming list at index i IS null then offsets[i] will contain `base + len(list) + null_offset_adjustment` where `base` is defined the same as above. To complete the above example let's assume that `null_offset_adjustment` is 7. Then the offsets will be [2, 9, 2, 5] If there are no nulls then the offsets we write here are exactly the same as the offsets in an Arrow list array (except we omit the leading 0 which is redundant) The reason we do this is so that reading a single list at index i only requires us to load the indices at i and i-1. If the offset at index i is greater than `null_offset_adjustment`` then the list at index i is null. Otherwise the length of the list is `offsets[i] - base` where base is defined the same as above. Let's consider our example offsets: [2, 9, 2, 5] We can take any range of lists and determine how many list items are referenced by the sublist. 0..3: [_, 5] -> items 0..5 (base = 0* and end is 5) 0..2: [_, 2] -> items 0..2 (base = 0* and end is 2) 0..1: [_, 9] -> items 0..2 (base = 0* and end is 9 % 7) 1..3: [2, 5] -> items 2..5 (base = 2 and end is 5) 1..2: [2, 2] -> items 2..2 (base = 2 and end is 2) 2..3: [9, 5] -> items 2..5 (base = 9 % 7 and end is 5) * When the start of our range is the 0th item the base is always 0 and we only need to load a single index from disk to determine the range. The data type of the offsets array is flexible and does not need to match the data type of the destination array. Please note that the offsets array is very likely to be efficiently encoded by bit packing deltas.
If a list is null then we add this value to the offset This value must be greater than the length of the items so that (offset + null_offset_adjustment) is never used by a non-null list. Note that this value cannot be equal to the length of the items because then a page with a single list would store [ X ] and we couldn't know if that is a null list or a list with X items. Therefore, the best choice for this value is 1 + # of items. Choosing this will maximize the bit packing that we can apply to the offsets.
How many items are referenced by these offsets. This is needed in order to determine which items pages map to this offsets page.
/ A layout used for pages where the data is small / / In this case we can fit many values into a single disk sector and transposing buffers is / expensive. As a result, we do not transpose the buffers but compress the data into small / chunks (called mini blocks) which are roughly the size of a disk sector.
Used in:
Description of the compression of repetition levels (e.g. how many bits per rep) Optional, if there is no repetition then this field is not present
Description of the compression of definition levels (e.g. how many bits per def) Optional, if there is no definition then this field is not present
Description of the compression of values
Dictionary data
Number of items in the dictionary
The meaning of each repdef layer, used to interpret repdef buffers correctly
The number of buffers in each mini-block, this is determined by the compression and does NOT include the repetition or definition buffers (the presence of these buffers can be determined by looking at the rep_compression and def_compression fields)
The depth of the repetition index. If there is repetition then the depth must be at least 1. If there are many layers of repetition then deeper repetition indices will support deeper nested random access. For example, given 5 layers of repetition then the repetition index depth must be at least 3 to support access like rows[50][17][3]. We require `repetition_index_depth + 1` u64 values per mini-block to store the repetition index if the `repetition_index_depth` is greater than 0. The +1 is because we need to store the number of "leftover items" at the end of the chunk. Otherwise, we wouldn't have any way to know if the final item in a chunk is valid or not.
The page already records how many rows are in the page. For mini-block we also need to know how many "items" are in the page. A row and an item are the same thing unless the page has lists.
An encoding that adds nullability to another array encoding This can wrap any array encoding and add nullability information
Used in:
The array has no nulls and there is a single buffer needed
The array may have nulls and we need two buffers
All values are null (no buffers needed)
Used in:
(message has no fields)
Used in:
Used in:
Transparent bitpacking variant where the number of bits per value is fixed through the whole buffer
Used in:
the number of bits of the uncompressed value. e.g. for a u32, this will be 32
The number of compressed bits per value, fixed across the entire buffer
Used in:
Used in:
/ Describes the meaning of each repdef layer in a mini-block layout
Used in:
, ,Should never be used, included for debugging purporses and general protobuf best practice
All values are valid (can be primitive or struct)
All list values are valid
There are one or more null items (can be primitive or struct)
A list layer with null lists but no empty lists
A list layer with empty lists but no null lists
A list layer with both empty lists and null lists
An array encoding for shredded structs that will never be null There is no actual data in this column. TODO: Struct validity bitmaps will be placed here.
Used in:
(message has no fields)
Used in:
Wraps a column with a zone map index that can be used to apply pushdown filters
Used in: