package lance.table

Get desktop application:
View/edit binary Protocol Buffers messages

Currently these are all empty messages because all needed details are either hard-coded (e.g. filenames) or stored in the index itself. However, we may want to add more details in the future, in particular we can add details that may be useful for planning queries (e.g. don't force us to load the index until we know we need it)

(message has no fields)

Lance Data File

Used in: DataFragment, Transaction.DataReplacementGroup

string path = 1
Relative path to the root.
repeated int32 fields = 2
The ids of the fields/columns in this file. -1 is used for "unassigned" while in memory. It is not meant to be written to disk. -2 is used for "tombstoned", meaningful a field that is no longer in use. This is often because the original field id was reassigned to a different data file. In Lance v1 IDs are assigned based on position in the file, offset by the max existing field id in the table (if any already). So when a fragment is first created with one file of N columns, the field ids will be 1, 2, ..., N. If a second, fragment is created with M columns, the field ids will be N+1, N+2, ..., N+M. In Lance v1 there is one field for each field in the input schema, this includes nested fields (both struct and list). Fixed size list fields have only a single field id (these are not considered nested fields in Lance v1). This allows column indices to be calculated from field IDs and the input schema. In Lance v2 the field IDs generally follow the same pattern but there is no way to calculate the column index from the field ID. This is because a given field could be encoded in many different ways, some of which occupy a different number of columns. For example, a struct field could be encoded into N + 1 columns or it could be encoded into a single packed column. To determine column indices the column_indices property should be used instead. In Lance v1 these ids must be sorted but might not always be contiguous.
repeated int32 column_indices = 3
The top-level column indices for each field in the file. If the data file is version 1 then this property will be empty Otherwise there must be one entry for each field in `fields`. Some fields may not correspond to a top-level column in the file. In these cases the index will -1. For example, consider the schema: - dimension: packed-struct (0): - x: u32 (1) - y: u32 (2) - path: list<u32> (3) - embedding: fsl<768> (4) - fp64 - borders: fsl<4> (5) - simple-struct (6) - margin: fp64 (7) - padding: fp64 (8) One possible column indices array could be: [0, -1, -1, 1, 3, 4, 5, 6, 7] This reflects quite a few phenomenon: - The packed struct is encoded into a single column and there is no top-level column for the x or y fields - The variable sized list is encoded into two columns - The embedding is encoded into a single column (common for FSL of primitive) and there is not "FSL column" - The borders field actually does have an "FSL column" The column indices table may not have duplicates (other than -1)
uint32 file_major_version = 4
The major file version used to create the file
uint32 file_minor_version = 5
The minor file version used to create the file If both `file_major_version` and `file_minor_version` are set to 0, then this is a version 0.1 or version 0.2 file.

Data fragment. A fragment is a set of files which represent the different columns of the same rows. If column exists in the schema, but the related file does not exist, treat this column as nulls.

Used in: Manifest, Transaction.Append, Transaction.Delete, Transaction.Merge, Transaction.Overwrite, Transaction.Rewrite, Transaction.Rewrite.RewriteGroup, Transaction.Update

uint64 id = 1
Unique ID of each DataFragment
repeated DataFile files = 2
optional DeletionFile deletion_file = 3
File that indicates which rows, if any, should be considered deleted.
oneof row_id_sequence
A serialized RowIdSequence message (see rowids.proto). These are the row ids for the fragment, in order of the rows as they appear. That is, if a fragment has 3 rows, and the row ids are [1, 42, 3], then the first row is row 1, the second row is row 42, and the third row is row 3.
- bytes inline_row_ids = 5
  If small (< 200KB), the row ids are stored inline.
- ExternalFile external_row_ids = 6
  Otherwise, stored as part of a file.
uint64 physical_rows = 4
Number of original rows in the fragment, this includes rows that are now marked with deletion tombstones. To compute the current number of rows, subtract `deletion_file.num_deleted_rows` from this value.

Deletion File The path of the deletion file is constructed as: {root}/_deletions/{fragment_id}-{read_version}-{id}.{extension} where {extension} is `.arrow` or `.bin` depending on the type of deletion.

Used in: DataFragment

DeletionFile.DeletionFileType file_type = 1
Type of deletion file. If it is unspecified, then the remaining fields will be missing.
uint64 read_version = 2
The version of the dataset this deletion file was built from.
uint64 id = 3
An opaque id used to differentiate this file from others written by concurrent writers.
uint64 num_deleted_rows = 4
The number of rows that are marked as deleted.

Type of deletion file, which varies depending on what is the most efficient way to store the deleted row offsets. If none, then will be unspecified. If there are sparsely deleted rows, then ARROW_ARRAY is the most efficient. If there are densely deleted rows, then BIT_MAP is the most efficient.

Used in: DeletionFile

ARROW_ARRAY = 0
Deletion file is a single Int32Array of deleted row offsets. This is stored as an Arrow IPC file with one batch and one column. Has a .arrow extension.
BITMAP = 1
Deletion file is a Roaring Bitmap of deleted row offsets. Has a .bin extension.

/ A basic bitpacked array of u64 values.

Used in: U64Segment, U64Segment.RangeWithHoles

oneof array
- EncodedU64Array.U16Array u16_array = 1
- EncodedU64Array.U32Array u32_array = 2
- EncodedU64Array.U64Array u64_array = 3

Used in: EncodedU64Array

uint64 base = 1
bytes offsets = 2
/ The deltas are stored as 16-bit unsigned integers. / (protobuf doesn't support 16-bit integers, so we use bytes instead)

Used in: EncodedU64Array

uint64 base = 1
bytes offsets = 2
/ The deltas are stored as 32-bit unsigned integers. / (we use bytes instead of uint32 to avoid overhead of varint encoding)

Used in: EncodedU64Array

bytes values = 2
/ (We use bytes instead of uint64 to avoid overhead of varint encoding)

Used in: DataFragment

string path = 1
Path to the file, relative to the root of the table.
uint64 offset = 2
The offset in the file where the data starts.
uint64 size = 3
The size of the data in the file.

Metadata describing the index.

Used in: IndexSection, Transaction.CreateIndex

optional UUID uuid = 1
Unique ID of an index. It is unique across all the dataset versions.
repeated int32 fields = 2
The columns to build the index.
string name = 3
Index name. Must be unique within one dataset version.
uint64 dataset_version = 4
The version of the dataset this index was built from.
bytes fragment_bitmap = 5
/ A bitmap of the included fragment ids. / / This may by used to determine how much of the dataset is covered by the / index. This information can be retrieved from the dataset by looking at / the dataset at `dataset_version`. However, since the old version may be / deleted while the index is still in use, this information is also stored / in the index. / / The bitmap is stored as a 32-bit Roaring bitmap.
optional google.protobuf.Any index_details = 6
/ Details, specific to the index type, which are needed to load / interpret the index / / Indices should avoid putting large amounts of information in this field, as it will / bloat the manifest.

Index Section, containing a list of index metadata for one dataset version.

repeated IndexMetadata indices = 1

(message has no fields)

Manifest is a global section shared between all the files.

repeated file.Field fields = 1
All fields of the dataset, including the nested fields.
repeated DataFragment fragments = 2
Fragments of the dataset.
uint64 version = 3
Snapshot version number.
uint64 version_aux_data = 4
The file position of the version auxiliary data. * It is not inheritable between versions. * It is not loaded by default during query.
map<string, bytes> metadata = 5
Schema metadata.
optional Manifest.WriterVersion writer_version = 13
The version of the writer that created this file. This information may be used to detect whether the file may have known bugs associated with that writer.
optional uint64 index_section = 6
If presented, the file position of the index metadata.
optional google.protobuf.Timestamp timestamp = 7
Version creation Timestamp, UTC timezone
string tag = 8
Optional version tag
uint64 reader_feature_flags = 9
Feature flags for readers. A bitmap of flags that indicate which features are required to be able to read the table. If a reader does not recognize a flag that is set, it should not attempt to read the dataset. Known flags: * 1: deletion files are present * 2: move_stable_row_ids: row IDs are tracked and stable after move operations (such as compaction), but not updates. * 4: use v2 format (deprecated) * 8: table config is present
uint64 writer_feature_flags = 10
Feature flags for writers. A bitmap of flags that indicate which features are required to be able to write to the dataset. if a writer does not recognize a flag that is set, it should not attempt to write to the dataset. The flags are the same as for reader_feature_flags, although they will not always apply to both.
uint32 max_fragment_id = 11
The highest fragment ID that has been used so far. This ID is not guaranteed to be present in the current version, but it may have been used in previous versions. For a single file, will be zero.
string transaction_file = 12
Path to the transaction file, relative to `{root}/_transactions` This contains a serialized Transaction message representing the transaction that created this version. May be empty if no transaction file was written. The path format is "{read_version}-{uuid}.txn" where {read_version} is the version of the table the transaction read from, and {uuid} is a hyphen-separated UUID.
uint64 next_row_id = 14
The next unused row id. If zero, then the table does not have any rows. This is only used if the "move_stable_row_ids" feature flag is set.
optional Manifest.DataStorageFormat data_format = 15
The data storage format This specifies what format is used to store the data files.
map<string, string> config = 16
Table config. Keys with the prefix "lance." are reserved for the Lance library. Other libraries may wish to similarly prefix their configuration keys appropriately.
uint64 blob_dataset_version = 17
The version of the blob dataset associated with this table. Changes to blob fields will modify the blob dataset and update this version in the parent table. If this value is 0 then there are no blob fields.

Used in: Manifest

string file_format = 1
The format of the data files (e.g. "lance")
string version = 2
The max format version of the data files. This is the maximum version of the file format that the dataset will create. This may be lower than the maximum version that can be written in order to allow older readers to read the dataset.

Used in: Manifest

string library = 1
The name of the library that created this file.
string version = 2
The version of the library that created this file. Because we cannot assume that the library is semantically versioned, this is a string. However, if it is semantically versioned, it should be a valid semver string without any 'v' prefix. For example: `2.0.0`, `2.0.0-rc.1`.

(message has no fields)

/ A sequence of row IDs. This is split up into one or more segments, / each of which can be encoded in different ways. The encodings are optimized / for values that are sorted, which will often be the case with row ids. / They also have optimized forms depending on how sparse the values are.

repeated U64Segment segments = 1

A transaction represents the changes to a dataset. This has two purposes: 1. When retrying a commit, the transaction can be used to re-build an updated manifest. 2. When there's a conflict, this can be used to determine whether the other transaction is compatible with this one.

uint64 read_version = 1
The version of the dataset this transaction was built from. For example, for a delete transaction this means the version of the dataset that was read from while evaluating the deletion predicate.
string uuid = 2
The UUID that unique identifies a transaction.
string tag = 3
Optional version tag.
oneof operation
The operation of this transaction.
- Transaction.Append append = 100
- Transaction.Delete delete = 101
- Transaction.Overwrite overwrite = 102
- Transaction.CreateIndex create_index = 103
- Transaction.Rewrite rewrite = 104
- Transaction.Merge merge = 105
- Transaction.Restore restore = 106
- Transaction.ReserveFragments reserve_fragments = 107
- Transaction.Update update = 108
- Transaction.Project project = 109
- Transaction.UpdateConfig update_config = 110
- Transaction.DataReplacement data_replacement = 111
oneof blob_operation
An operation to apply to the blob dataset
- Transaction.Append blob_append = 200
- Transaction.Overwrite blob_overwrite = 202

Add new rows to the dataset.

Used in: Transaction

repeated DataFragment fragments = 1
The new fragments to append. Fragment IDs are not yet assigned.

Add or replace a new secondary index. This is also used to remove an index (we are replacing it with nothing) - new_indices: the modified indices, empty if dropping indices only - removed_indices: the indices that are being replaced

Used in: Transaction

repeated IndexMetadata new_indices = 1
repeated IndexMetadata removed_indices = 2

An operation that replaces the data in a region of the table with new data.

Used in: Transaction

repeated DataReplacementGroup replacements = 1

Used in: DataReplacement

uint64 fragment_id = 1
optional DataFile new_file = 2

Mark rows as deleted.

Used in: Transaction

repeated DataFragment updated_fragments = 1
The fragments to update The fragment IDs will match existing fragments in the dataset.
repeated uint64 deleted_fragment_ids = 2
The fragments to delete entirely.
string predicate = 3
The predicate that was evaluated This may be used to determine whether the delete would have affected files written by a concurrent transaction.

An operation that merges in a new column, altering the schema.

Used in: Transaction

repeated DataFragment fragments = 1
The updated fragments These should all have existing fragment IDs.
repeated file.Field schema = 2
The new schema
map<string, bytes> schema_metadata = 3
Schema metadata.

Create or overwrite the entire dataset.

Used in: Transaction

repeated DataFragment fragments = 1
The new fragments Fragment IDs are not yet assigned.
repeated file.Field schema = 2
The new schema
map<string, bytes> schema_metadata = 3
Schema metadata.
map<string, string> config_upsert_values = 4
Key-value pairs to merge with existing config.

An operation that projects a subset of columns, altering the schema.

Used in: Transaction

repeated file.Field schema = 1
The new schema

An operation that reserves fragment ids for future use in a rewrite operation.

Used in: Transaction

uint32 num_fragments = 1

An operation that restores a dataset to a previous version.

Used in: Transaction

uint64 version = 1
The version to restore to

An operation that rewrites but does not change the data in the table. These kinds of operations just rearrange data.

Used in: Transaction

repeated DataFragment old_fragments = 1
The old fragments that are being replaced DEPRECATED: use groups instead. These should all have existing fragment IDs.
repeated DataFragment new_fragments = 2
The new fragments DEPRECATED: use groups instead. These fragments IDs are not yet assigned.
repeated Rewrite.RewriteGroup groups = 3
Groups of files that have been rewritten
repeated Rewrite.RewrittenIndex rewritten_indices = 4
Indices that have been rewritten

A group of rewrite files that are all part of the same rewrite.

Used in: Rewrite

repeated DataFragment old_fragments = 1
The old fragment that is being replaced This should have an existing fragment ID.
repeated DataFragment new_fragments = 2
The new fragment The ID should have been reserved by an earlier reserve operation

During a rewrite an index may be rewritten. We only serialize the UUID since a rewrite should not change the other index parameters.

Used in: Rewrite

optional UUID old_id = 1
The id of the index that will be replaced
optional UUID new_id = 2
the id of the new index

An operation that updates rows but does not add or remove rows.

Used in: Transaction

repeated uint64 removed_fragment_ids = 1
The fragments that have been removed. These are fragments where all rows have been updated and moved to a new fragment.
repeated DataFragment updated_fragments = 2
The fragments that have been updated.
repeated DataFragment new_fragments = 3
The new fragments where updated rows have been moved to.

An operation that updates the table config.

Used in: Transaction

map<string, string> upsert_values = 1
repeated string delete_keys = 2
map<string, string> schema_metadata = 3
map<uint32, UpdateConfig.FieldMetadataUpdate> field_metadata = 4

Used in: UpdateConfig

map<string, string> metadata = 5

/ Different ways to encode a sequence of u64 values.

Used in: RowIdSequence

oneof segment
- U64Segment.Range range = 1
  / When the values are sorted and contiguous.
- U64Segment.RangeWithHoles range_with_holes = 2
  / When the values are sorted but have a few gaps.
- U64Segment.RangeWithBitmap range_with_bitmap = 3
  / When the values are sorted but have many gaps.
- EncodedU64Array sorted_array = 4
  / When the values are sorted but are sparse.
- EncodedU64Array array = 5
  / A general array of values, which is not sorted.

/ A range of u64 values.

Used in: U64Segment

uint64 start = 1
/ The start of the range, inclusive.
uint64 end = 2
/ The end of the range, exclusive.

/ A range of u64 values with a bitmap.

Used in: U64Segment

uint64 start = 1
/ The start of the range, inclusive.
uint64 end = 2
/ The end of the range, exclusive.
bytes bitmap = 3
/ A bitmap of the values in the range. The bitmap is a sequence of bytes, / where each byte represents 8 values. The first byte represents values / start to start + 7, the second byte represents values start + 8 to / start + 15, and so on. The most significant bit of each byte represents / the first value in the range, and the least significant bit represents / the last value in the range. If the bit is set, the value is in the / range; if it is not set, the value is not in the range.

/ A range of u64 values with holes.

Used in: U64Segment

uint64 start = 1
/ The start of the range, inclusive.
uint64 end = 2
/ The end of the range, exclusive.
optional EncodedU64Array holes = 3
/ The holes in the range, as a sorted array of values; / Binary search can be used to check whether a value is a hole and should / be skipped. This can also be used to count the number of holes before a / given value, if you need to find the logical offset of a value in the / segment.

/ UUID type. encoded as 16 bytes.

Used in: IndexMetadata, Transaction.Rewrite.RewrittenIndex

bytes uuid = 1

(message has no fields)

Auxiliary Data attached to a version. Only load on-demand.

map<string, bytes> metadata = 3
key-value metadata.

package lance.table

message BTreeIndexDetails

message BitmapIndexDetails

message DataFile

string path = 1

repeated int32 fields = 2

repeated int32 column_indices = 3

uint32 file_major_version = 4

uint32 file_minor_version = 5

message DataFragment

uint64 id = 1

repeated DataFile files = 2

optional DeletionFile deletion_file = 3

oneof row_id_sequence

bytes inline_row_ids = 5

ExternalFile external_row_ids = 6

uint64 physical_rows = 4

message DeletionFile

DeletionFile.DeletionFileType file_type = 1

uint64 read_version = 2

uint64 id = 3

uint64 num_deleted_rows = 4

enum DeletionFile.DeletionFileType

ARROW_ARRAY = 0

BITMAP = 1

message EncodedU64Array

oneof array

EncodedU64Array.U16Array u16_array = 1

EncodedU64Array.U32Array u32_array = 2

EncodedU64Array.U64Array u64_array = 3

message EncodedU64Array.U16Array

uint64 base = 1

bytes offsets = 2

message EncodedU64Array.U32Array

uint64 base = 1

bytes offsets = 2

message EncodedU64Array.U64Array

bytes values = 2

message ExternalFile

string path = 1

uint64 offset = 2

uint64 size = 3

message IndexMetadata

optional UUID uuid = 1

repeated int32 fields = 2

string name = 3

uint64 dataset_version = 4

bytes fragment_bitmap = 5

optional google.protobuf.Any index_details = 6

message IndexSection

repeated IndexMetadata indices = 1

message InvertedIndexDetails

message LabelListIndexDetails

message Manifest

repeated file.Field fields = 1

repeated DataFragment fragments = 2

uint64 version = 3

uint64 version_aux_data = 4

map<string, bytes> metadata = 5

optional Manifest.WriterVersion writer_version = 13

optional uint64 index_section = 6

optional google.protobuf.Timestamp timestamp = 7

string tag = 8

uint64 reader_feature_flags = 9

uint64 writer_feature_flags = 10

uint32 max_fragment_id = 11

string transaction_file = 12

uint64 next_row_id = 14

optional Manifest.DataStorageFormat data_format = 15

map<string, string> config = 16

uint64 blob_dataset_version = 17

message Manifest.DataStorageFormat

string file_format = 1

string version = 2

message Manifest.WriterVersion

string library = 1

string version = 2

message NGramIndexDetails

message RowIdSequence

repeated U64Segment segments = 1