Get desktop application:
View/edit binary Protocol Buffers messages
Currently these are all empty messages because all needed details are either hard-coded (e.g. filenames) or stored in the index itself. However, we may want to add more details in the future, in particular we can add details that may be useful for planning queries (e.g. don't force us to load the index until we know we need it)
(message has no fields)
(message has no fields)
Lance Data File
Used in:
,Relative path to the root.
The ids of the fields/columns in this file. -1 is used for "unassigned" while in memory. It is not meant to be written to disk. -2 is used for "tombstoned", meaningful a field that is no longer in use. This is often because the original field id was reassigned to a different data file. In Lance v1 IDs are assigned based on position in the file, offset by the max existing field id in the table (if any already). So when a fragment is first created with one file of N columns, the field ids will be 1, 2, ..., N. If a second, fragment is created with M columns, the field ids will be N+1, N+2, ..., N+M. In Lance v1 there is one field for each field in the input schema, this includes nested fields (both struct and list). Fixed size list fields have only a single field id (these are not considered nested fields in Lance v1). This allows column indices to be calculated from field IDs and the input schema. In Lance v2 the field IDs generally follow the same pattern but there is no way to calculate the column index from the field ID. This is because a given field could be encoded in many different ways, some of which occupy a different number of columns. For example, a struct field could be encoded into N + 1 columns or it could be encoded into a single packed column. To determine column indices the column_indices property should be used instead. In Lance v1 these ids must be sorted but might not always be contiguous.
The top-level column indices for each field in the file. If the data file is version 1 then this property will be empty Otherwise there must be one entry for each field in `fields`. Some fields may not correspond to a top-level column in the file. In these cases the index will -1. For example, consider the schema: - dimension: packed-struct (0): - x: u32 (1) - y: u32 (2) - path: list<u32> (3) - embedding: fsl<768> (4) - fp64 - borders: fsl<4> (5) - simple-struct (6) - margin: fp64 (7) - padding: fp64 (8) One possible column indices array could be: [0, -1, -1, 1, 3, 4, 5, 6, 7] This reflects quite a few phenomenon: - The packed struct is encoded into a single column and there is no top-level column for the x or y fields - The variable sized list is encoded into two columns - The embedding is encoded into a single column (common for FSL of primitive) and there is not "FSL column" - The borders field actually does have an "FSL column" The column indices table may not have duplicates (other than -1)
The major file version used to create the file
The minor file version used to create the file If both `file_major_version` and `file_minor_version` are set to 0, then this is a version 0.1 or version 0.2 file.
Data fragment. A fragment is a set of files which represent the different columns of the same rows. If column exists in the schema, but the related file does not exist, treat this column as nulls.
Used in:
, , , , , , ,Unique ID of each DataFragment
File that indicates which rows, if any, should be considered deleted.
A serialized RowIdSequence message (see rowids.proto). These are the row ids for the fragment, in order of the rows as they appear. That is, if a fragment has 3 rows, and the row ids are [1, 42, 3], then the first row is row 1, the second row is row 42, and the third row is row 3.
If small (< 200KB), the row ids are stored inline.
Otherwise, stored as part of a file.
Number of original rows in the fragment, this includes rows that are now marked with deletion tombstones. To compute the current number of rows, subtract `deletion_file.num_deleted_rows` from this value.
Deletion File The path of the deletion file is constructed as: {root}/_deletions/{fragment_id}-{read_version}-{id}.{extension} where {extension} is `.arrow` or `.bin` depending on the type of deletion.
Used in:
Type of deletion file. If it is unspecified, then the remaining fields will be missing.
The version of the dataset this deletion file was built from.
An opaque id used to differentiate this file from others written by concurrent writers.
The number of rows that are marked as deleted.
Type of deletion file, which varies depending on what is the most efficient way to store the deleted row offsets. If none, then will be unspecified. If there are sparsely deleted rows, then ARROW_ARRAY is the most efficient. If there are densely deleted rows, then BIT_MAP is the most efficient.
Used in:
Deletion file is a single Int32Array of deleted row offsets. This is stored as an Arrow IPC file with one batch and one column. Has a .arrow extension.
Deletion file is a Roaring Bitmap of deleted row offsets. Has a .bin extension.
/ A basic bitpacked array of u64 values.
Used in:
,Used in:
/ The deltas are stored as 16-bit unsigned integers. / (protobuf doesn't support 16-bit integers, so we use bytes instead)
Used in:
/ The deltas are stored as 32-bit unsigned integers. / (we use bytes instead of uint32 to avoid overhead of varint encoding)
Used in:
/ (We use bytes instead of uint64 to avoid overhead of varint encoding)
Used in:
Path to the file, relative to the root of the table.
The offset in the file where the data starts.
The size of the data in the file.
Metadata describing the index.
Used in:
,Unique ID of an index. It is unique across all the dataset versions.
The columns to build the index.
Index name. Must be unique within one dataset version.
The version of the dataset this index was built from.
/ A bitmap of the included fragment ids. / / This may by used to determine how much of the dataset is covered by the / index. This information can be retrieved from the dataset by looking at / the dataset at `dataset_version`. However, since the old version may be / deleted while the index is still in use, this information is also stored / in the index. / / The bitmap is stored as a 32-bit Roaring bitmap.
/ Details, specific to the index type, which are needed to load / interpret the index / / Indices should avoid putting large amounts of information in this field, as it will / bloat the manifest.
Index Section, containing a list of index metadata for one dataset version.
(message has no fields)
(message has no fields)
Manifest is a global section shared between all the files.
All fields of the dataset, including the nested fields.
Fragments of the dataset.
Snapshot version number.
The file position of the version auxiliary data. * It is not inheritable between versions. * It is not loaded by default during query.
Schema metadata.
The version of the writer that created this file. This information may be used to detect whether the file may have known bugs associated with that writer.
If presented, the file position of the index metadata.
Version creation Timestamp, UTC timezone
Optional version tag
Feature flags for readers. A bitmap of flags that indicate which features are required to be able to read the table. If a reader does not recognize a flag that is set, it should not attempt to read the dataset. Known flags: * 1: deletion files are present * 2: move_stable_row_ids: row IDs are tracked and stable after move operations (such as compaction), but not updates. * 4: use v2 format (deprecated) * 8: table config is present
Feature flags for writers. A bitmap of flags that indicate which features are required to be able to write to the dataset. if a writer does not recognize a flag that is set, it should not attempt to write to the dataset. The flags are the same as for reader_feature_flags, although they will not always apply to both.
The highest fragment ID that has been used so far. This ID is not guaranteed to be present in the current version, but it may have been used in previous versions. For a single file, will be zero.
Path to the transaction file, relative to `{root}/_transactions` This contains a serialized Transaction message representing the transaction that created this version. May be empty if no transaction file was written. The path format is "{read_version}-{uuid}.txn" where {read_version} is the version of the table the transaction read from, and {uuid} is a hyphen-separated UUID.
The next unused row id. If zero, then the table does not have any rows. This is only used if the "move_stable_row_ids" feature flag is set.
The data storage format This specifies what format is used to store the data files.
Table config. Keys with the prefix "lance." are reserved for the Lance library. Other libraries may wish to similarly prefix their configuration keys appropriately.
The version of the blob dataset associated with this table. Changes to blob fields will modify the blob dataset and update this version in the parent table. If this value is 0 then there are no blob fields.
Used in:
The format of the data files (e.g. "lance")
The max format version of the data files. This is the maximum version of the file format that the dataset will create. This may be lower than the maximum version that can be written in order to allow older readers to read the dataset.
Used in:
The name of the library that created this file.
The version of the library that created this file. Because we cannot assume that the library is semantically versioned, this is a string. However, if it is semantically versioned, it should be a valid semver string without any 'v' prefix. For example: `2.0.0`, `2.0.0-rc.1`.
(message has no fields)
/ A sequence of row IDs. This is split up into one or more segments, / each of which can be encoded in different ways. The encodings are optimized / for values that are sorted, which will often be the case with row ids. / They also have optimized forms depending on how sparse the values are.
A transaction represents the changes to a dataset. This has two purposes: 1. When retrying a commit, the transaction can be used to re-build an updated manifest. 2. When there's a conflict, this can be used to determine whether the other transaction is compatible with this one.
The version of the dataset this transaction was built from. For example, for a delete transaction this means the version of the dataset that was read from while evaluating the deletion predicate.
The UUID that unique identifies a transaction.
Optional version tag.
The operation of this transaction.
An operation to apply to the blob dataset
Add new rows to the dataset.
Used in:
The new fragments to append. Fragment IDs are not yet assigned.
Add or replace a new secondary index. This is also used to remove an index (we are replacing it with nothing) - new_indices: the modified indices, empty if dropping indices only - removed_indices: the indices that are being replaced
Used in:
An operation that replaces the data in a region of the table with new data.
Used in:
Used in:
Mark rows as deleted.
Used in:
The fragments to update The fragment IDs will match existing fragments in the dataset.
The fragments to delete entirely.
The predicate that was evaluated This may be used to determine whether the delete would have affected files written by a concurrent transaction.
An operation that merges in a new column, altering the schema.
Used in:
The updated fragments These should all have existing fragment IDs.
The new schema
Schema metadata.
Create or overwrite the entire dataset.
Used in:
The new fragments Fragment IDs are not yet assigned.
The new schema
Schema metadata.
Key-value pairs to merge with existing config.
An operation that projects a subset of columns, altering the schema.
Used in:
The new schema
An operation that reserves fragment ids for future use in a rewrite operation.
Used in:
An operation that restores a dataset to a previous version.
Used in:
The version to restore to
An operation that rewrites but does not change the data in the table. These kinds of operations just rearrange data.
Used in:
The old fragments that are being replaced DEPRECATED: use groups instead. These should all have existing fragment IDs.
The new fragments DEPRECATED: use groups instead. These fragments IDs are not yet assigned.
Groups of files that have been rewritten
Indices that have been rewritten
A group of rewrite files that are all part of the same rewrite.
Used in:
The old fragment that is being replaced This should have an existing fragment ID.
The new fragment The ID should have been reserved by an earlier reserve operation
During a rewrite an index may be rewritten. We only serialize the UUID since a rewrite should not change the other index parameters.
Used in:
The id of the index that will be replaced
the id of the new index
An operation that updates rows but does not add or remove rows.
Used in:
The fragments that have been removed. These are fragments where all rows have been updated and moved to a new fragment.
The fragments that have been updated.
The new fragments where updated rows have been moved to.
An operation that updates the table config.
Used in:
Used in:
/ Different ways to encode a sequence of u64 values.
Used in:
/ When the values are sorted and contiguous.
/ When the values are sorted but have a few gaps.
/ When the values are sorted but have many gaps.
/ When the values are sorted but are sparse.
/ A general array of values, which is not sorted.
/ A range of u64 values.
Used in:
/ The start of the range, inclusive.
/ The end of the range, exclusive.
/ A range of u64 values with a bitmap.
Used in:
/ The start of the range, inclusive.
/ The end of the range, exclusive.
/ A bitmap of the values in the range. The bitmap is a sequence of bytes, / where each byte represents 8 values. The first byte represents values / start to start + 7, the second byte represents values start + 8 to / start + 15, and so on. The most significant bit of each byte represents / the first value in the range, and the least significant bit represents / the last value in the range. If the bit is set, the value is in the / range; if it is not set, the value is not in the range.
/ A range of u64 values with holes.
Used in:
/ The start of the range, inclusive.
/ The end of the range, exclusive.
/ The holes in the range, as a sorted array of values; / Binary search can be used to check whether a value is a hole and should / be skipped. This can also be used to count the number of holes before a / given value, if you need to find the logical offset of a value in the / segment.
/ UUID type. encoded as 16 bytes.
Used in:
,(message has no fields)
Auxiliary Data attached to a version. Only load on-demand.
key-value metadata.