Proto commits in lancedb/lance

These commits are when the Protocol Buffers files have changed: (only the last 100 relevant commits are shown)

Commit:1f7b270
Author:Will Jones
Committer:GitHub

fix: revert IO optimizations for now until we can test more (#3763)

The documentation is generated from this commit.

Commit:5d288c3
Author:Will Jones
Committer:GitHub

feat: store data file size in manifest (#3750) Part of #3751. * When we write data files, we store the size in bytes in the manifest. * We use this size to skip the `HEAD` request needed to find the file footer byte range. * When writing a dataset, we can read any missing sizes and fill them in.

Commit:fd8cc15
Author:Weston Pace
Committer:GitHub

feat: various 2.1 fixes and performance improvements (#3488) * Some bitpacking routines were rewritten from macros to generic functions * Fix nested list bug that could happen when inner list is null and contained in outer list is valid * Starting to clean up encodings protos for 2.1 * Change default compression to lz4 * Change miniblock to be 1 data buffer to N data buffers * Change FSL encoding so that primitive FSLs do not count in the rep/def levels.

The documentation is generated from this commit.

Commit:ff2ab10
Author:BubbleCal
Committer:GitHub

feat: support retrain index and incremental kmeans (#3489) Today we provide 2 ways for users to maintain the vector index: - `create_index`: create a new index on the entire dataset - `optimize`: incrementally index on the unindexed rows it's recommended that the users should call optimize for shorter indexing time, but the index might be not accurate as inserting more rows. this PR introduces: - record the loss of each delta index - add `retrain` flag into `OptimizeOptions`, retrain the whole index if set - add `loss` into vector index stats, so that users can decide whether to retrain the index - support to train KMeans from existing centroids to significantly improve the indexing perf after this, the users don't need to call `create_index` to create a new index to replace existing one, `optimize` would detect the avg loss, and retrain the index in more efficient way --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Commit:59d6596
Author:Weston Pace
Committer:GitHub

feat: add support for ngram indices (#3468) Ngram indices are indices that can speed up various string filters. To start with they will be able to speed up `contains(col, 'substr')` filters. They work by creating a bitmap for each ngram (short sequence of characters) in a value. For example, consider an index of 1-grams. This would create a bitmap for each letter of the alphabet. Then, at query time, we can use this to narrow down which strings could potentially satisfy the query. This is the first scalar index that requires a "recheck" step. It doesn't tell us exactly which rows satisfy the query. It only narrows down the list. Other indices that might behave like this are bloom filters and zone maps. This means that we need to still apply the filter on the results of the index search. A good portion of this PR is adding support for this concept into the scanner. --------- Co-authored-by: Will Jones <willjones127@gmail.com>

Commit:42722fb
Author:Rob Meng
Committer:GitHub

feat: allow replacement of entire datafile when the schema lines up correctly (#3408) For internal design doc see notion or ping me What this PR implements ![image](https://github.com/user-attachments/assets/85219024-a4a4-4b9a-bfb7-2b2f4990e68d) Plan of attack: * This PR: basic functionality, i.e. when there is no conflict calling this tx should just work * Next PR: implement more fine-grained conflict resolution * Potential future PR (when time permits): Allow partial replacement of a datafile. This can be done by "dropping" column indice in a datafile, thereby dropping the column in favor of another TODO: - [x] proto definition of the new transaction - [x] simple rust tests - [x] test error handling - [x] PR desc - [x] python tests - [x] implement conflict detection

Commit:5a92d31
Author:Weston Pace
Committer:GitHub

feat: finish up variable-length encodings in the full-zip path (#3344) This adds the last structural path for 2.1, full zip encoding of variable length data. Scheduling this turned out to be a little trickier than I had planned. There is no easy way to know where to slice the fully-zipped buffer when doing decoding. Currently we settle this problem by unzipping in the indirect scheduling task. There are some alternative possibilities that I have documented but for now I think this will be good enough and we can iterate on this going forwards.

Commit:8b8b8c8
Author:Weston Pace
Committer:GitHub

feat: add drop_index (#3382) Helpful for when an index is created by mistake or no longer needed.

Commit:b1ab748
Author:Weston Pace
Committer:GitHub

feat: add replace_schema_metadata and replace_field_metadata (#3263)

Commit:64fcfcc
Author:Weston Pace
Committer:GitHub

feat: adds list decode support for mini-block encoded data (#3241) Lists are encoded using rep/def levels and a repetition index. At decode time we take all this information to be able to fetch individual ranges of lists.

Commit:c4cb87a
Author:broccoliSpicy
Committer:GitHub

feat: packed struct encoding (#3186) This PR tries to add packed struct encoding. During encoding, it packs a struct with fixed width fields, producing a row oriented `FixedWidthDataBlock`, then use `ValueCompressor` to compressor to a `MiniBlock Layout`. during decoding, it first uses `ValueDecompressor` to get the row-oriented `FixedWidthDataBlock`, then construct a `StructDataBlock` for output. #3173 #2601

Commit:10c31b3
Author:Weston Pace
Committer:GitHub

feat: add the repetition index to the miniblock write path (#3208) The repetition index is what will give us random access support when we have list data. At a high level it stores the number of top-level rows in each mini-block chunk. We can use this later to figure out which chunks we need to read. In reality things are a little more complicated because we don't mandate that each chunk starts with a brand new row (e.g. a row can span multiple mini-block chunks). This is useful because we eventually want to support arbitrarily deep nested access. If we create not-so-mini blocks in the presence of large lists then we introduce read amplification we'd like to avoid.

Commit:f21397d
Author:Weston Pace
Committer:GitHub

feat: enhance repdef utilities to handle empty / null lists (#3200) Empty & null lists are interesting. If you have them then your final repetition & definition buffers will have more items than you have in your flattened array. This fact required a considerably reworking in how we build and unravel rep/def buffers. When building we record the position of the specials and then, when we serialize into rep/def buffers, we insert these special values. When unraveling we need to deal with the fact that certain rep/def values are "invisible" to the current context in which we are unraveling. In addition, we now need to start keeping track of the structure of each layer of repetition in the page metadata. This helps us understand the meaning behind different definition levels later when we are unraveling. This PR adds the changes to the rep/def utilities. We still aren't actually using repetition levels at all yet. That will come in future PRs.

Commit:e32f393
Author:broccoliSpicy
Committer:GitHub

feat: add dictionary encoding (#3134) This PR tries to support dictionary encoding by integrating it with `MiniBlock PageLayout`. The general approach here is: In a `MiniBlock PageLayout`, there is a optional `dictionary field` that stores a dictionary encoding if this `miniblock` has a dictionary. ``` /// A layout used for pages where the data is small /// /// In this case we can fit many values into a single disk sector and transposing buffers is /// expensive. As a result, we do not transpose the buffers but compress the data into small /// chunks (called mini blocks) which are roughly the size of a disk sector. message MiniBlockLayout { // Description of the compression of repetition levels (e.g. how many bits per rep) ArrayEncoding rep_compression = 1; // Description of the compression of definition levels (e.g. how many bits per def) ArrayEncoding def_compression = 2; // Description of the compression of values ArrayEncoding value_compression = 3; ArrayEncoding dictionary = 4; } ``` The rational for this is that if we dictionary encoding something, it's indices will definitely fall into a `MiniBlockLayout`. By doing this, we don't need to have a specific `DictionaryEncoding`, it can be any `ArrayEncoding`. The `Dictionary` and the `indices` are cascaded into another encoding automatically. #3123

Commit:a212395
Author:Weston Pace
Committer:GitHub

feat: start recording index details in the mainifest, cache index type lookup (#3131) This addresses a specific problem. When a dataset had a scalar index on a string column we would perform I/O during the planning phase on every query that contained a filter. This added considerably latency (especially against S3) to query times. We now cache that lookup. It also starts to tackle a more central problem as well. Right now we our manifest stores very little information about indices (pretty much just the UUID). Any further information must be obtained by loading the index. This PR introduces the concept of "index details" which is a spot that an index can put index-specific (e.g. specific to btree or specific to bitmap) information that can be accessed during planning (by just looking at the manifest). At the moment this concept is still fairly bare bones but I think, as scalar indices become more sophisticated, this information can be useful. If we decide we don't want it then I can pull it out as well and dial this PR back to just the caching component.

Commit:3ac6d4a
Author:broccoliSpicy
Committer:GitHub

feat: fsst compression with mini-block (#3121) This PR tries to integrate mini-block page layout with FSST compression. During compression, it first FSST compresses the input data then write out the data use `BinaryMiniBlockEncoder`. During decompression, it first uses `BinaryMiniBlockDecompressor` to decode the raw data read from disk, it then applies `FSST decompression`.

Commit:ec76db4
Author:Weston Pace
Committer:GitHub

feat: add full zip encoding for wide data types (#3114) The encoding is only tested on tensors for now. It should encode variable-width data but, without a repetition index, we are not yet able to schedule / decode variable width data. In addition, I've created a few todos for follow-up.

Commit:3f2faf2
Author:broccoliSpicy
Committer:GitHub

feat: support miniblock with binary data (#3099) This PR enables miniblock encoding with binary data type. --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

Commit:0abf7d4
Author:Weston Pace
Committer:GitHub

chore: various fixes to get the random data tests working on 2.1 (#3103)

Commit:c9f8a49
Author:Weston Pace
Committer:GitHub

feat: introduce concept of "storage class" wtih separate dataset for "blob" storage data (#3064) I'm open to terms other than "storage class" as well. I also think we might want to use something other than "blob" as it can be easy to confuse the concepts of "blob encoding" (e.g. low page overhead, file-like API) and "blob storage class" (fewer rows per file). In fact, the thresholds can even be different with the blob encoding threshold being 8-16MB but the blob storage class threshold being closer to 128KB Closes #3029

Commit:54053e6
Author:broccoliSpicy
Committer:GitHub

feat: bitpack with miniblock (#3067) This PR tries to add bit-packing encoding in the mini-block encoding path. In this PR, each chunk(1024 values) has it's own bit-width parameter and it's stored in each chunk I found that the current implementation to get the `bit_width` of every 1024 values is very slow and hurts the write speed significantly, more investigation needed, I will deal with it by filling a different issue and PR. #3052

Commit:8339b7a
Author:Yue
Committer:GitHub

feat: expose compression-level configuration for general compression (#3034) This PR aims to address https://github.com/lancedb/lance/issues/3033, and expose the compression level for general compression. Co-authored-by: broccoliSpicy <93440049+broccoliSpicy@users.noreply.github.com>

Commit:b1abfff
Author:Weston Pace
Committer:GitHub

feat: add 2.1 read path (#2968) Unlike the write path we were not able to get away with subtle changes to the existing traits. Most of the read traits needed to be duplicated. On the bright side, there is very little impact to the existing reader code though :)

Commit:27b919f
Author:broccoliSpicy
Committer:GitHub

chore: adds crate-ci/typos to check repository's spelling (#3022) This PR tries to introduce the spelling check workflow from [typos](https://github.com/crate-ci/typos) to ensure we have correct spelling in our repository. to escape the typo checking, we can add the words and files that we want to escape to `lance_repo/.typos.toml` like this: ``` [default.extend-words] DNE = "DNE" arange = "arange" nd = "nd" terrestial = "terrestial" abd = "abd" afe = "afe" [files] extend-exclude = ["notebooks/*.ipynb"] ```

Commit:f60a6ce
Author:Weston Pace
Committer:GitHub

feat: add the basic encode path for 2.1 (#3002) Adds a mini-block encoder Adds "structural encoder" (2.1 concept) for struct and primitive Adds compressor impl's for value compression

Commit:5c8f565
Author:dsgibbons
Committer:GitHub

feat: add table config (#2820) Closes #2200 #2200 reference the concept of "table metadata". This PR uses the name "config" to avoid potential confusion with other uses of the word "metadata" throughout the Lance format. This PR introduces: - A new `table.proto` field called `config`. - New `Dataset` methods: `update_config` and `delete_config_keys`. - `config` field in `Manifest` with public methods for updating and deleting. - A new transaction operation `UpdateConfig` along with conflict logic that returns an error if an operation mutates a key that is being upserted by another operation. - A new writer feature flag called `FLAG_TABLE_CONFIG`. - Unit tests for new `Dataset` methods, concurrent config updaters, and conflict resolution logic. --------- Co-authored-by: Will Jones <willjones127@gmail.com>

Commit:73ab2b5
Author:Weston Pace
Committer:GitHub

feat: add a blob encoding for large binary values (#2868) Compared to the regular large binary encoding this will have considerably less metadata (since there will be fewer pages). In the future it will also be possible to read just the descriptions (allowing for different ways of reading the data such as file objects). The scheduling algorithm is also more conservative as well. We don't fetch descriptions until we need them and we only read in one batch of blob data at a time (instead of an entire page). This means the reader will probably be slightly less performant (though, when reading values this large, it shouldn't be too noticeable) but will use considerably less RAM.

Commit:681db8c
Author:broccoliSpicy
Committer:GitHub

feat: support fastlanes bitpacking (#2886) This PR uses [fastlanes algorithm](https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf) for bit-pack encoding. The bit-packing routine is migrated from [SpiralDB's fastlanes implementation](https://github.com/spiraldb/fastlanes), the migrated code is modified to allow rust stable build. #2865 --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

Commit:f0fc411
Author:Weston Pace
Committer:GitHub

refactor: convert encoders to use data blocks (#2817) This is the last of the refactors needed to convert the encoders to working with data blocks.

Commit:6809358
Author:Raunak Shah
Committer:GitHub

feat: add fixed size binary encoding to lance (#2707) Addresses #2717 - Supports nullable, binary, string, large string, and large binary types that are fixed-size. Whether the type is fixed-size or not is detected automatically. - Arrays with empty strings, as well as empty arrays use the old binary encoder - New protobuf contains the byte width (int) and bytes (encoded using value encoder). Offsets are created in memory during decode time (should save an IOP since we don't encode the offsets). Initial benchmarking results: Full scan, fixed size string of length 8, 1M rows <img width="1000" alt="Screenshot 2024-08-13 at 3 58 20 PM" src="https://github.com/user-attachments/assets/6a363044-7401-420e-a0db-a4eac344042c"> To reproduce: `pytest python/benchmarks/test_fixed_size_binary.py`.

Commit:70a75f3
Author:Weston Pace
Committer:GitHub

feat: add data file format / version information to manifest (#2673) Add new "data storage format" property which allows a dataset to specify what file format (only lance) and the version to use when writing data Introduce a configurable version to the lance writer and change FSST and bitpacking to be guarded by a 2_1 version instead of env. variables Change compression to be based on field metadata instead of environment variables Migrate some tests to use v2

Commit:d141cc8
Author:Bert
Committer:GitHub

feat: support bitpacking for signed types (#2662) Adds support for bitpacking for signed types Int8, Int16, Int32 and Int64. In this case, we use one extra bit in the encoding for as the sign bit, and when decoding pad using 1 instead of 0 when expanding the bitpacked buffer

Commit:1098bb3
Author:albertlockett
Committer:albertlockett

feat: support bitpacking for signed types

Commit:e82b2a8
Author:albertlockett

encoding might work

Commit:6ebeaa0
Author:Will Jones
Committer:GitHub

feat: merge_insert update subcolumns (#2639) Closes #2610 * Supports subschemas in `merge_insert` for updates only * Inserts and deletes left as TODO * Field id `-2` is now reserved as a field "tombstone". These tombstones are fields that are no longer in the schema, usually because those fields are now in a different data file. * Fixed a bug in `Merger` where statistics were reset on each batch.

Commit:310950d
Author:Raunak Shah
Committer:GitHub

feat: add a packed struct encoding to lance (#2593) Introduces a new `PackedStruct` encoding, should speed up random access for struct data, ref #2601 - Can currently support non-nullable, primitive fixed-length types (including fixed size list) - Implemented as a physical type array encoder - The user can select whether they want to use this encoding by specifying the field `"packed"` as `true` or `false` in the metadata. The default will use the old `StructFieldEncoder` - Python benchmarks for reading/writing a table in case of both (i) full scans and (ii) random access are added in `test_packed_struct.py`. The expectation is that this encoding will perform better for random access, and worse in case of full scans. Benchmarking results: (10M rows, 5 struct fields, retrieving 100 rows via random access) Read perf: <img width="1401" alt="Screenshot 2024-07-24 at 8 05 08 AM" src="https://github.com/user-attachments/assets/dcaddcfc-a1a6-4ba3-b5f0-292d57b051b0"> Write perf: <img width="1401" alt="Screenshot 2024-07-24 at 8 11 23 AM" src="https://github.com/user-attachments/assets/f9f0c972-ce40-4a3e-a50e-5e88294d2ba3"> To reproduce run `pytest python/benchmarks/test_packed_struct.py -k <group>`, where `group` can be `"read"` or `"write"` --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

Commit:782350e
Author:Bert
Committer:GitHub

feat: add bitpack encoding for LanceV2 (#2333) Work in progress TODO - [x] improve tests - [ ] support signed types - [x] handle case where buffer is all 0s - [x] handle case where num compressed bits = num uncompressed bits

Commit:e5e6fd5
Author:broccoliSpicy
Committer:GitHub

feat: add FSST string compression (#2470) #2415 This is a naive implementation of the [FSST](https://www.vldb.org/pvldb/vol13/p2649-boncz.pdf), it is fairly tested so that it can be used in correctness proof for future optimization, and a mark for future performance engineering. this is not complete, but enough for one git commit, it's just, I am going through something and a bit debilitated, I hope pushing a PR and make the progress live may motivate me more. Question 1: why don't use the original C++ implementation in the paper and do some Rust/C++ FFI? the original C++ implementation is not tuned for our workload the original C++ implementation doesn't work directly with arrow string_array interface Question 2: how to use this FSST implementation? this is the designed interface, ```rust pub fn fsst_compress(_input_buf: &[u8], _input_offsets_buf: &[i32], _output_buf: &mut Vec<u8>, _output_offsets_buf: &mut Vec<i32>) -> Result<()> { Ok(()) } pub fn fsst_decompress(_input_buf: &[u8], _input_offsets_buf: &[i32], _output_buf: &mut Vec<u8>, _output_offsets_buf: &mut Vec<i32>) -> Result<()> { Ok(()) } ``` TODO: 1: more comments and documentations 2: benchmark this implementation(how can I get test data for our intended workload?) 3: push string encoding from lance logical encoding to lance physical encoding 4: lance string compression trait in lance physical encoding 5: implement FSST for lance string compression trait 6: implement SIMD optimization in the paper 7: maybe we could make FSST as a separate rust crate 8: predicate push down to FSST decompression a few questions: 1: does lance intended to support big-endian machine? 2: what kind of string dataset is preferable for our use? --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

Commit:7ec434d
Author:Raunak Shah
Committer:GitHub

feat: add dictionary encoding to lance (#2409) Ref #2347 Adds dictionary encoding for strings to Lance - It is implemented as physical (primitive) array encoder/decoder - Currently we require a maximum of 100 unique strings in a page to use dictionary encoding on that page. - The dictionary is decoded during scheduling itself, and is shared among decode tasks. Indices are decoded later on in the decode step. - Includes code for rust bench/flamegraph to identify any bottlenecks in performance (more details in comments)

Commit:4a114b2
Author:Weston Pace
Committer:GitHub

feat: enhance binary array encoding, make it the default (#2521) This adds support for the following things to the binary encoding: * Nulls * Large offsets * Different types (e.g. String vs Binary) In addition, I have changed the row limit on pages from `u32` to `u64`. Partly, this is because string arrays larger that 2GB are not unheard of and it may be nice in some cases to store them in one page. Also pages are infrequent enough that an extra 4 bytes is trivial (actually, protobuf already stores u32 and u64 the same). Another reason is that, with some encodings (e.g. constant / compressed bitmap / etc.) it's possible to have 2^64 rows without occupying a large amount of space. Note: this is technically a `.proto` change but in protobuf `u32 -> u64` is backwards compatible so it is not a huge issue.

Commit:2febf2d
Author:Raunak Shah
Committer:GitHub

feat: convert binary field encoding/decoding to array encoding/page decoding (#2426) Previously strings were encoded using the binary field encoder, which used the List encoder and decoder. Rewrote string encoding and decoding so that string encoding is now another primitive encoding. Now both encoding/decoding is done directly at an array/page level and does not use the list encoder. Functionality guarded behind an env variable for now. We want this to be the default encoding for strings once null support is added. This should enable adding (i) a dictionary array encoder and page decoder to enable dictionary encoding functionality - ref #2409 and (ii) FSST compression on strings

Commit:db7c6dc
Author:Will Jones
Committer:GitHub

feat: scan with stable row id (#2441) Modifies the scanner to now support two options: `with_row_id` and `with_row_address`. Row addresses are what we used to call `_rowid`. They are returned as the `_rowaddr` column. For backwards compatibility, the `_rowid` column will be a copy of `_rowaddr` for datasets that aren't using stable row ids. For those that have activated the stable row id feature, they will have the row ids in the `_rowid` column.

Commit:2d8e414
Author:Weston Pace
Committer:GitHub

feat: add pushdown to the read path (#2444) This PR adds quite a few pre-requisites as well: * We add the concept of a "filter" to the read path * We make the selection of decoder extensible with the `FieldDecoderStrategy` and `DecoderMiddlewareChain` * Various changes to the decode path so that we can handle the case where the number of rows scheduled is less than the number of rows originally asked for * We add Lance deserialization to `EncodedBatch` (serialization was added in an earlier PR) * We fix some bugs in the serialization and deserialization of `EncodedBatch` The basic flow is this: * On write, we store statistics for each leaf column. These are a record batch of min/max/null_count with a row for every X rows (X is configurable and we should experiment with it). * This batch is encoded into a single buffer (as a mini lance file) and written at the end of the data section of the file. Each leaf column's encoding is wrapped with a header that points to this buffer * When creating decoders, every time we create a leaf column, we unwrap the stats location and collect these. * When creating the root decoder, after the decoder is created, we wrap the root decoder with a filtering decoder that is aware of all of these zone maps. At this time we go out and collect all the zone maps from the file and decode them back into record batches. We now have a map from field to zone map * At schedule time, this outer decoder uses the user provided filter and the zone maps to refine the range that the user asked for. This is not the final PR for pushdown, there are still a few things that need to be done: * We need to use `LanceDfFieldDecoderStrategy` in fragment and the python bindings * The way we are mapping fields to DF columns is not quite right * Ideally we can move the zone map "initialization" to the beginning of the scheduling process (instead of opening the file) so that we can only inflate statistics for columns involved in the filter * We need to cache the inflated statistics in the cached metadata * Ideally we can add a "context" to the decoder middleware chain so that `LanceDfFieldDecoderStrategy` doesn't have to use a mutex Right now pushdown is implemented as a "best effort" filter so a `FilterExec` is still needed in any plan doing pushdown. In the future we might want to move the `FilterExec` into the decoder. This would close https://github.com/lancedb/lance/issues/2398

Commit:2d68062
Author:Will Jones
Committer:GitHub

feat: stable row id manifest changes (#2363) * Adds stable row ids to manifest * Support writing row id sequences for append, overwrite, and update. Epic: #2307

Commit:fad3108
Author:albertlockett
Committer:albertlockett

feat: Add bitpacking encoding

Commit:d8da445
Author:Weston Pace
Committer:GitHub

feat: add encoder utilities for pushdown (#2388) This adds a new field encoder (ZoneMapFieldEncoder) that calculates pushdown statistics and places them in the metadata. It also changes the encoder so that it the choice of encoding is configurable. This makes it possible for extensions to register custom encodings. The zone maps encoder is an example of this as it is placed in a special crate for "encodings that rely on datafusion". It also adds some utilities for converting an `EncodedBatch` to `Bytes` according to the lance file format. This makes it possible to go from `RecordBatch` to `Bytes` using the lance file format. There is not much testing for the zone maps encoder. More will come when we add support for reading zone maps but I want to keep this PR simple for now.

Commit:da2b295
Author:Weston Pace
Committer:GitHub

refactor: shuffle around v2 metadata sections to allow read-on-demand statistics (#2400) Previously the v2 file format had described the column metadatas and column metadata buffers in the same part. This is problematic because there is no way to know when the proto stops and the metadata begins. It's also makes it more difficult to do "read on demand" column statistics (and dictionaries). This PR gets rid of column metadata buffers. If a column encoding wants to refer to a data buffer of some kind that data buffer can be written to the data section. This helps account for things like dictionaries which might need to be periodically flushed to the disk. We also move the global buffers here for consistency. Now a reader can quickly read all column metadatas and the schema in a single IOP and then follow that up by loading column-specific buffers on an as-needed basis. This will allow us to only load statistics / dictionaries for columns that we are interested in. --------- Co-authored-by: Bert <albert.lockett@gmail.com>

Commit:68b45c3
Author:Yue
Committer:GitHub

feat: general compression for value page buffer (#2368) This PR introduces general compression for value page buffers, starting with zstd, to reduce the on-disk size of all types of value arrays. Here are the key details: 1. After some code exploration, I implemented this as a buffer encoder for `ValueEncoder` instead of as an independent physical encoder. Please let me know if this approach is suitable. 2. Enhancements to `ValuePageScheduler.schedule_ranges` allow it to read the entire buffer range for compressed buffers. To support this, I added a new `buffer_size` metadata to several buffer structs, populating these variables using metadata from Lance. 3. Currently, only zstd compression with the default level (level 3) is implemented. If this approach is deemed suitable, we can consider adding more general compression methods in the future. 4. In a specific test case, the original Lance file was 13MB. After applying zstd compression, its size was reduced to 7.4MB. --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

Commit:e310ab4
Author:Will Jones
Committer:GitHub

feat: row id index structures (experimental) (#2303) These are experimental indices to map from stable row ids to row addresses. It's possible there are some improvements to serialization format or performance we will make before stabilizing, but I'd like to defer that work so we can unblock work with the stable row ids. These row id indices are optimized for storage size (in-memory and on-disk) and access speed. Closes: #2308

Commit:711ac03
Author:broccoliSpicy
Committer:GitHub

docs: clearify comments in table.proto -> message DataFragment -> physical_rows (#2346) #2345

Commit:1e9414b
Author:Lei Xu
Committer:GitHub

feat: hamming distance (#2110)

Commit:098f730
Author:Weston Pace
Committer:GitHub

feat: add nullability and u64 support to list codec (#2255) This adds support for nulls in lists and adds support for lists that contain more than i32::MAX items (note: this is a lot more common since we write offset pages much slower than item pages)

Commit:d1582a5
Author:Will Jones
Committer:GitHub

feat: support complex schemas in append (#2209) Fixes two bugs, both associated with schemas that have holes in the field ids: 1. `write_fragments()` and `LanceFragment.create()` assume they can derive the field ids from the Arrow schema. This is not the case if there are holes in the schema. Therefore, when the mode is `Append`, we check the existing schema of the dataset and use its field ids. Fixes #2179 2. `PageTable` assumed that there were no holes in the field ids. It was parametrized as `field_id_offset` and `num_fields`, assuming the field ids were `field_id_offset..(field_id_offset + num_fields)`. This is changed to be parametrized by the min and max field id. --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

Commit:aa23b00
Author:Weston Pace
Committer:GitHub

refactor: rework fragment API in preparation for lance v2 (#2194) Refactor the file opening code in fragment to use common paths. Add validation logic for data files. Fix alter_columns so that it drops data files that are no longer referenced. Put in groundwork for new fragment info for v2. --------- Co-authored-by: Will Jones <willjones127@gmail.com>

Commit:aa4eb66
Author:Weston Pace
Committer:GitHub

feat: add v2 nullability for primitive / fsl (#2169) Struct and list will come later. This PR was getting large enough as it was. There is some refactoring that is happening as part of this PR also. The basic encoder was previously and array encoder that took in two buffer encodings. However, this meant we couldn't use the basic encoder to encode FSL since FSL is an array encoder. So I changed the basic encoder to be an encoder that takes an encoder for validity and an array encoder for values. I then also created the "value encoder" as an array encoder and renamed the existing value encoder (a buffer encoder) to the flat encoder. In addition there was some minor refactoring needed to keep track of buffer indices. Previously array encodings always mapped to 2 buffers. Now they might map to more than 2 buffers (e.g. FSL with its own nullability is 3 buffers).

Commit:3508701
Author:Weston Pace
Committer:GitHub

chore: fix license headers (#2177) This changes all license headers to SPDX (for brevity) and adds CI to verify the license headers (previously we only had checking on Rust, via ruff, and it was not checking the license text) I also dropped the copyright year because it's unnecessary, I didn't want to have to try and fix it correctly, and [other](https://github.com/facebook/react/pull/13593) [big](https://twitter.com/ajorg/status/1228369968963604480) [projects](https://hynek.me/til/copyright-years/) have done so as well. I also changed "Lance Developers" to "The Lance Authors" per the [Linux Foundation guidelines](https://www.linuxfoundation.org/blog/blog/copyright-notices-in-open-source-software-projects).

Commit:9e9a4e0
Author:Weston Pace
Committer:Weston Pace

Added tests for nullability and implementations for nullability for all data types except struct/list

Commit:3ac0074
Author:Weston Pace
Committer:GitHub

feat: mvp for lance version 0.2 reader / writer (#1965) The motivation and bigger picture are covered in more detail in https://github.com/lancedb/lance/issues/1929 This PR builds on top of https://github.com/lancedb/lance/pull/1918 and https://github.com/lancedb/lance/pull/1964 to create a new version of the Lance file format. There is still much to do, but this end-to-end MVP should provide the overall structure for the work. It can currently read and write primitive columns and list columns and supports some very basic encodings.

Commit:8e69dde
Author:Weston Pace
Committer:GitHub

feat: initial reader/writer for the v2 format (#2153) Prerequisites: - [x] #2142

Commit:b4ec40e
Author:Weston Pace
Committer:GitHub

feat: add a protobuf file describing encodings (#2137)

Commit:8fce93f
Author:BubbleCal
Committer:GitHub

chore: building `IVF_HNSW` index (#2066)

Commit:d0ab48e
Author:BubbleCal
Committer:GitHub

chore: write HNSW partitions (#2056)

Commit:a8def71
Author:Weston Pace
Committer:GitHub

refactor: split lance-core into lance-core, lance-io, lance-file, and lance-table (#1852) There is very little code change in this PR. It is primarily moving files around. The only substantial change should be a change to the `FileWriter`. It is now either `FileWriter<NotSelfDescribing>` (for testing in `lance-file`) and `FileWriter<ManifestDescribing>` for use everywhere else. This is because a lance file writes the schema as a manifest and manifest is part of the table format. In the future we could fix this by creating a schema message in the protobuf in lance-file and changing the `FileWriter` and `FileReader` to read this. However, this would be a breaking change and would require a bump in the lance file version (or feature flags). I'd prefer to save that change for later.

Commit:8cc6dab
Author:Rok Mihevc
Committer:GitHub

feat: support drop column api in Dataset (#1695) See #1674.

Commit:33d68e9
Author:Will Jones
Committer:GitHub

fix: prevent stats meta from breaking old readers (#1699) Older readers have a bug where they will read past the end of the file when reading the statistics metadata. Those older readers were before statistics were even used, so we don't care about them working. This commit moves the proto field so that only newer readers will pick it up. Closes #1697

Commit:7b18791
Author:Will Jones
Committer:GitHub

feat: add support for update queries (#1585) TODO: * [x] Tests * [x] Expose in Python * [x] Document in user guide Closes: #1558 Future follow ups tracked in #1589 --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

Commit:c53c63a
Author:Lei Xu
Committer:GitHub

chore: provide a trait to dynamically dispatch different pq based on different vector data type (#1555) Prepare to support bf16/f16 in PQ

Commit:c901cbe
Author:Will Jones
Committer:GitHub

fix: add versioning and bypass broken row counts (#1534) Adds a new feature: `WriterVersion` in the manifest. Also fixes two bugs: * #1531 Adds bypass logic to read correct `physical_rows` when old version is detected. Will update the value on write as needed. * When `migrate_fragments` runs, we do it on the new fragments and not the old, fixing a data loss issue. Fixes #1531 Fixes #1535

Commit:c2ad65f
Author:Lei Xu
Committer:GitHub

feat: store a separate tensor blob for IVF centroids (#1446) * Allow separate I/Os in the future, i.e., centroids can store externally from the IVF protobuf. * Support multiple vector data types (bf16, f16, f32) * Can support compression in the future.

Commit:c2a604d
Author:Lei Xu
Committer:GitHub

feat: add removed_indices to CreateIndex transaction operation (#1408)

Commit:7a645b2
Author:Weston Pace
Committer:GitHub

feat: remap indices on compaction (#1403) Currently, after compaction runs, any fragments that were covered by an index, and compacted, will no longer be covered by that index. This PR fixes that. During compaction we calculate the input and output row ids and then rewrite existing indices so that they cover the newly compacted fragments. Closes #1378 --------- Co-authored-by: Will Jones <willjones127@gmail.com>

Commit:a5d4a8c
Author:Will Jones
Committer:GitHub

feat: support storing page-level stats (#1316) This PR adds page-level statistics to the Lance format. Computing the statistics and using them in queries is left for a future PR. Changes: * Define the statistics format in the format documentation. * Added statistics metadata to the metadata * Added reading and writing of the statistic page table. Refactored array read and write functions to take the page table as a parameter. Closes #1318 --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

Commit:e62f235
Author:Will Jones
Committer:GitHub

fix: make PageTable handle multiple files (#1332) Currently, `PageTable` assumes there are fields `0..num_columns`, where `num_columns` is the total number of columns in the dataset. This works for new files added, because we always write `0..(max_field_id + 1)` entries in the page table, but it breaks for old data files who have fewer fields. This is why after merge we can read the new columns (in the new files), but not the old columns. Also, because the PageTable is cached, we could read while using the same instance of `LanceDataset`, but as soon as we loaded in a new session, the PageTables would be re-read and break. This PR modifies the `PageTable` format slightly so that the fields start at the minimum field id in that file's schema. This is backwards compatible for all tables that only have one data file per fragment, but breaks tables that have more than one file. However, seeing as the latter case is broken already, I think this breaking change is fine. Closes #1331.

Commit:c648578
Author:rmeng

save

Commit:27814c5
Author:Will Jones
Committer:GitHub

feat: track fragment size and fragments in indices (#1290) To make tables easier to manage, this PR adds new metadata to the manifest: 1. `Index` now tracks the `fragment_ids` it contains in a bitmap. This is useful to see how much of a dataset is covered by an old index. 2. `Fragment` now tracks the number of rows and number of deletions. This avoids the need to make additional IO to count the number of rows in a dataset. --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

Commit:816d85c
Author:Will Jones
Committer:GitHub

feat: add file compaction (#1095) Closes #934

Commit:c76bc85
Author:Weston Pace
Committer:GitHub

feat: add support for field metdata (#1173) * Got rid of extension_name field in favor of the HashMap field metadata. A flat extension name is not sophisticated enough for complex parameterized extension types such as fixed shape tensors * Add the tensor array type to the roundtrip types test * Removed leftover todo message --------- Co-authored-by: Will Jones <willjones127@gmail.com>

Commit:69c5ebf
Author:Weston Pace
Committer:GitHub

feat: rebase restore feature on top of transactions feature (#1143) * Revert "Revert "feat: quick restore command (#1138)" (#1148)" This reverts commit 2e32c14e1ec45f71cd15c856e1ca111352dd95b2. * Update the restore behavior to use an 'overwrite transaction' instead of writing the manifest directly * Clippy * Using a branch new restore operation instead of an overwrite operation * Changed the restore operation to only serialize the version and then load the old manifest from file later * Tweak imports after cherry-pick * Clippy * Remove an un-needed proto change

Commit:0a12b97
Author:Will Jones
Committer:GitHub

feat: transaction files and commit conflict resolution (#1127) * wip: outline conflict resolution impl * wip: make transaction serialize * wip: make transaction a separate file * docs: write basic format docs for transactions * get existing tests passing * cleanup * minor pr feedback * fix: handle indices properly * more fixes and tests * merge changes * handle params better

Commit:4cf6fae
Author:Will Jones

pr feedback

Commit:ecb19e4
Author:Will Jones

revert small change

Commit:2c0a345
Author:Will Jones

refactor: make collect_statistics part of schema

Commit:51b807a
Author:Will Jones
Committer:Will Jones

more progress

Commit:cab618c
Author:Will Jones
Committer:Will Jones

cleanup

Commit:8cf4f37
Author:Will Jones
Committer:Will Jones

more format changes

Commit:748d914
Author:Will Jones
Committer:Will Jones

start rearranging format docs for external manifest

Commit:d98665f
Author:Will Jones
Committer:GitHub

feat: track the max used fragment id (#1092) * feat: track the max used fragment id * handle old tables * feat: handle empty and unset

Commit:2f99f7b
Author:Will Jones
Committer:GitHub

docs: add clarifications to format about fields (#1052) * docs: add clarifications to format * docs: add fragment diagram * docs: clarify field ids are positional

Commit:313dba3
Author:Lei Xu
Committer:GitHub

[Doc, Rust] Improve docstring about how to use Dataset write in Rust (#1060)

Commit:8de3ce0
Author:Will Jones
Committer:GitHub

docs: update format docs with feature flags and row indices (#1037)

Commit:6497c0e
Author:Lei Xu
Committer:GitHub

Support Dot product as metric type (#1021)

Commit:551d676
Author:Will Jones
Committer:GitHub

feat: add feature flags to the dataset manifest (#979) * wip: implement feature flags * test it * format better * fix test

Commit:a410200
Author:Will Jones
Committer:GitHub

WIP: add deletion files to Lance format (#920) * wip: define deletion files * wip: rough outline of deletion IO utils * draft IO parts * passing tests * cleanup * better error handling

Commit:ee738e2
Author:Lei Xu
Committer:GitHub

Persist simple diskann index (#787)

Commit:220dc8f
Author:Lei Xu
Committer:GitHub

[Rust] Build DiskANN index (#763)

Commit:02ec8e7
Author:Lei Xu
Committer:GitHub

[Rust] Persist graph using lance file format. (#756)

Commit:25eb36f
Author:Chang She
Committer:GitHub

Vector search should support appending new rows (#593)

Commit:299af4a
Author:Lei Xu
Committer:GitHub

Streaming PQ (#689) --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>

Commit:35cf6f5
Author:Lei Xu
Committer:GitHub

Use MetricType to specify the metric / distance compute function (#600)

Commit:7ba2637
Author:Chang She
Committer:GitHub

Fix version timestamp issue (#582) * failing unit test for #581 * add timestamp and tag to manifest proto * fix version timestamp issue * fmt * fix CI * add timestamp accessor * fmt * set timestamp just before writing manifest