Proto commits in man-group/ArcticDB

These 91 commits are when the Protocol Buffers files have changed:

Commit:ecb7bdb
Author:Ognyan Stoimenov

Temp, dont merge

Commit:c792174
Author:William Dealtry
Committer:William Dealtry

Add some lightweight encodings

Commit:dc93ca6
Author:William Dealtry
Committer:William Dealtry

Add some lightweight encodings

Commit:5958781
Author:Alex Seaton
Committer:GitHub

GCP Support over the S3 XML API (#2176) This adds support for GCP over the S3-compatible XML API. GCP support did not work out of the box because Google does not support any of the S3 "batch" APIs. We used batch delete for all our deletion operations, which broke. Instead I use the unary delete operation here, using the AWS SDK's async API to reduce the performance impact of sending single object deletes. Users connect to GCP like: ``` ac = Arctic("gcpxml://storage.googleapis.com:aseaton_bucket?access=...&secret=...") ac = Arctic("gcpxmls://storage.googleapis.com:aseaton_bucket?access=...&secret=...") # Kept aws_auth naming as this will use ~/.aws/credentials ac = Arctic("gcpxml://storage.googleapis.com:aseaton_bucket?aws_auth=true&aws_profile=blah") ac = Arctic("gcpxml://storage.googleapis.com:aseaton_bucket?aws_auth=true&prefix=my_prefix") ``` We have a URL of `gcpxml` so that if we do a full GCP implementation later, users can get it with `gcp://`. I've added a GCP storage proto object so that users on the `gcpxml` API will later be able to move to a `gcp://` API without updating any library configs. The proto is tiny, just storing the library's prefix, which is the only information we need to serialize. This is implemented as a small sublass of our existing S3 implementation, with its deletion functionality overridden. The Python library adapter creates the `NativeVersionStore` with a `GCPXMLSettings` in memory object (in the `NativeVariantStorage`) which makes us create a `GCPXMLStorage` C++ adapter rather than the normal `S3Storage`. The interesting part of this PR is the threading model in `do_remove_no_batching_impl`. This is called from an IO executor as part of the `RemoveTask` so it is not safe (due to the risk of deadlocks) to submit any work to the IO executor from it. Instead: - The storage client `s3_client_impl.cpp` submits delete operations with the S3 API. The S3 SDK has its own event loop to handle these. When they complete, they resolve a `folly::Promise`, and the client returns a `folly::Future` out of that promise. - Our storage adapter (`detail-inl.hpp`) collects these futures, on an inline executor. This is fine because there is no work for this executor to do, other than waiting to be notified that promises have been resolved by the AWS SDK. For testing, I created a `moto` based test fixture that returns errors when batch delete operations are invoked. Also tested manually against a real GCP backend. There's some duplication between the `gcpxml` and the `s3` Python adapters, but since hopefully the `gcpxml` adapter will be temporary until a full GCP implementation, I didn't see the value in refactoring it away. Monday: 8450684276

The documentation is generated from this commit.

Commit:b798d13
Author:Alex Seaton
Committer:Alex Seaton

WIP GCP XML support - dedicated proto type for upgrade path to JSON API later

The documentation is generated from this commit.

Commit:8a3dfa1
Author:William Dealtry
Committer:William Dealtry

Adaptive encoding

Commit:c652cde
Author:William Dealtry
Committer:William Dealtry

Add some lightweight encodings

Commit:f92a5ac
Author:William Dealtry
Committer:William Dealtry

Refactor aggregator set data, add statistics

Commit:72c5cc4
Author:Alex Owens
Committer:GitHub

Bugfix/1841/maintain empty series names (#1983) #### Reference Issues/PRs Fixes #1841 #### What does this implement or fix? Before this change, if a `Series` had an empty-string as a name, this would be roundtripped as a `None`. This introduces a `has_name` bool to the normalization metadata protobuf, as a backwards-compatible way of effectively making the `name` field optional. The behaviour (which has been verified) can be summarised as follows: ``` Writer version | Series name | Protobuf name field | Protobuf has_name field | Series name read by <=5.0.0 | Series name read by this branch ---------------|-------------|---------------------|-------------------------|-----------------------------|-------------------------------- <=5.0.0 | "hello" | "hello" | Not present | "hello" | "hello" <=5.0.0 | "" | "" | Not present | None | None <=5.0.0 | None | "" | Not present | None | None This branch | "hello" | "hello" | True | "hello" | "hello" This branch | "" | "" | True | None | "" This branch | None | "" | False | None | None ```

Commit:0c4949a
Author:Alex Owens
Committer:GitHub

Bugfix/1841/maintain empty series names (#1983) #### Reference Issues/PRs Fixes #1841 #### What does this implement or fix? Before this change, if a `Series` had an empty-string as a name, this would be roundtripped as a `None`. This introduces a `has_name` bool to the normalization metadata protobuf, as a backwards-compatible way of effectively making the `name` field optional. The behaviour (which has been verified) can be summarised as follows: ``` Writer version | Series name | Protobuf name field | Protobuf has_name field | Series name read by <=5.0.0 | Series name read by this branch ---------------|-------------|---------------------|-------------------------|-----------------------------|-------------------------------- <=5.0.0 | "hello" | "hello" | Not present | "hello" | "hello" <=5.0.0 | "" | "" | Not present | None | None <=5.0.0 | None | "" | Not present | None | None This branch | "hello" | "hello" | True | "hello" | "hello" This branch | "" | "" | True | None | "" This branch | None | "" | False | None | None ```

Commit:303b2e0
Author:phoebusm
Committer:phoebusm

snapshot

Commit:e12654e
Author:phoebusm
Committer:phoebusm

non-protobuf new s3 settings snapshot

Commit:664c92c
Author:phoebusm
Committer:phoebusm

-Introduce aws_auth to proto (_RBAC_ is not replaced yet) -Add STS auth method to codebase (But no test yet)

Commit:44300ea
Author:phoebusm
Committer:phoebusm

Address PR comments

Commit:35a1e07
Author:Alex Owens
Committer:GitHub

Revert "Bugfix 1841: Correctly roundtrip None and empty string pd.Series names (#1878)" (#1953) #### Reference Issues/PRs This reverts commit 2c8d74b579290f595bf1f39b6a489e4ccb9616b2. The fix was broken, new clients incorrectly read Series with non-empty strings as names written with older clients as `None`

Commit:6af6a2a
Author:Alex Owens
Committer:GitHub

Revert "Bugfix 1841: Correctly roundtrip None and empty string pd.Series names (#1878)" (#1953) #### Reference Issues/PRs This reverts commit d6bdd48f0f3e4c971588b0680b4089409646eaa9. The fix was broken, new clients incorrectly read Series with non-empty strings as names written with older clients as `None`

Commit:8fe8a30
Author:phoebusm
Committer:phoebusm

-Introduce aws_auth to proto (_RBAC_ is not replaced yet) -Add STS auth method to codebase (But no test yet)

Commit:5d23aee
Author:phoebusm
Committer:phoebusm

-Introduce aws_auth to proto (_RBAC_ is not replaced yet) -Add STS auth method to codebase (But no test yet) Add auto test More detail readiness check Fast test quicker test Format fix print more log Shorten the test Always print Change test More specific test More test Fix test Remove pytest mark Complete branch test Quick test fix xdist test Try fix python selection Fix test Update venv Update fix More echo Test Test More echo Default to py310 Update Update python ver print py version 310 default Remove version

Commit:d6bdd48
Author:Alex Owens
Committer:GitHub

Bugfix 1841: Correctly roundtrip None and empty string pd.Series names (#1878) #### Reference Issues/PRs Fixes #1841

Commit:2c8d74b
Author:Alex Owens
Committer:GitHub

Bugfix 1841: Correctly roundtrip None and empty string pd.Series names (#1878) #### Reference Issues/PRs Fixes #1841

Commit:e5a5b61
Author:Alex Seaton
Committer:Alex Seaton

Support reading from prefixes that include a dot

Commit:8a8d293
Author:Alex Seaton
Committer:Alex Seaton

Support reading from prefixes that include a dot

Commit:102b1c0
Author:Alex Seaton
Committer:Alex Seaton

Remove rocksdb

Commit:af3f7c4
Author:Alex Seaton
Committer:Alex Seaton

Remove rocksdb

Commit:3384d76
Author:Alex Seaton
Committer:GitHub

Fix Segment use-after-move when replicating to NFS (#1756) ## Motivation (this section copied from previous (closed) attempt - https://github.com/man-group/ArcticDB/pull/1746) The motivation for the change is to allow `arcticdb-enterprise` to copy blocks to NFS storages without a use-after-move. I explained this in https://github.com/man-group/arcticdb-enterprise/pull/139 but to have an open record: CopyCompressedInterStoreTask has: ``` // Don't bother copying the key segment pair when writing to the final target if (it == std::prev(target_stores_.end())) { (*it)->write_compressed_sync(std::move(key_segment_pair)); } else { auto key_segment_pair_copy = key_segment_pair; (*it)->write_compressed_sync(std::move(key_segment_pair_copy)); } ``` KeySegmentPair has a shared_ptr to a KeySegmentPair, which we can think of here as just a `Segment`. Therefore the old `key_segment_pair_copy` is shallow, the underlying Segment is the same. But the segment eventually gets passed as an rvalue reference further down the stack. In `do_write_impl` we call `put_object` which calls `serialize_header`. This modifies the segment in place and passes that buffer to the AWS SDK. In the `NfsBackedStorage` we have: ``` void NfsBackedStorage::do_write(Composite<KeySegmentPair>&& kvs) { auto enc = kvs.transform([] (auto&& key_seg) { return KeySegmentPair{encode_object_id(key_seg.variant_key()), std::move(key_seg.segment())}; }); s3::detail::do_write_impl(std::move(enc), root_folder_, bucket_name_, *s3_client_, NfsBucketizer{}); } ``` where the segment gets moved from. Subsequent attempts to use the segment (eg copying on to the next store) then fail. https://github.com/man-group/arcticdb-enterprise/pull/139 fixed this issue by cloning the segment, but this approach avoids the (expensive) clone. ## Logical Change Copy the `KeySegmentPair`'s pointer to the `Segment` in `nfs_backed_storage.cpp` rather than moving from the segment. ## Refactor and Testing ### Copy Task Move the CopyCompressedInterStoreTask down to ArcticDB from arcticdb-enterprise. Add a test for it on NFS storage. I've verified that the tests in this commit fail without the refactor in the HEAD~1 commit. The only changes to `CopyCompressedInterstoreTask` from enterprise are: - Pass the `KeySegmentPair` by value in to `write_compressed{_sync}`. The `KeySegmentPair` is cheap to copy (especially considering we are about to copy an object across storages, likely with a network hop). - We have adopted the new `set_key` API of `KeySegmentPair`: ``` if (key_to_write_.has_value()) { key_segment_pair.set_key(*key_to_write_); } ``` - We have namespaced the `ProcessingResult` struct in to the task ### KeySegmentPair - Replace methods returning mutable lvalue references to keys with a `set_key` method. - Remove the `release_segment` method as it dangerously leaves the `KeySegmentPair` pointing at a `Segment` object that has been moved from, and it is not actually necessary. ## Follow up work The non-const `Segment& KeySegmentPair#segment()` API is still dangerous and error prone. I have a follow up change to remove it, but that API change affects very many files and will be best raised separately so that it doesn't block this fix for replication. A draft PR showing a proposal for that change is here - https://github.com/man-group/ArcticDB/pull/1757 .

Commit:498c331
Author:Alex Seaton
Committer:GitHub

Fix Segment use-after-move when replicating to NFS (#1756) ## Motivation (this section copied from previous (closed) attempt - https://github.com/man-group/ArcticDB/pull/1746) The motivation for the change is to allow `arcticdb-enterprise` to copy blocks to NFS storages without a use-after-move. I explained this in https://github.com/man-group/arcticdb-enterprise/pull/139 but to have an open record: CopyCompressedInterStoreTask has: ``` // Don't bother copying the key segment pair when writing to the final target if (it == std::prev(target_stores_.end())) { (*it)->write_compressed_sync(std::move(key_segment_pair)); } else { auto key_segment_pair_copy = key_segment_pair; (*it)->write_compressed_sync(std::move(key_segment_pair_copy)); } ``` KeySegmentPair has a shared_ptr to a KeySegmentPair, which we can think of here as just a `Segment`. Therefore the old `key_segment_pair_copy` is shallow, the underlying Segment is the same. But the segment eventually gets passed as an rvalue reference further down the stack. In `do_write_impl` we call `put_object` which calls `serialize_header`. This modifies the segment in place and passes that buffer to the AWS SDK. In the `NfsBackedStorage` we have: ``` void NfsBackedStorage::do_write(Composite<KeySegmentPair>&& kvs) { auto enc = kvs.transform([] (auto&& key_seg) { return KeySegmentPair{encode_object_id(key_seg.variant_key()), std::move(key_seg.segment())}; }); s3::detail::do_write_impl(std::move(enc), root_folder_, bucket_name_, *s3_client_, NfsBucketizer{}); } ``` where the segment gets moved from. Subsequent attempts to use the segment (eg copying on to the next store) then fail. https://github.com/man-group/arcticdb-enterprise/pull/139 fixed this issue by cloning the segment, but this approach avoids the (expensive) clone. ## Logical Change Copy the `KeySegmentPair`'s pointer to the `Segment` in `nfs_backed_storage.cpp` rather than moving from the segment. ## Refactor and Testing ### Copy Task Move the CopyCompressedInterStoreTask down to ArcticDB from arcticdb-enterprise. Add a test for it on NFS storage. I've verified that the tests in this commit fail without the refactor in the HEAD~1 commit. The only changes to `CopyCompressedInterstoreTask` from enterprise are: - Pass the `KeySegmentPair` by value in to `write_compressed{_sync}`. The `KeySegmentPair` is cheap to copy (especially considering we are about to copy an object across storages, likely with a network hop). - We have adopted the new `set_key` API of `KeySegmentPair`: ``` if (key_to_write_.has_value()) { key_segment_pair.set_key(*key_to_write_); } ``` - We have namespaced the `ProcessingResult` struct in to the task ### KeySegmentPair - Replace methods returning mutable lvalue references to keys with a `set_key` method. - Remove the `release_segment` method as it dangerously leaves the `KeySegmentPair` pointing at a `Segment` object that has been moved from, and it is not actually necessary. ## Follow up work The non-const `Segment& KeySegmentPair#segment()` API is still dangerous and error prone. I have a follow up change to remove it, but that API change affects very many files and will be best raised separately so that it doesn't block this fix for replication. A draft PR showing a proposal for that change is here - https://github.com/man-group/ArcticDB/pull/1757 .

Commit:f80a12f
Author:Alex Seaton
Committer:Alex Seaton

Move CopyCompressedInterStoreTask down to ArcticDB for testability

Commit:78d3152
Author:Alex Seaton
Committer:Alex Seaton

Move CopyCompressedInterStoreTask down to ArcticDB

Commit:5fcc1b4
Author:Vasil Danielov Pashov
Committer:GitHub

Read index API (#1568) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> Resolve #1150 #### What does this implement or fix? Read only the index column of a DataFrame stored in ArcticDB. Implemented only the V2 Library API as part of the `read` call. * Passing `columns=None` will require all columns in the DF * Passing `columns=[]` will return a DF containing only the index columns #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Vasil Pashov <vasil.pashov@man.com>

Commit:1acfa3a
Author:Vasil Danielov Pashov
Committer:GitHub

Read index API (#1568) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> Resolve #1150 #### What does this implement or fix? Read only the index column of a DataFrame stored in ArcticDB. Implemented only the V2 Library API as part of the `read` call. * Passing `columns=None` will require all columns in the DF * Passing `columns=[]` will return a DF containing only the index columns #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Vasil Pashov <vasil.pashov@man.com>

Commit:488f44a
Author:William Dealtry
Committer:William Dealtry

Add binary encoding format

Commit:516d169
Author:William Dealtry
Committer:William Dealtry

Add binary encoding format

Commit:26c23f9
Author:William Dealtry
Committer:William Dealtry

Add binary encoding format

Commit:94817d8
Author:Vasil Pashov
Committer:Vasil Pashov

Comment bucketize_dynamic in the proto file

Commit:0823b57
Author:Phoebus Mak
Committer:Phoebus Mak

-Set ca path and directory automatically Revoke removing assert Remove useless ca cert path in non ssl enabled testing environment Address PR comment Better test Address PR comments Update docs/mkdocs/docs/api/arctic_uri.md Co-authored-by: Alex Seaton <alexbseaton@gmail.com>

Commit:5dd6a64
Author:Phoebus Mak
Committer:Phoebus Mak

-Set ca path and directory automatically Revoke removing assert Remove useless ca cert path in non ssl enabled testing environment Address PR comment Better test Address PR comments Update docs/mkdocs/docs/api/arctic_uri.md Co-authored-by: Alex Seaton <alexbseaton@gmail.com>

Commit:a5c1741
Author:Vasil Danielov Pashov
Committer:GitHub

Add empty index type and feature flag for it (#1524) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Vasil Pashov <vasil.pashov@man.com>

Commit:86350d6
Author:Vasil Danielov Pashov
Committer:GitHub

Add empty index type and feature flag for it (#1524) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Vasil Pashov <vasil.pashov@man.com>

Commit:dfb32de
Author:Phoebus Mak
Committer:Phoebus Mak

Add S3 ca path support

Commit:d3a98eb
Author:Phoebus Mak
Committer:Phoebus Mak

Add S3 ca path support

Commit:a14eb48
Author:Vasil Danielov Pashov
Committer:GitHub

Implement empty index for 0-rowed columns (#1429) #### Reference Issues/PRs Closes: #1428 #### What does this implement or fix? Create an empty-index type. This required change in the Python and in the C++ layer. * In the C++ layer an new index type was added. (IndexDescriptor::EMPTY). It does not allocate a filed in the storage (similar to how row range index does not allocate a field). The checks for index compatibility are relaxed, the empty index is compatible with all other index types and it gets overridden the first time a non-empty index is written (either through update or append). On write we check if the dataframe contains 0 rows and if so it gets assigned an empty index. * The logic in the python layer is dodgy and needs discussion. In the current state the normalization metadata and the index descriptor are stored separately. There is one proto message describing both DateTime index and Ranged Index. The current change made it so that in case of 0 rows the python layer passes RowRange index to the C++ layer which checks if there are any rows in the DF. If there are rows Row range index is used, otherwise empty index is used. Note the `is_not_range_index` proto field. IMO it needs some refactoring in further PRs. It's used in the python layer to check if the first column is index or not. #### Any other comments? Merge this after: https://github.com/man-group/ArcticDB/pull/1436 #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Vasil Pashov <vasil.pashov@man.com>

Commit:33bba20
Author:William Dealtry
Committer:GitHub

Feature flag empty (#1440) Feature flag off the empty type behaviour by default, allow it to be re-enabled at the library level --------- Co-authored-by: Nick Clarke <nclarke@live.co.uk>

Commit:2e72ede
Author:William Dealtry
Committer:GitHub

Feature flag empty (#1440) Feature flag off the empty type behaviour by default, allow it to be re-enabled at the library level --------- Co-authored-by: Nick Clarke <nclarke@live.co.uk>

Commit:882e73e
Author:Muhammad Hamza Sajjad
Committer:GitHub

LMDB exception normalization with mock client (#1414) This is the final PR for exception normalization. Previous PRs are #1411, #1360, #1344, #1304, #1297 and #1285 #### Reference Issues/PRs Previously #1285 only normalized `KeyNotFoundException` and `DuplicateKeyException` correctly. As mentioned in [this comment](https://github.com/man-group/ArcticDB/pull/1285#discussion_r1474179896), we need to normalize other lmdb specific errors too. This PR does that. A mock client is also created to simulate the lmdb errors which aren't easily produce-able with real lmdb but can occur. The ErrorCode list has been updated in `error_code.hpp`. Previously, all the storage error codes were just sequential. This required that each time a new error was added for a specific storage, all the other error codes needed to be changed as we want to assign sequential error codes to the same storage. We now leave 10 error codes for each storage which allows us to easily add new error codes for a storage without having to change all the others. `::lmdb::map_full_error` is now normalized. When this error occurs, lmdb throws `LMDBMapFullException` which is child of `StorageException` #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing -->

Commit:ab3caf8
Author:Muhammad Hamza Sajjad
Committer:GitHub

LMDB exception normalization with mock client (#1414) This is the final PR for exception normalization. Previous PRs are #1411, #1360, #1344, #1304, #1297 and #1285 #### Reference Issues/PRs Previously #1285 only normalized `KeyNotFoundException` and `DuplicateKeyException` correctly. As mentioned in [this comment](https://github.com/man-group/ArcticDB/pull/1285#discussion_r1474179896), we need to normalize other lmdb specific errors too. This PR does that. A mock client is also created to simulate the lmdb errors which aren't easily produce-able with real lmdb but can occur. The ErrorCode list has been updated in `error_code.hpp`. Previously, all the storage error codes were just sequential. This required that each time a new error was added for a specific storage, all the other error codes needed to be changed as we want to assign sequential error codes to the same storage. We now leave 10 error codes for each storage which allows us to easily add new error codes for a storage without having to change all the others. `::lmdb::map_full_error` is now normalized. When this error occurs, lmdb throws `LMDBMapFullException` which is child of `StorageException` #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing -->

Commit:3aac66c
Author:Muhammad Hamza Sajjad
Committer:GitHub

#447 Add a `MockMongoClient` which can simulate mongo failures (#1395) Similar to #1331. The mock client will then be used to test exception normalization. Wrap up common things like `StorageOperation` and `StorageFailure` into `storage_mock_client.hpp`

Commit:33368c0
Author:Muhammad Hamza Sajjad
Committer:GitHub

#447 Add a `MockMongoClient` which can simulate mongo failures (#1395) Similar to #1331. The mock client will then be used to test exception normalization. Wrap up common things like `StorageOperation` and `StorageFailure` into `storage_mock_client.hpp`

Commit:9977c79
Author:William Dealtry
Committer:William Dealtry

Metadata cache work

Commit:54a62ca
Author:William Dealtry
Committer:William Dealtry

Refactor to allow multiple segments in the same physical object

Commit:066512c
Author:William Dealtry
Committer:William Dealtry

Refactor to allow multiple segments in the same physical object

Commit:6ed0148
Author:Vasil Danielov Pashov
Committer:GitHub

Bugfix: Empty type (#1227) #### Reference Issues/PRs Closes #1107 #### What does this implement or fix? Fixes how empty type interacts with other types. In general we should be able to: * Append other types to columns which were initially empty * Append empty columns to columns of any other type * Update values with empty type preserving the type of the column ### Changes: * Each type handler now has a function to report the byte size of its elements. * Each type handler now has a function to default initialize some memory * Empty type handler now backfills the "empty" elements. * integer types -> 0 (Not perfect but in future we're planning to add default value argument) * float types -> NaN * string types -> None * bool -> False * Nullable boolean -> None * Date -> NaT * The function which does default initialization up to now was used only in dynamic schema. Now the empty handler calls it as well to do the backfill. The function first checks the PoD types and if any of them matches uses the basic default initialization, otherwise it checks if there is a type handler and if so uses its default_initialize functionality * Refactor how updating works. `Column::truncate` is used instead of copying segment rows one by one. This should improve the performance of update. * Add a new python fixture `lmdb_version_store_static_and_dynamic` to cover all combinations {V1, V2} ecoding x {Static, Dynamic} schema * Empty typed columns are now reported as dense columns. They don't have a sparse map and both physical and logical rows are left uninitialized (value `-1`) **DISCUSS** - [x] Should we add an option to support updating non-empty stuff with empty (Conclusion reached in a slack thread: yes, we should allow to update with None as long as the type of the output is the same as the type of the column.) - [x] What should be the output for the following example (a column of none vs a column of 0) (Conclusion reached in a slack thread. The result should be [0,0], i.e. the output should have the same type as the type of the column.) ```python lib.write("sym", pd.DataFrame({"col": [1,2,3]})) lib.append("sym", pd.DataFrame({"col": [None, None]})) lib.read("sym", row_range=[3:5]).data ``` - [ ] Do we need hypothesis testing random appends of empty and non-empty stuff to the same column? **Dev TODO** - [x] Verify the following throws ```python lib.write("sym", pd.DataFrame({"col": [None, None]})) lib.append("sym", pd.DataFrame({"col": [1, 2, 3]})) lib.append("sym", pd.DataFrame({"col": ["some", "string"]})) ``` - [x] Appending to empty for dynamic schema - [x] Fix appending empty to other types with static schema - [x] Fix appending empty to other types with dynamic schema - [x] Create a single function to handle backfilling of data and use it both in the empty handler and in reduce_and_fix - [x] Change the name of PYBOOL type - [x] Fix update e.g. ```python lmdb_version_store_v2.write('test', pd.DataFrame([None, None], index=pd.date_range(periods=2, end=dt.now(), freq='1T'))) lmdb_version_store_v2.update('test', pd.DataFrame([None, None], index=pd.date_range(periods=2, end=dt.now(), freq='1T'))) ``` - [ ] Add tests for starting with empty column list e.g. `pd.DataFrame({"col": []})`. Potentially mark it as xfail and fix with a later PR. - [ ] Add tests for update when the update range is not entirely contained in the dataframe range index #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Vasil Pashov <vasil.pashov@man.com>

Commit:c8a1ac3
Author:Vasil Danielov Pashov
Committer:GitHub

Bugfix: Empty type (#1227) #### Reference Issues/PRs Closes #1107 #### What does this implement or fix? Fixes how empty type interacts with other types. In general we should be able to: * Append other types to columns which were initially empty * Append empty columns to columns of any other type * Update values with empty type preserving the type of the column ### Changes: * Each type handler now has a function to report the byte size of its elements. * Each type handler now has a function to default initialize some memory * Empty type handler now backfills the "empty" elements. * integer types -> 0 (Not perfect but in future we're planning to add default value argument) * float types -> NaN * string types -> None * bool -> False * Nullable boolean -> None * Date -> NaT * The function which does default initialization up to now was used only in dynamic schema. Now the empty handler calls it as well to do the backfill. The function first checks the PoD types and if any of them matches uses the basic default initialization, otherwise it checks if there is a type handler and if so uses its default_initialize functionality * Refactor how updating works. `Column::truncate` is used instead of copying segment rows one by one. This should improve the performance of update. * Add a new python fixture `lmdb_version_store_static_and_dynamic` to cover all combinations {V1, V2} ecoding x {Static, Dynamic} schema * Empty typed columns are now reported as dense columns. They don't have a sparse map and both physical and logical rows are left uninitialized (value `-1`) **DISCUSS** - [x] Should we add an option to support updating non-empty stuff with empty (Conclusion reached in a slack thread: yes, we should allow to update with None as long as the type of the output is the same as the type of the column.) - [x] What should be the output for the following example (a column of none vs a column of 0) (Conclusion reached in a slack thread. The result should be [0,0], i.e. the output should have the same type as the type of the column.) ```python lib.write("sym", pd.DataFrame({"col": [1,2,3]})) lib.append("sym", pd.DataFrame({"col": [None, None]})) lib.read("sym", row_range=[3:5]).data ``` - [ ] Do we need hypothesis testing random appends of empty and non-empty stuff to the same column? **Dev TODO** - [x] Verify the following throws ```python lib.write("sym", pd.DataFrame({"col": [None, None]})) lib.append("sym", pd.DataFrame({"col": [1, 2, 3]})) lib.append("sym", pd.DataFrame({"col": ["some", "string"]})) ``` - [x] Appending to empty for dynamic schema - [x] Fix appending empty to other types with static schema - [x] Fix appending empty to other types with dynamic schema - [x] Create a single function to handle backfilling of data and use it both in the empty handler and in reduce_and_fix - [x] Change the name of PYBOOL type - [x] Fix update e.g. ```python lmdb_version_store_v2.write('test', pd.DataFrame([None, None], index=pd.date_range(periods=2, end=dt.now(), freq='1T'))) lmdb_version_store_v2.update('test', pd.DataFrame([None, None], index=pd.date_range(periods=2, end=dt.now(), freq='1T'))) ``` - [ ] Add tests for starting with empty column list e.g. `pd.DataFrame({"col": []})`. Potentially mark it as xfail and fix with a later PR. - [ ] Add tests for update when the update range is not entirely contained in the dataframe range index #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Vasil Pashov <vasil.pashov@man.com>

Commit:74b8192
Author:Muhammad Hamza Sajjad
Committer:GitHub

Add a `MockAzureClient` which can simulate azure failures (#1331) This is similar to #1281. - Adds a config option to use the `MockAzureClient` instead of the `RealAzureClient` - If a `MockAzureClient` is created, one can simulate failures by passing the appropriate symbol name - Creates tests to ensure that `MockAzureClient` works properly for `read`, `write`, `update`, `delete` and `list` functions. To do: Test exceptions thrown by `MockAzureClient`. This will be done in the next PR where we normalize Azure exception. #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>

Commit:3b7199e
Author:Muhammad Hamza Sajjad
Committer:GitHub

Add a `MockAzureClient` which can simulate azure failures (#1331) This is similar to #1281. - Adds a config option to use the `MockAzureClient` instead of the `RealAzureClient` - If a `MockAzureClient` is created, one can simulate failures by passing the appropriate symbol name - Creates tests to ensure that `MockAzureClient` works properly for `read`, `write`, `update`, `delete` and `list` functions. To do: Test exceptions thrown by `MockAzureClient`. This will be done in the next PR where we normalize Azure exception. #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>

Commit:a48f3b0
Author:William Dealtry

Saving stuff before build server decom

Commit:1bf36d5
Author:Ivo Dilov
Committer:IvoDD

Adds a MockS3Client which can simulate s3 failures - Adds a config option to use the MockS3Client instead of the RealS3Client - If a MockS3Client is created, one can simulate failures by passing the appropriate symbol name - Adds tests which run various s3 storage failure scenarios - The tests expose two issues with s3 which will be fixed in a follow up commit

Commit:d68ff12
Author:Ivo Dilov
Committer:IvoDD

Adds a MockS3Client which can simulate s3 failures - Adds a config option to use the MockS3Client instead of the RealS3Client - If a MockS3Client is created, one can simulate failures by passing the appropriate symbol name - Adds tests which run various s3 storage failure scenarios - The tests expose two issues with s3 which will be fixed in a follow up commit

Commit:7b62f5c
Author:William Dealtry

Compiles

Commit:7c00e61
Author:William Dealtry

Experiments with multi-segment

Commit:be197f2
Author:willdealtry
Committer:William Dealtry

Multisegment WIP

Commit:4910d3e
Author:Georgi Petrov
Committer:GitHub

Add a way to handle non-string values for index names (#1170) #### Reference Issues/PRs Fixes #1119 #### What does this implement or fix? Add a way to handle non-string index names #### Any other comments?

Commit:53507ad
Author:Georgi Petrov
Committer:GitHub

Add a way to handle non-string values for index names (#1170) #### Reference Issues/PRs Fixes #1119 #### What does this implement or fix? Add a way to handle non-string index names #### Any other comments?

Commit:f87bf2a
Author:Vasil Danielov Pashov
Committer:GitHub

Dev/vasil.pashov/add nullablebools pyarray (#819) <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> #### Reference Issues/PRs Closes #267 Closes #268 <!-- Example: Fixes #1234. See also #3456. Please use keywords (e.g., Fixes) to create link to the issues or pull requests you resolved, so that they will automatically be closed when your pull request is merged. See: https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue --> #### What does this implement/fix? How does it work (high level)? Highlight notable design decisions. Extend the C++ code so that it can store on disc and read back columns containing NumPy arrays and columns containing nullable bolleans. Arrays stored in columns can be only of type int or float. How things work: 1. Nullable booleans are represented as sparse column. The bitmagic library is used to track which row has value and which row is null. This works with regular python arrays, numpy arrays of dtype bool and pandas arrays of dtype bool. The encoding end of the code is placed in `cpp/arcticdb/pipeline/frame_utils.hpp::aggregator_set_data` the decoding end is placed in `cpp/arcticdb/python/python_handlers.cpp::BoolHandler::handle_type`. **Note** `pandas` and `numpy` arrays with `dtype=bool` interpret `None` as `False`. 2. Array column support only numeric types (i.e. types for which `types.hpp::is_numeric_type` returns true). TypeDescriptors of dimensionality 1 denote an array. EMPTYVAL of dimensionality 1 denotes a segments full of empty arrays. EMPTYVAL DIM1 can be appended to any other Dim1 columns and any other Dim1 column can be appended to a column of type Dim1. When the later happens the type of the column will be changed. Column of type EMPTYVAL Dim1 will allocate space only for its bitmap. A column can contain arrays of different sizes, but the type of each array must be the same. The entries of the column arrays are stored as contiguous data in the blocks of values. In order to keep track of the sizes of each row additional data called shapes is stored. The i-th element of the shapes array stores the number of elements in the array at the i-th row of an array typed column. The encoding end of the code can be found in: `cpp/arcticdb/pipeline/frame_utils.hpp::aggregator_set_data` the decoding end can be found in `cpp/arcticdb/python/python_handlers.cpp::ArrayHandler::handle_type` Other improvements: 1. Enable ASAN, TSAN and UBSAN (tested only on Linux with Clang and LLD) 2. EMPTYVAL used to store one byte because the compression library needed it to denote the is nothing encoded. Now both EMPTYVAL Dim 0 and EMPTYVAL Dim1 store nothing. 3. Make magic word print the word as a string (it used to print only a number which has hard to understand) #### Any other comments? *Problems found* 1. Some of the data we store is stored as multidimensional data, a column of 1 row and n consecutive bytes of data (Dim1 uint8). This includes the metadata and the encoded fields. In future we can change the way we store arbitrary data, but we should always be able to read the data stored using this (weird) format. The class `BytesEncoder` in `codec.cpp` is made to abstract that, because several places in the code had the logic copy-pasted. 5. Encoding V1 and V2 behave differently when there are multiple blocks of multidimensional data. * Encoding V1 has number of shape blocks equal to the number of value blocks. Each value block is preceded by a shape block storing the shapes for that particular block. In addition to that the shapes are encoded with the same encoding options as the values. However there is no obvious correlation between shapes and the values. * Encoding V2 assumes that only one block of shapes will be present for multidimensional data and that block will be placed before all values block. The code however did not do that. In order to fix this the following was done: * There is a new column encoder `ColumnEncoder_v2` which gathers all the shapes and encodes them in a single block placed before all value blocks. `ColumnEncoder` now takes the encoding template parameter and uses either `ColumnEncoderV1` or `ColumnEncoderV2` based on it. * There is a new class `GenericBlockEncoderV2` used for V2 encoding. The old class did too much things as it encoded both the shapes and the values in a single function call. The new class does not care whether it encodes shapes or values. It is more generic and just encodes one block of memory. * `TypedBlockEncoderImpl` was extended with a template parameter (the encoding version). Everything stays the same for V1 encoding. V2 encoding however requires one call to `encode_shapes` (in case of multidimensional data) and several calls to encode the values. * The encoding version from the segment header is propagated to all functions which do decoding. Since the multidimensional data is not compatible between the two encoding version, we need to know how were things encoded. * Most of the functions in `codec.cpp` now have a template parameter `EncodingVersion`, because `max_compressed_size` and the encoding are different between the different versions. *Things that work, but can be done better* 1. `TypedBlockEncoderImpl` has different interface for V1 and V2 encodings. If V2 encoding is used you _can't_ just use this: ```c++ using Encoder = TypedBlockEncoderImpl<TDT, EncodingVersion::V2>; Encoder::encode_shapes Encoder::encode_values ``` because values and shapes are of different types. The way it works with specific encoder classes and passing them as template parameters to `GenericBlockEncoder` also feels off. 2. `GenericBlockEncoder` (both 1 and 2) do much more than encode the data. They compute hashes and set a bunch of parameters to the encoded field. With V1 it''s somewhat better because they set most of the member. With V2, however the `items_count` member cannot be set (because if shape block is passed it will mess it up, items_count denotes how many "rows" of data are there in the block). I would make both encoders return the data as a struct and leave it to the caller to set everything. 6. `BytesEncoder` can be changed. It's weird to have the data as multidimensional type. It would be easier if it was of Dim0 and uint8 type. #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings and documentation? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> --------- Co-authored-by: willdealtry <william.dealtry@gmail.com> Co-authored-by: Vasil Pashov <Vasil.Pashov@man.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Vasil Pashov <vasil@alfie.local>

Commit:904b0e5
Author:Vasil Danielov Pashov
Committer:GitHub

Dev/vasil.pashov/add nullablebools pyarray (#819) <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> #### Reference Issues/PRs Closes #267 Closes #268 <!-- Example: Fixes #1234. See also #3456. Please use keywords (e.g., Fixes) to create link to the issues or pull requests you resolved, so that they will automatically be closed when your pull request is merged. See: https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue --> #### What does this implement/fix? How does it work (high level)? Highlight notable design decisions. Extend the C++ code so that it can store on disc and read back columns containing NumPy arrays and columns containing nullable bolleans. Arrays stored in columns can be only of type int or float. How things work: 1. Nullable booleans are represented as sparse column. The bitmagic library is used to track which row has value and which row is null. This works with regular python arrays, numpy arrays of dtype bool and pandas arrays of dtype bool. The encoding end of the code is placed in `cpp/arcticdb/pipeline/frame_utils.hpp::aggregator_set_data` the decoding end is placed in `cpp/arcticdb/python/python_handlers.cpp::BoolHandler::handle_type`. **Note** `pandas` and `numpy` arrays with `dtype=bool` interpret `None` as `False`. 2. Array column support only numeric types (i.e. types for which `types.hpp::is_numeric_type` returns true). TypeDescriptors of dimensionality 1 denote an array. EMPTYVAL of dimensionality 1 denotes a segments full of empty arrays. EMPTYVAL DIM1 can be appended to any other Dim1 columns and any other Dim1 column can be appended to a column of type Dim1. When the later happens the type of the column will be changed. Column of type EMPTYVAL Dim1 will allocate space only for its bitmap. A column can contain arrays of different sizes, but the type of each array must be the same. The entries of the column arrays are stored as contiguous data in the blocks of values. In order to keep track of the sizes of each row additional data called shapes is stored. The i-th element of the shapes array stores the number of elements in the array at the i-th row of an array typed column. The encoding end of the code can be found in: `cpp/arcticdb/pipeline/frame_utils.hpp::aggregator_set_data` the decoding end can be found in `cpp/arcticdb/python/python_handlers.cpp::ArrayHandler::handle_type` Other improvements: 1. Enable ASAN, TSAN and UBSAN (tested only on Linux with Clang and LLD) 2. EMPTYVAL used to store one byte because the compression library needed it to denote the is nothing encoded. Now both EMPTYVAL Dim 0 and EMPTYVAL Dim1 store nothing. 3. Make magic word print the word as a string (it used to print only a number which has hard to understand) #### Any other comments? *Problems found* 1. Some of the data we store is stored as multidimensional data, a column of 1 row and n consecutive bytes of data (Dim1 uint8). This includes the metadata and the encoded fields. In future we can change the way we store arbitrary data, but we should always be able to read the data stored using this (weird) format. The class `BytesEncoder` in `codec.cpp` is made to abstract that, because several places in the code had the logic copy-pasted. 5. Encoding V1 and V2 behave differently when there are multiple blocks of multidimensional data. * Encoding V1 has number of shape blocks equal to the number of value blocks. Each value block is preceded by a shape block storing the shapes for that particular block. In addition to that the shapes are encoded with the same encoding options as the values. However there is no obvious correlation between shapes and the values. * Encoding V2 assumes that only one block of shapes will be present for multidimensional data and that block will be placed before all values block. The code however did not do that. In order to fix this the following was done: * There is a new column encoder `ColumnEncoder_v2` which gathers all the shapes and encodes them in a single block placed before all value blocks. `ColumnEncoder` now takes the encoding template parameter and uses either `ColumnEncoderV1` or `ColumnEncoderV2` based on it. * There is a new class `GenericBlockEncoderV2` used for V2 encoding. The old class did too much things as it encoded both the shapes and the values in a single function call. The new class does not care whether it encodes shapes or values. It is more generic and just encodes one block of memory. * `TypedBlockEncoderImpl` was extended with a template parameter (the encoding version). Everything stays the same for V1 encoding. V2 encoding however requires one call to `encode_shapes` (in case of multidimensional data) and several calls to encode the values. * The encoding version from the segment header is propagated to all functions which do decoding. Since the multidimensional data is not compatible between the two encoding version, we need to know how were things encoded. * Most of the functions in `codec.cpp` now have a template parameter `EncodingVersion`, because `max_compressed_size` and the encoding are different between the different versions. *Things that work, but can be done better* 1. `TypedBlockEncoderImpl` has different interface for V1 and V2 encodings. If V2 encoding is used you _can't_ just use this: ```c++ using Encoder = TypedBlockEncoderImpl<TDT, EncodingVersion::V2>; Encoder::encode_shapes Encoder::encode_values ``` because values and shapes are of different types. The way it works with specific encoder classes and passing them as template parameters to `GenericBlockEncoder` also feels off. 2. `GenericBlockEncoder` (both 1 and 2) do much more than encode the data. They compute hashes and set a bunch of parameters to the encoded field. With V1 it''s somewhat better because they set most of the member. With V2, however the `items_count` member cannot be set (because if shape block is passed it will mess it up, items_count denotes how many "rows" of data are there in the block). I would make both encoders return the data as a struct and leave it to the caller to set everything. 6. `BytesEncoder` can be changed. It's weird to have the data as multidimensional type. It would be easier if it was of Dim0 and uint8 type. #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings and documentation? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> --------- Co-authored-by: willdealtry <william.dealtry@gmail.com> Co-authored-by: Vasil Pashov <Vasil.Pashov@man.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Vasil Pashov <vasil@alfie.local>

Commit:aa6ffa4
Author:Georgi Petrov

Add demo support for OTEL

Commit:d2549dd
Author:Phoebus Mak
Committer:Phoebus Mak

Add preliminary change for slowdown error test

Commit:d757df9
Author:Phoebus Mak
Committer:Phoebus Mak

Add preliminary change for slowdown error test

Commit:ae6f8fe
Author:Phoebus Mak
Committer:Phoebus Mak

Add preliminary change for slowdown error test

Commit:7e23fdb
Author:Joe Iddon
Committer:GitHub

Initial rocksdb work (#945) <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> #### Reference Issues/PRs Closes #961 (development PR) > [!WARNING] > ~~Whoever creates the next release, must add `rocksdb` to the feedstock `meta.yaml` : https://github.com/conda-forge/arcticdb-feedstock/blob/main/recipe/meta.yaml#L58~~ > We have decided not to release on conda, just PyPI, so nothing needs to change with the feedstock. ``` - azure-identity-cpp - azure-storage-blobs-cpp - rocksdb // add this line ``` <!-- Example: Fixes #1234. See also #3456. Please use keywords (e.g., Fixes) to create link to the issues or pull requests you resolved, so that they will automatically be closed when your pull request is merged. See: https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue --> #### What does this implement/fix? How does it work (high level)? Highlight notable design decisions. Implements a first cut of the new RocksDB backend. This means some C++ tests to verify the backend is working correctly, but no user-facing Python API code so that we can get the dependencies merged before working on the interface. Noticeable design points: - Key/value pairs are grouped by key type into what RocksDB calls Column Families. - Column families are stored as pointers to handling objects. These need to be deleted before the database is deleted in a particular order and in the case of the column family handles with the function `DestroyColumnFamilyHandle`. For these reasons raw pointers are used over unique ptrs, and these are deleted explicitly in the `RocksDBStorage` destructor. It's possible that with careful order of the definitions of unique ptrs that this could be engineered to work, especially since `DestroyColumnFamilyHandle` really just calls `delete` internally. The raw pointer method matches the intended API though which is neat. - Unfortunately the interface for writing to the DB requires a copy. We must pass a pre-populated `Slice` object which already points to the memory. - For reading, [it is possible to use](https://github.com/facebook/rocksdb/wiki/Basic-Operations) a `PinnableSlice` which guarantees a portion of memory's existence before the `PinnableSlice` object goes out of scope so that our `Segment::from_bytes` function could operate directly on that and avoid a copy. **This has not been implemented yet.** Instead we ask `rocksdb` to write to a `std::string` which it internally must do a copy for, and then we read from that. - Instead of replicating the C++ tests for lmdb to run on rocksdb too (and really also the in-memory backend), this PR improved the `lmdb` tests to run across all three backends. Specifically parameterised backend generator objects are used to generate a fresh backend instance for each unit test. At the end of the tests a `TearDown` function deletes any left-over files from either `lmdb`, or `rocksdb`. - Modifies the `build.yml`, `build_steps.yml` and `setup.py` scripts to make the `cpp/vcpkg/packages` directory be symlinked to the `C:\` drive so that there is sufficient space for the build. It is believe that unzipping the `.zip` from NuGet is done in to that directory, and then copied to the `cpp/out/*-build/vcpkg_installed/x64-windows-static/lib` sub directories. Originally it was thought that vcpkg could be configured to instead put `vcpkg_installed` on the `C:\` drive, but this was not sufficient. The settings for enabling this too our left in place, but is left as `""` to default to the `D:\` drive as it was before. - Conforms to the recent changes to the backend file formats: https://github.com/man-group/ArcticDB/pull/949 #### Any other comments? The difficulties in adding the `rocksdb` dependency are summarised below: Operating System | vcpkg / pypi | conda --- | --- | --- Linux | Worked fine 👍 | Worked fine 👍 Windows | Works fine locally 👍 , but not on GitHub CI. Passes on CI if the `D:\` drive is allocated 100GB on a random spot basis rather than the usual 14GB, so potentially a space issue* . The CMake error is simply that `cmake.exe` just fails when building the `rocksdb` dependency in the directory: `D:/a/ArcticDB/ArcticDB/cpp/vcpkg/buildtrees/rocksdb/x64-windows-static-dbg`. @qc00 suggested first to increase `nugettimeout` **(see below). This did not work. So instead we put the `vcpkg_installed` directory onto the much larger `C:\` drive, which was implemented with an env variable which is passed through the CI into `setup.py`. This was still not fixing the error, so I also symlinked the `cpp/vcpkg/packages` directory to the `C:` drive which worked. Update: new `ccache.exe` issue also needed fixing. ***(see below). | Not supported Mac | Not supported | Created PR #961 to work on this, but have just made the two edits in this branch manually. In summary: The issue arises from `rocksdb` depending on `lz4` which of course `arcticdb` also depends on. This would be fine if `lz4` used `lz4Config.cmake` scripts, but instead both `arcticdb` and `rocksdb` define `FindLZ4.cmake` and `Findlz4.cmake` scripts, respectively, for resolving their dependencies. For Linux there are no problems, but on Mac, `rocksdb` calls our `arcticdb`'s `FindLZ4.cmake` script which only sets `LZ4_FOUND`, and not `lz4_FOUND`. In the linked PR I propose changes to our script to set this too, and create an alias linking target for the lowercase namespace `lz4::lz4` as well as `LZ4::LZ4`. This worked. *This is confirmed from the logs. Unfortunately the error message saying that the `D:\` drive is running out of space is hidden from the normal logs page. This would've helped to debug the Windows build much faster if I had spotted this, and wouldn't have needed @qc00 to guess at the cause. The normal logs don't show anything: ``` -- Building x64-windows-static-dbg -- Building x64-windows-static-rel CMake Error at vcpkg/scripts/buildsystems/vcpkg.cmake:893 (message): vcpkg install failed. See logs for more information: D:\a\ArcticDB\ArcticDB\cpp\out\windows-cl-release-build\vcpkg-manifest-install.log ``` But the GitHub "raw logs" do show the helpful warning: ``` 2023-10-15T17:40:53.5569564Z -- Building x64-windows-static-dbg 2023-10-15T17:48:28.3694197Z -- Building x64-windows-static-rel 2023-10-15T17:53:58.7684829Z ##[warning]You are running out of disk space. The runner will stop working when the machine runs out of disk space. Free space left: 99 MB 2023-10-15T17:54:43.7209503Z CMake Error at vcpkg/scripts/buildsystems/vcpkg.cmake:893 (message): 2023-10-15T17:54:43.7210612Z vcpkg install failed. See logs for more information: 2023-10-15T17:54:43.7231992Z D:\a\ArcticDB\ArcticDB\cpp\out\windows-cl-release-build\vcpkg-manifest-install.log ``` The `##[warning]` message also shows in the Summary section of the CI: ![image](https://github.com/man-group/ArcticDB/assets/144031717/7c3c1571-4177-4043-ae7b-db24b5c4a9a2) Note that it might be interesting to see if this warning would've shown in any of the logs: ``` D:\a\ArcticDB\ArcticDB\cpp\vcpkg\buildtrees\rocksdb\install-x64-windows-static-dbg-out.log D:\a\ArcticDB\ArcticDB\cpp\vcpkg\buildtrees\rocksdb\install-x64-windows-static-dbg-err.log D:\a\ArcticDB\ArcticDB\cpp\out\windows-cl-release-build\vcpkg-manifest-install.log ``` but manually `print()`ing these in the `finally:` clause of `setup.py` did not show anything which would've helped to debug this problem faster. **The size of the rocksdb `.zip` archive which is either stored locally on Windows, or uploaded to GitHub Packages via NuGet is around 380 MB. When this is unzipped and installed it creates a `lib` of size 980 MB. For reference, `arrow.lib` is the next largest at 430 MB. If it is unable to download the `.zip` from NuGet (which we tried to help it to do by increasing the `nugettimeout` parameter !), then it will try build it locally under `cpp/vcpkg/buildtrees/rocksdb`. The first step to doing this is to download the source code from `https://github.com/facebook/rocksdb/archive/v8.0.0.tar.gz` which is easy enough. Locally though, the build is as large as 4.7 GB, so this also will also fill up the D:\ drive. Trying to symlink this to somewhere on the C:\ drive on the GitHub CI is not an option because it will make building rocksdb from source super slow as C:\ is a network drive. ***`ccache.exe` issue: When [GitHub updated the `windows-latest` virtual env image](https://github.com/actions/runner-images/commit/32c795e64ca31dded26ab79c86bd9e3a6d542d74), this led to `ccache.exe` being added to `C:/Strawberry/c/bin/ccache.exe` which caused RocksDB to try use this when building from scratch. (Incidentally, it did this rather than retrieving the package previously cached to Nuget because the compiler hash changed. @qc00 is working on a change to stop adding the compiler hash to [the artifact](https://github.com/man-group/ArcticDB/pkgs/nuget/rocksdb_x64-windows-static). ) Anyway we don't need to use `ccache`, and moreover this was breaking with `CreateProcess failed` since for some reason, although `ccache.exe` was in the PATH, CMake could not then manage to use it. The fix was to add a step to the worflow to delete `ccache.exe` since we don't use it. (We instead use sccache). #### Checklist <details> <summary> Checklist for code changes... </summary> - [x ] Have you updated the relevant docstrings, documentation and copyright notice? - [x ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [x ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [x ] Are API changes highlighted in the PR description? - [x ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>

Commit:1c00c3d
Author:Joe Iddon
Committer:GitHub

Initial rocksdb work (#945) <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> #### Reference Issues/PRs Closes #961 (development PR) > [!WARNING] > ~~Whoever creates the next release, must add `rocksdb` to the feedstock `meta.yaml` : https://github.com/conda-forge/arcticdb-feedstock/blob/main/recipe/meta.yaml#L58~~ > We have decided not to release on conda, just PyPI, so nothing needs to change with the feedstock. ``` - azure-identity-cpp - azure-storage-blobs-cpp - rocksdb // add this line ``` <!-- Example: Fixes #1234. See also #3456. Please use keywords (e.g., Fixes) to create link to the issues or pull requests you resolved, so that they will automatically be closed when your pull request is merged. See: https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue --> #### What does this implement/fix? How does it work (high level)? Highlight notable design decisions. Implements a first cut of the new RocksDB backend. This means some C++ tests to verify the backend is working correctly, but no user-facing Python API code so that we can get the dependencies merged before working on the interface. Noticeable design points: - Key/value pairs are grouped by key type into what RocksDB calls Column Families. - Column families are stored as pointers to handling objects. These need to be deleted before the database is deleted in a particular order and in the case of the column family handles with the function `DestroyColumnFamilyHandle`. For these reasons raw pointers are used over unique ptrs, and these are deleted explicitly in the `RocksDBStorage` destructor. It's possible that with careful order of the definitions of unique ptrs that this could be engineered to work, especially since `DestroyColumnFamilyHandle` really just calls `delete` internally. The raw pointer method matches the intended API though which is neat. - Unfortunately the interface for writing to the DB requires a copy. We must pass a pre-populated `Slice` object which already points to the memory. - For reading, [it is possible to use](https://github.com/facebook/rocksdb/wiki/Basic-Operations) a `PinnableSlice` which guarantees a portion of memory's existence before the `PinnableSlice` object goes out of scope so that our `Segment::from_bytes` function could operate directly on that and avoid a copy. **This has not been implemented yet.** Instead we ask `rocksdb` to write to a `std::string` which it internally must do a copy for, and then we read from that. - Instead of replicating the C++ tests for lmdb to run on rocksdb too (and really also the in-memory backend), this PR improved the `lmdb` tests to run across all three backends. Specifically parameterised backend generator objects are used to generate a fresh backend instance for each unit test. At the end of the tests a `TearDown` function deletes any left-over files from either `lmdb`, or `rocksdb`. - Modifies the `build.yml`, `build_steps.yml` and `setup.py` scripts to make the `cpp/vcpkg/packages` directory be symlinked to the `C:\` drive so that there is sufficient space for the build. It is believe that unzipping the `.zip` from NuGet is done in to that directory, and then copied to the `cpp/out/*-build/vcpkg_installed/x64-windows-static/lib` sub directories. Originally it was thought that vcpkg could be configured to instead put `vcpkg_installed` on the `C:\` drive, but this was not sufficient. The settings for enabling this too our left in place, but is left as `""` to default to the `D:\` drive as it was before. - Conforms to the recent changes to the backend file formats: https://github.com/man-group/ArcticDB/pull/949 #### Any other comments? The difficulties in adding the `rocksdb` dependency are summarised below: Operating System | vcpkg / pypi | conda --- | --- | --- Linux | Worked fine 👍 | Worked fine 👍 Windows | Works fine locally 👍 , but not on GitHub CI. Passes on CI if the `D:\` drive is allocated 100GB on a random spot basis rather than the usual 14GB, so potentially a space issue* . The CMake error is simply that `cmake.exe` just fails when building the `rocksdb` dependency in the directory: `D:/a/ArcticDB/ArcticDB/cpp/vcpkg/buildtrees/rocksdb/x64-windows-static-dbg`. @qc00 suggested first to increase `nugettimeout` **(see below). This did not work. So instead we put the `vcpkg_installed` directory onto the much larger `C:\` drive, which was implemented with an env variable which is passed through the CI into `setup.py`. This was still not fixing the error, so I also symlinked the `cpp/vcpkg/packages` directory to the `C:` drive which worked. Update: new `ccache.exe` issue also needed fixing. ***(see below). | Not supported Mac | Not supported | Created PR #961 to work on this, but have just made the two edits in this branch manually. In summary: The issue arises from `rocksdb` depending on `lz4` which of course `arcticdb` also depends on. This would be fine if `lz4` used `lz4Config.cmake` scripts, but instead both `arcticdb` and `rocksdb` define `FindLZ4.cmake` and `Findlz4.cmake` scripts, respectively, for resolving their dependencies. For Linux there are no problems, but on Mac, `rocksdb` calls our `arcticdb`'s `FindLZ4.cmake` script which only sets `LZ4_FOUND`, and not `lz4_FOUND`. In the linked PR I propose changes to our script to set this too, and create an alias linking target for the lowercase namespace `lz4::lz4` as well as `LZ4::LZ4`. This worked. *This is confirmed from the logs. Unfortunately the error message saying that the `D:\` drive is running out of space is hidden from the normal logs page. This would've helped to debug the Windows build much faster if I had spotted this, and wouldn't have needed @qc00 to guess at the cause. The normal logs don't show anything: ``` -- Building x64-windows-static-dbg -- Building x64-windows-static-rel CMake Error at vcpkg/scripts/buildsystems/vcpkg.cmake:893 (message): vcpkg install failed. See logs for more information: D:\a\ArcticDB\ArcticDB\cpp\out\windows-cl-release-build\vcpkg-manifest-install.log ``` But the GitHub "raw logs" do show the helpful warning: ``` 2023-10-15T17:40:53.5569564Z -- Building x64-windows-static-dbg 2023-10-15T17:48:28.3694197Z -- Building x64-windows-static-rel 2023-10-15T17:53:58.7684829Z ##[warning]You are running out of disk space. The runner will stop working when the machine runs out of disk space. Free space left: 99 MB 2023-10-15T17:54:43.7209503Z CMake Error at vcpkg/scripts/buildsystems/vcpkg.cmake:893 (message): 2023-10-15T17:54:43.7210612Z vcpkg install failed. See logs for more information: 2023-10-15T17:54:43.7231992Z D:\a\ArcticDB\ArcticDB\cpp\out\windows-cl-release-build\vcpkg-manifest-install.log ``` The `##[warning]` message also shows in the Summary section of the CI: ![image](https://github.com/man-group/ArcticDB/assets/144031717/7c3c1571-4177-4043-ae7b-db24b5c4a9a2) Note that it might be interesting to see if this warning would've shown in any of the logs: ``` D:\a\ArcticDB\ArcticDB\cpp\vcpkg\buildtrees\rocksdb\install-x64-windows-static-dbg-out.log D:\a\ArcticDB\ArcticDB\cpp\vcpkg\buildtrees\rocksdb\install-x64-windows-static-dbg-err.log D:\a\ArcticDB\ArcticDB\cpp\out\windows-cl-release-build\vcpkg-manifest-install.log ``` but manually `print()`ing these in the `finally:` clause of `setup.py` did not show anything which would've helped to debug this problem faster. **The size of the rocksdb `.zip` archive which is either stored locally on Windows, or uploaded to GitHub Packages via NuGet is around 380 MB. When this is unzipped and installed it creates a `lib` of size 980 MB. For reference, `arrow.lib` is the next largest at 430 MB. If it is unable to download the `.zip` from NuGet (which we tried to help it to do by increasing the `nugettimeout` parameter !), then it will try build it locally under `cpp/vcpkg/buildtrees/rocksdb`. The first step to doing this is to download the source code from `https://github.com/facebook/rocksdb/archive/v8.0.0.tar.gz` which is easy enough. Locally though, the build is as large as 4.7 GB, so this also will also fill up the D:\ drive. Trying to symlink this to somewhere on the C:\ drive on the GitHub CI is not an option because it will make building rocksdb from source super slow as C:\ is a network drive. ***`ccache.exe` issue: When [GitHub updated the `windows-latest` virtual env image](https://github.com/actions/runner-images/commit/32c795e64ca31dded26ab79c86bd9e3a6d542d74), this led to `ccache.exe` being added to `C:/Strawberry/c/bin/ccache.exe` which caused RocksDB to try use this when building from scratch. (Incidentally, it did this rather than retrieving the package previously cached to Nuget because the compiler hash changed. @qc00 is working on a change to stop adding the compiler hash to [the artifact](https://github.com/man-group/ArcticDB/pkgs/nuget/rocksdb_x64-windows-static). ) Anyway we don't need to use `ccache`, and moreover this was breaking with `CreateProcess failed` since for some reason, although `ccache.exe` was in the PATH, CMake could not then manage to use it. The fix was to add a step to the worflow to delete `ccache.exe` since we don't use it. (We instead use sccache). #### Checklist <details> <summary> Checklist for code changes... </summary> - [x ] Have you updated the relevant docstrings, documentation and copyright notice? - [x ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [x ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [x ] Are API changes highlighted in the PR description? - [x ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>

Commit:2cfabfc
Author:Vasil Pashov
Committer:Vasil Pashov

Add py arrays

Commit:0196449
Author:Vasil Pashov
Committer:Vasil Pashov

Add handler for numpyarray

Commit:a794a41
Author:Vasil Pashov
Committer:Vasil Pashov

Add nullable boolean

Commit:be056ac
Author:Joe Iddon
Committer:GitHub

Memory backed api (#860) <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> #### Reference Issues/PRs <!-- Example: Fixes #1234. See also #3456. Please use keywords (e.g., Fixes) to create link to the issues or pull requests you resolved, so that they will automatically be closed when your pull request is merged. See: https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue --> #### What does this implement/fix? How does it work (high level)? Highlight notable design decisions. Implements a memory-backed API for the top-level `Arctic()` instance. Intended usage: ```python >>> import pandas as pd >>> ac = Arctic("mem://memory_db") # this name is passed through to the C++ storage, but ignored there >>> ac.create_library("lib") ... ``` Test definition files which now also verify `mem://`: - `test_arctic_batch.py` - added via extending `arctic_library`/`arctic_client` - `test_arctic.py` - added again via `arctic_library`/`arctic_client` - `test_basic_version_store.py` - parameterised all `lmdb_version_store` fixtures to test the memory adapter too* - `test_stress_generated.py` - parameterised `lmdb_version_store_tiny_segment` to test the memory adapter too. \* I did not extend the more specific test fixtures like `lmdb_version_store_tombstone_and_sync_passive`. These add an extra 15 seconds on top of times around 3 mins for the large test files. #### Any other comments? The timings for the stress test (`test_stress_small_row` in `test_stress_generated.py`) are: Adapter type | Write / sec | Read / sec --- | --- | --- lmdb | 97 | 5 mem | 11 | 10 The test is for 1MB of data. The results for the high entropy and low entropy times were within a few seconds of each other. (High entropy has completely random data, low entropy has blocks of repeated rows: aaacccbbb.) At the moment the name given is unused. #### Checklist <details> <summary> Checklist for code changes... </summary> - [x ] Have you updated the relevant docstrings and documentation? - [x ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [x ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [x ] Are API changes highlighted in the PR description? - [x ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>

Commit:907eabe
Author:Joe Iddon
Committer:GitHub

Memory backed api (#860) <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> #### Reference Issues/PRs <!-- Example: Fixes #1234. See also #3456. Please use keywords (e.g., Fixes) to create link to the issues or pull requests you resolved, so that they will automatically be closed when your pull request is merged. See: https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue --> #### What does this implement/fix? How does it work (high level)? Highlight notable design decisions. Implements a memory-backed API for the top-level `Arctic()` instance. Intended usage: ```python >>> import pandas as pd >>> ac = Arctic("mem://memory_db") # this name is passed through to the C++ storage, but ignored there >>> ac.create_library("lib") ... ``` Test definition files which now also verify `mem://`: - `test_arctic_batch.py` - added via extending `arctic_library`/`arctic_client` - `test_arctic.py` - added again via `arctic_library`/`arctic_client` - `test_basic_version_store.py` - parameterised all `lmdb_version_store` fixtures to test the memory adapter too* - `test_stress_generated.py` - parameterised `lmdb_version_store_tiny_segment` to test the memory adapter too. \* I did not extend the more specific test fixtures like `lmdb_version_store_tombstone_and_sync_passive`. These add an extra 15 seconds on top of times around 3 mins for the large test files. #### Any other comments? The timings for the stress test (`test_stress_small_row` in `test_stress_generated.py`) are: Adapter type | Write / sec | Read / sec --- | --- | --- lmdb | 97 | 5 mem | 11 | 10 The test is for 1MB of data. The results for the high entropy and low entropy times were within a few seconds of each other. (High entropy has completely random data, low entropy has blocks of repeated rows: aaacccbbb.) At the moment the name given is unused. #### Checklist <details> <summary> Checklist for code changes... </summary> - [x ] Have you updated the relevant docstrings and documentation? - [x ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [x ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [x ] Are API changes highlighted in the PR description? - [x ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>

Commit:9bd0d83
Author:Julien Jerphanion
Committer:GitHub

maint: Remove unused protocol buffers' definition (#856)

Commit:81333f5
Author:Julien Jerphanion
Committer:GitHub

maint: Remove unused protocol buffers' definition (#856)

Commit:f2d1e77
Author:Vasil Danielov Pashov
Committer:GitHub

Add none type (#646) <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> #### Reference Issues/PRs Resolves #639 Relates to: #540 #224 <!-- Example: Fixes #1234. See also #3456. Please use keywords (e.g., Fixes) to create link to the issues or pull requests you resolved, so that they will automatically be closed when your pull request is merged. See: https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue --> #### What does this implement/fix? Explain your changes. Change the C++ to support a special undefined type that can be changed into any other type in a subsequent data key. #### Any other comments? Only the C++ layers is changed, it might need additional python layer integration. --------- Co-authored-by: willdealtry <william.dealtry@gmail.com> Co-authored-by: Vasil Pashov <Vasil.Pashov@man.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Commit:579c4f2
Author:Vasil Danielov Pashov
Committer:GitHub

Add none type (#646) <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> #### Reference Issues/PRs Resolves #639 Relates to: #540 #224 <!-- Example: Fixes #1234. See also #3456. Please use keywords (e.g., Fixes) to create link to the issues or pull requests you resolved, so that they will automatically be closed when your pull request is merged. See: https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue --> #### What does this implement/fix? Explain your changes. Change the C++ to support a special undefined type that can be changed into any other type in a subsequent data key. #### Any other comments? Only the C++ layers is changed, it might need additional python layer integration. --------- Co-authored-by: willdealtry <william.dealtry@gmail.com> Co-authored-by: Vasil Pashov <Vasil.Pashov@man.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Commit:9fe2c9e
Author:Julien Jerphanion
Committer:GitHub

maint: Rename `datetime64[ns]`-related fields and datatypes (#592)

Commit:462815f
Author:Julien Jerphanion
Committer:GitHub

maint: Rename `datetime64[ns]`-related fields and datatypes (#592)

Commit:142fa82
Author:phoebusm
Committer:GitHub

Add support for Azure blob storage (Linux and Windows) (#427) All pytest test cases test against s3 storage, have been added Azure storage fixture as well. All tests against enumerator (Azurite) pass. However, until Azurite container is added in Github Action, those test cases won't pass in Github action. Separate PR will be raised for it. TODO: - [x] Add Azurite in Github Action - [x] python layer's config and pytest fixtures still require some work - [ ] document --------- Co-authored-by: mhertz <Matthew.Hertz@man.com> Co-authored-by: Matthew Hertz <m@mehertz.com>

Commit:f9caaef
Author:phoebusm
Committer:GitHub

Add support for Azure blob storage (Linux and Windows) (#427) All pytest test cases test against s3 storage, have been added Azure storage fixture as well. All tests against enumerator (Azurite) pass. However, until Azurite container is added in Github Action, those test cases won't pass in Github action. Separate PR will be raised for it. TODO: - [x] Add Azurite in Github Action - [x] python layer's config and pytest fixtures still require some work - [ ] document --------- Co-authored-by: mhertz <Matthew.Hertz@man.com> Co-authored-by: Matthew Hertz <m@mehertz.com>

Commit:3bc665d
Author:willdealtry
Committer:William Dealtry

Move encoded field and descriptors from protobuf

Commit:7782515
Author:willdealtry
Committer:William Dealtry

Move encoded field and descriptors from protobuf

Commit:00c7571
Author:willdealtry
Committer:willdealtry

Move encoded field and descriptors from protobuf

Commit:655900b
Author:Gem Dot Artigas
Committer:GemDot

Prevent out-of-order data being written unless explicitly requested

Commit:dea353f
Author:Gem Dot Artigas
Committer:GemDot

Prevent out-of-order data being written unless explicitly requested

Commit:4dee19d
Author:aseaton
Committer:Alex Seaton

Fix Windows unit test failures relating to LMDB mapsize.

Commit:db592b1
Author:aseaton
Committer:Alex Seaton

Fix Windows unit test failures relating to LMDB mapsize.

Commit:387978a
Author:Dealtry, William (London)

Initial Commit