Proto commits in outcaste-io/badger

These 53 commits are when the Protocol Buffers files have changed:

Commit:bccf8a2
Author:Manish R Jain
Committer:GitHub

feat(simplify): Remove 25% of Badger code (#12) In this PR, I only keep the features that are needed by Outserv. This results in a 25% reduction of Go codebase, from 15K LOC to 11K LOC, excluding protos, flat buffers, and tests. - Managed mode is the only mode available. - Write transactions and Oracle are gone. - Memtable and WAL are gone. - SyncWrites option is gone. We no longer need it because we only have SSTables and we always sync them. - Backup and Load are gone. - WriteBatch is now using Skiplists. - A separate memtable held by the DB object is gone. It only holds immutable memtables now. - Sequence struct is gone. - TTL for keys is gone. - Various tools are gone: - bank.go - backup.go - restore.go - For sake of speed, I commented out lots of tests from: - db_test.go - managed_db_test.go - stream_writer_test.go - txn_test.go - rotate_test.go I'll fix them later.

The documentation is generated from this commit.

Commit:7689a23
Author:Manish R Jain

Expel to Outcaste

Commit:73c1ce3
Author:Naman Jain
Committer:GitHub

pb: avoid protobuf warning due to common filename (#1519) (#1731) If a Go program imports both badger (v1) and badger/v2, a warning will be produced at init time: WARNING: proto: file "pb.proto" is already registered A future release will panic on registration conflicts. See: https://developers.google.com/protocol-buffers/docs/reference/go/faq#namespace-conflict The problem is the "pb.proto" filename; it's registered globally, which makes it very likely to cause conflicts with other protobuf-generated packages in a Go binary. Coincidentally, this is a problem with badger's pb package in the v1 module, since that too uses the name "pb.proto". Instead, call the file "badgerpb2.proto", which should be unique enough, and it's also the name we use for the Protobuf package. Finally, update gen.sh to work out of the box via just "bash gen.sh" without needing extra tweaks, thanks to the "paths=source_relative" option. It forces output files to be produced next to the input files, which is what we want. (cherry picked from commit 3e6a4b7ca9637d3e2b7522c4ef04052ef31ec6d9) Co-authored-by: Daniel Martí <mvdan@mvdan.cc>

Commit:74ade98
Author:Manish R Jain
Committer:GitHub

opt(stream): add option to directly copy over tables from lower levels (#1700) This PR adds FullCopy option in Stream. This allows sending the table entirely to the writer. If this option is set to true we directly copy over the tables from the last 2 levels. This option increases the stream speed while also lowering the memory consumption on the DB that is streaming the KVs. For 71GB, compressed and encrypted DB we observed 3x improvement in speed. The DB contained ~65GB in the last 2 levels while remaining in the above levels. To use this option, the following options should be set in Stream. stream.KeyToList = nil stream.ChooseKey = nil stream.SinceTs = 0 db.managedTxns = true If we use stream writer for receiving the KVs, the encryption mode has to be the same in sender and receiver. This will restrict db.StreamDB() to use the same encryption mode in both input and output DB. Added TODO for allowing different encryption modes.

Commit:1bebc26
Author:Manish R Jain
Committer:GitHub

feat(Trie): Working prefix match with holes (#1654) This PR adds a way to match a prefix, while ignoring certain portions of the key (aka holes). This is useful to implement multi-tenancy in Dgraph, where namespace is stored on the byte index 3-11 in the prefix key, and those bytes need to be ignored to subscribe to updates across all the namespaces.

Commit:6266a4e
Author:aman-bansal

changing metrics to v3 and badgerpb2 to badgerpb3

Commit:b69163b
Author:aman bansal
Committer:GitHub

changing badger module path to v3 (#1636) * changing badger module path to v3 * running go mod tidy * updating metric keys + proto package

Commit:54688d8
Author:aman-bansal

changing badger module path to v3

Commit:8d26d52
Author:Manish R Jain
Committer:GitHub

opt(memory): Use z.Calloc for allocating KVList (#1563) KVs can take up a lot of memory in the stream framework. With this change, we allocate them using z.Allocator, and allow callers to KeyToList to use the allocator to generate KVs as well. After we call Send, we release them. Also change Stream Framework to spit out StreamDone markers.

Commit:599363b
Author:Ibrahim Jarif
Committer:GitHub

[BREAKING] feat(index): Use flatbuffers instead of protobuf (#1546) This PR - Uses flatbuffers instead of protobufs for table index and directly stores byte slices in the cache. - Uses MaxVersion to pick the oldest tables for compaction first. - Uses leveldb/bloom so that we can test it without unmarshal - Adds uncompressed size and key count in table index. - Updates write bench tool to use managed mode.

Commit:58460bb
Author:Daniel Martí
Committer:Ibrahim Jarif

pb: avoid protobuf warning due to common filename (#1519) If a Go program imports both badger (v1) and badger/v2, a warning will be produced at init time: WARNING: proto: file "pb.proto" is already registered A future release will panic on registration conflicts. See: https://developers.google.com/protocol-buffers/docs/reference/go/faq#namespace-conflict The problem is the "pb.proto" filename; it's registered globally, which makes it very likely to cause conflicts with other protobuf-generated packages in a Go binary. Coincidentally, this is a problem with badger's pb package in the v1 module, since that too uses the name "pb.proto". Instead, call the file "badgerpb2.proto", which should be unique enough, and it's also the name we use for the Protobuf package. Finally, update gen.sh to work out of the box via just "bash gen.sh" without needing extra tweaks, thanks to the "paths=source_relative" option. It forces output files to be produced next to the input files, which is what we want.

Commit:7d91a85
Author:Ibrahim Jarif

feat(index): Use flatbuffers instead of protobuf Use flatbuffers instead of protobuf for storing table index.

Commit:3e6a4b7
Author:Daniel Martí
Committer:GitHub

pb: avoid protobuf warning due to common filename (#1519) If a Go program imports both badger (v1) and badger/v2, a warning will be produced at init time: WARNING: proto: file "pb.proto" is already registered A future release will panic on registration conflicts. See: https://developers.google.com/protocol-buffers/docs/reference/go/faq#namespace-conflict The problem is the "pb.proto" filename; it's registered globally, which makes it very likely to cause conflicts with other protobuf-generated packages in a Go binary. Coincidentally, this is a problem with badger's pb package in the v1 module, since that too uses the name "pb.proto". Instead, call the file "badgerpb2.proto", which should be unique enough, and it's also the name we use for the Protobuf package. Finally, update gen.sh to work out of the box via just "bash gen.sh" without needing extra tweaks, thanks to the "paths=source_relative" option. It forces output files to be produced next to the input files, which is what we want.

Commit:cddf7c0
Author:Ibrahim Jarif
Committer:GitHub

Proto: Rename dgraph.badger.v2.pb to badgerpb2 (#1314) This PR renames badger protobuf package from `dgraph.badger.v2.pb` to `badgerpb2`. The `pb.pb.go` file has been regenerated using the `pb/gen.sh` script.

Commit:8097259
Author:Daniel Martí
Committer:GitHub

Proto: make badger/v2 compatible with v1 (#1293) There were two instances of init-time work being incompatible with v1. That is, if one imported both v1 and v2 as part of a Go build, the resulting binary would end up panicking before main could run. The examples below are done with the latest versions of v1 and v2, and a main.go as follows: package main import ( _ "github.com/dgraph-io/badger" _ "github.com/dgraph-io/badger/v2" ) func main() {} First, the protobuf package used "pb" as its proto package name. This is a problem, because types are registered globally with their fully qualified names, like "pb.Foo". Since both badger/pb and badger/v2/pb tried to globally register types with the same qualified names, we'd get a panic: $ go run . panic: proto: duplicate enum registered: pb.ManifestChange_Operation goroutine 1 [running]: github.com/golang/protobuf/proto.RegisterEnum(...) .../go/pkg/mod/github.com/golang/protobuf@v1.3.1/proto/properties.go:459 github.com/dgraph-io/badger/v2/pb.init.0() .../badger/pb/pb.pb.go:638 +0x459 To fix this, make v2's proto package fully qualified. Since the namespace is global, just "v2.pb" wouldn't suffice; it's not unique enough. "dgraph.badger.v2.pb" seems good, since it follows the Go module path pretty closely. The second issue was with expvar, which too uses globally registered names: $ go run . 2020/04/08 22:59:20 Reuse of exported var name: badger_disk_reads_total panic: Reuse of exported var name: badger_disk_reads_total goroutine 1 [running]: log.Panicln(0xc00010de48, 0x2, 0x2) .../src/log/log.go:365 +0xac expvar.Publish(0x906fcc, 0x17, 0x9946a0, 0xc0000b0318) .../src/expvar/expvar.go:278 +0x267 expvar.NewInt(...) .../src/expvar/expvar.go:298 github.com/dgraph-io/badger/v2/y.init.1() .../badger/y/metrics.go:55 +0x65 exit status 2 This time, replacing the "badger_" var prefix with "badger_v2_" seems like it's simple enough as a fix. Fixes #1208.

Commit:69f35b3
Author:michele meloni
Committer:GitHub

Add go_package in .proto (#1282)

Commit:ea19351
Author:Ashish Goswami
Committer:Ibrahim Jarif

Introduce StreamDone in Stream Writer (#1061) This PR introduces, a way to tell StreamWriter to close a stream. Previously Streams were always open until Flush is called on StreamWriter. This resulted in memory utilisation, because of underlying TableBuilder to a sortedWriter. Also closing all sorted writer in single call resulted in more memory allocation(during Flush()). This can be useful in some case such as bulk loader in Dgraph, where only one stream is active at a time. (cherry picked from commit 385da9100e3534a5f82b3c37e0990305f8008a3a)

Commit:f46f8ea
Author:Ibrahim Jarif
Committer:GitHub

Store total key-value size in table footer (#1137) This PR stores the total key-value size in a table in the table footer. The key-value size can be accessed via db.Tables(..) call. It returns the list of all tables along with the total size of key-values.

Commit:385da91
Author:Ashish Goswami
Committer:GitHub

Introduce StreamDone in Stream Writer (#1061) This PR introduces, a way to tell StreamWriter to close a stream. Previously Streams were always open until Flush is called on StreamWriter. This resulted in memory utilisation, because of underlying TableBuilder to a sortedWriter. Also closing all sorted writer in single call resulted in more memory allocation(during Flush()). This can be useful in some case such as bulk loader in Dgraph, where only one stream is active at a time.

Commit:e7d0a7b
Author:Ibrahim Jarif
Committer:GitHub

Support compression in Badger (#1013) This commit adds support for compression in badger. Two compression algorithms - Snappy and ZSTD are supported for now. The compression algorithm information is stored in the manifest file. We compress blocks (typically of size 4KB) stored in the SST. The compression algorithm can be specified via the CompressionType option to badger.

Commit:a425b0e
Author:balaji
Committer:GitHub

Support encryption at rest (#1042) This PR will add support for encrypting data which going to be on disk. Two components are being encrypted, one is sst and another one is vlog. In sst, each block is encrypted with seperate IV using AES CTR mode. In vlog, each entry is being encrypted. Each vlog will have base IV of 12 bytes. IV for the each entry is generated by merging base IV with entry offset. Data are encrypted using datakey, which is generated by the badger. The datakey is further encrypted using user provided key and stored in disk. So that user can change key. In order to change key user has to provide old key and new key. we'll decrypt using old key and store the datakey back to disk by encrypting using the new key. By this mechanism, it'll simplfy the key change.

Commit:d9b02b7
Author:பாலாஜி ஜின்னா

add encryption details to table manifest changes Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

Commit:e9fab80
Author:balaji
Committer:GitHub

integrate encryption to db (#1000) * add encrption to the tab᷆le

Commit:dd79bfa
Author:பாலாஜி ஜின்னா

add support for encryption to the db Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

Commit:8868b65
Author:balaji
Committer:GitHub

encryption registry (#975) For encryption at rest support. We'll be using two keys for encryption. One key is provided by the user. Another one will be generated by the badger(data key). The key generated by the badger will be used for encryption and decryption. The key generated by the badger will be encrypted by the user-provided key and persisted to the disk. The main reason behind this implementation is that It'll be easy to rotate keys. We just need to decrypt the data key with the old key and encrypt back with the new key. As a first step in this commit. Key Registry is added. Which is responsible for maintaining all the badger generated key and doing key rotation. We generate a new key for every 10 days, which can be changed by the option.

Commit:9f39780
Author:பாலாஜி ஜின்னா
Committer:பாலாஜி ஜின்னா

encryption registry Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

Commit:9ce7439
Author:பாலாஜி ஜின்னா

clean up Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

Commit:adfb181
Author:பாலாஜி ஜின்னா

new change Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

Commit:b05fa0a
Author:பாலாஜி ஜின்னா

hook key registry Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

Commit:4c9b1b4
Author:பாலாஜி ஜின்னா

add key registry Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

Commit:4603b3a
Author:பாலாஜி ஜின்னா

wip Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

Commit:61c492d
Author:Ibrahim Jarif
Committer:GitHub

[breaking/format] Add key-offset index to the end of SST table (#881) * Add key-offset index to the end of SST table This commit adds the key-offset index used for searching the blocks in a table to the end the SST file. The length of the index is stored in the last 4 bytes of the file.

Commit:069bc6b
Author:ashish
Committer:ashish

Remove proto, store plane offsets for binary search in block

Commit:58b24a8
Author:Ibrahim Jarif
Committer:Ibrahim Jarif

fixup

Commit:5061241
Author:ashish
Committer:ashish

Add Checksum at block level

Commit:d0ded42
Author:ashish
Committer:ashish

Rebase with ibrahim/footer-protobuf

Commit:6751285
Author:ashish
Committer:ashish

Change Chechsum proto

Commit:222923c
Author:ashish
Committer:ashish

Add checksum implementation * Add checksum in protos * Add CRC32C and xxHash checksum implementation * Change blockMeta to blockIndex in protos

Commit:db8bb57
Author:Ibrahim Jarif

Change pb.checksum format and add checksum.go file

Commit:362779c
Author:Ibrahim Jarif

Remove checksum from manifest file

Commit:aca502d
Author:Ibrahim Jarif

Address review comments

Commit:413ce9e
Author:Ibrahim Jarif
Committer:Ibrahim Jarif

fixup

Commit:fc950be
Author:Ibrahim Jarif

address review comments

Commit:e711436
Author:Ibrahim Jarif
Committer:Ibrahim Jarif

Add key-offset index to the end of SST table This commit adds the key-offset index used for searching the blocks in a table to the end the SST file. The length of the index is stored in the last 4 bytes of the file.

Commit:7116e16
Author:Ashish Goswami
Committer:Manish R Jain

Add StreamWriter for fast sorted stream writes (#802) Changes: * Start work on a sorted stream writer which can avoid paying the cost of compactions. * Logic for sorted stream writer is all written. * Add a stream id to each key in the output from Stream framework. * Make StreamWriter work. Performance is incredible, able to write at 100MBps. * Add Tests for Stream Writer * Update Oracle after Stream Writer is done * Mods in stream framework so we can decrease the number of key ranges * Add awareness of stream id in sortedWriter. * Sync the directories when StreamWriter is done. * Add licenses * Moving builder.Finish within the goroutine improves throughput by 15%. Getting 112MBps write speed. * Change Stream Writer Tests to have managed modes also * Rename files to stream_writer.go and the corresponding test file. * Fix a bug caused by marking an index done below current done until, causing a panic. Now we just recreate the oracle. * Update protos

Commit:2017987
Author:Manish R Jain
Committer:GitHub

Introduce SSTable sha256 checksums (#689) Add SHA256 checksums for SSTables in MANIFEST. If a table no longer matches this checksum, that table would be skipped over with an error. Tested that it works with previous badger directories. As new tables get created, Badger would store their checksums in MANIFEST. Modified `badger info` to show the checksums stored in MANIFEST, so user can manually compare the output from `sha256sum <filename>` if needed. Fixes #680 .

Commit:599bc29
Author:Manish R Jain
Committer:GitHub

Renaming and Refactoring (#655) - Flatten function can now be called from live code path, i.e. while Badger is running. - Refactor stop and start compactions code, so it can be used in Flatten, DropAll, and other places. - Rename protos package to pb. Consolidate both the proto files into one pb.proto. - Rename KeyToKVList to just KeyToList in Stream. - Rename KVPair to just KV. Changes: * Rename protos to pb. * Rename KeyToKVList to KeyToList. * Refactor stop and start compactions, so we can use it in various places. * Rename KVPair to KV * Defer startCompactions right next to stopCompactions. * Catch all usages of KVPair. Rename to KV.

Commit:14cbd89
Author:Manish R Jain
Committer:GitHub

Port Stream framework from Dgraph to Badger (#653) This PR ports over a Stream framework from Dgraph to Badger. This framework allows users to concurrently iterate over Badger, converting data to key-value lists, which then get sent out serially. In Dgraph we use this framework for shipping snapshots from leader to followers, doing periodic delta merging of updates, moving predicates from one Alpha group to another, and so on. However, the framework is general enough that it could lie within Badger, and that's what this PR achieves. In addition, the Backup API of Badger now uses this framework to make taking backups faster. Changes: * Port Stream from Dgraph to Badger. * Switch Backup to use the new Stream framework. * Update godocs. * Remove a t.Logf data race. * Self-review

Commit:ba13ac7
Author:Gus
Committer:Manish R Jain

Exclude deleted and expired items from backup (#624) * added check to exclude deleted and expired items from backup. also added check to exclude expired items from restore. * changed delete and expire checks to keep a copy of the data, but save a marker indicating state. delete state is detected during restore using a key prefix. * added meta field to protos.KVPair to store meta state during backup. regenerated protos. changed Backup and Load to store and use meta values. added backup and restore tests to complement the existing ones. * fixed test * Handle DiscardEarlierVersions and IsDeletedOrExpired correctly. * removed unneeded test funcs. cleaned up tests. added incremental backup test. * fixed small data race. updated tests. * Cosmetic changes

Commit:a5499e5
Author:Deepak Jois
Committer:GitHub

Add support for TTL We add an additional field in the LSM key structure to capture a Unix timestamp, beyond which the key would be considered expired and would be treated as if it is deleted. This changes the on-disk format, so we also need to increment the manifest version number. Fixes #298

Commit:671c20e
Author:Deepak Jois
Committer:Deepak Jois

Add DB.Backup() and DB.Load() for backup purposes. We write a backup of the latest snapshot in the DB (including all previous versions) to a protobuf-encoded file. Fixes #135

Commit:ff3547c
Author:Deepak Jois
Committer:Deepak Jois

Add KV.Backup() to backup data to a file.

Commit:f3d5152
Author:Sam Hughes
Committer:Sam Hughes

Replace compaction log with manifest What this does do: - appends to manifest and syncs before deleting old files from disk, to ensure safety amidst kill -9 or system crash What this could do but does not: - include file size, smallest/biggest keys in the manifest - keep track of value log files It might be nice to have file size and include all files, to verify that a `cp -R` or file-based backup/restore has included everything. Right now this just gets us crash safety.