DatenLord

Join the chat at https://gitter.im/datenlord/datenlord codecov

DatenLord is a next-generation cloud-native distributed storage platform, which aims to meet the performance-critical storage needs from next-generation cloud-native applications, such as microservice, serverless, AI, etc. On one hand, DatenLord is designed to be a cloud-native storage system, which itself is distributed, fault-tolerant, and graceful upgrade. These cloud-native features make DatenLord easy to use and easy to maintain. On the other hand, DatenLord is designed as an application-orientated storage system, in that DatenLord is optimized for many performance-critical scenarios, such as databases, AI machine learning, big data. Meanwhile, DatenLord provides high-performance storage service for containers, which facilitates stateful applications running on top of Kubernetes (K8S). The high performance of DatenLord is achieved by leveraging the most recent technology revolution in hardware and software, such as NVMe, non-volatile memory, asynchronous programming, and the native Linux asynchronous IO support.

Why DatenLord?

Why do we build DatenLord? The reason is two-fold:

Target scenarios

The main scenario of DatenLord is to facilitate high availability across multi-cloud, hybrid-cloud, multiple data centers, etc. Concretely, there are many online business providers whose business is too important to afford any downtime. To achieve high availability, the service providers have to leverage multi-cloud, hybrid-cloud, and multiple data centers to hopefully avoid single point failure of each single cloud or data center, by deploying applications and services across multiple clouds or data centers. It's relatively easier to deploy applications and services to multiple clouds and data centers, but it's much harder to duplicate all data to all clouds or all data centers in a timely manner, due to the huge data size. If data is not equally available across multiple clouds or data centers, the online business might still suffer from single point failure of a cloud or a data center, because data unavailability resulted from a cloud or a data center failure.

DatenLord can alleviate data unavailable of cloud or data center failure by caching data to multiple layers, such as local cache, neighbor cache, remote cache, etc. Although the total data size is huge, the hot data involved in online business is usually of limited size, which is called data locality. DatenLord leverages data locality and builds a set of large scale distributed and automatic cache layers to buffer hot data in a smart manner. The benefit of DatenLord is two-fold:

Architecture

Single Data Center

DatenLord Single Data Center

Multiple Data Centers and Hybrid Cloud

DatenLord Multiple Data Centers and Hybrid Cloud

DatenLord provides 3 kinds of user interfaces: KV interface, S3 interface and file interface. The backend storage is supported by the underlying distributed cache layer which is strong consistent. The strong consistency is guaranteed by the metadata management module which is built on high performance consensus protocol. The persistence storage layer can be local disk or S3 storage. For the network, RDMA is used to provide high throughput and low latency networks. If RDMA is not supported, TCP is an alternative option. For the multiple data center and hybrid clouds scenario, there will be a dedicated metadata server which supports metadata requests within the same data center. While in the same data center scenario, the metadata module can run on the same machine as the cache node. The network between data centers and public clouds are managed by a private network to guarantee high quality data transfer.

Quick Start

Currently DatenLord has been built as Docker images and can be deployed via K8S.

To deploy DatenLord via K8S, just simply run:

To use DatenLord, just define PVC using DatenLord Storage Class, and then deploy a Pod using this PVC:

cat <<EOF >datenlord-demo.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-datenlord-test
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 100Mi
  storageClassName: csi-datenlord-sc

---
apiVersion: v1
kind: Pod
metadata:
  name: mysql-datenlord-test
spec:
    containers:
    - name: mysql
      image: mysql
      env:
      - name: MYSQL_ROOT_PASSWORD
        value: "rootpasswd"
      volumeMounts:
      - mountPath: /var/lib/mysql
        name: data
        subPath: mysql
    volumes:
    - name: data
      persistentVolumeClaim:
        claimName: pvc-datenlord-test
EOF

kubectl apply -f datenlord-demo.yaml

DatenLord provides a customized scheduler which implements K8S scheduler extender. The scheduler will try to schedule a pod to the node that has the volume that it requests. To use the scheduler, add schedulerName: datenlord-scheduler to the spec of your pod. Caveat: dangling docker image may cause failed to parse request error. Doing docker image prune on each K8S node is a way to fix it.

It may need to install snapshot CRD and controller on K8S, if used K8S CSI snapshot feature:

Monitoring

Datenlord monitoring guideline is in datenlord monitoring. We provide both YAML and Helm method to deploy the monitoring system.

To use YAML method, just run

sh ./scripts/setup/datenlord-monitor-deploy.sh

To use Helm method, run

sh ./scripts/setup/datenlord-monitor-deploy.sh helm

Performance Test

Performance test is done by fio and fio-plot is used to plot performance histograms.

To run performance test,

sudo apt-get update
sudo apt-get install -y fio python3-pip
sudo pip3 install matplotlib numpy fio-plot
sh ./scripts/perf/fio-perf-test.sh TEST_DIR

Four histograms will be generated.

Performance test is added to GitHub Action(cron.yml) and performance report is generated and archived as artifacts(Example) for every four hours.

How to Contribute

Anyone interested in DatenLord is welcomed to contribute.

Coding Style

Please follow the code style. Meanwhile, DatenLord adopts very strict clippy linting, please fix every clippy warning before submit your PR. Also please make sure all CI tests are passed.

Continuous Integration (CI)

The CI of DatenLord leverages GitHub Action. There are two CI flows for DatenLord, One is for Rust cargo test, clippy lints, and standard filesystem E2E checks; The other is for CSI related tests, such as CSI sanity test and CSI E2E test.

The CSI E2E test setup is a bit complex, its action script cron.yml is quite long, so let's explain it in detail:

Sub-Projects

DatenLord has several related sub-projects, mostly working in progress, listed alphabetically:

Road Map