Tsumugi Spark

UNDER ACTIVE DEVELOPMENT

python-client

Documentation

tsumugi-shiraui

NOTE: Tsumugi Shiraui is a chimera: a hybrid of Human and Gauna. She combines the chaotic power of Gauna with a Human intillegence and empathia. Like an original character of the Manga "Knights of Sidonia", this project aims to make a hybrid of very powerful but hard to learn and use Deequ Scala Library with a usability and simplicity of Spark Connect (PySpark Connect, Spark Connect Go, Spark Connect Rust, etc.).

About

The project's goal is to create a modern, SparkConnect-native wrapper for the elegant and high-performance Data Quality library, AWS Deequ. The PySpark Connect API is expected to be the primary focus, but the PySpark Classic API will also be maintained. Additionally, other thin clients such as connect-go or connect-rs will be supported.

Why another wrapper?

While Amazon Deequ itself is well-maintained, the existing PyDeequ wrapper faces two main challenges:

  1. It relies on direct py4j calls to the underlying Deequ Scala library. This approach presents two issues: a) It cannot be used in any SparkConnect environment. b) py4j is not well-suited for working with Scala code. For example, try creating an Option[Long] from Python using py4j to understand the complexity involved.
  2. It suffers from a lack of maintenance, likely due to the issues mentioned in point 1. This can be seen in this GitHub issue.
  3. The current python-deequ implementation makes it impossible to call row-level results because py4j.reflection.PythonProxyHandler is not serializable. This problem is documented in this GitHub issue.

Goals of the project

Non-goals of the project

Architecture overview

From a high-level perspective, Tsumugi implements three main components:

  1. Messages for Deequ Core's main data structures
  2. SparkConnect Plugin and utilities
  3. PySpark Connect and PySpark Classic thin client

The diagram below provides an overview of this concept:

Project structure

Protobuf messages

The tsumugi-server/src/main/protobuf/ directory contains messages that define the main structures of the Deequ Scala library:

SparkConnect Plugin

The file tsumugi-server/src/main/scala/org/apache/spark/sql/tsumugi/DeequConnectPlugin.scala contains the plugin code itself. It is designed to be very simple, consisting of approximately 50 lines of code. The plugin's functionality is straightforward: it checks if the message is a VerificationSuite, passes it into DeequSuiteBuilder, and then packages the result back into a Relation.

Deequ Suite Builder

The file tsumugi-server/src/main/scala/io/mrpowers/tsumugi/DeequSuiteBuilder.scala contains code that creates Deequ objects from protobuf messages. It maps enums and constants to their corresponding Deequ counterparts, and generates com.amazon.deequ objects from the respective protobuf messages. The code ultimately returns a ready-to-use Deequ top-level structure.

Getting Started

At the moment there are no package distributions of the server part as well there is no pre-built PyPi packages for clients. The only way to play with the project at the moment is to build it from the source code.

Quick start

There is a simple Python script that performs the following tasks:

  1. Builds the server plugin;
  2. Downloads the required Spark version and all missing JAR files;
  3. Combines everything together;
  4. Runs the local Spark Connect Server with the Tsumugi plugin.
python dev/run-connect.py

Building the server component requires Maven and Java 11. You can find installation instructions for both in their official documentation: Maven and Java 11. This script also requires Python 3.10 or higher. After installation, you can connect to the server and test it by creating a Python virtual environment. This process requires the poetry build tool. You can find instructions on how to install Poetry on their official website.

cd tsumugi-python
poetry env use python3.10 # any version bigger than 3.10 should work
poetry install --with dev # that install tsumugi as well as jupyter notebooks and pyspark[connect]

Now you can run jupyter and try the example notebook (tsumugi-python/examples/basic_example.ipynb): Notebook

Server

Building the server part requires Maven.

cd tsumugi-server
mvn clean package

Client

Installing the PySpark client requires poetry.

cd tsumugi-python
poetry env use python3.10 # 3.10+
poetry install

References

Tsumugi is built on top of Deequ Data Quality tool: