UNDER ACTIVE DEVELOPMENT

NOTE: Tsumugi Shiraui is a chimera: a hybrid of Human and Gauna. She combines the chaotic power of Gauna with a Human intillegence and empathia. Like an original character of the Manga "Knights of Sidonia", this project aims to make a hybrid of very powerful but hard to learn and use Deequ Scala Library with a usability and simplicity of Spark Connect (PySpark Connect, Spark Connect Go, Spark Connect Rust, etc.).
The project's goal is to create a modern, SparkConnect-native wrapper for the elegant and high-performance Data Quality library, AWS Deequ. The PySpark Connect API is expected to be the primary focus, but the PySpark Classic API will also be maintained. Additionally, other thin clients such as connect-go or connect-rs will be supported.
While Amazon Deequ itself is well-maintained, the existing PyDeequ wrapper faces two main challenges:
py4j calls to the underlying Deequ Scala library. This approach presents two issues: a) It cannot be used in any SparkConnect environment. b) py4j is not well-suited for working with Scala code. For example, try creating an Option[Long] from Python using py4j to understand the complexity involved.py4j.reflection.PythonProxyHandler is not serializable. This problem is documented in this GitHub issue.proto3 definitions of basic Deequ Scala structures (such as Check, Analyzer, AnomalyDetectionStrategy, VerificationSuite, Constraint, etc.);protoc;From a high-level perspective, Tsumugi implements three main components:
The diagram below provides an overview of this concept:

The tsumugi-server/src/main/protobuf/ directory contains messages that define the main structures of the Deequ Scala library:
VerificationSuite: This is the top-level Deequ object. For more details, refer to suite.proto.Analyzer: This object is defined using oneof from a list of analyzers (including CountDistinct, Size, Compliance, etc.). For implementation details, see analyzers.proto.AnomalyDetection and its associated strategies. For more information, consult strategies.proto.Check: This is defined using Constraint, CheckLevel, and a description.Constraint: This is defined as an Analyzer (which computes a metric), a reference value, and a comparison sign.The file tsumugi-server/src/main/scala/org/apache/spark/sql/tsumugi/DeequConnectPlugin.scala contains the plugin code itself. It is designed to be very simple, consisting of approximately 50 lines of code. The plugin's functionality is straightforward: it checks if the message is a VerificationSuite, passes it into DeequSuiteBuilder, and then packages the result back into a Relation.
The file tsumugi-server/src/main/scala/io/mrpowers/tsumugi/DeequSuiteBuilder.scala contains code that creates Deequ objects from protobuf messages. It maps enums and constants to their corresponding Deequ counterparts, and generates com.amazon.deequ objects from the respective protobuf messages. The code ultimately returns a ready-to-use Deequ top-level structure.
At the moment there are no package distributions of the server part as well there is no pre-built PyPi packages for clients. The only way to play with the project at the moment is to build it from the source code.
There is a simple Python script that performs the following tasks:
python dev/run-connect.py
Building the server component requires Maven and Java 11. You can find installation instructions for both in their official documentation: Maven and Java 11. This script also requires Python 3.10 or higher. After installation, you can connect to the server and test it by creating a Python virtual environment. This process requires the poetry build tool. You can find instructions on how to install Poetry on their official website.
cd tsumugi-python
poetry env use python3.10 # any version bigger than 3.10 should work
poetry install --with dev # that install tsumugi as well as jupyter notebooks and pyspark[connect]
Now you can run jupyter and try the example notebook (tsumugi-python/examples/basic_example.ipynb): Notebook
Building the server part requires Maven.
cd tsumugi-server
mvn clean package
Installing the PySpark client requires poetry.
cd tsumugi-python
poetry env use python3.10 # 3.10+
poetry install
Tsumugi is built on top of Deequ Data Quality tool: