Xorq is a multi‑engine batch transformation framework built on Ibis, DataFusion and Arrow. It ships a compute catalog and a multi-engine manifest you can run across DuckDB, Snowflake, DataFusion, and more.
- Multi-engine manifest: A single, typed plan captured as a YAML artifact that can execute in DuckDB, Snowflake, DataFusion, etc.
- Deterministic builds & caching: Content hashes of the plan power reproducible runs and cheap replays.
- Lineage & Schemas: Compile-time schema checks and end-to-end to end column-level lineage.
- Compute catalog: Versioned registry that stores and operates on manifests (run, cache, diff, serve-unbound).
- Portable UDxFs: Arbitrary python logic with schema-in/out contracts portable via Arrow Flight.
- Scikit-learn integration: Model fitting pipeline captured in the predict pipeline manifest for portable batch scoring and model training lineage
Not an orchestrator. Use Xorq from Airflow, Dagster, GitHub Actions, etc.
Not streaming/online. Xorq focuses on batch,out-of-core transformations.
pip install xorq[examples]
xorq init -t penguins
Then follow the Quickstart Tutorial for a full walk-through using the Penguins dataset.
The manifest is a collection of YAML files that captures the expression graph and supporting files like memtables serialized to disk.
Once you xorq build your pipeline, you get:
- expr.yaml: a reproducible expression graph
- deferred_reads.yaml: source metadata
- SQL and metadata files for inspection and CI
Xorq makes it easy to bring your scikit-learn Pipeline and automatically converts it into a deferred Xorq expression.
import xorq.api as xo
from xorq.expr.ml.pipeline_lib import Pipeline
(train, test) = xo.test_train_splits(...)
sklearn_pipeline = make_pipeline(...)
xorq_pipeline = Pipeline.from_instance(sklearn_pipeline)
# still no work done: deferred fit expression
fitted_pipeline = xorq_pipeline.fit(train, features=features, target=target)
expr = fitted_pipeline.predict(test[features])
Here's a commented snippet from a YAML manifest
predicted:
op: ExprScalarUDF # predict(...)
kwargs:
bill_length_mm: ... # features
bill_depth_mm: ...
flipper_length_mm: ...
body_mass_g: ...
meta:
__config__:
computed_kwargs_expr:
op: AggUDF # fit(...)
kwargs:
bill_length_mm: ...
bill_depth_mm: ...
flipper_length_mm: ...
body_mass_g: ...
species: ... # target
The YAML format serializes the Expression graph and all its nodes, including UDFs as pickled entries.
Once an expression is built, we can then catalog it and share across teams.
The compute catalog is a versioned registry of compute manifests. It can be stored in Git, S3, GCS, or a database.
❯ xorq catalog add builds/{build-hash} --alias penguins-model
❯ xorq catalog ls
Aliases:
mortgage-test-predicted dbf90860-88b3-4b6c-830a-8518b3296e7c r1
Entries:
dbf90860-88b3-4b6c-830a-8518b3296e7c r1 52f987594254
You can then run, serve or cache the catalog entry, including unbinding nodes that depend on external state (e.g. source tables). This is useful to serve a trained pipeline with new data.
We can rerun an expression with new inputs by replacing an arbitrary node in the expression defined by its node-hash.
xorq serve-unbound builds/7061dd65ff3c --host localhost --port 8001 --cache-dir penguins_example b2370a29c19df8e1e639c63252dacd0e
builds/7061dd65ff3c
: Your built expression manifest--host localhost --port 8001
: Where to serve the UDxF from--cache-dir penguins_example
: Directory for caching resultsb2370a29c19df8e1e639c63252dacd0e
: The node-hash that represents the expression input to replace
To learn more on how to find the node hash, check out the Serve Unbound.
import xorq.api as xo
client = xo.flight.connect("localhost", 8001)
f = client.get_exchange("default") # currently all expressions get the default name in addition to their hash
new_expr = expr.pipe(f)
new_expr.execute()
Xorq uses Apache Arrow Flight RPC for zero-copy data transfer and leverages Ibis and DataFusion under the hood for efficient computation.
A generic catalog that can be used to build new workloads:
- Lineage‑preserving, multi-engine feature stores (offline, reproducible)
- Composable data products (ship datasets as compute artifacts)
- Governed sharing of compute (catalog entries as the contract between teams)
- ML/data pipeline development (deterministic builds)
Also great for:
- Generating SQL from high-level DSLs (e.g. Semantic Layers)
- Batch model scoring across engines (same expr, different backends)
- Cross‑warehouse migrations (portability via Ibis + UDxFs)
- Data CI (compile‑time schema/lineage checks in PRs)
Xorq is pre-1.0 and evolving fast. Expect breaking changes.