GitHub - xorq-labs/xorq: multi-engine batch transformation framework

Xorq is a multi‑engine batch transformation framework built on Ibis, DataFusion and Arrow. It ships a compute catalog and a multi-engine manifest you can run across DuckDB, Snowflake, DataFusion, and more.

What Xorq gives you

Multi-engine manifest: A single, typed plan captured as a YAML artifact that can execute in DuckDB, Snowflake, DataFusion, etc.
Deterministic builds & caching: Content hashes of the plan power reproducible runs and cheap replays.
Lineage & Schemas: Compile-time schema checks and end-to-end to end column-level lineage.
Compute catalog: Versioned registry that stores and operates on manifests (run, cache, diff, serve-unbound).
Portable UDxFs: Arbitrary python logic with schema-in/out contracts portable via Arrow Flight.
Scikit-learn integration: Model fitting pipeline captured in the predict pipeline manifest for portable batch scoring and model training lineage

Not an orchestrator. Use Xorq from Airflow, Dagster, GitHub Actions, etc.

Not streaming/online. Xorq focuses on batch,out-of-core transformations.

Quickstart

pip install xorq[examples]
xorq init -t penguins

Then follow the Quickstart Tutorial for a full walk-through using the Penguins dataset.

From `scikit-learn` to multi-engine manifest

The manifest is a collection of YAML files that captures the expression graph and supporting files like memtables serialized to disk.

Once you xorq build your pipeline, you get:

expr.yaml: a reproducible expression graph
deferred_reads.yaml: source metadata
SQL and metadata files for inspection and CI

Xorq makes it easy to bring your scikit-learn Pipeline and automatically converts it into a deferred Xorq expression.

import xorq.api as xo
from xorq.expr.ml.pipeline_lib import Pipeline


(train, test) = xo.test_train_splits(...)
sklearn_pipeline = make_pipeline(...)
xorq_pipeline = Pipeline.from_instance(sklearn_pipeline)
# still no work done: deferred fit expression
fitted_pipeline = xorq_pipeline.fit(train, features=features, target=target)
expr = fitted_pipeline.predict(test[features])

Here's a commented snippet from a YAML manifest

predicted:
  op: ExprScalarUDF            # predict(...)
  kwargs:
    bill_length_mm: ...        # features
    bill_depth_mm: ...
    flipper_length_mm: ...
    body_mass_g: ...
  meta:
    __config__:
      computed_kwargs_expr:
        op: AggUDF             # fit(...)
        kwargs:
          bill_length_mm: ...
          bill_depth_mm: ...
          flipper_length_mm: ...
          body_mass_g: ...
          species: ...         # target

The YAML format serializes the Expression graph and all its nodes, including UDFs as pickled entries.

From manifest to catalog

Once an expression is built, we can then catalog it and share across teams.

The compute catalog is a versioned registry of compute manifests. It can be stored in Git, S3, GCS, or a database.

❯ xorq catalog add builds/{build-hash} --alias penguins-model

❯ xorq catalog ls
Aliases:
mortgage-test-predicted dbf90860-88b3-4b6c-830a-8518b3296e7c    r1
Entries:
dbf90860-88b3-4b6c-830a-8518b3296e7c    r1      52f987594254

You can then run, serve or cache the catalog entry, including unbinding nodes that depend on external state (e.g. source tables). This is useful to serve a trained pipeline with new data.

Serve the same expression with new inputs (serve-unbound)

We can rerun an expression with new inputs by replacing an arbitrary node in the expression defined by its node-hash.

xorq serve-unbound builds/7061dd65ff3c --host localhost --port 8001 --cache-dir penguins_example b2370a29c19df8e1e639c63252dacd0e

builds/7061dd65ff3c: Your built expression manifest
--host localhost --port 8001: Where to serve the UDxF from
--cache-dir penguins_example: Directory for caching results
b2370a29c19df8e1e639c63252dacd0e: The node-hash that represents the expression input to replace

To learn more on how to find the node hash, check out the Serve Unbound.

Compose with the served expression:

import xorq.api as xo

client = xo.flight.connect("localhost", 8001)
f = client.get_exchange("default") # currently all expressions get the default name in addition to their hash

new_expr = expr.pipe(f)

new_expr.execute()

How Xorq works

Xorq uses Apache Arrow Flight RPC for zero-copy data transfer and leverages Ibis and DataFusion under the hood for efficient computation.

Use cases

A generic catalog that can be used to build new workloads:

Lineage‑preserving, multi-engine feature stores (offline, reproducible)
Composable data products (ship datasets as compute artifacts)
Governed sharing of compute (catalog entries as the contract between teams)
ML/data pipeline development (deterministic builds)

Also great for:

Generating SQL from high-level DSLs (e.g. Semantic Layers)
Batch model scoring across engines (same expr, different backends)
Cross‑warehouse migrations (portability via Ibis + UDxFs)
Data CI (compile‑time schema/lineage checks in PRs)

Learn More

Status

Xorq is pre-1.0 and evolving fast. Expect breaking changes.

Name		Name	Last commit message	Last commit date
Latest commit History 1,017 Commits
.github		.github
db		db
docker		docker
docs		docs
examples		examples
nix		nix
python/xorq		python/xorq
.codespell.ignore-words		.codespell.ignore-words
.envrc		.envrc
.envrc.user.editable		.envrc.user.editable
.envrc.user.flake		.envrc.user.flake
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitattributes		.gitattributes
.gitignore.template		.gitignore.template
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
compose.yaml		compose.yaml
flake.lock		flake.lock
flake.nix		flake.nix
justfile		justfile
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
uv.lock		uv.lock
vendors.txt		vendors.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

What Xorq gives you

Quickstart

From `scikit-learn` to multi-engine manifest

From manifest to catalog

Serve the same expression with new inputs (serve-unbound)

Compose with the served expression:

How Xorq works

Use cases

Learn More

Status

Get Involved

About

Uh oh!

Releases 21

Uh oh!

Contributors 11

Languages

License

xorq-labs/xorq

Folders and files

Latest commit

History

Repository files navigation

What Xorq gives you

Quickstart

From scikit-learn to multi-engine manifest

From manifest to catalog

Serve the same expression with new inputs (serve-unbound)

Compose with the served expression:

How Xorq works

Use cases

Learn More

Status

Get Involved

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 21

Uh oh!

Contributors 11

Languages

From `scikit-learn` to multi-engine manifest