DataFusion Ray rewrite to connect stages with Arrow Flight Streaming #60

robtandy · 2025-02-04T17:48:55Z

@andygrove, per our collaboration around this, here is the requested PR main. Tagging @alamb here also for his additional insight and perspective around query execution strategies, and to follow up from the presentation given at the DataFusion Community Meeting.

TL;DR

This PR contains a pull based stage execution model where stages are connected using ArrowFlight. Its simpler and more performant than previous iterations of the streaming rewrite. It works on TPCH SF100 and below. Above has not been tested, though I think it should parallelize well at the expense of execution overhead.

This may obsolete issue #55, address issue #46, and move toward #2

Evolution of this work

This represents the third iteration of attempts to stream data between stages. A brief accounting of those efforts might be useful to capture here:

Try to use the Ray Object Store to stream batches between stages.
This was a challenge for two reasons. The first was that under high throughput of potentially small items, the object store added too much latency to query processing. The second and reason this was abandoned is that creating a shuffle writer exec plan node to jump from rust to python to interact with the object store, and potentially itself, call rust, proved difficult to manage and reason about.
The second attempt and one I have discussed on discord, was to adopt ArrowFlight for streaming between stages and flip the execution of the stages from pull to push. The thinking was to have each stage eagerly execute and stream batches to an Exchange Actor which would hold a bunch of channels (num stages x num partitions per stage), and allow subsequent stages to consume from them.

The problems here were that the Exchange Actor was difficult to tune and created an additional Arrow Flight hop. Another challenge was that DataFusion is inherently a pull based architecture, and very easy to compose and reason about. Flipping this was like swimming upstream and resulted in a lot of complications which DataFusion already elegantly manages.

While it was interesting to consider push execution, and may inform future work to consume from streams and materialize query results, ultimately, it meant reimplementing a lot of things that DataFusion just makes easy.
The third attempt, and this iteration is purely pull based and uses ArrowFlight to stream between stages. This turned out to produce the smallest amount of code, and one that was easy to work with and debug. Its as if you are executing Datafusion locally, but some of the execution nodes are connected with ArrowFlight instead of channels.

There is more that can be improved performance and ergonomics wise, but this is quite usable as it is, and will allow others to see and collaborate.

For examples, see the main readme, and examples/tips.py and tpch/tpc.py.

Execution Strategy

DataFusion Ray will optimize a query plan and then break it into stages. Those stages will be scheduled as Ray Actors and make their partition streams available over arrow flight.

Connecting one stage to another means adding a RayStageReaderExec node within the stage where a connection is required and it will go get the stream using a FlightClient.

Tunables:

---isolate DataFusion Ray will attempt to host each Stage as its own actor. This flag (in the examples and a parameter to the RayContext) will tell DataFusion Ray to host each partition of each stage as its own Actor. This dramatically increases parallelism, but is a blunt instrument, and a more fine tuned choice (like split a stage into x parts) would be more desirable, and can be added in a future update.
--concurrency will control the partitioning count for the all stages and is planned using DataFusion before submitting control to Ray. This interacts with --isolate
--batch-size This controls the target (and also max) batch size exchanged between stages. Currently 8192 works for all queries in TPCH SF100. Going higher can produce Flight errors as we exceed the batch payload size.

Rob.tandy/ray shuffle

distributed stages working

save point before arrow flight refactor

robtandy · 2025-02-05T17:07:26Z

Note that the CI is broken, and I suggest we address in subsequent work. There are plenty of housekeeping activities to do post this PR to make for a 0.1.0 release. We can make issues for those after this PR is reviewed and/or accepted.

milenkovicm · 2025-02-07T09:29:25Z

one small suggestion related to ---isolate. Would it make sense to rename it to --actor-per-task or --actor-per-task-context or something along those lines.
I believe it would align with common naming (spark/ballista) where partition handling within stage is called task, also it would need less explanation.

robtandy · 2025-02-07T14:57:08Z

@milenkovicm , That's a good suggestion. While I wait for review, I'm revisiting this functionality to add finer grained control.

Something desirable is to be able to specify the number of workers for the query. If we did this, maybe --workers? This way you can have a predictable resource allocation, worker wise, and having more concurrent queries on the cluster might be more manageable.

I'm not sure yet the best way to do this. Because after you determine the number of stages, how would you best decide which ones are advantageous to split?

change prints to log output. Set up proper logging

robtandy · 2025-02-07T15:42:33Z

Update to this PR. Added proper logging output configured (for both rust and python) with DATAFUSION_RAY_LOG_LEVEL environment variable. The env var and logging settings are propagated to Ray Workers.

milenkovicm · 2025-02-07T16:15:31Z

Something desirable is to be able to specify the number of workers for the query. If we did this, maybe --workers? This way you can have a predictable resource allocation, worker wise, and having more concurrent queries on the cluster might be more manageable.

Would --workers be set per session or overall ray cluster?
I apologise if I give too many references to ballista, but my brain is wired to that concept at the moment.
Overall cluster parallelism in ballista is tied up to sum of executor parallelism. Specific session context parallelism can be set as session configuration parameter.

Would it make sense to use datafusion.execution.target_partitions to control --workers? Ballista had ballista.shuffle.partitions session configuration option which would set datafusion.execution.target_partitions to desired task parallelism.

edmondop · 2025-02-07T16:16:16Z

datafusion_ray/core.py

+        self.isolate_partitions = isolate_parititions
+        self.prefetch_buffer_size = prefetch_buffer_size
+
+    def stages(self):


I have two comments here:

for properties we should use the @property decorator, see suggestion above

we can factor out the function __init_stages(self)

i.e. the

@property def stages(self): if not self._stages: self._stages = self._init_stages() return self.stages

edmondop · 2025-02-07T16:17:01Z

datafusion_ray/core.py

+
+    def optimized_logical_plan(self):
+        return self.df.optimized_logical_plan()
+


Suggested change

@property

def execution_plan(self):

return self.df.execution_plan()

@property

def logical_plan(self):

return self.df.logical_plan()

@property

def optimized_logical_plan(self):

return self.df.optimized_logical_plan()

edmondop · 2025-02-07T16:20:33Z

datafusion_ray/core.py

+            self.run_stages()
+
+            addrs = ray.get(self.coord.get_stage_addrs.remote())
+


Suggested change

if len(addrs[last_stage]) != 1:

raise ValueError("Unexpected condition: more than one final stage")

edmondop · 2025-02-07T16:21:59Z

datafusion_ray/core.py

+
+    def show(self) -> None:
+        batches = self.collect()
+        print(prettify(batches))


Should this remain a print or be an info?

edmondop · 2025-02-07T16:23:39Z

datafusion_ray/core.py

+        self.ctx.set(option, value)
+
+
+@ray.remote(num_cpus=0)


Why num_cpus=0 ?

All stages need to be running in order for the results to stream through the distributed plan. Setting num_cpus=0 ensures all Actors will be scheduled by Ray. If we had a different value, Ray may choose to wait for available resources and we, at the moment, do not have a way of knowing a stage is waiting to be scheduled.

I think future PRs will included better specification of the resources required per query.

edmondop · 2025-02-07T16:25:26Z

datafusion_ray/core.py

+                shadow_partition,
+            )
+        except Exception as e:
+            error(


How if we use a custom type hre?
```raise StageServiceError(self.stage_id, shadow) from e`

edmondop · 2025-02-07T16:26:02Z

datafusion_ray/core.py

+        from datafusion_ray._datafusion_ray_internal import StageService
+
+        self.shadow_partition = shadow_partition
+        shadow = (


We can probably compute this in the exception handler it's only used for error handling?

…e options after round trip through substrait

Rob.tandy/pull arrow flight

robtandy · 2025-02-11T17:27:53Z

I've refactored the control of parallelizing execution to be more fine grained. --partitions-per-worker controls the number of partitions hosted by an actor (who gets an entire Ray Worker). So if the stage has 10 partitions, concurrency=10 and partitions-per-worker=4, we'll spin up 40 actors to satisfy the query.

Latest TPCH100 results, compared with local data fusion using all cores on a 32CPU machine, NVME drive:

{
    "engine": "datafusion-ray",
    "benchmark": "tpch",
    "settings": {
        "concurrency": 16,
        "batch_size": 8192,
        "prefetch_buffer_size": 0,
        "partitions_per_worker": 4
    },
    "data_path": "file:///data2/sf100/",
    "queries": {
        "1": 14.63571572303772,
        "2": 16.11984419822693,
        "3": 20.260254621505737,
        "4": 16.40132737159729,
        "5": 35.4002046585083,
        "6": 7.793532609939575,
        "7": 49.56708884239197,
        "8": 37.00137710571289,
        "9": 60.13660907745361,
        "10": 38.18756365776062,
        "11": 13.499444484710693,
        "12": 18.93906331062317,
        "13": 14.921503782272339,
        "14": 7.416260004043579,
        "15": 2.373532295227051,
        "16": 8.229618549346924,
        "17": 52.57597255706787,
        "18": 85.48271942138672,
        "19": 10.138697862625122,
        "20": 15.182426929473877,
        "21": 78.81208372116089,
        "22": 8.711960792541504
    },
    "local_queries": {
        "1": 14.912381172180176,
        "2": 10.478784322738647,
        "3": 8.960041284561157,
        "4": 3.8824241161346436,
        "5": 15.605360507965088,
        "6": 1.672469139099121,
        "7": 28.076196432113647,
        "8": 14.546991348266602,
        "9": 26.64270520210266,
        "10": 11.699812173843384,
        "11": 4.682126522064209,
        "12": 4.03217339515686,
        "13": 8.454285621643066,
        "14": 2.875070095062256,
        "15": 0.0013363361358642578,
        "16": 2.2461774349212646,
        "17": 26.58483576774597,
        "18": 61.40281629562378,
        "19": 5.444426774978638,
        "20": 7.112048625946045,
        "21": 36.257577657699585,
        "22": 2.517507314682007
    },
    "validated": {
        "1": true,
        "2": true,
        "3": true,
        "4": true,
        "5": true,
        "6": true,
        "7": true,
        "8": true,
        "9": true,
        "10": true,
        "11": true,
        "12": true,
        "13": true,
        "14": true,
        "15": true,
        "16": true,
        "17": true,
        "18": true,
        "19": true,
        "20": true,
        "21": true,
        "22": true
    }
}

edmondop · 2025-02-11T17:38:39Z

Wow these results are impressive, I am not sure I understand the consequence of thechanges, are you allocating 4x actors compared than before?

robtandy · 2025-02-11T18:18:31Z

Hey thanks! The number of actors created per query is number_of_stages_in_query * concurrency / partitions_per_worker. Or if partitions_per_worker is not set, then the number of actors created is number_of_stages_in_query

accommodate chagnes to asyncio.wait in python 3.11+

robtandy · 2025-02-11T19:30:24Z

Update to PR to accommodate python 's changes to asyncio.wait that occured in version 3.11 and above. Tested with 3.10 and 3.12.

We'll sort out more version testing when we sort out CI

andygrove · 2025-02-12T16:21:39Z

Thanks @robtandy. There has been a lot of progress and I agree that it would be good to merge this.

The Kubernetes CI tests are failing, which is not surprising. Do we want to disable these tests?

 buildx failed with: ERROR: resolve : lstat k8s: no such file or directory

@milenkovicm @edmondop Do you have an objections to merging this PR as a checkpoint on progress?

milenkovicm · 2025-02-12T16:24:38Z

please do @andygrove, my comment is minor, it should not be considered as blocker in any sense.

robtandy · 2025-02-12T16:38:28Z

I don't know much yet about setting up CI, if you (@andygrove ) /we can disable, please do. My preference would be land this, and sort out the housekeeping like tasks we need to do in order to get the repo in a solid state with CI and more user facing docs. Then release a version.

alamb · 2025-02-13T22:38:15Z

woohoo!

alamb

This is very cool -- thank you @robtandy @edmondop and @andygrove ❤️

alamb · 2025-02-13T22:40:32Z

src/util.rs

+}
+
+fn ipc_to_batch_helper(bytes: &[u8]) -> Result<RecordBatch, ArrowError> {
+    let mut stream_reader = StreamReader::try_new_buffered(Cursor::new(bytes), None)?;


FYI since you are reading from memory here anyways I don't think buffering adds much extra value

Also, once this is available

Add with_skip_validation flag to IPC StreamReader, FileReader and FileDecoder arrow-rs#7120
You can probably save quite a bit of time revalidating known good inputs

robtandy and others added 30 commits December 29, 2024 12:11

wip

e7917a2

wip

52a1258

round trip to python and back working

a651917

tips.py working

e396c76

Merge pull request #1 from robtandy/rob.tandy/ray_shuffle

5714d1f

Rob.tandy/ray shuffle

workers die on tpch

cffd046

wip

86a0691

fixed race of multiple threads calling actor

e6e5f94

working but only for simple queries

acf9cb6

working with rewrite

493fa21

Merge pull request #2 from robtandy/rob.tandy/ray_shuffle

8702412

Rob.tandy/ray shuffle

distributed stages working

4ad0ac1

Merge pull request #3 from robtandy/rob.tandy/ray_shuffle

b46ebb5

distributed stages working

save point before arrow flight refactor

62dcd0f

Merge pull request #4 from robtandy/rob.tandy/ray_shuffle

d68e05f

save point before arrow flight refactor

arrow flight wip

4174054

arrow flight wip working for small example query

9b95dc3

arrow flight working with shadows

25fd5d7

wip commit

0002366

flight working, still a bug in physical plan xfer

c1abc25

wip

a29792e

add stats and do proper shutdown

55ed9cf

fix rounding issue with done fraction, and race condition calculating it

0424b24

compare with single node datafusion result for tpch

ad93bf6

tidy up a bit of code and tpch benchmark

86a6ced

add missing file after rename

a90d7d1

add register listing tables

7283d48

fix listing table issue

561861d

timing for initial ray setup, fix bug in ray dataframe

7dd5497

refactor to use an Exchange per Stage

587c03d

robtandy mentioned this pull request Feb 4, 2025

datafusion-ray is not published to pypi.org #59

Open

robtandy and others added 2 commits February 7, 2025 10:40

change prints to log output. Set up proper logging

bae9329

Merge pull request #8 from robtandy/rob.tandy/pull_arrow_flight

96398f1

change prints to log output. Set up proper logging

edmondop reviewed Feb 7, 2025

View reviewed changes

robtandy and others added 8 commits February 9, 2025 14:23

refactor to use configurable partitions per actor

e63c446

fix bug in splitting stages

07dd679

fix bug in splitting stages II: return of fixing bug

09cf91e

capture settings in tpcbench results json

d406242

fix bug in splitting stages III: this time its personal

663ec64

hack until we upgrade to newest DataFusion to repair ParquetExec tabl…

0a7401d

…e options after round trip through substrait

update docs

b834e38

Merge pull request #9 from robtandy/rob.tandy/pull_arrow_flight

e8e0b36

Rob.tandy/pull arrow flight

robtandy and others added 2 commits February 11, 2025 14:25

accommodate chagnes to asyncio.wait in python 3.11+

af2d4c7

Merge pull request #10 from robtandy/rob.tandy/pull_arrow_flight

3248795

accommodate chagnes to asyncio.wait in python 3.11+

andygrove merged commit 071802f into apache:main Feb 13, 2025
0 of 2 checks passed

alamb reviewed Feb 13, 2025

View reviewed changes

alamb mentioned this pull request Feb 22, 2025

[EPIC] Substrait: Add producer and consumer for physical plans apache/datafusion#5173

Closed

12 tasks


		def optimized_logical_plan(self):
		return self.df.optimized_logical_plan()

+    @property
+    def execution_plan(self):
+        return self.df.execution_plan()
+    @property
+    def logical_plan(self):
+        return self.df.logical_plan()
+    @property
+    def optimized_logical_plan(self):
+        return self.df.optimized_logical_plan()

		self.run_stages()

		addrs = ray.get(self.coord.get_stage_addrs.remote())


	if len(addrs[last_stage]) != 1:
	raise ValueError("Unexpected condition: more than one final stage")

Uh oh!

DataFusion Ray rewrite to connect stages with Arrow Flight Streaming #60

DataFusion Ray rewrite to connect stages with Arrow Flight Streaming #60

Uh oh!

Conversation

robtandy commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Evolution of this work

Execution Strategy

Tunables:

Uh oh!

robtandy commented Feb 5, 2025

Uh oh!

milenkovicm commented Feb 7, 2025

Uh oh!

robtandy commented Feb 7, 2025

Uh oh!

robtandy commented Feb 7, 2025

Uh oh!

milenkovicm commented Feb 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robtandy commented Feb 11, 2025

Uh oh!

edmondop commented Feb 11, 2025

Uh oh!

robtandy commented Feb 11, 2025

Uh oh!

robtandy commented Feb 11, 2025

Uh oh!

andygrove commented Feb 12, 2025

Uh oh!

milenkovicm commented Feb 12, 2025

Uh oh!

robtandy commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

alamb commented Feb 13, 2025

Uh oh!

alamb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

robtandy commented Feb 4, 2025 •

edited

Loading

robtandy commented Feb 12, 2025 •

edited

Loading

alamb left a comment •

edited

Loading