Implement a way to preserve partitioning through `UnionExec` without losing ordering · Issue #10314 · apache/datafusion

Is your feature request related to a problem or challenge?

The EnforceDistribution physical optimizer pass in DataFusion in some cases will introduce InterleaveExec to increase partitioning when data passes through a UnionExec:

datafusion/datafusion/core/src/physical_optimizer/enforce_distribution.rs

Lines 1196 to 1226 in 2231183

    
           plan = if plan.as_any().is::<UnionExec>() 
        
               && !config.optimizer.prefer_existing_union 
        
               && can_interleave(children_plans.iter()) 
        
           { 
        
               // Add a special case for [`UnionExec`] since we want to "bubble up" 
        
               // hash-partitioned data. So instead of 
        
               // 
        
               // Agg: 
        
               //   Repartition (hash): 
        
               //     Union: 
        
               //       - Agg: 
        
               //           Repartition (hash): 
        
               //             Data 
        
               //       - Agg: 
        
               //           Repartition (hash): 
        
               //             Data 
        
               // 
        
               // we can use: 
        
               // 
        
               // Agg: 
        
               //   Interleave: 
        
               //     - Agg: 
        
               //         Repartition (hash): 
        
               //           Data 
        
               //     - Agg: 
        
               //         Repartition (hash): 
        
               //           Data 
        
               Arc::new(InterleaveExec::try_new(children_plans)?) 
        
           } else { 
        
               plan.with_new_children(children_plans)? 
        
           };

Here is what InterleaveExec does:

datafusion/datafusion/physical-plan/src/union.rs

Lines 286 to 317 in 4edbdd7

    
           /// Combines multiple input streams by interleaving them. 
        
           /// 
        
           /// This only works if all inputs have the same hash-partitioning. 
        
           /// 
        
           /// # Data Flow 
        
           /// ```text 
        
           /// +---------+ 
        
           /// |         |---+ 
        
           /// | Input 1 |   | 
        
           /// |         |-------------+ 
        
           /// +---------+   |         | 
        
           ///               |         |         +---------+ 
        
           ///               +------------------>|         | 
        
           ///                 +---------------->| Combine |--> 
        
           ///                 | +-------------->|         | 
        
           ///                 | |     |         +---------+ 
        
           /// +---------+     | |     | 
        
           /// |         |-----+ |     | 
        
           /// | Input 2 |       |     | 
        
           /// |         |---------------+ 
        
           /// +---------+       |     | | 
        
           ///                   |     | |       +---------+ 
        
           ///                   |     +-------->|         | 
        
           ///                   |       +------>| Combine |--> 
        
           ///                   |         +---->|         | 
        
           ///                   |         |     +---------+ 
        
           /// +---------+       |         | 
        
           /// |         |-------+         | 
        
           /// | Input 3 |                 | 
        
           /// |         |-----------------+ 
        
           /// +---------+ 
        
           /// ```

However, this has the potential downside of destroying and pre-existing ordering which is sometimes preferable than increasing / improving partitionining (e.g. see #10257 and datafusion.optimizer.prefer_existing_sort setting)

Describe the solution you'd like

I would like there to be some way to preserve the partitioning after a UnionExec without losing the ordering information and then remove the prefer_existing_union flag

Describe alternatives you've considered

One possibility is to add a preserve_order flag to InterleaveExec the same way as RepartitionExec has a preserve_order flag:

datafusion/datafusion/physical-plan/src/repartition/mod.rs

Lines 328 to 417 in 4edbdd7

    
           /// Maps `N` input partitions to `M` output partitions based on a 
        
           /// [`Partitioning`] scheme. 
        
           /// 
        
           /// # Background 
        
           /// 
        
           /// DataFusion, like most other commercial systems, with the 
        
           /// notable exception of DuckDB, uses the "Exchange Operator" based 
        
           /// approach to parallelism which works well in practice given 
        
           /// sufficient care in implementation. 
        
           /// 
        
           /// DataFusion's planner picks the target number of partitions and 
        
           /// then `RepartionExec` redistributes [`RecordBatch`]es to that number 
        
           /// of output partitions. 
        
           /// 
        
           /// For example, given `target_partitions=3` (trying to use 3 cores) 
        
           /// but scanning an input with 2 partitions, `RepartitionExec` can be 
        
           /// used to get 3 even streams of `RecordBatch`es 
        
           /// 
        
           /// 
        
           ///```text 
        
           ///        ▲                  ▲                  ▲ 
        
           ///        │                  │                  │ 
        
           ///        │                  │                  │ 
        
           ///        │                  │                  │ 
        
           ///┌───────────────┐  ┌───────────────┐  ┌───────────────┐ 
        
           ///│    GroupBy    │  │    GroupBy    │  │    GroupBy    │ 
        
           ///│   (Partial)   │  │   (Partial)   │  │   (Partial)   │ 
        
           ///└───────────────┘  └───────────────┘  └───────────────┘ 
        
           ///        ▲                  ▲                  ▲ 
        
           ///        └──────────────────┼──────────────────┘ 
        
           ///                           │ 
        
           ///              ┌─────────────────────────┐ 
        
           ///              │     RepartitionExec     │ 
        
           ///              │   (hash/round robin)    │ 
        
           ///              └─────────────────────────┘ 
        
           ///                         ▲   ▲ 
        
           ///             ┌───────────┘   └───────────┐ 
        
           ///             │                           │ 
        
           ///             │                           │ 
        
           ///        .─────────.                 .─────────. 
        
           ///     ,─'           '─.           ,─'           '─. 
        
           ///    ;      Input      :         ;      Input      : 
        
           ///    :   Partition 0   ;         :   Partition 1   ; 
        
           ///     ╲               ╱           ╲               ╱ 
        
           ///      '─.         ,─'             '─.         ,─' 
        
           ///         `───────'                   `───────' 
        
           ///``` 
        
           /// 
        
           /// # Output Ordering 
        
           /// 
        
           /// If more than one stream is being repartitioned, the output will be some 
        
           /// arbitrary interleaving (and thus unordered) unless 
        
           /// [`Self::with_preserve_order`] specifies otherwise. 
        
           /// 
        
           /// # Footnote 
        
           /// 
        
           /// The "Exchange Operator" was first described in the 1989 paper 
        
           /// [Encapsulation of parallelism in the Volcano query processing 
        
           /// system 
        
           /// Paper](https://w6113.github.io/files/papers/volcanoparallelism-89.pdf) 
        
           /// which uses the term "Exchange" for the concept of repartitioning 
        
           /// data across threads. 
        
           #[derive(Debug)] 
        
           pub struct RepartitionExec { 
        
               /// Input execution plan 
        
               input: Arc<dyn ExecutionPlan>, 
        
               /// Partitioning scheme to use 
        
               partitioning: Partitioning, 
        
               /// Inner state that is initialized when the first output stream is created. 
        
               state: LazyState, 
        
               /// Execution metrics 
        
               metrics: ExecutionPlanMetricsSet, 
        
               /// Boolean flag to decide whether to preserve ordering. If true means 
        
               /// `SortPreservingRepartitionExec`, false means `RepartitionExec`. 
        
               preserve_order: bool, 
        
               /// Cache holding plan properties like equivalences, output partitioning etc. 
        
               cache: PlanProperties, 
        
           } 
        
           #[derive(Debug, Clone)] 
        
           struct RepartitionMetrics { 
        
               /// Time in nanos to execute child operator and fetch batches 
        
               fetch_time: metrics::Time, 
        
               /// Time in nanos to perform repartitioning 
        
               repartition_time: metrics::Time, 
        
               /// Time in nanos for sending resulting batches to channels. 
        
               /// 
        
               /// One metric per output partition. 
        
               send_time: Vec<metrics::Time>, 
        
           }

Additional context

We encountered this while working on #10259 @mustafasrepo and @phillipleblanc pointed out that config flag prefer_existing_union was effectively the same as prefer_existing_sort

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement a way to preserve partitioning through `UnionExec` without losing ordering #10314

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	plan = if plan.as_any().is::<UnionExec>()
	&& !config.optimizer.prefer_existing_union
	&& can_interleave(children_plans.iter())
	{
	// Add a special case for [`UnionExec`] since we want to "bubble up"
	// hash-partitioned data. So instead of
	//
	// Agg:
	// Repartition (hash):
	// Union:
	// - Agg:
	// Repartition (hash):
	// Data
	// - Agg:
	// Repartition (hash):
	// Data
	//
	// we can use:
	//
	// Agg:
	// Interleave:
	// - Agg:
	// Repartition (hash):
	// Data
	// - Agg:
	// Repartition (hash):
	// Data
	Arc::new(InterleaveExec::try_new(children_plans)?)
	} else {
	plan.with_new_children(children_plans)?
	};

	/// Combines multiple input streams by interleaving them.
	///
	/// This only works if all inputs have the same hash-partitioning.
	///
	/// # Data Flow
	/// ```text
	/// +---------+
	/// \| \|---+
	/// \| Input 1 \| \|
	/// \| \|-------------+
	/// +---------+ \| \|
	/// \| \| +---------+
	/// +------------------>\| \|
	/// +---------------->\| Combine \|-->
	/// \| +-------------->\| \|
	/// \| \| \| +---------+
	/// +---------+ \| \| \|
	/// \| \|-----+ \| \|
	/// \| Input 2 \| \| \|
	/// \| \|---------------+
	/// +---------+ \| \| \|
	/// \| \| \| +---------+
	/// \| +-------->\| \|
	/// \| +------>\| Combine \|-->
	/// \| +---->\| \|
	/// \| \| +---------+
	/// +---------+ \| \|
	/// \| \|-------+ \|
	/// \| Input 3 \| \|
	/// \| \|-----------------+
	/// +---------+
	/// ```

	/// Maps `N` input partitions to `M` output partitions based on a
	/// [`Partitioning`] scheme.
	///
	/// # Background
	///
	/// DataFusion, like most other commercial systems, with the
	/// notable exception of DuckDB, uses the "Exchange Operator" based
	/// approach to parallelism which works well in practice given
	/// sufficient care in implementation.
	///
	/// DataFusion's planner picks the target number of partitions and
	/// then `RepartionExec` redistributes [`RecordBatch`]es to that number
	/// of output partitions.
	///
	/// For example, given `target_partitions=3` (trying to use 3 cores)
	/// but scanning an input with 2 partitions, `RepartitionExec` can be
	/// used to get 3 even streams of `RecordBatch`es
	///
	///
	///```text
	/// ▲ ▲ ▲
	/// │ │ │
	/// │ │ │
	/// │ │ │
	///┌───────────────┐ ┌───────────────┐ ┌───────────────┐
	///│ GroupBy │ │ GroupBy │ │ GroupBy │
	///│ (Partial) │ │ (Partial) │ │ (Partial) │
	///└───────────────┘ └───────────────┘ └───────────────┘
	/// ▲ ▲ ▲
	/// └──────────────────┼──────────────────┘
	/// │
	/// ┌─────────────────────────┐
	/// │ RepartitionExec │
	/// │ (hash/round robin) │
	/// └─────────────────────────┘
	/// ▲ ▲
	/// ┌───────────┘ └───────────┐
	/// │ │
	/// │ │
	/// .─────────. .─────────.
	/// ,─' '─. ,─' '─.
	/// ; Input : ; Input :
	/// : Partition 0 ; : Partition 1 ;
	/// ╲ ╱ ╲ ╱
	/// '─. ,─' '─. ,─'
	/// `───────' `───────'
	///```
	///
	/// # Output Ordering
	///
	/// If more than one stream is being repartitioned, the output will be some
	/// arbitrary interleaving (and thus unordered) unless
	/// [`Self::with_preserve_order`] specifies otherwise.
	///
	/// # Footnote
	///
	/// The "Exchange Operator" was first described in the 1989 paper
	/// [Encapsulation of parallelism in the Volcano query processing
	/// system
	/// Paper](https://w6113.github.io/files/papers/volcanoparallelism-89.pdf)
	/// which uses the term "Exchange" for the concept of repartitioning
	/// data across threads.
	#[derive(Debug)]
	pub struct RepartitionExec {
	/// Input execution plan
	input: Arc<dyn ExecutionPlan>,
	/// Partitioning scheme to use
	partitioning: Partitioning,
	/// Inner state that is initialized when the first output stream is created.
	state: LazyState,
	/// Execution metrics
	metrics: ExecutionPlanMetricsSet,
	/// Boolean flag to decide whether to preserve ordering. If true means
	/// `SortPreservingRepartitionExec`, false means `RepartitionExec`.
	preserve_order: bool,
	/// Cache holding plan properties like equivalences, output partitioning etc.
	cache: PlanProperties,
	}

	#[derive(Debug, Clone)]
	struct RepartitionMetrics {
	/// Time in nanos to execute child operator and fetch batches
	fetch_time: metrics::Time,
	/// Time in nanos to perform repartitioning
	repartition_time: metrics::Time,
	/// Time in nanos for sending resulting batches to channels.
	///
	/// One metric per output partition.
	send_time: Vec<metrics::Time>,
	}

Implement a way to preserve partitioning through UnionExec without losing ordering #10314

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Implement a way to preserve partitioning through `UnionExec` without losing ordering #10314