[Design] Recoverable Grouped Execution #12124

wenleix · 2018-12-21T08:54:10Z

(A comment-friendly version of this design doc can be found at https://docs.google.com/document/d/1YhibgfzxtkjeJoYtty7R_AdBjTQdqf2nTtwQHk3kGLA/edit?usp=sharing)

Introduction

Grouped execution was introduced to Presto in #8951 to support huge join and aggregation raised in ETL pipelines.

When the input tables are already partitioned on the join key or aggregation key (e.g. bucketed table in Hive), Presto could process a subset (group) of the partitions of the data at a time. This reduces the amount of memory needed to hold the hash table. Implementation wise, for a stage with grouped execution enabled, it is further split into many “lifespan”s, where each lifespan corresponding to a table partition (e.g bucket in Hive table). Only a subset of lifespans are processed simultaneously during stage execution, configured by concurrent-lifespans-per-task.

Besides breaking the memory barrier, grouped execution also enables partial query failure recovery -- each lifespan can be retried independently when the output of the stage writes to a persistent storage. Note output from failed tasks needs to be cleaned up, as discussed later.

Preliminary

Consider the following query, where A and B are already bucketed on custkey:

SELECT ...
FROM A JOIN B 
    USING custkey

Without grouped execution, the workers will load all the data on build side (Table B):

However, since A and B are already bucketed, Presto can schedule the query execution in a more intelligent way to reduce peak memory consumption: for each bucket i, joining the bucket i on table A and B can be done independently! In Presto engine, we call this computation unit a “lifespan”:

Besides scaling Presto to more memory-intensive queries, grouped execution also opens opportunity for partial query failure recovery, as now each lifespan in the query are independent and can be retried independently. As illustrated in the following figure:

Design

A prototype can be found in https://github.com/wenleix/presto/commits/tankbbb

1. Dynamic lifespan schedule

This is done in #11693. Before this change, lifespan are pre-allocated to tasks in a fixed way, which doesn’t work for restarted lifespan since it requires to be allocated to a different task.

Note dynamic life schedule only works when there is no remote source in the stage. In the future, we also want to add support for remote source with replicated distribution (i.e. for broadcast join)

2. Track Persistent Lifespan

In current code, SQLTaskExecution will report a lifespan as completed if all the drivers are done and there are no more inputs. However, it doesn’t check whether lifespan’s output has been delivered. In failure recovery scenario, we want to track such “persistent” lifespans which don’t need to be restarted after task failure.

There are two options to track whether a lifespan’s output is delivered:

Track on the sender side: make OutputBuffer aware of lifespan.
Track on the receiver side: make SqlStageExecution/SqlQueryExecution informed when TableFinishOperator get data from TableWriteOperator .

Another problem is how to cleanup output from failed lifespans, which will be discussed in next section.

POC based on first option can be found at wenleix@a6e89a1.

3. Support Lifespan Granularity Commit

Failed lifespans may generate temporary output, which cannot be included in the final output. Thus the ConnectorPageSink has to support partial commit.

We can support partial commit in HiveConnector in the following way:

The writer initially writes to files prefixed with .tmp.presto, which will be ignored by compute engine (e.g. Hive/Spark/Presto)
When ConnectorPageSink.finish() get called, the worker commit the partial output by remove the .tmp.presto prefix.
- The final file name is decided by stage id and lifespan -- so different attempt will commit to the same file name.

This protocol should work with STAGE_AND_MOVE_TO_TARGET_DIRECTORY write mode.

This commit protocol only requires the underlying filesystem to implement atomic rename. This is also the approach used by MapReduce/Spark.

Note in case there are more than one tasks try to do the rename (i.e. a task considered as failed by coordinator, but it can still talk to filesystem), we cannot decide which task finally win the rename race. Thus stats cannot be updated with recoverable grouped execution for now.

4. Allow Removing Remote Source from ExchangeClient

For a failed task, its receiver stage (i.e. TableFinishOperator) needs to cancel waiting for output from it.

Note that adding this support to n-to-n exchange is inherently hard, since sending the cancellation requests to every receiver task introduced too many coordinator to worker HTTP requests in a bursty manner. (see the discussion in #11065). However, it's OK for the purpose of supporting recovery, since only n-to-1 exchange needs to be supported.

POC: wenleix@79363c1

5. Reschedule Splits for Failed Task

For restarted tasks, splits needs to be re-scheduled. A rewind() API will be added to SplitSource interface. This is trivial for FixedSplitSource since all splits are pre-loaded. It’s more sophisticated for HiveSplitSource since we start the query execution while we are still discovering splits.

For now we decided to keep all splits in memory for HiveSplitSource when running in recovery mode. Note even only with grouped execution, we are likely already buffering all the HiveInternalSplit since split discovery in bucketed mode do not block “offer”:

presto/presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitSource.java

Lines 203 to 211 in 0bfe80b

    
           @Override 
        
           public ListenableFuture<?> offer(OptionalInt bucketNumber, InternalHiveSplit connectorSplit) 
        
           { 
        
               AsyncQueue<InternalHiveSplit> queue = queueFor(bucketNumber); 
        
               queue.offer(connectorSplit); 
        
               // Do not block "offer" when running split discovery in bucketed mode. 
        
               // A limit is enforced on estimatedSplitSizeInBytes. 
        
               return immediateFuture(null); 
        
           }

We have two implementation options:

Load all the HiveInternalSplit prior to execute the query.
Execute query while discovering splits, but don’t drop scheduled splits.

POC based on first option: wenleix@37ef4a4

6. Restart Failed Tasks

SqlStageExecution should coordinate task restart, such as asking StageScheduler to restart the task, remove source from TableFinishOperator stage, etc.

POC: wenleix@0af6291

Loose Ends

Planner will mark a fragment as dynamic bucket schedule / recoverable.
- So in query plan, we know whether the stage supports recoverable.
- Currently dynamic bucket schedule is decided at execution.
Report “wasted CPU” due to partial retry
- Report CPU time per lifespan
Add batch mode (session property)
- Batch mode allows recoverable execution, but might introduce drawback, such as split discovery may take longer.
- Don’t redistribute write/scale write if the fragment has partitioning .
- Support virtual buckets in batch mode (Support virtual bucketing on $path for hive unbucketed table #12099)

The text was updated successfully, but these errors were encountered:

wenleix · 2018-12-21T08:55:55Z

cc @dain , @kokosing , @findepi

sopel39 · 2018-12-27T11:13:36Z

Facebook and S3 currently use DIRECT_TO_TARGET_NEW_DIRECTORY write mode, which doesn’t fit well into this model. We might want to introduce other write mode for this.

I was thinking that we could use some kind of "sub-transactions" for the failure cleanup. This would delegate the cleanup to the connector (e.g: we could use Hive 3 MVCC for S3). Then the coordinator call either:

TransactionManager#asyncCommit
or
TransactionManager#asyncAbort

If any fails, then the whole query fails. This would also remove the need for cleanup in ConnectorMetadata.finishInsert/finishCreateTable

This is part of the effort of recoverable grouped execution (prestodb#12124) to prepare for rescheduling splits for failed tasks.

This is part of the effort of recoverable grouped execution (#12124) to prepare for rescheduling splits for failed tasks.

stale · 2022-03-03T01:23:44Z

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.

This was referenced Jan 23, 2019

Decide dynamic lifespan schedule at plan time #12217

Merged

Support dynamic node assignment for grouped execution #11693

Merged

wenleix changed the title ~~[Proposal] Recoverable Grouped Execution~~ [Design] Recoverable Grouped Execution Feb 7, 2019

wenleix mentioned this issue Feb 13, 2019

Mark lifespan as completed after all output delivered #12336

Merged

shixuan-fan mentioned this issue Feb 14, 2019

Add option to preload splits for grouped execution #12330

Merged

shixuan-fan added a commit to shixuan-fan/presto that referenced this issue Feb 21, 2019

Add option to preload splits for grouped execution

9d6fa9c

This is part of the effort of recoverable grouped execution (prestodb#12124) to prepare for rescheduling splits for failed tasks.

shixuan-fan added a commit that referenced this issue Feb 22, 2019

Add option to preload splits for grouped execution

80ae958

This is part of the effort of recoverable grouped execution (#12124) to prepare for rescheduling splits for failed tasks.

shixuan-fan mentioned this issue Feb 26, 2019

Add hive session property to enable writing staging files #12386

Merged

wenleix mentioned this issue Mar 5, 2019

[Design] Exchange Materialization #12387

Closed

wenleix mentioned this issue Mar 22, 2019

Mark split schedule done after lifespan finish execution #12519

Merged

This was referenced Apr 3, 2019

Implement recoverable grouped execution #12529

Merged

Make each fragment contain at most 1 TableWriterNode #12526

Merged

Support removing remote source from ExchangeClient #12488

Merged

shixuan-fan mentioned this issue Apr 15, 2019

Add rewind API for SplitSource #12677

Merged

wenleix mentioned this issue Apr 18, 2019

Move removeRemoteSource from RemoteTaskFactory to RemoteTask #12691

Merged

shixuan-fan mentioned this issue May 10, 2019

Introduce channels for staging partition write and commit #12787

Merged

shixuan-fan mentioned this issue Jun 12, 2019

Support grouped execution for eligible table scans #12934

Merged

shixuan-fan mentioned this issue Jul 18, 2019

Add release notes for 0.222 #13100

Merged

wenleix added the Roadmap A top level roadmap item label Jul 22, 2019

Crossoverrr mentioned this issue May 13, 2020

Add grouped_execution to kudu connector trinodb/trino#3715

Merged

stale bot added the stale label Mar 3, 2022

stale bot closed this as completed Apr 16, 2022

kagamiori mentioned this issue May 24, 2024

Test NestedLoopJoin and MergeJoin in join fuzzer facebookincubator/velox#9901

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Design] Recoverable Grouped Execution #12124

[Design] Recoverable Grouped Execution #12124

wenleix commented Dec 21, 2018 •

edited

Loading

wenleix commented Dec 21, 2018

sopel39 commented Dec 27, 2018 •

edited

Loading

stale bot commented Mar 3, 2022

[Design] Recoverable Grouped Execution #12124

[Design] Recoverable Grouped Execution #12124

Comments

wenleix commented Dec 21, 2018 • edited Loading

Introduction

Preliminary

Design

1. Dynamic lifespan schedule

2. Track Persistent Lifespan

3. Support Lifespan Granularity Commit

4. Allow Removing Remote Source from ExchangeClient

5. Reschedule Splits for Failed Task

6. Restart Failed Tasks

Loose Ends

wenleix commented Dec 21, 2018

sopel39 commented Dec 27, 2018 • edited Loading

stale bot commented Mar 3, 2022

wenleix commented Dec 21, 2018 •

edited

Loading

sopel39 commented Dec 27, 2018 •

edited

Loading