[SDP] PipelinesHandler and Pipeline Commands

jaceklaskowski · jaceklaskowski · commit 9bb9cfb4a04a · 2025-09-05T23:26:29.000+02:00
diff --git a/docs/declarative-pipelines/GraphRegistrationContext.md b/docs/declarative-pipelines/GraphRegistrationContext.md
@@ -31,7 +31,7 @@ toDataflowGraph: DataflowGraph
 
 `toDataflowGraph` is used when:
 
-* `PipelinesHandler` ([Spark Connect]({{ book.spark_connect }})) is requested to [startRun](PipelinesHandler.md#startRun)
+* `PipelinesHandler` ([Spark Connect]({{ book.spark_connect }})) is requested to [start a pipeline run](PipelinesHandler.md#startRun)
 
 ## Tables { #tables }
 
diff --git a/docs/declarative-pipelines/PipelineExecution.md b/docs/declarative-pipelines/PipelineExecution.md
@@ -18,13 +18,13 @@
 runPipeline(): Unit
 ```
 
-`runPipeline` [starts the pipeline](#startPipeline) and requests the [PipelineExecution](PipelineUpdateContext.md#pipelineExecution) (of this [PipelineUpdateContext](#context)) to [wait for the execution to complete](#awaitCompletion).
+`runPipeline` [starts this pipeline](#startPipeline) and requests the [PipelineExecution](PipelineUpdateContext.md#pipelineExecution) (of this [PipelineUpdateContext](#context)) to [wait for the execution to complete](#awaitCompletion).
 
 ---
 
 `runPipeline` is used when:
 
-* `PipelinesHandler` is requested to [startRun](PipelinesHandler.md#startRun) (for [Spark Connect]({{ book.spark_connect }}))
+* `PipelinesHandler` is requested to [start a pipeline run](PipelinesHandler.md#startRun)
 
 ## Start Pipeline { #startPipeline }
 
diff --git a/docs/declarative-pipelines/PipelinesHandler.md b/docs/declarative-pipelines/PipelinesHandler.md
@@ -2,6 +2,8 @@
 
 `PipelinesHandler` is used to [handle pipeline commands](#handlePipelinesCommand) in [Spark Connect]({{ book.spark_connect }}) ([SparkConnectPlanner]({{ book.spark_connect }}/server/SparkConnectPlanner), precisely).
 
+`PipelinesHandler` acts as a bridge between Python and SQL "frontends" and Spark Connect Server (where pipeline execution happens).
+
 ## Handle Pipelines Command { #handlePipelinesCommand }
 
 ```scala
@@ -14,14 +16,14 @@ handlePipelinesCommand(
 
 `handlePipelinesCommand` handles the given pipeline `cmd` command.
 
-| PipelineCommand | Description |
-|-----------------|-------------|
-| `CREATE_DATAFLOW_GRAPH` | [Creates a new Dataflow Graph](#createDataflowGraph) |
-| `DROP_DATAFLOW_GRAPH` | [Drops a pipeline](#DROP_DATAFLOW_GRAPH) |
-| `DEFINE_DATASET` | [Defines a dataset](#DEFINE_DATASET) |
-| `DEFINE_FLOW` | [Defines a flow](#DEFINE_FLOW) |
-| `START_RUN` | [Starts a pipeline](#START_RUN) |
-| `DEFINE_SQL_GRAPH_ELEMENTS` | [DEFINE_SQL_GRAPH_ELEMENTS](#DEFINE_SQL_GRAPH_ELEMENTS) |
+| PipelineCommand | Description | Initiator |
+|-----------------|-------------|-----------|
+| `CREATE_DATAFLOW_GRAPH` | [Creates a new dataflow graph](#CREATE_DATAFLOW_GRAPH) | [pyspark.pipelines.spark_connect_pipeline](#create_dataflow_graph) |
+| `DROP_DATAFLOW_GRAPH` | [Drops a pipeline](#DROP_DATAFLOW_GRAPH) ||
+| `DEFINE_DATASET` | [Defines a dataset](#DEFINE_DATASET) | [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md#register_dataset) |
+| `DEFINE_FLOW` | [Defines a flow](#DEFINE_FLOW) | [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md#register_flow) |
+| `START_RUN` | [Starts a pipeline run](#START_RUN) | [pyspark.pipelines.spark_connect_pipeline](#start_run) |
+| `DEFINE_SQL_GRAPH_ELEMENTS` | [DEFINE_SQL_GRAPH_ELEMENTS](#DEFINE_SQL_GRAPH_ELEMENTS) | [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md#register_sql) |
 
 `handlePipelinesCommand` reports an `UnsupportedOperationException` for incorrect commands:
 
@@ -33,9 +35,13 @@ handlePipelinesCommand(
 
 `handlePipelinesCommand` is used when:
 
-* `SparkConnectPlanner` is requested to `handlePipelineCommand` (for `PIPELINE_COMMAND` command)
+* `SparkConnectPlanner` ([Spark Connect]({{ book.spark_connect }}/server/SparkConnectPlanner)) is requested to `handlePipelineCommand` (for `PIPELINE_COMMAND` command)
+
+### CREATE_DATAFLOW_GRAPH { #CREATE_DATAFLOW_GRAPH }
 
-### Define Dataset Command { #DEFINE_DATASET }
+`handlePipelinesCommand` [creates a dataflow graph](#createDataflowGraph) and sends the graph ID back.
+
+### DEFINE_DATASET { #DEFINE_DATASET }
 
 `handlePipelinesCommand` prints out the following INFO message to the logs:
 
@@ -45,7 +51,7 @@ Define pipelines dataset cmd received: [cmd]
 
 `handlePipelinesCommand` [defines a dataset](#defineDataset).
 
-### Define Flow Command { #DEFINE_FLOW }
+### DEFINE_FLOW { #DEFINE_FLOW }
 
 `handlePipelinesCommand` prints out the following INFO message to the logs:
 
@@ -55,7 +61,17 @@ Define pipelines flow cmd received: [cmd]
 
 `handlePipelinesCommand` [defines a flow](#defineFlow).
 
-### Start Pipeline { #startRun }
+### START_RUN { #START_RUN }
+
+`handlePipelinesCommand` prints out the following INFO message to the logs:
+
+```text
+Start pipeline cmd received: [cmd]
+```
+
+`handlePipelinesCommand` [starts a pipeline run](#startRun).
+
+## Start Pipeline Run { #startRun }
 
 ```scala
 startRun(
@@ -64,21 +80,18 @@ startRun(
   sessionHolder: SessionHolder): Unit
 ```
 
-`startRun` prints out the following INFO message to the logs:
-
-```text
-Start pipeline cmd received: [cmd]
-```
+??? note "`START_RUN` Pipeline Command"
+    `startRun` is used when `PipelinesHandler` is requested to handle [proto.PipelineCommand.CommandTypeCase.START_RUN](#START_RUN) command.
 
 `startRun` finds the [GraphRegistrationContext](GraphRegistrationContext.md) by `dataflowGraphId` in the [DataflowGraphRegistry](DataflowGraphRegistry.md) (in the given `SessionHolder`).
 
 `startRun` creates a `PipelineEventSender` to send pipeline events back to the Spark Connect client (_Python pipeline runtime_).
 
 `startRun` creates a [PipelineUpdateContextImpl](PipelineUpdateContextImpl.md) (with the `PipelineEventSender`).
 
-In the end, `startRun` requests the `PipelineUpdateContextImpl` for the [PipelineExecution](PipelineExecution.md) to [runPipeline](PipelineExecution.md#runPipeline) or [dryRunPipeline](PipelineExecution.md#dryRunPipeline) for `dry-run` or `run` command, respectively.
+In the end, `startRun` requests the `PipelineUpdateContextImpl` for the [PipelineExecution](PipelineUpdateContext.md#pipelineExecution) to [run a pipeline](PipelineExecution.md#runPipeline) or [dry-run a pipeline](PipelineExecution.md#dryRunPipeline) for `dry-run` or `run` command, respectively.
 
-### Create Dataflow Graph { #createDataflowGraph }
+## Create Dataflow Graph { #createDataflowGraph }
 
 ```scala
 createDataflowGraph(
@@ -90,7 +103,7 @@ createDataflowGraph(
 
 `createDataflowGraph` returns the ID of the created dataflow graph.
 
-### defineSqlGraphElements { #defineSqlGraphElements }
+## defineSqlGraphElements { #defineSqlGraphElements }
 
 ```scala
 defineSqlGraphElements(
@@ -100,7 +113,7 @@ defineSqlGraphElements(
 
 `defineSqlGraphElements`...FIXME
 
-### Define Dataset (Table or View) { #defineDataset }
+## Define Dataset (Table or View) { #defineDataset }
 
 ```scala
 defineDataset(
@@ -123,7 +136,7 @@ For unknown types, `defineDataset` reports an `IllegalArgumentException`:
 Unknown dataset type: [type]
 ```
 
-### Define Flow { #defineFlow }
+## Define Flow { #defineFlow }
 
 ```scala
 defineFlow(
diff --git a/docs/declarative-pipelines/UnresolvedFlow.md b/docs/declarative-pipelines/UnresolvedFlow.md
@@ -0,0 +1,3 @@
+# UnresolvedFlow
+
+`UnresolvedFlow` is...FIXME
diff --git a/docs/declarative-pipelines/index.md b/docs/declarative-pipelines/index.md
@@ -152,6 +152,7 @@ Pipelines elements are defined in SQL files included as `definitions` in a [pipe
 Supported SQL statements:
 
 * [CREATE FLOW AS INSERT INTO BY NAME](../sql/SparkSqlAstBuilder.md#visitCreatePipelineInsertIntoFlow)
+* ...
 
 ## Demo: Create Virtual Environment for Python Client
 

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# UnresolvedFlow`
	`2`	`+`
	`3`	+`UnresolvedFlow` is...FIXME