[SDP] SQL and Python support (esp. CreateFlowCommand)

jaceklaskowski · jaceklaskowski · commit dbc9c6775806 · 2025-09-04T23:22:02.000+02:00
diff --git a/docs/declarative-pipelines/GraphRegistrationContext.md b/docs/declarative-pipelines/GraphRegistrationContext.md
@@ -77,7 +77,7 @@ registerFlow(
 
 * `PipelinesHandler` ([Spark Connect]({{ book.spark_connect }})) is requested to [handle DEFINE_FLOW command](PipelinesHandler.md#defineFlow)
 * `SqlGraphRegistrationContext` is requested to [process the following SQL commands](SqlGraphRegistrationContext.md#processSqlQuery):
-    * `CreateFlowCommand`
+    * [CreateFlowCommand](../logical-operators/CreateFlowCommand.md)
     * `CreateMaterializedViewAsSelect`
     * `CreateView`
     * `CreateStreamingTableAsSelect`
diff --git a/docs/declarative-pipelines/SparkConnectGraphElementRegistry.md b/docs/declarative-pipelines/SparkConnectGraphElementRegistry.md
@@ -2,6 +2,8 @@
 
 `SparkConnectGraphElementRegistry` is a [GraphElementRegistry](GraphElementRegistry.md).
 
+`SparkConnectGraphElementRegistry` acts as a communication bridge between Spark Declarative Pipelines' Python execution environment and Spark Connect Server (with [PipelinesHandler](PipelinesHandler.md)).
+
 ## Creating Instance
 
 `SparkConnectGraphElementRegistry` takes the following to be created:
@@ -28,4 +30,25 @@
 
 `register_dataset` makes sure that the given `Dataset` is either `MaterializedView`, `StreamingTable` or `TemporaryView`.
 
-`register_dataset` requests this [SparkSession](#spark) to [execute](#execute_command) a `PipelineCommand.DefineDataset`.
+`register_dataset` requests this [SparkConnectClient](#spark) to [execute](#execute_command) a `PipelineCommand.DefineDataset` command.
+
+!!! note "PipelinesHandler"
+    `DefineDataset` commands are handled by [PipelinesHandler](PipelinesHandler.md#defineDataset) on Spark Connect Server.
+
+## register_flow { #register_flow }
+
+??? note "GraphElementRegistry"
+
+    ```py
+    register_flow(
+        self,
+        flow: Flow
+    ) -> None
+    ```
+
+    `register_flow` is part of the [GraphElementRegistry](GraphElementRegistry.md#register_flow) abstraction.
+
+`register_flow` requests this [SparkConnectClient](#spark) to [execute](#execute_command) a `PipelineCommand.DefineFlow` command.
+
+!!! note "PipelinesHandler"
+    `DefineFlow` commands are handled by [PipelinesHandler](PipelinesHandler.md#defineFlow) on Spark Connect Server.
diff --git a/docs/declarative-pipelines/SqlGraphRegistrationContext.md b/docs/declarative-pipelines/SqlGraphRegistrationContext.md
@@ -51,7 +51,7 @@ processSqlQuery(
 * `CreateMaterializedViewAsSelect`
 * `CreateStreamingTableAsSelect`
 * `CreateStreamingTable`
-* `CreateFlowCommand`
+* [CreateFlowCommand](#CreateFlowCommand)
 
 ### splitSqlFileIntoQueries { #splitSqlFileIntoQueries }
 
@@ -63,3 +63,22 @@ splitSqlFileIntoQueries(
 ```
 
 `splitSqlFileIntoQueries`...FIXME
+
+## CreateFlowCommand { #CreateFlowCommand }
+
+[CreateFlowCommand](../logical-operators/CreateFlowCommand.md) logical commands are handled by `CreateFlowHandler`.
+
+A flow name must be a single-part name (that is resolved against the current pipelines catalog and database).
+
+The [flowOperation](../logical-operators/CreateFlowCommand.md#flowOperation) of a [CreateFlowCommand](../logical-operators/CreateFlowCommand.md) command must be [InsertIntoStatement](../logical-operators/InsertIntoStatement.md).
+
+!!! note
+    Only `INSERT INTO ... BY NAME` flows are supported in [Spark Declarative Pipelines](index.md).
+
+    `INSERT OVERWRITE` flows are not supported.
+
+    `IF NOT EXISTS` not supported for flows.
+
+    Neither partition spec nor user-specified schema can be specified.
+
+In the end, `CreateFlowHandler` requests this [GraphRegistrationContext](#graphRegistrationContext) to [register](GraphRegistrationContext.md#registerFlow) an [UnresolvedFlow](UnresolvedFlow.md).
diff --git a/docs/declarative-pipelines/index.md b/docs/declarative-pipelines/index.md
@@ -4,7 +4,7 @@ subtitle: ⚠️ 4.1.0-SNAPSHOT
 
 # Declarative Pipelines
 
-**Spark Declarative Pipelines (SDP)** is a declarative framework for building ETL pipelines on Apache Spark using Python or SQL.
+**Spark Declarative Pipelines (SDP)** is a declarative framework for building ETL pipelines on Apache Spark using [Python](#python) or [SQL](#sql).
 
 ??? warning "Apache Spark 4.1.0-SNAPSHOT"
     Declarative Pipelines framework is only available in the development branch of Apache Spark 4.1.0-SNAPSHOT.
@@ -28,22 +28,18 @@ subtitle: ⚠️ 4.1.0-SNAPSHOT
     Type --help for more information.
     ```
 
-Streaming flows are backed by streaming sources, and batch flows are backed by batch sources.
+A Declarative Pipelines project is configured using a [pipeline specification file](#pipeline-specification-file) and executed with [spark-pipelines](#spark-pipelines) shell script.
+
+In the pipeline specification file, Declarative Pipelines developers include definitions of tables, views and flows (transformations) in Python and SQL. A SDP project can use both languages simultaneously.
 
 Declarative Pipelines uses [Python decorators](#python-decorators) to describe tables, views and flows, declaratively.
 
+Streaming flows are backed by streaming sources, and batch flows are backed by batch sources.
+
 [DataflowGraph](DataflowGraph.md) is the core graph structure in Declarative Pipelines.
 
 Once described, a pipeline can be [started](PipelineExecution.md#runPipeline) (on a [PipelineExecution](PipelineExecution.md)).
 
-## Python Import Alias Convention
-
-As of this [Commit 6ab0df9]({{ spark.commit }}/6ab0df9287c5a9ce49769612c2bb0a1daab83bee), the convention to alias the import of Declarative Pipelines in Python is `dp` (from `sdp`).
-
-```python
-from pyspark import pipelines as dp
-```
-
 ## Pipeline Specification File
 
 The heart of a Declarative Pipelines project is a pipeline specification file (in YAML format).
@@ -68,12 +64,45 @@ definitions:
       include: transformations/**/*.sql
 ```
 
-## Python Decorators for Tables and Flows { #python-decorators }
+## spark-pipelines Shell Script { #spark-pipelines }
 
-Declarative Pipelines uses the following [Python decorators](https://peps.python.org/pep-0318/) to describe tables and views:
+`spark-pipelines` shell script is used to launch [org.apache.spark.deploy.SparkPipelines](SparkPipelines.md).
 
-* [@dp.materialized_view](#materialized_view) for materialized views
-* [@dp.table](#table) for streaming and batch tables
+## Dataset Types
+
+Declarative Pipelines supports the following dataset types:
+
+* **Materialized Views** (datasets) that are published to a catalog
+* **Table** that are published to a catalog
+* **Views** that are not published to a catalog
+
+## Spark Connect Only { #spark-connect }
+
+Declarative Pipelines currently only supports Spark Connect.
+
+```console
+$ ./bin/spark-pipelines --conf spark.api.mode=xxx
+...
+25/08/03 12:33:57 INFO SparkPipelines: --spark.api.mode must be 'connect'. Declarative Pipelines currently only supports Spark Connect.
+Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1
+ at org.apache.spark.deploy.SparkPipelines$$anon$1.handle(SparkPipelines.scala:73)
+ at org.apache.spark.launcher.SparkSubmitOptionParser.parse(SparkSubmitOptionParser.java:169)
+ at org.apache.spark.deploy.SparkPipelines$$anon$1.<init>(SparkPipelines.scala:58)
+ at org.apache.spark.deploy.SparkPipelines$.splitArgs(SparkPipelines.scala:57)
+ at org.apache.spark.deploy.SparkPipelines$.constructSparkSubmitArgs(SparkPipelines.scala:43)
+ at org.apache.spark.deploy.SparkPipelines$.main(SparkPipelines.scala:37)
+ at org.apache.spark.deploy.SparkPipelines.main(SparkPipelines.scala)
+```
+
+## Python
+
+### Python Import Alias Convention
+
+As of this [Commit 6ab0df9]({{ spark.commit }}/6ab0df9287c5a9ce49769612c2bb0a1daab83bee), the convention to alias the import of Declarative Pipelines in Python is `dp` (from `sdp`).
+
+```python
+from pyspark import pipelines as dp
+```
 
 ### pyspark.pipelines Python Module { #pyspark_pipelines }
 
@@ -91,6 +120,13 @@ Use the following import in your Python code:
 from pyspark import pipelines as dp
 ```
 
+### Python Decorators
+
+Declarative Pipelines uses the following [Python decorators](https://peps.python.org/pep-0318/) to describe tables and views:
+
+* [@dp.materialized_view](#materialized_view) for materialized views
+* [@dp.table](#table) for streaming and batch tables
+
 ### @dp.append_flow { #append_flow }
 
 ### @dp.create_streaming_table { #create_streaming_table }
@@ -103,6 +139,20 @@ Creates a [MaterializedView](MaterializedView.md) (for a table whose contents ar
 
 ### @dp.temporary_view { #temporary_view }
 
+[Registers](GraphElementRegistry.md#register_dataset) a `TemporaryView` dataset and a [Flow](Flow.md) in the [GraphElementRegistry](GraphElementRegistry.md#register_flow).
+
+## SQL
+
+Spark Declarative Pipelines supports SQL language to define pipelines.
+
+Pipelines elements are defined in SQL files included as `definitions` in a [pipelines specification file](#pipeline-specification-file).
+
+[SqlGraphRegistrationContext](SqlGraphRegistrationContext.md) is used on Spark Connect Server to handle SQL statements (from SQL definitions files and [Python decorators](#python-decorators)).
+
+Supported SQL statements:
+
+* [CREATE FLOW AS INSERT INTO BY NAME](../sql/SparkSqlAstBuilder.md#visitCreatePipelineInsertIntoFlow)
+
 ## Demo: Create Virtual Environment for Python Client
 
 ```shell
@@ -379,36 +429,6 @@ spark-warehouse
 3 directories, 4 files
 ```
 
-## Spark Connect Only { #spark-connect }
-
-Declarative Pipelines currently only supports Spark Connect.
-
-```console
-$ ./bin/spark-pipelines --conf spark.api.mode=xxx
-...
-25/08/03 12:33:57 INFO SparkPipelines: --spark.api.mode must be 'connect'. Declarative Pipelines currently only supports Spark Connect.
-Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1
- at org.apache.spark.deploy.SparkPipelines$$anon$1.handle(SparkPipelines.scala:73)
- at org.apache.spark.launcher.SparkSubmitOptionParser.parse(SparkSubmitOptionParser.java:169)
- at org.apache.spark.deploy.SparkPipelines$$anon$1.<init>(SparkPipelines.scala:58)
- at org.apache.spark.deploy.SparkPipelines$.splitArgs(SparkPipelines.scala:57)
- at org.apache.spark.deploy.SparkPipelines$.constructSparkSubmitArgs(SparkPipelines.scala:43)
- at org.apache.spark.deploy.SparkPipelines$.main(SparkPipelines.scala:37)
- at org.apache.spark.deploy.SparkPipelines.main(SparkPipelines.scala)
-```
-
-## spark-pipelines Shell Script { #spark-pipelines }
-
-`spark-pipelines` shell script is used to launch [org.apache.spark.deploy.SparkPipelines](SparkPipelines.md).
-
-## Dataset Types
-
-Declarative Pipelines supports the following dataset types:
-
-* **Materialized Views** (datasets) that are published to a catalog
-* **Table** that are published to a catalog
-* **Views** that are not published to a catalog
-
 ## Demo: Scala API
 
 ### Step 1. Register Dataflow Graph
diff --git a/docs/logical-operators/CreateFlowCommand.md b/docs/logical-operators/CreateFlowCommand.md
@@ -0,0 +1,23 @@
+---
+title: CreateFlowCommand
+---
+
+# CreateFlowCommand Binary Logical Operator
+
+`CreateFlowCommand` is a `BinaryCommand` logical operator that represents [CREATE FLOW ... AS INSERT INTO ... BY NAME](../sql/SparkSqlAstBuilder.md#visitCreatePipelineInsertIntoFlow) SQL statements in [Spark Declarative Pipelines](../declarative-pipelines/index.md).
+
+`CreateFlowCommand` is handled by [SqlGraphRegistrationContext](../declarative-pipelines/SqlGraphRegistrationContext.md#CreateFlowCommand).
+
+`Pipelines` execution planning strategy is used to prevent direct execution of Spark Declarative Pipelines' SQL stataments.
+
+## Creating Instance
+
+`CreateFlowCommand` takes the following to be created:
+
+* <span id="name"> Name (`UnresolvedIdentifier` leaf logical operator)
+* <span id="flowOperation"> Flow operation ([InsertIntoStatement](InsertIntoStatement.md) unary logical operator)
+* <span id="comment"> Comment (optional)
+
+`CreateFlowCommand` is created when:
+
+* `SparkSqlAstBuilder` is requested to [parse CREATE FLOW AS INSERT INTO BY NAME SQL statement](../sql/SparkSqlAstBuilder.md#visitCreatePipelineInsertIntoFlow)
diff --git a/docs/sql/SparkSqlAstBuilder.md b/docs/sql/SparkSqlAstBuilder.md
@@ -125,6 +125,17 @@ Creates a [CreateTable](../logical-operators/CreateTable.md)
 
 ANTLR labeled alternative: `#createHiveTable`
 
+### CREATE FLOW AS INSERT INTO BY NAME { #visitCreatePipelineInsertIntoFlow }
+
+Creates a [CreateFlowCommand](../logical-operators/CreateFlowCommand.md) logical operator for `CREATE FLOW` SQL statement
+
+```sql
+CREATE FLOW [ flow_name ]
+AS INSERT INTO [ destination_name ] BY NAME
+```
+
+ANTLR labeled alternative: `#createPipelineInsertIntoFlow`
+
 ### CREATE TABLE { #visitCreateTable }
 
 Creates a [CreateTempViewUsing](../logical-operators/CreateTempViewUsing.md) logical operator for `CREATE TEMPORARY VIEW USING` or falls back to [AstBuilder](AstBuilder.md#visitCreateTable) (to create either a [CreateTableAsSelect](../logical-operators/CreateTableAsSelect.md) or a [CreateTable](../logical-operators/CreateTable.md))