You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/declarative-pipelines/SparkConnectGraphElementRegistry.md
+24-1Lines changed: 24 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,8 @@
2
2
3
3
`SparkConnectGraphElementRegistry` is a [GraphElementRegistry](GraphElementRegistry.md).
4
4
5
+
`SparkConnectGraphElementRegistry` acts as a communication bridge between Spark Declarative Pipelines' Python execution environment and Spark Connect Server (with [PipelinesHandler](PipelinesHandler.md)).
6
+
5
7
## Creating Instance
6
8
7
9
`SparkConnectGraphElementRegistry` takes the following to be created:
@@ -28,4 +30,25 @@
28
30
29
31
`register_dataset` makes sure that the given `Dataset` is either `MaterializedView`, `StreamingTable` or `TemporaryView`.
30
32
31
-
`register_dataset` requests this [SparkSession](#spark) to [execute](#execute_command) a `PipelineCommand.DefineDataset`.
33
+
`register_dataset` requests this [SparkConnectClient](#spark) to [execute](#execute_command) a `PipelineCommand.DefineDataset` command.
34
+
35
+
!!! note "PipelinesHandler"
36
+
`DefineDataset` commands are handled by [PipelinesHandler](PipelinesHandler.md#defineDataset) on Spark Connect Server.
37
+
38
+
## register_flow { #register_flow }
39
+
40
+
??? note "GraphElementRegistry"
41
+
42
+
```py
43
+
register_flow(
44
+
self,
45
+
flow: Flow
46
+
) -> None
47
+
```
48
+
49
+
`register_flow` is part of the [GraphElementRegistry](GraphElementRegistry.md#register_flow) abstraction.
50
+
51
+
`register_flow` requests this [SparkConnectClient](#spark) to [execute](#execute_command) a `PipelineCommand.DefineFlow` command.
52
+
53
+
!!! note "PipelinesHandler"
54
+
`DefineFlow` commands are handled by [PipelinesHandler](PipelinesHandler.md#defineFlow) on Spark Connect Server.
[CreateFlowCommand](../logical-operators/CreateFlowCommand.md) logical commands are handled by `CreateFlowHandler`.
70
+
71
+
A flow name must be a single-part name (that is resolved against the current pipelines catalog and database).
72
+
73
+
The [flowOperation](../logical-operators/CreateFlowCommand.md#flowOperation) of a [CreateFlowCommand](../logical-operators/CreateFlowCommand.md) command must be [InsertIntoStatement](../logical-operators/InsertIntoStatement.md).
74
+
75
+
!!! note
76
+
Only `INSERT INTO ... BY NAME` flows are supported in [Spark Declarative Pipelines](index.md).
77
+
78
+
`INSERT OVERWRITE` flows are not supported.
79
+
80
+
`IF NOT EXISTS` not supported for flows.
81
+
82
+
Neither partition spec nor user-specified schema can be specified.
83
+
84
+
In the end, `CreateFlowHandler` requests this [GraphRegistrationContext](#graphRegistrationContext) to [register](GraphRegistrationContext.md#registerFlow) an [UnresolvedFlow](UnresolvedFlow.md).
Copy file name to clipboardExpand all lines: docs/declarative-pipelines/index.md
+64-44Lines changed: 64 additions & 44 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ subtitle: ⚠️ 4.1.0-SNAPSHOT
4
4
5
5
# Declarative Pipelines
6
6
7
-
**Spark Declarative Pipelines (SDP)** is a declarative framework for building ETL pipelines on Apache Spark using Python or SQL.
7
+
**Spark Declarative Pipelines (SDP)** is a declarative framework for building ETL pipelines on Apache Spark using [Python](#python) or [SQL](#sql).
8
8
9
9
??? warning "Apache Spark 4.1.0-SNAPSHOT"
10
10
Declarative Pipelines framework is only available in the development branch of Apache Spark 4.1.0-SNAPSHOT.
@@ -28,22 +28,18 @@ subtitle: ⚠️ 4.1.0-SNAPSHOT
28
28
Type --help for more information.
29
29
```
30
30
31
-
Streaming flows are backed by streaming sources, and batch flows are backed by batch sources.
31
+
A Declarative Pipelines project is configured using a [pipeline specification file](#pipeline-specification-file) and executed with [spark-pipelines](#spark-pipelines) shell script.
32
+
33
+
In the pipeline specification file, Declarative Pipelines developers include definitions of tables, views and flows (transformations) in Python and SQL. A SDP project can use both languages simultaneously.
32
34
33
35
Declarative Pipelines uses [Python decorators](#python-decorators) to describe tables, views and flows, declaratively.
34
36
37
+
Streaming flows are backed by streaming sources, and batch flows are backed by batch sources.
38
+
35
39
[DataflowGraph](DataflowGraph.md) is the core graph structure in Declarative Pipelines.
36
40
37
41
Once described, a pipeline can be [started](PipelineExecution.md#runPipeline) (on a [PipelineExecution](PipelineExecution.md)).
38
42
39
-
## Python Import Alias Convention
40
-
41
-
As of this [Commit 6ab0df9]({{ spark.commit }}/6ab0df9287c5a9ce49769612c2bb0a1daab83bee), the convention to alias the import of Declarative Pipelines in Python is `dp` (from `sdp`).
42
-
43
-
```python
44
-
from pyspark import pipelines as dp
45
-
```
46
-
47
43
## Pipeline Specification File
48
44
49
45
The heart of a Declarative Pipelines project is a pipeline specification file (in YAML format).
@@ -68,12 +64,45 @@ definitions:
68
64
include: transformations/**/*.sql
69
65
```
70
66
71
-
## Python Decorators for Tables and Flows { #python-decorators }
Declarative Pipelines uses the following [Python decorators](https://peps.python.org/pep-0318/) to describe tables and views:
69
+
`spark-pipelines` shell script is used to launch [org.apache.spark.deploy.SparkPipelines](SparkPipelines.md).
74
70
75
-
* [@dp.materialized_view](#materialized_view) for materialized views
76
-
* [@dp.table](#table) for streaming and batch tables
71
+
## Dataset Types
72
+
73
+
Declarative Pipelines supports the following dataset types:
74
+
75
+
* **Materialized Views** (datasets) that are published to a catalog
76
+
* **Table** that are published to a catalog
77
+
* **Views** that are not published to a catalog
78
+
79
+
## Spark Connect Only { #spark-connect }
80
+
81
+
Declarative Pipelines currently only supports Spark Connect.
82
+
83
+
```console
84
+
$ ./bin/spark-pipelines --conf spark.api.mode=xxx
85
+
...
86
+
25/08/03 12:33:57 INFO SparkPipelines: --spark.api.mode must be 'connect'. Declarative Pipelines currently only supports Spark Connect.
87
+
Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1
88
+
at org.apache.spark.deploy.SparkPipelines$$anon$1.handle(SparkPipelines.scala:73)
89
+
at org.apache.spark.launcher.SparkSubmitOptionParser.parse(SparkSubmitOptionParser.java:169)
90
+
at org.apache.spark.deploy.SparkPipelines$$anon$1.<init>(SparkPipelines.scala:58)
91
+
at org.apache.spark.deploy.SparkPipelines$.splitArgs(SparkPipelines.scala:57)
92
+
at org.apache.spark.deploy.SparkPipelines$.constructSparkSubmitArgs(SparkPipelines.scala:43)
93
+
at org.apache.spark.deploy.SparkPipelines$.main(SparkPipelines.scala:37)
94
+
at org.apache.spark.deploy.SparkPipelines.main(SparkPipelines.scala)
95
+
```
96
+
97
+
## Python
98
+
99
+
### Python Import Alias Convention
100
+
101
+
As of this [Commit 6ab0df9]({{ spark.commit }}/6ab0df9287c5a9ce49769612c2bb0a1daab83bee), the convention to alias the import of Declarative Pipelines in Python is `dp` (from `sdp`).
@@ -103,6 +139,20 @@ Creates a [MaterializedView](MaterializedView.md) (for a table whose contents ar
103
139
104
140
### @dp.temporary_view { #temporary_view }
105
141
142
+
[Registers](GraphElementRegistry.md#register_dataset) a `TemporaryView` dataset and a [Flow](Flow.md) in the [GraphElementRegistry](GraphElementRegistry.md#register_flow).
143
+
144
+
## SQL
145
+
146
+
Spark Declarative Pipelines supports SQL language to define pipelines.
147
+
148
+
Pipelines elements are defined in SQL files included as `definitions` in a [pipelines specification file](#pipeline-specification-file).
149
+
150
+
[SqlGraphRegistrationContext](SqlGraphRegistrationContext.md) is used on Spark Connect Server to handle SQL statements (from SQL definitions files and [Python decorators](#python-decorators)).
151
+
152
+
Supported SQL statements:
153
+
154
+
* [CREATE FLOW AS INSERT INTO BY NAME](../sql/SparkSqlAstBuilder.md#visitCreatePipelineInsertIntoFlow)
155
+
106
156
## Demo: Create Virtual Environment for Python Client
107
157
108
158
```shell
@@ -379,36 +429,6 @@ spark-warehouse
379
429
3 directories, 4 files
380
430
```
381
431
382
-
## Spark Connect Only { #spark-connect }
383
-
384
-
Declarative Pipelines currently only supports Spark Connect.
385
-
386
-
```console
387
-
$ ./bin/spark-pipelines --conf spark.api.mode=xxx
388
-
...
389
-
25/08/03 12:33:57 INFO SparkPipelines: --spark.api.mode must be 'connect'. Declarative Pipelines currently only supports Spark Connect.
390
-
Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1
391
-
at org.apache.spark.deploy.SparkPipelines$$anon$1.handle(SparkPipelines.scala:73)
392
-
at org.apache.spark.launcher.SparkSubmitOptionParser.parse(SparkSubmitOptionParser.java:169)
393
-
at org.apache.spark.deploy.SparkPipelines$$anon$1.<init>(SparkPipelines.scala:58)
394
-
at org.apache.spark.deploy.SparkPipelines$.splitArgs(SparkPipelines.scala:57)
395
-
at org.apache.spark.deploy.SparkPipelines$.constructSparkSubmitArgs(SparkPipelines.scala:43)
396
-
at org.apache.spark.deploy.SparkPipelines$.main(SparkPipelines.scala:37)
397
-
at org.apache.spark.deploy.SparkPipelines.main(SparkPipelines.scala)
`CreateFlowCommand` is a `BinaryCommand` logical operator that represents [CREATE FLOW ... AS INSERT INTO ... BY NAME](../sql/SparkSqlAstBuilder.md#visitCreatePipelineInsertIntoFlow) SQL statements in [Spark Declarative Pipelines](../declarative-pipelines/index.md).
8
+
9
+
`CreateFlowCommand` is handled by [SqlGraphRegistrationContext](../declarative-pipelines/SqlGraphRegistrationContext.md#CreateFlowCommand).
10
+
11
+
`Pipelines` execution planning strategy is used to prevent direct execution of Spark Declarative Pipelines' SQL stataments.
12
+
13
+
## Creating Instance
14
+
15
+
`CreateFlowCommand` takes the following to be created:
16
+
17
+
* <spanid="name"> Name (`UnresolvedIdentifier` leaf logical operator)
*`SparkSqlAstBuilder` is requested to [parse CREATE FLOW AS INSERT INTO BY NAME SQL statement](../sql/SparkSqlAstBuilder.md#visitCreatePipelineInsertIntoFlow)
Creates a [CreateTempViewUsing](../logical-operators/CreateTempViewUsing.md) logical operator for `CREATE TEMPORARY VIEW USING` or falls back to [AstBuilder](AstBuilder.md#visitCreateTable) (to create either a [CreateTableAsSelect](../logical-operators/CreateTableAsSelect.md) or a [CreateTable](../logical-operators/CreateTable.md))
0 commit comments