|
| 1 | +# Data transformation and preparation (ETL/ELT) |
| 2 | + |
| 3 | +Data preparation for analysis is a key stage in building a data warehouse. {{ydb-short-name}} supports all standard data transformation approaches, allowing you to choose the most suitable tool for a specific task: from pure SQL to complex pipelines on Apache Spark. |
| 4 | + |
| 5 | +## ELT |
| 6 | + |
| 7 | +Data transformations using SQL are often the most performant, since all processing occurs directly within the {{ydb-short-name}} engine without moving data to and from external systems. The logic is described in SQL and executed by the distributed MPP engine, which is optimized for analytical operations. |
| 8 | + |
| 9 | +### Performance in the TPC-H benchmark |
| 10 | + |
| 11 | +The performance of ELT operations directly depends on the execution speed of complex analytical queries. The industry-standard benchmark for evaluating such queries is [TPC-H](https://www.tpc.org/tpch/). |
| 12 | + |
| 13 | +A comparison with another distributed analytical DBMS on the TPC-H query set shows that {{ydb-short-name}} demonstrates more stable performance, especially when executing queries that contain: |
| 14 | + |
| 15 | +* connections (`JOIN`) of a large number of tables (five or more); |
| 16 | +* nested subqueries used for filtering; |
| 17 | +* aggregations (`GROUP BY`) followed by complex filtering of the results. |
| 18 | + |
| 19 | +{width=600} |
| 20 | + |
| 21 | +This stability indicates the high efficiency of the {{ ydb-short-name }} cost-based query optimizer in building execution plans for complex SQL patterns typical of real-world ELT processes. For a data warehouse (DWH) platform, this means predictable data update times and a reduced risk of uncontrolled performance degradation in the production environment. |
| 22 | + |
| 23 | +### Key use cases |
| 24 | + |
| 25 | +* Building data marts: use the familiar [`INSERT INTO ... SELECT FROM ...`](../../../yql/reference/syntax/insert_into.md) syntax to create aggregated tables (data marts) from raw data; |
| 26 | +* joining OLTP and OLAP data: {{ydb-short-name}} allows you to join data from both transactional (row-based) and analytical (column-based) tables in a single query. This enables you to enrich "cold" analytical data with up-to-date information from the OLTP system without the need for duplication; |
| 27 | +* bulk updates: for "blind" writes of large volumes of data without existence checks, you can use the [`UPSERT INTO`](../../../yql/reference/syntax/upsert_into.md) operator. |
| 28 | + |
| 29 | +### Managing SQL pipelines with dbt {#dbt} |
| 30 | + |
| 31 | +To manage complex SQL pipelines, use the [dbt plugin](../../../integrations/migration/dbt.md). This plugin allows data engineers to describe data models as `SELECT` queries, and dbt automatically builds a dependency graph between models and executes them in the correct order. This approach helps implement software engineering principles (testing, documentation, versioning) when working with SQL code. |
| 32 | + |
| 33 | +## ETL |
| 34 | + |
| 35 | +### Complex transformations using external frameworks {#external-etl} |
| 36 | + |
| 37 | +For tasks that require complex logic in programming languages (Python, Scala, Java), integration with ML pipelines, or processing large volumes of data, it is convenient to use external frameworks for distributed processing. |
| 38 | + |
| 39 | +Apache Spark is one of the most popular tools for such tasks, and a [dedicated connector](../../../integrations/ingestion/spark.md) to {{ydb-short-name}} has been developed for it. If your company uses other similar solutions (e.g., Apache Flink), they can also be used to build ETL processes using the [JDBC driver](../../../reference/languages-and-apis/jdbc-driver/index.md). |
| 40 | + |
| 41 | +A key advantage of {{ydb-short-name}} when working with such systems is its architecture, which allows for parallel data reading. {{ydb-short-name}} has no dedicated master node for exports, so external tools can read information directly from all storage nodes. This ensures high-speed reads and linear scalability. |
| 42 | + |
| 43 | +## Pipeline orchestration |
| 44 | + |
| 45 | +Orchestrators are used to run pipelines on a schedule and manage dependencies. |
| 46 | + |
| 47 | +* Apache Airflow: an [Apache Airflow provider](../../../integrations/orchestration/airflow.md) is supported for orchestrating data in {{ydb-short-name}}. It can be used to create DAGs that run `dbt run`, execute YQL scripts, or initiate Spark jobs. |
| 48 | +* built-in mechanisms: For some tasks, an external orchestrator is not required. {{ydb-short-name}} can perform some operations automatically: |
| 49 | + |
| 50 | + * TTL-based data expiration: automatically cleans up partitions after a specified time; |
| 51 | + * automatic compaction: data merging and optimization processes for the LSM tree run in the background, eliminating the need to regularly run commands like `VACUUM`; |
| 52 | +* other orchestrators: if your company uses a different tool (e.g., Dagster, Prefect) or a custom scheduler, you can use it to run the same commands. Most orchestrators can execute shell scripts, allowing you to call the YDB CLI, [dbt](#dbt) run, and other utilities. |
| 53 | + |
| 54 | +## Integration with other ETL tools via JDBC |
| 55 | + |
| 56 | +{{ydb-short-name}} provides a [JDBC driver](../../../reference/languages-and-apis/jdbc-driver/index.md), enabling the use of a wide range of existing ETL tools, such as [Apache NiFi](https://nifi.apache.org/) and other JDBC-compliant systems. |
0 commit comments