[docs] Added analytics and linked articles en (ydb-platform#27902)

ayakivosklznak · qyryq · commit 99cf6656ea6a · 2025-11-24T13:16:39.000Z
diff --git a/ydb/docs/en/core/concepts/analytics/concepts/bi.md b/ydb/docs/en/core/concepts/analytics/concepts/bi.md
@@ -0,0 +1,26 @@
+# BI analytics and data visualization
+
+The interactivity of BI dashboards directly depends on the performance of the underlying database. {{ydb-short-name}} was designed as a high-performance analytical platform that executes queries in sub-second time, enabling analysts to work with data interactively.
+
+This is achieved through key architectural features:
+
+* columnar storage: queries read only the columns specified in the request from disk, which reduces the volume of I/O operations;
+* MPP architecture: each query is parallelized across all available compute nodes of the cluster, harnessing all available resources for its execution;
+* decentralized architecture: the absence of a single master node enables efficient processing of multiple concurrent queries from BI systems.
+
+## Performance in independent benchmarks
+
+Although synthetic tests do not always reflect real-world workloads, they serve as a good starting point for performance comparison. [ClickBench](https://benchmark.clickhouse.com/#system=+Rf%7Cnof%7CYD&type=-&machine=-ca2%7Cgle%7C6ax%7Cae-%7C6ale%7Cgel%7C3al&cluster_size=-&opensource=-&tuned=+n&metric=hot&queries=-) is an independent benchmark for analytical DBMSs, developed by the creators of ClickHouse.
+
+On a set of 43 analytical queries, {{ydb-short-name}} shows competitive results, outperforming many popular open-source and cloud analytical databases. This confirms the engine's high performance on typical OLAP queries.
+
+![](_includes/clickbench.png)
+
+## Integrations with BI platforms
+
+{{ ydb-short-name }} supports the following BI platforms:
+
+* [Yandex DataLens](../../../integrations/visualization/datalens.md);
+* [Apache Superset](../../../integrations/visualization/superset.md);
+* [Grafana](../../../integrations/visualization/grafana.md);
+* [Polymatica](https://wiki.polymatica.ru/display/PDTNUG1343/YDB+Server).
diff --git a/ydb/docs/en/core/concepts/analytics/concepts/etl.md b/ydb/docs/en/core/concepts/analytics/concepts/etl.md
@@ -0,0 +1,56 @@
+# Data transformation and preparation (ETL/ELT)
+
+Data preparation for analysis is a key stage in building a data warehouse. {{ydb-short-name}} supports all standard data transformation approaches, allowing you to choose the most suitable tool for a specific task: from pure SQL to complex pipelines on Apache Spark.
+
+## ELT
+
+Data transformations using SQL are often the most performant, since all processing occurs directly within the {{ydb-short-name}} engine without moving data to and from external systems. The logic is described in SQL and executed by the distributed MPP engine, which is optimized for analytical operations.
+
+### Performance in the TPC-H benchmark
+
+The performance of ELT operations directly depends on the execution speed of complex analytical queries. The industry-standard benchmark for evaluating such queries is [TPC-H](https://www.tpc.org/tpch/).
+
+A comparison with another distributed analytical DBMS on the TPC-H query set shows that {{ydb-short-name}} demonstrates more stable performance, especially when executing queries that contain:
+
+* connections (`JOIN`) of a large number of tables (five or more);
+* nested subqueries used for filtering;
+* aggregations (`GROUP BY`) followed by complex filtering of the results.
+
+![](_includes/ydb_vs_another.png){width=600}
+
+This stability indicates the high efficiency of the {{ ydb-short-name }} cost-based query optimizer in building execution plans for complex SQL patterns typical of real-world ELT processes. For a data warehouse (DWH) platform, this means predictable data update times and a reduced risk of uncontrolled performance degradation in the production environment.
+
+### Key use cases
+
+* Building data marts: use the familiar [`INSERT INTO ... SELECT FROM ...`](../../../yql/reference/syntax/insert_into.md) syntax to create aggregated tables (data marts) from raw data;
+* joining OLTP and OLAP data: {{ydb-short-name}} allows you to join data from both transactional (row-based) and analytical (column-based) tables in a single query. This enables you to enrich "cold" analytical data with up-to-date information from the OLTP system without the need for duplication;
+* bulk updates: for "blind" writes of large volumes of data without existence checks, you can use the [`UPSERT INTO`](../../../yql/reference/syntax/upsert_into.md) operator.
+
+### Managing SQL pipelines with dbt {#dbt}
+
+To manage complex SQL pipelines, use the [dbt plugin](../../../integrations/migration/dbt.md). This plugin allows data engineers to describe data models as `SELECT` queries, and dbt automatically builds a dependency graph between models and executes them in the correct order. This approach helps implement software engineering principles (testing, documentation, versioning) when working with SQL code.
+
+## ETL
+
+### Complex transformations using external frameworks {#external-etl}
+
+For tasks that require complex logic in programming languages (Python, Scala, Java), integration with ML pipelines, or processing large volumes of data, it is convenient to use external frameworks for distributed processing.
+
+Apache Spark is one of the most popular tools for such tasks, and a [dedicated connector](../../../integrations/ingestion/spark.md) to {{ydb-short-name}} has been developed for it. If your company uses other similar solutions (e.g., Apache Flink), they can also be used to build ETL processes using the [JDBC driver](../../../reference/languages-and-apis/jdbc-driver/index.md).
+
+A key advantage of {{ydb-short-name}} when working with such systems is its architecture, which allows for parallel data reading. {{ydb-short-name}} has no dedicated master node for exports, so external tools can read information directly from all storage nodes. This ensures high-speed reads and linear scalability.
+
+## Pipeline orchestration
+
+Orchestrators are used to run pipelines on a schedule and manage dependencies.
+
+* Apache Airflow: an [Apache Airflow provider](../../../integrations/orchestration/airflow.md) is supported for orchestrating data in {{ydb-short-name}}. It can be used to create DAGs that run `dbt run`, execute YQL scripts, or initiate Spark jobs.
+* built-in mechanisms: For some tasks, an external orchestrator is not required. {{ydb-short-name}} can perform some operations automatically:
+   
+    * TTL-based data expiration: automatically cleans up partitions after a specified time;
+    * automatic compaction: data merging and optimization processes for the LSM tree run in the background, eliminating the need to regularly run commands like `VACUUM`;
+* other orchestrators: if your company uses a different tool (e.g., Dagster, Prefect) or a custom scheduler, you can use it to run the same commands. Most orchestrators can execute shell scripts, allowing you to call the YDB CLI, [dbt](#dbt) run, and other utilities.
+
+## Integration with other ETL tools via JDBC
+
+{{ydb-short-name}} provides a [JDBC driver](../../../reference/languages-and-apis/jdbc-driver/index.md), enabling the use of a wide range of existing ETL tools, such as [Apache NiFi](https://nifi.apache.org/) and other JDBC-compliant systems.
diff --git a/ydb/docs/en/core/concepts/analytics/concepts/execution.md b/ydb/docs/en/core/concepts/analytics/concepts/execution.md
@@ -0,0 +1,40 @@
+# Query Execution
+
+{{ydb-short-name}} is a distributed Massively Parallel Processing (MPP) database designed for executing complex analytical queries on large volumes of data. Each query is automatically parallelized across all available compute nodes in the cluster, enabling efficient use of compute resources.
+
+{{ydb-short-name}} supports several key technologies that ensure high performance and stability:
+
+* [{#T}](#mpp)
+* [{#T}](#cbo)
+* [{#T}](#spilling)
+* [{#T}](#resource_management)
+
+## Decentralized MPP architecture {#mpp}
+
+Unlike MPP systems with a dedicated master node, {{ydb-short-name}}'s architecture is completely decentralized. This provides two main advantages:
+
+* High fault tolerance: any node in the cluster can accept and coordinate query execution. There is no single point of failure (SPOF). The failure of some nodes does not halt the cluster—the load is automatically redistributed among the remaining nodes.
+* Compute scalability: you can add or remove compute nodes without downtime, and the system automatically adapts, distributing the load to account for the new cluster composition.
+
+## Cost-Based Query Optimizer {#cbo}
+
+Before executing a query, {{ydb-short-name}} uses a [Cost-Based Optimizer (CBO)](../../../concepts/optimizer.md). It analyzes the query text, metadata, and statistics on data distribution in tables to build a physical execution plan with the lowest estimated cost.
+
+The optimizer can:
+
+* choose the join order for queries with dozens of `JOIN`s;
+* select distributed `JOIN` algorithms (e.g., Grace Join, Broadcast Join) depending on table sizes;
+* push down filters (`WHERE` clauses) as close as possible to the data sources to reduce the amount of data processed in subsequent stages.
+
+## Handling data that exceeds RAM {#spilling}
+
+Analytical queries can require large amounts of RAM, especially for `JOIN` and `GROUP BY` operations. {{ydb-short-name}} is designed to work with data that may not fit into RAM.
+
+* Spilling: if the intermediate results of a query exceed the memory limit, {{ydb-short-name}} [automatically spills](../../../concepts/spilling.md) them to the node's local disk. This prevents the query from failing with an "Out of Memory" error and allows queries to be executed on large volumes of data.
+* Distributed JOIN algorithms: for joining tables that exceed the memory of a single node, distributed algorithms are used that process data in chunks across different nodes.
+
+## Workload Isolation and Resource Management {#resource_management}
+
+In a corporate DWH, different teams often run different types of workloads. To prevent these workloads from interfering with each other, {{ydb-short-name}} has a built-in resource manager.
+
+* Workload Manager: the built-in `workload manager` allows you to create resource pools and, using [classifiers](../../../concepts/glossary#resource-pool-classifier), assign queries from different user groups to different pools. This mechanism solves the "noisy neighbor" problem, where a single resource-intensive query can slow down the system for all other users.
diff --git a/ydb/docs/en/core/concepts/analytics/concepts/federated.md b/ydb/docs/en/core/concepts/analytics/concepts/federated.md
@@ -0,0 +1,12 @@
+# Federated queries
+
+[Federated queries](../../../concepts/federated_query/index.md) allow you to query data stored in external systems without first loading it (ETL) into {{ydb-short-name}}. The most popular use case is working with data in S3-compatible object storage.
+
+## How it works
+
+You can create an [external table](../../../concepts/datamodel/external_table.md) in {{ydb-short-name}} that references data in S3. When you execute a SELECT query against such a table, {{ydb-short-name}} initiates a parallel read from all compute nodes. Each node reads and processes only the portion of data it needs.
+
+* Supported formats: [Parquet, CSV, JSON](../../../concepts/federated_query/s3/formats.md) with [various compression algorithms](../../../concepts/federated_query/s3/formats.md#compression).
+* Read optimization: {{ydb-short-name}} uses S3 data read optimization mechanisms (partition pruning) for [Hive-style partitioning](../../../concepts/federated_query/s3/partitioning.md) and for [more complex partitioning schemes](../../../concepts/federated_query/s3/partition_projection.md).
+
+![](_includes/s3_read.png){width=600}
diff --git a/ydb/docs/en/core/concepts/analytics/concepts/ingest.md b/ydb/docs/en/core/concepts/analytics/concepts/ingest.md
@@ -0,0 +1,20 @@
+# Data Ingestion
+
+{{ydb-short-name}} is designed to ingest both streaming and batch data. The absence of dedicated master nodes allows for parallel writes to all database nodes, enabling write throughput to scale linearly as the cluster grows. The choice of tool depends on the requirements for latency, delivery guarantees, and data volume.
+
+## Streaming Ingestion (Real-time)
+
+For scenarios requiring minimal latency, such as logs, metrics, and CDC streams.
+
+* [Topics](../../../concepts/datamodel/topic.md) with [Kafka API](../../../reference/kafka-api/index.md): the primary and recommended method for streaming ingestion. Topics are the {{ydb-short-name}} built-in equivalent of Apache Kafka. Thanks to Kafka API support, you can use existing clients and systems (Apache Flink, Spark Streaming, Kafka Connect) without any changes. The key advantage is the ability to perform [transactional writes from a topic to a table](../../../concepts/datamodel/topic.md#topic-transactions), which guarantees `exactly-once` semantics at the database level.
+* Plugins for Fluent Bit / Logstash: if you use [Fluent Bit](../../../integrations/ingestion/fluent-bit.md) or [Logstash](../../../integrations/ingestion/logstash.md) for log collection, specialized plugins allow you to write data directly to {{ydb-short-name}}, bypassing intermediate message brokers.
+* Built-in data transfer (Transfer): the [Transfer](../../transfer.md) service allows you to transform and move data from topics to tables in streaming mode.
+
+## Batch Ingestion (Batch)
+
+For loading large volumes of historical data, exports from other systems, or the results of batch jobs.
+
+* [BulkUpsert](../../../recipes/ydb-sdk/bulk-upsert.md) - the most performant method for batch inserts. BulkUpsert is a specialized API optimized for maximum throughput. It requires fewer resources compared to transactional operations, allowing you to load large datasets at maximum speed.
+* [Federated queries](../../federated_query/index.md) to data in S3 / Data Lakes - {{ydb-short-name}} allows you to execute SQL queries directly against data stored in S3-compatible object storage or other external systems. This is a convenient way to load data without using separate ETL tools.
+* [Apache Spark connector](../../../integrations/ingestion/spark.md) saves data directly to {{ydb-short-name}} tables in multi-threaded mode for the most performant writes.
+* [JDBC driver](../../../reference/languages-and-apis/jdbc-driver/index.md) and [native SDKs](../../../reference/languages-and-apis/index.md)- with these, you can connect any applications or pipelines, including Apache Spark, Apache NiFi, and other solutions.
diff --git a/ydb/docs/en/core/concepts/analytics/concepts/ml.md b/ydb/docs/en/core/concepts/analytics/concepts/ml.md
@@ -0,0 +1,24 @@
+# Machine Learning
+
+{{ydb-short-name}} serves as an effective platform for storing and processing data in ML pipelines. You can use familiar tools, such as Jupyter Notebook and Apache Spark, throughout all stages of the ML model lifecycle.
+
+## Feature Engineering
+
+Use {{ydb-short-name}} as an engine for feature engineering:
+
+* SQL and [dbt](../../../integrations/migration/dbt.md): execute complex analytical queries to aggregate raw data and create new features. Materialize feature sets into row-based tables for fast access;
+* Apache Spark: for more complex transformations that require Python or Scala logic, use the [Apache Spark connector](../../../integrations/ingestion/spark.md) to read data, process it, and save the results back to {{ydb-short-name}}.
+
+## Model Training
+
+{{ydb-short-name}} can serve as a fast and scalable data source for model training:
+
+-   Jupyter Integration: connect to {{ydb-short-name}} from [Jupyter Notebook](../../../integrations/gui/jupyter.md) for ad-hoc analysis and model prototyping;
+-   distributed training: the Apache Spark connector enables parallel reading of data from all cluster nodes directly into a Spark DataFrame. This allows you to load training sets for models in PySpark MLlib, CatBoost, Scikit-learn, and other libraries.
+
+## Online Feature Store
+
+The combination of [row-based](../../../concepts/datamodel/table.md#row-oriented-tables) (OLTP) and [columnar](../../../concepts/datamodel/table.md#column-oriented-tables) (OLAP) tables in {{ydb-short-name}} allows you to implement not only an analytical warehouse but also an [Online Feature Store](https://en.wikipedia.org/wiki/Feature_engineering#Feature_stores) on a single platform.
+
+* Use row-based (OLTP) tables to store features that require low-latency point reads; this allows ML models to retrieve features in real time for inference.
+* Use columnar (OLAP) tables to store historical data and for the batch calculation of these features.
diff --git a/ydb/docs/en/core/concepts/analytics/concepts/store.md b/ydb/docs/en/core/concepts/analytics/concepts/store.md
@@ -0,0 +1,30 @@
+# Data Storage
+
+Efficient data storage is the foundation of any analytical warehouse. {{ ydb-short-name }} uses a columnar format, a storage and compute disaggregation architecture, and automatic maintenance processes to ensure high performance and a low total cost of ownership.
+
+## Columnar tables {#column_table}
+
+Data in [columnar tables](../../../concepts/datamodel/table.md#column-oriented-tables) is stored by columns instead of rows. This approach is the standard for OLAP systems and offers two key advantages:
+
+1.  Reduced read volume: when a query (e.g., `SELECT column_a, column_b FROM...`) is executed, only the data from the columns involved in the query is read from the disk.
+2.  Data compression: data of the same type within a column compresses better than heterogeneous data in a row. {{ ydb-short-name }} uses the `LZ4` compression algorithm.
+
+## Architecture with storage and compute disaggregation {#disaggregation}
+
+Storage and compute disaggregation is an architectural principle of {{ ydb-short-name }}. The nodes responsible for data storage (storage nodes) and the nodes that execute queries (dynamic nodes) are separate. This allows you to:
+
+* scale resources independently: if you run out of disk space, you add storage nodes. If you lack CPU for queries, you add compute nodes. This differs from systems where storage and compute resources are tightly coupled;
+* redistribute load quickly: redistributing compute load between nodes does not require physical data movement; only metadata is transferred.
+
+## Automatic storage optimization {#zero_admin}
+
+{{ydb-short-name}} is designed to minimize manual maintenance operations.
+
+* Automatic data compaction: Data is stored in [LSM-like](../../../concepts/mvcc.md#organizaciya-hraneniya-dannyh-mvcc) structures; data merging and optimization processes run continuously in the background. You do not need to run VACUUM or similar commands.
+* Automatic data deletion: To manage the data lifecycle, use the [TTL-based deletion](../../../concepts/ttl.md) mechanism.
+
+## Built-in fault tolerance {#reliability}
+
+{{ydb-short-name}} was designed from the ground up as a fault-tolerant system and supports [various data placement modes](../../../concepts/topology.md#cluster-config) to protect against hardware, rack, or even entire data center failures.
+
+![](_includes/olap_3dc.png){width=800}
diff --git a/ydb/docs/en/core/concepts/analytics/concepts/toc_i.yaml b/ydb/docs/en/core/concepts/analytics/concepts/toc_i.yaml
@@ -0,0 +1,15 @@
+items:
+  - name: Data Ingestion
+    href: ingest.md
+  - name: Data Storage
+    href: store.md
+  - name: Query Execution
+    href: execution.md
+  - name: Federated queries
+    href: federated.md
+  - name: Data transformation and preparation (ETL/ELT)
+    href: etl.md
+  - name: BI analytics and data visualization
+    href: bi.md
+  - name: Machine Learning
+    href: ml.md
diff --git a/ydb/docs/en/core/concepts/analytics/concepts/toc_p.yaml b/ydb/docs/en/core/concepts/analytics/concepts/toc_p.yaml
@@ -0,0 +1,2 @@
+items:
+- include: { mode: link, path: toc_i.yaml }
diff --git a/ydb/docs/en/core/concepts/analytics/olap_features.md b/ydb/docs/en/core/concepts/analytics/olap_features.md
@@ -23,7 +23,7 @@ Core concepts for organizing, scaling, and managing data.
 ### Streaming Ingestion
 
   - [Topics (Kafka API)](../datamodel/topic.md): Native streaming using the Kafka protocol.
-  - [Коннектор Fluent Bit](../../integrations/ingestion/fluent-bit.md): Direct log ingestion.
+  - [Connector Fluent Bit](../../integrations/ingestion/fluent-bit.md): Direct log ingestion.
 
 ### Batch Ingestion
 
diff --git a/ydb/docs/en/core/concepts/analytics/toc_p.yaml b/ydb/docs/en/core/concepts/analytics/toc_p.yaml
@@ -1,6 +1,9 @@
 items:
   - name: Overview
     href: index.md
+  - name: Detailed
+    include:
+      mode: link
+      path: concepts/toc_p.yaml
   - name: Key Features
     href: olap_features.md
-

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+items:`
	`2`	`+- include: { mode: link, path: toc_i.yaml }`