|
4 | 4 | "cell_type": "markdown", |
5 | 5 | "metadata": {}, |
6 | 6 | "source": [ |
7 | | - "## Getting Started {#getting-started}\n", |
| 7 | + "## Getting Started\n", |
8 | 8 | "\n", |
9 | 9 | "First, install PySpark:\n", |
10 | 10 | "\n", |
|
68 | 68 | "source": [ |
69 | 69 | "The SparkSession is your entry point to all PySpark functionality.\n", |
70 | 70 | "\n", |
71 | | - "## Creating DataFrames {#creating-dataframes}\n", |
| 71 | + "## Creating DataFrames\n", |
72 | 72 | "\n", |
73 | 73 | "PySpark supports creating DataFrames from multiple sources including Python objects, pandas DataFrames, files, and databases.\n", |
74 | 74 | "\n", |
|
168 | 168 | "cell_type": "markdown", |
169 | 169 | "metadata": {}, |
170 | 170 | "source": [ |
171 | | - "## Understanding Lazy Evaluation {#understanding-lazy-evaluation}\n", |
| 171 | + "## Understanding Lazy Evaluation\n", |
172 | 172 | "\n", |
173 | 173 | "PySpark's execution model differs fundamentally from pandas. Operations are divided into two types.\n", |
174 | 174 | "\n", |
|
215 | 215 | "\n", |
216 | 216 | "This lazy evaluation enables Spark's [Catalyst optimizer](https://www.databricks.com/glossary/catalyst-optimizer) to analyze your complete workflow and apply optimizations like predicate pushdown and column pruning before execution.\n", |
217 | 217 | "\n", |
218 | | - "## Data Exploration {#data-exploration}\n", |
| 218 | + "## Data Exploration\n", |
219 | 219 | "\n", |
220 | 220 | "Data exploration in PySpark works similarly to pandas, but with methods designed for distributed computing. Instead of pandas' `df.info()` and `df.head()`, PySpark uses `printSchema()` and `show()` to inspect schemas and preview records across the cluster.\n", |
221 | 221 | "\n", |
|
335 | 335 | "cell_type": "markdown", |
336 | 336 | "metadata": {}, |
337 | 337 | "source": [ |
338 | | - "## Selection & Filtering {#selection-filtering}\n", |
| 338 | + "## Selection & Filtering\n", |
339 | 339 | "\n", |
340 | 340 | "When selecting and filtering data, PySpark uses explicit methods like `select()` and `filter()` that build distributed execution plans.\n", |
341 | 341 | "\n", |
|
411 | 411 | "cell_type": "markdown", |
412 | 412 | "metadata": {}, |
413 | 413 | "source": [ |
414 | | - "## Column Operations {#column-operations}\n", |
| 414 | + "## Column Operations\n", |
415 | 415 | "\n", |
416 | 416 | "Unlike pandas' mutable operations where `df['new_col']` modifies the DataFrame in place, PySpark's `withColumn()` and `withColumnRenamed()` return new DataFrames, maintaining the distributed computing model.\n", |
417 | 417 | "\n", |
|
487 | 487 | "cell_type": "markdown", |
488 | 488 | "metadata": {}, |
489 | 489 | "source": [ |
490 | | - "## Aggregation Functions {#aggregation-functions}\n", |
| 490 | + "## Aggregation Functions\n", |
491 | 491 | "\n", |
492 | 492 | "Unlike pandas' in-memory aggregations, PySpark's `groupBy()` and aggregation functions distribute calculations across cluster nodes, using the same conceptual model as pandas but with lazy evaluation.\n", |
493 | 493 | "\n", |
|
561 | 561 | "cell_type": "markdown", |
562 | 562 | "metadata": {}, |
563 | 563 | "source": [ |
564 | | - "## String Functions {#string-functions}\n", |
| 564 | + "## String Functions\n", |
565 | 565 | "\n", |
566 | 566 | "Unlike pandas' vectorized string methods accessed via `.str`, PySpark provides importable functions like `concat()`, `split()`, and `regexp_replace()` that transform entire columns across distributed partitions.\n", |
567 | 567 | "\n", |
|
642 | 642 | "cell_type": "markdown", |
643 | 643 | "metadata": {}, |
644 | 644 | "source": [ |
645 | | - "## Date/Time Functions {#datetime-functions}\n", |
| 645 | + "## Date/Time Functions\n", |
646 | 646 | "\n", |
647 | 647 | "Working with dates and timestamps is essential for time-based analysis. PySpark offers comprehensive functions to extract date components, format timestamps, and perform temporal operations.\n", |
648 | 648 | "\n", |
|
747 | 747 | "cell_type": "markdown", |
748 | 748 | "metadata": {}, |
749 | 749 | "source": [ |
750 | | - "## Working with Time Series {#working-with-time-series}\n", |
| 750 | + "## Working with Time Series\n", |
751 | 751 | "\n", |
752 | 752 | "Time series analysis often requires comparing values across different time periods. PySpark's window functions with lag and lead operations enable calculations of changes and trends over time.\n", |
753 | 753 | "\n", |
|
823 | 823 | "cell_type": "markdown", |
824 | 824 | "metadata": {}, |
825 | 825 | "source": [ |
826 | | - "## Window Analytics {#window-analytics}\n", |
| 826 | + "## Window Analytics\n", |
827 | 827 | "\n", |
828 | 828 | "Complex analytics operations like rankings, running totals, and moving averages require window functions that operate within data partitions. These functions enable sophisticated analytical queries without self-joins.\n", |
829 | 829 | "\n", |
|
927 | 927 | "cell_type": "markdown", |
928 | 928 | "metadata": {}, |
929 | 929 | "source": [ |
930 | | - "## Join Operations {#join-operations}\n", |
| 930 | + "## Join Operations\n", |
931 | 931 | "\n", |
932 | 932 | "Combining data from multiple tables is a core operation in data analysis. PySpark supports various join types including inner, left, and broadcast joins, with automatic optimization for performance.\n", |
933 | 933 | "\n", |
|
1029 | 1029 | "cell_type": "markdown", |
1030 | 1030 | "metadata": {}, |
1031 | 1031 | "source": [ |
1032 | | - "## SQL Integration {#sql-integration}\n", |
| 1032 | + "## SQL Integration\n", |
1033 | 1033 | "\n", |
1034 | 1034 | "PySpark supports standard SQL syntax for querying data. You can write SQL queries using familiar SELECT, JOIN, and WHERE clauses alongside PySpark operations.\n", |
1035 | 1035 | "\n", |
|
1130 | 1130 | "cell_type": "markdown", |
1131 | 1131 | "metadata": {}, |
1132 | 1132 | "source": [ |
1133 | | - "## Custom Functions {#custom-functions}\n", |
| 1133 | + "## Custom Functions\n", |
1134 | 1134 | "\n", |
1135 | 1135 | "When built-in functions aren't sufficient, custom logic can be implemented using pandas UDFs. These user-defined functions provide vectorized performance through Apache Arrow and support both scalar operations and grouped transformations.\n", |
1136 | 1136 | "\n", |
|
1195 | 1195 | "cell_type": "markdown", |
1196 | 1196 | "metadata": {}, |
1197 | 1197 | "source": [ |
1198 | | - "## SQL Expressions {#sql-expressions}\n", |
| 1198 | + "## SQL Expressions\n", |
1199 | 1199 | "\n", |
1200 | 1200 | "SQL expressions can be embedded directly within DataFrame operations for complex transformations. The `expr()` and `selectExpr()` functions allow SQL syntax to be used alongside DataFrame methods, providing flexibility in query construction.\n", |
1201 | 1201 | "\n", |
|
0 commit comments