docs(data_engineer): remove HTML anchor IDs

khuyentran1401 · khuyentran1401 · commit 25d8eda053c3 · 2025-10-12T15:20:31.000-05:00
diff --git a/data_engineer/pyspark_sql_complete_guide.ipynb b/data_engineer/pyspark_sql_complete_guide.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Getting Started {#getting-started}\n",
+    "## Getting Started\n",
     "\n",
     "First, install PySpark:\n",
     "\n",
@@ -68,7 +68,7 @@
    "source": [
     "The SparkSession is your entry point to all PySpark functionality.\n",
     "\n",
-    "## Creating DataFrames {#creating-dataframes}\n",
+    "## Creating DataFrames\n",
     "\n",
     "PySpark supports creating DataFrames from multiple sources including Python objects, pandas DataFrames, files, and databases.\n",
     "\n",
@@ -168,7 +168,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Understanding Lazy Evaluation {#understanding-lazy-evaluation}\n",
+    "## Understanding Lazy Evaluation\n",
     "\n",
     "PySpark's execution model differs fundamentally from pandas. Operations are divided into two types.\n",
     "\n",
@@ -215,7 +215,7 @@
     "\n",
     "This lazy evaluation enables Spark's [Catalyst optimizer](https://www.databricks.com/glossary/catalyst-optimizer) to analyze your complete workflow and apply optimizations like predicate pushdown and column pruning before execution.\n",
     "\n",
-    "## Data Exploration {#data-exploration}\n",
+    "## Data Exploration\n",
     "\n",
     "Data exploration in PySpark works similarly to pandas, but with methods designed for distributed computing. Instead of pandas' `df.info()` and `df.head()`, PySpark uses `printSchema()` and `show()` to inspect schemas and preview records across the cluster.\n",
     "\n",
@@ -335,7 +335,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Selection & Filtering {#selection-filtering}\n",
+    "## Selection & Filtering\n",
     "\n",
     "When selecting and filtering data, PySpark uses explicit methods like `select()` and `filter()` that build distributed execution plans.\n",
     "\n",
@@ -411,7 +411,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Column Operations {#column-operations}\n",
+    "## Column Operations\n",
     "\n",
     "Unlike pandas' mutable operations where `df['new_col']` modifies the DataFrame in place, PySpark's `withColumn()` and `withColumnRenamed()` return new DataFrames, maintaining the distributed computing model.\n",
     "\n",
@@ -487,7 +487,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Aggregation Functions {#aggregation-functions}\n",
+    "## Aggregation Functions\n",
     "\n",
     "Unlike pandas' in-memory aggregations, PySpark's `groupBy()` and aggregation functions distribute calculations across cluster nodes, using the same conceptual model as pandas but with lazy evaluation.\n",
     "\n",
@@ -561,7 +561,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## String Functions {#string-functions}\n",
+    "## String Functions\n",
     "\n",
     "Unlike pandas' vectorized string methods accessed via `.str`, PySpark provides importable functions like `concat()`, `split()`, and `regexp_replace()` that transform entire columns across distributed partitions.\n",
     "\n",
@@ -642,7 +642,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Date/Time Functions {#datetime-functions}\n",
+    "## Date/Time Functions\n",
     "\n",
     "Working with dates and timestamps is essential for time-based analysis. PySpark offers comprehensive functions to extract date components, format timestamps, and perform temporal operations.\n",
     "\n",
@@ -747,7 +747,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Working with Time Series {#working-with-time-series}\n",
+    "## Working with Time Series\n",
     "\n",
     "Time series analysis often requires comparing values across different time periods. PySpark's window functions with lag and lead operations enable calculations of changes and trends over time.\n",
     "\n",
@@ -823,7 +823,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Window Analytics {#window-analytics}\n",
+    "## Window Analytics\n",
     "\n",
     "Complex analytics operations like rankings, running totals, and moving averages require window functions that operate within data partitions. These functions enable sophisticated analytical queries without self-joins.\n",
     "\n",
@@ -927,7 +927,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Join Operations {#join-operations}\n",
+    "## Join Operations\n",
     "\n",
     "Combining data from multiple tables is a core operation in data analysis. PySpark supports various join types including inner, left, and broadcast joins, with automatic optimization for performance.\n",
     "\n",
@@ -1029,7 +1029,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## SQL Integration {#sql-integration}\n",
+    "## SQL Integration\n",
     "\n",
     "PySpark supports standard SQL syntax for querying data. You can write SQL queries using familiar SELECT, JOIN, and WHERE clauses alongside PySpark operations.\n",
     "\n",
@@ -1130,7 +1130,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Custom Functions {#custom-functions}\n",
+    "## Custom Functions\n",
     "\n",
     "When built-in functions aren't sufficient, custom logic can be implemented using pandas UDFs. These user-defined functions provide vectorized performance through Apache Arrow and support both scalar operations and grouped transformations.\n",
     "\n",
@@ -1195,7 +1195,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## SQL Expressions {#sql-expressions}\n",
+    "## SQL Expressions\n",
     "\n",
     "SQL expressions can be embedded directly within DataFrame operations for complex transformations. The `expr()`  and `selectExpr()` functions allow SQL syntax to be used alongside DataFrame methods, providing flexibility in query construction.\n",
     "\n",