Skip to content

Commit 1f0e559

Browse files
Bunch0fAtomsCopilotalexott
authored
Add DBR EOS monitor, signed commit (#578)
Added signature to my commit. Deleted dbr_migration_dash and replaced with dbr_eos: The DBR monitoring tool uses system tables and a custom DBR look-up table to identify outdated DBR versions and flag those nearing end-of-service. The results are visualized in an AI/BI Dashboard with Alerts_V2 sent to notify when clusters are running on DBR that are end-of-service. The DAB deploys a job and pipeline that refreshes the look-up table and dashboard; the Alerts are scheduled to run shortly afterwards. This is from a conversation I (Morgan Williams) had with Diego Gomez and Francisco Vargas --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Alex Ott <alexott@gmail.com>
1 parent 7d13b9f commit 1f0e559

File tree

14 files changed

+5264
-1
lines changed

14 files changed

+5264
-1
lines changed

dbsql/dbr_eos/README.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# DBR End-of-Service Monitor
2+
3+
This repository contains two main components:
4+
5+
## 1. `dbr-eos-monitor-dabs/` — Databricks Asset Bundle (DABs)
6+
The **`dbr-eos-monitor-dabs/`** folder contains a [Databricks Asset Bundle](https://docs.databricks.com/en/dev-tools/bundles/index.html) that automates:
7+
8+
- **Refreshing the DBR lookup table** — Updates data about cluster DBR versions and their end-of-service dates.
9+
- **Refreshing the Lakeview dashboard** — Ensures the “DBR Monitor Dashboard” always displays the latest DBR cluster status.
10+
11+
### Key components
12+
- **Jobs:** `DBR_Monitor_Refresh` scheduled to run 6am daily.
13+
- **Task 1:** Runs the `DBR_Lookup_Table.ipynb` notebook to update DBR metadata.
14+
- **Task 2:** Refreshes the `DBR Monitor Dashboard` bound to a Serverless SQL Warehouse.
15+
- **Serverless SQL Warehouse:** `DBR Monitor Serverless` — Created automatically for dashboard queries.
16+
- **Dashboard:** `DBR Monitor Dashboard` — Shows clusters and days until DBR end-of-service.
17+
18+
### How to deploy
19+
1. Clone this repo into your Databricks workspace **Git Folder**.
20+
2. In your Databricks workspace, go to **Create → Git Folder**, select **Sparse checkout mode** and paste this Git URL (https://github.com/databrickslabs/sandbox.git).
21+
3. Paste **dbsql/dbr_eos** in the Cone patterns box
22+
4. Give the Git folder name something useful like, "DBR_End_of_Service"
23+
5. Click **"Open in asset bundle editor"**.
24+
6. Click in the top right, **"deploy bundle"**.
25+
7. After deployment, open the created Dashboard (DBR Monitor Dashboard) and the Job (DBR Monitor Refresh).
26+
27+
---
28+
29+
## 2. `alerts/` — DBR EOS Alerts (not DABs-managed)
30+
The **`alerts/`** folder contains definitions and scripts for **Alerts v2** that **notify recipients when job or interactive clusters are running a DBR that is past end-of-service**.
31+
32+
### Update notifications and custom template
33+
- **Please update** the alert notifications fields to include all who should be notified of clusters running EOS DBR
34+
- **Please update** the custom template href for the dashboard to reflect your `DBR Monitor Dashboard` URL
35+
36+
### Why not in DABs?
37+
Databricks Asset Bundles do **not** currently support managing Alerts v2 directly.
38+
Instead, the `alerts/` folder:
39+
- Stores the alert configuration in JSON.
40+
- Pulls the alerts to your workspace via GIT.
41+
- The alert runs on a Serverless SQL Warehouse, scheduled to run 6:20am daily.
42+
- It queries for clusters where: days_until_eos <= threshold
43+
- Once you update the notifications then they'll be alerted when triggered
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
{
2+
"custom_summary": "Interactive Clusters EOS Alert",
3+
"evaluation": {
4+
"source": {
5+
"name": "is_supported",
6+
"display": "is_supported",
7+
"aggregation": "COUNT"
8+
},
9+
"comparison_operator": "GREATER_THAN",
10+
"threshold": {
11+
"value": {
12+
"double_value": 0.0
13+
}
14+
},
15+
"empty_result_state": "OK",
16+
"notification": {
17+
"retrigger_seconds": 1,
18+
"notify_on_ok": false
19+
}
20+
},
21+
"schedule": {
22+
"quartz_cron_schedule": "29 20 6 * * ?",
23+
"timezone_id": "America/New_York"
24+
},
25+
"query_lines": [
26+
"-- Get clusters and latest change time",
27+
"WITH ",
28+
"latest_cluster_versions AS (",
29+
" SELECT",
30+
" cluster_id,",
31+
" MAX(change_time) AS latest_change_time",
32+
" FROM",
33+
" system.compute.clusters",
34+
" GROUP BY",
35+
" cluster_id",
36+
"),",
37+
"",
38+
"filtered_usage AS (",
39+
" SELECT",
40+
" u.workspace_id,",
41+
" u.usage_metadata.cluster_id,",
42+
" u.usage_quantity,",
43+
" u.sku_name,",
44+
" u.usage_date",
45+
" FROM",
46+
" system.billing.usage u",
47+
" WHERE",
48+
" u.billing_origin_product = 'ALL_PURPOSE'",
49+
" AND u.usage_metadata.node_type != 'SERVERLESS'",
50+
" AND u.usage_date >= CURRENT_DATE - INTERVAL 30 DAYS",
51+
" --AND (:workspace_ids = Array('All') OR array_contains(:workspace_ids, u.workspace_id))",
52+
"),",
53+
"",
54+
"latest_cluster_info AS (",
55+
" SELECT",
56+
" c.cluster_id,",
57+
" c.cluster_name,",
58+
" c.owned_by,",
59+
" c.dbr_version,",
60+
" c.change_time",
61+
" FROM",
62+
" system.compute.clusters c",
63+
" JOIN latest_cluster_versions lcv",
64+
" ON c.cluster_id = lcv.cluster_id",
65+
" AND c.change_time = lcv.latest_change_time",
66+
"),",
67+
"",
68+
"",
69+
"combined AS (",
70+
" SELECT",
71+
" u.workspace_id,",
72+
" u.cluster_id,",
73+
" c.cluster_name,",
74+
" c.owned_by,",
75+
" u.usage_date,",
76+
" NULLIF(regexp_extract(dbr_version, '\\\\d+\\\\.\\\\d', 0), '') AS dbr_version",
77+
" FROM",
78+
" filtered_usage u",
79+
" JOIN latest_cluster_info c ON u.cluster_id = c.cluster_id",
80+
" WHERE dbr_version IS NOT NULL",
81+
")",
82+
"",
83+
"SELECT",
84+
"workspace_id,",
85+
"cluster_id,",
86+
"cluster_name,",
87+
"owned_by,",
88+
"coalesce(wn.dbr_version, combined.dbr_version) as dbr_version_lts_ind,",
89+
"max(usage_date) as latest_usage_date,",
90+
"MAX(CASE",
91+
" WHEN wn.end_of_support_date IS NOT NULL AND CURRENT_DATE <= wn.end_of_support_date THEN TRUE",
92+
" ELSE FALSE",
93+
"END) AS is_supported",
94+
"FROM combined",
95+
"LEFT JOIN dbdemos.lookup.dbr_version wn",
96+
"ON combined.dbr_version = wn.clean_dbr_version",
97+
"GROUP BY workspace_id, cluster_id, cluster_name, owned_by, dbr_version_lts_ind",
98+
"HAVING is_supported = false"
99+
],
100+
"custom_description_lines": [
101+
"Alert \"{{ALERT_NAME}}\" changed status to {{ALERT_STATUS}}.<br>",
102+
"We recommend upgrading all clusters running DBR that are nearing or past the end of service to serverless. <br>",
103+
"Serverless clusters automatically manage DBR for you, ensuring they are supported.<br>",
104+
"Query result value: {{QUERY_RESULT_VALUE}}<br>",
105+
"<a href=\"{{ALERT_URL}}\">Go to alert</a> | ",
106+
"<a href=\"{{DASHBOARD_URL}}\">Go to dashboard</a>"
107+
]
108+
}
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
{
2+
"custom_summary": "Job Clusters EOS Alert",
3+
"evaluation": {
4+
"source": {
5+
"name": "is_supported",
6+
"display": "is_supported",
7+
"aggregation": "COUNT"
8+
},
9+
"comparison_operator": "GREATER_THAN",
10+
"threshold": {
11+
"value": {
12+
"double_value": 0.0
13+
}
14+
},
15+
"empty_result_state": "OK",
16+
"notification": {
17+
"retrigger_seconds": 1,
18+
"notify_on_ok": false
19+
}
20+
},
21+
"schedule": {
22+
"quartz_cron_schedule": "29 20 6 * * ?",
23+
"timezone_id": "America/New_York"
24+
},
25+
"query_lines": [
26+
"WITH",
27+
"filtered_usage AS (",
28+
" SELECT",
29+
" u.workspace_id,",
30+
" u.usage_metadata.cluster_id,",
31+
" u.usage_metadata.job_id,",
32+
" u.usage_quantity,",
33+
" u.identity_metadata.run_as,",
34+
" u.sku_name,",
35+
" u.usage_date",
36+
" FROM",
37+
" system.billing.usage u",
38+
" WHERE",
39+
" u.billing_origin_product = 'JOBS'",
40+
" AND u.usage_metadata.node_type != 'SERVERLESS'",
41+
" AND u.usage_metadata.job_id IS NOT NULL",
42+
" AND u.usage_date >= CURRENT_DATE - INTERVAL 30 DAYS",
43+
" --AND (:workspace_ids = ARRAY('All') OR array_contains(:workspace_ids, u.workspace_id))",
44+
" GROUP BY ALL",
45+
"),",
46+
"",
47+
"-- Price lookup",
48+
"price_lookup AS (",
49+
" SELECT ",
50+
" sku_name,",
51+
" pricing.effective_list.default AS price_per_dbu",
52+
" FROM system.billing.list_prices",
53+
" WHERE price_end_time IS NULL",
54+
"),",
55+
"",
56+
"-- Join with cluster and parse DBR version",
57+
"usage_with_dbr AS (",
58+
" SELECT",
59+
" fu.workspace_id,",
60+
" fu.cluster_id,",
61+
" fu.job_id,",
62+
" fu.usage_date,",
63+
" fu.usage_quantity,",
64+
" fu.run_as,",
65+
" pl.price_per_dbu,",
66+
" NULLIF(regexp_extract(c.dbr_version, '\\\\d+\\\\.\\\\d', 0), '') AS dbr_version",
67+
" FROM",
68+
" filtered_usage fu",
69+
" JOIN system.compute.clusters c",
70+
" ON fu.cluster_id = c.cluster_id",
71+
" JOIN price_lookup pl",
72+
" ON fu.sku_name = pl.sku_name",
73+
" WHERE dbr_version IS NOT NULL",
74+
")",
75+
"",
76+
"-- Aggregate per job",
77+
"",
78+
" SELECT",
79+
" u.workspace_id,",
80+
" u.job_id,",
81+
" u.dbr_version,",
82+
" coalesce(wn.dbr_version, u.dbr_version) as dbr_version_lts_ind,",
83+
" CASE WHEN wn.end_of_support_date IS NOT NULL THEN datediff(wn.end_of_support_date, current_date) ELSE 0 END AS days_till_eos, --days till end of support",
84+
" max(u.usage_date) as latest_usage_date,",
85+
" ARRAY_AGG(DISTINCT u.run_as) AS run_identities,",
86+
" MAX(CASE",
87+
" WHEN wn.end_of_support_date IS NOT NULL AND CURRENT_DATE <= wn.end_of_support_date THEN TRUE",
88+
" ELSE FALSE",
89+
" END) AS is_supported",
90+
" FROM",
91+
" usage_with_dbr u",
92+
" LEFT JOIN dbdemos.lookup.dbr_version wn ",
93+
" ON u.dbr_version = wn.clean_dbr_version",
94+
" GROUP BY",
95+
" u.workspace_id, u.job_id, u.dbr_version, dbr_version_lts_ind, days_till_eos",
96+
"HAVING is_supported = false"
97+
],
98+
"custom_description_lines": [
99+
"Alert \"{{ALERT_NAME}}\" changed status to {{ALERT_STATUS}}.<br>",
100+
"We recommend upgrading all clusters running DBR that are nearing or past the end of service to serverless. <br>",
101+
"Serverless clusters automatically manage DBR for you, ensuring they are supported.<br>",
102+
"Query result value: {{QUERY_RESULT_VALUE}}<br>",
103+
"<a href=\"{{ALERT_URL}}\">Go to alert</a> | ",
104+
"<a href=\"{{DASHBOARD_URL}}\">Go to dashboard</a>"
105+
]
106+
}

dbsql/dbr_eos/alerts/README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
## 2. `alerts/` — DBR EOS Alerts (not DABs-managed)
2+
The **`alerts/`** folder contains definitions and scripts for **Alerts v2** that **notify recipients when job or interactive clusters are running a DBR that is past end-of-service**.
3+
4+
### Update notifications and custom template
5+
- **Please update** the alert notifications fields to include all who should be notified of clusters running EOS DBR
6+
- **Please update** the custom template href for the dashboard to reflect your `DBR Monitor Dashboard` URL
7+
8+
### Why not in DABs?
9+
Databricks Asset Bundles do **not** currently support managing Alerts v2 directly.
10+
Instead, the `alerts/` folder:
11+
- Stores the alert configuration in JSON.
12+
- Pulls the alerts to your workspace via GIT.
13+
- The alert runs on a Serverless SQL Warehouse, scheduled to run 6:20am daily.
14+
- It queries for clusters where: days_until_eos <= threshold
15+
- Once you update the notifications then they'll be alerted when triggered
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.databricks/
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Databricks Runtime EOS Monitor - Asset Bundle
2+
Monitor and alert on DBR at or nearing end of service.
3+
4+
This bundle creates:
5+
- A notebook-based refresh job (`DBR_Lookup_Table.ipynb`)
6+
- A dashboard (`DBR Days Until End Of Service Dashboard.lvdash.json`) that will be created in your workspace on deploy
7+
8+
## How to Use
9+
10+
## Quickstart (Workspace UI)
11+
1. Clone this repo into your Databricks workspace **Git Folder**.
12+
2. In your Databricks workspace, go to **Create → Git Folder**, select **Sparse checkout mode** and paste this Git URL (https://github.com/databrickslabs/sandbox.git).
13+
3. Paste **dbsql/dbr_eos** in the Cone patterns box
14+
4. Give the Git folder name something useful like, "DBR_End_of_Service"
15+
5. Click **"Open in asset bundle editor"**.
16+
6. Click in the top right, **"deploy bundle"**.
17+
7. After deployment, open the created Dashboard (DBR Monitor Dashboard) and the Job (DBR Monitor Refresh).
18+
19+
## Quickstart (CLI)
20+
```bash
21+
# Use your workspace profile
22+
databricks bundle validate -t dev
23+
databricks bundle deploy -t dev
24+
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
bundle:
2+
name: dbr-eos-monitor-dabs
3+
4+
# Pull in all modular resource files (e.g., resources/dashboard.yml, resources/jobs/*.yml)
5+
include:
6+
- resources/*.yml
7+
- resources/*/*.yml
8+
9+
# Vars are used only for prod/CI; dev stays zero-config so the Workspace UI can deploy immediately
10+
variables:
11+
workspace_host:
12+
description: "Workspace URL for prod/CI (e.g., https://adb-1234567890123.4.azuredatabricks.net)"
13+
default: ""
14+
root_path_base:
15+
description: "Base workspace path for prod/CI (e.g., /Workspace/Users/<you> or /Workspace/Shared)"
16+
default: ""
17+
run_as_user:
18+
description: "User or SPN for prod run_as"
19+
default: ""
20+
wn_catalog:
21+
description: "UC catalog for DBR_Lookup_Table"
22+
default: "dbdemos"
23+
wn_schema:
24+
description: "UC schema for DBR_Lookup_Table"
25+
default: "lookup"
26+
wn_table:
27+
description: "UC tablename for DBR_Lookup_Table"
28+
default: "dbr_version"
29+
30+
targets:
31+
# Works in the Workspace UI without edits; host omitted so UI uses the current workspace
32+
dev:
33+
mode: development
34+
default: true
35+
workspace:
36+
# Use a safe deterministic root to avoid unresolved vars
37+
root_path: "~/.bundle/${bundle.name}/dev"
38+
# run_as optional in dev (defaults to deployer)
39+
40+
# Parameterized target for CI/CD or cross-workspace deploys
41+
prod:
42+
mode: production
43+
workspace:
44+
host: ${var.workspace_host}
45+
root_path: ${var.root_path_base}/.bundle/${bundle.name}/${target.name}
46+
run_as:
47+
user_name: ${var.run_as_user} # or service_principal_name: <app-id>
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
resources:
2+
sql_warehouses:
3+
dbr_monitor_wh:
4+
name: "DBR Monitor Serverless"
5+
cluster_size: "2X-Small"
6+
min_num_clusters: 1
7+
max_num_clusters: 1
8+
auto_stop_mins: 5
9+
enable_serverless_compute: true

0 commit comments

Comments
 (0)