Skip to content

Commit 453a793

Browse files
Committing progress:
- Tested out pytask for building pipelines - Used the pytask data catalog to create sets of tasks as parameters to functions using namedtuples - Used the pytask data catalog to manage the parallelization of tasks - Created a pytask logger to log the progress of tasks - Implemented the download step of querying the ERA5 dataset in pytask - Began implementation of the aggregation step in pytask: - Used the astral library to find the time of sunrise and sunset for each data point in a query - Assigned a diurnal class to each data point based on the time of day - Aggregation of data points by date and diurnal class in progress
1 parent 8764e48 commit 453a793

33 files changed

+105951
-4
lines changed

_proc/02_aggregate.ipynb

Lines changed: 2998 additions & 0 deletions
Large diffs are not rendered by default.

_proc/03_publish.qmd

Lines changed: 519 additions & 0 deletions
Large diffs are not rendered by default.

_proc/10_pytask_demo.ipynb

Lines changed: 889 additions & 0 deletions
Large diffs are not rendered by default.

_proc/20_pytask_logger.ipynb

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "raw",
5+
"metadata": {},
6+
"source": [
7+
"---\n",
8+
"description: A simple logger module for the pytask tasks\n",
9+
"output-file: pytask_logger.html\n",
10+
"title: logger\n",
11+
"\n",
12+
"---\n",
13+
"\n"
14+
]
15+
},
16+
{
17+
"cell_type": "markdown",
18+
"metadata": {},
19+
"source": [
20+
"<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->"
21+
]
22+
},
23+
{
24+
"cell_type": "code",
25+
"execution_count": 0,
26+
"has_sd": true,
27+
"metadata": {},
28+
"outputs": [
29+
{
30+
"data": {
31+
"text/markdown": [
32+
"---\n",
33+
"\n",
34+
"[source](https://github.com/TinasheMTapera/era5_sandbox/blob/main/era5_sandbox/pytask_logger.py#L15){target=\"_blank\" style=\"float:right; font-size:smaller\"}\n",
35+
"\n",
36+
"### setup_logger\n",
37+
"\n",
38+
"> setup_logger (name:str, log_file:pathlib.Path, level=20)"
39+
],
40+
"text/plain": [
41+
"---\n",
42+
"\n",
43+
"[source](https://github.com/TinasheMTapera/era5_sandbox/blob/main/era5_sandbox/pytask_logger.py#L15){target=\"_blank\" style=\"float:right; font-size:smaller\"}\n",
44+
"\n",
45+
"### setup_logger\n",
46+
"\n",
47+
"> setup_logger (name:str, log_file:pathlib.Path, level=20)"
48+
]
49+
},
50+
"execution_count": null,
51+
"metadata": {},
52+
"output_type": "execute_result"
53+
}
54+
],
55+
"source": [
56+
"#| echo: false\n",
57+
"#| output: asis\n",
58+
"show_doc(setup_logger)"
59+
]
60+
}
61+
],
62+
"metadata": {
63+
"kernelspec": {
64+
"display_name": "era5_sandbox",
65+
"language": "python",
66+
"name": "python3"
67+
},
68+
"language_info": {
69+
"name": "python",
70+
"version": "3.11.11"
71+
}
72+
},
73+
"nbformat": 4,
74+
"nbformat_minor": 5
75+
}

_proc/21_pytask_download.ipynb

Lines changed: 254 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "raw",
5+
"metadata": {},
6+
"source": [
7+
"---\n",
8+
"description: This module downloads the raw era5 data from the CDS API. It is similar\n",
9+
" to the original script, refactored for `pytask`.\n",
10+
"output-file: pytask_download.html\n",
11+
"title: task_download\n",
12+
"\n",
13+
"---\n",
14+
"\n"
15+
]
16+
},
17+
{
18+
"cell_type": "markdown",
19+
"metadata": {},
20+
"source": [
21+
"<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->"
22+
]
23+
},
24+
{
25+
"cell_type": "markdown",
26+
"id": "7a2a901b",
27+
"metadata": {},
28+
"source": [
29+
"We're going to quickly refactor the pipeline to use pytask instead of hydra and snakemake. This will hopefully demonstrate a simpler and more flexible way to manage data pipelines in Python.\n",
30+
"\n",
31+
"To start off, we need to create a function that queries the CDS API with one job. This function will be used to download the data for each query in the range specified in the data catalog in the config file.\n",
32+
"\n",
33+
"Let's take a look at the data catalog we created in the config module:"
34+
]
35+
},
36+
{
37+
"cell_type": "markdown",
38+
"id": "fa59bff2",
39+
"metadata": {},
40+
"source": [
41+
"You can see the queries entry we created in the data catalog. Each query is a namedtuple that contains the parameters for the CDS API query. The `query` namedtuple has the following variable fields (other fields are singletons): `year`, `month`, `geography`, and `variable`."
42+
]
43+
},
44+
{
45+
"cell_type": "code",
46+
"execution_count": 3,
47+
"id": "a498f62a",
48+
"metadata": {
49+
"language": "python"
50+
},
51+
"outputs": [
52+
{
53+
"data": {
54+
"text/plain": [
55+
"[Query(year='2024', month='11', day=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31'], time=['00:00', '01:00', '02:00', '03:00', '04:00', '05:00', '06:00', '07:00', '08:00', '09:00', '10:00', '11:00', '12:00', '13:00', '14:00', '15:00', '16:00', '17:00', '18:00', '19:00', '20:00', '21:00', '22:00', '23:00'], geography={'name': 'nepal', 'shapefile': 'https://data.humdata.org/dataset/07db728a-4f0f-4e98-8eb0-8fa9df61f01c/resource/2eb4c47f-fd6e-425d-b623-d35be1a7640e/download/npl_adm_nd_20240314_ab_shp.zip'}, product_type='reanalysis', variables=['2m_dewpoint_temperature', '2m_temperature', 'total_precipitation', 'volumetric_soil_water_layer_1']),\n",
56+
" Query(year='2024', month='12', day=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31'], time=['00:00', '01:00', '02:00', '03:00', '04:00', '05:00', '06:00', '07:00', '08:00', '09:00', '10:00', '11:00', '12:00', '13:00', '14:00', '15:00', '16:00', '17:00', '18:00', '19:00', '20:00', '21:00', '22:00', '23:00'], geography={'name': 'madagascar', 'shapefile': 'https://data.humdata.org/dataset/26fa506b-0727-4d9d-a590-d2abee21ee22/resource/ed94d52e-349e-41be-80cb-62dc0435bd34/download/mdg_adm_bngrc_ocha_20181031_shp.zip'}, product_type='reanalysis', variables=['2m_dewpoint_temperature', '2m_temperature', 'total_precipitation', 'volumetric_soil_water_layer_1']),\n",
57+
" Query(year='2024', month='12', day=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31'], time=['00:00', '01:00', '02:00', '03:00', '04:00', '05:00', '06:00', '07:00', '08:00', '09:00', '10:00', '11:00', '12:00', '13:00', '14:00', '15:00', '16:00', '17:00', '18:00', '19:00', '20:00', '21:00', '22:00', '23:00'], geography={'name': 'nepal', 'shapefile': 'https://data.humdata.org/dataset/07db728a-4f0f-4e98-8eb0-8fa9df61f01c/resource/2eb4c47f-fd6e-425d-b623-d35be1a7640e/download/npl_adm_nd_20240314_ab_shp.zip'}, product_type='reanalysis', variables=['2m_dewpoint_temperature', '2m_temperature', 'total_precipitation', 'volumetric_soil_water_layer_1'])]"
58+
]
59+
},
60+
"execution_count": 3,
61+
"metadata": {},
62+
"output_type": "execute_result"
63+
}
64+
],
65+
"source": [
66+
"queries = data_catalog['queries'].load()\n",
67+
"queries[-3:]"
68+
]
69+
},
70+
{
71+
"cell_type": "markdown",
72+
"id": "088d8ffd",
73+
"metadata": {},
74+
"source": [
75+
"We can test this query like we did in the original work:"
76+
]
77+
},
78+
{
79+
"cell_type": "code",
80+
"execution_count": 5,
81+
"id": "161dc5e5",
82+
"metadata": {
83+
"language": "python"
84+
},
85+
"outputs": [
86+
{
87+
"data": {
88+
"text/plain": [
89+
"[np.float64(-11.5), np.float64(42.7), np.float64(-26.1), np.float64(50.9)]"
90+
]
91+
},
92+
"execution_count": 5,
93+
"metadata": {},
94+
"output_type": "execute_result"
95+
}
96+
],
97+
"source": [
98+
"example_query = queries[0]\n",
99+
"\n",
100+
"create_bounding_box(example_query.geography['shapefile'])"
101+
]
102+
},
103+
{
104+
"cell_type": "markdown",
105+
"id": "e3514da1",
106+
"metadata": {},
107+
"source": [
108+
"In this way, we have a similar approach as Hydra configs, but, using the `pytask` data catalog, we can more easily gather the data for a specific task in structured manner entirely in Python."
109+
]
110+
},
111+
{
112+
"cell_type": "code",
113+
"execution_count": 6,
114+
"id": "226bdd13",
115+
"metadata": {
116+
"language": "python"
117+
},
118+
"outputs": [
119+
{
120+
"name": "stderr",
121+
"output_type": "stream",
122+
"text": [
123+
"2025-07-29 13:35:42,386 INFO [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.\n",
124+
"2025-07-29 13:35:48,821 INFO Request ID is 48a71608-be3e-41fd-acf7-c0542542bc1e\n",
125+
"2025-07-29 13:35:48,965 INFO status has been updated to accepted\n",
126+
"2025-07-29 13:35:57,620 INFO status has been updated to running\n",
127+
"2025-07-29 13:42:09,192 INFO status has been updated to successful\n",
128+
" \r"
129+
]
130+
},
131+
{
132+
"data": {
133+
"text/plain": [
134+
"'2009-1_madagascar.nc'"
135+
]
136+
},
137+
"execution_count": 6,
138+
"metadata": {},
139+
"output_type": "execute_result"
140+
}
141+
],
142+
"source": [
143+
"client = cdsapi.Client()\n",
144+
"\n",
145+
"ex_bounding_box = create_bounding_box(example_query.geography['shapefile'])\n",
146+
"\n",
147+
"request = {\n",
148+
" \"product_type\": example_query.product_type,\n",
149+
" \"variable\": example_query.variables, \n",
150+
" \"year\": example_query.year,\n",
151+
" \"month\": example_query.month,\n",
152+
" \"day\": example_query.day,\n",
153+
" \"time\": example_query.time,\n",
154+
" \"data_format\": \"netcdf\",\n",
155+
" \"download_format\": \"unarchived\",\n",
156+
" \"area\": ex_bounding_box\n",
157+
" }\n",
158+
"\n",
159+
"target = f\"{example_query.name()}.nc\"\n",
160+
"\n",
161+
"client.retrieve(\"reanalysis-era5-land\", request).download(target)"
162+
]
163+
},
164+
{
165+
"cell_type": "markdown",
166+
"id": "36cfe542",
167+
"metadata": {},
168+
"source": [
169+
"This works! So now we just need to create a `task_` function that pytask will recognise to parallelise the download of queries over:"
170+
]
171+
},
172+
{
173+
"cell_type": "markdown",
174+
"id": "f7bbbc04",
175+
"metadata": {},
176+
"source": [
177+
"Because we defined this task in a function and loop, we can easily debug it by simply calling it:"
178+
]
179+
},
180+
{
181+
"cell_type": "code",
182+
"execution_count": 10,
183+
"id": "97270a2d",
184+
"metadata": {
185+
"language": "python"
186+
},
187+
"outputs": [
188+
{
189+
"name": "stdout",
190+
"output_type": "stream",
191+
"text": [
192+
"2009-1_nepal\n"
193+
]
194+
},
195+
{
196+
"name": "stderr",
197+
"output_type": "stream",
198+
"text": [
199+
"2025-07-29 14:18:21,553 INFO [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.\n",
200+
"2025-07-29 14:18:23,913 INFO Request ID is 4f6ccab4-3236-4850-bce2-38d0abb73c8b\n",
201+
"2025-07-29 14:18:24,220 INFO status has been updated to accepted\n",
202+
"2025-07-29 14:18:32,862 INFO status has been updated to running\n",
203+
"2025-07-29 14:18:38,061 INFO status has been updated to successful\n",
204+
" \r"
205+
]
206+
},
207+
{
208+
"data": {
209+
"text/plain": [
210+
"Path('/net/rcstorenfs02/ifs/rc_labs/dominici_lab/lab/data_processing/csph-era5_sandbox/bld/2009-1_nepal.nc')"
211+
]
212+
},
213+
"execution_count": 10,
214+
"metadata": {},
215+
"output_type": "execute_result"
216+
}
217+
],
218+
"source": [
219+
"task_download_raw_data()"
220+
]
221+
},
222+
{
223+
"cell_type": "code",
224+
"execution_count": null,
225+
"id": "57588941",
226+
"metadata": {
227+
"language": "python"
228+
},
229+
"outputs": [],
230+
"source": []
231+
}
232+
],
233+
"metadata": {
234+
"kernelspec": {
235+
"display_name": "era5_sandbox",
236+
"language": "python",
237+
"name": "python3"
238+
},
239+
"language_info": {
240+
"codemirror_mode": {
241+
"name": "ipython",
242+
"version": 3
243+
},
244+
"file_extension": ".py",
245+
"mimetype": "text/x-python",
246+
"name": "python",
247+
"nbconvert_exporter": "python",
248+
"pygments_lexer": "ipython3",
249+
"version": "3.11.11"
250+
}
251+
},
252+
"nbformat": 4,
253+
"nbformat_minor": 5
254+
}
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
[2025-03-17 12:59:37,230][datapi.legacy_api_client][INFO] - [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.
2+
[2025-03-17 12:59:37,232][datapi.legacy_api_client][WARNING] - [2024-06-16T00:00:00] CDS API syntax is changed and some keys or parameter names may have also changed. To avoid requests failing, please use the "Show API request code" tool on the dataset Download Form to check you are using the correct syntax for your API request.
3+
[2025-03-17 12:59:37,541][datapi.legacy_api_client][INFO] - Request ID is 94401c1f-cc22-4d58-acea-0cca463df9ab
4+
[2025-03-17 12:59:37,676][datapi.legacy_api_client][INFO] - status has been updated to accepted

0 commit comments

Comments
 (0)