NSAPH-Data-Processing
diff --git a/‎_proc/02_aggregate.ipynb‎
Lines changed: 2998 additions & 0 deletions b/‎_proc/02_aggregate.ipynb‎
Lines changed: 2998 additions & 0 deletions
diff --git a/‎_proc/03_publish.qmd‎
Lines changed: 519 additions & 0 deletions b/‎_proc/03_publish.qmd‎
Lines changed: 519 additions & 0 deletions
diff --git a/‎_proc/10_pytask_demo.ipynb‎
Lines changed: 889 additions & 0 deletions b/‎_proc/10_pytask_demo.ipynb‎
Lines changed: 889 additions & 0 deletions
diff --git a/‎_proc/20_pytask_logger.ipynb‎
Lines changed: 75 additions & 0 deletions b/‎_proc/20_pytask_logger.ipynb‎
Lines changed: 75 additions & 0 deletions
diff --git a/‎_proc/21_pytask_download.ipynb‎
Lines changed: 254 additions & 0 deletions b/‎_proc/21_pytask_download.ipynb‎
Lines changed: 254 additions & 0 deletions
diff --git a/‎_proc/logs/2025-03-17/12-59-36/ipython-input-1-3f35d1394572.log‎
Lines changed: 4 additions & 0 deletions b/‎_proc/logs/2025-03-17/12-59-36/ipython-input-1-3f35d1394572.log‎
Lines changed: 4 additions & 0 deletions
@@ -0,0 +1,75 @@
+{
+ "cells": [
+  {
+   "cell_type": "raw",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "description: A simple logger module for the pytask tasks\n",
+    "output-file: pytask_logger.html\n",
+    "title: logger\n",
+    "\n",
+    "---\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 0,
+   "has_sd": true,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/markdown": [
+       "---\n",
+       "\n",
+       "[source](https://github.com/TinasheMTapera/era5_sandbox/blob/main/era5_sandbox/pytask_logger.py#L15){target=\"_blank\" style=\"float:right; font-size:smaller\"}\n",
+       "\n",
+       "### setup_logger\n",
+       "\n",
+       ">      setup_logger (name:str, log_file:pathlib.Path, level=20)"
+      ],
+      "text/plain": [
+       "---\n",
+       "\n",
+       "[source](https://github.com/TinasheMTapera/era5_sandbox/blob/main/era5_sandbox/pytask_logger.py#L15){target=\"_blank\" style=\"float:right; font-size:smaller\"}\n",
+       "\n",
+       "### setup_logger\n",
+       "\n",
+       ">      setup_logger (name:str, log_file:pathlib.Path, level=20)"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#| echo: false\n",
+    "#| output: asis\n",
+    "show_doc(setup_logger)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "era5_sandbox",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.11.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,254 @@
+{
+ "cells": [
+  {
+   "cell_type": "raw",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "description: This module downloads the raw era5 data from the CDS API. It is similar\n",
+    "  to the original script, refactored for `pytask`.\n",
+    "output-file: pytask_download.html\n",
+    "title: task_download\n",
+    "\n",
+    "---\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7a2a901b",
+   "metadata": {},
+   "source": [
+    "We're going to quickly refactor the pipeline to use pytask instead of hydra and snakemake. This will hopefully demonstrate a simpler and more flexible way to manage data pipelines in Python.\n",
+    "\n",
+    "To start off, we need to create a function that queries the CDS API with one job. This function will be used to download the data for each query in the range specified in the data catalog in the config file.\n",
+    "\n",
+    "Let's take a look at the data catalog we created in the config module:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fa59bff2",
+   "metadata": {},
+   "source": [
+    "You can see the queries entry we created in the data catalog. Each query is a namedtuple that contains the parameters for the CDS API query. The `query` namedtuple has the following variable fields (other fields are singletons): `year`, `month`, `geography`, and `variable`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "a498f62a",
+   "metadata": {
+    "language": "python"
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[Query(year='2024', month='11', day=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31'], time=['00:00', '01:00', '02:00', '03:00', '04:00', '05:00', '06:00', '07:00', '08:00', '09:00', '10:00', '11:00', '12:00', '13:00', '14:00', '15:00', '16:00', '17:00', '18:00', '19:00', '20:00', '21:00', '22:00', '23:00'], geography={'name': 'nepal', 'shapefile': 'https://data.humdata.org/dataset/07db728a-4f0f-4e98-8eb0-8fa9df61f01c/resource/2eb4c47f-fd6e-425d-b623-d35be1a7640e/download/npl_adm_nd_20240314_ab_shp.zip'}, product_type='reanalysis', variables=['2m_dewpoint_temperature', '2m_temperature', 'total_precipitation', 'volumetric_soil_water_layer_1']),\n",
+       " Query(year='2024', month='12', day=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31'], time=['00:00', '01:00', '02:00', '03:00', '04:00', '05:00', '06:00', '07:00', '08:00', '09:00', '10:00', '11:00', '12:00', '13:00', '14:00', '15:00', '16:00', '17:00', '18:00', '19:00', '20:00', '21:00', '22:00', '23:00'], geography={'name': 'madagascar', 'shapefile': 'https://data.humdata.org/dataset/26fa506b-0727-4d9d-a590-d2abee21ee22/resource/ed94d52e-349e-41be-80cb-62dc0435bd34/download/mdg_adm_bngrc_ocha_20181031_shp.zip'}, product_type='reanalysis', variables=['2m_dewpoint_temperature', '2m_temperature', 'total_precipitation', 'volumetric_soil_water_layer_1']),\n",
+       " Query(year='2024', month='12', day=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31'], time=['00:00', '01:00', '02:00', '03:00', '04:00', '05:00', '06:00', '07:00', '08:00', '09:00', '10:00', '11:00', '12:00', '13:00', '14:00', '15:00', '16:00', '17:00', '18:00', '19:00', '20:00', '21:00', '22:00', '23:00'], geography={'name': 'nepal', 'shapefile': 'https://data.humdata.org/dataset/07db728a-4f0f-4e98-8eb0-8fa9df61f01c/resource/2eb4c47f-fd6e-425d-b623-d35be1a7640e/download/npl_adm_nd_20240314_ab_shp.zip'}, product_type='reanalysis', variables=['2m_dewpoint_temperature', '2m_temperature', 'total_precipitation', 'volumetric_soil_water_layer_1'])]"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "queries = data_catalog['queries'].load()\n",
+    "queries[-3:]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "088d8ffd",
+   "metadata": {},
+   "source": [
+    "We can test this query like we did in the original work:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "161dc5e5",
+   "metadata": {
+    "language": "python"
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[np.float64(-11.5), np.float64(42.7), np.float64(-26.1), np.float64(50.9)]"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "example_query = queries[0]\n",
+    "\n",
+    "create_bounding_box(example_query.geography['shapefile'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e3514da1",
+   "metadata": {},
+   "source": [
+    "In this way, we have a similar approach as Hydra configs, but, using the `pytask` data catalog, we can more easily gather the data for a specific task in structured manner entirely in Python."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "226bdd13",
+   "metadata": {
+    "language": "python"
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2025-07-29 13:35:42,386 INFO [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.\n",
+      "2025-07-29 13:35:48,821 INFO Request ID is 48a71608-be3e-41fd-acf7-c0542542bc1e\n",
+      "2025-07-29 13:35:48,965 INFO status has been updated to accepted\n",
+      "2025-07-29 13:35:57,620 INFO status has been updated to running\n",
+      "2025-07-29 13:42:09,192 INFO status has been updated to successful\n",
+      "                                                                                         \r"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'2009-1_madagascar.nc'"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "client = cdsapi.Client()\n",
+    "\n",
+    "ex_bounding_box = create_bounding_box(example_query.geography['shapefile'])\n",
+    "\n",
+    "request = {\n",
+    "            \"product_type\": example_query.product_type,\n",
+    "            \"variable\": example_query.variables, \n",
+    "            \"year\": example_query.year,\n",
+    "            \"month\": example_query.month,\n",
+    "            \"day\": example_query.day,\n",
+    "            \"time\": example_query.time,\n",
+    "            \"data_format\": \"netcdf\",\n",
+    "            \"download_format\": \"unarchived\",\n",
+    "            \"area\": ex_bounding_box\n",
+    "        }\n",
+    "\n",
+    "target = f\"{example_query.name()}.nc\"\n",
+    "\n",
+    "client.retrieve(\"reanalysis-era5-land\", request).download(target)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "36cfe542",
+   "metadata": {},
+   "source": [
+    "This works! So now we just need to create a `task_` function that pytask will recognise to parallelise the download of queries over:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f7bbbc04",
+   "metadata": {},
+   "source": [
+    "Because we defined this task in a function and loop, we can easily debug it by simply calling it:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "97270a2d",
+   "metadata": {
+    "language": "python"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2009-1_nepal\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2025-07-29 14:18:21,553 INFO [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.\n",
+      "2025-07-29 14:18:23,913 INFO Request ID is 4f6ccab4-3236-4850-bce2-38d0abb73c8b\n",
+      "2025-07-29 14:18:24,220 INFO status has been updated to accepted\n",
+      "2025-07-29 14:18:32,862 INFO status has been updated to running\n",
+      "2025-07-29 14:18:38,061 INFO status has been updated to successful\n",
+      "                                                                                     \r"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "Path('/net/rcstorenfs02/ifs/rc_labs/dominici_lab/lab/data_processing/csph-era5_sandbox/bld/2009-1_nepal.nc')"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "task_download_raw_data()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "57588941",
+   "metadata": {
+    "language": "python"
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "era5_sandbox",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,4 @@
+[2025-03-17 12:59:37,230][datapi.legacy_api_client][INFO] - [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.
+[2025-03-17 12:59:37,232][datapi.legacy_api_client][WARNING] - [2024-06-16T00:00:00] CDS API syntax is changed and some keys or parameter names may have also changed. To avoid requests failing, please use the "Show API request code" tool on the dataset Download Form to check you are using the correct syntax for your API request.
+[2025-03-17 12:59:37,541][datapi.legacy_api_client][INFO] - Request ID is 94401c1f-cc22-4d58-acea-0cca463df9ab
+[2025-03-17 12:59:37,676][datapi.legacy_api_client][INFO] - status has been updated to accepted