NSAPH-Data-Processing · audiracmichelle · Sep 13, 2025 · Sep 13, 2025 · Sep 13, 2025 · Sep 13, 2025
diff --git a/.gitignore b/.gitignore
@@ -10,6 +10,9 @@ logs/
 ## slurm ----
 **/slurm*
 
+## data files ----
+data/
+
 ## Python ----
 
 # Byte-compiled / optimized / DLL files

diff --git a/Dockerfile b/Dockerfile
@@ -13,8 +13,8 @@ RUN mamba env update -n base -f requirements.yaml
 #&& mamba clean -a
 
 # Create paths to data placeholders
-RUN python utils/create_dir_paths.py datapaths.input.satellite_pm25.annual=null datapaths.input.satellite_pm25.monthly=null
+RUN python utils/create_dir_paths.py datapaths.input.satellite_pm25.yearly=null datapaths.input.satellite_pm25.monthly=null
-RUN python utils/create_dir_paths.py datapaths.input.satellite_pm25.yearly=null datapaths.input.satellite_pm25.monthly=null
+RUN python src/create_datapaths.py
-RUN python utils/create_dir_paths.py datapaths.input.satellite_pm25.yearly=null datapaths.input.satellite_pm25.monthly=null
+RUN python src/create_datapaths.py
 
-# snakemake --configfile conf/config.yaml --cores 4 -C temporal_freq=annual
+# snakemake --configfile conf/config.yaml --cores 4 -C temporal_freq=yearly
 ENTRYPOINT ["snakemake", "--configfile", "conf/config.yaml"]
-CMD ["--cores", "4", "-C", "polygon_name=county", "temporal_freq=annual"]
+CMD ["--cores", "4", "-C", "polygon_name=county", "temporal_freq=yearly"]
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# pm25_washu_raster2polygon
+# pm25_randall_raster2polygon
 
 Code to produce spatial aggregations of pm25 estimates as generated by the [Atmospheric Composition Analysis Group](https://sites.wustl.edu/acag/datasets/surface-pm2-5/). The spatial aggregation are performed for satellite pm25 from grid/raster (NetCDF) to polygons (shp).
 
@@ -10,7 +10,7 @@ The [Atmospheric Composition Analysis Group](https://sites.wustl.edu/acag/datase
 
 The version [V5.GL.04](https://sites.wustl.edu/acag/datasets/surface-pm2-5/#V5.GL.04) consists of mean PM2.5 (ug/m3) available at:
 
-*  Temporal frequency: Annual and monthly  
+*  Temporal frequency: yearly and monthly  
 *  Grid resolutions: (0.1° × 0.1°) and (0.01° × 0.01°)  
 *  Geographic regions: North America, Europe, Asia, and Global
 
@@ -29,26 +29,89 @@ Aaron van Donkelaar, Melanie S. Hammer, Liam Bindle, Michael Brauer, Jeffery R.
 
 # Codebook
 
-## Dataset Columns:
+## Dataset Output Structure
 
-* county aggregations:
+The pipeline produces parquet files with PM2.5 aggregations at the polygon level.
 
-* zcta aggregations:
+### Yearly aggregations (`temporal_freq=yearly`):
+
+Output path: `data/V5GL/output/{polygon_name}_yearly/pm25__randall__{polygon_name}_yearly__{year}.parquet`
+
+Columns:
+* `{polygon_id}` (index): County FIPS code, ZCTA code, or Census Tract GEOID depending on polygon type
+* `pm25` (float64): Mean PM2.5 concentration (µg/m³) for the year
+* `year` (int64): Year of observation
+
+### Monthly aggregations (`temporal_freq=monthly`):
+
+Output path: `data/V5GL/output/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__{year}.parquet`
+
+Columns:
+* `{polygon_id}` (index): County FIPS code, ZCTA code, or Census Tract GEOID depending on polygon type
+* `pm25` (float64): Mean PM2.5 concentration (µg/m³) for the month
+* `year` (int64): Year of observation
+* `month` (object/int): Month of observation
+
+### Polygon ID Variables
+
+The index column name varies by polygon type and is defined in `conf/shapefiles/shapefiles.yaml`:
+* **County**: `county` (5-digit FIPS code, e.g., "01001" for Autauga County, Alabama)
+* **ZCTA**: `zcta` (5-digit ZIP Code Tabulation Area, e.g., "00601")
+* **Census Tract**: `GEOID` (11-digit census tract identifier)
+
+### Intermediate Files (Monthly only)
+
+During monthly processing, intermediate files are created:
+* Path: `data/V5GL/intermediate/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__{year}_{month}.parquet`
+* These are concatenated into yearly files in the output directory
 
 ---
 
 # Configuration files
 
-The configuration structure withing the `/conf` folder allow you to modify the input parameters for the following steps:
+The configuration structure within the `/conf` folder allows you to modify the input parameters for the following steps:
 
-* create directory paths: `utils/create_dir_paths.py`
+* create directory paths: `src/create_datapaths.py`
 * download pm25: `src/download_pm25.py`
 * download shapefiles: `src/download_shapefile.py`
 * aggregate pm25: `src/aggregate_pm25.py`
+* concatenate monthly files: `src/concat_monthly.py`
 
 The key parameters are:
-* `temporal_freq` which determines whether the original annual or monthly pm25 files will be aggregated. The options are: `annual` and `monthly`.
-* `polygon_name` which determines into which polygons the pm25 grid will the aggregated. The options are: `zcta` and `county`.
+* `temporal_freq` which determines whether the original yearly or monthly pm25 files will be aggregated. The options are: `yearly` and `monthly`.
+* `polygon_name` which determines into which polygons the pm25 grid will be aggregated. The options are: `zcta`, `county`, and `census_tract`.
+
+## Configuration Structure
+
+The configuration system uses Hydra with the following structure:
+
+* `/conf/config.yaml` - Main configuration file with default settings
+* `/conf/datapaths/` - Data path configurations for different environments:
+  * `cannon_v5gl.yaml` - Paths for V5GL data on Cannon cluster
+  * `cannon_v6gl.yaml` - Paths for V6GL data on Cannon cluster
+  * `datapaths.yaml` - Template configuration
+* `/conf/shapefiles/shapefiles.yaml` - Shapefile metadata including:
+  * Available years for each polygon type
+  * ID column names (`idvar`)
+  * File naming prefixes
+  * Download URLs (optional, via `url_map`)
+* `/conf/satellite_pm25/` - PM2.5 dataset configurations for different versions
+* `/conf/snakemake.yaml` - Default parameters for Snakemake workflow
+
+## Shapefile Configuration
+
+Shapefiles can be obtained in two ways:
+
+1. **Symlinks to existing Lab shapefiles** (recommended for Cannon cluster):
+   ```bash
+   python src/create_datapaths.py
+   ```
+   This creates symbolic links from the project's `data/` directory to the Lab's existing shapefile repository at `/n/dominici_lab/lab/lego/geoboundaries/`.
+
+2. **Direct download** from Census Bureau (optional):
+   Shapefiles can be downloaded automatically if URLs are configured in `conf/shapefiles/shapefiles.yaml`. The download script will use the `url_map` for the specified year.
+
+The pipeline uses backward compatibility for shapefiles - if PM2.5 data is from a year without an exact shapefile match, it automatically selects the most recent prior shapefile year available.
 
 ---
 
@@ -75,17 +138,37 @@ mamba activate <env_name>
 
 ## Input and output paths
 
-Run
+The pipeline requires setting up directory paths and symbolic links to data sources. Run:
+
+```bash
+python src/create_datapaths.py
+```
+
+This script:
+* Creates the base directory structure under `data/V5GL/` (or `data/V6GL/` depending on configuration)
+* Creates symbolic links to:
+  * PM2.5 raw data at `/n/dominici_lab/lab/lego/environmental/pm25__randall/`
+  * Shapefiles at `/n/dominici_lab/lab/lego/geoboundaries/us_geoboundaries__census/us_shapefile__census/`
+  * Output directories for aggregated results
+
+To use a different configuration, specify the datapaths config file:
 
 ```bash
-python utils/create_dir_paths.py 
+python src/create_datapaths.py datapaths=cannon_v6gl
 ```
 
 ## Pipeline
 
-You can run the pipeline steps manually or run the snakemake pipeline described in the Snakefile.
+The pipeline consists of four main steps:
+
+1. **Download/link shapefiles**: Obtain or link to US Census shapefiles (counties, ZCTAs, or census tracts)
+2. **Download PM2.5 data**: Download satellite PM2.5 NetCDF files from Washington University
+3. **Aggregate PM2.5**: Perform spatial aggregation from raster grid to polygons
+4. **Concatenate monthly files** (monthly frequency only): Combine monthly parquet files into yearly files
+
+You can run the pipeline steps manually or use the Snakemake workflow.
 
-**run pipeline steps manually**
+### Run pipeline steps manually
 
 ```bash
 python src/download_shapefile.py
@@ -98,7 +181,7 @@ python src/aggregate_pm25.py
 or run the pipeline:
 
 ```bash
-snakemake --cores 4 -C polygon_name=county temporal_freq=annual 
+snakemake --cores 4 -C polygon_name=county temporal_freq=yearly 
 ```
 
 Modify `cores`, `polygon_name` and `temporal_freq` as you find convenient.
@@ -115,7 +198,7 @@ mkdir <path>/satellite_pm25_raster2polygon
 
 ```bash
 docker pull nsaph/satellite_pm25_raster2polygon
-docker run -v <path>:/app/data/input/satellite_pm25/annual <path>/satellite_pm25_raster2polygon/:/app/data/output/satellite_pm25_raster2polygon nsaph/satellite_pm25_raster2polygon
+docker run -v <path>:/app/data/input/satellite_pm25/yearly <path>/satellite_pm25_raster2polygon/:/app/data/output/satellite_pm25_raster2polygon nsaph/satellite_pm25_raster2polygon
 ```  
 
 If you are interested in storing the input raw and intermediate data run

diff --git a/Snakefile b/Snakefile
@@ -17,37 +17,32 @@ temporal_freq = config['temporal_freq']
 polygon_name = config['polygon_name']
 
 with initialize(version_base=None, config_path="conf"):
-    hydra_cfg = compose(config_name="config", overrides=[f"temporal_freq={temporal_freq}", f"polygon_name={polygon_name}"])
+    cfg = compose(config_name="config", overrides=[f"temporal_freq={temporal_freq}", f"polygon_name={polygon_name}"])
 
-satellite_pm25_cfg = hydra_cfg.satellite_pm25
-shapefiles_cfg = hydra_cfg.shapefiles
+satellite_pm25_cfg = cfg.satellite_pm25
+shapefiles_cfg = cfg.shapefiles
 
-shapefile_years_list = list(shapefiles_cfg[polygon_name].keys())
+shapefile_years_list = shapefiles_cfg[polygon_name].years
 
 months_list = "01" if temporal_freq == 'yearly' else [str(i).zfill(2) for i in range(1, 12 + 1)]
-years_list = list(range(1998, 2022 + 1))
+years_list = list(range(1998, 2023 + 1))
 
 # == Define rules ==
 rule all:
     input:
         expand(
-            f"data/output/pm25__washu/{polygon_name}_{temporal_freq}/pm25__washu__{polygon_name}_{temporal_freq}__" +  
-                ("{year}.parquet" if temporal_freq == 'yearly' else "{year}_{month}.parquet"), 
-            year=years_list,
-            month=months_list
+            f"{cfg.datapaths.base_path}/output/{polygon_name}_{temporal_freq}/pm25__randall__{polygon_name}_{temporal_freq}__" +  
+                "{year}.parquet", 
+            year=years_list
+        ) if temporal_freq == 'yearly' else expand(
+            f"{cfg.datapaths.base_path}/output/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__" + "{year}.parquet",
+            year=years_list
         )
 
-# remove and use symlink to the us census geoboundaries 
-rule download_shapefiles:
-    output:
-        f"data/input/shapefiles/shapefile_{polygon_name}_" + "{shapefile_year}/shapefile.shp" 
-    shell:
-        f"python src/download_shapefile.py polygon_name={polygon_name} " + "shapefile_year={wildcards.shapefile_year}"
-
 rule download_satellite_pm25:
     output:
         expand(
-            f"data/input/pm25__washu__raw/{temporal_freq}/{satellite_pm25_cfg[temporal_freq]['file_prefix']}." + 
+            f"{cfg.datapaths.base_path}/input/raw/{temporal_freq}/{satellite_pm25_cfg[temporal_freq]['file_prefix']}." + 
             ("{year}01-{year}12.nc" if temporal_freq == 'yearly' else "{year}{month}-{year}{month}.nc"), 
             year=years_list,
             month=months_list)
@@ -58,23 +53,24 @@ rule download_satellite_pm25:
 
 def get_shapefile_input(wildcards):
     shapefile_year = available_shapefile_year(int(wildcards.year), shapefile_years_list)
-    return f"data/input/shapefiles/shapefile_{polygon_name}_{shapefile_year}/shapefile.shp"
+    shapefile_prefix = shapefiles_cfg[polygon_name].prefix
+    shapefile_name = f"{shapefile_prefix}{shapefile_year}"
+    return f"{cfg.datapaths.base_path}/input/shapefiles/{polygon_name}_yearly/{shapefile_name}/{shapefile_name}.shp"
 
 rule aggregate_pm25:
     input:
         get_shapefile_input,
         expand(
-            f"data/input/pm25__washu__raw/{temporal_freq}/{satellite_pm25_cfg[temporal_freq]['file_prefix']}." + 
+            f"{cfg.datapaths.base_path}/input/raw/{temporal_freq}/{satellite_pm25_cfg[temporal_freq]['file_prefix']}." + 
             ("{{year}}01-{{year}}12.nc" if temporal_freq == 'yearly' else "{{year}}{month}-{{year}}{month}.nc"), 
             month=months_list
         )
 
     output:
-        expand(
-            f"data/output/pm25__washu/{polygon_name}_{temporal_freq}/pm25__washu__{polygon_name}_{temporal_freq}__" + 
-            ("{{year}}.parquet" if temporal_freq == 'yearly' else "{{year}}_{month}.parquet"), 
-            month=months_list  # we only want to expand months_list and keep year as wildcard
-        )
+        [f"{cfg.datapaths.base_path}/output/{polygon_name}_{temporal_freq}/pm25__randall__{polygon_name}_{temporal_freq}__{{year}}.parquet"] if temporal_freq == 'yearly' else [
+            f"{cfg.datapaths.base_path}/intermediate/{polygon_name}_{temporal_freq}/pm25__randall__{polygon_name}_{temporal_freq}__{{year}}_{month}.parquet"
+            for month in months_list
+        ]
     log:
         f"logs/satellite_pm25_{polygon_name}_{{year}}.log"
     shell:
@@ -83,3 +79,22 @@ rule aggregate_pm25:
             ("year={wildcards.year}" if temporal_freq == 'yearly' else "year={wildcards.year}") +
             " &> {log}"
         )
+
+rule concat_monthly:
+    # This rule is only needed when temporal_freq is 'monthly' to create yearly files
+    # Combines monthly parquet files from intermediate directory into a single yearly parquet file
+    input:
+        lambda wildcards: expand(
+            f"{cfg.datapaths.base_path}/intermediate/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__" + wildcards.year + "_{month}.parquet",
+            month=[str(i).zfill(2) for i in range(1, 12 + 1)]
+        )
+    output:
+        yearly_file=f"{cfg.datapaths.base_path}/output/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__" + "{year}.parquet"
+    log:
+        f"logs/concat_monthly_{polygon_name}_" + "{year}.log"
+    shell:
+        f"PYTHONPATH=. python src/concat_monthly.py polygon_name={polygon_name} " + "year={wildcards.year} &> {log}"
+
+
+
+
diff --git a/conf/config.yaml b/conf/config.yaml
@@ -1,15 +1,15 @@
 defaults:
   - _self_
-  - datapaths: cannon_datapaths
+  - datapaths: cannon_v5gl
   - shapefiles: shapefiles
-  - satellite_pm25: us_pm25
+  - satellite_pm25: V5GL0502.HybridPM25c_0p05.NorthAmerica
 
 # == aggregation args
 temporal_freq: yearly # yearly, monthly to be matched with cfg.satellite_pm25
 year: 2020
 
 # == shapefile download args
-polygon_name: zcta # zcta, county to be matched with cfg.shapefiles
+polygon_name: county # zcta, county to be matched with cfg.shapefiles
 shapefile_year: 2020 #to be matched with cfg.shapefiles
 
 show_progress: false

diff --git a/conf/datapaths/cannon_datapaths.yaml b/conf/datapaths/cannon_datapaths.yaml
diff --git a/conf/datapaths/cannon_v5gl.yaml b/conf/datapaths/cannon_v5gl.yaml
@@ -0,0 +1,20 @@
+base_path: data/V5GL
+
+dirs:
+  input:
+    raw: 
+      yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/raw/yearly #/n/netscratch/dominici_lab/Lab/pm25__randall__raw/yearly 
+      monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/raw/monthly #/n/netscratch/dominici_lab/Lab/pm25__randall__raw/monthly
+    shapefiles: 
+      county_yearly: /n/dominici_lab/lab/lego/geoboundaries/us_geoboundaries__census/us_shapefile__census/county_yearly
+      zcta_yearly: /n/dominici_lab/lab/lego/geoboundaries/us_geoboundaries__census/us_shapefile__census/zcta_yearly
+
+  intermediate:
+    zcta_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/intermediate/zcta_monthly
+    county_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/intermediate/county_monthly
+
+  output:
+    zcta_yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/zcta_yearly
+    zcta_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/zcta_monthly
+    county_yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/county_yearly
+    county_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/county_monthly
diff --git a/conf/datapaths/cannon_v6gl.yaml b/conf/datapaths/cannon_v6gl.yaml
@@ -0,0 +1,18 @@
+base_path: data/V6GL
+
+dirs:
+  input:
+    raw: 
+      yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/raw/yearly #/n/netscratch/dominici_lab/Lab/pm25__randall__raw/yearly 
+      monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/raw/monthly #/n/netscratch/dominici_lab/Lab/pm25__randall__raw/monthly
+    shapefiles: 
+      county_yearly: /n/dominici_lab/lab/lego/geoboundaries/us_geoboundaries__census/us_shapefile__census/county_yearly
+      zcta_yearly: /n/dominici_lab/lab/lego/geoboundaries/us_geoboundaries__census/us_shapefile__census/zcta_yearly
+  intermediate:
+    zcta_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/intermediate/zcta_monthly
+    county_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/intermediate/county_monthly
+  output:
+    zcta_yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/zcta_yearly
+    zcta_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/zcta_monthly
+    county_yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/county_yearly
+    county_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/county_monthly
diff --git a/conf/datapaths/datapaths.yaml b/conf/datapaths/datapaths.yaml
@@ -1,12 +1,16 @@
-# if files are stored within the local copy of the repository, then use null:
-input:
-  pm25__washu__raw: 
-    yearly: null
-    monthly: null
-  shapefiles: null
+base_path: data/V6GL
 
-output:
-  pm25__washu: 
+dirs:
+  input:
+    raw: 
+      yearly: null
+      monthly: null
+    shapefiles: null
+
+  intermediate:
+    zcta_monthly: null
+    county_monthly: null
+  output:
     zcta_yearly: null
     zcta_monthly: null
     county_yearly: null

diff --git a/conf/satellite_pm25/us_pm25.yaml → ...V5GL04.HybridPM25c_0p10.NorthAmerica.yaml b/conf/satellite_pm25/us_pm25.yaml → ...V5GL04.HybridPM25c_0p10.NorthAmerica.yaml