Skip to content
Open
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
f1edea4
modify annual -> yearly washu -> randall
audiracmichelle Sep 13, 2025
486da59
updated env
audiracmichelle Sep 13, 2025
a5f0cb1
renamed file
audiracmichelle Sep 13, 2025
cd7a5cd
fixed typo
audiracmichelle Sep 13, 2025
cc423d4
distinguished between v5 versions
audiracmichelle Sep 13, 2025
b670581
updated create_dir_paths
audiracmichelle Sep 13, 2025
b9f0298
utilized folder_cfg.base_path
audiracmichelle Sep 13, 2025
f526036
modify structure
audiracmichelle Sep 13, 2025
9d7a5ac
moved file
audiracmichelle Sep 13, 2025
2a75e31
updated confs for v5 and v6
audiracmichelle Sep 13, 2025
26d017c
tested basepaths
audiracmichelle Sep 14, 2025
3e2c4ba
modified paths
audiracmichelle Sep 14, 2025
2c02695
tested basepaths
audiracmichelle Sep 14, 2025
7d318cc
tested base paths
audiracmichelle Sep 14, 2025
453fa67
tested base paths
audiracmichelle Sep 14, 2025
b05e2ee
renamed files
audiracmichelle Sep 14, 2025
1c36ecb
updated jobs
audiracmichelle Sep 14, 2025
20e9a3b
Trying to flatten monthly download directory so that there aren't any…
shreyanalluri Nov 6, 2025
2c49563
layer name for V6GL is "PM25" not "GWRPM25"
shreyanalluri Nov 6, 2025
4f1ae9d
starter script to concatenate monthly files into one yearly file
shreyanalluri Nov 6, 2025
3036cad
Updating snakefile to include monthly file concatenation
shreyanalluri Nov 6, 2025
877d8ca
Updating concatenation script to only run for a single year, updating…
shreyanalluri Nov 12, 2025
d110a26
output monthly aggregations to intermediate directory
shreyanalluri Nov 13, 2025
4c371b3
updating where intermediate monthly aggregations are stored
shreyanalluri Nov 13, 2025
5fb24d0
Updating snakefile inputs and outputs for intermediate directory
shreyanalluri Nov 13, 2025
99df6e7
Updating pipeline to work with lego shapefiles instead of shapefile d…
shreyanalluri Nov 20, 2025
76bc6f2
update readme to reflect changes to pipeline
shreyanalluri Nov 21, 2025
12b4129
remove shapefile download entirely
shreyanalluri Nov 21, 2025
5799233
remove shapefile download
shreyanalluri Nov 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ logs/
## slurm ----
**/slurm*

## data files ----
data/

## Python ----

# Byte-compiled / optimized / DLL files
Expand Down
6 changes: 3 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ RUN mamba env update -n base -f requirements.yaml
#&& mamba clean -a

# Create paths to data placeholders
RUN python utils/create_dir_paths.py datapaths.input.satellite_pm25.annual=null datapaths.input.satellite_pm25.monthly=null
RUN python utils/create_dir_paths.py datapaths.input.satellite_pm25.yearly=null datapaths.input.satellite_pm25.monthly=null
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script path and argument structure have changed. This should be updated to:

RUN python src/create_datapaths.py

The old command line arguments (datapaths.input.satellite_pm25.yearly=null) are no longer compatible with the new configuration structure that uses datapaths.base_path and datapaths.dirs.

Suggested change
RUN python utils/create_dir_paths.py datapaths.input.satellite_pm25.yearly=null datapaths.input.satellite_pm25.monthly=null
RUN python src/create_datapaths.py

Copilot uses AI. Check for mistakes.

# snakemake --configfile conf/config.yaml --cores 4 -C temporal_freq=annual
# snakemake --configfile conf/config.yaml --cores 4 -C temporal_freq=yearly
ENTRYPOINT ["snakemake", "--configfile", "conf/config.yaml"]
CMD ["--cores", "4", "-C", "polygon_name=county", "temporal_freq=annual"]
CMD ["--cores", "4", "-C", "polygon_name=county", "temporal_freq=yearly"]
113 changes: 98 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# pm25_washu_raster2polygon
# pm25_randall_raster2polygon

Code to produce spatial aggregations of pm25 estimates as generated by the [Atmospheric Composition Analysis Group](https://sites.wustl.edu/acag/datasets/surface-pm2-5/). The spatial aggregation are performed for satellite pm25 from grid/raster (NetCDF) to polygons (shp).

Expand All @@ -10,7 +10,7 @@ The [Atmospheric Composition Analysis Group](https://sites.wustl.edu/acag/datase

The version [V5.GL.04](https://sites.wustl.edu/acag/datasets/surface-pm2-5/#V5.GL.04) consists of mean PM2.5 (ug/m3) available at:

* Temporal frequency: Annual and monthly
* Temporal frequency: yearly and monthly
* Grid resolutions: (0.1° × 0.1°) and (0.01° × 0.01°)
* Geographic regions: North America, Europe, Asia, and Global

Expand All @@ -29,26 +29,89 @@ Aaron van Donkelaar, Melanie S. Hammer, Liam Bindle, Michael Brauer, Jeffery R.

# Codebook

## Dataset Columns:
## Dataset Output Structure

* county aggregations:
The pipeline produces parquet files with PM2.5 aggregations at the polygon level.

* zcta aggregations:
### Yearly aggregations (`temporal_freq=yearly`):

Output path: `data/V5GL/output/{polygon_name}_yearly/pm25__randall__{polygon_name}_yearly__{year}.parquet`

Columns:
* `{polygon_id}` (index): County FIPS code, ZCTA code, or Census Tract GEOID depending on polygon type
* `pm25` (float64): Mean PM2.5 concentration (µg/m³) for the year
* `year` (int64): Year of observation

### Monthly aggregations (`temporal_freq=monthly`):

Output path: `data/V5GL/output/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__{year}.parquet`

Columns:
* `{polygon_id}` (index): County FIPS code, ZCTA code, or Census Tract GEOID depending on polygon type
* `pm25` (float64): Mean PM2.5 concentration (µg/m³) for the month
* `year` (int64): Year of observation
* `month` (object/int): Month of observation

### Polygon ID Variables

The index column name varies by polygon type and is defined in `conf/shapefiles/shapefiles.yaml`:
* **County**: `county` (5-digit FIPS code, e.g., "01001" for Autauga County, Alabama)
* **ZCTA**: `zcta` (5-digit ZIP Code Tabulation Area, e.g., "00601")
* **Census Tract**: `GEOID` (11-digit census tract identifier)

### Intermediate Files (Monthly only)

During monthly processing, intermediate files are created:
* Path: `data/V5GL/intermediate/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__{year}_{month}.parquet`
* These are concatenated into yearly files in the output directory

---

# Configuration files

The configuration structure withing the `/conf` folder allow you to modify the input parameters for the following steps:
The configuration structure within the `/conf` folder allows you to modify the input parameters for the following steps:

* create directory paths: `utils/create_dir_paths.py`
* create directory paths: `src/create_datapaths.py`
* download pm25: `src/download_pm25.py`
* download shapefiles: `src/download_shapefile.py`
* aggregate pm25: `src/aggregate_pm25.py`
* concatenate monthly files: `src/concat_monthly.py`

The key parameters are:
* `temporal_freq` which determines whether the original annual or monthly pm25 files will be aggregated. The options are: `annual` and `monthly`.
* `polygon_name` which determines into which polygons the pm25 grid will the aggregated. The options are: `zcta` and `county`.
* `temporal_freq` which determines whether the original yearly or monthly pm25 files will be aggregated. The options are: `yearly` and `monthly`.
* `polygon_name` which determines into which polygons the pm25 grid will be aggregated. The options are: `zcta`, `county`, and `census_tract`.
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "will the aggregated" should be "will be aggregated".

Copilot uses AI. Check for mistakes.

## Configuration Structure

The configuration system uses Hydra with the following structure:

* `/conf/config.yaml` - Main configuration file with default settings
* `/conf/datapaths/` - Data path configurations for different environments:
* `cannon_v5gl.yaml` - Paths for V5GL data on Cannon cluster
* `cannon_v6gl.yaml` - Paths for V6GL data on Cannon cluster
* `datapaths.yaml` - Template configuration
* `/conf/shapefiles/shapefiles.yaml` - Shapefile metadata including:
* Available years for each polygon type
* ID column names (`idvar`)
* File naming prefixes
* Download URLs (optional, via `url_map`)
* `/conf/satellite_pm25/` - PM2.5 dataset configurations for different versions
* `/conf/snakemake.yaml` - Default parameters for Snakemake workflow

## Shapefile Configuration

Shapefiles can be obtained in two ways:

1. **Symlinks to existing Lab shapefiles** (recommended for Cannon cluster):
```bash
python src/create_datapaths.py
```
This creates symbolic links from the project's `data/` directory to the Lab's existing shapefile repository at `/n/dominici_lab/lab/lego/geoboundaries/`.

2. **Direct download** from Census Bureau (optional):
Shapefiles can be downloaded automatically if URLs are configured in `conf/shapefiles/shapefiles.yaml`. The download script will use the `url_map` for the specified year.

The pipeline uses backward compatibility for shapefiles - if PM2.5 data is from a year without an exact shapefile match, it automatically selects the most recent prior shapefile year available.

---

Expand All @@ -75,17 +138,37 @@ mamba activate <env_name>

## Input and output paths

Run
The pipeline requires setting up directory paths and symbolic links to data sources. Run:

```bash
python src/create_datapaths.py
```

This script:
* Creates the base directory structure under `data/V5GL/` (or `data/V6GL/` depending on configuration)
* Creates symbolic links to:
* PM2.5 raw data at `/n/dominici_lab/lab/lego/environmental/pm25__randall/`
* Shapefiles at `/n/dominici_lab/lab/lego/geoboundaries/us_geoboundaries__census/us_shapefile__census/`
* Output directories for aggregated results

To use a different configuration, specify the datapaths config file:

```bash
python utils/create_dir_paths.py
python src/create_datapaths.py datapaths=cannon_v6gl
```

## Pipeline

You can run the pipeline steps manually or run the snakemake pipeline described in the Snakefile.
The pipeline consists of four main steps:

1. **Download/link shapefiles**: Obtain or link to US Census shapefiles (counties, ZCTAs, or census tracts)
2. **Download PM2.5 data**: Download satellite PM2.5 NetCDF files from Washington University
3. **Aggregate PM2.5**: Perform spatial aggregation from raster grid to polygons
4. **Concatenate monthly files** (monthly frequency only): Combine monthly parquet files into yearly files

You can run the pipeline steps manually or use the Snakemake workflow.

**run pipeline steps manually**
### Run pipeline steps manually

```bash
python src/download_shapefile.py
Expand All @@ -98,7 +181,7 @@ python src/aggregate_pm25.py
or run the pipeline:

```bash
snakemake --cores 4 -C polygon_name=county temporal_freq=annual
snakemake --cores 4 -C polygon_name=county temporal_freq=yearly
```

Modify `cores`, `polygon_name` and `temporal_freq` as you find convenient.
Expand All @@ -115,7 +198,7 @@ mkdir <path>/satellite_pm25_raster2polygon

```bash
docker pull nsaph/satellite_pm25_raster2polygon
docker run -v <path>:/app/data/input/satellite_pm25/annual <path>/satellite_pm25_raster2polygon/:/app/data/output/satellite_pm25_raster2polygon nsaph/satellite_pm25_raster2polygon
docker run -v <path>:/app/data/input/satellite_pm25/yearly <path>/satellite_pm25_raster2polygon/:/app/data/output/satellite_pm25_raster2polygon nsaph/satellite_pm25_raster2polygon
```

If you are interested in storing the input raw and intermediate data run
Expand Down
63 changes: 39 additions & 24 deletions Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -17,37 +17,32 @@ temporal_freq = config['temporal_freq']
polygon_name = config['polygon_name']

with initialize(version_base=None, config_path="conf"):
hydra_cfg = compose(config_name="config", overrides=[f"temporal_freq={temporal_freq}", f"polygon_name={polygon_name}"])
cfg = compose(config_name="config", overrides=[f"temporal_freq={temporal_freq}", f"polygon_name={polygon_name}"])

satellite_pm25_cfg = hydra_cfg.satellite_pm25
shapefiles_cfg = hydra_cfg.shapefiles
satellite_pm25_cfg = cfg.satellite_pm25
shapefiles_cfg = cfg.shapefiles

shapefile_years_list = list(shapefiles_cfg[polygon_name].keys())
shapefile_years_list = shapefiles_cfg[polygon_name].years

months_list = "01" if temporal_freq == 'yearly' else [str(i).zfill(2) for i in range(1, 12 + 1)]
years_list = list(range(1998, 2022 + 1))
years_list = list(range(1998, 2023 + 1))

# == Define rules ==
rule all:
input:
expand(
f"data/output/pm25__washu/{polygon_name}_{temporal_freq}/pm25__washu__{polygon_name}_{temporal_freq}__" +
("{year}.parquet" if temporal_freq == 'yearly' else "{year}_{month}.parquet"),
year=years_list,
month=months_list
f"{cfg.datapaths.base_path}/output/{polygon_name}_{temporal_freq}/pm25__randall__{polygon_name}_{temporal_freq}__" +
"{year}.parquet",
year=years_list
) if temporal_freq == 'yearly' else expand(
f"{cfg.datapaths.base_path}/output/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__" + "{year}.parquet",
year=years_list
)

# remove and use symlink to the us census geoboundaries
rule download_shapefiles:
output:
f"data/input/shapefiles/shapefile_{polygon_name}_" + "{shapefile_year}/shapefile.shp"
shell:
f"python src/download_shapefile.py polygon_name={polygon_name} " + "shapefile_year={wildcards.shapefile_year}"

rule download_satellite_pm25:
output:
expand(
f"data/input/pm25__washu__raw/{temporal_freq}/{satellite_pm25_cfg[temporal_freq]['file_prefix']}." +
f"{cfg.datapaths.base_path}/input/raw/{temporal_freq}/{satellite_pm25_cfg[temporal_freq]['file_prefix']}." +
("{year}01-{year}12.nc" if temporal_freq == 'yearly' else "{year}{month}-{year}{month}.nc"),
year=years_list,
month=months_list)
Expand All @@ -58,23 +53,24 @@ rule download_satellite_pm25:

def get_shapefile_input(wildcards):
shapefile_year = available_shapefile_year(int(wildcards.year), shapefile_years_list)
return f"data/input/shapefiles/shapefile_{polygon_name}_{shapefile_year}/shapefile.shp"
shapefile_prefix = shapefiles_cfg[polygon_name].prefix
shapefile_name = f"{shapefile_prefix}{shapefile_year}"
return f"{cfg.datapaths.base_path}/input/shapefiles/{polygon_name}_yearly/{shapefile_name}/{shapefile_name}.shp"

rule aggregate_pm25:
input:
get_shapefile_input,
expand(
f"data/input/pm25__washu__raw/{temporal_freq}/{satellite_pm25_cfg[temporal_freq]['file_prefix']}." +
f"{cfg.datapaths.base_path}/input/raw/{temporal_freq}/{satellite_pm25_cfg[temporal_freq]['file_prefix']}." +
("{{year}}01-{{year}}12.nc" if temporal_freq == 'yearly' else "{{year}}{month}-{{year}}{month}.nc"),
month=months_list
)

output:
expand(
f"data/output/pm25__washu/{polygon_name}_{temporal_freq}/pm25__washu__{polygon_name}_{temporal_freq}__" +
("{{year}}.parquet" if temporal_freq == 'yearly' else "{{year}}_{month}.parquet"),
month=months_list # we only want to expand months_list and keep year as wildcard
)
[f"{cfg.datapaths.base_path}/output/{polygon_name}_{temporal_freq}/pm25__randall__{polygon_name}_{temporal_freq}__{{year}}.parquet"] if temporal_freq == 'yearly' else [
f"{cfg.datapaths.base_path}/intermediate/{polygon_name}_{temporal_freq}/pm25__randall__{polygon_name}_{temporal_freq}__{{year}}_{month}.parquet"
for month in months_list
]
log:
f"logs/satellite_pm25_{polygon_name}_{{year}}.log"
shell:
Expand All @@ -83,3 +79,22 @@ rule aggregate_pm25:
("year={wildcards.year}" if temporal_freq == 'yearly' else "year={wildcards.year}") +
" &> {log}"
)

rule concat_monthly:
# This rule is only needed when temporal_freq is 'monthly' to create yearly files
# Combines monthly parquet files from intermediate directory into a single yearly parquet file
input:
lambda wildcards: expand(
f"{cfg.datapaths.base_path}/intermediate/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__" + wildcards.year + "_{month}.parquet",
month=[str(i).zfill(2) for i in range(1, 12 + 1)]
)
output:
yearly_file=f"{cfg.datapaths.base_path}/output/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__" + "{year}.parquet"
log:
f"logs/concat_monthly_{polygon_name}_" + "{year}.log"
shell:
f"PYTHONPATH=. python src/concat_monthly.py polygon_name={polygon_name} " + "year={wildcards.year} &> {log}"
Comment on lines +83 to +96
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concat_monthly rule is defined but never called when temporal_freq='monthly'. Looking at the rule all input, when temporal_freq is monthly, it expects files in the output directory, but aggregate_pm25 saves to the intermediate directory. The concat_monthly rule should be a dependency to create these output files, but it's not connected to the workflow. Consider updating rule all to ensure concat_monthly outputs are the target when temporal_freq is monthly.

Copilot uses AI. Check for mistakes.




6 changes: 3 additions & 3 deletions conf/config.yaml
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
defaults:
- _self_
- datapaths: cannon_datapaths
- datapaths: cannon_v5gl
- shapefiles: shapefiles
- satellite_pm25: us_pm25
- satellite_pm25: V5GL0502.HybridPM25c_0p05.NorthAmerica

# == aggregation args
temporal_freq: yearly # yearly, monthly to be matched with cfg.satellite_pm25
year: 2020

# == shapefile download args
polygon_name: zcta # zcta, county to be matched with cfg.shapefiles
polygon_name: county # zcta, county to be matched with cfg.shapefiles
shapefile_year: 2020 #to be matched with cfg.shapefiles

show_progress: false
Expand Down
13 changes: 0 additions & 13 deletions conf/datapaths/cannon_datapaths.yaml

This file was deleted.

20 changes: 20 additions & 0 deletions conf/datapaths/cannon_v5gl.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
base_path: data/V5GL

dirs:
input:
raw:
yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/raw/yearly #/n/netscratch/dominici_lab/Lab/pm25__randall__raw/yearly
monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/raw/monthly #/n/netscratch/dominici_lab/Lab/pm25__randall__raw/monthly
shapefiles:
county_yearly: /n/dominici_lab/lab/lego/geoboundaries/us_geoboundaries__census/us_shapefile__census/county_yearly
zcta_yearly: /n/dominici_lab/lab/lego/geoboundaries/us_geoboundaries__census/us_shapefile__census/zcta_yearly

intermediate:
zcta_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/intermediate/zcta_monthly
county_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/intermediate/county_monthly

output:
zcta_yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/zcta_yearly
zcta_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/zcta_monthly
county_yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/county_yearly
county_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/county_monthly
18 changes: 18 additions & 0 deletions conf/datapaths/cannon_v6gl.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
base_path: data/V6GL

dirs:
input:
raw:
yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/raw/yearly #/n/netscratch/dominici_lab/Lab/pm25__randall__raw/yearly
monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/raw/monthly #/n/netscratch/dominici_lab/Lab/pm25__randall__raw/monthly
shapefiles:
county_yearly: /n/dominici_lab/lab/lego/geoboundaries/us_geoboundaries__census/us_shapefile__census/county_yearly
zcta_yearly: /n/dominici_lab/lab/lego/geoboundaries/us_geoboundaries__census/us_shapefile__census/zcta_yearly
intermediate:
zcta_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/intermediate/zcta_monthly
county_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/intermediate/county_monthly
output:
zcta_yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/zcta_yearly
zcta_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/zcta_monthly
county_yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/county_yearly
county_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/county_monthly
20 changes: 12 additions & 8 deletions conf/datapaths/datapaths.yaml
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
# if files are stored within the local copy of the repository, then use null:
input:
pm25__washu__raw:
yearly: null
monthly: null
shapefiles: null
base_path: data/V6GL

output:
pm25__washu:
dirs:
input:
raw:
yearly: null
monthly: null
shapefiles: null

intermediate:
zcta_monthly: null
county_monthly: null
output:
zcta_yearly: null
zcta_monthly: null
county_yearly: null
Expand Down
Loading
Loading