-
Notifications
You must be signed in to change notification settings - Fork 0
Audiracmichelle/issue42 #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 28 commits
f1edea4
486da59
a5f0cb1
cd7a5cd
cc423d4
b670581
b9f0298
f526036
9d7a5ac
2a75e31
26d017c
3e2c4ba
2c02695
7d318cc
453fa67
b05e2ee
1c36ecb
20e9a3b
2c49563
4f1ae9d
3036cad
877d8ca
d110a26
4c371b3
5fb24d0
99df6e7
76bc6f2
12b4129
5799233
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,4 @@ | ||
| # pm25_washu_raster2polygon | ||
| # pm25_randall_raster2polygon | ||
|
|
||
| Code to produce spatial aggregations of pm25 estimates as generated by the [Atmospheric Composition Analysis Group](https://sites.wustl.edu/acag/datasets/surface-pm2-5/). The spatial aggregation are performed for satellite pm25 from grid/raster (NetCDF) to polygons (shp). | ||
|
|
||
|
|
@@ -10,7 +10,7 @@ The [Atmospheric Composition Analysis Group](https://sites.wustl.edu/acag/datase | |
|
|
||
| The version [V5.GL.04](https://sites.wustl.edu/acag/datasets/surface-pm2-5/#V5.GL.04) consists of mean PM2.5 (ug/m3) available at: | ||
|
|
||
| * Temporal frequency: Annual and monthly | ||
| * Temporal frequency: yearly and monthly | ||
| * Grid resolutions: (0.1° × 0.1°) and (0.01° × 0.01°) | ||
| * Geographic regions: North America, Europe, Asia, and Global | ||
|
|
||
|
|
@@ -29,26 +29,89 @@ Aaron van Donkelaar, Melanie S. Hammer, Liam Bindle, Michael Brauer, Jeffery R. | |
|
|
||
| # Codebook | ||
|
|
||
| ## Dataset Columns: | ||
| ## Dataset Output Structure | ||
|
|
||
| * county aggregations: | ||
| The pipeline produces parquet files with PM2.5 aggregations at the polygon level. | ||
|
|
||
| * zcta aggregations: | ||
| ### Yearly aggregations (`temporal_freq=yearly`): | ||
|
|
||
| Output path: `data/V5GL/output/{polygon_name}_yearly/pm25__randall__{polygon_name}_yearly__{year}.parquet` | ||
|
|
||
| Columns: | ||
| * `{polygon_id}` (index): County FIPS code, ZCTA code, or Census Tract GEOID depending on polygon type | ||
| * `pm25` (float64): Mean PM2.5 concentration (µg/m³) for the year | ||
| * `year` (int64): Year of observation | ||
|
|
||
| ### Monthly aggregations (`temporal_freq=monthly`): | ||
|
|
||
| Output path: `data/V5GL/output/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__{year}.parquet` | ||
|
|
||
| Columns: | ||
| * `{polygon_id}` (index): County FIPS code, ZCTA code, or Census Tract GEOID depending on polygon type | ||
| * `pm25` (float64): Mean PM2.5 concentration (µg/m³) for the month | ||
| * `year` (int64): Year of observation | ||
| * `month` (object/int): Month of observation | ||
|
|
||
| ### Polygon ID Variables | ||
|
|
||
| The index column name varies by polygon type and is defined in `conf/shapefiles/shapefiles.yaml`: | ||
| * **County**: `county` (5-digit FIPS code, e.g., "01001" for Autauga County, Alabama) | ||
| * **ZCTA**: `zcta` (5-digit ZIP Code Tabulation Area, e.g., "00601") | ||
| * **Census Tract**: `GEOID` (11-digit census tract identifier) | ||
|
|
||
| ### Intermediate Files (Monthly only) | ||
|
|
||
| During monthly processing, intermediate files are created: | ||
| * Path: `data/V5GL/intermediate/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__{year}_{month}.parquet` | ||
| * These are concatenated into yearly files in the output directory | ||
|
|
||
| --- | ||
|
|
||
| # Configuration files | ||
|
|
||
| The configuration structure withing the `/conf` folder allow you to modify the input parameters for the following steps: | ||
| The configuration structure within the `/conf` folder allows you to modify the input parameters for the following steps: | ||
|
|
||
| * create directory paths: `utils/create_dir_paths.py` | ||
| * create directory paths: `src/create_datapaths.py` | ||
| * download pm25: `src/download_pm25.py` | ||
| * download shapefiles: `src/download_shapefile.py` | ||
| * aggregate pm25: `src/aggregate_pm25.py` | ||
| * concatenate monthly files: `src/concat_monthly.py` | ||
|
|
||
| The key parameters are: | ||
| * `temporal_freq` which determines whether the original annual or monthly pm25 files will be aggregated. The options are: `annual` and `monthly`. | ||
| * `polygon_name` which determines into which polygons the pm25 grid will the aggregated. The options are: `zcta` and `county`. | ||
| * `temporal_freq` which determines whether the original yearly or monthly pm25 files will be aggregated. The options are: `yearly` and `monthly`. | ||
| * `polygon_name` which determines into which polygons the pm25 grid will be aggregated. The options are: `zcta`, `county`, and `census_tract`. | ||
|
||
|
|
||
| ## Configuration Structure | ||
|
|
||
| The configuration system uses Hydra with the following structure: | ||
|
|
||
| * `/conf/config.yaml` - Main configuration file with default settings | ||
| * `/conf/datapaths/` - Data path configurations for different environments: | ||
| * `cannon_v5gl.yaml` - Paths for V5GL data on Cannon cluster | ||
| * `cannon_v6gl.yaml` - Paths for V6GL data on Cannon cluster | ||
| * `datapaths.yaml` - Template configuration | ||
| * `/conf/shapefiles/shapefiles.yaml` - Shapefile metadata including: | ||
| * Available years for each polygon type | ||
| * ID column names (`idvar`) | ||
| * File naming prefixes | ||
| * Download URLs (optional, via `url_map`) | ||
| * `/conf/satellite_pm25/` - PM2.5 dataset configurations for different versions | ||
| * `/conf/snakemake.yaml` - Default parameters for Snakemake workflow | ||
|
|
||
| ## Shapefile Configuration | ||
|
|
||
| Shapefiles can be obtained in two ways: | ||
|
|
||
| 1. **Symlinks to existing Lab shapefiles** (recommended for Cannon cluster): | ||
| ```bash | ||
| python src/create_datapaths.py | ||
| ``` | ||
| This creates symbolic links from the project's `data/` directory to the Lab's existing shapefile repository at `/n/dominici_lab/lab/lego/geoboundaries/`. | ||
|
|
||
| 2. **Direct download** from Census Bureau (optional): | ||
| Shapefiles can be downloaded automatically if URLs are configured in `conf/shapefiles/shapefiles.yaml`. The download script will use the `url_map` for the specified year. | ||
|
|
||
| The pipeline uses backward compatibility for shapefiles - if PM2.5 data is from a year without an exact shapefile match, it automatically selects the most recent prior shapefile year available. | ||
|
|
||
| --- | ||
|
|
||
|
|
@@ -75,17 +138,37 @@ mamba activate <env_name> | |
|
|
||
| ## Input and output paths | ||
|
|
||
| Run | ||
| The pipeline requires setting up directory paths and symbolic links to data sources. Run: | ||
|
|
||
| ```bash | ||
| python src/create_datapaths.py | ||
| ``` | ||
|
|
||
| This script: | ||
| * Creates the base directory structure under `data/V5GL/` (or `data/V6GL/` depending on configuration) | ||
| * Creates symbolic links to: | ||
| * PM2.5 raw data at `/n/dominici_lab/lab/lego/environmental/pm25__randall/` | ||
| * Shapefiles at `/n/dominici_lab/lab/lego/geoboundaries/us_geoboundaries__census/us_shapefile__census/` | ||
| * Output directories for aggregated results | ||
|
|
||
| To use a different configuration, specify the datapaths config file: | ||
|
|
||
| ```bash | ||
| python utils/create_dir_paths.py | ||
| python src/create_datapaths.py datapaths=cannon_v6gl | ||
| ``` | ||
|
|
||
| ## Pipeline | ||
|
|
||
| You can run the pipeline steps manually or run the snakemake pipeline described in the Snakefile. | ||
| The pipeline consists of four main steps: | ||
|
|
||
| 1. **Download/link shapefiles**: Obtain or link to US Census shapefiles (counties, ZCTAs, or census tracts) | ||
| 2. **Download PM2.5 data**: Download satellite PM2.5 NetCDF files from Washington University | ||
| 3. **Aggregate PM2.5**: Perform spatial aggregation from raster grid to polygons | ||
| 4. **Concatenate monthly files** (monthly frequency only): Combine monthly parquet files into yearly files | ||
|
|
||
| You can run the pipeline steps manually or use the Snakemake workflow. | ||
|
|
||
| **run pipeline steps manually** | ||
| ### Run pipeline steps manually | ||
|
|
||
| ```bash | ||
| python src/download_shapefile.py | ||
|
|
@@ -98,7 +181,7 @@ python src/aggregate_pm25.py | |
| or run the pipeline: | ||
|
|
||
| ```bash | ||
| snakemake --cores 4 -C polygon_name=county temporal_freq=annual | ||
| snakemake --cores 4 -C polygon_name=county temporal_freq=yearly | ||
| ``` | ||
|
|
||
| Modify `cores`, `polygon_name` and `temporal_freq` as you find convenient. | ||
|
|
@@ -115,7 +198,7 @@ mkdir <path>/satellite_pm25_raster2polygon | |
|
|
||
| ```bash | ||
| docker pull nsaph/satellite_pm25_raster2polygon | ||
| docker run -v <path>:/app/data/input/satellite_pm25/annual <path>/satellite_pm25_raster2polygon/:/app/data/output/satellite_pm25_raster2polygon nsaph/satellite_pm25_raster2polygon | ||
| docker run -v <path>:/app/data/input/satellite_pm25/yearly <path>/satellite_pm25_raster2polygon/:/app/data/output/satellite_pm25_raster2polygon nsaph/satellite_pm25_raster2polygon | ||
| ``` | ||
|
|
||
| If you are interested in storing the input raw and intermediate data run | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,37 +17,32 @@ temporal_freq = config['temporal_freq'] | |
| polygon_name = config['polygon_name'] | ||
|
|
||
| with initialize(version_base=None, config_path="conf"): | ||
| hydra_cfg = compose(config_name="config", overrides=[f"temporal_freq={temporal_freq}", f"polygon_name={polygon_name}"]) | ||
| cfg = compose(config_name="config", overrides=[f"temporal_freq={temporal_freq}", f"polygon_name={polygon_name}"]) | ||
|
|
||
| satellite_pm25_cfg = hydra_cfg.satellite_pm25 | ||
| shapefiles_cfg = hydra_cfg.shapefiles | ||
| satellite_pm25_cfg = cfg.satellite_pm25 | ||
| shapefiles_cfg = cfg.shapefiles | ||
|
|
||
| shapefile_years_list = list(shapefiles_cfg[polygon_name].keys()) | ||
| shapefile_years_list = shapefiles_cfg[polygon_name].years | ||
|
|
||
| months_list = "01" if temporal_freq == 'yearly' else [str(i).zfill(2) for i in range(1, 12 + 1)] | ||
| years_list = list(range(1998, 2022 + 1)) | ||
| years_list = list(range(1998, 2023 + 1)) | ||
|
|
||
| # == Define rules == | ||
| rule all: | ||
| input: | ||
| expand( | ||
| f"data/output/pm25__washu/{polygon_name}_{temporal_freq}/pm25__washu__{polygon_name}_{temporal_freq}__" + | ||
| ("{year}.parquet" if temporal_freq == 'yearly' else "{year}_{month}.parquet"), | ||
| year=years_list, | ||
| month=months_list | ||
| f"{cfg.datapaths.base_path}/output/{polygon_name}_{temporal_freq}/pm25__randall__{polygon_name}_{temporal_freq}__" + | ||
| "{year}.parquet", | ||
| year=years_list | ||
| ) if temporal_freq == 'yearly' else expand( | ||
| f"{cfg.datapaths.base_path}/output/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__" + "{year}.parquet", | ||
| year=years_list | ||
| ) | ||
|
|
||
| # remove and use symlink to the us census geoboundaries | ||
| rule download_shapefiles: | ||
| output: | ||
| f"data/input/shapefiles/shapefile_{polygon_name}_" + "{shapefile_year}/shapefile.shp" | ||
| shell: | ||
| f"python src/download_shapefile.py polygon_name={polygon_name} " + "shapefile_year={wildcards.shapefile_year}" | ||
|
|
||
| rule download_satellite_pm25: | ||
| output: | ||
| expand( | ||
| f"data/input/pm25__washu__raw/{temporal_freq}/{satellite_pm25_cfg[temporal_freq]['file_prefix']}." + | ||
| f"{cfg.datapaths.base_path}/input/raw/{temporal_freq}/{satellite_pm25_cfg[temporal_freq]['file_prefix']}." + | ||
| ("{year}01-{year}12.nc" if temporal_freq == 'yearly' else "{year}{month}-{year}{month}.nc"), | ||
| year=years_list, | ||
| month=months_list) | ||
|
|
@@ -58,23 +53,24 @@ rule download_satellite_pm25: | |
|
|
||
| def get_shapefile_input(wildcards): | ||
| shapefile_year = available_shapefile_year(int(wildcards.year), shapefile_years_list) | ||
| return f"data/input/shapefiles/shapefile_{polygon_name}_{shapefile_year}/shapefile.shp" | ||
| shapefile_prefix = shapefiles_cfg[polygon_name].prefix | ||
| shapefile_name = f"{shapefile_prefix}{shapefile_year}" | ||
| return f"{cfg.datapaths.base_path}/input/shapefiles/{polygon_name}_yearly/{shapefile_name}/{shapefile_name}.shp" | ||
|
|
||
| rule aggregate_pm25: | ||
| input: | ||
| get_shapefile_input, | ||
| expand( | ||
| f"data/input/pm25__washu__raw/{temporal_freq}/{satellite_pm25_cfg[temporal_freq]['file_prefix']}." + | ||
| f"{cfg.datapaths.base_path}/input/raw/{temporal_freq}/{satellite_pm25_cfg[temporal_freq]['file_prefix']}." + | ||
| ("{{year}}01-{{year}}12.nc" if temporal_freq == 'yearly' else "{{year}}{month}-{{year}}{month}.nc"), | ||
| month=months_list | ||
| ) | ||
|
|
||
| output: | ||
| expand( | ||
| f"data/output/pm25__washu/{polygon_name}_{temporal_freq}/pm25__washu__{polygon_name}_{temporal_freq}__" + | ||
| ("{{year}}.parquet" if temporal_freq == 'yearly' else "{{year}}_{month}.parquet"), | ||
| month=months_list # we only want to expand months_list and keep year as wildcard | ||
| ) | ||
| [f"{cfg.datapaths.base_path}/output/{polygon_name}_{temporal_freq}/pm25__randall__{polygon_name}_{temporal_freq}__{{year}}.parquet"] if temporal_freq == 'yearly' else [ | ||
| f"{cfg.datapaths.base_path}/intermediate/{polygon_name}_{temporal_freq}/pm25__randall__{polygon_name}_{temporal_freq}__{{year}}_{month}.parquet" | ||
| for month in months_list | ||
| ] | ||
| log: | ||
| f"logs/satellite_pm25_{polygon_name}_{{year}}.log" | ||
| shell: | ||
|
|
@@ -83,3 +79,22 @@ rule aggregate_pm25: | |
| ("year={wildcards.year}" if temporal_freq == 'yearly' else "year={wildcards.year}") + | ||
| " &> {log}" | ||
| ) | ||
|
|
||
| rule concat_monthly: | ||
| # This rule is only needed when temporal_freq is 'monthly' to create yearly files | ||
| # Combines monthly parquet files from intermediate directory into a single yearly parquet file | ||
| input: | ||
| lambda wildcards: expand( | ||
| f"{cfg.datapaths.base_path}/intermediate/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__" + wildcards.year + "_{month}.parquet", | ||
| month=[str(i).zfill(2) for i in range(1, 12 + 1)] | ||
| ) | ||
| output: | ||
| yearly_file=f"{cfg.datapaths.base_path}/output/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__" + "{year}.parquet" | ||
| log: | ||
| f"logs/concat_monthly_{polygon_name}_" + "{year}.log" | ||
| shell: | ||
| f"PYTHONPATH=. python src/concat_monthly.py polygon_name={polygon_name} " + "year={wildcards.year} &> {log}" | ||
|
Comment on lines
+83
to
+96
|
||
|
|
||
|
|
||
|
|
||
|
|
||
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| base_path: data/V5GL | ||
|
|
||
| dirs: | ||
| input: | ||
| raw: | ||
| yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/raw/yearly #/n/netscratch/dominici_lab/Lab/pm25__randall__raw/yearly | ||
| monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/raw/monthly #/n/netscratch/dominici_lab/Lab/pm25__randall__raw/monthly | ||
| shapefiles: | ||
| county_yearly: /n/dominici_lab/lab/lego/geoboundaries/us_geoboundaries__census/us_shapefile__census/county_yearly | ||
| zcta_yearly: /n/dominici_lab/lab/lego/geoboundaries/us_geoboundaries__census/us_shapefile__census/zcta_yearly | ||
|
|
||
| intermediate: | ||
| zcta_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/intermediate/zcta_monthly | ||
| county_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/intermediate/county_monthly | ||
|
|
||
| output: | ||
| zcta_yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/zcta_yearly | ||
| zcta_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/zcta_monthly | ||
| county_yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/county_yearly | ||
| county_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V5GL/county_monthly |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| base_path: data/V6GL | ||
|
|
||
| dirs: | ||
| input: | ||
| raw: | ||
| yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/raw/yearly #/n/netscratch/dominici_lab/Lab/pm25__randall__raw/yearly | ||
| monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/raw/monthly #/n/netscratch/dominici_lab/Lab/pm25__randall__raw/monthly | ||
| shapefiles: | ||
| county_yearly: /n/dominici_lab/lab/lego/geoboundaries/us_geoboundaries__census/us_shapefile__census/county_yearly | ||
| zcta_yearly: /n/dominici_lab/lab/lego/geoboundaries/us_geoboundaries__census/us_shapefile__census/zcta_yearly | ||
| intermediate: | ||
| zcta_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/intermediate/zcta_monthly | ||
| county_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/intermediate/county_monthly | ||
| output: | ||
| zcta_yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/zcta_yearly | ||
| zcta_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/zcta_monthly | ||
| county_yearly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/county_yearly | ||
| county_monthly: /n/dominici_lab/lab/lego/environmental/pm25__randall/V6GL/county_monthly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script path and argument structure have changed. This should be updated to:
RUN python src/create_datapaths.pyThe old command line arguments (
datapaths.input.satellite_pm25.yearly=null) are no longer compatible with the new configuration structure that usesdatapaths.base_pathanddatapaths.dirs.