Skip to content

Conversation

@audiracmichelle
Copy link
Contributor

Hey Shreya, this is my progress. Can you continue working on this branch?

@shreyanalluri shreyanalluri self-assigned this Oct 31, 2025
@audiracmichelle
Copy link
Contributor Author

Issues that we're trying to resolve here:
#43 #45 #44 #42
@shreyanalluri

@shreyanalluri
Copy link

@audiracmichelle This is ready to review! Here is a summary of my changes:

1. Shapefile Integration and Configuration

  • Pipeline now uses cannon shapefiles instead of downloading
  • Removed shapefile download functionality (deleted src/download_shapefile.py and removed download_shapefiles rule from Snakefile)

2. Data Path Configuration

  • Added intermediate and cannon shapefile paths

3. Pipeline Improvements

Monthly aggregation workflow:

  • Added src/concat_monthly.py to combine monthly parquet files into yearly outputs
  • Modified aggregate_pm25.py to save monthly outputs to intermediate/ directory instead of final output/
  • Enhanced Snakefile with proper concat_monthly rule and dependency tracking
  • Monthly files now properly concatenated: intermediate/{polygon}_monthly/*.parquetoutput/{polygon}_monthly/{year}.parquet

Path structure fixes:

  • Updated Snakefile to use {polygon_name}_yearly/ directory structure
  • Updated src/aggregate_pm25.py to match new shapefile path pattern
  • Fixed download_pm25.py to flatten directory structure when extracting zip files
  • Fixed inconsistencies between shapefile input paths and aggregate rule expectations

Years coverage:

  • Extended processing range to 1998-2023

4. Code Organization and Cleanup

Reorganized utilities:

  • Removed unused global PM2.5 configuration (conf/satellite_pm25/global_pm25.yaml)
  • Removed shapefile download URLs from conf/shapefiles/shapefiles.yaml (simplified to metadata only)

Closes #42 Closes #43 Closes #44 Closes #45 Closes #47

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the PM2.5 raster-to-polygon aggregation pipeline to support multiple PM2.5 dataset versions (V5GL and V6GL) and improve configuration flexibility. The changes include restructuring the configuration system, renaming from "washu" to "randall" throughout the codebase, and adding support for intermediate monthly file processing.

  • Configuration restructured to support multiple PM2.5 dataset versions with version-specific datapaths
  • Added new concat_monthly script to consolidate monthly aggregation files into yearly outputs
  • Updated file naming conventions from "pm25__washu" to "pm25__randall" and "annual" to "yearly"

Reviewed changes

Copilot reviewed 25 out of 28 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/download_shapefile.py Updated to use new configuration base_path structure
src/download_pm25.py Changed download directory path to use base_path; improved file handling with directory flattening
src/create_datapaths.py Added init_folder function to initialize base data paths; fixed spelling error
src/concat_monthly.py New script to consolidate monthly parquet files into yearly files
src/aggregate_pm25.py Updated shapefile loading to use new config structure; separated monthly output to intermediate directory
notes/eda_output.ipynb Updated file paths to use cfg.datapaths.base_path (contains errors)
notes/eda_input.ipynb Updated file paths to use cfg.datapaths.base_path
jobs/*.sbatch New SLURM job templates for different polygon/frequency combinations
fasrc_jobs/*.sbatch Removed old FASRC-specific job files
environment.yaml Renamed environment to pm25_randall; removed unused torch dependencies
conf/shapefiles/shapefiles.yaml Restructured from nested year dictionaries to flat structure with years list and url_map
conf/satellite_pm25/*.yaml Added configuration files for multiple PM2.5 dataset versions
conf/datapaths/*.yaml Restructured to include base_path and nested dirs structure for multiple versions
conf/config.yaml Updated default configuration to use V5GL dataset and new datapaths structure
Snakefile Updated to use new configuration structure; added concat_monthly rule for monthly processing
README.md Comprehensive documentation updates reflecting new structure and workflow
Dockerfile Updated commands to use "yearly" instead of "annual" terminology
.gitignore Added data/ directory to ignore list
Comments suppressed due to low confidence (2)

src/create_datapaths.py:23

  • Missing documentation for the new init_folder function. Consider adding a docstring that explains:
  • The purpose of this function
  • The parameters (folder_cfg structure and expected fields)
  • The relationship with create_subfolders_and_links
  • Return value (if any)
    src/create_datapaths.py:30
  • Fixed spelling: corrected "does not exists" to "does not exist".

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


#singularity exec $HOME/singularity_images/satellite_pm25_raster2polygon_latest.sif snakemake --cores 16 -C polygon_name=zcta temporal_freq=monthly

snakemake --cores 32 -C polygon_name=zcta temporal_freq=yearly
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sbatch job is configured to run temporal_freq=yearly, but the filename and comments suggest this should be for monthly processing. The command should likely be:

snakemake --cores 32 -C polygon_name=zcta temporal_freq=monthly
Suggested change
snakemake --cores 32 -C polygon_name=zcta temporal_freq=yearly
snakemake --cores 32 -C polygon_name=zcta temporal_freq=monthly

Copilot uses AI. Check for mistakes.
# == setup chrome driver
# Expand the tilde to the user's home directory
download_dir = f"data/input/pm25__washu__raw/"
download_dir = f"{cfg.datapaths.base_path}/input/raw/"
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The download directory path is incomplete. Based on the pattern used elsewhere in the codebase (e.g., line 30), this should include the temporal frequency subdirectory. Consider:

download_dir = f"{cfg.datapaths.base_path}/input/raw/{cfg.temporal_freq}/"

Otherwise, all temporal frequencies will download to the same directory, potentially causing file conflicts.

Suggested change
download_dir = f"{cfg.datapaths.base_path}/input/raw/"
download_dir = f"{cfg.datapaths.base_path}/input/raw/{cfg.temporal_freq}/"

Copilot uses AI. Check for mistakes.

@hydra.main(config_path="../conf", config_name="config", version_base=None)
def main(cfg):
url = cfg.shapefiles[cfg.polygon_name][cfg.shapefile_year].url
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shapefile configuration structure has changed from nested year dictionaries to a flat structure with years, idvar, prefix, and url_map fields. This line tries to access cfg.shapefiles[cfg.polygon_name][cfg.shapefile_year].url but should now access cfg.shapefiles[cfg.polygon_name].url_map[cfg.shapefile_year] to match the new configuration structure defined in conf/shapefiles/shapefiles.yaml.

Suggested change
url = cfg.shapefiles[cfg.polygon_name][cfg.shapefile_year].url
url = cfg.shapefiles[cfg.polygon_name].url_map[cfg.shapefile_year]

Copilot uses AI. Check for mistakes.

# Create paths to data placeholders
RUN python utils/create_dir_paths.py datapaths.input.satellite_pm25.annual=null datapaths.input.satellite_pm25.monthly=null
RUN python utils/create_dir_paths.py datapaths.input.satellite_pm25.yearly=null datapaths.input.satellite_pm25.monthly=null
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script path and argument structure have changed. This should be updated to:

RUN python src/create_datapaths.py

The old command line arguments (datapaths.input.satellite_pm25.yearly=null) are no longer compatible with the new configuration structure that uses datapaths.base_path and datapaths.dirs.

Suggested change
RUN python utils/create_dir_paths.py datapaths.input.satellite_pm25.yearly=null datapaths.input.satellite_pm25.monthly=null
RUN python src/create_datapaths.py

Copilot uses AI. Check for mistakes.
* `temporal_freq` which determines whether the original annual or monthly pm25 files will be aggregated. The options are: `annual` and `monthly`.
* `polygon_name` which determines into which polygons the pm25 grid will the aggregated. The options are: `zcta` and `county`.
* `temporal_freq` which determines whether the original yearly or monthly pm25 files will be aggregated. The options are: `yearly` and `monthly`.
* `polygon_name` which determines into which polygons the pm25 grid will be aggregated. The options are: `zcta`, `county`, and `census_tract`.
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "will the aggregated" should be "will be aggregated".

Copilot uses AI. Check for mistakes.
url = cfg.shapefiles[cfg.polygon_name][cfg.shapefile_year].url

tgt = f"data/input/shapefiles/shapefile_{cfg.polygon_name}_{cfg.shapefile_year}"
tgt = f"{cfg.datapaths.base_path}/input/shapefiles/shapefile_{cfg.polygon_name}_{cfg.shapefile_year}"
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output path pattern doesn't match the expected shapefile structure used elsewhere in the codebase. According to src/aggregate_pm25.py:44 and Snakefile:58, shapefiles should be stored as {base_path}/input/shapefiles/{polygon_name}_yearly/{shapefile_name}/{shapefile_name}.shp. This line creates a path like shapefile_{polygon_name}_{shapefile_year} which is inconsistent with the new naming convention using prefixes.

Copilot uses AI. Check for mistakes.
"\n",
"# Read parquet file with pm25 at county level for 2015\n",
"pm25_data = pq.read_table(\"data/output/satellite_pm25_raster2polygon/monthly/satellite_pm25_zcta_2015_01.parquet\").to_pandas()"
"pm25_data = pq.read_table(f\"{cfg.datapaths.base_path}/datapaths.base_path}/datapaths.base_path}/datapaths.base_path}/output/satellite_pm25_raster2polygon/monthly/satellite_pm25_zcta_2015_01.parquet\").to_pandas()"
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file path contains duplicated datapaths.base_path} strings. It should be:

pm25_data = pq.read_table(f"{cfg.datapaths.base_path}/output/satellite_pm25_raster2polygon/monthly/satellite_pm25_zcta_2015_01.parquet").to_pandas()
Suggested change
"pm25_data = pq.read_table(f\"{cfg.datapaths.base_path}/datapaths.base_path}/datapaths.base_path}/datapaths.base_path}/output/satellite_pm25_raster2polygon/monthly/satellite_pm25_zcta_2015_01.parquet\").to_pandas()"
"pm25_data = pq.read_table(f\"{cfg.datapaths.base_path}/output/satellite_pm25_raster2polygon/monthly/satellite_pm25_zcta_2015_01.parquet\").to_pandas()"

Copilot uses AI. Check for mistakes.
"\n",
"# Read the shapefile\n",
"shapefile = gpd.read_file('data/shapefile_cb_county_2015/shapefile.shp')\n",
"shapefile = gpd.read_file('{cfg.datapaths.base_path}/datapaths.base_path}/datapaths.base_path}/shapefile_cb_county_2015/shapefile.shp')\n",
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file path is malformed with duplicated datapaths.base_path} and incorrect f-string formatting (missing f prefix). It should be:

shapefile = gpd.read_file(f'{cfg.datapaths.base_path}/shapefile_cb_county_2015/shapefile.shp')
Suggested change
"shapefile = gpd.read_file('{cfg.datapaths.base_path}/datapaths.base_path}/datapaths.base_path}/shapefile_cb_county_2015/shapefile.shp')\n",
"shapefile = gpd.read_file(f'{cfg.datapaths.base_path}/shapefile_cb_county_2015/shapefile.shp')\n",

Copilot uses AI. Check for mistakes.
import os
import hydra
import logging
from pathlib import Path
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Path' is not used.

Suggested change
from pathlib import Path

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 29 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (2)

environment.yaml:21

  • [nitpick] The torch, torchaudio, and torchvision dependencies were removed. Verify that these packages are not required by any code in the repository. If they were removed because they're unused, this is good cleanup. However, if any scripts depend on these packages, this will cause import errors.
    src/create_datapaths.py:30
  • The spelling error "does not exists" should be "does not exist".

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +83 to +96
rule concat_monthly:
# This rule is only needed when temporal_freq is 'monthly' to create yearly files
# Combines monthly parquet files from intermediate directory into a single yearly parquet file
input:
lambda wildcards: expand(
f"{cfg.datapaths.base_path}/intermediate/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__" + wildcards.year + "_{month}.parquet",
month=[str(i).zfill(2) for i in range(1, 12 + 1)]
)
output:
yearly_file=f"{cfg.datapaths.base_path}/output/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__" + "{year}.parquet"
log:
f"logs/concat_monthly_{polygon_name}_" + "{year}.log"
shell:
f"PYTHONPATH=. python src/concat_monthly.py polygon_name={polygon_name} " + "year={wildcards.year} &> {log}"
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concat_monthly rule is defined but never called when temporal_freq='monthly'. Looking at the rule all input, when temporal_freq is monthly, it expects files in the output directory, but aggregate_pm25 saves to the intermediate directory. The concat_monthly rule should be a dependency to create these output files, but it's not connected to the workflow. Consider updating rule all to ensure concat_monthly outputs are the target when temporal_freq is monthly.

Copilot uses AI. Check for mistakes.
monthly_files = consolidate_year(intermediate_dir, monthly_dir, year, polygon_name)
LOGGER.info(f"Successfully consolidated {len(monthly_files)} monthly files for year {year}")

except Exception as e:
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The error handling in the try-except block catches all exceptions with a generic Exception, which could mask unexpected errors. Consider catching specific expected exceptions (e.g., FileNotFoundError, pd.errors.ParserError) and letting unexpected exceptions propagate with their full traceback for better debugging.

Suggested change
except Exception as e:
except (FileNotFoundError, pd.errors.ParserError) as e:

Copilot uses AI. Check for mistakes.
Comment on lines +83 to +88
for root, dirs, files in os.walk(src_dir):
for file in files:
src_file = os.path.join(root, file)
dest_file = os.path.join(dest_dir, file)
shutil.move(src_file, dest_file)
logger.info(f"Moved {file} to {dest_dir}")
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file flattening logic using os.walk() could overwrite files if there are duplicate filenames in different subdirectories. The destination path dest_file is based only on the filename without preserving subdirectory structure. Consider checking for file name conflicts before moving, or add a warning/error if duplicate filenames are detected.

Suggested change
for root, dirs, files in os.walk(src_dir):
for file in files:
src_file = os.path.join(root, file)
dest_file = os.path.join(dest_dir, file)
shutil.move(src_file, dest_file)
logger.info(f"Moved {file} to {dest_dir}")
# Check for duplicate filenames before moving
from collections import defaultdict
filename_map = defaultdict(list)
for root, dirs, files in os.walk(src_dir):
for file in files:
filename_map[file].append(os.path.join(root, file))
duplicates = [fname for fname, paths in filename_map.items() if len(paths) > 1]
if duplicates:
logger.error(f"Duplicate filenames detected in the extracted archive: {duplicates}. Aborting move to prevent overwriting files.")
raise RuntimeError(f"Duplicate filenames detected: {duplicates}")
# No duplicates, safe to move files
for file, paths in filename_map.items():
src_file = paths[0]
dest_file = os.path.join(dest_dir, file)
shutil.move(src_file, dest_file)
logger.info(f"Moved {file} to {dest_dir}")

Copilot uses AI. Check for mistakes.
shutil.rmtree(src_dir)

# Remove anyfile starging with Unconfirmed (this might be a Chrome bug/artifact)
# Remove any file starting with Unconfirmed (this might be a Chrome bug/artifact)
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions "starging" which is a typo. This comment was updated but the typo was not fixed.

Copilot uses AI. Check for mistakes.

# Sort by polygon ID and month for consistent ordering
if 'month' in yearly_df.columns:
yearly_df = yearly_df.sort_values(['month'])
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The function only sorts by 'month' column if it exists, but doesn't include the polygon ID in the sort. For consistent ordering and better data organization, consider sorting by both polygon ID and month:

if 'month' in yearly_df.columns:
    yearly_df = yearly_df.sort_values([yearly_df.index.name, 'month'])

This ensures the data is ordered first by polygon, then by month within each polygon.

Suggested change
yearly_df = yearly_df.sort_values(['month'])
# Sort by polygon ID (index) and month for consistent ordering
yearly_df = yearly_df.reset_index().sort_values([yearly_df.index.name, 'month']).set_index(yearly_df.index.name)

Copilot uses AI. Check for mistakes.
- 2000
- 2010
- 2020
idvar: "zcta"
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idvar for zcta was changed from the specific year-based column names (e.g., ZCTA5CE10, ZCTA5CE20) to a generic "zcta". This is likely incorrect - ZCTA shapefiles have different column names based on the census year (ZCTA5, ZCTA5CE10, ZCTA5CE20). This would cause the aggregation to fail if the column doesn't exist in the shapefile.

Suggested change
idvar: "zcta"
idvar:
2000: "ZCTA5"
2010: "ZCTA5CE10"
2020: "ZCTA5CE20"

Copilot uses AI. Check for mistakes.
- 2020
- 2021
- 2022
- 2023
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Year 2023 was added to the county years list, but there is no corresponding configuration for accessing/downloading the 2023 shapefile. The old configuration had URL mappings for each year. Verify that the shapefile for 2023 is available and accessible through the new configuration structure.

Copilot uses AI. Check for mistakes.
- 2021
- 2022
- 2023
idvar: "county"
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idvar for county was changed from "GEOID" to "county". This is likely incorrect - Census county shapefiles typically use GEOID as the identifier column. Verify that the actual shapefile column name matches this configuration, as this would cause the aggregation to fail if the column doesn't exist.

Suggested change
idvar: "county"
idvar: "GEOID"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants