-
Notifications
You must be signed in to change notification settings - Fork 0
Audiracmichelle/issue42 #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Issues that we're trying to resolve here: |
… rule all to reflect output structure of concat_monthly.py
|
@audiracmichelle This is ready to review! Here is a summary of my changes: 1. Shapefile Integration and Configuration
2. Data Path Configuration
3. Pipeline ImprovementsMonthly aggregation workflow:
Path structure fixes:
Years coverage:
4. Code Organization and CleanupReorganized utilities:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the PM2.5 raster-to-polygon aggregation pipeline to support multiple PM2.5 dataset versions (V5GL and V6GL) and improve configuration flexibility. The changes include restructuring the configuration system, renaming from "washu" to "randall" throughout the codebase, and adding support for intermediate monthly file processing.
- Configuration restructured to support multiple PM2.5 dataset versions with version-specific datapaths
- Added new concat_monthly script to consolidate monthly aggregation files into yearly outputs
- Updated file naming conventions from "pm25__washu" to "pm25__randall" and "annual" to "yearly"
Reviewed changes
Copilot reviewed 25 out of 28 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/download_shapefile.py | Updated to use new configuration base_path structure |
| src/download_pm25.py | Changed download directory path to use base_path; improved file handling with directory flattening |
| src/create_datapaths.py | Added init_folder function to initialize base data paths; fixed spelling error |
| src/concat_monthly.py | New script to consolidate monthly parquet files into yearly files |
| src/aggregate_pm25.py | Updated shapefile loading to use new config structure; separated monthly output to intermediate directory |
| notes/eda_output.ipynb | Updated file paths to use cfg.datapaths.base_path (contains errors) |
| notes/eda_input.ipynb | Updated file paths to use cfg.datapaths.base_path |
| jobs/*.sbatch | New SLURM job templates for different polygon/frequency combinations |
| fasrc_jobs/*.sbatch | Removed old FASRC-specific job files |
| environment.yaml | Renamed environment to pm25_randall; removed unused torch dependencies |
| conf/shapefiles/shapefiles.yaml | Restructured from nested year dictionaries to flat structure with years list and url_map |
| conf/satellite_pm25/*.yaml | Added configuration files for multiple PM2.5 dataset versions |
| conf/datapaths/*.yaml | Restructured to include base_path and nested dirs structure for multiple versions |
| conf/config.yaml | Updated default configuration to use V5GL dataset and new datapaths structure |
| Snakefile | Updated to use new configuration structure; added concat_monthly rule for monthly processing |
| README.md | Comprehensive documentation updates reflecting new structure and workflow |
| Dockerfile | Updated commands to use "yearly" instead of "annual" terminology |
| .gitignore | Added data/ directory to ignore list |
Comments suppressed due to low confidence (2)
src/create_datapaths.py:23
- Missing documentation for the new
init_folderfunction. Consider adding a docstring that explains:
- The purpose of this function
- The parameters (folder_cfg structure and expected fields)
- The relationship with
create_subfolders_and_links - Return value (if any)
src/create_datapaths.py:30
- Fixed spelling: corrected "does not exists" to "does not exist".
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| #singularity exec $HOME/singularity_images/satellite_pm25_raster2polygon_latest.sif snakemake --cores 16 -C polygon_name=zcta temporal_freq=monthly | ||
|
|
||
| snakemake --cores 32 -C polygon_name=zcta temporal_freq=yearly |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sbatch job is configured to run temporal_freq=yearly, but the filename and comments suggest this should be for monthly processing. The command should likely be:
snakemake --cores 32 -C polygon_name=zcta temporal_freq=monthly| snakemake --cores 32 -C polygon_name=zcta temporal_freq=yearly | |
| snakemake --cores 32 -C polygon_name=zcta temporal_freq=monthly |
| # == setup chrome driver | ||
| # Expand the tilde to the user's home directory | ||
| download_dir = f"data/input/pm25__washu__raw/" | ||
| download_dir = f"{cfg.datapaths.base_path}/input/raw/" |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The download directory path is incomplete. Based on the pattern used elsewhere in the codebase (e.g., line 30), this should include the temporal frequency subdirectory. Consider:
download_dir = f"{cfg.datapaths.base_path}/input/raw/{cfg.temporal_freq}/"Otherwise, all temporal frequencies will download to the same directory, potentially causing file conflicts.
| download_dir = f"{cfg.datapaths.base_path}/input/raw/" | |
| download_dir = f"{cfg.datapaths.base_path}/input/raw/{cfg.temporal_freq}/" |
src/download_shapefile.py
Outdated
|
|
||
| @hydra.main(config_path="../conf", config_name="config", version_base=None) | ||
| def main(cfg): | ||
| url = cfg.shapefiles[cfg.polygon_name][cfg.shapefile_year].url |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shapefile configuration structure has changed from nested year dictionaries to a flat structure with years, idvar, prefix, and url_map fields. This line tries to access cfg.shapefiles[cfg.polygon_name][cfg.shapefile_year].url but should now access cfg.shapefiles[cfg.polygon_name].url_map[cfg.shapefile_year] to match the new configuration structure defined in conf/shapefiles/shapefiles.yaml.
| url = cfg.shapefiles[cfg.polygon_name][cfg.shapefile_year].url | |
| url = cfg.shapefiles[cfg.polygon_name].url_map[cfg.shapefile_year] |
|
|
||
| # Create paths to data placeholders | ||
| RUN python utils/create_dir_paths.py datapaths.input.satellite_pm25.annual=null datapaths.input.satellite_pm25.monthly=null | ||
| RUN python utils/create_dir_paths.py datapaths.input.satellite_pm25.yearly=null datapaths.input.satellite_pm25.monthly=null |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script path and argument structure have changed. This should be updated to:
RUN python src/create_datapaths.pyThe old command line arguments (datapaths.input.satellite_pm25.yearly=null) are no longer compatible with the new configuration structure that uses datapaths.base_path and datapaths.dirs.
| RUN python utils/create_dir_paths.py datapaths.input.satellite_pm25.yearly=null datapaths.input.satellite_pm25.monthly=null | |
| RUN python src/create_datapaths.py |
| * `temporal_freq` which determines whether the original annual or monthly pm25 files will be aggregated. The options are: `annual` and `monthly`. | ||
| * `polygon_name` which determines into which polygons the pm25 grid will the aggregated. The options are: `zcta` and `county`. | ||
| * `temporal_freq` which determines whether the original yearly or monthly pm25 files will be aggregated. The options are: `yearly` and `monthly`. | ||
| * `polygon_name` which determines into which polygons the pm25 grid will be aggregated. The options are: `zcta`, `county`, and `census_tract`. |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "will the aggregated" should be "will be aggregated".
src/download_shapefile.py
Outdated
| url = cfg.shapefiles[cfg.polygon_name][cfg.shapefile_year].url | ||
|
|
||
| tgt = f"data/input/shapefiles/shapefile_{cfg.polygon_name}_{cfg.shapefile_year}" | ||
| tgt = f"{cfg.datapaths.base_path}/input/shapefiles/shapefile_{cfg.polygon_name}_{cfg.shapefile_year}" |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output path pattern doesn't match the expected shapefile structure used elsewhere in the codebase. According to src/aggregate_pm25.py:44 and Snakefile:58, shapefiles should be stored as {base_path}/input/shapefiles/{polygon_name}_yearly/{shapefile_name}/{shapefile_name}.shp. This line creates a path like shapefile_{polygon_name}_{shapefile_year} which is inconsistent with the new naming convention using prefixes.
| "\n", | ||
| "# Read parquet file with pm25 at county level for 2015\n", | ||
| "pm25_data = pq.read_table(\"data/output/satellite_pm25_raster2polygon/monthly/satellite_pm25_zcta_2015_01.parquet\").to_pandas()" | ||
| "pm25_data = pq.read_table(f\"{cfg.datapaths.base_path}/datapaths.base_path}/datapaths.base_path}/datapaths.base_path}/output/satellite_pm25_raster2polygon/monthly/satellite_pm25_zcta_2015_01.parquet\").to_pandas()" |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file path contains duplicated datapaths.base_path} strings. It should be:
pm25_data = pq.read_table(f"{cfg.datapaths.base_path}/output/satellite_pm25_raster2polygon/monthly/satellite_pm25_zcta_2015_01.parquet").to_pandas()| "pm25_data = pq.read_table(f\"{cfg.datapaths.base_path}/datapaths.base_path}/datapaths.base_path}/datapaths.base_path}/output/satellite_pm25_raster2polygon/monthly/satellite_pm25_zcta_2015_01.parquet\").to_pandas()" | |
| "pm25_data = pq.read_table(f\"{cfg.datapaths.base_path}/output/satellite_pm25_raster2polygon/monthly/satellite_pm25_zcta_2015_01.parquet\").to_pandas()" |
| "\n", | ||
| "# Read the shapefile\n", | ||
| "shapefile = gpd.read_file('data/shapefile_cb_county_2015/shapefile.shp')\n", | ||
| "shapefile = gpd.read_file('{cfg.datapaths.base_path}/datapaths.base_path}/datapaths.base_path}/shapefile_cb_county_2015/shapefile.shp')\n", |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file path is malformed with duplicated datapaths.base_path} and incorrect f-string formatting (missing f prefix). It should be:
shapefile = gpd.read_file(f'{cfg.datapaths.base_path}/shapefile_cb_county_2015/shapefile.shp')| "shapefile = gpd.read_file('{cfg.datapaths.base_path}/datapaths.base_path}/datapaths.base_path}/shapefile_cb_county_2015/shapefile.shp')\n", | |
| "shapefile = gpd.read_file(f'{cfg.datapaths.base_path}/shapefile_cb_county_2015/shapefile.shp')\n", |
| import os | ||
| import hydra | ||
| import logging | ||
| from pathlib import Path |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'Path' is not used.
| from pathlib import Path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 26 out of 29 changed files in this pull request and generated 8 comments.
Comments suppressed due to low confidence (2)
environment.yaml:21
- [nitpick] The torch, torchaudio, and torchvision dependencies were removed. Verify that these packages are not required by any code in the repository. If they were removed because they're unused, this is good cleanup. However, if any scripts depend on these packages, this will cause import errors.
src/create_datapaths.py:30 - The spelling error "does not exists" should be "does not exist".
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| rule concat_monthly: | ||
| # This rule is only needed when temporal_freq is 'monthly' to create yearly files | ||
| # Combines monthly parquet files from intermediate directory into a single yearly parquet file | ||
| input: | ||
| lambda wildcards: expand( | ||
| f"{cfg.datapaths.base_path}/intermediate/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__" + wildcards.year + "_{month}.parquet", | ||
| month=[str(i).zfill(2) for i in range(1, 12 + 1)] | ||
| ) | ||
| output: | ||
| yearly_file=f"{cfg.datapaths.base_path}/output/{polygon_name}_monthly/pm25__randall__{polygon_name}_monthly__" + "{year}.parquet" | ||
| log: | ||
| f"logs/concat_monthly_{polygon_name}_" + "{year}.log" | ||
| shell: | ||
| f"PYTHONPATH=. python src/concat_monthly.py polygon_name={polygon_name} " + "year={wildcards.year} &> {log}" |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concat_monthly rule is defined but never called when temporal_freq='monthly'. Looking at the rule all input, when temporal_freq is monthly, it expects files in the output directory, but aggregate_pm25 saves to the intermediate directory. The concat_monthly rule should be a dependency to create these output files, but it's not connected to the workflow. Consider updating rule all to ensure concat_monthly outputs are the target when temporal_freq is monthly.
| monthly_files = consolidate_year(intermediate_dir, monthly_dir, year, polygon_name) | ||
| LOGGER.info(f"Successfully consolidated {len(monthly_files)} monthly files for year {year}") | ||
|
|
||
| except Exception as e: |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The error handling in the try-except block catches all exceptions with a generic Exception, which could mask unexpected errors. Consider catching specific expected exceptions (e.g., FileNotFoundError, pd.errors.ParserError) and letting unexpected exceptions propagate with their full traceback for better debugging.
| except Exception as e: | |
| except (FileNotFoundError, pd.errors.ParserError) as e: |
| for root, dirs, files in os.walk(src_dir): | ||
| for file in files: | ||
| src_file = os.path.join(root, file) | ||
| dest_file = os.path.join(dest_dir, file) | ||
| shutil.move(src_file, dest_file) | ||
| logger.info(f"Moved {file} to {dest_dir}") |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file flattening logic using os.walk() could overwrite files if there are duplicate filenames in different subdirectories. The destination path dest_file is based only on the filename without preserving subdirectory structure. Consider checking for file name conflicts before moving, or add a warning/error if duplicate filenames are detected.
| for root, dirs, files in os.walk(src_dir): | |
| for file in files: | |
| src_file = os.path.join(root, file) | |
| dest_file = os.path.join(dest_dir, file) | |
| shutil.move(src_file, dest_file) | |
| logger.info(f"Moved {file} to {dest_dir}") | |
| # Check for duplicate filenames before moving | |
| from collections import defaultdict | |
| filename_map = defaultdict(list) | |
| for root, dirs, files in os.walk(src_dir): | |
| for file in files: | |
| filename_map[file].append(os.path.join(root, file)) | |
| duplicates = [fname for fname, paths in filename_map.items() if len(paths) > 1] | |
| if duplicates: | |
| logger.error(f"Duplicate filenames detected in the extracted archive: {duplicates}. Aborting move to prevent overwriting files.") | |
| raise RuntimeError(f"Duplicate filenames detected: {duplicates}") | |
| # No duplicates, safe to move files | |
| for file, paths in filename_map.items(): | |
| src_file = paths[0] | |
| dest_file = os.path.join(dest_dir, file) | |
| shutil.move(src_file, dest_file) | |
| logger.info(f"Moved {file} to {dest_dir}") |
| shutil.rmtree(src_dir) | ||
|
|
||
| # Remove anyfile starging with Unconfirmed (this might be a Chrome bug/artifact) | ||
| # Remove any file starting with Unconfirmed (this might be a Chrome bug/artifact) |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions "starging" which is a typo. This comment was updated but the typo was not fixed.
|
|
||
| # Sort by polygon ID and month for consistent ordering | ||
| if 'month' in yearly_df.columns: | ||
| yearly_df = yearly_df.sort_values(['month']) |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The function only sorts by 'month' column if it exists, but doesn't include the polygon ID in the sort. For consistent ordering and better data organization, consider sorting by both polygon ID and month:
if 'month' in yearly_df.columns:
yearly_df = yearly_df.sort_values([yearly_df.index.name, 'month'])This ensures the data is ordered first by polygon, then by month within each polygon.
| yearly_df = yearly_df.sort_values(['month']) | |
| # Sort by polygon ID (index) and month for consistent ordering | |
| yearly_df = yearly_df.reset_index().sort_values([yearly_df.index.name, 'month']).set_index(yearly_df.index.name) |
| - 2000 | ||
| - 2010 | ||
| - 2020 | ||
| idvar: "zcta" |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idvar for zcta was changed from the specific year-based column names (e.g., ZCTA5CE10, ZCTA5CE20) to a generic "zcta". This is likely incorrect - ZCTA shapefiles have different column names based on the census year (ZCTA5, ZCTA5CE10, ZCTA5CE20). This would cause the aggregation to fail if the column doesn't exist in the shapefile.
| idvar: "zcta" | |
| idvar: | |
| 2000: "ZCTA5" | |
| 2010: "ZCTA5CE10" | |
| 2020: "ZCTA5CE20" |
| - 2020 | ||
| - 2021 | ||
| - 2022 | ||
| - 2023 |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Year 2023 was added to the county years list, but there is no corresponding configuration for accessing/downloading the 2023 shapefile. The old configuration had URL mappings for each year. Verify that the shapefile for 2023 is available and accessible through the new configuration structure.
| - 2021 | ||
| - 2022 | ||
| - 2023 | ||
| idvar: "county" |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idvar for county was changed from "GEOID" to "county". This is likely incorrect - Census county shapefiles typically use GEOID as the identifier column. Verify that the actual shapefile column name matches this configuration, as this would cause the aggregation to fail if the column doesn't exist.
| idvar: "county" | |
| idvar: "GEOID" |
Hey Shreya, this is my progress. Can you continue working on this branch?