-
Notifications
You must be signed in to change notification settings - Fork 0
Feature: output graph data as JSON #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add ability to export complete networkx graph data structure to file formats that other programs can consume programmatically. New Features: - Add --format option to choose between 'html' (default) and 'graph' output - Add --graph-format option to select serialization (json/pickle/graphml) - Support for JSON export (human-readable, widely compatible) - Support for Pickle export (Python-native, exact data preservation) - Support for GraphML export (standard graph format for analysis tools) New Files: - diffgraph/graph_export.py: Core export/import functionality - test_graph_export.py: Comprehensive test suite - example_usage.py: Example script for using exported data - test_cli_manual.sh: Automated testing script - GRAPH_EXPORT_FEATURE.md: Feature documentation - TESTING_GUIDE.md: Testing instructions Modified Files: - diffgraph/cli.py: Added new CLI options and conditional output logic - diffgraph/graph_manager.py: Added export_to_dict() method - README.md: Updated documentation with new features - CHANGELOG.md: Added v1.1.0 release notes - setup.py: Bumped version to 1.1.0 - diffgraph/__init__.py: Updated version to 1.1.0 Technical Details: - Exported data includes file nodes, component nodes, graphs, and metadata - All formats support round-trip (export → load) with data integrity - NetworkX graphs serialized using node-link format for compatibility - Backward compatible: existing HTML functionality unchanged Testing: - All unit tests pass (test_graph_export.py) - Automated test suite validates all formats - Example usage script demonstrates practical use cases
Add comprehensive design documents for new integration-friendly JSON output format to docs/planning/ folder to keep root directory clean. Documents: - STRUCTURED_OUTPUT_DESIGN.md: Complete schema specification - File categorization (source/docs/config/auto-generated) - Rich component metadata (complexity, impact radius) - External dependency nodes (APIs, databases, services) - Comprehensive relationship types (REST, RPC, pub/sub) - Cross-references between docs and code - Phase-based implementation strategy - PHASE1_IMPLEMENTATION_NOTES.md: Implementation guide - Implementation decisions and rationale - Phase 1 scope (basic restructuring with existing data) - Phase 2/3 future enhancements - Testing strategy and success criteria - Next session pickup instructions These documents preserve design rationale and guide future implementation without cluttering the codebase root.
Add integration-friendly structured JSON format optimized for VSCode extensions, web UIs, and other tool integrations. This is Phase 1 using existing analysis data without requiring new AI analysis. New Features: - Structured JSON output with clean categorization and rich metadata - Automatic file classification into 4 categories: * auto_generated: Lock files, build artifacts (excluded from review) * documentation: Docs with potential cross-references to code * configuration: Config files with change tracking * source_code: Source files with full dependency graphs - Git diff stats (additions/deletions) extracted per file - Impact radius calculation from dependency graph - Complete graph structure (all edge targets exist as nodes) - Pattern-based classification with 40+ common file patterns New Files: - diffgraph/structured_export.py: Core transformation logic (400+ lines) * File categorization and classification * Component and edge transformation * Git stats extraction * Structured format generation - test_structured_export.py: Comprehensive test suite * File classification tests * Component transformation tests * Full export validation * Graph completeness tests * Empty graph handling Modified Files: - diffgraph/cli.py: Route JSON format to structured export - README.md: Document structured format with examples - CHANGELOG.md: Added v1.2.0 release notes - setup.py: Bumped version to 1.2.0 - diffgraph/__init__.py: Updated version to 1.2.0 Breaking Changes: - JSON format now outputs structured format (v2.0 schema) - NetworkX format still available via pickle/graphml Technical Details: - Phase 1: Uses existing data, leaves advanced fields as null - Relationship types: imports, calls, extends, implements - Change types: added, deleted, modified, unchanged - Clean separation of files graph and components graph - All tests pass (5/5 test suites) Phase 2 Planned: - Full codebase impact analysis - External dependency nodes (APIs, databases) - Advanced relationships (REST, RPC, pub/sub) - Complexity metrics and line numbers - Cross-reference detection See docs/planning/STRUCTURED_OUTPUT_DESIGN.md for complete specification.
…id', but just 'parent'. refactor: Clean up whitespace in structured_export.py This commit removes unnecessary whitespace throughout the structured_export.py file, improving code readability without altering functionality. The changes include adjustments to spacing in function definitions, docstrings, and return statements, ensuring a more consistent coding style.
WalkthroughThis PR introduces comprehensive graph export capabilities for DiffGraph, adding support for multiple serialization formats (JSON, Pickle, GraphML), a new structured JSON output format for integration with external tools, enhanced CLI options for format selection, and a complete test suite with documentation and planning guides. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant GraphManager
participant Exporter as graph_export/<br/>structured_export
participant FileSystem
User->>CLI: diffgraph --format graph --graph-format json
CLI->>GraphManager: access graph data
alt Structured JSON
CLI->>Exporter: export_structured_json()
Exporter->>GraphManager: query nodes/edges
Exporter->>Exporter: classify files, transform nodes
Exporter->>FileSystem: write JSON
else Standard Export
CLI->>Exporter: export_graph_to_json()
Exporter->>GraphManager: export_to_dict()
Exporter->>FileSystem: write JSON
end
FileSystem-->>User: export file path
Note over Exporter: Supports JSON, Pickle, GraphML<br/>with load_graph functions for reconstruction
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Possibly related PRs
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
♻️ Duplicate comments (1)
diffgraph/graph_export.py (1)
109-166: Code duplication noted.This function duplicates the logic in
GraphManager.export_to_dict()(lines 305-360 in graph_manager.py). See the comment on that file for refactoring suggestions.
🧹 Nitpick comments (6)
CHANGELOG.md (1)
8-36: Well-documented release notes!The changelog clearly documents the Phase 1 structured JSON output format and breaking changes. The phased approach and technical details provide good context for users.
Optional style improvement: Consider hyphenating "backwards-compatible" on line 23 per standard English compound adjective usage.
- - **JSON format now uses structured output by default** (breaking change for JSON, but backwards compatible overall) + - **JSON format now uses structured output by default** (breaking change for JSON, but backwards-compatible overall)docs/planning/STRUCTURED_OUTPUT_DESIGN.md (1)
1-643: Excellent design document!This comprehensive design document provides a clear roadmap for the structured JSON output format, with well-defined phases and implementation strategy. The distinction between Phase 1 (basic restructuring) and Phase 2 (enhanced analysis) sets realistic expectations.
Optional markdown improvements for better rendering:
- Add language specifier to the code block at line 93
- Remove spaces inside emphasis markers (lines 126, 155)
These are minor formatting issues that don't affect the excellent content.
test_graph_export.py (1)
1-1: Shebang without executable permission.The file has a shebang line but is not marked as executable. Either make the file executable or remove the shebang since this is typically imported as a module.
# Option 1: Make it executable chmod +x test_graph_export.py # Option 2: Remove the shebang if not neededAs per static analysis.
diffgraph/graph_manager.py (1)
324-331: Simplify component serialization logic.The component list comprehension uses nested conditionals with
hasattrchecks that may not be necessary. TheFileNode.componentsfield is typed asList[Dict], suggesting components should already be dictionaries.If
componentsare always dictionaries (as the type hint suggests), simplify to:-'components': [ - { - 'name': c.name if hasattr(c, 'name') else str(c), - 'change_type': c.change_type if hasattr(c, 'change_type') else 'unknown', - 'summary': c.summary if hasattr(c, 'summary') else None - } if hasattr(c, '__dict__') else str(c) - for c in (node.components or []) -] +'components': node.components or []If components can be mixed types, add a comment explaining why and consider validating the type earlier in the flow.
example_usage.py (1)
47-47: Rename unused loop variable.The loop variable
component_idis not used in the loop body.Apply this diff:
- for component_id, component_node in graph_manager.component_nodes.items(): + for _, component_node in graph_manager.component_nodes.items():Similar issue at line 64.
docs/planning/PHASE1_IMPLEMENTATION_NOTES.md (1)
300-314: Add language identifier to fenced code block.The commit message template code block (line 300) is missing a language identifier.
Apply this diff:
-``` +```text feat: Add structured JSON output format (Phase 1) ...</blockquote></details> </blockquote></details> <details> <summary>📜 Review details</summary> **Configuration used**: CodeRabbit UI **Review profile**: CHILL **Plan**: Pro <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between a43abac1e9bb30104aab181cc68e1b306abe4a70 and d43d668f05e8b913c665d7b312c000c4e9875033. </details> <details> <summary>📒 Files selected for processing (16)</summary> * `CHANGELOG.md` (1 hunks) * `GRAPH_EXPORT_FEATURE.md` (1 hunks) * `README.md` (2 hunks) * `TESTING_GUIDE.md` (1 hunks) * `diffgraph/__init__.py` (1 hunks) * `diffgraph/cli.py` (3 hunks) * `diffgraph/graph_export.py` (1 hunks) * `diffgraph/graph_manager.py` (1 hunks) * `diffgraph/structured_export.py` (1 hunks) * `docs/planning/PHASE1_IMPLEMENTATION_NOTES.md` (1 hunks) * `docs/planning/STRUCTURED_OUTPUT_DESIGN.md` (1 hunks) * `example_usage.py` (1 hunks) * `setup.py` (1 hunks) * `test_cli_manual.sh` (1 hunks) * `test_graph_export.py` (1 hunks) * `test_structured_export.py` (1 hunks) </details> <details> <summary>🧰 Additional context used</summary> <details> <summary>🧬 Code graph analysis (6)</summary> <details> <summary>test_graph_export.py (2)</summary><blockquote> <details> <summary>diffgraph/graph_manager.py (8)</summary> * `GraphManager` (54-360) * `ChangeType` (8-13) * `FileStatus` (15-20) * `add_file` (85-94) * `add_component` (96-122) * `add_component_dependency` (124-135) * `mark_processed` (150-156) * `export_to_dict` (305-360) </details> <details> <summary>diffgraph/graph_export.py (3)</summary> * `export_graph` (249-273) * `load_graph_from_json` (169-182) * `load_graph_from_pickle` (185-198) </details> </blockquote></details> <details> <summary>example_usage.py (1)</summary><blockquote> <details> <summary>diffgraph/graph_export.py (1)</summary> * `load_graph_from_json` (169-182) </details> </blockquote></details> <details> <summary>diffgraph/cli.py (2)</summary><blockquote> <details> <summary>diffgraph/graph_export.py (1)</summary> * `export_graph` (249-273) </details> <details> <summary>diffgraph/structured_export.py (1)</summary> * `export_structured_json` (403-433) </details> </blockquote></details> <details> <summary>diffgraph/graph_export.py (1)</summary><blockquote> <details> <summary>diffgraph/graph_manager.py (5)</summary> * `GraphManager` (54-360) * `FileNode` (23-34) * `ComponentNode` (37-52) * `ChangeType` (8-13) * `FileStatus` (15-20) </details> </blockquote></details> <details> <summary>test_structured_export.py (2)</summary><blockquote> <details> <summary>diffgraph/graph_manager.py (6)</summary> * `GraphManager` (54-360) * `ChangeType` (8-13) * `add_file` (85-94) * `add_component` (96-122) * `add_component_dependency` (124-135) * `mark_processed` (150-156) </details> <details> <summary>diffgraph/structured_export.py (5)</summary> * `classify_file` (53-75) * `export_structured_json` (403-433) * `transform_component_node` (175-202) * `transform_file_node` (205-224) * `transform_to_structured_format` (312-400) </details> </blockquote></details> <details> <summary>diffgraph/structured_export.py (1)</summary><blockquote> <details> <summary>diffgraph/graph_manager.py (4)</summary> * `GraphManager` (54-360) * `FileNode` (23-34) * `ComponentNode` (37-52) * `ChangeType` (8-13) </details> </blockquote></details> </details><details> <summary>🪛 LanguageTool</summary> <details> <summary>CHANGELOG.md</summary> [uncategorized] ~23-~23: If this is a compound adjective that modifies the following noun, use a hyphen. Context: ...efault** (breaking change for JSON, but backwards compatible overall) - `--format graph --graph-fo... (EN_COMPOUND_ADJECTIVE_INTERNAL) </details> </details> <details> <summary>🪛 markdownlint-cli2 (0.18.1)</summary> <details> <summary>docs/planning/PHASE1_IMPLEMENTATION_NOTES.md</summary> 300-300: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> <details> <summary>docs/planning/STRUCTURED_OUTPUT_DESIGN.md</summary> 93-93: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- 126-126: Spaces inside emphasis markers (MD037, no-space-in-emphasis) --- 155-155: Spaces inside emphasis markers (MD037, no-space-in-emphasis) </details> </details> <details> <summary>🪛 Ruff (0.14.2)</summary> <details> <summary>test_graph_export.py</summary> 1-1: Shebang is present but file is not executable (EXE001) </details> <details> <summary>example_usage.py</summary> 1-1: Shebang is present but file is not executable (EXE001) --- 47-47: Loop control variable `component_id` not used within loop body Rename unused `component_id` to `_component_id` (B007) --- 64-64: Loop control variable `component_id` not used within loop body Rename unused `component_id` to `_component_id` (B007) </details> <details> <summary>diffgraph/cli.py</summary> 238-238: `subprocess` call: check for execution of untrusted input (S603) --- 238-238: Starting a process with a partial executable path (S607) --- 240-240: Starting a process without a shell (S606) --- 242-242: `subprocess` call: check for execution of untrusted input (S603) --- 242-242: Starting a process with a partial executable path (S607) </details> <details> <summary>diffgraph/graph_export.py</summary> 196-196: `pickle` and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue (S301) --- 273-273: Avoid specifying long messages outside the exception class (TRY003) </details> <details> <summary>test_structured_export.py</summary> 1-1: Shebang is present but file is not executable (EXE001) </details> <details> <summary>diffgraph/structured_export.py</summary> 90-90: Consider iterable unpacking instead of concatenation Replace with iterable unpacking (RUF005) --- 91-91: `subprocess` call: check for execution of untrusted input (S603) --- 100-100: Do not catch blind exception: `Exception` (BLE001) --- 177-177: Unused function argument: `graph_manager` (ARG001) --- 248-248: Unused function argument: `graph_manager` (ARG001) --- 405-405: PEP 484 prohibits implicit `Optional` Convert to `T | None` (RUF013) </details> </details> </details> <details> <summary>🔇 Additional comments (11)</summary><blockquote> <details> <summary>setup.py (1)</summary><blockquote> `5-5`: **LGTM!** The version bump to 1.2.0 appropriately reflects the new graph export features being added in this PR. </blockquote></details> <details> <summary>TESTING_GUIDE.md (1)</summary><blockquote> `1-302`: **Excellent testing documentation!** This comprehensive testing guide provides clear instructions for verifying the new graph export feature across multiple scenarios. The step-by-step examples and troubleshooting section will be valuable for both users and contributors. </blockquote></details> <details> <summary>GRAPH_EXPORT_FEATURE.md (1)</summary><blockquote> `1-187`: **Comprehensive feature documentation!** This document provides excellent coverage of the graph export feature, including clear usage examples, data structure descriptions, and implementation details. The backward compatibility note is particularly helpful. </blockquote></details> <details> <summary>diffgraph/__init__.py (1)</summary><blockquote> `5-5`: **LGTM!** Version bump to 1.2.0 is consistent with setup.py and appropriately reflects the new features. </blockquote></details> <details> <summary>test_graph_export.py (1)</summary><blockquote> `13-111`: **Well-structured test coverage!** The test function comprehensively validates all export formats (JSON, pickle, GraphML) with proper assertions for data integrity and file creation. The cleanup step is a good practice. </blockquote></details> <details> <summary>README.md (2)</summary><blockquote> `9-79`: **Clear documentation of new features!** The updated usage section effectively communicates the new graph export capabilities with clear examples for all supported formats (JSON, Pickle, GraphML). --- `81-184`: **Excellent output format documentation!** The new "Output Formats" section provides comprehensive coverage of both HTML reports and graph data export. The structured JSON format explanation with categorization and code examples is particularly helpful for users integrating with the tool. </blockquote></details> <details> <summary>diffgraph/cli.py (1)</summary><blockquote> `211-242`: **LGTM! Clean format-based routing.** The implementation correctly branches between graph export and HTML report generation based on the `--format` flag. The use of structured JSON for `--graph-format json` and NetworkX format for pickle/graphml is well-aligned with the Phase 1 design. Note: The static analysis warnings about subprocess calls (lines 238, 240, 242) are false positives—these commands open the browser with system utilities and don't execute untrusted input. </blockquote></details> <details> <summary>test_structured_export.py (1)</summary><blockquote> `19-284`: **Excellent test coverage for Phase 1!** The test suite comprehensively validates: - File classification across all categories (auto-generated, docs, config, source) - Component transformation with Phase 2 null fields - End-to-end structured export with proper categorization - Graph completeness (all edges reference valid nodes) - Empty graph handling This provides strong validation of the Phase 1 structured export implementation. Minor: Consider making the file executable since it has a shebang (line 1). </blockquote></details> <details> <summary>diffgraph/graph_export.py (1)</summary><blockquote> `185-198`: **Document pickle security considerations.** Pickle can be unsafe when deserializing untrusted data. Since this function loads graph data that may come from external sources, consider documenting the security implications. Add a security warning to the docstring: ```diff def load_graph_from_pickle(pickle_path: str) -> GraphManager: """ Load graph data from a pickle file. + + ⚠️ Security Warning: Only load pickle files from trusted sources. + Pickle can execute arbitrary code during deserialization. Args: pickle_path: Path to the pickle fileAlternatively, if pickle files are only used internally or in trusted environments, document that assumption.
diffgraph/structured_export.py (1)
175-202: Unused parameter is acceptable for Phase 1.The
graph_managerparameter is currently unused but is likely needed for Phase 2 enhancements (relationship type detection based on component analysis). Keeping it maintains a consistent API.Similar unused parameter at line 248 in
transform_file_edgeis also acceptable for Phase 2 extensibility.
| def export_to_dict(self) -> dict: | ||
| """ | ||
| Export the graph manager state to a dictionary. | ||
| This is useful for serialization and can be passed to graph_export functions. | ||
| Returns: | ||
| Dictionary containing all graph data and metadata | ||
| """ | ||
| from networkx.readwrite import json_graph | ||
|
|
||
| # Convert file nodes to serializable format | ||
| file_nodes_data = {} | ||
| for file_path, node in self.file_nodes.items(): | ||
| file_nodes_data[file_path] = { | ||
| 'path': node.path, | ||
| 'status': node.status.value, | ||
| 'change_type': node.change_type.value, | ||
| 'summary': node.summary, | ||
| 'error': node.error, | ||
| 'components': [ | ||
| { | ||
| 'name': c.name if hasattr(c, 'name') else str(c), | ||
| 'change_type': c.change_type if hasattr(c, 'change_type') else 'unknown', | ||
| 'summary': c.summary if hasattr(c, 'summary') else None | ||
| } if hasattr(c, '__dict__') else str(c) | ||
| for c in (node.components or []) | ||
| ] | ||
| } | ||
|
|
||
| # Convert component nodes to serializable format | ||
| component_nodes_data = {} | ||
| for component_id, node in self.component_nodes.items(): | ||
| component_nodes_data[component_id] = { | ||
| 'name': node.name, | ||
| 'file_path': node.file_path, | ||
| 'change_type': node.change_type.value, | ||
| 'component_type': node.component_type, | ||
| 'parent': node.parent, | ||
| 'summary': node.summary, | ||
| 'dependencies': node.dependencies, | ||
| 'dependents': node.dependents | ||
| } | ||
|
|
||
| # Convert graphs to node-link format | ||
| file_graph_data = json_graph.node_link_data(self.file_graph, edges="links") | ||
| component_graph_data = json_graph.node_link_data(self.component_graph, edges="links") | ||
|
|
||
| # Combine all data | ||
| return { | ||
| 'version': '1.0', | ||
| 'file_nodes': file_nodes_data, | ||
| 'component_nodes': component_nodes_data, | ||
| 'file_graph': file_graph_data, | ||
| 'component_graph': component_graph_data, | ||
| 'processed_files': list(self.processed_files) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code duplication with graph_export.py.
The export_to_dict() method duplicates the serialization logic in diffgraph/graph_export.py's _prepare_graph_data() function (lines 109-166). Both functions perform identical transformations of file_nodes, component_nodes, and graphs.
Consider one of these approaches:
Option 1 (Recommended): Make _prepare_graph_data call export_to_dict
# In diffgraph/graph_export.py
def _prepare_graph_data(graph_manager: GraphManager) -> Dict[str, Any]:
- # Convert file nodes to serializable format
- file_nodes_data = {}
- for file_path, node in graph_manager.file_nodes.items():
- ...
- # ... rest of duplication
- return data
+ """Prepare graph data for serialization by delegating to GraphManager."""
+ return graph_manager.export_to_dict()Option 2: Remove export_to_dict and keep logic in graph_export.py
If export_to_dict isn't used elsewhere, consider removing it entirely and keeping serialization logic centralized in the graph_export module.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In diffgraph/graph_manager.py around lines 305-360, export_to_dict duplicates
the serialization logic already implemented in
diffgraph/graph_export.py::_prepare_graph_data (lines ~109-166); replace the
duplicated implementation by delegating to that centralized function (import
_prepare_graph_data from diffgraph.graph_export and return its result) or, if
export_to_dict is unused elsewhere, remove this method and update callers to use
graph_export._prepare_graph_data directly; ensure the function signatures match,
import paths are correct, update any unit tests or call sites, and keep returned
structure/version identical to avoid breaking callers.
| try: | ||
| cmd = ['git', 'diff', '--numstat'] + diff_args + ['--', file_path] | ||
| result = subprocess.run(cmd, capture_output=True, text=True, check=False) | ||
|
|
||
| if result.returncode == 0 and result.stdout.strip(): | ||
| # Output format: "additions\tdeletions\tfilename" | ||
| parts = result.stdout.strip().split('\t') | ||
| if len(parts) >= 2: | ||
| additions = int(parts[0]) if parts[0] != '-' else 0 | ||
| deletions = int(parts[1]) if parts[1] != '-' else 0 | ||
| return {'additions': additions, 'deletions': deletions} | ||
| except Exception as e: | ||
| print(f"Warning: Could not get stats for {file_path}: {e}") | ||
|
|
||
| return {'additions': 0, 'deletions': 0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify git diff arguments are sanitized.
The get_file_stats function passes diff_args directly to a subprocess command. If these arguments come from user input, this could be a security risk.
Verify that diff_args are sanitized before reaching this function. Based on diffgraph/cli.py (lines 43-46), it appears sanitize_diff_args is used in the CLI flow. Confirm this sanitization also applies to the structured export path:
🏁 Script executed:
#!/bin/bash
# Check if diff_args are sanitized before reaching get_file_stats
rg -n "export_structured_json" -A 10 -B 5 --type=pyLength of output: 7168
I need to see more context around the export_structured_json call in cli.py to verify if diff_args are sanitized before reaching that function.
I need to see the actual implementation context around line 216 in cli.py to verify if diff_args are sanitized before passing to export_structured_json.
Based on my research, I can confirm that git commands can be exploited through argument injection options, and it's possible to execute arbitrary commands even when using safe APIs. The recommended approach is to use '--' to separate options from paths/arguments.
The security concern in the original review comment is valid. However, I need to examine the actual code implementation in cli.py to verify whether sanitize_diff_args is properly applied before the export_structured_json call. Let me check the code directly to complete the verification.
---
🏁 Script executed:
```shell
#!/bin/bash
# Get the full context around line 216 of cli.py to see diff_args preparation
rg -n "export_structured_json" --type=py -B 30 diffgraph/cli.py | head -60
Length of output: 2235
I need to find where diff_args is defined in the CLI function to verify if it's sanitized before reaching line 216.
---
🏁 Script executed:
```shell
#!/bin/bash
# Find the function definition that contains line 216 and trace diff_args origin
rg -n "def " --type=py diffgraph/cli.py | grep -B1 "^[0-9]*:def"
Length of output: 336
---
🏁 Script executed:
```shell
#!/bin/bash
# Get the implementation of main function starting at line 143
sed -n '143,250p' diffgraph/cli.py
Length of output: 4849
Now I need to check if get_changed_files or load_file_contents sanitize the diff_args before using them.
---
🏁 Script executed:
```shell
#!/bin/bash
# Get the implementation of get_changed_files starting at line 31
sed -n '31,90p' diffgraph/cli.py
Length of output: 2093
Excellent - I found that get_changed_files does sanitize diff_args at line 44. However, the unsanitized diff_args are also passed directly to export_structured_json at line 216. Let me verify if export_structured_json sanitizes them before calling get_file_stats.
---
🏁 Script executed:
```shell
#!/bin/bash
# Get export_structured_json implementation to see if it sanitizes diff_args
sed -n '403,433p' diffgraph/structured_export.py
Length of output: 1081
---
🏁 Script executed:
```shell
#!/bin/bash
# Find transform_to_structured_format function and check if it sanitizes diff_args
rg -n "def transform_to_structured_format" --type=py -A 50 diffgraph/structured_export.py | head -70
Length of output: 2018
Unsanitized diff_args create a git command injection vulnerability in the structured export path.
The verification confirms the review comment's concern is valid. The diff_args flow from the CLI is: main() (line 159, raw args) → export_structured_json() (line 216) → transform_to_structured_format() (line 426) → get_file_stats() (lines 333-336) → subprocess (line 91).
Unlike get_changed_files() which calls sanitize_diff_args() for its own use, the structured export path does not sanitize diff_args before passing them to get_file_stats(). At line 91 of structured_export.py, the unsanitized arguments are directly concatenated into the git subprocess command:
cmd = ['git', 'diff', '--numstat'] + diff_args + ['--', file_path]
This is a security vulnerability that should be fixed by calling sanitize_diff_args() either:
- In
export_structured_json()before passing totransform_to_structured_format(), or - At the start of
transform_to_structured_format()before usingdiff_args
🧰 Tools
🪛 Ruff (0.14.2)
90-90: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
91-91: subprocess call: check for execution of untrusted input
(S603)
100-100: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In diffgraph/structured_export.py around lines 89 to 103, diff_args is
concatenated directly into the git subprocess command causing a command
injection vulnerability; fix by sanitizing diff_args before it's used — call the
existing sanitize_diff_args() on diff_args either in export_structured_json()
before passing diff_args into transform_to_structured_format(), or at the start
of transform_to_structured_format(); ensure the sanitized list replaces the
original diff_args variable so the subprocess command becomes
['git','diff','--numstat'] + sanitized_diff_args + ['--', file_path].
| 'total_files_changed': total_files, | ||
| 'total_additions': total_additions, | ||
| 'total_deletions': total_deletions, | ||
| 'analyzer_version': '1.2.0' # TODO: Get from package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Replace TODO with actual version.
The analyzer version is hardcoded with a TODO comment.
Import the version from the package:
+from . import __version__
+
# ...
'metadata': {
'analyzed_at': datetime.now(timezone.utc).isoformat(),
'diff_base': diff_base,
'diff_target': diff_target,
'total_files_changed': total_files,
'total_additions': total_additions,
'total_deletions': total_deletions,
- 'analyzer_version': '1.2.0' # TODO: Get from package
+ 'analyzer_version': __version__
},📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 'analyzer_version': '1.2.0' # TODO: Get from package | |
| from . import __version__ | |
| # ... other code ... | |
| 'metadata': { | |
| 'analyzed_at': datetime.now(timezone.utc).isoformat(), | |
| 'diff_base': diff_base, | |
| 'diff_target': diff_target, | |
| 'total_files_changed': total_files, | |
| 'total_additions': total_additions, | |
| 'total_deletions': total_deletions, | |
| 'analyzer_version': __version__ | |
| }, |
🤖 Prompt for AI Agents
In diffgraph/structured_export.py around line 394, the analyzer_version is
hardcoded as '1.2.0' with a TODO; replace the hardcoded value by importing the
package version (e.g., from your package's __init__ or version module) and use
that imported variable for 'analyzer_version'. Add the appropriate import at the
top of the file, handle import errors if needed (fallback or raise), and remove
the TODO comment.
| @@ -0,0 +1,88 @@ | |||
| #!/usr/bin/env python | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Make the script executable.
The shebang is present, but the file lacks executable permissions.
Run:
chmod +x example_usage.py🧰 Tools
🪛 Ruff (0.14.2)
1-1: Shebang is present but file is not executable
(EXE001)
🤖 Prompt for AI Agents
In example_usage.py at line 1, the file has a shebang but lacks executable
permission; make it executable by setting the POSIX execute bit for the intended
users (e.g., run chmod +x example_usage.py) so the script can be run directly
from the shell while retaining current owner/group and other permissions.
| @@ -0,0 +1,145 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Make the script executable.
The shebang is present, but the file lacks executable permissions.
Run:
chmod +x test_cli_manual.sh🤖 Prompt for AI Agents
In test_cli_manual.sh around line 1, the script has a shebang but lacks
executable permissions; fix by setting the file mode to include the execute bit
for the appropriate user(s) (e.g., give the owner execute permission) so the
script can be run directly.
Summary by CodeRabbit
Release Notes - Version 1.2.0
New Features
--format(html|graph),--graph-format(json|pickle|graphml), and--outputfor custom export paths.Documentation
Tests