Skip to content

Conversation

@avikalpg
Copy link
Contributor

@avikalpg avikalpg commented Oct 29, 2025

Summary by CodeRabbit

Release Notes - Version 1.2.0

  • New Features

    • Added graph data export in multiple formats: JSON, Pickle, and GraphML.
    • Introduced structured JSON output format with enhanced metadata and categorization (Phase 1).
    • Added CLI options: --format (html|graph), --graph-format (json|pickle|graphml), and --output for custom export paths.
  • Documentation

    • Expanded README with export usage examples and output format specifications.
    • Added comprehensive testing guide for export functionality.
  • Tests

    • New test suites validating export/import workflows and data integrity across all formats.

Add ability to export complete networkx graph data structure to file formats
that other programs can consume programmatically.

New Features:
- Add --format option to choose between 'html' (default) and 'graph' output
- Add --graph-format option to select serialization (json/pickle/graphml)
- Support for JSON export (human-readable, widely compatible)
- Support for Pickle export (Python-native, exact data preservation)
- Support for GraphML export (standard graph format for analysis tools)

New Files:
- diffgraph/graph_export.py: Core export/import functionality
- test_graph_export.py: Comprehensive test suite
- example_usage.py: Example script for using exported data
- test_cli_manual.sh: Automated testing script
- GRAPH_EXPORT_FEATURE.md: Feature documentation
- TESTING_GUIDE.md: Testing instructions

Modified Files:
- diffgraph/cli.py: Added new CLI options and conditional output logic
- diffgraph/graph_manager.py: Added export_to_dict() method
- README.md: Updated documentation with new features
- CHANGELOG.md: Added v1.1.0 release notes
- setup.py: Bumped version to 1.1.0
- diffgraph/__init__.py: Updated version to 1.1.0

Technical Details:
- Exported data includes file nodes, component nodes, graphs, and metadata
- All formats support round-trip (export → load) with data integrity
- NetworkX graphs serialized using node-link format for compatibility
- Backward compatible: existing HTML functionality unchanged

Testing:
- All unit tests pass (test_graph_export.py)
- Automated test suite validates all formats
- Example usage script demonstrates practical use cases
Add comprehensive design documents for new integration-friendly JSON output
format to docs/planning/ folder to keep root directory clean.

Documents:
- STRUCTURED_OUTPUT_DESIGN.md: Complete schema specification
  - File categorization (source/docs/config/auto-generated)
  - Rich component metadata (complexity, impact radius)
  - External dependency nodes (APIs, databases, services)
  - Comprehensive relationship types (REST, RPC, pub/sub)
  - Cross-references between docs and code
  - Phase-based implementation strategy

- PHASE1_IMPLEMENTATION_NOTES.md: Implementation guide
  - Implementation decisions and rationale
  - Phase 1 scope (basic restructuring with existing data)
  - Phase 2/3 future enhancements
  - Testing strategy and success criteria
  - Next session pickup instructions

These documents preserve design rationale and guide future implementation
without cluttering the codebase root.
Add integration-friendly structured JSON format optimized for VSCode extensions,
web UIs, and other tool integrations. This is Phase 1 using existing analysis
data without requiring new AI analysis.

New Features:
- Structured JSON output with clean categorization and rich metadata
- Automatic file classification into 4 categories:
  * auto_generated: Lock files, build artifacts (excluded from review)
  * documentation: Docs with potential cross-references to code
  * configuration: Config files with change tracking
  * source_code: Source files with full dependency graphs
- Git diff stats (additions/deletions) extracted per file
- Impact radius calculation from dependency graph
- Complete graph structure (all edge targets exist as nodes)
- Pattern-based classification with 40+ common file patterns

New Files:
- diffgraph/structured_export.py: Core transformation logic (400+ lines)
  * File categorization and classification
  * Component and edge transformation
  * Git stats extraction
  * Structured format generation
- test_structured_export.py: Comprehensive test suite
  * File classification tests
  * Component transformation tests
  * Full export validation
  * Graph completeness tests
  * Empty graph handling

Modified Files:
- diffgraph/cli.py: Route JSON format to structured export
- README.md: Document structured format with examples
- CHANGELOG.md: Added v1.2.0 release notes
- setup.py: Bumped version to 1.2.0
- diffgraph/__init__.py: Updated version to 1.2.0

Breaking Changes:
- JSON format now outputs structured format (v2.0 schema)
- NetworkX format still available via pickle/graphml

Technical Details:
- Phase 1: Uses existing data, leaves advanced fields as null
- Relationship types: imports, calls, extends, implements
- Change types: added, deleted, modified, unchanged
- Clean separation of files graph and components graph
- All tests pass (5/5 test suites)

Phase 2 Planned:
- Full codebase impact analysis
- External dependency nodes (APIs, databases)
- Advanced relationships (REST, RPC, pub/sub)
- Complexity metrics and line numbers
- Cross-reference detection

See docs/planning/STRUCTURED_OUTPUT_DESIGN.md for complete specification.
…id', but just 'parent'.

refactor: Clean up whitespace in structured_export.py

This commit removes unnecessary whitespace throughout the structured_export.py file, improving code readability without altering functionality. The changes include adjustments to spacing in function definitions, docstrings, and return statements, ensuring a more consistent coding style.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 29, 2025

Walkthrough

This PR introduces comprehensive graph export capabilities for DiffGraph, adding support for multiple serialization formats (JSON, Pickle, GraphML), a new structured JSON output format for integration with external tools, enhanced CLI options for format selection, and a complete test suite with documentation and planning guides.

Changes

Cohort / File(s) Summary
Version & Package Updates
diffgraph/__init__.py, setup.py
Bumps version from 0.1.0/1.0.0 to 1.2.0
Core Export Functionality
diffgraph/graph_export.py
New module with export/load functions for JSON, Pickle, GraphML formats; includes graph data preparation and reconstruction helpers
Structured JSON Export
diffgraph/structured_export.py
New module for transforming graph data into structured JSON format with file classification, language detection, relationship inference, and metadata aggregation
GraphManager Enhancement
diffgraph/graph_manager.py
Adds export_to_dict() method to serialize graph structures into JSON-compatible dictionaries
CLI Enhancements
diffgraph/cli.py
Adds --format (html|graph), --graph-format (json|pickle|graphml), --output options; implements conditional export logic for graph formats
Core Documentation
README.md, CHANGELOG.md
Updates feature list, adds export capability documentation, includes usage examples for JSON/Pickle/GraphML/HTML outputs, and release notes for 1.2.0
Planning & Design Docs
docs/planning/PHASE1_IMPLEMENTATION_NOTES.md, docs/planning/STRUCTURED_OUTPUT_DESIGN.md, GRAPH_EXPORT_FEATURE.md, TESTING_GUIDE.md
Comprehensive documentation of Phase 1 implementation strategy, structured output schema design, feature specifications, and testing procedures
Test Suite
test_graph_export.py, test_structured_export.py, test_cli_manual.sh
Unit tests for export/import workflows, structured export validation, manual CLI testing with round-trip integrity checks
Example & Usage
example_usage.py
Demo script for loading and analyzing exported graph data with statistics and component connectivity analysis

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant GraphManager
    participant Exporter as graph_export/<br/>structured_export
    participant FileSystem

    User->>CLI: diffgraph --format graph --graph-format json
    CLI->>GraphManager: access graph data
    alt Structured JSON
        CLI->>Exporter: export_structured_json()
        Exporter->>GraphManager: query nodes/edges
        Exporter->>Exporter: classify files, transform nodes
        Exporter->>FileSystem: write JSON
    else Standard Export
        CLI->>Exporter: export_graph_to_json()
        Exporter->>GraphManager: export_to_dict()
        Exporter->>FileSystem: write JSON
    end
    FileSystem-->>User: export file path
    
    Note over Exporter: Supports JSON, Pickle, GraphML<br/>with load_graph functions for reconstruction
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

  • diffgraph/structured_export.py: New module with extensive transformation logic, pattern matching, and Git integration; validate correctness of file classification, relationship inference, and schema conformance
  • diffgraph/graph_export.py: Graph serialization/deserialization across multiple formats; verify data integrity in round-trip conversions and compatibility with GraphML standard
  • diffgraph/cli.py: Parameter routing logic and conditional export paths; ensure backward compatibility with HTML output and correct error handling for invalid format combinations
  • test_structured_export.py & test_graph_export.py: Comprehensive test coverage but requires validation that edge cases (empty graphs, missing files, malformed data) are properly handled
  • Planning documents (STRUCTURED_OUTPUT_DESIGN.md, PHASE1_IMPLEMENTATION_NOTES.md): Validate alignment between design intent and implementation, especially Phase 1/2 boundaries and placeholder fields

Possibly related PRs

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "Feature: output graph data as JSON" is partially related to the changeset. The PR does introduce JSON output capabilities, which is a legitimate and significant part of the changes. However, the title does not capture the full scope of the implementation, which includes multiple export formats (JSON, pickle, GraphML), a new structured JSON format specifically designed for Phase 1 with enhanced metadata and categorization, multiple new modules, and comprehensive CLI enhancements. The title focuses on one output format while omitting the multi-format nature and the "structured" aspect of the JSON output mentioned in the CHANGELOG as the primary feature.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/output-graph-data

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (1)
diffgraph/graph_export.py (1)

109-166: Code duplication noted.

This function duplicates the logic in GraphManager.export_to_dict() (lines 305-360 in graph_manager.py). See the comment on that file for refactoring suggestions.

🧹 Nitpick comments (6)
CHANGELOG.md (1)

8-36: Well-documented release notes!

The changelog clearly documents the Phase 1 structured JSON output format and breaking changes. The phased approach and technical details provide good context for users.

Optional style improvement: Consider hyphenating "backwards-compatible" on line 23 per standard English compound adjective usage.

-  - **JSON format now uses structured output by default** (breaking change for JSON, but backwards compatible overall)
+  - **JSON format now uses structured output by default** (breaking change for JSON, but backwards-compatible overall)
docs/planning/STRUCTURED_OUTPUT_DESIGN.md (1)

1-643: Excellent design document!

This comprehensive design document provides a clear roadmap for the structured JSON output format, with well-defined phases and implementation strategy. The distinction between Phase 1 (basic restructuring) and Phase 2 (enhanced analysis) sets realistic expectations.

Optional markdown improvements for better rendering:

  1. Add language specifier to the code block at line 93
  2. Remove spaces inside emphasis markers (lines 126, 155)

These are minor formatting issues that don't affect the excellent content.

test_graph_export.py (1)

1-1: Shebang without executable permission.

The file has a shebang line but is not marked as executable. Either make the file executable or remove the shebang since this is typically imported as a module.

# Option 1: Make it executable
chmod +x test_graph_export.py

# Option 2: Remove the shebang if not needed

As per static analysis.

diffgraph/graph_manager.py (1)

324-331: Simplify component serialization logic.

The component list comprehension uses nested conditionals with hasattr checks that may not be necessary. The FileNode.components field is typed as List[Dict], suggesting components should already be dictionaries.

If components are always dictionaries (as the type hint suggests), simplify to:

-'components': [
-    {
-        'name': c.name if hasattr(c, 'name') else str(c),
-        'change_type': c.change_type if hasattr(c, 'change_type') else 'unknown',
-        'summary': c.summary if hasattr(c, 'summary') else None
-    } if hasattr(c, '__dict__') else str(c)
-    for c in (node.components or [])
-]
+'components': node.components or []

If components can be mixed types, add a comment explaining why and consider validating the type earlier in the flow.

example_usage.py (1)

47-47: Rename unused loop variable.

The loop variable component_id is not used in the loop body.

Apply this diff:

-    for component_id, component_node in graph_manager.component_nodes.items():
+    for _, component_node in graph_manager.component_nodes.items():

Similar issue at line 64.

docs/planning/PHASE1_IMPLEMENTATION_NOTES.md (1)

300-314: Add language identifier to fenced code block.

The commit message template code block (line 300) is missing a language identifier.

Apply this diff:

-```
+```text
 feat: Add structured JSON output format (Phase 1)
 ...

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: CodeRabbit UI

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between a43abac1e9bb30104aab181cc68e1b306abe4a70 and d43d668f05e8b913c665d7b312c000c4e9875033.

</details>

<details>
<summary>📒 Files selected for processing (16)</summary>

* `CHANGELOG.md` (1 hunks)
* `GRAPH_EXPORT_FEATURE.md` (1 hunks)
* `README.md` (2 hunks)
* `TESTING_GUIDE.md` (1 hunks)
* `diffgraph/__init__.py` (1 hunks)
* `diffgraph/cli.py` (3 hunks)
* `diffgraph/graph_export.py` (1 hunks)
* `diffgraph/graph_manager.py` (1 hunks)
* `diffgraph/structured_export.py` (1 hunks)
* `docs/planning/PHASE1_IMPLEMENTATION_NOTES.md` (1 hunks)
* `docs/planning/STRUCTURED_OUTPUT_DESIGN.md` (1 hunks)
* `example_usage.py` (1 hunks)
* `setup.py` (1 hunks)
* `test_cli_manual.sh` (1 hunks)
* `test_graph_export.py` (1 hunks)
* `test_structured_export.py` (1 hunks)

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🧬 Code graph analysis (6)</summary>

<details>
<summary>test_graph_export.py (2)</summary><blockquote>

<details>
<summary>diffgraph/graph_manager.py (8)</summary>

* `GraphManager` (54-360)
* `ChangeType` (8-13)
* `FileStatus` (15-20)
* `add_file` (85-94)
* `add_component` (96-122)
* `add_component_dependency` (124-135)
* `mark_processed` (150-156)
* `export_to_dict` (305-360)

</details>
<details>
<summary>diffgraph/graph_export.py (3)</summary>

* `export_graph` (249-273)
* `load_graph_from_json` (169-182)
* `load_graph_from_pickle` (185-198)

</details>

</blockquote></details>
<details>
<summary>example_usage.py (1)</summary><blockquote>

<details>
<summary>diffgraph/graph_export.py (1)</summary>

* `load_graph_from_json` (169-182)

</details>

</blockquote></details>
<details>
<summary>diffgraph/cli.py (2)</summary><blockquote>

<details>
<summary>diffgraph/graph_export.py (1)</summary>

* `export_graph` (249-273)

</details>
<details>
<summary>diffgraph/structured_export.py (1)</summary>

* `export_structured_json` (403-433)

</details>

</blockquote></details>
<details>
<summary>diffgraph/graph_export.py (1)</summary><blockquote>

<details>
<summary>diffgraph/graph_manager.py (5)</summary>

* `GraphManager` (54-360)
* `FileNode` (23-34)
* `ComponentNode` (37-52)
* `ChangeType` (8-13)
* `FileStatus` (15-20)

</details>

</blockquote></details>
<details>
<summary>test_structured_export.py (2)</summary><blockquote>

<details>
<summary>diffgraph/graph_manager.py (6)</summary>

* `GraphManager` (54-360)
* `ChangeType` (8-13)
* `add_file` (85-94)
* `add_component` (96-122)
* `add_component_dependency` (124-135)
* `mark_processed` (150-156)

</details>
<details>
<summary>diffgraph/structured_export.py (5)</summary>

* `classify_file` (53-75)
* `export_structured_json` (403-433)
* `transform_component_node` (175-202)
* `transform_file_node` (205-224)
* `transform_to_structured_format` (312-400)

</details>

</blockquote></details>
<details>
<summary>diffgraph/structured_export.py (1)</summary><blockquote>

<details>
<summary>diffgraph/graph_manager.py (4)</summary>

* `GraphManager` (54-360)
* `FileNode` (23-34)
* `ComponentNode` (37-52)
* `ChangeType` (8-13)

</details>

</blockquote></details>

</details><details>
<summary>🪛 LanguageTool</summary>

<details>
<summary>CHANGELOG.md</summary>

[uncategorized] ~23-~23: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...efault** (breaking change for JSON, but backwards compatible overall)   - `--format graph --graph-fo...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

</details>

</details>
<details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

<details>
<summary>docs/planning/PHASE1_IMPLEMENTATION_NOTES.md</summary>

300-300: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>
<details>
<summary>docs/planning/STRUCTURED_OUTPUT_DESIGN.md</summary>

93-93: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

126-126: Spaces inside emphasis markers

(MD037, no-space-in-emphasis)

---

155-155: Spaces inside emphasis markers

(MD037, no-space-in-emphasis)

</details>

</details>
<details>
<summary>🪛 Ruff (0.14.2)</summary>

<details>
<summary>test_graph_export.py</summary>

1-1: Shebang is present but file is not executable

(EXE001)

</details>
<details>
<summary>example_usage.py</summary>

1-1: Shebang is present but file is not executable

(EXE001)

---

47-47: Loop control variable `component_id` not used within loop body

Rename unused `component_id` to `_component_id`

(B007)

---

64-64: Loop control variable `component_id` not used within loop body

Rename unused `component_id` to `_component_id`

(B007)

</details>
<details>
<summary>diffgraph/cli.py</summary>

238-238: `subprocess` call: check for execution of untrusted input

(S603)

---

238-238: Starting a process with a partial executable path

(S607)

---

240-240: Starting a process without a shell

(S606)

---

242-242: `subprocess` call: check for execution of untrusted input

(S603)

---

242-242: Starting a process with a partial executable path

(S607)

</details>
<details>
<summary>diffgraph/graph_export.py</summary>

196-196: `pickle` and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

---

273-273: Avoid specifying long messages outside the exception class

(TRY003)

</details>
<details>
<summary>test_structured_export.py</summary>

1-1: Shebang is present but file is not executable

(EXE001)

</details>
<details>
<summary>diffgraph/structured_export.py</summary>

90-90: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

---

91-91: `subprocess` call: check for execution of untrusted input

(S603)

---

100-100: Do not catch blind exception: `Exception`

(BLE001)

---

177-177: Unused function argument: `graph_manager`

(ARG001)

---

248-248: Unused function argument: `graph_manager`

(ARG001)

---

405-405: PEP 484 prohibits implicit `Optional`

Convert to `T | None`

(RUF013)

</details>

</details>

</details>

<details>
<summary>🔇 Additional comments (11)</summary><blockquote>

<details>
<summary>setup.py (1)</summary><blockquote>

`5-5`: **LGTM!**

The version bump to 1.2.0 appropriately reflects the new graph export features being added in this PR.

</blockquote></details>
<details>
<summary>TESTING_GUIDE.md (1)</summary><blockquote>

`1-302`: **Excellent testing documentation!**

This comprehensive testing guide provides clear instructions for verifying the new graph export feature across multiple scenarios. The step-by-step examples and troubleshooting section will be valuable for both users and contributors.

</blockquote></details>
<details>
<summary>GRAPH_EXPORT_FEATURE.md (1)</summary><blockquote>

`1-187`: **Comprehensive feature documentation!**

This document provides excellent coverage of the graph export feature, including clear usage examples, data structure descriptions, and implementation details. The backward compatibility note is particularly helpful.

</blockquote></details>
<details>
<summary>diffgraph/__init__.py (1)</summary><blockquote>

`5-5`: **LGTM!**

Version bump to 1.2.0 is consistent with setup.py and appropriately reflects the new features.

</blockquote></details>
<details>
<summary>test_graph_export.py (1)</summary><blockquote>

`13-111`: **Well-structured test coverage!**

The test function comprehensively validates all export formats (JSON, pickle, GraphML) with proper assertions for data integrity and file creation. The cleanup step is a good practice.

</blockquote></details>
<details>
<summary>README.md (2)</summary><blockquote>

`9-79`: **Clear documentation of new features!**

The updated usage section effectively communicates the new graph export capabilities with clear examples for all supported formats (JSON, Pickle, GraphML).

---

`81-184`: **Excellent output format documentation!**

The new "Output Formats" section provides comprehensive coverage of both HTML reports and graph data export. The structured JSON format explanation with categorization and code examples is particularly helpful for users integrating with the tool.

</blockquote></details>
<details>
<summary>diffgraph/cli.py (1)</summary><blockquote>

`211-242`: **LGTM! Clean format-based routing.**

The implementation correctly branches between graph export and HTML report generation based on the `--format` flag. The use of structured JSON for `--graph-format json` and NetworkX format for pickle/graphml is well-aligned with the Phase 1 design.



Note: The static analysis warnings about subprocess calls (lines 238, 240, 242) are false positives—these commands open the browser with system utilities and don't execute untrusted input.

</blockquote></details>
<details>
<summary>test_structured_export.py (1)</summary><blockquote>

`19-284`: **Excellent test coverage for Phase 1!**

The test suite comprehensively validates:
- File classification across all categories (auto-generated, docs, config, source)
- Component transformation with Phase 2 null fields
- End-to-end structured export with proper categorization
- Graph completeness (all edges reference valid nodes)
- Empty graph handling

This provides strong validation of the Phase 1 structured export implementation.



Minor: Consider making the file executable since it has a shebang (line 1).

</blockquote></details>
<details>
<summary>diffgraph/graph_export.py (1)</summary><blockquote>

`185-198`: **Document pickle security considerations.**

Pickle can be unsafe when deserializing untrusted data. Since this function loads graph data that may come from external sources, consider documenting the security implications.



Add a security warning to the docstring:
```diff
 def load_graph_from_pickle(pickle_path: str) -> GraphManager:
     """
     Load graph data from a pickle file.
+    
+    ⚠️  Security Warning: Only load pickle files from trusted sources.
+    Pickle can execute arbitrary code during deserialization.
     
     Args:
         pickle_path: Path to the pickle file

Alternatively, if pickle files are only used internally or in trusted environments, document that assumption.

diffgraph/structured_export.py (1)

175-202: Unused parameter is acceptable for Phase 1.

The graph_manager parameter is currently unused but is likely needed for Phase 2 enhancements (relationship type detection based on component analysis). Keeping it maintains a consistent API.

Similar unused parameter at line 248 in transform_file_edge is also acceptable for Phase 2 extensibility.

Comment on lines +305 to +360
def export_to_dict(self) -> dict:
"""
Export the graph manager state to a dictionary.
This is useful for serialization and can be passed to graph_export functions.
Returns:
Dictionary containing all graph data and metadata
"""
from networkx.readwrite import json_graph

# Convert file nodes to serializable format
file_nodes_data = {}
for file_path, node in self.file_nodes.items():
file_nodes_data[file_path] = {
'path': node.path,
'status': node.status.value,
'change_type': node.change_type.value,
'summary': node.summary,
'error': node.error,
'components': [
{
'name': c.name if hasattr(c, 'name') else str(c),
'change_type': c.change_type if hasattr(c, 'change_type') else 'unknown',
'summary': c.summary if hasattr(c, 'summary') else None
} if hasattr(c, '__dict__') else str(c)
for c in (node.components or [])
]
}

# Convert component nodes to serializable format
component_nodes_data = {}
for component_id, node in self.component_nodes.items():
component_nodes_data[component_id] = {
'name': node.name,
'file_path': node.file_path,
'change_type': node.change_type.value,
'component_type': node.component_type,
'parent': node.parent,
'summary': node.summary,
'dependencies': node.dependencies,
'dependents': node.dependents
}

# Convert graphs to node-link format
file_graph_data = json_graph.node_link_data(self.file_graph, edges="links")
component_graph_data = json_graph.node_link_data(self.component_graph, edges="links")

# Combine all data
return {
'version': '1.0',
'file_nodes': file_nodes_data,
'component_nodes': component_nodes_data,
'file_graph': file_graph_data,
'component_graph': component_graph_data,
'processed_files': list(self.processed_files)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Code duplication with graph_export.py.

The export_to_dict() method duplicates the serialization logic in diffgraph/graph_export.py's _prepare_graph_data() function (lines 109-166). Both functions perform identical transformations of file_nodes, component_nodes, and graphs.

Consider one of these approaches:

Option 1 (Recommended): Make _prepare_graph_data call export_to_dict

# In diffgraph/graph_export.py
def _prepare_graph_data(graph_manager: GraphManager) -> Dict[str, Any]:
-    # Convert file nodes to serializable format
-    file_nodes_data = {}
-    for file_path, node in graph_manager.file_nodes.items():
-        ...
-    # ... rest of duplication
-    return data
+    """Prepare graph data for serialization by delegating to GraphManager."""
+    return graph_manager.export_to_dict()

Option 2: Remove export_to_dict and keep logic in graph_export.py

If export_to_dict isn't used elsewhere, consider removing it entirely and keeping serialization logic centralized in the graph_export module.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In diffgraph/graph_manager.py around lines 305-360, export_to_dict duplicates
the serialization logic already implemented in
diffgraph/graph_export.py::_prepare_graph_data (lines ~109-166); replace the
duplicated implementation by delegating to that centralized function (import
_prepare_graph_data from diffgraph.graph_export and return its result) or, if
export_to_dict is unused elsewhere, remove this method and update callers to use
graph_export._prepare_graph_data directly; ensure the function signatures match,
import paths are correct, update any unit tests or call sites, and keep returned
structure/version identical to avoid breaking callers.

Comment on lines +89 to +103
try:
cmd = ['git', 'diff', '--numstat'] + diff_args + ['--', file_path]
result = subprocess.run(cmd, capture_output=True, text=True, check=False)

if result.returncode == 0 and result.stdout.strip():
# Output format: "additions\tdeletions\tfilename"
parts = result.stdout.strip().split('\t')
if len(parts) >= 2:
additions = int(parts[0]) if parts[0] != '-' else 0
deletions = int(parts[1]) if parts[1] != '-' else 0
return {'additions': additions, 'deletions': deletions}
except Exception as e:
print(f"Warning: Could not get stats for {file_path}: {e}")

return {'additions': 0, 'deletions': 0}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify git diff arguments are sanitized.

The get_file_stats function passes diff_args directly to a subprocess command. If these arguments come from user input, this could be a security risk.

Verify that diff_args are sanitized before reaching this function. Based on diffgraph/cli.py (lines 43-46), it appears sanitize_diff_args is used in the CLI flow. Confirm this sanitization also applies to the structured export path:


🏁 Script executed:

#!/bin/bash
# Check if diff_args are sanitized before reaching get_file_stats
rg -n "export_structured_json" -A 10 -B 5 --type=py

Length of output: 7168


I need to see more context around the export_structured_json call in cli.py to verify if diff_args are sanitized before reaching that function.

I need to see the actual implementation context around line 216 in cli.py to verify if diff_args are sanitized before passing to export_structured_json.

Based on my research, I can confirm that git commands can be exploited through argument injection options, and it's possible to execute arbitrary commands even when using safe APIs. The recommended approach is to use '--' to separate options from paths/arguments.

The security concern in the original review comment is valid. However, I need to examine the actual code implementation in cli.py to verify whether sanitize_diff_args is properly applied before the export_structured_json call. Let me check the code directly to complete the verification.


---

🏁 Script executed:

```shell
#!/bin/bash
# Get the full context around line 216 of cli.py to see diff_args preparation
rg -n "export_structured_json" --type=py -B 30 diffgraph/cli.py | head -60

Length of output: 2235


I need to find where diff_args is defined in the CLI function to verify if it's sanitized before reaching line 216.


---

🏁 Script executed:

```shell
#!/bin/bash
# Find the function definition that contains line 216 and trace diff_args origin
rg -n "def " --type=py diffgraph/cli.py | grep -B1 "^[0-9]*:def"

Length of output: 336



---

🏁 Script executed:

```shell
#!/bin/bash
# Get the implementation of main function starting at line 143
sed -n '143,250p' diffgraph/cli.py

Length of output: 4849


Now I need to check if get_changed_files or load_file_contents sanitize the diff_args before using them.


---

🏁 Script executed:

```shell
#!/bin/bash
# Get the implementation of get_changed_files starting at line 31
sed -n '31,90p' diffgraph/cli.py

Length of output: 2093


Excellent - I found that get_changed_files does sanitize diff_args at line 44. However, the unsanitized diff_args are also passed directly to export_structured_json at line 216. Let me verify if export_structured_json sanitizes them before calling get_file_stats.


---

🏁 Script executed:

```shell
#!/bin/bash
# Get export_structured_json implementation to see if it sanitizes diff_args
sed -n '403,433p' diffgraph/structured_export.py

Length of output: 1081



---

🏁 Script executed:

```shell
#!/bin/bash
# Find transform_to_structured_format function and check if it sanitizes diff_args
rg -n "def transform_to_structured_format" --type=py -A 50 diffgraph/structured_export.py | head -70

Length of output: 2018


Unsanitized diff_args create a git command injection vulnerability in the structured export path.

The verification confirms the review comment's concern is valid. The diff_args flow from the CLI is: main() (line 159, raw args) → export_structured_json() (line 216) → transform_to_structured_format() (line 426) → get_file_stats() (lines 333-336) → subprocess (line 91).

Unlike get_changed_files() which calls sanitize_diff_args() for its own use, the structured export path does not sanitize diff_args before passing them to get_file_stats(). At line 91 of structured_export.py, the unsanitized arguments are directly concatenated into the git subprocess command:

cmd = ['git', 'diff', '--numstat'] + diff_args + ['--', file_path]

This is a security vulnerability that should be fixed by calling sanitize_diff_args() either:

  • In export_structured_json() before passing to transform_to_structured_format(), or
  • At the start of transform_to_structured_format() before using diff_args
🧰 Tools
🪛 Ruff (0.14.2)

90-90: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)


91-91: subprocess call: check for execution of untrusted input

(S603)


100-100: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In diffgraph/structured_export.py around lines 89 to 103, diff_args is
concatenated directly into the git subprocess command causing a command
injection vulnerability; fix by sanitizing diff_args before it's used — call the
existing sanitize_diff_args() on diff_args either in export_structured_json()
before passing diff_args into transform_to_structured_format(), or at the start
of transform_to_structured_format(); ensure the sanitized list replaces the
original diff_args variable so the subprocess command becomes
['git','diff','--numstat'] + sanitized_diff_args + ['--', file_path].

'total_files_changed': total_files,
'total_additions': total_additions,
'total_deletions': total_deletions,
'analyzer_version': '1.2.0' # TODO: Get from package
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Replace TODO with actual version.

The analyzer version is hardcoded with a TODO comment.

Import the version from the package:

+from . import __version__
+
 # ...
         'metadata': {
             'analyzed_at': datetime.now(timezone.utc).isoformat(),
             'diff_base': diff_base,
             'diff_target': diff_target,
             'total_files_changed': total_files,
             'total_additions': total_additions,
             'total_deletions': total_deletions,
-            'analyzer_version': '1.2.0'  # TODO: Get from package
+            'analyzer_version': __version__
         },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
'analyzer_version': '1.2.0' # TODO: Get from package
from . import __version__
# ... other code ...
'metadata': {
'analyzed_at': datetime.now(timezone.utc).isoformat(),
'diff_base': diff_base,
'diff_target': diff_target,
'total_files_changed': total_files,
'total_additions': total_additions,
'total_deletions': total_deletions,
'analyzer_version': __version__
},
🤖 Prompt for AI Agents
In diffgraph/structured_export.py around line 394, the analyzer_version is
hardcoded as '1.2.0' with a TODO; replace the hardcoded value by importing the
package version (e.g., from your package's __init__ or version module) and use
that imported variable for 'analyzer_version'. Add the appropriate import at the
top of the file, handle import errors if needed (fallback or raise), and remove
the TODO comment.

@@ -0,0 +1,88 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Make the script executable.

The shebang is present, but the file lacks executable permissions.

Run:

chmod +x example_usage.py
🧰 Tools
🪛 Ruff (0.14.2)

1-1: Shebang is present but file is not executable

(EXE001)

🤖 Prompt for AI Agents
In example_usage.py at line 1, the file has a shebang but lacks executable
permission; make it executable by setting the POSIX execute bit for the intended
users (e.g., run chmod +x example_usage.py) so the script can be run directly
from the shell while retaining current owner/group and other permissions.

@@ -0,0 +1,145 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Make the script executable.

The shebang is present, but the file lacks executable permissions.

Run:

chmod +x test_cli_manual.sh
🤖 Prompt for AI Agents
In test_cli_manual.sh around line 1, the script has a shebang but lacks
executable permissions; fix by setting the file mode to include the execute bit
for the appropriate user(s) (e.g., give the owner execute permission) so the
script can be run directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants