Skip to content

Conversation

@sciapan
Copy link

@sciapan sciapan commented Oct 15, 2025

Summary

  • switch the consultant request payload to populate the new names field instead of dataSources
  • simplify format_data_source_names to return plain name strings and refresh related unit tests

Testing

  • uv run pytest (fails: unable to reach pypi.org/fastmcp)

https://chatgpt.com/codex/tasks/task_e_68eff26921a0832ba5f07cdb3d1e3f9a

@gemini-code-assist
Copy link

Summary of Changes

Hello @sciapan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors how data sources are handled and communicated within the chat consultant and search tools. The core change involves transitioning from using internal data source IDs to more user-friendly data source names in API requests and function parameters. This update streamlines the interface for specifying data sources, improves clarity in the codebase, and prepares for future enhancements, while also ensuring backward compatibility for existing integrations.

Highlights

  • API Payload Update: The codebase_consultant tool now sends data source information using a names field in the request payload instead of the deprecated dataSources field.
  • Parameter Renaming and Clarification: The data_source_ids parameter in both codebase_consultant and codebase_search functions has been renamed to data_sources to reflect that it now expects human-readable names rather than internal IDs. Documentation and examples have been updated accordingly.
  • Simplified Data Source Formatting: The utility functions format_data_source_names and normalize_data_source_names have been simplified. They now primarily extract and return plain string names from various input formats, including dictionaries that might contain name or id fields.
  • Backward Compatibility: Aliases have been introduced (normalize_data_source_ids = normalize_data_source_names, format_data_source_ids = format_data_source_names) to ensure that existing code relying on the old _ids function names continues to work without immediate breakage.
  • Test Suite Refinement: All relevant unit tests have been updated to reflect the new parameter names, payload structure, and simplified data source formatting logic, ensuring the changes behave as expected.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively refactors the codebase to use data source names instead of IDs when communicating with the backend, simplifying client-side logic. The changes are well-tested and cover all relevant tool implementations and their documentation. I've identified one high-severity issue regarding an inconsistency in how data source names are processed in the search tool compared to the chat tool, which could lead to bugs. I've also included a couple of medium-severity suggestions to improve code maintainability in the utility functions. Overall, this is a good change that improves the tool's interface.

# Normalize data source IDs (handles Claude Desktop serialization issues)
data_source_ids = normalize_data_source_ids(data_source_ids)
# Normalize data source names (handles Claude Desktop serialization issues)
data_source_names = normalize_data_source_names(data_sources)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's an inconsistency here compared to tools/chat.py. The codebase_consultant function uses both normalize_data_source_names and format_data_source_names to process the data_sources. This ensures names are cleaned (e.g., whitespace stripped and empty values removed). This function only normalizes but doesn't format, which could lead to sending un-trimmed or empty names to the API.

To ensure consistent behavior, you should also apply format_data_source_names here. Note that you'll need to import format_data_source_names from utils for this change to work.

Suggested change
data_source_names = normalize_data_source_names(data_sources)
data_source_names = normalize_data_source_names(data_sources)
data_source_names = format_data_source_names(data_source_names)

Comment on lines +100 to +107
elif isinstance(ds, dict):
name = ds.get("name") or ds.get("id")
if isinstance(name, str):
name = name.strip()
if name:
formatted.append(name)
elif name is not None:
formatted.append(str(name))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic inside this elif block can be simplified to improve readability and handle both string and non-string names more consistently.

Suggested change
elif isinstance(ds, dict):
name = ds.get("name") or ds.get("id")
if isinstance(name, str):
name = name.strip()
if name:
formatted.append(name)
elif name is not None:
formatted.append(str(name))
elif isinstance(ds, dict):
name = ds.get("name") or ds.get("id")
if name is not None:
name_str = str(name).strip()
if name_str:
formatted.append(name_str)

return formatted


# Backward compatibility aliases for legacy imports

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's good practice to add a TODO comment here to remind developers to remove these backward compatibility aliases in the future, once all call sites have been updated to use the new _names functions.

Suggested change
# Backward compatibility aliases for legacy imports
# TODO: Remove these backward compatibility aliases once all call sites are updated.
# Backward compatibility aliases for legacy imports

@codealive-github-integration
Copy link

🔍 Code Review

Hello sciapan (@sciapan),

Thanks for your work on sending chat data source names! This change introduces important functionality for handling data source identification, enhancing how chat interactions reference their origins. Overall, the implementation introduces a flexible approach to data source naming, but it also highlights critical areas for robustness and security that need attention.

🎯 Key Changes & Highlights

✨ **Flexible Data Source Naming**

The introduction of data_source_names and the normalize_data_source_names function provides a highly flexible way to handle various input types for data source identification. This adaptability can simplify client-side integration by gracefully coercing different formats into a consistent list of strings, improving developer experience.

🚨 Issues & Suggestions

🐛 Missing JSON Error Handling

The `normalize_data_source_names` function lacks error handling for invalid JSON strings, potentially causing application crashes.

Problem:
The normalize_data_source_names function in src/utils/errors.py attempts to parse string inputs as JSON if they start with [. However, this json.loads() call is not wrapped in a try-except json.JSONDecodeError block. This omission is a regression from previous similar logic and means that an invalid JSON string will cause a json.JSONDecodeError, leading to an application crash instead of graceful fallback.

Fix:
Reintroduce a try-except json.JSONDecodeError block around the json.loads(stripped) call within normalize_data_source_names. If a JSONDecodeError occurs, catch it and proceed as if the input was a plain string. This restores the function's robustness and prevents unexpected crashes.

# Example of fix:
if stripped.startswith('['):
    try:
        parsed = json.loads(stripped)
        if isinstance(parsed, list):
            return parsed
    except json.JSONDecodeError:
        # Fallback to treating as a plain string if JSON parsing fails
        pass

🔐 Backend Security Burden

Client-side flexibility in data source naming places a significant security burden on the backend, requiring robust validation.

Problem:
The normalize_data_source_names and format_data_source_names functions are designed with extreme flexibility, coercing various input types into a list of strings. While convenient, this client-side permissiveness shifts a substantial security burden to the backend. If the backend's resolution logic for data_source_names is vulnerable to injection (e.g., SQL, command) or allows access control bypass through malicious names, this design could contribute to an exploit. This is a critical architectural boundary concern.

Fix:
Ensure the backend service responsible for resolving data_source_names to internal IDs implements strict input validation, sanitization, and authorization checks. Specifically:

  1. Input Validation: Validate the format and content of data_source_names against expected patterns (e.g., regex for valid names) to reject malformed or suspicious inputs early.
  2. Sanitization/Escaping: Properly sanitize or escape data_source_names before using them in any backend operations (e.g., database queries, file system operations) to prevent injection attacks.
  3. Authorization: After resolving a name to an ID, always perform explicit authorization checks to ensure the requesting user has permission to access the resolved data source.

🏗️ Deprecate _ids Aliases

Aliasing `_ids` functions to `_names` functions introduces architectural ambiguity and reduces maintainability.

Why:
The aliasing of normalize_data_source_ids = normalize_data_source_names and format_data_source_ids = format_data_source_names blurs the conceptual distinction between 'IDs' and 'Names'. This violates the principle of least astonishment, as functions named _ids now operate on 'names', potentially leading to incorrect assumptions, subtle bugs, and increased cognitive load for developers. This creates technical debt by obscuring the true behavior of the system's data source handling and impacts future-proofing.

How:
Plan for a phased migration to explicitly use the _names functions throughout the codebase:

  1. Deprecate and Document: Clearly mark normalize_data_source_ids and format_data_source_ids as deprecated, indicating their new behavior and recommending the _names alternatives.
  2. Migrate Call Sites: Systematically update all call sites that currently use the _ids aliases to use the new _names functions.
  3. Remove Aliases: Once all call sites are migrated, remove the aliasing to restore conceptual clarity and eliminate technical debt.

✨ Stricter Input Validation

Implement stricter client-side input validation for `data_source_names` to improve robustness and clarity.

Why:
The current extreme flexibility of normalize_data_source_names and format_data_source_names makes their API contract loose, increasing the likelihood of unexpected or malformed data being passed downstream. Stricter client-side validation makes the code more defensive, reduces the burden on the backend, and makes the functions' behavior more predictable for future developers.

How:
Implement stricter input validation at the entry point of normalize_data_source_names and format_data_source_names. Instead of attempting to coerce any input, these functions should explicitly define and validate against expected input types (e.g., str, list[str], list[dict]). Inputs that do not conform to these expected types should either raise a clear error or be explicitly filtered out with a warning, rather than silently attempting a best-effort conversion.


🚦 Merge Recommendation

🚫 Request Changes - Critical issues require resolution before merging to ensure stability and security.

Reasoning

The changes introduce valuable functionality, but the identified missing JSON error handling is a critical regression that could lead to application crashes. Additionally, the architectural implications of client-side flexibility on backend security require immediate attention. Addressing these blocking issues will ensure the stability, security, and long-term maintainability of the codebase.


🤖 This review was generated by CodeAlive AI

AI can make mistakes, so please verify all suggestions and use with caution. We value your feedback about this review - please contact us at support@codealive.ai.

💡 Tip: Comment /codealive review to retrigger reviews. You can also manage reviews at https://app.codealive.ai/pull-requests.

codebase_consultant(
question="How do the microservices communicate?",
data_sources=["workspace_123", "repo_456"]
data_sources=["workspace:platform", "repo:payments"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

некорректные примеры - нейронка наверняка подумает, что это спец. префиксы. вот так можно сделать:
data_sources=["repo or workspace name"]

data_sources: List of data source names to search in (required).
Can be workspace names (search all repositories in the workspace)
or individual repository names for targeted searches.
Example: ["enterprise-platform", "workspace:payments-team"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

снова лишний префикс (workspace:)

3. Workspace-wide question:
codebase_search(query="How do microservices talk to the billing API?", data_source_ids=["workspace456"])
codebase_search(query="How do microservices talk to the billing API?", data_sources=["workspace:backend-team"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

префикс

4. Mixed query with a known identifier:
codebase_search(query="Where do we validate JWTs (AuthService)?", data_source_ids=["repo123"])
codebase_search(query="Where do we validate JWTs (AuthService)?", data_sources=["repo-main"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"main" лучше убрать из текста, чтобы не сбивать ллмку с толку о том, что это ветка

@sciapanCA sciapanCA closed this Oct 15, 2025
@sciapanCA sciapanCA deleted the codex/update-mcp-definition-for-data-sources branch October 15, 2025 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants