From a76c8cd3d50052f58683f3da38501bf907d78bdf Mon Sep 17 00:00:00 2001 From: Ranjana Babu Date: Sat, 21 Jun 2025 16:18:02 +0530 Subject: [PATCH 1/2] add contribution_plan.md file --- contribution_plan.md | 67 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) create mode 100644 contribution_plan.md diff --git a/contribution_plan.md b/contribution_plan.md new file mode 100644 index 0000000000000..843c1b0a11203 --- /dev/null +++ b/contribution_plan.md @@ -0,0 +1,67 @@ +## 1. Basic Information + +- **Project Name:** pandas +- **GitHub URL:** [https://github.com/pandas-dev/pandas](https://github.com/pandas-dev/pandas) +- **Primary Language(s):** + - Python (core language) + - C / Cython (for performance-critical components) + +- **What is the project used for?** + pandas is a powerful, open-source library used for: + - Data manipulation and analysis + - Working with structured data (like CSV, Excel, SQL, JSON) + - Offering key data structures: `DataFrame` and `Series` + - Enabling fast, flexible operations for data cleaning, filtering, grouping, merging, and more + It's widely used in data science, machine learning, finance, and research. + +--- + +## 2. Contribution Guidelines + +- **Are there clear steps in a CONTRIBUTING.md file?** + No + +- **Is there a Code of Conduct?** + ✅ Yes, the project follows a [Code of Conduct](https://github.com/pandas-dev/pandas/blob/main/.github/CODE_OF_CONDUCT.md) based on the Contributor Covenant to ensure a welcoming and respectful community. + +- **Is a CLA (Contributor License Agreement) needed?** + ❌ No Contributor License Agreement is required for contributing to pandas. + +- **Are first-time contributors welcomed?** + ✅ Yes, very much! The project: + - Labels beginner-friendly issues (`good first issue`) + - Offers clear contribution steps + - Encourages community interaction on GitHub discussions and issues + +--- + +## 3. Environment Setup + +- **How do you set up the project locally?** + + 1. **Install Anaconda** + - Download from [https://www.anaconda.com](https://www.anaconda.com) + + 2. **Create a conda environment** + conda create -n pandas-dev python=3.10 -y + conda activate pandas-dev + + + 3. **Clone the GitHub repository** + git clone https://github.com/pandas-dev/pandas.git + cd pandas + + 4. **Install dependencies** + pip install -r requirements-dev.txt + + 5. **(Optional) Build pandas from source** + python setup.py build_ext --inplace + + 6. **(Optional) Run tests** + pytest pandas + +- **Any dependencies or setup steps?** + Yes — dependencies are managed through `requirements-dev.txt` and include: + - `numpy`, `cython` + - `pytest`, `mypy`, `black`, `flake8` + - `isort`, `versioneer`, and others required for linting, testing, and building From db61986b1eff8fc6565a0135483590decb259f01 Mon Sep 17 00:00:00 2001 From: Ranjana Babu Date: Mon, 23 Jun 2025 16:12:26 +0530 Subject: [PATCH 2/2] add contribution_plan.md file --- contribution_plan.md | 129 ++++++++++++++++++++++++++++++++----------- 1 file changed, 98 insertions(+), 31 deletions(-) diff --git a/contribution_plan.md b/contribution_plan.md index 843c1b0a11203..b95a8927a7d28 100644 --- a/contribution_plan.md +++ b/contribution_plan.md @@ -1,3 +1,4 @@ + ## 1. Basic Information - **Project Name:** pandas @@ -19,7 +20,12 @@ ## 2. Contribution Guidelines - **Are there clear steps in a CONTRIBUTING.md file?** - No + ❌ No. The project uses a `contributing.rst` file instead of `CONTRIBUTING.md`. This file provides comprehensive guidelines for contributing to pandas, including: + 1. Accepted contribution types such as bug fixes, documentation updates, feature enhancements, and suggestions. + 2. Steps to identify suitable tasks by selecting issues labeled as "good first issue" or "Docs". + 3. A version control workflow that involves: Forking the repository → Cloning it locally → Creating a new branch → Making changes → Submitting a Pull Request (PR). + 4. Instructions for setting up the development environment using conda and regularly syncing with the upstream main branch. + 5. Best practices for writing meaningful commit messages, referencing related issues, and ensuring all tests pass before submission. - **Is there a Code of Conduct?** ✅ Yes, the project follows a [Code of Conduct](https://github.com/pandas-dev/pandas/blob/main/.github/CODE_OF_CONDUCT.md) based on the Contributor Covenant to ensure a welcoming and respectful community. @@ -33,35 +39,96 @@ - Offers clear contribution steps - Encourages community interaction on GitHub discussions and issues ---- - ## 3. Environment Setup -- **How do you set up the project locally?** - - 1. **Install Anaconda** - - Download from [https://www.anaconda.com](https://www.anaconda.com) - - 2. **Create a conda environment** - conda create -n pandas-dev python=3.10 -y - conda activate pandas-dev - - - 3. **Clone the GitHub repository** - git clone https://github.com/pandas-dev/pandas.git - cd pandas - - 4. **Install dependencies** - pip install -r requirements-dev.txt - - 5. **(Optional) Build pandas from source** - python setup.py build_ext --inplace - - 6. **(Optional) Run tests** - pytest pandas - -- **Any dependencies or setup steps?** - Yes — dependencies are managed through `requirements-dev.txt` and include: - - `numpy`, `cython` - - `pytest`, `mypy`, `black`, `flake8` - - `isort`, `versioneer`, and others required for linting, testing, and building +### Steps to Set Up Locally: + +1. Fork the repository on GitHub to your account. +2. Clone the repository locally: + ```bash + git clone https://github.com//pandas.git + cd pandas + ``` +3. Create and activate a development environment using conda: + ```bash + conda create -n devenv python=3.10 + conda activate devenv + ``` +4. Install development dependencies: + ```bash + pip install -r requirements-dev.txt + ``` +5. Build the C extensions required by pandas: + ```bash + python setup.py build_ext --inplace + ``` +6. (Optional but recommended) Run the test suite to validate your environment: + ```bash + pytest pandas/tests/ + ``` + +## 4. Making a Contribution + +- **Open Issue Chosen:** + [BUG: Groupby aggregate coercion of outputs inconsistency for pyarrow dtypes #61636](https://github.com/pandas-dev/pandas/issues/61636) + +- **Issue Summary:** + When using `groupby(...).agg()` on PyArrow-backed DataFrames, the output types are sometimes inconsistently coerced to pandas-native dtypes like `float64`, rather than preserving the original PyArrow dtypes. This leads to unexpected results and breaks downstream workflows that rely on dtype stability. + +### Steps to Resolve the Issue: + +1. Reproduce the issue locally by creating a DataFrame backed by PyArrow dtypes and performing `groupby(...).agg()` with functions like `'sum'`, `'first'`, etc. +2. Implement a fix to ensure aggregation on PyArrow-backed DataFrames: + - Maintain the original PyArrow dtypes wherever applicable. + - Avoid coercion unless required by the aggregation operation. +3. Modify the logic in the relevant pandas core modules (likely `core/groupby/aggregation.py` or `core/groupby/groupby.py`). +4. Add targeted unit tests under `pandas/tests/groupby/` to cover aggregation behavior on PyArrow-backed data. +5. Run all tests to ensure that the issue is resolved and no regressions are introduced. + +## 5. Create a Pull Request Plan + +### Pull Request Workflow: + +1. Create a new feature branch: + ```bash + git checkout -b fix-groupby-coercion-pyarrow + ``` + +2. Make the required code changes in the appropriate files. + +3. Add and commit the changes: + ```bash + git add . + git commit -m "BUG: Fix aggregate dtype coercion on pyarrow-backed GroupBy (#61636)" + ``` + +4. Push the changes to your fork: + ```bash + git push origin fix-groupby-coercion-pyarrow + ``` + +5. Open a Pull Request in GitHub from your branch to `pandas-dev/pandas:main`. + +### Example PR Title: +``` +BUG: Fix aggregate dtype coercion on pyarrow-backed GroupBy (#61636) +``` + +### PR Description: +``` +This PR addresses [#61636](https://github.com/pandas-dev/pandas/issues/61636), which reports inconsistent dtype coercion during groupby aggregation on PyArrow-backed DataFrames. Specifically, aggregations like 'sum' or 'first' on columns with Arrow dtypes (e.g., int32, uint64) may return outputs with unexpected pandas-native dtypes like float64. + +The fix ensures that aggregation operations on Arrow-backed columns preserve the original Arrow dtypes wherever possible, improving consistency and reliability for downstream workflows. + +New unit tests have been added to validate aggregation outputs for PyArrow-backed DataFrames and confirm dtype stability. + +Closes #61636. +``` + +### Testing the Fix: + +- Run the test suite using: + ```bash + pytest pandas/tests/groupby/ + ``` +- Confirm that all tests pass and that the new tests adequately cover the issue scenario involving Arrow dtypes.