Skip to content

Conversation

@codeflash-ai
Copy link
Contributor

@codeflash-ai codeflash-ai bot commented Nov 6, 2025

⚡️ This pull request contains optimizations for PR #4112

If you approve this dependent PR, these changes will be merged into the original PR branch feat/track-text-source.

This PR will be automatically closed if the original PR is merged.


📄 13% (0.13x) speedup for boxes_iou in unstructured/partition/pdf_image/pdfminer_processing.py

⏱️ Runtime : 49.9 milliseconds 44.2 milliseconds (best of 69 runs)

📝 Explanation and details

The optimization achieves a 13% speedup through three key improvements that target the most time-consuming operations:

1. Eliminated Python-level iteration in get_coords_from_bboxes:

  • Replaced the expensive for i, bbox in enumerate(bboxes): coords[i, :] = [bbox.x1, bbox.y1, bbox.x2, bbox.y2] loop (50.3% of function time) with np.fromiter() using a generator expression
  • This avoids repeated Python object access and list creation, leveraging NumPy's efficient C-level iteration instead
  • The profiler shows this reduces the coordinate extraction overhead significantly

2. Replaced np.split() with direct column slicing:

  • Changed np.split(coords1, 4, axis=1) (7% of function time) to coords1[:, 0:1], coords1[:, 1:2], coords1[:, 2:3], coords1[:, 3:4]
  • np.split() creates new array objects, while slicing creates lightweight views of existing data
  • This optimization is particularly effective since areas_of_boxes_and_intersection_area accounts for 73.9% of the total boxes_iou runtime

3. Improved broadcasting mechanics:

  • Replaced implicit .T transposes with explicit broadcasting using [:, None] and [None, :] in the denominator calculation
  • This makes NumPy's memory access patterns more predictable and reduces temporary array creation

Test case performance patterns:

  • Highest gains (60-80% faster) on simple cases with few boxes, where coordinate extraction overhead dominates
  • Moderate gains (10-20% faster) on large-scale tests (1000+ boxes) where the intersection computation becomes the bottleneck
  • Consistent improvements across all input types (Box objects, numpy arrays, edge cases)

The optimizations are particularly valuable for PDF processing workloads where bounding box IoU calculations are performed frequently across many document elements.

Correctness verification report:

Test Status
⏪ Replay Tests 🔘 None Found
⚙️ Existing Unit Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
🌀 Generated Regression Tests 42 Passed
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from __future__ import annotations

import numpy as np
# imports
import pytest
from unstructured.partition.pdf_image.pdfminer_processing import boxes_iou

EPSILON_AREA = 0.01
DEFAULT_ROUND = 15
from unstructured.partition.pdf_image.pdfminer_processing import boxes_iou


# Helper class for box representation
class Box:
    def __init__(self, x1, y1, x2, y2):
        self.x1 = x1
        self.y1 = y1
        self.x2 = x2
        self.y2 = y2

# ----------------------------
# Basic Test Cases
# ----------------------------

def test_identical_boxes_numpy():
    # Two identical boxes, IoU = 1, should return True for threshold < 1
    box = np.array([[0, 0, 10, 10]])
    codeflash_output = boxes_iou(box, box, threshold=0.5); result = codeflash_output # 85.8μs -> 54.3μs (58.1% faster)

def test_identical_boxes_list():
    # Two identical boxes as list of Box objects
    box = Box(0, 0, 10, 10)
    codeflash_output = boxes_iou([box], [box], threshold=0.5); result = codeflash_output # 114μs -> 86.6μs (32.4% faster)

def test_no_overlap():
    # Boxes do not overlap at all
    box1 = np.array([[0, 0, 10, 10]])
    box2 = np.array([[20, 20, 30, 30]])
    codeflash_output = boxes_iou(box1, box2, threshold=0.1); result = codeflash_output # 76.6μs -> 47.0μs (62.8% faster)

def test_partial_overlap():
    # Boxes partially overlap, IoU < 1
    box1 = np.array([[0, 0, 10, 10]])
    box2 = np.array([[5, 5, 15, 15]])
    # Calculate IoU for these boxes
    # intersection: (6x6)=36, union: 121+121-36=206, IoU=36/206~0.174
    codeflash_output = boxes_iou(box1, box2, threshold=0.1); result = codeflash_output # 74.4μs -> 45.8μs (62.4% faster)
    codeflash_output = boxes_iou(box1, box2, threshold=0.2); result = codeflash_output # 60.5μs -> 33.3μs (81.8% faster)

def test_multiple_boxes():
    # Multiple boxes, pairwise IoU
    boxes1 = np.array([[0, 0, 10, 10], [20, 20, 30, 30]])
    boxes2 = np.array([[0, 0, 10, 10], [5, 5, 15, 15]])
    codeflash_output = boxes_iou(boxes1, boxes2, threshold=0.5); result = codeflash_output # 85.4μs -> 56.2μs (52.0% faster)

# ----------------------------
# Edge Test Cases
# ----------------------------

def test_zero_area_box():
    # Box with zero area (x1==x2, y1==y2)
    box1 = np.array([[10, 10, 10, 10]])
    box2 = np.array([[10, 10, 20, 20]])
    codeflash_output = boxes_iou(box1, box2, threshold=0.01); result = codeflash_output # 74.0μs -> 44.7μs (65.7% faster)

def test_negative_coordinates():
    # Boxes with negative coordinates
    box1 = np.array([[-10, -10, 0, 0]])
    box2 = np.array([[-5, -5, 5, 5]])
    # Overlap area is 6x6=36, box1 area is 121, box2 area is 121, union=206
    # IoU = 36/206 ~ 0.174
    codeflash_output = boxes_iou(box1, box2, threshold=0.1); result = codeflash_output # 73.0μs -> 44.6μs (63.9% faster)

def test_touching_boxes():
    # Boxes that just touch at the edge (should have intersection area 1)
    box1 = np.array([[0, 0, 10, 10]])
    box2 = np.array([[11, 0, 20, 10]])
    codeflash_output = boxes_iou(box1, box2, threshold=0.0); result = codeflash_output # 72.9μs -> 44.3μs (64.4% faster)

def test_empty_inputs():
    # One or both box arrays are empty
    box1 = np.empty((0, 4))
    box2 = np.array([[0, 0, 10, 10]])
    codeflash_output = boxes_iou(box1, box2); result = codeflash_output # 84.7μs -> 55.9μs (51.7% faster)
    box1 = np.array([[0, 0, 10, 10]])
    box2 = np.empty((0, 4))
    codeflash_output = boxes_iou(box1, box2); result = codeflash_output # 67.0μs -> 41.0μs (63.4% faster)
    box1 = np.empty((0, 4))
    box2 = np.empty((0, 4))
    codeflash_output = boxes_iou(box1, box2); result = codeflash_output # 65.7μs -> 39.1μs (67.9% faster)

def test_threshold_extremes():
    # Threshold at 0 (all overlaps pass), at 1 (only perfect overlaps pass)
    box1 = np.array([[0, 0, 10, 10]])
    box2 = np.array([[0, 0, 10, 10], [5, 5, 15, 15]])
    codeflash_output = boxes_iou(box1, box2, threshold=0.0); result = codeflash_output # 81.2μs -> 52.3μs (55.3% faster)
    codeflash_output = boxes_iou(box1, box2, threshold=1.0); result = codeflash_output # 66.0μs -> 38.4μs (71.9% faster)

def test_float_precision():
    # Boxes that overlap by a tiny amount, check rounding/epsilon
    box1 = np.array([[0, 0, 1e-8, 1e-8]])
    box2 = np.array([[0, 0, 1, 1]])
    # Intersection area is almost zero, union is almost area(box2)
    codeflash_output = boxes_iou(box1, box2, threshold=0.0); result = codeflash_output # 96.3μs -> 66.4μs (45.0% faster)
    codeflash_output = boxes_iou(box1, box2, threshold=0.01); result = codeflash_output # 75.3μs -> 48.5μs (55.2% faster)

def test_non_integer_boxes():
    # Boxes with float coordinates
    box1 = np.array([[0.5, 0.5, 10.5, 10.5]])
    box2 = np.array([[5.5, 5.5, 15.5, 15.5]])
    # Overlap: (6x6)=36, box1: 11x11=121, box2: 11x11=121, union=206, IoU~0.174
    codeflash_output = boxes_iou(box1, box2, threshold=0.15); result = codeflash_output # 93.8μs -> 63.1μs (48.7% faster)

def test_input_as_list_of_boxes():
    # Accepts list of Box objects as input
    box1 = [Box(0, 0, 10, 10)]
    box2 = [Box(0, 0, 10, 10), Box(5, 5, 15, 15)]
    codeflash_output = boxes_iou(box1, box2, threshold=0.5); result = codeflash_output # 113μs -> 85.2μs (33.3% faster)

# ----------------------------
# Large Scale Test Cases
# ----------------------------

def test_large_number_of_boxes():
    # Test with 1000 boxes, all identical, should all overlap perfectly
    N = 1000
    boxes = np.tile(np.array([[0, 0, 10, 10]]), (N, 1))
    codeflash_output = boxes_iou(boxes, boxes, threshold=0.99); result = codeflash_output # 20.2ms -> 18.2ms (11.0% faster)

def test_large_number_of_boxes_partial_overlap():
    # 500 boxes, each shifted by 1, overlap only with themselves and immediate neighbors
    N = 500
    boxes1 = np.array([[i, 0, i+10, 10] for i in range(N)])
    boxes2 = np.array([[i, 0, i+10, 10] for i in range(N)])
    codeflash_output = boxes_iou(boxes1, boxes2, threshold=0.5); result = codeflash_output # 5.22ms -> 4.70ms (11.0% faster)
    # Diagonal should be True (identical), off-diagonal mostly False except for some neighbors
    for i in range(N):
        pass

def test_large_sparse_overlap():
    # 100 boxes, only first overlaps with first, rest are far apart
    N = 100
    boxes1 = np.array([[0, 0, 10, 10]] + [[i*100, 0, i*100+10, 10] for i in range(1, N)])
    boxes2 = np.array([[0, 0, 10, 10]] + [[i*100, 0, i*100+10, 10] for i in range(1, N)])
    codeflash_output = boxes_iou(boxes1, boxes2, threshold=0.5); result = codeflash_output # 194μs -> 161μs (20.1% faster)
    # Only diagonal is True
    for i in range(N):
        for j in range(N):
            if i == j:
                pass
            else:
                pass

def test_large_rectangular_grid():
    # 100x10 grid of boxes, all non-overlapping
    rows, cols = 100, 10
    boxes1 = []
    for i in range(rows):
        for j in range(cols):
            boxes1.append([i*20, j*20, i*20+10, j*20+10])
    boxes1 = np.array(boxes1)
    codeflash_output = boxes_iou(boxes1, boxes1, threshold=0.5); result = codeflash_output # 19.0ms -> 17.1ms (10.9% faster)
    # Only diagonal is True
    for i in range(rows*cols):
        if i < rows*cols-1:
            pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from __future__ import annotations

import numpy as np
# imports
import pytest
from unstructured.partition.pdf_image.pdfminer_processing import boxes_iou

EPSILON_AREA = 0.01
DEFAULT_ROUND = 15
from unstructured.partition.pdf_image.pdfminer_processing import boxes_iou


# Helper class for box objects
class Box:
    def __init__(self, x1, y1, x2, y2):
        self.x1 = x1
        self.y1 = y1
        self.x2 = x2
        self.y2 = y2

# --------- Basic Test Cases ---------
def test_single_identical_box():
    # Test IOU for two identical boxes (should be 1, so always True for threshold < 1)
    box = Box(0, 0, 10, 10)
    codeflash_output = boxes_iou([box], [box], threshold=0.5); arr = codeflash_output # 146μs -> 105μs (38.3% faster)

def test_single_non_overlapping_boxes():
    # Test IOU for two non-overlapping boxes (IOU=0, should be False)
    box1 = Box(0, 0, 10, 10)
    box2 = Box(20, 20, 30, 30)
    codeflash_output = boxes_iou([box1], [box2], threshold=0.1); arr = codeflash_output # 110μs -> 81.7μs (34.7% faster)

def test_partial_overlap_boxes():
    # Test IOU for two partially overlapping boxes
    box1 = Box(0, 0, 10, 10)
    box2 = Box(5, 5, 15, 15)
    codeflash_output = boxes_iou([box1], [box2], threshold=0.1); arr = codeflash_output # 108μs -> 79.7μs (35.9% faster)
    # Calculate IOU manually
    inter_area = (10 - 5 + 1) * (10 - 5 + 1) # 6*6=36
    area1 = (10 - 0 + 1) * (10 - 0 + 1) # 11*11=121
    area2 = (15 - 5 + 1) * (15 - 5 + 1) # 11*11=121
    union = area1 + area2 - inter_area
    iou = inter_area / union

def test_multiple_boxes():
    # Test IOU for multiple boxes
    boxes1 = [Box(0, 0, 10, 10), Box(20, 20, 30, 30)]
    boxes2 = [Box(0, 0, 10, 10), Box(25, 25, 35, 35)]
    codeflash_output = boxes_iou(boxes1, boxes2, threshold=0.1); arr = codeflash_output # 114μs -> 86.2μs (32.9% faster)

def test_numpy_array_input():
    # Test IOU for numpy array input
    arr1 = np.array([[0, 0, 10, 10], [20, 20, 30, 30]])
    arr2 = np.array([[0, 0, 10, 10], [25, 25, 35, 35]])
    codeflash_output = boxes_iou(arr1, arr2, threshold=0.1); arr = codeflash_output # 89.4μs -> 59.0μs (51.6% faster)

# --------- Edge Test Cases ---------
def test_empty_lists():
    # Test IOU with empty lists
    codeflash_output = boxes_iou([], [], threshold=0.5); arr = codeflash_output # 95.4μs -> 64.3μs (48.4% faster)

def test_one_empty_list():
    # Test IOU with one empty list
    box = Box(0, 0, 10, 10)
    codeflash_output = boxes_iou([box], [], threshold=0.5); arr1 = codeflash_output # 104μs -> 75.0μs (38.8% faster)
    codeflash_output = boxes_iou([], [box], threshold=0.5); arr2 = codeflash_output # 83.8μs -> 58.1μs (44.2% faster)

def test_zero_area_boxes():
    # Test IOU for boxes with zero area (x1==x2 and/or y1==y2)
    box1 = Box(5, 5, 5, 5)
    box2 = Box(5, 5, 5, 5)
    codeflash_output = boxes_iou([box1], [box2], threshold=0.1); arr = codeflash_output # 105μs -> 78.3μs (34.4% faster)

def test_negative_coordinates():
    # Test IOU for boxes with negative coordinates
    box1 = Box(-10, -10, 0, 0)
    box2 = Box(-5, -5, 5, 5)
    codeflash_output = boxes_iou([box1], [box2], threshold=0.1); arr = codeflash_output # 104μs -> 77.8μs (34.6% faster)

def test_threshold_behavior():
    # Test IOU for threshold at edge values
    box1 = Box(0, 0, 10, 10)
    box2 = Box(0, 0, 10, 10)
    codeflash_output = boxes_iou([box1], [box2], threshold=0.0); arr_low = codeflash_output # 104μs -> 77.6μs (34.6% faster)
    codeflash_output = boxes_iou([box1], [box2], threshold=1.0); arr_high = codeflash_output # 84.7μs -> 60.6μs (39.9% faster)

def test_float_coordinates():
    # Test IOU for boxes with float coordinates
    box1 = Box(0.0, 0.0, 10.5, 10.5)
    box2 = Box(5.5, 5.5, 15.5, 15.5)
    codeflash_output = boxes_iou([box1], [box2], threshold=0.1); arr = codeflash_output # 103μs -> 77.2μs (34.3% faster)

def test_rounding_precision():
    # Test IOU with different rounding precision
    box1 = Box(0, 0, 10.123456789, 10.987654321)
    box2 = Box(0, 0, 10.123456789, 10.987654321)
    codeflash_output = boxes_iou([box1], [box2], threshold=0.5, round_to=2); arr1 = codeflash_output # 103μs -> 76.3μs (36.2% faster)
    codeflash_output = boxes_iou([box1], [box2], threshold=0.5, round_to=15); arr2 = codeflash_output # 84.8μs -> 59.8μs (41.9% faster)

# --------- Large Scale Test Cases ---------
def test_many_boxes():
    # Test IOU for large number of boxes
    N = 100
    # Create 100 boxes, all overlapping with themselves
    boxes = [Box(i, i, i+10, i+10) for i in range(N)]
    codeflash_output = boxes_iou(boxes, boxes, threshold=0.5); arr = codeflash_output # 305μs -> 224μs (36.3% faster)
    # Diagonal should be True (identical boxes), off-diagonal should be False (no overlap)
    for i in range(N):
        for j in range(N):
            if i == j:
                pass
            else:
                pass

def test_large_numpy_array():
    # Test IOU for large numpy arrays
    N = 200
    arr1 = np.zeros((N, 4))
    arr2 = np.zeros((N, 4))
    # All boxes are at (0,0,10,10)
    arr1[:] = [0, 0, 10, 10]
    arr2[:] = [0, 0, 10, 10]
    codeflash_output = boxes_iou(arr1, arr2, threshold=0.5); arr = codeflash_output # 743μs -> 703μs (5.69% faster)

def test_sparse_overlap_large():
    # Test IOU for large arrays with sparse overlap
    N = 100
    boxes1 = [Box(i*10, i*10, i*10+5, i*10+5) for i in range(N)]
    boxes2 = [Box(i*10+2, i*10+2, i*10+7, i*10+7) for i in range(N)]
    codeflash_output = boxes_iou(boxes1, boxes2, threshold=0.1); arr = codeflash_output # 308μs -> 225μs (36.9% faster)
    # Only diagonal should be True
    for i in range(N):
        for j in range(N):
            if i == j:
                pass
            else:
                pass

def test_performance_large_scale():
    # Test IOU for performance with max allowed size
    N = 250
    boxes1 = [Box(i, i, i+5, i+5) for i in range(N)]
    boxes2 = [Box(i, i, i+5, i+5) for i in range(N)]
    codeflash_output = boxes_iou(boxes1, boxes2, threshold=0.5); arr = codeflash_output # 822μs -> 658μs (24.9% faster)

# --------- Mutation-sensitive test ---------
def test_mutation_sensitive_behavior():
    # This test will fail if the IOU logic is changed to >= instead of >
    box1 = Box(0, 0, 10, 10)
    box2 = Box(0, 0, 10, 10)
    codeflash_output = boxes_iou([box1], [box2], threshold=1.0); arr = codeflash_output # 109μs -> 80.8μs (35.1% faster)

# --------- Input Type Robustness ---------

To edit these changes git checkout codeflash/optimize-pr4112-2025-11-06T21.15.51 and push.

Codeflash Static Badge

qued and others added 14 commits November 4, 2025 10:05
This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.
…same` by 24% in PR #4112 (`feat/track-text-source`) (#4114)

## ⚡️ This pull request contains optimizations for PR #4112
If you approve this dependent PR, these changes will be merged into the
original PR branch `feat/track-text-source`.
>This PR will be automatically closed if the original PR is merged.
----
#### 📄 24% (0.24x) speedup for
***`_merge_extracted_into_inferred_when_almost_the_same` in
`unstructured/partition/pdf_image/pdfminer_processing.py`***

⏱️ Runtime : **`40.6 milliseconds`** **→** **`32.6 milliseconds`** (best
of `18` runs)

#### 📝 Explanation and details


The optimized code achieves a **24% speedup** through two key
optimizations:

**1. Improved `_minimum_containing_coords` function:**
- **What**: Replaced `np.vstack` with separate array creation followed
by `np.column_stack`
- **Why**: The original code created list comprehensions multiple times
within `np.vstack`, causing redundant temporary arrays and inefficient
memory access patterns. The optimized version pre-computes each
coordinate array once, then combines them efficiently
- **Impact**: Reduces function time from 1.88ms to 1.41ms (25% faster).
Line profiler shows the costly list comprehensions in the original
(lines with 27%, 14%, 13%, 12% of time) are replaced with more efficient
array operations

**2. Optimized comparison in `boxes_iou` function:**
- **What**: Changed `(inter_area / denom) > threshold` to `inter_area >
(threshold * denom)`
- **Why**: Avoids expensive division operations by algebraically
rearranging the inequality. Division is significantly slower than
multiplication in NumPy, especially for large arrays
- **Impact**: Reduces the final comparison from 19% to 5.8% of function
time, while the intermediate denominator calculation takes 11.8%

**3. Minor optimization in boolean mask creation:**
- **What**: Replaced `boxes_almost_same.sum(axis=1).astype(bool)` with
`np.any(boxes_almost_same, axis=1)`
- **Why**: `np.any` short-circuits on the first True value and is
semantically clearer, though the performance gain is minimal

**Test case analysis shows the optimizations are particularly effective
for:**
- Large-scale scenarios (1000+ elements): 17-75% speedup depending on
match patterns
- Cases with no matches benefit most (74.6% faster) due to avoiding
expensive division operations
- All test cases show consistent 6-17% improvements, indicating robust
optimization across different workloads

The optimizations maintain identical functionality while reducing
computational overhead through better NumPy usage patterns and
mathematical rearrangement.



✅ **Correctness verification report:**

| Test                        | Status            |
| --------------------------- | ----------------- |
| ⏪ Replay Tests | 🔘 **None Found** |
| ⚙️ Existing Unit Tests | 🔘 **None Found** |
| 🔎 Concolic Coverage Tests | 🔘 **None Found** |
| 🌀 Generated Regression Tests | ✅ **18 Passed** |
|📊 Tests Coverage       | 100.0% |
<details>
<summary>🌀 Generated Regression Tests and Runtime</summary>

```python
import numpy as np
# imports
import pytest
from unstructured.partition.pdf_image.pdfminer_processing import \
    _merge_extracted_into_inferred_when_almost_the_same

# --- Minimal class stubs and helpers to support the function under test ---

class DummyLayoutElements:
    """
    Minimal implementation of LayoutElements to support testing.
    - element_coords: np.ndarray of shape (N, 4) for bounding boxes.
    - texts: np.ndarray of shape (N,) for text strings.
    - is_extracted_array: np.ndarray of shape (N,) for boolean flags.
    """
    def __init__(self, element_coords, texts=None, is_extracted_array=None):
        self.element_coords = np.array(element_coords, dtype=np.float32)
        self.texts = np.array(texts if texts is not None else [''] * len(element_coords), dtype=object)
        self.is_extracted_array = np.array(is_extracted_array if is_extracted_array is not None else [False] * len(element_coords), dtype=bool)

    def __len__(self):
        return len(self.element_coords)

    def slice(self, mask):
        # mask can be a boolean array or integer indices
        if isinstance(mask, (np.ndarray, list)):
            if isinstance(mask[0], bool):
                idx = np.where(mask)[0]
            else:
                idx = np.array(mask)
        else:
            idx = np.array([mask])
        return DummyLayoutElements(
            self.element_coords[idx],
            self.texts[idx],
            self.is_extracted_array[idx]
        )
from unstructured.partition.pdf_image.pdfminer_processing import \
    _merge_extracted_into_inferred_when_almost_the_same

# --- Unit Tests ---

# ----------- BASIC TEST CASES -----------

def test_no_inferred_elements_returns_false_mask():
    # No inferred elements: all extracted should not be merged
    extracted = DummyLayoutElements([[0, 0, 1, 1], [1, 1, 2, 2]], texts=["a", "b"])
    inferred = DummyLayoutElements([])
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.9); mask = codeflash_output # 3.50μs -> 3.30μs (6.10% faster)

def test_no_extracted_elements_returns_empty_mask():
    # No extracted elements: should return empty mask
    extracted = DummyLayoutElements([])
    inferred = DummyLayoutElements([[0, 0, 1, 1]])
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.9); mask = codeflash_output # 2.30μs -> 2.31μs (0.475% slower)

















#------------------------------------------------
import numpy as np
# imports
import pytest
from unstructured.partition.pdf_image.pdfminer_processing import \
    _merge_extracted_into_inferred_when_almost_the_same


# Minimal stubs for TextRegions and LayoutElements to enable testing
class TextRegions:
    def __init__(self, coords, texts=None, is_extracted_array=None):
        self.x1 = coords[:, 0]
        self.y1 = coords[:, 1]
        self.x2 = coords[:, 2]
        self.y2 = coords[:, 3]
        self.texts = np.array(texts) if texts is not None else np.array([""] * len(coords))
        self.is_extracted_array = np.array(is_extracted_array) if is_extracted_array is not None else np.zeros(len(coords), dtype=bool)
        self.element_coords = coords

    def __len__(self):
        return len(self.element_coords)

    def slice(self, mask):
        # mask can be bool array or indices
        if isinstance(mask, (np.ndarray, list)):
            if isinstance(mask, np.ndarray) and mask.dtype == bool:
                idx = np.where(mask)[0]
            else:
                idx = mask
        else:
            idx = [mask]
        coords = self.element_coords[idx]
        texts = self.texts[idx]
        is_extracted_array = self.is_extracted_array[idx]
        return TextRegions(coords, texts, is_extracted_array)

class LayoutElements(TextRegions):
    pass
from unstructured.partition.pdf_image.pdfminer_processing import \
    _merge_extracted_into_inferred_when_almost_the_same

# ===========================
# Unit Tests
# ===========================

# ----------- BASIC TEST CASES -----------

def test_basic_exact_match():
    # One extracted, one inferred, same box
    coords = np.array([[0, 0, 10, 10]])
    extracted = LayoutElements(coords, texts=["extracted"], is_extracted_array=[True])
    inferred = LayoutElements(coords, texts=["inferred"], is_extracted_array=[False])
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 207μs -> 192μs (7.74% faster)

def test_basic_no_match():
    # Boxes do not overlap
    extracted = LayoutElements(np.array([[0, 0, 10, 10]]), texts=["extracted"], is_extracted_array=[True])
    inferred = LayoutElements(np.array([[20, 20, 30, 30]]), texts=["inferred"], is_extracted_array=[False])
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 163μs -> 151μs (7.85% faster)

def test_basic_partial_overlap_below_threshold():
    # Overlap, but below threshold
    extracted = LayoutElements(np.array([[0, 0, 10, 10]]), texts=["extracted"], is_extracted_array=[True])
    inferred = LayoutElements(np.array([[5, 5, 15, 15]]), texts=["inferred"], is_extracted_array=[False])
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 158μs -> 148μs (6.53% faster)

def test_basic_partial_overlap_above_threshold():
    # Overlap, above threshold
    extracted = LayoutElements(np.array([[0, 0, 10, 10]]), texts=["extracted"], is_extracted_array=[True])
    inferred = LayoutElements(np.array([[0, 0, 10, 10.1]]), texts=["inferred"], is_extracted_array=[False])
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 191μs -> 176μs (8.22% faster)

def test_basic_multiple_elements_some_match():
    # Multiple extracted/inferred, some matches
    extracted = LayoutElements(
        np.array([[0, 0, 10, 10], [20, 20, 30, 30]]),
        texts=["extracted1", "extracted2"],
        is_extracted_array=[True, True]
    )
    inferred = LayoutElements(
        np.array([[0, 0, 10, 10], [100, 100, 110, 110]]),
        texts=["inferred1", "inferred2"],
        is_extracted_array=[False, False]
    )
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 172μs -> 162μs (5.98% faster)

# ----------- EDGE TEST CASES -----------

def test_edge_empty_extracted():
    # No extracted elements
    extracted = LayoutElements(np.zeros((0, 4)), texts=[], is_extracted_array=[])
    inferred = LayoutElements(np.array([[0,0,1,1]]), texts=["foo"], is_extracted_array=[False])
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 2.08μs -> 2.06μs (0.969% faster)

def test_edge_empty_inferred():
    # No inferred elements
    extracted = LayoutElements(np.array([[0,0,1,1]]), texts=["foo"], is_extracted_array=[True])
    inferred = LayoutElements(np.zeros((0, 4)), texts=[], is_extracted_array=[])
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 2.71μs -> 2.48μs (9.29% faster)

def test_edge_all_elements_match():
    # All extracted match inferred
    coords = np.array([[0,0,10,10], [20,20,30,30]])
    extracted = LayoutElements(coords, texts=["A", "B"], is_extracted_array=[True, True])
    inferred = LayoutElements(coords, texts=["X", "Y"], is_extracted_array=[False, False])
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 174μs -> 162μs (7.69% faster)

def test_edge_threshold_zero():
    # Threshold zero means all overlap counts
    extracted = LayoutElements(np.array([[0,0,10,10]]), texts=["foo"], is_extracted_array=[True])
    inferred = LayoutElements(np.array([[5,5,15,15]]), texts=["bar"], is_extracted_array=[False])
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.0); mask = codeflash_output # 159μs -> 150μs (5.94% faster)

def test_edge_threshold_one():
    # Threshold one means only perfect overlap counts
    extracted = LayoutElements(np.array([[0,0,10,10]]), texts=["foo"], is_extracted_array=[True])
    inferred = LayoutElements(np.array([[0,0,10,10]]), texts=["bar"], is_extracted_array=[False])
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 1.0); mask = codeflash_output # 155μs -> 145μs (7.01% faster)

def test_edge_multiple_matches_first_match_wins():
    # Extracted overlaps with multiple inferred, but only first match is updated
    extracted = LayoutElements(np.array([[0,0,10,10]]), texts=["foo"], is_extracted_array=[True])
    inferred = LayoutElements(
        np.array([[0,0,10,10], [0,0,10,10]]), texts=["bar1", "bar2"], is_extracted_array=[False, False]
    )
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 168μs -> 156μs (7.25% faster)

def test_edge_coords_are_updated_to_minimum_containing():
    # Bounding boxes are updated to minimum containing box
    extracted = LayoutElements(np.array([[1,2,9,10]]), texts=["foo"], is_extracted_array=[True])
    inferred = LayoutElements(np.array([[0,0,10,10]]), texts=["bar"], is_extracted_array=[False])
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 156μs -> 144μs (8.56% faster)
    # The new coords should be the minimum containing both
    expected = np.array([0,0,10,10])

# ----------- LARGE SCALE TEST CASES -----------

def test_large_scale_many_elements():
    # 500 extracted, 500 inferred, all match
    N = 500
    coords = np.stack([np.arange(N), np.arange(N), np.arange(N)+10, np.arange(N)+10], axis=1)
    extracted = LayoutElements(coords, texts=[f"ex{i}" for i in range(N)], is_extracted_array=[True]*N)
    inferred = LayoutElements(coords.copy(), texts=[f"in{i}" for i in range(N)], is_extracted_array=[False]*N)
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 2.90ms -> 2.79ms (3.78% faster)

def test_large_scale_some_elements_match():
    # 1000 extracted, 500 inferred, only first 500 match
    N = 1000
    M = 500
    coords_extracted = np.stack([np.arange(N), np.arange(N), np.arange(N)+10, np.arange(N)+10], axis=1)
    coords_inferred = coords_extracted[:M]
    extracted = LayoutElements(coords_extracted, texts=[f"ex{i}" for i in range(N)], is_extracted_array=[True]*N)
    inferred = LayoutElements(coords_inferred.copy(), texts=[f"in{i}" for i in range(M)], is_extracted_array=[False]*M)
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 6.49ms -> 5.56ms (16.6% faster)
    # First 500 should be merged, rest not
    expected_mask = np.zeros(N, dtype=bool)
    expected_mask[:M] = True

def test_large_scale_no_elements_match():
    # 1000 extracted, 500 inferred, none match
    N = 1000
    M = 500
    coords_extracted = np.stack([np.arange(N), np.arange(N), np.arange(N)+10, np.arange(N)+10], axis=1)
    coords_inferred = coords_extracted[:M] + 10000  # Far away
    extracted = LayoutElements(coords_extracted, texts=[f"ex{i}" for i in range(N)], is_extracted_array=[True]*N)
    inferred = LayoutElements(coords_inferred, texts=[f"in{i}" for i in range(M)], is_extracted_array=[False]*M)
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 8.91ms -> 5.11ms (74.6% faster)

def test_large_scale_performance():
    # Test that the function runs efficiently for 1000 elements
    N = 1000
    coords = np.stack([np.arange(N), np.arange(N), np.arange(N)+10, np.arange(N)+10], axis=1)
    extracted = LayoutElements(coords, texts=[f"ex{i}" for i in range(N)], is_extracted_array=[True]*N)
    inferred = LayoutElements(coords.copy(), texts=[f"in{i}" for i in range(N)], is_extracted_array=[False]*N)
    import time
    start = time.time()
    codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 20.6ms -> 17.6ms (17.1% faster)
    elapsed = time.time() - start
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```

</details>


To edit these changes `git checkout
codeflash/optimize-pr4112-2025-11-05T21.03.01` and push.


[![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai)
![Static
Badge](https://img.shields.io/badge/🎯_Optimization_Quality-high-green)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Speeds up layout merging by optimizing bounding-box aggregation,
boolean mask creation, and IOU comparison to avoid divisions.
> 
> - **Performance optimizations in
`unstructured/partition/pdf_image/pdfminer_processing.py`**:
>   - `/_minimum_containing_coords`:
> - Precomputes `x1/y1/x2/y2` arrays and uses `np.column_stack` to build
output; removes extra transpose.
>   - `/_merge_extracted_into_inferred_when_almost_the_same`:
> - Replaces `sum(...).astype(bool)` with `np.any(..., axis=1)` for
match mask.
>   - `/boxes_iou`:
> - Computes denominator once and replaces division `(x/y) > t` with `x
> t*y` to avoid divisions.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
8a0335f. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
The optimization achieves a **13% speedup** through three key improvements that target the most time-consuming operations:

**1. Eliminated Python-level iteration in `get_coords_from_bboxes`:**
- Replaced the expensive `for i, bbox in enumerate(bboxes): coords[i, :] = [bbox.x1, bbox.y1, bbox.x2, bbox.y2]` loop (50.3% of function time) with `np.fromiter()` using a generator expression
- This avoids repeated Python object access and list creation, leveraging NumPy's efficient C-level iteration instead
- The profiler shows this reduces the coordinate extraction overhead significantly

**2. Replaced `np.split()` with direct column slicing:**
- Changed `np.split(coords1, 4, axis=1)` (7% of function time) to `coords1[:, 0:1], coords1[:, 1:2], coords1[:, 2:3], coords1[:, 3:4]`
- `np.split()` creates new array objects, while slicing creates lightweight views of existing data
- This optimization is particularly effective since `areas_of_boxes_and_intersection_area` accounts for 73.9% of the total `boxes_iou` runtime

**3. Improved broadcasting mechanics:**
- Replaced implicit `.T` transposes with explicit broadcasting using `[:, None]` and `[None, :]` in the denominator calculation
- This makes NumPy's memory access patterns more predictable and reduces temporary array creation

**Test case performance patterns:**
- **Highest gains (60-80% faster)** on simple cases with few boxes, where coordinate extraction overhead dominates
- **Moderate gains (10-20% faster)** on large-scale tests (1000+ boxes) where the intersection computation becomes the bottleneck
- **Consistent improvements** across all input types (Box objects, numpy arrays, edge cases)

The optimizations are particularly valuable for PDF processing workloads where bounding box IoU calculations are performed frequently across many document elements.
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to codeflash labels Nov 6, 2025
@codeflash-ai codeflash-ai bot mentioned this pull request Nov 6, 2025
Base automatically changed from feat/track-text-source to main November 11, 2025 19:45
@codeflash-ai codeflash-ai bot closed this Nov 12, 2025
@codeflash-ai
Copy link
Contributor Author

codeflash-ai bot commented Nov 12, 2025

This PR has been automatically closed because the original PR #4120 by vhsakpal was closed.

@codeflash-ai codeflash-ai bot deleted the codeflash/optimize-pr4112-2025-11-06T21.15.51 branch November 12, 2025 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants