⚡️ Speed up function `boxes_iou` by 13% in PR #4112 (`feat/track-text-source`) #4116

codeflash-ai · 2025-11-06T21:15:57Z

⚡️ This pull request contains optimizations for PR #4112

If you approve this dependent PR, these changes will be merged into the original PR branch feat/track-text-source.

This PR will be automatically closed if the original PR is merged.

📄 13% (0.13x) speedup for `boxes_iou` in `unstructured/partition/pdf_image/pdfminer_processing.py`

⏱️ Runtime : 49.9 milliseconds → 44.2 milliseconds (best of 69 runs)

📝 Explanation and details

The optimization achieves a 13% speedup through three key improvements that target the most time-consuming operations:

1. Eliminated Python-level iteration in get_coords_from_bboxes:

Replaced the expensive for i, bbox in enumerate(bboxes): coords[i, :] = [bbox.x1, bbox.y1, bbox.x2, bbox.y2] loop (50.3% of function time) with np.fromiter() using a generator expression
This avoids repeated Python object access and list creation, leveraging NumPy's efficient C-level iteration instead
The profiler shows this reduces the coordinate extraction overhead significantly

2. Replaced np.split() with direct column slicing:

Changed np.split(coords1, 4, axis=1) (7% of function time) to coords1[:, 0:1], coords1[:, 1:2], coords1[:, 2:3], coords1[:, 3:4]
np.split() creates new array objects, while slicing creates lightweight views of existing data
This optimization is particularly effective since areas_of_boxes_and_intersection_area accounts for 73.9% of the total boxes_iou runtime

3. Improved broadcasting mechanics:

Replaced implicit .T transposes with explicit broadcasting using [:, None] and [None, :] in the denominator calculation
This makes NumPy's memory access patterns more predictable and reduces temporary array creation

Test case performance patterns:

Highest gains (60-80% faster) on simple cases with few boxes, where coordinate extraction overhead dominates
Moderate gains (10-20% faster) on large-scale tests (1000+ boxes) where the intersection computation becomes the bottleneck
Consistent improvements across all input types (Box objects, numpy arrays, edge cases)

The optimizations are particularly valuable for PDF processing workloads where bounding box IoU calculations are performed frequently across many document elements.

✅ Correctness verification report:

Test	Status
⏪ Replay Tests	🔘 None Found
⚙️ Existing Unit Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 42 Passed
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

from __future__ import annotations

import numpy as np
# imports
import pytest
from unstructured.partition.pdf_image.pdfminer_processing import boxes_iou

EPSILON_AREA = 0.01
DEFAULT_ROUND = 15
from unstructured.partition.pdf_image.pdfminer_processing import boxes_iou


# Helper class for box representation
class Box:
    def __init__(self, x1, y1, x2, y2):
        self.x1 = x1
        self.y1 = y1
        self.x2 = x2
        self.y2 = y2

# ----------------------------
# Basic Test Cases
# ----------------------------

def test_identical_boxes_numpy():
    # Two identical boxes, IoU = 1, should return True for threshold < 1
    box = np.array([[0, 0, 10, 10]])
    codeflash_output = boxes_iou(box, box, threshold=0.5); result = codeflash_output # 85.8μs -> 54.3μs (58.1% faster)

def test_identical_boxes_list():
    # Two identical boxes as list of Box objects
    box = Box(0, 0, 10, 10)
    codeflash_output = boxes_iou([box], [box], threshold=0.5); result = codeflash_output # 114μs -> 86.6μs (32.4% faster)

def test_no_overlap():
    # Boxes do not overlap at all
    box1 = np.array([[0, 0, 10, 10]])
    box2 = np.array([[20, 20, 30, 30]])
    codeflash_output = boxes_iou(box1, box2, threshold=0.1); result = codeflash_output # 76.6μs -> 47.0μs (62.8% faster)

def test_partial_overlap():
    # Boxes partially overlap, IoU < 1
    box1 = np.array([[0, 0, 10, 10]])
    box2 = np.array([[5, 5, 15, 15]])
    # Calculate IoU for these boxes
    # intersection: (6x6)=36, union: 121+121-36=206, IoU=36/206~0.174
    codeflash_output = boxes_iou(box1, box2, threshold=0.1); result = codeflash_output # 74.4μs -> 45.8μs (62.4% faster)
    codeflash_output = boxes_iou(box1, box2, threshold=0.2); result = codeflash_output # 60.5μs -> 33.3μs (81.8% faster)

def test_multiple_boxes():
    # Multiple boxes, pairwise IoU
    boxes1 = np.array([[0, 0, 10, 10], [20, 20, 30, 30]])
    boxes2 = np.array([[0, 0, 10, 10], [5, 5, 15, 15]])
    codeflash_output = boxes_iou(boxes1, boxes2, threshold=0.5); result = codeflash_output # 85.4μs -> 56.2μs (52.0% faster)

# ----------------------------
# Edge Test Cases
# ----------------------------

def test_zero_area_box():
    # Box with zero area (x1==x2, y1==y2)
    box1 = np.array([[10, 10, 10, 10]])
    box2 = np.array([[10, 10, 20, 20]])
    codeflash_output = boxes_iou(box1, box2, threshold=0.01); result = codeflash_output # 74.0μs -> 44.7μs (65.7% faster)

def test_negative_coordinates():
    # Boxes with negative coordinates
    box1 = np.array([[-10, -10, 0, 0]])
    box2 = np.array([[-5, -5, 5, 5]])
    # Overlap area is 6x6=36, box1 area is 121, box2 area is 121, union=206
    # IoU = 36/206 ~ 0.174
    codeflash_output = boxes_iou(box1, box2, threshold=0.1); result = codeflash_output # 73.0μs -> 44.6μs (63.9% faster)

def test_touching_boxes():
    # Boxes that just touch at the edge (should have intersection area 1)
    box1 = np.array([[0, 0, 10, 10]])
    box2 = np.array([[11, 0, 20, 10]])
    codeflash_output = boxes_iou(box1, box2, threshold=0.0); result = codeflash_output # 72.9μs -> 44.3μs (64.4% faster)

def test_empty_inputs():
    # One or both box arrays are empty
    box1 = np.empty((0, 4))
    box2 = np.array([[0, 0, 10, 10]])
    codeflash_output = boxes_iou(box1, box2); result = codeflash_output # 84.7μs -> 55.9μs (51.7% faster)
    box1 = np.array([[0, 0, 10, 10]])
    box2 = np.empty((0, 4))
    codeflash_output = boxes_iou(box1, box2); result = codeflash_output # 67.0μs -> 41.0μs (63.4% faster)
    box1 = np.empty((0, 4))
    box2 = np.empty((0, 4))
    codeflash_output = boxes_iou(box1, box2); result = codeflash_output # 65.7μs -> 39.1μs (67.9% faster)

def test_threshold_extremes():
    # Threshold at 0 (all overlaps pass), at 1 (only perfect overlaps pass)
    box1 = np.array([[0, 0, 10, 10]])
    box2 = np.array([[0, 0, 10, 10], [5, 5, 15, 15]])
    codeflash_output = boxes_iou(box1, box2, threshold=0.0); result = codeflash_output # 81.2μs -> 52.3μs (55.3% faster)
    codeflash_output = boxes_iou(box1, box2, threshold=1.0); result = codeflash_output # 66.0μs -> 38.4μs (71.9% faster)

def test_float_precision():
    # Boxes that overlap by a tiny amount, check rounding/epsilon
    box1 = np.array([[0, 0, 1e-8, 1e-8]])
    box2 = np.array([[0, 0, 1, 1]])
    # Intersection area is almost zero, union is almost area(box2)
    codeflash_output = boxes_iou(box1, box2, threshold=0.0); result = codeflash_output # 96.3μs -> 66.4μs (45.0% faster)
    codeflash_output = boxes_iou(box1, box2, threshold=0.01); result = codeflash_output # 75.3μs -> 48.5μs (55.2% faster)

def test_non_integer_boxes():
    # Boxes with float coordinates
    box1 = np.array([[0.5, 0.5, 10.5, 10.5]])
    box2 = np.array([[5.5, 5.5, 15.5, 15.5]])
    # Overlap: (6x6)=36, box1: 11x11=121, box2: 11x11=121, union=206, IoU~0.174
    codeflash_output = boxes_iou(box1, box2, threshold=0.15); result = codeflash_output # 93.8μs -> 63.1μs (48.7% faster)

def test_input_as_list_of_boxes():
    # Accepts list of Box objects as input
    box1 = [Box(0, 0, 10, 10)]
    box2 = [Box(0, 0, 10, 10), Box(5, 5, 15, 15)]
    codeflash_output = boxes_iou(box1, box2, threshold=0.5); result = codeflash_output # 113μs -> 85.2μs (33.3% faster)

# ----------------------------
# Large Scale Test Cases
# ----------------------------

def test_large_number_of_boxes():
    # Test with 1000 boxes, all identical, should all overlap perfectly
    N = 1000
    boxes = np.tile(np.array([[0, 0, 10, 10]]), (N, 1))
    codeflash_output = boxes_iou(boxes, boxes, threshold=0.99); result = codeflash_output # 20.2ms -> 18.2ms (11.0% faster)

def test_large_number_of_boxes_partial_overlap():
    # 500 boxes, each shifted by 1, overlap only with themselves and immediate neighbors
    N = 500
    boxes1 = np.array([[i, 0, i+10, 10] for i in range(N)])
    boxes2 = np.array([[i, 0, i+10, 10] for i in range(N)])
    codeflash_output = boxes_iou(boxes1, boxes2, threshold=0.5); result = codeflash_output # 5.22ms -> 4.70ms (11.0% faster)
    # Diagonal should be True (identical), off-diagonal mostly False except for some neighbors
    for i in range(N):
        pass

def test_large_sparse_overlap():
    # 100 boxes, only first overlaps with first, rest are far apart
    N = 100
    boxes1 = np.array([[0, 0, 10, 10]] + [[i*100, 0, i*100+10, 10] for i in range(1, N)])
    boxes2 = np.array([[0, 0, 10, 10]] + [[i*100, 0, i*100+10, 10] for i in range(1, N)])
    codeflash_output = boxes_iou(boxes1, boxes2, threshold=0.5); result = codeflash_output # 194μs -> 161μs (20.1% faster)
    # Only diagonal is True
    for i in range(N):
        for j in range(N):
            if i == j:
                pass
            else:
                pass

def test_large_rectangular_grid():
    # 100x10 grid of boxes, all non-overlapping
    rows, cols = 100, 10
    boxes1 = []
    for i in range(rows):
        for j in range(cols):
            boxes1.append([i*20, j*20, i*20+10, j*20+10])
    boxes1 = np.array(boxes1)
    codeflash_output = boxes_iou(boxes1, boxes1, threshold=0.5); result = codeflash_output # 19.0ms -> 17.1ms (10.9% faster)
    # Only diagonal is True
    for i in range(rows*cols):
        if i < rows*cols-1:
            pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from __future__ import annotations

import numpy as np
# imports
import pytest
from unstructured.partition.pdf_image.pdfminer_processing import boxes_iou

EPSILON_AREA = 0.01
DEFAULT_ROUND = 15
from unstructured.partition.pdf_image.pdfminer_processing import boxes_iou


# Helper class for box objects
class Box:
    def __init__(self, x1, y1, x2, y2):
        self.x1 = x1
        self.y1 = y1
        self.x2 = x2
        self.y2 = y2

# --------- Basic Test Cases ---------
def test_single_identical_box():
    # Test IOU for two identical boxes (should be 1, so always True for threshold < 1)
    box = Box(0, 0, 10, 10)
    codeflash_output = boxes_iou([box], [box], threshold=0.5); arr = codeflash_output # 146μs -> 105μs (38.3% faster)

def test_single_non_overlapping_boxes():
    # Test IOU for two non-overlapping boxes (IOU=0, should be False)
    box1 = Box(0, 0, 10, 10)
    box2 = Box(20, 20, 30, 30)
    codeflash_output = boxes_iou([box1], [box2], threshold=0.1); arr = codeflash_output # 110μs -> 81.7μs (34.7% faster)

def test_partial_overlap_boxes():
    # Test IOU for two partially overlapping boxes
    box1 = Box(0, 0, 10, 10)
    box2 = Box(5, 5, 15, 15)
    codeflash_output = boxes_iou([box1], [box2], threshold=0.1); arr = codeflash_output # 108μs -> 79.7μs (35.9% faster)
    # Calculate IOU manually
    inter_area = (10 - 5 + 1) * (10 - 5 + 1) # 6*6=36
    area1 = (10 - 0 + 1) * (10 - 0 + 1) # 11*11=121
    area2 = (15 - 5 + 1) * (15 - 5 + 1) # 11*11=121
    union = area1 + area2 - inter_area
    iou = inter_area / union

def test_multiple_boxes():
    # Test IOU for multiple boxes
    boxes1 = [Box(0, 0, 10, 10), Box(20, 20, 30, 30)]
    boxes2 = [Box(0, 0, 10, 10), Box(25, 25, 35, 35)]
    codeflash_output = boxes_iou(boxes1, boxes2, threshold=0.1); arr = codeflash_output # 114μs -> 86.2μs (32.9% faster)

def test_numpy_array_input():
    # Test IOU for numpy array input
    arr1 = np.array([[0, 0, 10, 10], [20, 20, 30, 30]])
    arr2 = np.array([[0, 0, 10, 10], [25, 25, 35, 35]])
    codeflash_output = boxes_iou(arr1, arr2, threshold=0.1); arr = codeflash_output # 89.4μs -> 59.0μs (51.6% faster)

# --------- Edge Test Cases ---------
def test_empty_lists():
    # Test IOU with empty lists
    codeflash_output = boxes_iou([], [], threshold=0.5); arr = codeflash_output # 95.4μs -> 64.3μs (48.4% faster)

def test_one_empty_list():
    # Test IOU with one empty list
    box = Box(0, 0, 10, 10)
    codeflash_output = boxes_iou([box], [], threshold=0.5); arr1 = codeflash_output # 104μs -> 75.0μs (38.8% faster)
    codeflash_output = boxes_iou([], [box], threshold=0.5); arr2 = codeflash_output # 83.8μs -> 58.1μs (44.2% faster)

def test_zero_area_boxes():
    # Test IOU for boxes with zero area (x1==x2 and/or y1==y2)
    box1 = Box(5, 5, 5, 5)
    box2 = Box(5, 5, 5, 5)
    codeflash_output = boxes_iou([box1], [box2], threshold=0.1); arr = codeflash_output # 105μs -> 78.3μs (34.4% faster)

def test_negative_coordinates():
    # Test IOU for boxes with negative coordinates
    box1 = Box(-10, -10, 0, 0)
    box2 = Box(-5, -5, 5, 5)
    codeflash_output = boxes_iou([box1], [box2], threshold=0.1); arr = codeflash_output # 104μs -> 77.8μs (34.6% faster)

def test_threshold_behavior():
    # Test IOU for threshold at edge values
    box1 = Box(0, 0, 10, 10)
    box2 = Box(0, 0, 10, 10)
    codeflash_output = boxes_iou([box1], [box2], threshold=0.0); arr_low = codeflash_output # 104μs -> 77.6μs (34.6% faster)
    codeflash_output = boxes_iou([box1], [box2], threshold=1.0); arr_high = codeflash_output # 84.7μs -> 60.6μs (39.9% faster)

def test_float_coordinates():
    # Test IOU for boxes with float coordinates
    box1 = Box(0.0, 0.0, 10.5, 10.5)
    box2 = Box(5.5, 5.5, 15.5, 15.5)
    codeflash_output = boxes_iou([box1], [box2], threshold=0.1); arr = codeflash_output # 103μs -> 77.2μs (34.3% faster)

def test_rounding_precision():
    # Test IOU with different rounding precision
    box1 = Box(0, 0, 10.123456789, 10.987654321)
    box2 = Box(0, 0, 10.123456789, 10.987654321)
    codeflash_output = boxes_iou([box1], [box2], threshold=0.5, round_to=2); arr1 = codeflash_output # 103μs -> 76.3μs (36.2% faster)
    codeflash_output = boxes_iou([box1], [box2], threshold=0.5, round_to=15); arr2 = codeflash_output # 84.8μs -> 59.8μs (41.9% faster)

# --------- Large Scale Test Cases ---------
def test_many_boxes():
    # Test IOU for large number of boxes
    N = 100
    # Create 100 boxes, all overlapping with themselves
    boxes = [Box(i, i, i+10, i+10) for i in range(N)]
    codeflash_output = boxes_iou(boxes, boxes, threshold=0.5); arr = codeflash_output # 305μs -> 224μs (36.3% faster)
    # Diagonal should be True (identical boxes), off-diagonal should be False (no overlap)
    for i in range(N):
        for j in range(N):
            if i == j:
                pass
            else:
                pass

def test_large_numpy_array():
    # Test IOU for large numpy arrays
    N = 200
    arr1 = np.zeros((N, 4))
    arr2 = np.zeros((N, 4))
    # All boxes are at (0,0,10,10)
    arr1[:] = [0, 0, 10, 10]
    arr2[:] = [0, 0, 10, 10]
    codeflash_output = boxes_iou(arr1, arr2, threshold=0.5); arr = codeflash_output # 743μs -> 703μs (5.69% faster)

def test_sparse_overlap_large():
    # Test IOU for large arrays with sparse overlap
    N = 100
    boxes1 = [Box(i*10, i*10, i*10+5, i*10+5) for i in range(N)]
    boxes2 = [Box(i*10+2, i*10+2, i*10+7, i*10+7) for i in range(N)]
    codeflash_output = boxes_iou(boxes1, boxes2, threshold=0.1); arr = codeflash_output # 308μs -> 225μs (36.9% faster)
    # Only diagonal should be True
    for i in range(N):
        for j in range(N):
            if i == j:
                pass
            else:
                pass

def test_performance_large_scale():
    # Test IOU for performance with max allowed size
    N = 250
    boxes1 = [Box(i, i, i+5, i+5) for i in range(N)]
    boxes2 = [Box(i, i, i+5, i+5) for i in range(N)]
    codeflash_output = boxes_iou(boxes1, boxes2, threshold=0.5); arr = codeflash_output # 822μs -> 658μs (24.9% faster)

# --------- Mutation-sensitive test ---------
def test_mutation_sensitive_behavior():
    # This test will fail if the IOU logic is changed to >= instead of >
    box1 = Box(0, 0, 10, 10)
    box2 = Box(0, 0, 10, 10)
    codeflash_output = boxes_iou([box1], [box2], threshold=1.0); arr = codeflash_output # 109μs -> 80.8μs (35.1% faster)

# --------- Input Type Robustness ---------

To edit these changes git checkout codeflash/optimize-pr4112-2025-11-06T21.15.51 and push.

… extracted

This pull request includes updated ingest test fixtures. Please review and merge if appropriate.

…same` by 24% in PR #4112 (`feat/track-text-source`) (#4114) ## ⚡️ This pull request contains optimizations for PR #4112 If you approve this dependent PR, these changes will be merged into the original PR branch `feat/track-text-source`. >This PR will be automatically closed if the original PR is merged. ---- #### 📄 24% (0.24x) speedup for ***`_merge_extracted_into_inferred_when_almost_the_same` in `unstructured/partition/pdf_image/pdfminer_processing.py`*** ⏱️ Runtime : **`40.6 milliseconds`** **→** **`32.6 milliseconds`** (best of `18` runs) #### 📝 Explanation and details The optimized code achieves a **24% speedup** through two key optimizations: **1. Improved `_minimum_containing_coords` function:** - **What**: Replaced `np.vstack` with separate array creation followed by `np.column_stack` - **Why**: The original code created list comprehensions multiple times within `np.vstack`, causing redundant temporary arrays and inefficient memory access patterns. The optimized version pre-computes each coordinate array once, then combines them efficiently - **Impact**: Reduces function time from 1.88ms to 1.41ms (25% faster). Line profiler shows the costly list comprehensions in the original (lines with 27%, 14%, 13%, 12% of time) are replaced with more efficient array operations **2. Optimized comparison in `boxes_iou` function:** - **What**: Changed `(inter_area / denom) > threshold` to `inter_area > (threshold * denom)` - **Why**: Avoids expensive division operations by algebraically rearranging the inequality. Division is significantly slower than multiplication in NumPy, especially for large arrays - **Impact**: Reduces the final comparison from 19% to 5.8% of function time, while the intermediate denominator calculation takes 11.8% **3. Minor optimization in boolean mask creation:** - **What**: Replaced `boxes_almost_same.sum(axis=1).astype(bool)` with `np.any(boxes_almost_same, axis=1)` - **Why**: `np.any` short-circuits on the first True value and is semantically clearer, though the performance gain is minimal **Test case analysis shows the optimizations are particularly effective for:** - Large-scale scenarios (1000+ elements): 17-75% speedup depending on match patterns - Cases with no matches benefit most (74.6% faster) due to avoiding expensive division operations - All test cases show consistent 6-17% improvements, indicating robust optimization across different workloads The optimizations maintain identical functionality while reducing computational overhead through better NumPy usage patterns and mathematical rearrangement. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⏪ Replay Tests | 🔘 **None Found** | | ⚙️ Existing Unit Tests | 🔘 **None Found** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | | 🌀 Generated Regression Tests | ✅ **18 Passed** | |📊 Tests Coverage | 100.0% | <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python import numpy as np # imports import pytest from unstructured.partition.pdf_image.pdfminer_processing import \ _merge_extracted_into_inferred_when_almost_the_same # --- Minimal class stubs and helpers to support the function under test --- class DummyLayoutElements: """ Minimal implementation of LayoutElements to support testing. - element_coords: np.ndarray of shape (N, 4) for bounding boxes. - texts: np.ndarray of shape (N,) for text strings. - is_extracted_array: np.ndarray of shape (N,) for boolean flags. """ def __init__(self, element_coords, texts=None, is_extracted_array=None): self.element_coords = np.array(element_coords, dtype=np.float32) self.texts = np.array(texts if texts is not None else [''] * len(element_coords), dtype=object) self.is_extracted_array = np.array(is_extracted_array if is_extracted_array is not None else [False] * len(element_coords), dtype=bool) def __len__(self): return len(self.element_coords) def slice(self, mask): # mask can be a boolean array or integer indices if isinstance(mask, (np.ndarray, list)): if isinstance(mask[0], bool): idx = np.where(mask)[0] else: idx = np.array(mask) else: idx = np.array([mask]) return DummyLayoutElements( self.element_coords[idx], self.texts[idx], self.is_extracted_array[idx] ) from unstructured.partition.pdf_image.pdfminer_processing import \ _merge_extracted_into_inferred_when_almost_the_same # --- Unit Tests --- # ----------- BASIC TEST CASES ----------- def test_no_inferred_elements_returns_false_mask(): # No inferred elements: all extracted should not be merged extracted = DummyLayoutElements([[0, 0, 1, 1], [1, 1, 2, 2]], texts=["a", "b"]) inferred = DummyLayoutElements([]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.9); mask = codeflash_output # 3.50μs -> 3.30μs (6.10% faster) def test_no_extracted_elements_returns_empty_mask(): # No extracted elements: should return empty mask extracted = DummyLayoutElements([]) inferred = DummyLayoutElements([[0, 0, 1, 1]]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.9); mask = codeflash_output # 2.30μs -> 2.31μs (0.475% slower) #------------------------------------------------ import numpy as np # imports import pytest from unstructured.partition.pdf_image.pdfminer_processing import \ _merge_extracted_into_inferred_when_almost_the_same # Minimal stubs for TextRegions and LayoutElements to enable testing class TextRegions: def __init__(self, coords, texts=None, is_extracted_array=None): self.x1 = coords[:, 0] self.y1 = coords[:, 1] self.x2 = coords[:, 2] self.y2 = coords[:, 3] self.texts = np.array(texts) if texts is not None else np.array([""] * len(coords)) self.is_extracted_array = np.array(is_extracted_array) if is_extracted_array is not None else np.zeros(len(coords), dtype=bool) self.element_coords = coords def __len__(self): return len(self.element_coords) def slice(self, mask): # mask can be bool array or indices if isinstance(mask, (np.ndarray, list)): if isinstance(mask, np.ndarray) and mask.dtype == bool: idx = np.where(mask)[0] else: idx = mask else: idx = [mask] coords = self.element_coords[idx] texts = self.texts[idx] is_extracted_array = self.is_extracted_array[idx] return TextRegions(coords, texts, is_extracted_array) class LayoutElements(TextRegions): pass from unstructured.partition.pdf_image.pdfminer_processing import \ _merge_extracted_into_inferred_when_almost_the_same # =========================== # Unit Tests # =========================== # ----------- BASIC TEST CASES ----------- def test_basic_exact_match(): # One extracted, one inferred, same box coords = np.array([[0, 0, 10, 10]]) extracted = LayoutElements(coords, texts=["extracted"], is_extracted_array=[True]) inferred = LayoutElements(coords, texts=["inferred"], is_extracted_array=[False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 207μs -> 192μs (7.74% faster) def test_basic_no_match(): # Boxes do not overlap extracted = LayoutElements(np.array([[0, 0, 10, 10]]), texts=["extracted"], is_extracted_array=[True]) inferred = LayoutElements(np.array([[20, 20, 30, 30]]), texts=["inferred"], is_extracted_array=[False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 163μs -> 151μs (7.85% faster) def test_basic_partial_overlap_below_threshold(): # Overlap, but below threshold extracted = LayoutElements(np.array([[0, 0, 10, 10]]), texts=["extracted"], is_extracted_array=[True]) inferred = LayoutElements(np.array([[5, 5, 15, 15]]), texts=["inferred"], is_extracted_array=[False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 158μs -> 148μs (6.53% faster) def test_basic_partial_overlap_above_threshold(): # Overlap, above threshold extracted = LayoutElements(np.array([[0, 0, 10, 10]]), texts=["extracted"], is_extracted_array=[True]) inferred = LayoutElements(np.array([[0, 0, 10, 10.1]]), texts=["inferred"], is_extracted_array=[False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 191μs -> 176μs (8.22% faster) def test_basic_multiple_elements_some_match(): # Multiple extracted/inferred, some matches extracted = LayoutElements( np.array([[0, 0, 10, 10], [20, 20, 30, 30]]), texts=["extracted1", "extracted2"], is_extracted_array=[True, True] ) inferred = LayoutElements( np.array([[0, 0, 10, 10], [100, 100, 110, 110]]), texts=["inferred1", "inferred2"], is_extracted_array=[False, False] ) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 172μs -> 162μs (5.98% faster) # ----------- EDGE TEST CASES ----------- def test_edge_empty_extracted(): # No extracted elements extracted = LayoutElements(np.zeros((0, 4)), texts=[], is_extracted_array=[]) inferred = LayoutElements(np.array([[0,0,1,1]]), texts=["foo"], is_extracted_array=[False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 2.08μs -> 2.06μs (0.969% faster) def test_edge_empty_inferred(): # No inferred elements extracted = LayoutElements(np.array([[0,0,1,1]]), texts=["foo"], is_extracted_array=[True]) inferred = LayoutElements(np.zeros((0, 4)), texts=[], is_extracted_array=[]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 2.71μs -> 2.48μs (9.29% faster) def test_edge_all_elements_match(): # All extracted match inferred coords = np.array([[0,0,10,10], [20,20,30,30]]) extracted = LayoutElements(coords, texts=["A", "B"], is_extracted_array=[True, True]) inferred = LayoutElements(coords, texts=["X", "Y"], is_extracted_array=[False, False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 174μs -> 162μs (7.69% faster) def test_edge_threshold_zero(): # Threshold zero means all overlap counts extracted = LayoutElements(np.array([[0,0,10,10]]), texts=["foo"], is_extracted_array=[True]) inferred = LayoutElements(np.array([[5,5,15,15]]), texts=["bar"], is_extracted_array=[False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.0); mask = codeflash_output # 159μs -> 150μs (5.94% faster) def test_edge_threshold_one(): # Threshold one means only perfect overlap counts extracted = LayoutElements(np.array([[0,0,10,10]]), texts=["foo"], is_extracted_array=[True]) inferred = LayoutElements(np.array([[0,0,10,10]]), texts=["bar"], is_extracted_array=[False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 1.0); mask = codeflash_output # 155μs -> 145μs (7.01% faster) def test_edge_multiple_matches_first_match_wins(): # Extracted overlaps with multiple inferred, but only first match is updated extracted = LayoutElements(np.array([[0,0,10,10]]), texts=["foo"], is_extracted_array=[True]) inferred = LayoutElements( np.array([[0,0,10,10], [0,0,10,10]]), texts=["bar1", "bar2"], is_extracted_array=[False, False] ) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 168μs -> 156μs (7.25% faster) def test_edge_coords_are_updated_to_minimum_containing(): # Bounding boxes are updated to minimum containing box extracted = LayoutElements(np.array([[1,2,9,10]]), texts=["foo"], is_extracted_array=[True]) inferred = LayoutElements(np.array([[0,0,10,10]]), texts=["bar"], is_extracted_array=[False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 156μs -> 144μs (8.56% faster) # The new coords should be the minimum containing both expected = np.array([0,0,10,10]) # ----------- LARGE SCALE TEST CASES ----------- def test_large_scale_many_elements(): # 500 extracted, 500 inferred, all match N = 500 coords = np.stack([np.arange(N), np.arange(N), np.arange(N)+10, np.arange(N)+10], axis=1) extracted = LayoutElements(coords, texts=[f"ex{i}" for i in range(N)], is_extracted_array=[True]*N) inferred = LayoutElements(coords.copy(), texts=[f"in{i}" for i in range(N)], is_extracted_array=[False]*N) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 2.90ms -> 2.79ms (3.78% faster) def test_large_scale_some_elements_match(): # 1000 extracted, 500 inferred, only first 500 match N = 1000 M = 500 coords_extracted = np.stack([np.arange(N), np.arange(N), np.arange(N)+10, np.arange(N)+10], axis=1) coords_inferred = coords_extracted[:M] extracted = LayoutElements(coords_extracted, texts=[f"ex{i}" for i in range(N)], is_extracted_array=[True]*N) inferred = LayoutElements(coords_inferred.copy(), texts=[f"in{i}" for i in range(M)], is_extracted_array=[False]*M) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 6.49ms -> 5.56ms (16.6% faster) # First 500 should be merged, rest not expected_mask = np.zeros(N, dtype=bool) expected_mask[:M] = True def test_large_scale_no_elements_match(): # 1000 extracted, 500 inferred, none match N = 1000 M = 500 coords_extracted = np.stack([np.arange(N), np.arange(N), np.arange(N)+10, np.arange(N)+10], axis=1) coords_inferred = coords_extracted[:M] + 10000 # Far away extracted = LayoutElements(coords_extracted, texts=[f"ex{i}" for i in range(N)], is_extracted_array=[True]*N) inferred = LayoutElements(coords_inferred, texts=[f"in{i}" for i in range(M)], is_extracted_array=[False]*M) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 8.91ms -> 5.11ms (74.6% faster) def test_large_scale_performance(): # Test that the function runs efficiently for 1000 elements N = 1000 coords = np.stack([np.arange(N), np.arange(N), np.arange(N)+10, np.arange(N)+10], axis=1) extracted = LayoutElements(coords, texts=[f"ex{i}" for i in range(N)], is_extracted_array=[True]*N) inferred = LayoutElements(coords.copy(), texts=[f"in{i}" for i in range(N)], is_extracted_array=[False]*N) import time start = time.time() codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 20.6ms -> 17.6ms (17.1% faster) elapsed = time.time() - start # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` </details> To edit these changes `git checkout codeflash/optimize-pr4112-2025-11-05T21.03.01` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai) ![Static Badge](https://img.shields.io/badge/🎯_Optimization_Quality-high-green)  --- > [!NOTE] > Speeds up layout merging by optimizing bounding-box aggregation, boolean mask creation, and IOU comparison to avoid divisions. > > - **Performance optimizations in `unstructured/partition/pdf_image/pdfminer_processing.py`**: > - `/_minimum_containing_coords`: > - Precomputes `x1/y1/x2/y2` arrays and uses `np.column_stack` to build output; removes extra transpose. > - `/_merge_extracted_into_inferred_when_almost_the_same`: > - Replaces `sum(...).astype(bool)` with `np.any(..., axis=1)` for match mask. > - `/boxes_iou`: > - Computes denominator once and replaces division `(x/y) > t` with `x > t*y` to avoid divisions. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 8a0335f. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>

The optimization achieves a **13% speedup** through three key improvements that target the most time-consuming operations: **1. Eliminated Python-level iteration in `get_coords_from_bboxes`:** - Replaced the expensive `for i, bbox in enumerate(bboxes): coords[i, :] = [bbox.x1, bbox.y1, bbox.x2, bbox.y2]` loop (50.3% of function time) with `np.fromiter()` using a generator expression - This avoids repeated Python object access and list creation, leveraging NumPy's efficient C-level iteration instead - The profiler shows this reduces the coordinate extraction overhead significantly **2. Replaced `np.split()` with direct column slicing:** - Changed `np.split(coords1, 4, axis=1)` (7% of function time) to `coords1[:, 0:1], coords1[:, 1:2], coords1[:, 2:3], coords1[:, 3:4]` - `np.split()` creates new array objects, while slicing creates lightweight views of existing data - This optimization is particularly effective since `areas_of_boxes_and_intersection_area` accounts for 73.9% of the total `boxes_iou` runtime **3. Improved broadcasting mechanics:** - Replaced implicit `.T` transposes with explicit broadcasting using `[:, None]` and `[None, :]` in the denominator calculation - This makes NumPy's memory access patterns more predictable and reduces temporary array creation **Test case performance patterns:** - **Highest gains (60-80% faster)** on simple cases with few boxes, where coordinate extraction overhead dominates - **Moderate gains (10-20% faster)** on large-scale tests (1000+ boxes) where the intersection computation becomes the bottleneck - **Consistent improvements** across all input types (Box objects, numpy arrays, edge cases) The optimizations are particularly valuable for PDF processing workloads where bounding box IoU calculations are performed frequently across many document elements.

codeflash-ai · 2025-11-12T10:56:14Z

This PR has been automatically closed because the original PR #4120 by vhsakpal was closed.

qued and others added 14 commits November 4, 2025 10:05

Add test to check behavior of is_extracted metadata during normalization

89d6fed

test element merge behavior for extracted text metadata

130c867

support is_extracted metadata for elements

1a78d06

Add merge logic for is_extracted

abcc4f3

Add test that pdfminer processed file layouelements are recognized as…

7e159c4

… extracted

merge array elements while retaining extracted status

ae8f1a1

formatting

d7fc5a0

update deps

22fc9b3

format

2f59dc0

Update changelog and version

9b96a95

feat: track text source <- Ingest test fixtures update (#4113)

bb5ff8b

This pull request includes updated ingest test fixtures. Please review and merge if appropriate.

reduce comment length for linting

ea47d20

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to codeflash labels Nov 6, 2025

codeflash-ai bot mentioned this pull request Nov 6, 2025

feat: track text source #4112

Merged

Base automatically changed from feat/track-text-source to main November 11, 2025 19:45

codeflash-ai bot closed this Nov 12, 2025

codeflash-ai bot deleted the codeflash/optimize-pr4112-2025-11-06T21.15.51 branch November 12, 2025 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `boxes_iou` by 13% in PR #4112 (`feat/track-text-source`) #4116

⚡️ Speed up function `boxes_iou` by 13% in PR #4112 (`feat/track-text-source`) #4116

Uh oh!

codeflash-ai bot commented Nov 6, 2025

Uh oh!

codeflash-ai bot commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

⚡️ Speed up function boxes_iou by 13% in PR #4112 (feat/track-text-source) #4116

⚡️ Speed up function boxes_iou by 13% in PR #4112 (feat/track-text-source) #4116

Uh oh!

Conversation

codeflash-ai bot commented Nov 6, 2025

⚡️ This pull request contains optimizations for PR #4112

📄 13% (0.13x) speedup for boxes_iou in unstructured/partition/pdf_image/pdfminer_processing.py

📝 Explanation and details

Uh oh!

codeflash-ai bot commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

⚡️ Speed up function `boxes_iou` by 13% in PR #4112 (`feat/track-text-source`) #4116

⚡️ Speed up function `boxes_iou` by 13% in PR #4112 (`feat/track-text-source`) #4116

📄 13% (0.13x) speedup for `boxes_iou` in `unstructured/partition/pdf_image/pdfminer_processing.py`