Explicitly vectorize i32x8 to u8x8 conversion for storing into YCbCr buffers #22

okaneco · 2025-11-05T21:05:40Z

Remove unsafe transmute in AVX2 YCbCr conversion

Prior to this, the conversion was scalarized and used a bswap.
Rewriting the code to avoid reversing the array resulted in worse
codegen that extracted the bytes and manually re-inserted them
back into the SIMD register to store 8 bytes at once.

This is stacked on top of - #17
Only the last commit is relevant

Shnatsel

The tests comparing against the scalar implementation pass. The benchmarks show up to 4.5% improvement on Zen 4 and no regressions:

Benchmarks on Zen 4

encode rgb/encode rgb 100
                        time:   [56.018 ms 56.023 ms 56.028 ms]
                        change: [+0.0905% +0.1598% +0.2284%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe
Benchmarking encode rgb/encode rgb 4x1: Warming up for 15.000 s
Warning: Unable to complete 100 samples in 90.0s. You may wish to increase target time to 174.6s, enable flat sampling, or reduce sample count to 50.
encode rgb/encode rgb 4x1
                        time:   [34.628 ms 34.630 ms 34.631 ms]
                        change: [-1.1374% -1.0220% -0.9103%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe
Benchmarking encode rgb/encode rgb progressive: Warming up for 15.000 s
Warning: Unable to complete 100 samples in 90.0s. You may wish to increase target time to 175.3s, enable flat sampling, or reduce sample count to 50.
encode rgb/encode rgb progressive
                        time:   [34.762 ms 34.763 ms 34.765 ms]
                        change: [-1.9688% -1.7338% -1.4983%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) high mild
  10 (10.00%) high severe
encode rgb/encode rgb optimized
                        time:   [116.74 ms 116.76 ms 116.78 ms]
                        change: [-2.6467% -2.5590% -2.4677%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild
encode rgb/encode rgb optimized progressive
                        time:   [118.35 ms 118.38 ms 118.41 ms]
                        change: [-4.6270% -4.5451% -4.4616%] (p = 0.00 < 0.05)
                        Performance has improved.
encode rgb/encode rgb mixed
                        time:   [245.04 ms 245.09 ms 245.15 ms]
                        change: [-3.1994% -3.1084% -3.0202%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  12 (12.00%) high mild

     Running benches/fdct.rs (target/release/deps/fdct-3a322e62af936f7b)
fdct/default fdct       time:   [39.240 ns 39.271 ns 39.303 ns]
                        change: [-2.0367% -1.9016% -1.7679%] (p = 0.00 < 0.05)
                        Performance has improved.
fdct/fdct avx2          time:   [19.228 ns 19.229 ns 19.229 ns]
                        change: [-2.3674% -2.2783% -2.1903%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

This may understate the improvements on other systems, Zen4 tends to deal well with just about any sequence of instructions and not benefit as much from SIMD as older Intel chips.

Remove unsafe transmute in AVX2 YCbCr conversion Prior to this, the conversion was scalarized and used a bswap. Rewriting the code to avoid reversing the array resulted in worse codegen that extracted the bytes and manually re-inserted them back into the SIMD register to store 8 bytes at once.

okaneco · 2025-11-29T17:06:14Z

This should be ready for review now.

Here's the assembly difference for the __m256i to array conversion function before and after this PR
https://rust.godbolt.org/z/59n6nP6n5

I reported the codegen issue from the safe YCbCr issue upstream to LLVM here
llvm/llvm-project#167138

okaneco mentioned this pull request Nov 5, 2025

Safe AVX YCbCr #17

Merged

okaneco force-pushed the remove_transmute_ycbcr branch from ae27b16 to 97c3575 Compare November 5, 2025 23:17

Shnatsel approved these changes Nov 7, 2025

View reviewed changes

okaneco force-pushed the remove_transmute_ycbcr branch from 97c3575 to b972b53 Compare November 29, 2025 16:55

okaneco marked this pull request as ready for review November 29, 2025 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explicitly vectorize i32x8 to u8x8 conversion for storing into YCbCr buffers #22

Explicitly vectorize i32x8 to u8x8 conversion for storing into YCbCr buffers #22

Uh oh!

okaneco commented Nov 5, 2025

Uh oh!

Shnatsel left a comment •

edited

Loading

Uh oh!

okaneco commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Explicitly vectorize i32x8 to u8x8 conversion for storing into YCbCr buffers #22

Are you sure you want to change the base?

Explicitly vectorize i32x8 to u8x8 conversion for storing into YCbCr buffers #22

Uh oh!

Conversation

okaneco commented Nov 5, 2025

Uh oh!

Shnatsel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

okaneco commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shnatsel left a comment •

edited

Loading