Skip to content

Conversation

@ethche
Copy link
Contributor

@ethche ethche commented Nov 11, 2025

Includes the LFBO Pattern Search autotuner, which modifies PatternSearch to search through configs using a learned acquisition function (RandomForestClassifier) according to the likelihood-free bayesian optimization framework [1].

  • Similar to PatternSearch, we generate neighbors from search copies. But instead we generate random neighbors instead of exhaustive set.
  • Filters a fraction of them to evaluate using a fitted RandomForestClassifier. This classifier is trained to learn which configs are the best x% of configs. As a result, the classifier learns which configs are likely to improve upon the best config seen so far.

This improves over PatternSearch in kernel latency and autotuning wall-clock time on B200 for a set of benchmark kernels. DifferentialEvolution can improve further upon this, but takes substantially longer. DESurrogate in #1096, has comparable performance. We also compare to UCBPatternSearch, a previous proposal which uses a Gaussian Process + UCB acquisition function.

Kernel latency:
geomean_latency_ratio_vs_pattern

Autotuning Wall-clock Speedup:
geomean_wallclock_speedup_vs_pattern

Autotuning Convergence Time:
geomean_convergence_speedup_vs_pattern

Some example convergence plots:

int4_gemm_1_1_7168_8192 softmax_4096_640

[1] J. Song, et al. A General Recipe for Likelihood-free Bayesian Optimization

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 11, 2025
@ethche ethche requested a review from jansel November 14, 2025 01:03
@ethche ethche requested a review from jansel November 16, 2025 19:41
@jansel
Copy link
Contributor

jansel commented Nov 20, 2025

If you update the benchmarking CI job to install the extra deps we can also do a benchmark run for this.

@ethche
Copy link
Contributor Author

ethche commented Nov 21, 2025

Hi @jansel, I ran the benchmark CI job for LFBO Pattern Search: HUD.

Overall, we observe equivalent performance to the default tuner while wall-clock times are substantially reduced (0.8x to 0.5x across the board).

We even see slight performance improvements for AMD.

@ethche ethche requested a review from jansel November 22, 2025 02:28
@jansel jansel merged commit 7581998 into pytorch:main Nov 22, 2025
15 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants