Inference Scaling - Multiobjective Optimization

Efficient AI Inference

Efficient AI inference scaling is essential for practical deployment. Instead of relying on static scaling heuristics or simple bivariate trade-offs between performance and compute, we should incorporate multiple factors such as cost, latency, and accuracy. This project models inference scaling as a multi-objective optimization (MOO) problem and simulates it using 3D and 2D space. Below is the sample output:

A feasible space shaped by 3D constraints captures values that a 2D space fails to account for.

Project Structure

This repository contains a Jupyter notebook that implements Monte Carlo simulations for optimizing inference scaling in AI models. It explores trade-offs between cost, time, and accuracy across various pre-configured models.

inference-optimization-moo/
├── 01_Installer/
│   ├── install.sh
│   └── requirements.txt
├── 02_MultiobjectiveOptimization/
│   ├── __init__.py
│   ├── Inference_scaling_MOO.ipynb
│   ├── inference_scaling.py
│   └── __pycache__/
├── .project-metadata.yaml
├── pyproject.toml
└── README.md

Features

Model Configurations: Pre-defined settings for multiple AI models (e.g., GPT-5 variants, Nvidia Nemotron, Qwen3 series), including cost per token, latency, and accuracy distributions.
Monte Carlo Simulations: Statistical estimation of performance metrics with configurable trial counts and parallelization factors.
Optimization Methods:
- 1. Maximal Accuracy selection
- 1. Maximal Cube selection
- 1. Pareto frontiers and utopia-closest point
- 1. Pareto frontiers and Knee point detection
Interactive Visualizations: 3D feasible space plots
Constraints: Feasibility checks based on total cost and time budgets, and minimal accuracy requirements.

Requirements

Python 3.9+
Jupyter Notebook
Libraries: numpy, matplotlib, ipywidgets, mpl_toolkits.mplot3d, pyyaml

Repository Organization

01_Installer/: Contains installation scripts (install.sh), dependency list (requirements.txt), and packaging configuration (pyproject.toml).
02_MultiobjectiveOptimization/: Core Python module (inference_scaling.py) and the interactive Jupyter notebook (Inference_scaling_MOO.ipynb).

Installation

Option 1: Manual Installation with Virtual Environment

Create and activate a virtual environment:

# Create virtual environment
python -m venv inference-optimization-moo_env

# Activate virtual environment (Windows)
inference-optimization-moo_env\Scripts\activate

# Activate virtual environment (macOS/Linux)
source inference-optimization-moo_env/bin/activate

Install dependencies (includes Jupyter):

pip install -r 01_Installer/requirements.txt

Install Jupyter kernel for the virtual environment:

python -m ipykernel install --user --name=inference-optimization-moo_env --display-name="Inference Optimization"

Option 2: Shell

Run the provided shell script to set up dependencies automatically:

./01_Installer/install.sh

Usage

Launch from project root

# From the project root directory
jupyter notebook 02_MultiobjectiveOptimization/Inference_scaling_MOO.ipynb

Using VS Code

Open the file 02_MultiobjectiveOptimization/Inference_scaling_MOO.ipynb in VS Code
Select the "Inference Optimization" kernel (if using virtual environment)
Run cells interactively

Running the Notebook

Run the cells sequentially to load model configurations and functions.
Use the interactive widget at the end to select a model, adjust budget constraints (max-cost, max-time, min-accuracy), and visualize results.
Key parameters:
- selected_model: Choose from available sample models (e.g., 'gpt5', 'nvidia-nemotron-ultra-253b').
- C_max_total: Maximum total cost in dollars.
- T_max_total: Maximum total time in seconds.
- acc_min: Minimum acceptable accuracy.
- k_max: Maximum number of inferences to test.
- mc_trials: Number of Monte Carlo trials for statistical robustness.
- parallel_factor: Degree of parallelism (P).

Note: If you encounter import errors, make sure to launch Jupyter from the 02_MultiobjectiveOptimization directory where the inference_scaling.py file is located.

Output

3D Feasible Cube Plot: Visualizes the trade-off in the 3D space with constraint planes, MC trajectories, and optimal points (optimality could be different for priority).
Accuracy vs. k Plot: Shows how accuracy improves with more inferences.
Total Cost vs. k Plot: Displays cost scaling with k.
Text Summary: Prints optimal k values and metrics for each method.

Contributers

Thanks to Nashua Springberry (Cloudera) and Michael Schuler (Cloudera) for constructive comments on the design and programming for the simulation.

Citation

@misc{jung2025optimizeinference,
      title={3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency}, 
      author={Minseok Jung and Abhas Ricky and Muhammad Rameez Chatni},
      year={2025},
      eprint={2510.18905},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.18905}, 
}

License

This project is licensed under the Apache License 2.0 — see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
01_Installer		01_Installer
02_MultiobjectiveOptimization		02_MultiobjectiveOptimization
.gitignore		.gitignore
.project-metadata.yaml		.project-metadata.yaml
LICENSE		LICENSE
README.md		README.md
catalog-cover.png		catalog-cover.png
catalog-entry.yaml		catalog-entry.yaml
config.yaml		config.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Inference Scaling - Multiobjective Optimization

Efficient AI Inference

Project Structure

Features

Requirements

Repository Organization

Installation

Option 1: Manual Installation with Virtual Environment

Option 2: Shell

Usage

Launch from project root

Using VS Code

Running the Notebook

Output

Contributers

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

cloudera/CAI_AMP_Inference_Scaling_Optimization

Folders and files

Latest commit

History

Repository files navigation

Inference Scaling - Multiobjective Optimization

Efficient AI Inference

Project Structure

Features

Requirements

Repository Organization

Installation

Option 1: Manual Installation with Virtual Environment

Option 2: Shell

Usage

Launch from project root

Using VS Code

Running the Notebook

Output

Contributers

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages