Efficient AI inference scaling is essential for practical deployment. Instead of relying on static scaling heuristics or simple bivariate trade-offs between performance and compute, we should incorporate multiple factors such as cost, latency, and accuracy. This project models inference scaling as a multi-objective optimization (MOO) problem and simulates it using 3D and 2D space. Below is the sample output:
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
A feasible space shaped by 3D constraints captures values that a 2D space fails to account for.

This repository contains a Jupyter notebook that implements Monte Carlo simulations for optimizing inference scaling in AI models. It explores trade-offs between cost, time, and accuracy across various pre-configured models.
inference-optimization-moo/
├── 01_Installer/
│ ├── install.sh
│ └── requirements.txt
├── 02_MultiobjectiveOptimization/
│ ├── __init__.py
│ ├── Inference_scaling_MOO.ipynb
│ ├── inference_scaling.py
│ └── __pycache__/
├── .project-metadata.yaml
├── pyproject.toml
└── README.md
- Model Configurations: Pre-defined settings for multiple AI models (e.g., GPT-5 variants, Nvidia Nemotron, Qwen3 series), including cost per token, latency, and accuracy distributions.
- Monte Carlo Simulations: Statistical estimation of performance metrics with configurable trial counts and parallelization factors.
- Optimization Methods:
-
- Maximal Accuracy selection
-
- Maximal Cube selection
-
- Pareto frontiers and utopia-closest point
-
- Pareto frontiers and Knee point detection
-
- Interactive Visualizations: 3D feasible space plots
- Constraints: Feasibility checks based on total cost and time budgets, and minimal accuracy requirements.
- Python 3.9+
- Jupyter Notebook
- Libraries:
numpy,matplotlib,ipywidgets,mpl_toolkits.mplot3d,pyyaml
01_Installer/: Contains installation scripts (install.sh), dependency list (requirements.txt), and packaging configuration (pyproject.toml).02_MultiobjectiveOptimization/: Core Python module (inference_scaling.py) and the interactive Jupyter notebook (Inference_scaling_MOO.ipynb).
-
Create and activate a virtual environment:
# Create virtual environment python -m venv inference-optimization-moo_env # Activate virtual environment (Windows) inference-optimization-moo_env\Scripts\activate # Activate virtual environment (macOS/Linux) source inference-optimization-moo_env/bin/activate
-
Install dependencies (includes Jupyter):
pip install -r 01_Installer/requirements.txt
-
Install Jupyter kernel for the virtual environment:
python -m ipykernel install --user --name=inference-optimization-moo_env --display-name="Inference Optimization"
Run the provided shell script to set up dependencies automatically:
./01_Installer/install.sh# From the project root directory
jupyter notebook 02_MultiobjectiveOptimization/Inference_scaling_MOO.ipynb- Open the file
02_MultiobjectiveOptimization/Inference_scaling_MOO.ipynbin VS Code - Select the "Inference Optimization" kernel (if using virtual environment)
- Run cells interactively
-
Run the cells sequentially to load model configurations and functions.
-
Use the interactive widget at the end to select a model, adjust budget constraints (max-cost, max-time, min-accuracy), and visualize results.
-
Key parameters:
selected_model: Choose from available sample models (e.g., 'gpt5', 'nvidia-nemotron-ultra-253b').C_max_total: Maximum total cost in dollars.T_max_total: Maximum total time in seconds.acc_min: Minimum acceptable accuracy.k_max: Maximum number of inferences to test.mc_trials: Number of Monte Carlo trials for statistical robustness.parallel_factor: Degree of parallelism (P).
Note: If you encounter import errors, make sure to launch Jupyter from the 02_MultiobjectiveOptimization directory where the inference_scaling.py file is located.
- 3D Feasible Cube Plot: Visualizes the trade-off in the 3D space with constraint planes, MC trajectories, and optimal points (optimality could be different for priority).
- Accuracy vs. k Plot: Shows how accuracy improves with more inferences.
- Total Cost vs. k Plot: Displays cost scaling with k.
- Text Summary: Prints optimal k values and metrics for each method.
Thanks to Nashua Springberry (Cloudera) and Michael Schuler (Cloudera) for constructive comments on the design and programming for the simulation.
@misc{jung2025optimizeinference,
title={3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency},
author={Minseok Jung and Abhas Ricky and Muhammad Rameez Chatni},
year={2025},
eprint={2510.18905},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.18905},
}
This project is licensed under the Apache License 2.0 — see the LICENSE file for details.






