Skip to content

Conversation

@Snuffy2
Copy link
Contributor

@Snuffy2 Snuffy2 commented Nov 22, 2025

Leaving this as draft as it depends on #96. Once #96 Is merged, I will rebase and mark it ready to merge.


This pull request introduces support for just-in-time (JIT) model loading and automatic model unloading after periods of inactivity, enhancing VRAM management and server responsiveness. It refactors the API endpoint logic to be aware of JIT and auto-unload states, improves error handling, and adds stricter type checks for handler selection. Additionally, it updates the CLI and documentation to expose and explain the new features.

JIT Loading & Auto-Unload Support

  • Added JIT loading and idle auto-unload features, with documentation and CLI flags (--jit, --auto-unload-minutes) to enable deferred model initialization and VRAM reclamation when idle. /health endpoint now reports model status as "unloaded" when appropriate. [1] [2] [3] [4]

API Endpoint Refactoring

  • Introduced _get_handler_or_error helper for consistent handler retrieval and error reporting, making endpoints aware of JIT and auto-unload states. All major endpoints now use this helper for improved reliability. [1] [2] [3] [4] [5] [6] [7] [8]

Type Safety & Error Handling

  • Added stricter type checks for handler selection in embeddings and audio_transcriptions endpoints, returning clear errors if the wrong model type is used. [1] [2]

Streaming Response Improvements

  • Improved tool call chunk indexing and ID assignment for streaming chat completions, ensuring correct association and handling of tool calls in streamed responses. [1] [2] [3] [4]

CLI & Documentation Enhancements

  • Refined UpperChoice class for canonical option normalization and updated related CLI help text for clarity. [1] [2]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant