NVIDIA · JunyiXu-nv · Nov 27, 2025 · Nov 28, 2025 · LinPoly · Dec 1, 2025
@@ -0,0 +1,75 @@
+# OpenAI API Compatibility Examples
+
+This directory contains individual, self-contained examples demonstrating TensorRT-LLM's OpenAI API compatibility. Examples are organized by API endpoint.
+
+## Prerequisites
+
+1. **Start the trtllm-serve server:**
+   ```bash
+   trtllm-serve meta-llama/Llama-3.1-8B-Instruct
+   ```
+
+   for reasoning model or model with tool calling ability. Specify `--tool_parser` and `--reasoning_parser`, e.g.
+
+   ```bash
+   trtllm-serve Qwen/Qwen3-8B --reasoning_parser "qwen3" --tool_parser "qwen3"
+   ```
+
+
+## Running Examples
+
+Each example is a standalone Python script. Run from the example's directory:
+
+```bash
+# From chat_completions directory
+cd chat_completions
+python example_01_basic_chat.py
+```
+
+Or run with full path from the repository root:
+
+```bash
+python examples/serve/compatibility/chat_completions/example_01_basic_chat.py
+```
+
+### 📋 Complete Example List
+
+All examples demonstrate the `/v1/chat/completions` endpoint:
+
+| Example | File | Description |
+|---------|------|-------------|
+| **01** | `example_01_basic_chat.py` | Basic non-streaming chat completion |
+| **02** | `example_02_streaming_chat.py` | Streaming responses with real-time delivery |
+| **03** | `example_03_multi_turn_conversation.py` | Multi-turn conversation with context |
+| **04** | `example_04_streaming_with_usage.py` | Streaming with continuous token usage stats |
+| **05** | `example_05_json_mode.py` | Structured output with JSON schema |
+| **06** | `example_06_tool_calling.py` | Function/tool calling with tools |
+| **07** | `example_07_advanced_sampling.py` | TensorRT-LLM extended sampling parameters |
+
+## Configuration
+
+All examples use these default settings:
+
+```python
+base_url = "http://localhost:8000/v1"
+api_key = "tensorrt_llm"  # Can be any string
+```
+
+To use a different server:
+
+```python
+client = OpenAI(
+    base_url="http://YOUR_SERVER:PORT/v1",
+    api_key="your_key",
+)
+```
+
+## Model Requirements
+
+Some examples require specific model capabilities:
+
+| Example | Model Requirement |
+|---------|------------------|
+| 05 (JSON Mode) | xgrammar support |
+| 06 (Tool Calling) | Tool-capable model (Llama 3.1+, Mistral Instruct, etc.) |
+| Others | Any model |
@@ -0,0 +1,100 @@
+# Chat Completions API Examples
+
+Examples for the `/v1/chat/completions` endpoint - the most versatile API for conversational AI.
+
+## Quick Start
+
+```bash
+# Run the basic example
+python example_01_basic_chat.py
+```
+
+## Examples Overview
+
+### Basic Examples
+
+1. **`example_01_basic_chat.py`** - Start here!
+   - Simple request/response
+   - Shows token usage
+   - Non-streaming mode
+
+2. **`example_02_streaming_chat.py`** - Real-time responses
+   - Stream tokens as generated
+   - Better UX for long responses
+   - Server-Sent Events (SSE)
+
+3. **`example_03_multi_turn_conversation.py`** - Context management
+   - Multiple conversation turns
+   - Conversation history
+   - Follow-up questions
+
+4. **`example_04_streaming_with_usage.py`** - Streaming + metrics
+   - Continuous token counts
+   - `stream_options` parameter
+   - Monitor resource usage
+
+### Advanced Examples
+
+5. **`example_05_json_mode.py`** - Structured output
+   - JSON schema validation
+   - Structured data extraction
+   - Requires xgrammar
+
+6. **`example_06_tool_calling.py`** - Function calling
+   - External tool integration
+   - Function definitions
+   - Requires compatible model (Qwen3, gpt_oss)
+
+7. **`example_07_advanced_sampling.py`** - Fine-tuned control
+   - `top_k`, `repetition_penalty`
+   - Custom stop sequences
+   - TensorRT-LLM extensions
+
+## Key Concepts
+
+### Non-Streaming vs Streaming
+
+**Non-Streaming** (`stream=False`):
+- Wait for complete response
+- Single response object
+- Simple to use
+
+**Streaming** (`stream=True`):
+- Tokens delivered as generated
+- Better perceived latency
+- Server-Sent Events (SSE)
+
+### Conversation Context
+
+Messages accumulate in the `messages` array:
+```python
+messages = [
+    {"role": "system", "content": "You are helpful."},
+    {"role": "user", "content": "Hello"},
+    {"role": "assistant", "content": "Hi there!"},
+    {"role": "user", "content": "How are you?"},  # Next turn
+]
+```
+
+### Tool Calling
+
+Define functions the model can call:
+```python
+tools = [{
+    "type": "function",
+    "function": {
+        "name": "get_weather",
+        "parameters": {...}
+    }
+}]
+```
+
+## Model Requirements
+
+| Feature | Requirement |
+|---------|-------------|
+| Basic chat | Any model |
+| Streaming | Any model |
+| Multi-turn | Any model |
+| JSON mode | xgrammar support |
+| Tool calling | Compatible model (Qwen3 and gpt_oss.) |
@@ -0,0 +1,58 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#!/usr/bin/env python3
+"""Example 1: Basic Non-Streaming Chat Completion.
+
+Demonstrates a simple chat completion request with the OpenAI-compatible API.
+"""
+
+from openai import OpenAI
+
+# Initialize the client
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="tensorrt_llm",
+)
+
+# Get the model name from the server
+models = client.models.list()
+model = models.data[0].id
+
+print("=" * 80)
+print("Example 1: Basic Non-Streaming Chat Completion")
+print("=" * 80)
+print()
+
+# Create a simple chat completion
+response = client.chat.completions.create(
+    model=model,
+    messages=[
+        {"role": "system", "content": "You are a helpful assistant."},
+        {"role": "user", "content": "What is the capital of France?"},
+    ],
+    max_tokens=4096,
+    temperature=0.7,
+)
+
+# Print the response
+print("Response:")
+print(f"Content: {response.choices[0].message.content}")
+print(f"Finish reason: {response.choices[0].finish_reason}")
+print(
+    f"Tokens used: {response.usage.total_tokens} "
+    f"(prompt: {response.usage.prompt_tokens}, "
+    f"completion: {response.usage.completion_tokens})"
+)
@@ -0,0 +1,76 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#!/usr/bin/env python3
+"""Example 2: Streaming Chat Completion.
+
+Demonstrates streaming responses with real-time token delivery.
+"""
+
+from openai import OpenAI
+
+# Initialize the client
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="tensorrt_llm",
+)
+
+# Get the model name from the server
+models = client.models.list()
+model = models.data[0].id
+
+print("=" * 80)
+print("Example 2: Streaming Chat Completion")
+print("=" * 80)
+print()
+
+print("Prompt: Write a haiku about artificial intelligence\n")
+
+# Create a streaming chat completion
+stream = client.chat.completions.create(
+    model=model,
+    messages=[{"role": "user", "content": "Write a haiku about artificial intelligence"}],
+    max_tokens=4096,
+    temperature=0.8,
+    stream=True,
+)
+
+# Print tokens as they arrive
+print("Response (streaming):")
+print("Assistant: ", end="", flush=True)
+
+current_state = "none"
+for chunk in stream:
+    has_content = hasattr(chunk.choices[0].delta, "content") and chunk.choices[0].delta.content
+    has_reasoning_content = (
+        hasattr(chunk.choices[0].delta, "reasoning_content")
+        and chunk.choices[0].delta.reasoning_content
+    )
+    if has_content:
+        if current_state != "content":
+            print("Content: ", end="", flush=True)
+            current_state = "content"
+
+        print(chunk.choices[0].delta.content, end="", flush=True)
+
+    if has_reasoning_content:
+        if current_state != "reasoning_content":
+            print("Reasoning: ", end="", flush=True)
+            current_state = "reasoning_content"
+
+        print(chunk.choices[0].delta.reasoning_content, end="", flush=True)
+print("\n")
+
+print("Stop reason: ", chunk.choices[0].finish_reason)
@@ -0,0 +1,73 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#!/usr/bin/env python3
+"""Example 3: Multi-turn Conversation.
+
+Demonstrates maintaining conversation context across multiple turns.
+"""
+
+from openai import OpenAI
+
+# Initialize the client
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="tensorrt_llm",
+)
+
+# Get the model name from the server
+models = client.models.list()
+model = models.data[0].id
+
+print("=" * 80)
+print("Example 3: Multi-turn Conversation")
+print("=" * 80)
+print()
+
+# Start a conversation with system message
+messages = [
+    {"role": "system", "content": "You are an expert mathematician."},
+]
+
+# First turn: User asks a question
+messages.append({"role": "user", "content": "What is 15 multiplied by 23?"})
+print("USER: What is 15 multiplied by 23?")
+
+response1 = client.chat.completions.create(
+    model=model,
+    messages=messages,
+    max_tokens=4096,
+    temperature=0,
+)
+
+assistant_reply_1 = response1.choices[0].message.content
+print(f"ASSISTANT: {assistant_reply_1}\n")
+
+# Add assistant's response to conversation history
+messages.append({"role": "assistant", "content": assistant_reply_1})
+
+# Second turn: User asks a follow-up question
+messages.append({"role": "user", "content": "Now divide that result by 5"})
+print("USER: Now divide that result by 5")
+
+response2 = client.chat.completions.create(
+    model=model,
+    messages=messages,
+    max_tokens=4096,
+    temperature=0,
+)
+
+assistant_reply_2 = response2.choices[0].message.content
+print(f"ASSISTANT: {assistant_reply_2}")