diff --git a/examples/serve/compatibility/README.md b/examples/serve/compatibility/README.md new file mode 100644 index 00000000000..c21ec79afe5 --- /dev/null +++ b/examples/serve/compatibility/README.md @@ -0,0 +1,75 @@ +# OpenAI API Compatibility Examples + +This directory contains individual, self-contained examples demonstrating TensorRT-LLM's OpenAI API compatibility. Examples are organized by API endpoint. + +## Prerequisites + +1. **Start the trtllm-serve server:** + ```bash + trtllm-serve meta-llama/Llama-3.1-8B-Instruct + ``` + + for reasoning model or model with tool calling ability. Specify `--tool_parser` and `--reasoning_parser`, e.g. + + ```bash + trtllm-serve Qwen/Qwen3-8B --reasoning_parser "qwen3" --tool_parser "qwen3" + ``` + + +## Running Examples + +Each example is a standalone Python script. Run from the example's directory: + +```bash +# From chat_completions directory +cd chat_completions +python example_01_basic_chat.py +``` + +Or run with full path from the repository root: + +```bash +python examples/serve/compatibility/chat_completions/example_01_basic_chat.py +``` + +### 📋 Complete Example List + +All examples demonstrate the `/v1/chat/completions` endpoint: + +| Example | File | Description | +|---------|------|-------------| +| **01** | `example_01_basic_chat.py` | Basic non-streaming chat completion | +| **02** | `example_02_streaming_chat.py` | Streaming responses with real-time delivery | +| **03** | `example_03_multi_turn_conversation.py` | Multi-turn conversation with context | +| **04** | `example_04_streaming_with_usage.py` | Streaming with continuous token usage stats | +| **05** | `example_05_json_mode.py` | Structured output with JSON schema | +| **06** | `example_06_tool_calling.py` | Function/tool calling with tools | +| **07** | `example_07_advanced_sampling.py` | TensorRT-LLM extended sampling parameters | + +## Configuration + +All examples use these default settings: + +```python +base_url = "http://localhost:8000/v1" +api_key = "tensorrt_llm" # Can be any string +``` + +To use a different server: + +```python +client = OpenAI( + base_url="http://YOUR_SERVER:PORT/v1", + api_key="your_key", +) +``` + +## Model Requirements + +Some examples require specific model capabilities: + +| Example | Model Requirement | +|---------|------------------| +| 05 (JSON Mode) | xgrammar support | +| 06 (Tool Calling) | Tool-capable model (Llama 3.1+, Mistral Instruct, etc.) | +| Others | Any model | diff --git a/examples/serve/compatibility/chat_completions/README.md b/examples/serve/compatibility/chat_completions/README.md new file mode 100644 index 00000000000..58695dceca3 --- /dev/null +++ b/examples/serve/compatibility/chat_completions/README.md @@ -0,0 +1,100 @@ +# Chat Completions API Examples + +Examples for the `/v1/chat/completions` endpoint - the most versatile API for conversational AI. + +## Quick Start + +```bash +# Run the basic example +python example_01_basic_chat.py +``` + +## Examples Overview + +### Basic Examples + +1. **`example_01_basic_chat.py`** - Start here! + - Simple request/response + - Shows token usage + - Non-streaming mode + +2. **`example_02_streaming_chat.py`** - Real-time responses + - Stream tokens as generated + - Better UX for long responses + - Server-Sent Events (SSE) + +3. **`example_03_multi_turn_conversation.py`** - Context management + - Multiple conversation turns + - Conversation history + - Follow-up questions + +4. **`example_04_streaming_with_usage.py`** - Streaming + metrics + - Continuous token counts + - `stream_options` parameter + - Monitor resource usage + +### Advanced Examples + +5. **`example_05_json_mode.py`** - Structured output + - JSON schema validation + - Structured data extraction + - Requires xgrammar + +6. **`example_06_tool_calling.py`** - Function calling + - External tool integration + - Function definitions + - Requires compatible model (Qwen3, gpt_oss) + +7. **`example_07_advanced_sampling.py`** - Fine-tuned control + - `top_k`, `repetition_penalty` + - Custom stop sequences + - TensorRT-LLM extensions + +## Key Concepts + +### Non-Streaming vs Streaming + +**Non-Streaming** (`stream=False`): +- Wait for complete response +- Single response object +- Simple to use + +**Streaming** (`stream=True`): +- Tokens delivered as generated +- Better perceived latency +- Server-Sent Events (SSE) + +### Conversation Context + +Messages accumulate in the `messages` array: +```python +messages = [ + {"role": "system", "content": "You are helpful."}, + {"role": "user", "content": "Hello"}, + {"role": "assistant", "content": "Hi there!"}, + {"role": "user", "content": "How are you?"}, # Next turn +] +``` + +### Tool Calling + +Define functions the model can call: +```python +tools = [{ + "type": "function", + "function": { + "name": "get_weather", + "parameters": {...} + } +}] +``` + +## Model Requirements + +| Feature | Requirement | +|---------|-------------| +| Basic chat | Any model | +| Streaming | Any model | +| Multi-turn | Any model | +| JSON mode | xgrammar support | +| Tool calling | Compatible model (Qwen3 and gpt_oss.) | diff --git a/examples/serve/compatibility/chat_completions/example_01_basic_chat.py b/examples/serve/compatibility/chat_completions/example_01_basic_chat.py new file mode 100644 index 00000000000..64d3f8ab400 --- /dev/null +++ b/examples/serve/compatibility/chat_completions/example_01_basic_chat.py @@ -0,0 +1,58 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#!/usr/bin/env python3 +"""Example 1: Basic Non-Streaming Chat Completion. + +Demonstrates a simple chat completion request with the OpenAI-compatible API. +""" + +from openai import OpenAI + +# Initialize the client +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="tensorrt_llm", +) + +# Get the model name from the server +models = client.models.list() +model = models.data[0].id + +print("=" * 80) +print("Example 1: Basic Non-Streaming Chat Completion") +print("=" * 80) +print() + +# Create a simple chat completion +response = client.chat.completions.create( + model=model, + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "What is the capital of France?"}, + ], + max_tokens=4096, + temperature=0.7, +) + +# Print the response +print("Response:") +print(f"Content: {response.choices[0].message.content}") +print(f"Finish reason: {response.choices[0].finish_reason}") +print( + f"Tokens used: {response.usage.total_tokens} " + f"(prompt: {response.usage.prompt_tokens}, " + f"completion: {response.usage.completion_tokens})" +) diff --git a/examples/serve/compatibility/chat_completions/example_02_streaming_chat.py b/examples/serve/compatibility/chat_completions/example_02_streaming_chat.py new file mode 100644 index 00000000000..71343821088 --- /dev/null +++ b/examples/serve/compatibility/chat_completions/example_02_streaming_chat.py @@ -0,0 +1,76 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#!/usr/bin/env python3 +"""Example 2: Streaming Chat Completion. + +Demonstrates streaming responses with real-time token delivery. +""" + +from openai import OpenAI + +# Initialize the client +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="tensorrt_llm", +) + +# Get the model name from the server +models = client.models.list() +model = models.data[0].id + +print("=" * 80) +print("Example 2: Streaming Chat Completion") +print("=" * 80) +print() + +print("Prompt: Write a haiku about artificial intelligence\n") + +# Create a streaming chat completion +stream = client.chat.completions.create( + model=model, + messages=[{"role": "user", "content": "Write a haiku about artificial intelligence"}], + max_tokens=4096, + temperature=0.8, + stream=True, +) + +# Print tokens as they arrive +print("Response (streaming):") +print("Assistant: ", end="", flush=True) + +current_state = "none" +for chunk in stream: + has_content = hasattr(chunk.choices[0].delta, "content") and chunk.choices[0].delta.content + has_reasoning_content = ( + hasattr(chunk.choices[0].delta, "reasoning_content") + and chunk.choices[0].delta.reasoning_content + ) + if has_content: + if current_state != "content": + print("Content: ", end="", flush=True) + current_state = "content" + + print(chunk.choices[0].delta.content, end="", flush=True) + + if has_reasoning_content: + if current_state != "reasoning_content": + print("Reasoning: ", end="", flush=True) + current_state = "reasoning_content" + + print(chunk.choices[0].delta.reasoning_content, end="", flush=True) +print("\n") + +print("Stop reason: ", chunk.choices[0].finish_reason) diff --git a/examples/serve/compatibility/chat_completions/example_03_multi_turn_conversation.py b/examples/serve/compatibility/chat_completions/example_03_multi_turn_conversation.py new file mode 100644 index 00000000000..9edbc28faa0 --- /dev/null +++ b/examples/serve/compatibility/chat_completions/example_03_multi_turn_conversation.py @@ -0,0 +1,73 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#!/usr/bin/env python3 +"""Example 3: Multi-turn Conversation. + +Demonstrates maintaining conversation context across multiple turns. +""" + +from openai import OpenAI + +# Initialize the client +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="tensorrt_llm", +) + +# Get the model name from the server +models = client.models.list() +model = models.data[0].id + +print("=" * 80) +print("Example 3: Multi-turn Conversation") +print("=" * 80) +print() + +# Start a conversation with system message +messages = [ + {"role": "system", "content": "You are an expert mathematician."}, +] + +# First turn: User asks a question +messages.append({"role": "user", "content": "What is 15 multiplied by 23?"}) +print("USER: What is 15 multiplied by 23?") + +response1 = client.chat.completions.create( + model=model, + messages=messages, + max_tokens=4096, + temperature=0, +) + +assistant_reply_1 = response1.choices[0].message.content +print(f"ASSISTANT: {assistant_reply_1}\n") + +# Add assistant's response to conversation history +messages.append({"role": "assistant", "content": assistant_reply_1}) + +# Second turn: User asks a follow-up question +messages.append({"role": "user", "content": "Now divide that result by 5"}) +print("USER: Now divide that result by 5") + +response2 = client.chat.completions.create( + model=model, + messages=messages, + max_tokens=4096, + temperature=0, +) + +assistant_reply_2 = response2.choices[0].message.content +print(f"ASSISTANT: {assistant_reply_2}") diff --git a/examples/serve/compatibility/chat_completions/example_04_streaming_with_usage.py b/examples/serve/compatibility/chat_completions/example_04_streaming_with_usage.py new file mode 100644 index 00000000000..30dc20b7241 --- /dev/null +++ b/examples/serve/compatibility/chat_completions/example_04_streaming_with_usage.py @@ -0,0 +1,81 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#!/usr/bin/env python3 +"""Example 4: Streaming with Usage Statistics. + +Demonstrates streaming responses with continuous token usage updates. +""" + +from openai import OpenAI + +# Initialize the client +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="tensorrt_llm", +) + +# Get the model name from the server +models = client.models.list() +model = models.data[0].id + +print("=" * 80) +print("Example 4: Streaming with Usage Statistics") +print("=" * 80) +print() + +print("Request: Streaming with continuous usage stats enabled\n") + +# Create streaming request with usage statistics +stream = client.chat.completions.create( + model=model, + messages=[{"role": "user", "content": "Count from 1 to 5"}], + max_tokens=4096, + stream=True, + stream_options={"include_usage": True, "continuous_usage_stats": True}, +) + +print("Response with token counts and reasoning (if available):") +chunk = None +current_state = "none" +for chunk in stream: + if len(chunk.choices) == 0: + continue + + has_content = hasattr(chunk.choices[0].delta, "content") and chunk.choices[0].delta.content + has_reasoning_content = ( + hasattr(chunk.choices[0].delta, "reasoning_content") + and chunk.choices[0].delta.reasoning_content + ) + if has_content: + if current_state != "content": + print("Content: ", end="", flush=True) + current_state = "content" + + print(chunk.choices[0].delta.content, end="", flush=True) + + if has_reasoning_content: + if current_state != "reasoning_content": + print("Reasoning: ", end="", flush=True) + current_state = "reasoning_content" + + print(chunk.choices[0].delta.reasoning_content, end="", flush=True) +print() + +print( + f"Tokens used: {chunk.usage.total_tokens} " + f"(prompt: {chunk.usage.prompt_tokens}, " + f"completion: {chunk.usage.completion_tokens})" +) diff --git a/examples/serve/compatibility/chat_completions/example_05_json_mode.py b/examples/serve/compatibility/chat_completions/example_05_json_mode.py new file mode 100644 index 00000000000..6d4430e6d72 --- /dev/null +++ b/examples/serve/compatibility/chat_completions/example_05_json_mode.py @@ -0,0 +1,80 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#!/usr/bin/env python3 +"""Example 5: JSON Mode with Schema. + +Demonstrates structured output generation with JSON schema validation. + +Note: This requires xgrammar support and compatible model configuration. +""" + +import json + +from openai import OpenAI + +# Initialize the client +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="tensorrt_llm", +) + +# Get the model name from the server +models = client.models.list() +model = models.data[0].id + +print("=" * 80) +print("Example 5: JSON Mode with Schema") +print("=" * 80) +print() + +# Define the JSON schema +schema = { + "name": "city_info", + "schema": { + "type": "object", + "properties": { + "name": {"type": "string"}, + "country": {"type": "string"}, + "population": {"type": "integer"}, + "famous_for": {"type": "array", "items": {"type": "string"}}, + }, + "required": ["name", "country", "population"], + }, +} + +print("Request with JSON schema:") +print(json.dumps(schema, indent=2)) +print() +print("Note: JSON schema support requires xgrammar and compatible model configuration.\n") + +try: + # Create chat completion with JSON schema + response = client.chat.completions.create( + model=model, + messages=[ + {"role": "system", "content": "You are a helpful assistant that outputs JSON."}, + {"role": "user", "content": "Give me information about Tokyo."}, + ], + response_format={"type": "json_schema", "json_schema": schema}, + max_tokens=4096, + ) + + print("JSON Response:") + result = json.loads(response.choices[0].message.content) + print(json.dumps(result, indent=2)) +except Exception as e: + print("JSON schema support requires xgrammar and proper configuration.") + print(f"Error: {e}") diff --git a/examples/serve/compatibility/chat_completions/example_06_tool_calling.py b/examples/serve/compatibility/chat_completions/example_06_tool_calling.py new file mode 100644 index 00000000000..5f0b0289685 --- /dev/null +++ b/examples/serve/compatibility/chat_completions/example_06_tool_calling.py @@ -0,0 +1,120 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#!/usr/bin/env python3 +"""Example 6: Tool/Function Calling. + +Demonstrates tool calling with function definitions and responses. + +Note: This requires a compatible model (e.g., Llama 3.1+, Mistral Instruct). +""" + +import json + +from openai import OpenAI + +# Initialize the client +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="tensorrt_llm", +) + +# Get the model name from the server +models = client.models.list() +model = models.data[0].id + +print("=" * 80) +print("Example 6: Tool/Function Calling") +print("=" * 80) +print() +print("Note: Tool calling requires compatible models (e.g., Llama 3.1+)\n") + +# Define the available tools +tools = [ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather in a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "City and state, e.g. San Francisco, CA", + }, + "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}, + }, + "required": ["location"], + }, + }, + } +] + + +def get_weather(location: str, unit: str = "fahrenheit") -> dict: + return {"location": location, "temperature": 68, "unit": unit, "conditions": "sunny"} + + +print("Available tools:") +print(json.dumps(tools, indent=2)) +print("\nUser query: What is the weather in San Francisco?\n") + +try: + # Initial request with tools + response = client.chat.completions.create( + model=model, + messages=[{"role": "user", "content": "What is the weather in San Francisco?"}], + tools=tools, + tool_choice="auto", + max_tokens=4096, + ) + + message = response.choices[0].message + + if message.tool_calls: + print("Tool calls requested:") + for tool_call in message.tool_calls: + print(f" Function: {tool_call.function.name}") + print(f" Arguments: {tool_call.function.arguments}") + + # Simulate function execution + print("\nSimulating function execution...") + function_response = get_weather(**json.loads(tool_call.function.arguments)) + print(f"Function result: {json.dumps(function_response, indent=2)}") + + # Send function result back to get final response + messages = [ + {"role": "user", "content": "What is the weather in San Francisco?"}, + message, + { + "role": "tool", + "tool_call_id": message.tool_calls[0].id, + "content": json.dumps(function_response), + }, + ] + + final_response = client.chat.completions.create( + model=model, + messages=messages, + max_tokens=4096, + ) + + print(f"\nFinal response: {final_response.choices[0].message.content}") + else: + print(f"Direct response: {message.content}") +except Exception as e: + print("Note: Tool calling requires model support (e.g., Llama 3.1+ models)") + print(f"Error: {e}") diff --git a/examples/serve/compatibility/chat_completions/example_07_advanced_sampling.py b/examples/serve/compatibility/chat_completions/example_07_advanced_sampling.py new file mode 100644 index 00000000000..ae0899b3449 --- /dev/null +++ b/examples/serve/compatibility/chat_completions/example_07_advanced_sampling.py @@ -0,0 +1,63 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#!/usr/bin/env python3 +"""Example 7: Advanced Sampling Parameters. + +Demonstrates TensorRT-LLM specific sampling parameters for fine-tuned control. +""" + +from openai import OpenAI + +# Initialize the client +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="tensorrt_llm", +) + +# Get the model name from the server +models = client.models.list() +model = models.data[0].id + +print("=" * 80) +print("Example 7: Advanced Sampling Parameters") +print("=" * 80) +print() + +print("Using TensorRT-LLM extended parameters:") +print(" - top_k: 50") +print(" - repetition_penalty: 1.1") +print(" - min_tokens: 20") +print(" - stop sequences: ['The End', '\\n\\n\\n']") +print() + +# Create completion with advanced sampling parameters +response = client.chat.completions.create( + model=model, + messages=[{"role": "user", "content": "Write a very short story about a robot."}], + max_tokens=4096, + temperature=0.8, + top_p=0.95, + extra_body={ + "top_k": 50, + "repetition_penalty": 1.1, + "min_tokens": 20, + "stop": ["The End", "\n\n\n"], + }, +) + +print("Story:") +print(response.choices[0].message.content) +print(f"\nFinish reason: {response.choices[0].finish_reason}")