diff --git a/apps/site/docs/en/api.mdx b/apps/site/docs/en/api.mdx index 945f17163..04743b9b8 100644 --- a/apps/site/docs/en/api.mdx +++ b/apps/site/docs/en/api.mdx @@ -27,9 +27,14 @@ In Playwright and Puppeteer, there are some common parameters: These Agents also support the following advanced configuration parameters: -- `modelConfig: () => IModelConfig`: Optional. Custom model configuration function. Allows you to dynamically configure different models through code instead of environment variables. This is particularly useful when you need to use different models for different AI tasks (such as VQA, planning, grounding, etc.). +- `modelConfig: (params: { intent: TIntent }) => IModelConfig`: Optional. Custom model configuration function. Allows you to dynamically configure different models through code instead of environment variables. This is particularly useful when you need to use different models for different AI tasks (such as Insight, Planning, etc.). - **Example:** + The function receives a parameter object with an `intent` field indicating the current task type: + - `'insight'`: Visual understanding and element location tasks (such as `aiQuery`, `aiLocate`, `aiTap`, etc.) + - `'planning'`: Automatic planning tasks (such as `aiAct`) + - `'default'`: Other uncategorized tasks + + **Basic Example:** ```typescript const agent = new PuppeteerAgent(page, { modelConfig: () => ({ @@ -41,6 +46,40 @@ These Agents also support the following advanced configuration parameters: }); ``` + **Configure different models for different task types:** + ```typescript + const agent = new PuppeteerAgent(page, { + modelConfig: ({ intent }) => { + // Use Qwen-VL model for Insight tasks (for visual understanding and location) + if (intent === 'insight') { + return { + MIDSCENE_INSIGHT_MODEL_NAME: 'qwen-vl-plus', + MIDSCENE_INSIGHT_MODEL_API_KEY: 'sk-insight-key', + MIDSCENE_INSIGHT_MODEL_BASE_URL: 'https://dashscope.aliyuncs.com/compatible-mode/v1' + }; + } + + // Use GPT-4o model for Planning tasks (for task planning) + if (intent === 'planning') { + return { + MIDSCENE_PLANNING_MODEL_NAME: 'gpt-4o', + MIDSCENE_PLANNING_MODEL_API_KEY: 'sk-planning-key', + MIDSCENE_PLANNING_MODEL_BASE_URL: 'https://api.openai.com/v1', + MIDSCENE_INSIGHT_LOCATOR_MODE: 'qwen3-vl' + }; + } + + // Default configuration + return { + MIDSCENE_MODEL_NAME: 'gpt-4o', + MIDSCENE_MODEL_API_KEY: 'sk-default-key', + }; + } + }); + ``` + + For more information about configuring models by task type, refer to the [Configure model and provider](./model-provider#configure-models-by-task-type-advanced) documentation. + - `createOpenAIClient: (openai, options) => Promise`: Optional. Custom OpenAI client wrapper function. Allows you to wrap the OpenAI client instance for integrating observability tools (such as LangSmith, LangFuse) or applying custom middleware. **Parameter Description:** diff --git a/apps/site/docs/en/automate-with-scripts-in-yaml.mdx b/apps/site/docs/en/automate-with-scripts-in-yaml.mdx index ac4751b8d..af849d67b 100644 --- a/apps/site/docs/en/automate-with-scripts-in-yaml.mdx +++ b/apps/site/docs/en/automate-with-scripts-in-yaml.mdx @@ -439,7 +439,7 @@ tasks: convertHttpImage2Base64: true ``` -For VQA steps like `aiAsk`, `aiQuery`, `aiBoolean`, `aiNumber`, `aiString`, and `aiAssert`, you can set the `prompt` and `images` fields directly. +For insight steps like `aiAsk`, `aiQuery`, `aiBoolean`, `aiNumber`, `aiString`, and `aiAssert`, you can set the `prompt` and `images` fields directly. ```yaml tasks: diff --git a/apps/site/docs/en/choose-a-model.mdx b/apps/site/docs/en/choose-a-model.mdx index 640b1a71e..a56f3c9c8 100644 --- a/apps/site/docs/en/choose-a-model.mdx +++ b/apps/site/docs/en/choose-a-model.mdx @@ -42,6 +42,22 @@ You need to configure the following environment variables before use: - `MIDSCENE_MODEL_API_KEY` - API key - `MIDSCENE_MODEL_NAME` - Model name +### Configure Models by Task Type (Advanced) + +Midscene supports configuring different models for different task types: + +- **Insight tasks**: Visual understanding and element location (such as `aiQuery`, `aiLocate`, `aiTap`, etc.) +- **Planning tasks**: Automatic planning tasks (such as `aiAct`) +- **Default tasks**: Other uncategorized tasks + +You can use the following environment variable prefixes to configure models for different task types: + +- `MIDSCENE_INSIGHT_MODEL_*` - For visual understanding and element location tasks +- `MIDSCENE_PLANNING_MODEL_*` - For automatic planning tasks +- `MIDSCENE_MODEL_*` - Default configuration, used as fallback for other tasks + +For more details, refer to the [Configure model and provider](./model-provider#configure-models-by-task-type-advanced) documentation. + ## Supported Vision Models diff --git a/apps/site/docs/en/model-provider.mdx b/apps/site/docs/en/model-provider.mdx index 663e49387..cbfdc2043 100644 --- a/apps/site/docs/en/model-provider.mdx +++ b/apps/site/docs/en/model-provider.mdx @@ -39,6 +39,60 @@ Extra configs to use `Gemini 2.5 Pro` model: For more information about the models, see [Choose a model](./choose-a-model). +### Configure Models by Task Type (Advanced) + +Midscene internally categorizes AI tasks into different intent types. You can configure different models for different intents: + +- **Insight tasks**: Visual Question Answering (VQA) and Visual Grounding, such as `aiQuery`, `aiLocate`, `aiTap`, etc. +- **Planning tasks**: Automatic planning tasks, such as `aiAct` +- **Default tasks**: Other uncategorized tasks + +Each task type can have independent model configurations: + +| Task Type | Environment Variable Prefix | Description | +|-----------|---------------------------|-------------| +| Insight | `MIDSCENE_INSIGHT_MODEL_*` | For visual understanding and element location tasks | +| Planning | `MIDSCENE_PLANNING_MODEL_*` | For automatic planning tasks | +| Default | `MIDSCENE_MODEL_*` | Default configuration, used as fallback for other tasks | + +Complete configuration options supported by each prefix: + +| Configuration | Description | +|--------------|-------------| +| `*_MODEL_NAME` | Model name | +| `*_MODEL_API_KEY` | API key | +| `*_MODEL_BASE_URL` | API endpoint URL | +| `*_MODEL_HTTP_PROXY` | HTTP/HTTPS proxy | +| `*_MODEL_SOCKS_PROXY` | SOCKS proxy | +| `*_MODEL_INIT_CONFIG_JSON` | OpenAI SDK initialization config JSON | +| `*_LOCATOR_MODE` | Locator mode (e.g. `qwen3-vl`, `vlm-ui-tars`, etc.) | + +**Example: Configure different models for Insight and Planning tasks** + +```bash +# Insight tasks use Qwen-VL model (for visual understanding and location) +export MIDSCENE_INSIGHT_MODEL_NAME="qwen-vl-plus" +export MIDSCENE_INSIGHT_MODEL_API_KEY="sk-insight-key" +export MIDSCENE_INSIGHT_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1" +export MIDSCENE_INSIGHT_LOCATOR_MODE="qwen3-vl" + +# Planning tasks use GPT-4o model (for task planning) +export MIDSCENE_PLANNING_MODEL_NAME="gpt-4o" +export MIDSCENE_PLANNING_MODEL_API_KEY="sk-planning-key" +export MIDSCENE_PLANNING_MODEL_BASE_URL="https://api.openai.com/v1" +export MIDSCENE_PLANNING_LOCATOR_MODE="qwen3-vl" + +# Default configuration (used as fallback) +export MIDSCENE_MODEL_NAME="gpt-4o" +export MIDSCENE_MODEL_API_KEY="sk-default-key" +``` + +:::tip + +If a task type's configuration is not set, Midscene will automatically use the default `MIDSCENE_MODEL_*` configuration. In most cases, you only need to configure the default `MIDSCENE_MODEL_*` variables. + +::: + ### Advanced configs Some advanced configs are also supported. Usually you don't need to use them. diff --git a/apps/site/docs/zh/api.mdx b/apps/site/docs/zh/api.mdx index 211942ecf..ad31a0577 100644 --- a/apps/site/docs/zh/api.mdx +++ b/apps/site/docs/zh/api.mdx @@ -27,9 +27,14 @@ Midscene 中每个 Agent 都有自己的构造函数。 这些 Agent 还支持以下高级配置参数: -- `modelConfig: () => IModelConfig`: 可选。自定义模型配置函数。允许你通过代码动态配置不同的模型,而不是通过环境变量。这在需要为不同的 AI 任务(如 VQA、规划、定位等)使用不同模型时特别有用。 +- `modelConfig: (params: { intent: TIntent }) => IModelConfig`: 可选。自定义模型配置函数。允许你通过代码动态配置不同的模型,而不是通过环境变量。这在需要为不同的 AI 任务(如 Insight、Planning 等)使用不同模型时特别有用。 - **示例:** + 函数接收一个参数对象,包含 `intent` 字段,表示当前任务类型: + - `'insight'`: 视觉理解和元素定位任务(如 `aiQuery`、`aiLocate`、`aiTap` 等) + - `'planning'`: 自动规划任务(如 `aiAct`) + - `'default'`: 其他未分类任务 + + **基础示例:** ```typescript const agent = new PuppeteerAgent(page, { modelConfig: () => ({ @@ -41,6 +46,40 @@ Midscene 中每个 Agent 都有自己的构造函数。 }); ``` + **为不同任务类型配置不同模型:** + ```typescript + const agent = new PuppeteerAgent(page, { + modelConfig: ({ intent }) => { + // 为 Insight 任务使用 Qwen-VL 模型(用于视觉理解和定位) + if (intent === 'insight') { + return { + MIDSCENE_INSIGHT_MODEL_NAME: 'qwen-vl-plus', + MIDSCENE_INSIGHT_MODEL_API_KEY: 'sk-insight-key', + MIDSCENE_INSIGHT_MODEL_BASE_URL: 'https://dashscope.aliyuncs.com/compatible-mode/v1' + }; + } + + // 为 Planning 任务使用 GPT-4o 模型(用于任务规划) + if (intent === 'planning') { + return { + MIDSCENE_PLANNING_MODEL_NAME: 'gpt-4o', + MIDSCENE_PLANNING_MODEL_API_KEY: 'sk-planning-key', + MIDSCENE_PLANNING_MODEL_BASE_URL: 'https://api.openai.com/v1', + MIDSCENE_INSIGHT_LOCATOR_MODE: 'qwen3-vl' + }; + } + + // 默认配置 + return { + MIDSCENE_MODEL_NAME: 'gpt-4o', + MIDSCENE_MODEL_API_KEY: 'sk-default-key', + }; + } + }); + ``` + + 更多关于按任务类型配置模型的信息,请参考 [配置模型和服务商](./model-provider#按任务类型配置模型高级) 文档。 + - `createOpenAIClient: (openai, options) => Promise`: 可选。自定义 OpenAI 客户端包装函数。允许你包装 OpenAI 客户端实例,用于集成可观测性工具(如 LangSmith、LangFuse)或应用自定义中间件。 **参数说明:** diff --git a/apps/site/docs/zh/choose-a-model.mdx b/apps/site/docs/zh/choose-a-model.mdx index 372adc3fd..ca807bdcc 100644 --- a/apps/site/docs/zh/choose-a-model.mdx +++ b/apps/site/docs/zh/choose-a-model.mdx @@ -42,6 +42,22 @@ Midscene 要求模型服务商提供兼容 OpenAI 风格的接口。 - `MIDSCENE_MODEL_API_KEY` - API 密钥 - `MIDSCENE_MODEL_NAME` - 模型名称 +### 按任务类型配置模型(高级) + +Midscene 支持为不同的任务类型配置不同的模型: + +- **Insight 任务**:视觉理解和元素定位(如 `aiQuery`、`aiLocate`、`aiTap` 等) +- **Planning 任务**:自动规划任务(如 `aiAct`) +- **Default 任务**:其他未分类任务 + +你可以使用以下环境变量前缀来配置不同任务类型的模型: + +- `MIDSCENE_INSIGHT_MODEL_*` - 用于视觉理解和元素定位任务 +- `MIDSCENE_PLANNING_MODEL_*` - 用于自动规划任务 +- `MIDSCENE_MODEL_*` - 默认配置,作为其他任务的后备选项 + +更多详细信息,请参考 [配置模型和服务商](./model-provider#按任务类型配置模型高级) 文档。 + ## 已支持的视觉模型 diff --git a/apps/site/docs/zh/model-provider.mdx b/apps/site/docs/zh/model-provider.mdx index 3747cca42..9b6d5c2f7 100644 --- a/apps/site/docs/zh/model-provider.mdx +++ b/apps/site/docs/zh/model-provider.mdx @@ -42,6 +42,60 @@ Midscene 默认集成了 OpenAI SDK 调用 AI 服务。使用这个 SDK 限定 关于模型的更多信息,请参阅 [选择 AI 模型](./choose-a-model)。 +### 按任务类型配置模型(高级) + +Midscene 内部将 AI 任务分为不同的意图(Intent)类型。你可以为不同的意图配置不同的模型: + +- **Insight 任务**:包括视觉问答(VQA)和视觉定位(Grounding),如 `aiQuery`、`aiLocate`、`aiTap` 等方法 +- **Planning 任务**:自动规划相关的任务,如 `aiAct` 方法 +- **Default 任务**:其他未分类的任务 + +每种任务类型都可以配置独立的模型参数: + +| 任务类型 | 环境变量前缀 | 说明 | +|---------|-------------|------| +| Insight | `MIDSCENE_INSIGHT_MODEL_*` | 用于视觉理解和元素定位任务 | +| Planning | `MIDSCENE_PLANNING_MODEL_*` | 用于自动规划任务 | +| Default | `MIDSCENE_MODEL_*` | 默认配置,作为其他任务的后备选项 | + +每个前缀支持的完整配置项: + +| 配置项 | 说明 | +|-------|------| +| `*_MODEL_NAME` | 模型名称 | +| `*_MODEL_API_KEY` | API 密钥 | +| `*_MODEL_BASE_URL` | API 接入地址 | +| `*_MODEL_HTTP_PROXY` | HTTP/HTTPS 代理 | +| `*_MODEL_SOCKS_PROXY` | SOCKS 代理 | +| `*_MODEL_INIT_CONFIG_JSON` | OpenAI SDK 初始化配置 JSON | +| `*_LOCATOR_MODE` | 定位模式(如 `qwen3-vl`、`vlm-ui-tars` 等) | + +**示例:为 Insight 和 Planning 任务配置不同的模型** + +```bash +# Insight 任务使用 Qwen-VL 模型(用于视觉理解和定位) +export MIDSCENE_INSIGHT_MODEL_NAME="qwen-vl-plus" +export MIDSCENE_INSIGHT_MODEL_API_KEY="sk-insight-key" +export MIDSCENE_INSIGHT_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1" +export MIDSCENE_INSIGHT_LOCATOR_MODE="qwen3-vl" + +# Planning 任务使用 GPT-4o 模型(用于任务规划) +export MIDSCENE_PLANNING_MODEL_NAME="gpt-4o" +export MIDSCENE_PLANNING_MODEL_API_KEY="sk-planning-key" +export MIDSCENE_PLANNING_MODEL_BASE_URL="https://api.openai.com/v1" +export MIDSCENE_PLANNING_LOCATOR_MODE="qwen3-vl" + +# 默认配置(用作后备) +export MIDSCENE_MODEL_NAME="gpt-4o" +export MIDSCENE_MODEL_API_KEY="sk-default-key" +``` + +:::tip + +如果某个任务类型的配置未设置,Midscene 会自动使用 `MIDSCENE_MODEL_*` 的默认配置。大多数情况下,你只需要配置默认的 `MIDSCENE_MODEL_*` 变量即可。 + +::: + ### 高级配置 还有一些高级配置项,通常不需要使用。 diff --git a/packages/core/src/agent/agent.ts b/packages/core/src/agent/agent.ts index 5c8a9e6b9..5f9e1aee5 100644 --- a/packages/core/src/agent/agent.ts +++ b/packages/core/src/agent/agent.ts @@ -471,7 +471,7 @@ export class Agent< ); // assume all operation in action space is related to locating - const modelConfig = this.modelConfigManager.getModelConfig('grounding'); + const modelConfig = this.modelConfigManager.getModelConfig('insight'); const { output, runner } = await this.taskExecutor.runPlans( title, @@ -796,7 +796,7 @@ export class Agent< demand: ServiceExtractParam, opt: ServiceExtractOption = defaultServiceExtractOption, ): Promise { - const modelConfig = this.modelConfigManager.getModelConfig('VQA'); + const modelConfig = this.modelConfigManager.getModelConfig('insight'); const { output } = await this.taskExecutor.createTypeQueryExecution( 'Query', demand, @@ -810,7 +810,7 @@ export class Agent< prompt: TUserPrompt, opt: ServiceExtractOption = defaultServiceExtractOption, ): Promise { - const modelConfig = this.modelConfigManager.getModelConfig('VQA'); + const modelConfig = this.modelConfigManager.getModelConfig('insight'); const { textPrompt, multimodalPrompt } = parsePrompt(prompt); const { output } = await this.taskExecutor.createTypeQueryExecution( @@ -827,7 +827,7 @@ export class Agent< prompt: TUserPrompt, opt: ServiceExtractOption = defaultServiceExtractOption, ): Promise { - const modelConfig = this.modelConfigManager.getModelConfig('VQA'); + const modelConfig = this.modelConfigManager.getModelConfig('insight'); const { textPrompt, multimodalPrompt } = parsePrompt(prompt); const { output } = await this.taskExecutor.createTypeQueryExecution( @@ -844,7 +844,7 @@ export class Agent< prompt: TUserPrompt, opt: ServiceExtractOption = defaultServiceExtractOption, ): Promise { - const modelConfig = this.modelConfigManager.getModelConfig('VQA'); + const modelConfig = this.modelConfigManager.getModelConfig('insight'); const { textPrompt, multimodalPrompt } = parsePrompt(prompt); const { output } = await this.taskExecutor.createTypeQueryExecution( @@ -895,7 +895,7 @@ export class Agent< deepThink, ); // use same intent as aiLocate - const modelConfig = this.modelConfigManager.getModelConfig('grounding'); + const modelConfig = this.modelConfigManager.getModelConfig('insight'); const text = await this.service.describe(center, modelConfig, { deepThink, @@ -956,7 +956,7 @@ export class Agent< assert(locateParam, 'cannot get locate param for aiLocate'); const locatePlan = locatePlanForLocate(locateParam); const plans = [locatePlan]; - const modelConfig = this.modelConfigManager.getModelConfig('grounding'); + const modelConfig = this.modelConfigManager.getModelConfig('insight'); const { output } = await this.taskExecutor.runPlans( taskTitleStr('Locate', locateParamStr(locateParam)), @@ -986,7 +986,7 @@ export class Agent< msg?: string, opt?: AgentAssertOpt & ServiceExtractOption, ) { - const modelConfig = this.modelConfigManager.getModelConfig('VQA'); + const modelConfig = this.modelConfigManager.getModelConfig('insight'); const serviceOpt: ServiceExtractOption = { domIncluded: opt?.domIncluded ?? defaultServiceExtractOption.domIncluded, @@ -1058,7 +1058,7 @@ export class Agent< } async aiWaitFor(assertion: TUserPrompt, opt?: AgentWaitForOpt) { - const modelConfig = this.modelConfigManager.getModelConfig('VQA'); + const modelConfig = this.modelConfigManager.getModelConfig('insight'); await this.taskExecutor.waitFor( assertion, { diff --git a/packages/core/tests/ai/service/service.test.ts b/packages/core/tests/ai/service/service.test.ts index d5bcb0032..ce670f015 100644 --- a/packages/core/tests/ai/service/service.test.ts +++ b/packages/core/tests/ai/service/service.test.ts @@ -12,7 +12,7 @@ vi.setConfig({ testTimeout: 60 * 1000, }); -const modelConfig = globalModelConfigManager.getModelConfig('grounding'); +const modelConfig = globalModelConfigManager.getModelConfig('insight'); describe.skipIf(!modelConfig.vlMode)('service locate with deep think', () => { test('service locate with search area', async () => { diff --git a/packages/core/tests/unit-test/proxy-integration.test.ts b/packages/core/tests/unit-test/proxy-integration.test.ts index e3adf9320..cf4894eb6 100644 --- a/packages/core/tests/unit-test/proxy-integration.test.ts +++ b/packages/core/tests/unit-test/proxy-integration.test.ts @@ -349,20 +349,20 @@ describe('proxy integration', () => { expect(mockModelConfig.socksProxy).toBe(proxyUrl); }); - it('should support intent-specific proxy configuration for VQA', () => { + it('should support intent-specific proxy configuration for insight', () => { const proxyUrl = 'http://127.0.0.1:8080'; const mockModelConfig: IModelConfig = { modelName: 'gpt-4o', openaiApiKey: 'test-key', openaiBaseURL: 'https://api.openai.com/v1', - httpProxy: proxyUrl, // Would be populated from MIDSCENE_VQA_MODEL_HTTP_PROXY + httpProxy: proxyUrl, // Would be populated from MIDSCENE_INSIGHT_MODEL_HTTP_PROXY modelDescription: 'test', - intent: 'VQA', + intent: 'insight', from: 'env', }; - expect(mockModelConfig.intent).toBe('VQA'); + expect(mockModelConfig.intent).toBe('insight'); expect(mockModelConfig.httpProxy).toBe(proxyUrl); }); @@ -383,20 +383,20 @@ describe('proxy integration', () => { expect(mockModelConfig.socksProxy).toBe(proxyUrl); }); - it('should support intent-specific proxy configuration for grounding', () => { + it('should support intent-specific proxy configuration for insight with http', () => { const proxyUrl = 'http://127.0.0.1:8080'; const mockModelConfig: IModelConfig = { modelName: 'gpt-4o', openaiApiKey: 'test-key', openaiBaseURL: 'https://api.openai.com/v1', - httpProxy: proxyUrl, // Would be populated from MIDSCENE_GROUNDING_MODEL_HTTP_PROXY + httpProxy: proxyUrl, // Would be populated from MIDSCENE_INSIGHT_MODEL_HTTP_PROXY modelDescription: 'test', - intent: 'grounding', + intent: 'insight', from: 'env', }; - expect(mockModelConfig.intent).toBe('grounding'); + expect(mockModelConfig.intent).toBe('insight'); expect(mockModelConfig.httpProxy).toBe(proxyUrl); }); }); diff --git a/packages/evaluation/src/test-analyzer.ts b/packages/evaluation/src/test-analyzer.ts index 713194b07..f88e0f7aa 100644 --- a/packages/evaluation/src/test-analyzer.ts +++ b/packages/evaluation/src/test-analyzer.ts @@ -232,7 +232,7 @@ ${errorMsg ? `Error: ${errorMsg}` : ''} // compare coordinates if ( testCase.response_rect && - globalModelConfigManager.getModelConfig('grounding').vlMode + globalModelConfigManager.getModelConfig('insight').vlMode ) { const resultRect = (result as LocateResult).rect; if (!resultRect) { diff --git a/packages/evaluation/tests/llm-locator.test.ts b/packages/evaluation/tests/llm-locator.test.ts index 0f734397b..9087b0376 100644 --- a/packages/evaluation/tests/llm-locator.test.ts +++ b/packages/evaluation/tests/llm-locator.test.ts @@ -28,14 +28,13 @@ let resultCollector: TestResultCollector; let failCaseThreshold = 2; if (process.env.CI) { - failCaseThreshold = globalModelConfigManager.getModelConfig('grounding') - .vlMode + failCaseThreshold = globalModelConfigManager.getModelConfig('insight').vlMode ? 2 : 3; } beforeAll(async () => { - const modelConfig = globalModelConfigManager.getModelConfig('grounding'); + const modelConfig = globalModelConfigManager.getModelConfig('insight'); const { vlMode, modelName } = modelConfig; diff --git a/packages/shared/src/env/constants.ts b/packages/shared/src/env/constants.ts index 0277045d5..450395afc 100644 --- a/packages/shared/src/env/constants.ts +++ b/packages/shared/src/env/constants.ts @@ -1,11 +1,11 @@ import { - MIDSCENE_GROUNDING_LOCATOR_MODE, - MIDSCENE_GROUNDING_MODEL_API_KEY, - MIDSCENE_GROUNDING_MODEL_BASE_URL, - MIDSCENE_GROUNDING_MODEL_HTTP_PROXY, - MIDSCENE_GROUNDING_MODEL_INIT_CONFIG_JSON, - MIDSCENE_GROUNDING_MODEL_NAME, - MIDSCENE_GROUNDING_MODEL_SOCKS_PROXY, + MIDSCENE_INSIGHT_LOCATOR_MODE, + MIDSCENE_INSIGHT_MODEL_API_KEY, + MIDSCENE_INSIGHT_MODEL_BASE_URL, + MIDSCENE_INSIGHT_MODEL_HTTP_PROXY, + MIDSCENE_INSIGHT_MODEL_INIT_CONFIG_JSON, + MIDSCENE_INSIGHT_MODEL_NAME, + MIDSCENE_INSIGHT_MODEL_SOCKS_PROXY, MIDSCENE_LOCATOR_MODE, MIDSCENE_MODEL_API_KEY, MIDSCENE_MODEL_BASE_URL, @@ -23,14 +23,6 @@ import { MIDSCENE_PLANNING_MODEL_INIT_CONFIG_JSON, MIDSCENE_PLANNING_MODEL_NAME, MIDSCENE_PLANNING_MODEL_SOCKS_PROXY, - MIDSCENE_VQA_LOCATOR_MODE, - MIDSCENE_VQA_MODEL_API_KEY, - MIDSCENE_VQA_MODEL_BASE_URL, - MIDSCENE_VQA_MODEL_HTTP_PROXY, - MIDSCENE_VQA_MODEL_INIT_CONFIG_JSON, - // VQA - MIDSCENE_VQA_MODEL_NAME, - MIDSCENE_VQA_MODEL_SOCKS_PROXY, OPENAI_API_KEY, OPENAI_BASE_URL, } from './types'; @@ -54,42 +46,23 @@ interface IModelConfigKeys { vlMode: string; } -export const VQA_MODEL_CONFIG_KEYS: IModelConfigKeys = { - modelName: MIDSCENE_VQA_MODEL_NAME, +export const INSIGHT_MODEL_CONFIG_KEYS: IModelConfigKeys = { + modelName: MIDSCENE_INSIGHT_MODEL_NAME, /** * proxy */ - socksProxy: MIDSCENE_VQA_MODEL_SOCKS_PROXY, - httpProxy: MIDSCENE_VQA_MODEL_HTTP_PROXY, + socksProxy: MIDSCENE_INSIGHT_MODEL_SOCKS_PROXY, + httpProxy: MIDSCENE_INSIGHT_MODEL_HTTP_PROXY, /** * OpenAI */ - openaiBaseURL: MIDSCENE_VQA_MODEL_BASE_URL, - openaiApiKey: MIDSCENE_VQA_MODEL_API_KEY, - openaiExtraConfig: MIDSCENE_VQA_MODEL_INIT_CONFIG_JSON, + openaiBaseURL: MIDSCENE_INSIGHT_MODEL_BASE_URL, + openaiApiKey: MIDSCENE_INSIGHT_MODEL_API_KEY, + openaiExtraConfig: MIDSCENE_INSIGHT_MODEL_INIT_CONFIG_JSON, /** * Extra */ - vlMode: MIDSCENE_VQA_LOCATOR_MODE, -} as const; - -export const GROUNDING_MODEL_CONFIG_KEYS: IModelConfigKeys = { - modelName: MIDSCENE_GROUNDING_MODEL_NAME, - /** - * proxy - */ - socksProxy: MIDSCENE_GROUNDING_MODEL_SOCKS_PROXY, - httpProxy: MIDSCENE_GROUNDING_MODEL_HTTP_PROXY, - /** - * OpenAI - */ - openaiBaseURL: MIDSCENE_GROUNDING_MODEL_BASE_URL, - openaiApiKey: MIDSCENE_GROUNDING_MODEL_API_KEY, - openaiExtraConfig: MIDSCENE_GROUNDING_MODEL_INIT_CONFIG_JSON, - /** - * Extra - */ - vlMode: MIDSCENE_GROUNDING_LOCATOR_MODE, + vlMode: MIDSCENE_INSIGHT_LOCATOR_MODE, } as const; export const PLANNING_MODEL_CONFIG_KEYS: IModelConfigKeys = { diff --git a/packages/shared/src/env/decide-model-config.ts b/packages/shared/src/env/decide-model-config.ts index 8608e4d62..c357df8bf 100644 --- a/packages/shared/src/env/decide-model-config.ts +++ b/packages/shared/src/env/decide-model-config.ts @@ -8,9 +8,8 @@ import type { import { DEFAULT_MODEL_CONFIG_KEYS, DEFAULT_MODEL_CONFIG_KEYS_LEGACY, - GROUNDING_MODEL_CONFIG_KEYS, + INSIGHT_MODEL_CONFIG_KEYS, PLANNING_MODEL_CONFIG_KEYS, - VQA_MODEL_CONFIG_KEYS, } from './constants'; import { MIDSCENE_MODEL_API_KEY, @@ -37,15 +36,13 @@ import { } from './parse'; type TModelConfigKeys = - | typeof VQA_MODEL_CONFIG_KEYS - | typeof GROUNDING_MODEL_CONFIG_KEYS + | typeof INSIGHT_MODEL_CONFIG_KEYS | typeof PLANNING_MODEL_CONFIG_KEYS | typeof DEFAULT_MODEL_CONFIG_KEYS | typeof DEFAULT_MODEL_CONFIG_KEYS_LEGACY; const KEYS_MAP: Record = { - VQA: VQA_MODEL_CONFIG_KEYS, - grounding: GROUNDING_MODEL_CONFIG_KEYS, + insight: INSIGHT_MODEL_CONFIG_KEYS, planning: PLANNING_MODEL_CONFIG_KEYS, default: DEFAULT_MODEL_CONFIG_KEYS, } as const; diff --git a/packages/shared/src/env/model-config-manager.ts b/packages/shared/src/env/model-config-manager.ts index 3bba49488..00cb43d2b 100644 --- a/packages/shared/src/env/model-config-manager.ts +++ b/packages/shared/src/env/model-config-manager.ts @@ -13,7 +13,7 @@ import type { } from './types'; import { VL_MODE_RAW_VALID_VALUES as VL_MODES } from './types'; -const ALL_INTENTS: TIntent[] = ['VQA', 'default', 'grounding', 'planning']; +const ALL_INTENTS: TIntent[] = ['insight', 'default', 'planning']; export type TIntentConfigMap = Record< TIntent, @@ -51,9 +51,8 @@ export class ModelConfigManager { modelConfigFn: TModelConfigFnInternal, ): TIntentConfigMap { const intentConfigMap: TIntentConfigMap = { - VQA: undefined, + insight: undefined, default: undefined, - grounding: undefined, planning: undefined, }; @@ -71,9 +70,8 @@ export class ModelConfigManager { private calcModelConfigMapBaseOnIntent(intentConfigMap: TIntentConfigMap) { const modelConfigMap: Record = { - VQA: undefined, + insight: undefined, default: undefined, - grounding: undefined, planning: undefined, }; for (const i of ALL_INTENTS) { @@ -93,9 +91,8 @@ export class ModelConfigManager { allEnvConfig: Record, ) { const modelConfigMap: Record = { - VQA: undefined, + insight: undefined, default: undefined, - grounding: undefined, planning: undefined, }; for (const i of ALL_INTENTS) { @@ -177,7 +174,7 @@ Learn more: https://midscenejs.com/choose-a-model`, this.globalConfigManager = globalConfigManager; } - throwErrorIfNonVLModel(intent: TIntent = 'grounding') { + throwErrorIfNonVLModel(intent: TIntent = 'insight') { const modelConfig = this.getModelConfig(intent); if (!modelConfig.vlMode) { diff --git a/packages/shared/src/env/types.ts b/packages/shared/src/env/types.ts index 146268249..fec6f3f08 100644 --- a/packages/shared/src/env/types.ts +++ b/packages/shared/src/env/types.ts @@ -88,15 +88,18 @@ export const MIDSCENE_RUN_DIR = 'MIDSCENE_RUN_DIR'; // default new export const MIDSCENE_LOCATOR_MODE = 'MIDSCENE_LOCATOR_MODE'; -// VQA -export const MIDSCENE_VQA_MODEL_NAME = 'MIDSCENE_VQA_MODEL_NAME'; -export const MIDSCENE_VQA_MODEL_SOCKS_PROXY = 'MIDSCENE_VQA_MODEL_SOCKS_PROXY'; -export const MIDSCENE_VQA_MODEL_HTTP_PROXY = 'MIDSCENE_VQA_MODEL_HTTP_PROXY'; -export const MIDSCENE_VQA_MODEL_BASE_URL = 'MIDSCENE_VQA_MODEL_BASE_URL'; -export const MIDSCENE_VQA_MODEL_API_KEY = 'MIDSCENE_VQA_MODEL_API_KEY'; -export const MIDSCENE_VQA_MODEL_INIT_CONFIG_JSON = - 'MIDSCENE_VQA_MODEL_INIT_CONFIG_JSON'; -export const MIDSCENE_VQA_LOCATOR_MODE = 'MIDSCENE_VQA_LOCATOR_MODE'; +// INSIGHT (unified VQA and Grounding) +export const MIDSCENE_INSIGHT_MODEL_NAME = 'MIDSCENE_INSIGHT_MODEL_NAME'; +export const MIDSCENE_INSIGHT_MODEL_SOCKS_PROXY = + 'MIDSCENE_INSIGHT_MODEL_SOCKS_PROXY'; +export const MIDSCENE_INSIGHT_MODEL_HTTP_PROXY = + 'MIDSCENE_INSIGHT_MODEL_HTTP_PROXY'; +export const MIDSCENE_INSIGHT_MODEL_BASE_URL = + 'MIDSCENE_INSIGHT_MODEL_BASE_URL'; +export const MIDSCENE_INSIGHT_MODEL_API_KEY = 'MIDSCENE_INSIGHT_MODEL_API_KEY'; +export const MIDSCENE_INSIGHT_MODEL_INIT_CONFIG_JSON = + 'MIDSCENE_INSIGHT_MODEL_INIT_CONFIG_JSON'; +export const MIDSCENE_INSIGHT_LOCATOR_MODE = 'MIDSCENE_INSIGHT_LOCATOR_MODE'; // PLANNING export const MIDSCENE_PLANNING_MODEL_NAME = 'MIDSCENE_PLANNING_MODEL_NAME'; @@ -112,21 +115,6 @@ export const MIDSCENE_PLANNING_MODEL_INIT_CONFIG_JSON = 'MIDSCENE_PLANNING_MODEL_INIT_CONFIG_JSON'; export const MIDSCENE_PLANNING_LOCATOR_MODE = 'MIDSCENE_PLANNING_LOCATOR_MODE'; -// GROUNDING -export const MIDSCENE_GROUNDING_MODEL_NAME = 'MIDSCENE_GROUNDING_MODEL_NAME'; -export const MIDSCENE_GROUNDING_MODEL_SOCKS_PROXY = - 'MIDSCENE_GROUNDING_MODEL_SOCKS_PROXY'; -export const MIDSCENE_GROUNDING_MODEL_HTTP_PROXY = - 'MIDSCENE_GROUNDING_MODEL_HTTP_PROXY'; -export const MIDSCENE_GROUNDING_MODEL_BASE_URL = - 'MIDSCENE_GROUNDING_MODEL_BASE_URL'; -export const MIDSCENE_GROUNDING_MODEL_API_KEY = - 'MIDSCENE_GROUNDING_MODEL_API_KEY'; -export const MIDSCENE_GROUNDING_MODEL_INIT_CONFIG_JSON = - 'MIDSCENE_GROUNDING_MODEL_INIT_CONFIG_JSON'; -export const MIDSCENE_GROUNDING_LOCATOR_MODE = - 'MIDSCENE_GROUNDING_LOCATOR_MODE'; - /** * env keys declared but unused */ @@ -210,14 +198,14 @@ export const MODEL_ENV_KEYS = [ MIDSCENE_OPENAI_SOCKS_PROXY, MODEL_API_KEY, MODEL_BASE_URL, - // VQA - MIDSCENE_VQA_MODEL_NAME, - MIDSCENE_VQA_MODEL_SOCKS_PROXY, - MIDSCENE_VQA_MODEL_HTTP_PROXY, - MIDSCENE_VQA_MODEL_BASE_URL, - MIDSCENE_VQA_MODEL_API_KEY, - MIDSCENE_VQA_MODEL_INIT_CONFIG_JSON, - MIDSCENE_VQA_LOCATOR_MODE, + // INSIGHT (unified VQA and Grounding) + MIDSCENE_INSIGHT_MODEL_NAME, + MIDSCENE_INSIGHT_MODEL_SOCKS_PROXY, + MIDSCENE_INSIGHT_MODEL_HTTP_PROXY, + MIDSCENE_INSIGHT_MODEL_BASE_URL, + MIDSCENE_INSIGHT_MODEL_API_KEY, + MIDSCENE_INSIGHT_MODEL_INIT_CONFIG_JSON, + MIDSCENE_INSIGHT_LOCATOR_MODE, // PLANNING MIDSCENE_PLANNING_MODEL_NAME, MIDSCENE_PLANNING_MODEL_SOCKS_PROXY, @@ -226,14 +214,6 @@ export const MODEL_ENV_KEYS = [ MIDSCENE_PLANNING_MODEL_API_KEY, MIDSCENE_PLANNING_MODEL_INIT_CONFIG_JSON, MIDSCENE_PLANNING_LOCATOR_MODE, - // GROUNDING - MIDSCENE_GROUNDING_MODEL_NAME, - MIDSCENE_GROUNDING_MODEL_SOCKS_PROXY, - MIDSCENE_GROUNDING_MODEL_HTTP_PROXY, - MIDSCENE_GROUNDING_MODEL_BASE_URL, - MIDSCENE_GROUNDING_MODEL_API_KEY, - MIDSCENE_GROUNDING_MODEL_INIT_CONFIG_JSON, - MIDSCENE_GROUNDING_LOCATOR_MODE, ] as const; export const ALL_ENV_KEYS = [ @@ -262,18 +242,18 @@ export type TVlModeTypes = | 'gemini' | 'vlm-ui-tars'; -export interface IModelConfigForVQA { +export interface IModelConfigForInsight { // model name - [MIDSCENE_VQA_MODEL_NAME]: string; + [MIDSCENE_INSIGHT_MODEL_NAME]: string; // proxy - [MIDSCENE_VQA_MODEL_SOCKS_PROXY]?: string; - [MIDSCENE_VQA_MODEL_HTTP_PROXY]?: string; + [MIDSCENE_INSIGHT_MODEL_SOCKS_PROXY]?: string; + [MIDSCENE_INSIGHT_MODEL_HTTP_PROXY]?: string; // OpenAI - [MIDSCENE_VQA_MODEL_BASE_URL]?: string; - [MIDSCENE_VQA_MODEL_API_KEY]?: string; - [MIDSCENE_VQA_MODEL_INIT_CONFIG_JSON]?: string; + [MIDSCENE_INSIGHT_MODEL_BASE_URL]?: string; + [MIDSCENE_INSIGHT_MODEL_API_KEY]?: string; + [MIDSCENE_INSIGHT_MODEL_INIT_CONFIG_JSON]?: string; // extra - [MIDSCENE_VQA_LOCATOR_MODE]?: TVlModeValues; + [MIDSCENE_INSIGHT_LOCATOR_MODE]?: TVlModeValues; } /** @@ -305,20 +285,6 @@ export interface IModelConfigForPlanning { [MIDSCENE_PLANNING_LOCATOR_MODE]?: TVlModeValues; } -export interface IModeConfigForGrounding { - // model name - [MIDSCENE_GROUNDING_MODEL_NAME]: string; - // proxy - [MIDSCENE_GROUNDING_MODEL_SOCKS_PROXY]?: string; - [MIDSCENE_GROUNDING_MODEL_HTTP_PROXY]?: string; - // OpenAI - [MIDSCENE_GROUNDING_MODEL_BASE_URL]?: string; - [MIDSCENE_GROUNDING_MODEL_API_KEY]?: string; - [MIDSCENE_GROUNDING_MODEL_INIT_CONFIG_JSON]?: string; - // extra - [MIDSCENE_GROUNDING_LOCATOR_MODE]?: TVlModeValues; -} - export interface IModelConfigForDefault { // model name [MIDSCENE_MODEL_NAME]: string; @@ -348,12 +314,11 @@ export interface IModelConfigForDefaultLegacy { } /** - * - VQA: Visual Question Answering - * - grounding:short for Visual Grounding + * - insight: Visual Question Answering and Visual Grounding (unified) * - planning: planning - * - default: all except VQA、grounding、planning + * - default: all except insight、planning */ -export type TIntent = 'VQA' | 'planning' | 'grounding' | 'default'; +export type TIntent = 'insight' | 'planning' | 'default'; /** * Internal type with intent parameter for ModelConfigManager @@ -361,20 +326,15 @@ export type TIntent = 'VQA' | 'planning' | 'grounding' | 'default'; */ export type TModelConfigFnInternal = (options: { intent: TIntent; -}) => - | IModelConfigForVQA - | IModelConfigForPlanning - | IModeConfigForGrounding - | IModelConfigForDefault; +}) => IModelConfigForInsight | IModelConfigForPlanning | IModelConfigForDefault; /** * User-facing model config function type * Users return config objects without needing to know about intent parameter */ export type TModelConfigFn = () => - | IModelConfigForVQA + | IModelConfigForInsight | IModelConfigForPlanning - | IModeConfigForGrounding | IModelConfigForDefault; export enum UITarsModelVersion { diff --git a/packages/shared/tests/unit-test/env/decide-model.test.ts b/packages/shared/tests/unit-test/env/decide-model.test.ts index 133eb03ce..cdbcafdb3 100644 --- a/packages/shared/tests/unit-test/env/decide-model.test.ts +++ b/packages/shared/tests/unit-test/env/decide-model.test.ts @@ -11,27 +11,27 @@ import { } from '../../../src/env/types'; describe('decideModelConfig from modelConfig fn', () => { - it('return lacking config for VQA', () => { + it('return lacking config for insight', () => { expect(() => - decideModelConfigFromIntentConfig('VQA', {}), + decideModelConfigFromIntentConfig('insight', {}), ).toThrowErrorMatchingInlineSnapshot( '[Error: The return value of agent.modelConfig do not have a valid value with key MIDSCENE_MODEL_NAME.]', ); }); - it('return full config for VQA', () => { - const result = decideModelConfigFromIntentConfig('VQA', { - MIDSCENE_VQA_MODEL_NAME: 'vqa-model', - MIDSCENE_VQA_MODEL_BASE_URL: 'mock-url', - MIDSCENE_VQA_MODEL_API_KEY: 'mock-key', + it('return full config for insight', () => { + const result = decideModelConfigFromIntentConfig('insight', { + MIDSCENE_INSIGHT_MODEL_NAME: 'insight-model', + MIDSCENE_INSIGHT_MODEL_BASE_URL: 'mock-url', + MIDSCENE_INSIGHT_MODEL_API_KEY: 'mock-key', }); expect(result).toMatchInlineSnapshot(` { "from": "modelConfig", "httpProxy": undefined, - "intent": "VQA", + "intent": "insight", "modelDescription": "", - "modelName": "vqa-model", + "modelName": "insight-model", "openaiApiKey": "mock-key", "openaiBaseURL": "mock-url", "openaiExtraConfig": undefined, @@ -44,7 +44,7 @@ describe('decideModelConfig from modelConfig fn', () => { }); it('return default config', () => { - const result = decideModelConfigFromIntentConfig('VQA', { + const result = decideModelConfigFromIntentConfig('insight', { MIDSCENE_MODEL_NAME: 'default-model', MIDSCENE_MODEL_BASE_URL: 'mock-url', MIDSCENE_MODEL_API_KEY: 'mock-key', @@ -53,7 +53,7 @@ describe('decideModelConfig from modelConfig fn', () => { { "from": "modelConfig", "httpProxy": undefined, - "intent": "VQA", + "intent": "insight", "modelDescription": "", "modelName": "default-model", "openaiApiKey": "mock-key", diff --git a/packages/shared/tests/unit-test/env/modle-config-manager.test.ts b/packages/shared/tests/unit-test/env/modle-config-manager.test.ts index 9e705570e..d30551335 100644 --- a/packages/shared/tests/unit-test/env/modle-config-manager.test.ts +++ b/packages/shared/tests/unit-test/env/modle-config-manager.test.ts @@ -3,9 +3,9 @@ import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'; import { ModelConfigManager } from '../../../src/env/model-config-manager'; import type { TIntent, TModelConfigFn } from '../../../src/env/types'; import { - MIDSCENE_GROUNDING_MODEL_API_KEY, - MIDSCENE_GROUNDING_MODEL_BASE_URL, - MIDSCENE_GROUNDING_MODEL_NAME, + MIDSCENE_INSIGHT_MODEL_API_KEY, + MIDSCENE_INSIGHT_MODEL_BASE_URL, + MIDSCENE_INSIGHT_MODEL_NAME, MIDSCENE_MODEL_API_KEY, MIDSCENE_MODEL_BASE_URL, MIDSCENE_MODEL_INIT_CONFIG_JSON, @@ -14,9 +14,6 @@ import { MIDSCENE_PLANNING_MODEL_API_KEY, MIDSCENE_PLANNING_MODEL_BASE_URL, MIDSCENE_PLANNING_MODEL_NAME, - MIDSCENE_VQA_MODEL_API_KEY, - MIDSCENE_VQA_MODEL_BASE_URL, - MIDSCENE_VQA_MODEL_NAME, OPENAI_API_KEY, OPENAI_BASE_URL, } from '../../../src/env/types'; @@ -41,11 +38,11 @@ describe('ModelConfigManager', () => { }; switch (intent) { - case 'VQA': + case 'insight': return { - [MIDSCENE_VQA_MODEL_NAME]: 'gpt-4-vision', - [MIDSCENE_VQA_MODEL_API_KEY]: 'test-vqa-key', - [MIDSCENE_VQA_MODEL_BASE_URL]: 'https://api.openai.com/v1', + [MIDSCENE_INSIGHT_MODEL_NAME]: 'gpt-4-vision', + [MIDSCENE_INSIGHT_MODEL_API_KEY]: 'test-insight-key', + [MIDSCENE_INSIGHT_MODEL_BASE_URL]: 'https://api.openai.com/v1', }; case 'planning': return { @@ -54,12 +51,6 @@ describe('ModelConfigManager', () => { [MIDSCENE_PLANNING_MODEL_BASE_URL]: 'https://api.openai.com/v1', [MIDSCENE_PLANNING_LOCATOR_MODE]: 'qwen-vl' as const, }; - case 'grounding': - return { - [MIDSCENE_GROUNDING_MODEL_NAME]: 'gpt-4-vision', - [MIDSCENE_GROUNDING_MODEL_API_KEY]: 'test-grounding-key', - [MIDSCENE_GROUNDING_MODEL_BASE_URL]: 'https://api.openai.com/v1', - }; case 'default': return baseConfig; default: @@ -73,7 +64,7 @@ describe('ModelConfigManager', () => { it('should throw error when modelConfigFn returns undefined for any intent', () => { const modelConfigFn: TModelConfigFn = ({ intent }) => { - if (intent === 'VQA') { + if (intent === 'insight') { return undefined as any; } return { @@ -84,7 +75,7 @@ describe('ModelConfigManager', () => { }; expect(() => new ModelConfigManager(modelConfigFn)).toThrow( - 'The agent has an option named modelConfig is a function, but it return undefined when call with intent VQA, which should be a object.', + 'The agent has an option named modelConfig is a function, but it return undefined when call with intent insight, which should be a object.', ); }); }); @@ -99,11 +90,11 @@ describe('ModelConfigManager', () => { }; switch (intent) { - case 'VQA': + case 'insight': return { - [MIDSCENE_VQA_MODEL_NAME]: 'gpt-4-vision', - [MIDSCENE_VQA_MODEL_API_KEY]: 'test-vqa-key', - [MIDSCENE_VQA_MODEL_BASE_URL]: 'https://api.openai.com/v1', + [MIDSCENE_INSIGHT_MODEL_NAME]: 'gpt-4-vision', + [MIDSCENE_INSIGHT_MODEL_API_KEY]: 'test-insight-key', + [MIDSCENE_INSIGHT_MODEL_BASE_URL]: 'https://api.openai.com/v1', }; case 'planning': return { @@ -112,12 +103,6 @@ describe('ModelConfigManager', () => { [MIDSCENE_PLANNING_MODEL_BASE_URL]: 'https://api.openai.com/v1', [MIDSCENE_PLANNING_LOCATOR_MODE]: 'qwen-vl', }; - case 'grounding': - return { - [MIDSCENE_GROUNDING_MODEL_NAME]: 'gpt-4-vision', - [MIDSCENE_GROUNDING_MODEL_API_KEY]: 'test-grounding-key', - [MIDSCENE_GROUNDING_MODEL_BASE_URL]: 'https://api.openai.com/v1', - }; case 'default': return baseConfig; default: @@ -127,11 +112,11 @@ describe('ModelConfigManager', () => { const manager = new ModelConfigManager(modelConfigFn); - const vqaConfig = manager.getModelConfig('VQA'); - expect(vqaConfig.modelName).toBe('gpt-4-vision'); - expect(vqaConfig.openaiApiKey).toBe('test-vqa-key'); - expect(vqaConfig.intent).toBe('VQA'); - expect(vqaConfig.from).toBe('modelConfig'); + const insightConfig = manager.getModelConfig('insight'); + expect(insightConfig.modelName).toBe('gpt-4-vision'); + expect(insightConfig.openaiApiKey).toBe('test-insight-key'); + expect(insightConfig.intent).toBe('insight'); + expect(insightConfig.from).toBe('modelConfig'); const planningConfig = manager.getModelConfig('planning'); expect(planningConfig.modelName).toBe('qwen-vl-plus'); @@ -140,12 +125,6 @@ describe('ModelConfigManager', () => { expect(planningConfig.from).toBe('modelConfig'); expect(planningConfig.vlMode).toBe('qwen-vl'); - const groundingConfig = manager.getModelConfig('grounding'); - expect(groundingConfig.modelName).toBe('gpt-4-vision'); - expect(groundingConfig.openaiApiKey).toBe('test-grounding-key'); - expect(groundingConfig.intent).toBe('grounding'); - expect(groundingConfig.from).toBe('modelConfig'); - const defaultConfig = manager.getModelConfig('default'); expect(defaultConfig.modelName).toBe('gpt-4'); expect(defaultConfig.openaiApiKey).toBe('test-key'); @@ -380,8 +359,7 @@ describe('ModelConfigManager', () => { // Other intents should succeed expect(() => manager.getModelConfig('default')).not.toThrow(); - expect(() => manager.getModelConfig('VQA')).not.toThrow(); - expect(() => manager.getModelConfig('grounding')).not.toThrow(); + expect(() => manager.getModelConfig('insight')).not.toThrow(); }); it('should accept all valid VL modes for planning', () => { @@ -464,11 +442,11 @@ describe('ModelConfigManager', () => { const mockCreateClient = vi.fn(); const modelConfigFn: TModelConfigFn = ({ intent }) => { switch (intent) { - case 'VQA': + case 'insight': return { - [MIDSCENE_VQA_MODEL_NAME]: 'gpt-4-vision', - [MIDSCENE_VQA_MODEL_API_KEY]: 'test-vqa-key', - [MIDSCENE_VQA_MODEL_BASE_URL]: 'https://api.openai.com/v1', + [MIDSCENE_INSIGHT_MODEL_NAME]: 'gpt-4-vision', + [MIDSCENE_INSIGHT_MODEL_API_KEY]: 'test-insight-key', + [MIDSCENE_INSIGHT_MODEL_BASE_URL]: 'https://api.openai.com/v1', }; case 'planning': return { @@ -477,12 +455,6 @@ describe('ModelConfigManager', () => { [MIDSCENE_PLANNING_MODEL_BASE_URL]: 'https://api.openai.com/v1', [MIDSCENE_PLANNING_LOCATOR_MODE]: 'qwen-vl' as const, }; - case 'grounding': - return { - [MIDSCENE_GROUNDING_MODEL_NAME]: 'gpt-4-vision', - [MIDSCENE_GROUNDING_MODEL_API_KEY]: 'test-grounding-key', - [MIDSCENE_GROUNDING_MODEL_BASE_URL]: 'https://api.openai.com/v1', - }; default: return { [MIDSCENE_MODEL_NAME]: 'gpt-4', @@ -494,36 +466,26 @@ describe('ModelConfigManager', () => { const manager = new ModelConfigManager(modelConfigFn, mockCreateClient); - const vqaConfig = manager.getModelConfig('VQA'); - expect(vqaConfig.createOpenAIClient).toBe(mockCreateClient); + const insightConfig = manager.getModelConfig('insight'); + expect(insightConfig.createOpenAIClient).toBe(mockCreateClient); const planningConfig = manager.getModelConfig('planning'); expect(planningConfig.createOpenAIClient).toBe(mockCreateClient); - const groundingConfig = manager.getModelConfig('grounding'); - expect(groundingConfig.createOpenAIClient).toBe(mockCreateClient); - const defaultConfig = manager.getModelConfig('default'); expect(defaultConfig.createOpenAIClient).toBe(mockCreateClient); }); it('should inject createOpenAIClient into all intent configs in normal mode', () => { - vi.stubEnv(MIDSCENE_VQA_MODEL_NAME, 'gpt-4-vision'); - vi.stubEnv(MIDSCENE_VQA_MODEL_API_KEY, 'test-vqa-key'); - vi.stubEnv(MIDSCENE_VQA_MODEL_BASE_URL, 'https://api.openai.com/v1'); + vi.stubEnv(MIDSCENE_INSIGHT_MODEL_NAME, 'gpt-4-vision'); + vi.stubEnv(MIDSCENE_INSIGHT_MODEL_API_KEY, 'test-insight-key'); + vi.stubEnv(MIDSCENE_INSIGHT_MODEL_BASE_URL, 'https://api.openai.com/v1'); vi.stubEnv(MIDSCENE_PLANNING_MODEL_NAME, 'qwen-vl-plus'); vi.stubEnv(MIDSCENE_PLANNING_MODEL_API_KEY, 'test-planning-key'); vi.stubEnv(MIDSCENE_PLANNING_MODEL_BASE_URL, 'https://api.openai.com/v1'); vi.stubEnv(MIDSCENE_PLANNING_LOCATOR_MODE, 'qwen-vl'); - vi.stubEnv(MIDSCENE_GROUNDING_MODEL_NAME, 'gpt-4-vision'); - vi.stubEnv(MIDSCENE_GROUNDING_MODEL_API_KEY, 'test-grounding-key'); - vi.stubEnv( - MIDSCENE_GROUNDING_MODEL_BASE_URL, - 'https://api.openai.com/v1', - ); - vi.stubEnv(MIDSCENE_MODEL_NAME, 'gpt-4'); vi.stubEnv(OPENAI_API_KEY, 'test-key'); vi.stubEnv(OPENAI_BASE_URL, 'https://api.openai.com/v1'); @@ -532,15 +494,12 @@ describe('ModelConfigManager', () => { const manager = new ModelConfigManager(undefined, mockCreateClient); manager.registerGlobalConfigManager(new GlobalConfigManager()); - const vqaConfig = manager.getModelConfig('VQA'); - expect(vqaConfig.createOpenAIClient).toBe(mockCreateClient); + const insightConfig = manager.getModelConfig('insight'); + expect(insightConfig.createOpenAIClient).toBe(mockCreateClient); const planningConfig = manager.getModelConfig('planning'); expect(planningConfig.createOpenAIClient).toBe(mockCreateClient); - const groundingConfig = manager.getModelConfig('grounding'); - expect(groundingConfig.createOpenAIClient).toBe(mockCreateClient); - const defaultConfig = manager.getModelConfig('default'); expect(defaultConfig.createOpenAIClient).toBe(mockCreateClient); }); diff --git a/packages/web-integration/tests/unit-test/agent.test.ts b/packages/web-integration/tests/unit-test/agent.test.ts index 207033f77..5fdd1be15 100644 --- a/packages/web-integration/tests/unit-test/agent.test.ts +++ b/packages/web-integration/tests/unit-test/agent.test.ts @@ -65,7 +65,7 @@ const mockedModelConfigFnResult = { const modelConfigCalcByMockedModelConfigFnResult = { from: 'modelConfig', httpProxy: undefined, - intent: 'VQA', + intent: 'insight', modelDescription: '', modelName: 'mock-model', openaiApiKey: 'mock-api-key',