Skip to content

Commit 803cf00

Browse files
quanruclaude
andcommitted
feat(shared): unify VQA and grounding models into insight model
Unified MIDSCENE_VQA_MODEL_* and MIDSCENE_GROUNDING_MODEL_* environment variables into a single MIDSCENE_INSIGHT_MODEL_* configuration. Changes: - Updated type definitions to use 'insight' intent instead of 'VQA' and 'grounding' - Unified 12 environment variables into 6 INSIGHT variables - Updated all agent code to use 'insight' intent - Fixed all test cases (140/140 passing) - Added comprehensive documentation for intent-based model configuration - Fixed duplicate case clause warnings in test files Breaking changes: - Replaced TIntent type: 'VQA' | 'grounding' -> 'insight' - Environment variables MIDSCENE_VQA_MODEL_* and MIDSCENE_GROUNDING_MODEL_* are no longer supported Documentation updates: - Added detailed intent-based configuration guide in model-provider.mdx (EN/ZH) - Updated API documentation with modelConfig examples (EN/ZH) - Updated choose-a-model.mdx with task type configuration section (EN/ZH) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 8378dae commit 803cf00

File tree

19 files changed

+346
-238
lines changed

19 files changed

+346
-238
lines changed

apps/site/docs/en/api.mdx

Lines changed: 41 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,14 @@ In Playwright and Puppeteer, there are some common parameters:
2727

2828
These Agents also support the following advanced configuration parameters:
2929

30-
- `modelConfig: () => IModelConfig`: Optional. Custom model configuration function. Allows you to dynamically configure different models through code instead of environment variables. This is particularly useful when you need to use different models for different AI tasks (such as VQA, planning, grounding, etc.).
30+
- `modelConfig: (params: { intent: TIntent }) => IModelConfig`: Optional. Custom model configuration function. Allows you to dynamically configure different models through code instead of environment variables. This is particularly useful when you need to use different models for different AI tasks (such as Insight, Planning, etc.).
3131

32-
**Example:**
32+
The function receives a parameter object with an `intent` field indicating the current task type:
33+
- `'insight'`: Visual understanding and element location tasks (such as `aiQuery`, `aiLocate`, `aiTap`, etc.)
34+
- `'planning'`: Automatic planning tasks (such as `aiAct`)
35+
- `'default'`: Other uncategorized tasks
36+
37+
**Basic Example:**
3338
```typescript
3439
const agent = new PuppeteerAgent(page, {
3540
modelConfig: () => ({
@@ -41,6 +46,40 @@ These Agents also support the following advanced configuration parameters:
4146
});
4247
```
4348

49+
**Configure different models for different task types:**
50+
```typescript
51+
const agent = new PuppeteerAgent(page, {
52+
modelConfig: ({ intent }) => {
53+
// Use Qwen-VL model for Insight tasks (for visual understanding and location)
54+
if (intent === 'insight') {
55+
return {
56+
MIDSCENE_INSIGHT_MODEL_NAME: 'qwen-vl-plus',
57+
MIDSCENE_INSIGHT_MODEL_API_KEY: 'sk-insight-key',
58+
MIDSCENE_INSIGHT_MODEL_BASE_URL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
59+
MIDSCENE_INSIGHT_LOCATOR_MODE: 'qwen3-vl'
60+
};
61+
}
62+
63+
// Use GPT-4o model for Planning tasks (for task planning)
64+
if (intent === 'planning') {
65+
return {
66+
MIDSCENE_PLANNING_MODEL_NAME: 'gpt-4o',
67+
MIDSCENE_PLANNING_MODEL_API_KEY: 'sk-planning-key',
68+
MIDSCENE_PLANNING_MODEL_BASE_URL: 'https://api.openai.com/v1'
69+
};
70+
}
71+
72+
// Default configuration
73+
return {
74+
MIDSCENE_MODEL_NAME: 'gpt-4o',
75+
MIDSCENE_MODEL_API_KEY: 'sk-default-key',
76+
};
77+
}
78+
});
79+
```
80+
81+
For more information about configuring models by task type, refer to the [Configure model and provider](./model-provider#configure-models-by-task-type-advanced) documentation.
82+
4483
- `createOpenAIClient: (openai, options) => Promise<OpenAI | undefined>`: Optional. Custom OpenAI client wrapper function. Allows you to wrap the OpenAI client instance for integrating observability tools (such as LangSmith, LangFuse) or applying custom middleware.
4584

4685
**Parameter Description:**

apps/site/docs/en/automate-with-scripts-in-yaml.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -439,7 +439,7 @@ tasks:
439439
convertHttpImage2Base64: true
440440
```
441441
442-
For VQA steps like `aiAsk`, `aiQuery`, `aiBoolean`, `aiNumber`, `aiString`, and `aiAssert`, you can set the `prompt` and `images` fields directly.
442+
For insight steps like `aiAsk`, `aiQuery`, `aiBoolean`, `aiNumber`, `aiString`, and `aiAssert`, you can set the `prompt` and `images` fields directly.
443443

444444
```yaml
445445
tasks:

apps/site/docs/en/choose-a-model.mdx

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,22 @@ You need to configure the following environment variables before use:
4242
- `MIDSCENE_MODEL_API_KEY` - API key
4343
- `MIDSCENE_MODEL_NAME` - Model name
4444

45+
### Configure Models by Task Type (Advanced)
46+
47+
Midscene supports configuring different models for different task types:
48+
49+
- **Insight tasks**: Visual understanding and element location (such as `aiQuery`, `aiLocate`, `aiTap`, etc.)
50+
- **Planning tasks**: Automatic planning tasks (such as `aiAct`)
51+
- **Default tasks**: Other uncategorized tasks
52+
53+
You can use the following environment variable prefixes to configure models for different task types:
54+
55+
- `MIDSCENE_INSIGHT_MODEL_*` - For visual understanding and element location tasks
56+
- `MIDSCENE_PLANNING_MODEL_*` - For automatic planning tasks
57+
- `MIDSCENE_MODEL_*` - Default configuration, used as fallback for other tasks
58+
59+
For more details, refer to the [Configure model and provider](./model-provider#configure-models-by-task-type-advanced) documentation.
60+
4561

4662
## Supported Vision Models
4763

apps/site/docs/en/model-provider.mdx

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,60 @@ Extra configs to use `Gemini 2.5 Pro` model:
3939

4040
For more information about the models, see [Choose a model](./choose-a-model).
4141

42+
### Configure Models by Task Type (Advanced)
43+
44+
Midscene internally categorizes AI tasks into different intent types. You can configure different models for different intents:
45+
46+
- **Insight tasks**: Visual Question Answering (VQA) and Visual Grounding, such as `aiQuery`, `aiLocate`, `aiTap`, etc.
47+
- **Planning tasks**: Automatic planning tasks, such as `aiAct`
48+
- **Default tasks**: Other uncategorized tasks
49+
50+
Each task type can have independent model configurations:
51+
52+
| Task Type | Environment Variable Prefix | Description |
53+
|-----------|---------------------------|-------------|
54+
| Insight | `MIDSCENE_INSIGHT_MODEL_*` | For visual understanding and element location tasks |
55+
| Planning | `MIDSCENE_PLANNING_MODEL_*` | For automatic planning tasks |
56+
| Default | `MIDSCENE_MODEL_*` | Default configuration, used as fallback for other tasks |
57+
58+
Complete configuration options supported by each prefix:
59+
60+
| Configuration | Description |
61+
|--------------|-------------|
62+
| `*_MODEL_NAME` | Model name |
63+
| `*_MODEL_API_KEY` | API key |
64+
| `*_MODEL_BASE_URL` | API endpoint URL |
65+
| `*_MODEL_HTTP_PROXY` | HTTP/HTTPS proxy |
66+
| `*_MODEL_SOCKS_PROXY` | SOCKS proxy |
67+
| `*_MODEL_INIT_CONFIG_JSON` | OpenAI SDK initialization config JSON |
68+
| `*_LOCATOR_MODE` | Locator mode (e.g. `qwen3-vl`, `vlm-ui-tars`, etc.) |
69+
70+
**Example: Configure different models for Insight and Planning tasks**
71+
72+
```bash
73+
# Insight tasks use Qwen-VL model (for visual understanding and location)
74+
export MIDSCENE_INSIGHT_MODEL_NAME="qwen-vl-plus"
75+
export MIDSCENE_INSIGHT_MODEL_API_KEY="sk-insight-key"
76+
export MIDSCENE_INSIGHT_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
77+
export MIDSCENE_INSIGHT_LOCATOR_MODE="qwen3-vl"
78+
79+
# Planning tasks use GPT-4o model (for task planning)
80+
export MIDSCENE_PLANNING_MODEL_NAME="gpt-4o"
81+
export MIDSCENE_PLANNING_MODEL_API_KEY="sk-planning-key"
82+
export MIDSCENE_PLANNING_MODEL_BASE_URL="https://api.openai.com/v1"
83+
export MIDSCENE_PLANNING_LOCATOR_MODE="qwen3-vl"
84+
85+
# Default configuration (used as fallback)
86+
export MIDSCENE_MODEL_NAME="gpt-4o"
87+
export MIDSCENE_MODEL_API_KEY="sk-default-key"
88+
```
89+
90+
:::tip
91+
92+
If a task type's configuration is not set, Midscene will automatically use the default `MIDSCENE_MODEL_*` configuration. In most cases, you only need to configure the default `MIDSCENE_MODEL_*` variables.
93+
94+
:::
95+
4296
### Advanced configs
4397

4498
Some advanced configs are also supported. Usually you don't need to use them.

apps/site/docs/zh/api.mdx

Lines changed: 41 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,14 @@ Midscene 中每个 Agent 都有自己的构造函数。
2727

2828
这些 Agent 还支持以下高级配置参数:
2929

30-
- `modelConfig: () => IModelConfig`: 可选。自定义模型配置函数。允许你通过代码动态配置不同的模型,而不是通过环境变量。这在需要为不同的 AI 任务(如 VQA、规划、定位等)使用不同模型时特别有用。
30+
- `modelConfig: (params: { intent: TIntent }) => IModelConfig`: 可选。自定义模型配置函数。允许你通过代码动态配置不同的模型,而不是通过环境变量。这在需要为不同的 AI 任务(如 Insight、Planning 等)使用不同模型时特别有用。
3131

32-
**示例:**
32+
函数接收一个参数对象,包含 `intent` 字段,表示当前任务类型:
33+
- `'insight'`: 视觉理解和元素定位任务(如 `aiQuery``aiLocate``aiTap` 等)
34+
- `'planning'`: 自动规划任务(如 `aiAct`
35+
- `'default'`: 其他未分类任务
36+
37+
**基础示例:**
3338
```typescript
3439
const agent = new PuppeteerAgent(page, {
3540
modelConfig: () => ({
@@ -41,6 +46,40 @@ Midscene 中每个 Agent 都有自己的构造函数。
4146
});
4247
```
4348

49+
**为不同任务类型配置不同模型:**
50+
```typescript
51+
const agent = new PuppeteerAgent(page, {
52+
modelConfig: ({ intent }) => {
53+
// 为 Insight 任务使用 Qwen-VL 模型(用于视觉理解和定位)
54+
if (intent === 'insight') {
55+
return {
56+
MIDSCENE_INSIGHT_MODEL_NAME: 'qwen-vl-plus',
57+
MIDSCENE_INSIGHT_MODEL_API_KEY: 'sk-insight-key',
58+
MIDSCENE_INSIGHT_MODEL_BASE_URL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
59+
MIDSCENE_INSIGHT_LOCATOR_MODE: 'qwen3-vl'
60+
};
61+
}
62+
63+
// 为 Planning 任务使用 GPT-4o 模型(用于任务规划)
64+
if (intent === 'planning') {
65+
return {
66+
MIDSCENE_PLANNING_MODEL_NAME: 'gpt-4o',
67+
MIDSCENE_PLANNING_MODEL_API_KEY: 'sk-planning-key',
68+
MIDSCENE_PLANNING_MODEL_BASE_URL: 'https://api.openai.com/v1'
69+
};
70+
}
71+
72+
// 默认配置
73+
return {
74+
MIDSCENE_MODEL_NAME: 'gpt-4o',
75+
MIDSCENE_MODEL_API_KEY: 'sk-default-key',
76+
};
77+
}
78+
});
79+
```
80+
81+
更多关于按任务类型配置模型的信息,请参考 [配置模型和服务商](./model-provider#按任务类型配置模型高级) 文档。
82+
4483
- `createOpenAIClient: (openai, options) => Promise<OpenAI | undefined>`: 可选。自定义 OpenAI 客户端包装函数。允许你包装 OpenAI 客户端实例,用于集成可观测性工具(如 LangSmith、LangFuse)或应用自定义中间件。
4584

4685
**参数说明:**

apps/site/docs/zh/choose-a-model.mdx

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,22 @@ Midscene 要求模型服务商提供兼容 OpenAI 风格的接口。
4242
- `MIDSCENE_MODEL_API_KEY` - API 密钥
4343
- `MIDSCENE_MODEL_NAME` - 模型名称
4444

45+
### 按任务类型配置模型(高级)
46+
47+
Midscene 支持为不同的任务类型配置不同的模型:
48+
49+
- **Insight 任务**:视觉理解和元素定位(如 `aiQuery``aiLocate``aiTap` 等)
50+
- **Planning 任务**:自动规划任务(如 `aiAct`
51+
- **Default 任务**:其他未分类任务
52+
53+
你可以使用以下环境变量前缀来配置不同任务类型的模型:
54+
55+
- `MIDSCENE_INSIGHT_MODEL_*` - 用于视觉理解和元素定位任务
56+
- `MIDSCENE_PLANNING_MODEL_*` - 用于自动规划任务
57+
- `MIDSCENE_MODEL_*` - 默认配置,作为其他任务的后备选项
58+
59+
更多详细信息,请参考 [配置模型和服务商](./model-provider#按任务类型配置模型高级) 文档。
60+
4561

4662
## 已支持的视觉模型
4763

apps/site/docs/zh/model-provider.mdx

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,60 @@ Midscene 默认集成了 OpenAI SDK 调用 AI 服务。使用这个 SDK 限定
4242

4343
关于模型的更多信息,请参阅 [选择 AI 模型](./choose-a-model)
4444

45+
### 按任务类型配置模型(高级)
46+
47+
Midscene 内部将 AI 任务分为不同的意图(Intent)类型。你可以为不同的意图配置不同的模型:
48+
49+
- **Insight 任务**:包括视觉问答(VQA)和视觉定位(Grounding),如 `aiQuery``aiLocate``aiTap` 等方法
50+
- **Planning 任务**:自动规划相关的任务,如 `aiAct` 方法
51+
- **Default 任务**:其他未分类的任务
52+
53+
每种任务类型都可以配置独立的模型参数:
54+
55+
| 任务类型 | 环境变量前缀 | 说明 |
56+
|---------|-------------|------|
57+
| Insight | `MIDSCENE_INSIGHT_MODEL_*` | 用于视觉理解和元素定位任务 |
58+
| Planning | `MIDSCENE_PLANNING_MODEL_*` | 用于自动规划任务 |
59+
| Default | `MIDSCENE_MODEL_*` | 默认配置,作为其他任务的后备选项 |
60+
61+
每个前缀支持的完整配置项:
62+
63+
| 配置项 | 说明 |
64+
|-------|------|
65+
| `*_MODEL_NAME` | 模型名称 |
66+
| `*_MODEL_API_KEY` | API 密钥 |
67+
| `*_MODEL_BASE_URL` | API 接入地址 |
68+
| `*_MODEL_HTTP_PROXY` | HTTP/HTTPS 代理 |
69+
| `*_MODEL_SOCKS_PROXY` | SOCKS 代理 |
70+
| `*_MODEL_INIT_CONFIG_JSON` | OpenAI SDK 初始化配置 JSON |
71+
| `*_LOCATOR_MODE` | 定位模式(如 `qwen3-vl``vlm-ui-tars` 等) |
72+
73+
**示例:为 Insight 和 Planning 任务配置不同的模型**
74+
75+
```bash
76+
# Insight 任务使用 Qwen-VL 模型(用于视觉理解和定位)
77+
export MIDSCENE_INSIGHT_MODEL_NAME="qwen-vl-plus"
78+
export MIDSCENE_INSIGHT_MODEL_API_KEY="sk-insight-key"
79+
export MIDSCENE_INSIGHT_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
80+
export MIDSCENE_INSIGHT_LOCATOR_MODE="qwen3-vl"
81+
82+
# Planning 任务使用 GPT-4o 模型(用于任务规划)
83+
export MIDSCENE_PLANNING_MODEL_NAME="gpt-4o"
84+
export MIDSCENE_PLANNING_MODEL_API_KEY="sk-planning-key"
85+
export MIDSCENE_PLANNING_MODEL_BASE_URL="https://api.openai.com/v1"
86+
export MIDSCENE_PLANNING_LOCATOR_MODE="qwen3-vl"
87+
88+
# 默认配置(用作后备)
89+
export MIDSCENE_MODEL_NAME="gpt-4o"
90+
export MIDSCENE_MODEL_API_KEY="sk-default-key"
91+
```
92+
93+
:::tip
94+
95+
如果某个任务类型的配置未设置,Midscene 会自动使用 `MIDSCENE_MODEL_*` 的默认配置。大多数情况下,你只需要配置默认的 `MIDSCENE_MODEL_*` 变量即可。
96+
97+
:::
98+
4599
### 高级配置
46100

47101
还有一些高级配置项,通常不需要使用。

packages/core/src/agent/agent.ts

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -471,7 +471,7 @@ export class Agent<
471471
);
472472

473473
// assume all operation in action space is related to locating
474-
const modelConfig = this.modelConfigManager.getModelConfig('grounding');
474+
const modelConfig = this.modelConfigManager.getModelConfig('insight');
475475

476476
const { output, runner } = await this.taskExecutor.runPlans(
477477
title,
@@ -796,7 +796,7 @@ export class Agent<
796796
demand: ServiceExtractParam,
797797
opt: ServiceExtractOption = defaultServiceExtractOption,
798798
): Promise<ReturnType> {
799-
const modelConfig = this.modelConfigManager.getModelConfig('VQA');
799+
const modelConfig = this.modelConfigManager.getModelConfig('insight');
800800
const { output } = await this.taskExecutor.createTypeQueryExecution(
801801
'Query',
802802
demand,
@@ -810,7 +810,7 @@ export class Agent<
810810
prompt: TUserPrompt,
811811
opt: ServiceExtractOption = defaultServiceExtractOption,
812812
): Promise<boolean> {
813-
const modelConfig = this.modelConfigManager.getModelConfig('VQA');
813+
const modelConfig = this.modelConfigManager.getModelConfig('insight');
814814

815815
const { textPrompt, multimodalPrompt } = parsePrompt(prompt);
816816
const { output } = await this.taskExecutor.createTypeQueryExecution(
@@ -827,7 +827,7 @@ export class Agent<
827827
prompt: TUserPrompt,
828828
opt: ServiceExtractOption = defaultServiceExtractOption,
829829
): Promise<number> {
830-
const modelConfig = this.modelConfigManager.getModelConfig('VQA');
830+
const modelConfig = this.modelConfigManager.getModelConfig('insight');
831831

832832
const { textPrompt, multimodalPrompt } = parsePrompt(prompt);
833833
const { output } = await this.taskExecutor.createTypeQueryExecution(
@@ -844,7 +844,7 @@ export class Agent<
844844
prompt: TUserPrompt,
845845
opt: ServiceExtractOption = defaultServiceExtractOption,
846846
): Promise<string> {
847-
const modelConfig = this.modelConfigManager.getModelConfig('VQA');
847+
const modelConfig = this.modelConfigManager.getModelConfig('insight');
848848

849849
const { textPrompt, multimodalPrompt } = parsePrompt(prompt);
850850
const { output } = await this.taskExecutor.createTypeQueryExecution(
@@ -895,7 +895,7 @@ export class Agent<
895895
deepThink,
896896
);
897897
// use same intent as aiLocate
898-
const modelConfig = this.modelConfigManager.getModelConfig('grounding');
898+
const modelConfig = this.modelConfigManager.getModelConfig('insight');
899899

900900
const text = await this.service.describe(center, modelConfig, {
901901
deepThink,
@@ -956,7 +956,7 @@ export class Agent<
956956
assert(locateParam, 'cannot get locate param for aiLocate');
957957
const locatePlan = locatePlanForLocate(locateParam);
958958
const plans = [locatePlan];
959-
const modelConfig = this.modelConfigManager.getModelConfig('grounding');
959+
const modelConfig = this.modelConfigManager.getModelConfig('insight');
960960

961961
const { output } = await this.taskExecutor.runPlans(
962962
taskTitleStr('Locate', locateParamStr(locateParam)),
@@ -986,7 +986,7 @@ export class Agent<
986986
msg?: string,
987987
opt?: AgentAssertOpt & ServiceExtractOption,
988988
) {
989-
const modelConfig = this.modelConfigManager.getModelConfig('VQA');
989+
const modelConfig = this.modelConfigManager.getModelConfig('insight');
990990

991991
const serviceOpt: ServiceExtractOption = {
992992
domIncluded: opt?.domIncluded ?? defaultServiceExtractOption.domIncluded,
@@ -1058,7 +1058,7 @@ export class Agent<
10581058
}
10591059

10601060
async aiWaitFor(assertion: TUserPrompt, opt?: AgentWaitForOpt) {
1061-
const modelConfig = this.modelConfigManager.getModelConfig('VQA');
1061+
const modelConfig = this.modelConfigManager.getModelConfig('insight');
10621062
await this.taskExecutor.waitFor(
10631063
assertion,
10641064
{

packages/core/tests/ai/service/service.test.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ vi.setConfig({
1212
testTimeout: 60 * 1000,
1313
});
1414

15-
const modelConfig = globalModelConfigManager.getModelConfig('grounding');
15+
const modelConfig = globalModelConfigManager.getModelConfig('insight');
1616

1717
describe.skipIf(!modelConfig.vlMode)('service locate with deep think', () => {
1818
test('service locate with search area', async () => {

0 commit comments

Comments
 (0)