Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 41 additions & 2 deletions apps/site/docs/en/api.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,14 @@ In Playwright and Puppeteer, there are some common parameters:

These Agents also support the following advanced configuration parameters:

- `modelConfig: () => IModelConfig`: Optional. Custom model configuration function. Allows you to dynamically configure different models through code instead of environment variables. This is particularly useful when you need to use different models for different AI tasks (such as VQA, planning, grounding, etc.).
- `modelConfig: (params: { intent: TIntent }) => IModelConfig`: Optional. Custom model configuration function. Allows you to dynamically configure different models through code instead of environment variables. This is particularly useful when you need to use different models for different AI tasks (such as Insight, Planning, etc.).

**Example:**
The function receives a parameter object with an `intent` field indicating the current task type:
- `'insight'`: Visual understanding and element location tasks (such as `aiQuery`, `aiLocate`, `aiTap`, etc.)
- `'planning'`: Automatic planning tasks (such as `aiAct`)
- `'default'`: Other uncategorized tasks

**Basic Example:**
```typescript
const agent = new PuppeteerAgent(page, {
modelConfig: () => ({
Expand All @@ -41,6 +46,40 @@ These Agents also support the following advanced configuration parameters:
});
```

**Configure different models for different task types:**
```typescript
const agent = new PuppeteerAgent(page, {
modelConfig: ({ intent }) => {
// Use Qwen-VL model for Insight tasks (for visual understanding and location)
if (intent === 'insight') {
return {
MIDSCENE_INSIGHT_MODEL_NAME: 'qwen-vl-plus',
MIDSCENE_INSIGHT_MODEL_API_KEY: 'sk-insight-key',
MIDSCENE_INSIGHT_MODEL_BASE_URL: 'https://dashscope.aliyuncs.com/compatible-mode/v1'
};
}

// Use GPT-4o model for Planning tasks (for task planning)
if (intent === 'planning') {
return {
MIDSCENE_PLANNING_MODEL_NAME: 'gpt-4o',
MIDSCENE_PLANNING_MODEL_API_KEY: 'sk-planning-key',
MIDSCENE_PLANNING_MODEL_BASE_URL: 'https://api.openai.com/v1',
MIDSCENE_INSIGHT_LOCATOR_MODE: 'qwen3-vl'
};
}

// Default configuration
return {
MIDSCENE_MODEL_NAME: 'gpt-4o',
MIDSCENE_MODEL_API_KEY: 'sk-default-key',
};
}
});
```

For more information about configuring models by task type, refer to the [Configure model and provider](./model-provider#configure-models-by-task-type-advanced) documentation.

- `createOpenAIClient: (openai, options) => Promise<OpenAI | undefined>`: Optional. Custom OpenAI client wrapper function. Allows you to wrap the OpenAI client instance for integrating observability tools (such as LangSmith, LangFuse) or applying custom middleware.

**Parameter Description:**
Expand Down
2 changes: 1 addition & 1 deletion apps/site/docs/en/automate-with-scripts-in-yaml.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -439,7 +439,7 @@ tasks:
convertHttpImage2Base64: true
```

For VQA steps like `aiAsk`, `aiQuery`, `aiBoolean`, `aiNumber`, `aiString`, and `aiAssert`, you can set the `prompt` and `images` fields directly.
For insight steps like `aiAsk`, `aiQuery`, `aiBoolean`, `aiNumber`, `aiString`, and `aiAssert`, you can set the `prompt` and `images` fields directly.

```yaml
tasks:
Expand Down
16 changes: 16 additions & 0 deletions apps/site/docs/en/choose-a-model.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,22 @@ You need to configure the following environment variables before use:
- `MIDSCENE_MODEL_API_KEY` - API key
- `MIDSCENE_MODEL_NAME` - Model name

### Configure Models by Task Type (Advanced)

Midscene supports configuring different models for different task types:

- **Insight tasks**: Visual understanding and element location (such as `aiQuery`, `aiLocate`, `aiTap`, etc.)
- **Planning tasks**: Automatic planning tasks (such as `aiAct`)
- **Default tasks**: Other uncategorized tasks

You can use the following environment variable prefixes to configure models for different task types:

- `MIDSCENE_INSIGHT_MODEL_*` - For visual understanding and element location tasks
- `MIDSCENE_PLANNING_MODEL_*` - For automatic planning tasks
- `MIDSCENE_MODEL_*` - Default configuration, used as fallback for other tasks

For more details, refer to the [Configure model and provider](./model-provider#configure-models-by-task-type-advanced) documentation.


## Supported Vision Models

Expand Down
54 changes: 54 additions & 0 deletions apps/site/docs/en/model-provider.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,60 @@ Extra configs to use `Gemini 2.5 Pro` model:

For more information about the models, see [Choose a model](./choose-a-model).

### Configure Models by Task Type (Advanced)

Midscene internally categorizes AI tasks into different intent types. You can configure different models for different intents:

- **Insight tasks**: Visual Question Answering (VQA) and Visual Grounding, such as `aiQuery`, `aiLocate`, `aiTap`, etc.
- **Planning tasks**: Automatic planning tasks, such as `aiAct`
- **Default tasks**: Other uncategorized tasks

Each task type can have independent model configurations:

| Task Type | Environment Variable Prefix | Description |
|-----------|---------------------------|-------------|
| Insight | `MIDSCENE_INSIGHT_MODEL_*` | For visual understanding and element location tasks |
| Planning | `MIDSCENE_PLANNING_MODEL_*` | For automatic planning tasks |
| Default | `MIDSCENE_MODEL_*` | Default configuration, used as fallback for other tasks |

Complete configuration options supported by each prefix:

| Configuration | Description |
|--------------|-------------|
| `*_MODEL_NAME` | Model name |
| `*_MODEL_API_KEY` | API key |
| `*_MODEL_BASE_URL` | API endpoint URL |
| `*_MODEL_HTTP_PROXY` | HTTP/HTTPS proxy |
| `*_MODEL_SOCKS_PROXY` | SOCKS proxy |
| `*_MODEL_INIT_CONFIG_JSON` | OpenAI SDK initialization config JSON |
| `*_LOCATOR_MODE` | Locator mode (e.g. `qwen3-vl`, `vlm-ui-tars`, etc.) |

**Example: Configure different models for Insight and Planning tasks**

```bash
# Insight tasks use Qwen-VL model (for visual understanding and location)
export MIDSCENE_INSIGHT_MODEL_NAME="qwen-vl-plus"
export MIDSCENE_INSIGHT_MODEL_API_KEY="sk-insight-key"
export MIDSCENE_INSIGHT_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
export MIDSCENE_INSIGHT_LOCATOR_MODE="qwen3-vl"

# Planning tasks use GPT-4o model (for task planning)
export MIDSCENE_PLANNING_MODEL_NAME="gpt-4o"
export MIDSCENE_PLANNING_MODEL_API_KEY="sk-planning-key"
export MIDSCENE_PLANNING_MODEL_BASE_URL="https://api.openai.com/v1"
export MIDSCENE_PLANNING_LOCATOR_MODE="qwen3-vl"

# Default configuration (used as fallback)
export MIDSCENE_MODEL_NAME="gpt-4o"
export MIDSCENE_MODEL_API_KEY="sk-default-key"
```

:::tip

If a task type's configuration is not set, Midscene will automatically use the default `MIDSCENE_MODEL_*` configuration. In most cases, you only need to configure the default `MIDSCENE_MODEL_*` variables.

:::

### Advanced configs

Some advanced configs are also supported. Usually you don't need to use them.
Expand Down
43 changes: 41 additions & 2 deletions apps/site/docs/zh/api.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,14 @@ Midscene 中每个 Agent 都有自己的构造函数。

这些 Agent 还支持以下高级配置参数:

- `modelConfig: () => IModelConfig`: 可选。自定义模型配置函数。允许你通过代码动态配置不同的模型,而不是通过环境变量。这在需要为不同的 AI 任务(如 VQA、规划、定位等)使用不同模型时特别有用。
- `modelConfig: (params: { intent: TIntent }) => IModelConfig`: 可选。自定义模型配置函数。允许你通过代码动态配置不同的模型,而不是通过环境变量。这在需要为不同的 AI 任务(如 Insight、Planning 等)使用不同模型时特别有用。

**示例:**
函数接收一个参数对象,包含 `intent` 字段,表示当前任务类型:
- `'insight'`: 视觉理解和元素定位任务(如 `aiQuery`、`aiLocate`、`aiTap` 等)
- `'planning'`: 自动规划任务(如 `aiAct`)
- `'default'`: 其他未分类任务

**基础示例:**
```typescript
const agent = new PuppeteerAgent(page, {
modelConfig: () => ({
Expand All @@ -41,6 +46,40 @@ Midscene 中每个 Agent 都有自己的构造函数。
});
```

**为不同任务类型配置不同模型:**
```typescript
const agent = new PuppeteerAgent(page, {
modelConfig: ({ intent }) => {
// 为 Insight 任务使用 Qwen-VL 模型(用于视觉理解和定位)
if (intent === 'insight') {
return {
MIDSCENE_INSIGHT_MODEL_NAME: 'qwen-vl-plus',
MIDSCENE_INSIGHT_MODEL_API_KEY: 'sk-insight-key',
MIDSCENE_INSIGHT_MODEL_BASE_URL: 'https://dashscope.aliyuncs.com/compatible-mode/v1'
};
}

// 为 Planning 任务使用 GPT-4o 模型(用于任务规划)
if (intent === 'planning') {
return {
MIDSCENE_PLANNING_MODEL_NAME: 'gpt-4o',
MIDSCENE_PLANNING_MODEL_API_KEY: 'sk-planning-key',
MIDSCENE_PLANNING_MODEL_BASE_URL: 'https://api.openai.com/v1',
MIDSCENE_INSIGHT_LOCATOR_MODE: 'qwen3-vl'
};
}

// 默认配置
return {
MIDSCENE_MODEL_NAME: 'gpt-4o',
MIDSCENE_MODEL_API_KEY: 'sk-default-key',
};
}
});
```

更多关于按任务类型配置模型的信息,请参考 [配置模型和服务商](./model-provider#按任务类型配置模型高级) 文档。

- `createOpenAIClient: (openai, options) => Promise<OpenAI | undefined>`: 可选。自定义 OpenAI 客户端包装函数。允许你包装 OpenAI 客户端实例,用于集成可观测性工具(如 LangSmith、LangFuse)或应用自定义中间件。

**参数说明:**
Expand Down
16 changes: 16 additions & 0 deletions apps/site/docs/zh/choose-a-model.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,22 @@ Midscene 要求模型服务商提供兼容 OpenAI 风格的接口。
- `MIDSCENE_MODEL_API_KEY` - API 密钥
- `MIDSCENE_MODEL_NAME` - 模型名称

### 按任务类型配置模型(高级)

Midscene 支持为不同的任务类型配置不同的模型:

- **Insight 任务**:视觉理解和元素定位(如 `aiQuery`、`aiLocate`、`aiTap` 等)
- **Planning 任务**:自动规划任务(如 `aiAct`)
- **Default 任务**:其他未分类任务

你可以使用以下环境变量前缀来配置不同任务类型的模型:

- `MIDSCENE_INSIGHT_MODEL_*` - 用于视觉理解和元素定位任务
- `MIDSCENE_PLANNING_MODEL_*` - 用于自动规划任务
- `MIDSCENE_MODEL_*` - 默认配置,作为其他任务的后备选项

更多详细信息,请参考 [配置模型和服务商](./model-provider#按任务类型配置模型高级) 文档。


## 已支持的视觉模型

Expand Down
54 changes: 54 additions & 0 deletions apps/site/docs/zh/model-provider.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,60 @@ Midscene 默认集成了 OpenAI SDK 调用 AI 服务。使用这个 SDK 限定

关于模型的更多信息,请参阅 [选择 AI 模型](./choose-a-model)。

### 按任务类型配置模型(高级)

Midscene 内部将 AI 任务分为不同的意图(Intent)类型。你可以为不同的意图配置不同的模型:

- **Insight 任务**:包括视觉问答(VQA)和视觉定位(Grounding),如 `aiQuery`、`aiLocate`、`aiTap` 等方法
- **Planning 任务**:自动规划相关的任务,如 `aiAct` 方法
- **Default 任务**:其他未分类的任务

每种任务类型都可以配置独立的模型参数:

| 任务类型 | 环境变量前缀 | 说明 |
|---------|-------------|------|
| Insight | `MIDSCENE_INSIGHT_MODEL_*` | 用于视觉理解和元素定位任务 |
| Planning | `MIDSCENE_PLANNING_MODEL_*` | 用于自动规划任务 |
| Default | `MIDSCENE_MODEL_*` | 默认配置,作为其他任务的后备选项 |

每个前缀支持的完整配置项:

| 配置项 | 说明 |
|-------|------|
| `*_MODEL_NAME` | 模型名称 |
| `*_MODEL_API_KEY` | API 密钥 |
| `*_MODEL_BASE_URL` | API 接入地址 |
| `*_MODEL_HTTP_PROXY` | HTTP/HTTPS 代理 |
| `*_MODEL_SOCKS_PROXY` | SOCKS 代理 |
| `*_MODEL_INIT_CONFIG_JSON` | OpenAI SDK 初始化配置 JSON |
| `*_LOCATOR_MODE` | 定位模式(如 `qwen3-vl`、`vlm-ui-tars` 等) |

**示例:为 Insight 和 Planning 任务配置不同的模型**

```bash
# Insight 任务使用 Qwen-VL 模型(用于视觉理解和定位)
export MIDSCENE_INSIGHT_MODEL_NAME="qwen-vl-plus"
export MIDSCENE_INSIGHT_MODEL_API_KEY="sk-insight-key"
export MIDSCENE_INSIGHT_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
export MIDSCENE_INSIGHT_LOCATOR_MODE="qwen3-vl"

# Planning 任务使用 GPT-4o 模型(用于任务规划)
export MIDSCENE_PLANNING_MODEL_NAME="gpt-4o"
export MIDSCENE_PLANNING_MODEL_API_KEY="sk-planning-key"
export MIDSCENE_PLANNING_MODEL_BASE_URL="https://api.openai.com/v1"
export MIDSCENE_PLANNING_LOCATOR_MODE="qwen3-vl"

# 默认配置(用作后备)
export MIDSCENE_MODEL_NAME="gpt-4o"
export MIDSCENE_MODEL_API_KEY="sk-default-key"
```

:::tip

如果某个任务类型的配置未设置,Midscene 会自动使用 `MIDSCENE_MODEL_*` 的默认配置。大多数情况下,你只需要配置默认的 `MIDSCENE_MODEL_*` 变量即可。

:::

### 高级配置

还有一些高级配置项,通常不需要使用。
Expand Down
18 changes: 9 additions & 9 deletions packages/core/src/agent/agent.ts
Original file line number Diff line number Diff line change
Expand Up @@ -471,7 +471,7 @@ export class Agent<
);

// assume all operation in action space is related to locating
const modelConfig = this.modelConfigManager.getModelConfig('grounding');
const modelConfig = this.modelConfigManager.getModelConfig('insight');

const { output, runner } = await this.taskExecutor.runPlans(
title,
Expand Down Expand Up @@ -796,7 +796,7 @@ export class Agent<
demand: ServiceExtractParam,
opt: ServiceExtractOption = defaultServiceExtractOption,
): Promise<ReturnType> {
const modelConfig = this.modelConfigManager.getModelConfig('VQA');
const modelConfig = this.modelConfigManager.getModelConfig('insight');
const { output } = await this.taskExecutor.createTypeQueryExecution(
'Query',
demand,
Expand All @@ -810,7 +810,7 @@ export class Agent<
prompt: TUserPrompt,
opt: ServiceExtractOption = defaultServiceExtractOption,
): Promise<boolean> {
const modelConfig = this.modelConfigManager.getModelConfig('VQA');
const modelConfig = this.modelConfigManager.getModelConfig('insight');

const { textPrompt, multimodalPrompt } = parsePrompt(prompt);
const { output } = await this.taskExecutor.createTypeQueryExecution(
Expand All @@ -827,7 +827,7 @@ export class Agent<
prompt: TUserPrompt,
opt: ServiceExtractOption = defaultServiceExtractOption,
): Promise<number> {
const modelConfig = this.modelConfigManager.getModelConfig('VQA');
const modelConfig = this.modelConfigManager.getModelConfig('insight');

const { textPrompt, multimodalPrompt } = parsePrompt(prompt);
const { output } = await this.taskExecutor.createTypeQueryExecution(
Expand All @@ -844,7 +844,7 @@ export class Agent<
prompt: TUserPrompt,
opt: ServiceExtractOption = defaultServiceExtractOption,
): Promise<string> {
const modelConfig = this.modelConfigManager.getModelConfig('VQA');
const modelConfig = this.modelConfigManager.getModelConfig('insight');

const { textPrompt, multimodalPrompt } = parsePrompt(prompt);
const { output } = await this.taskExecutor.createTypeQueryExecution(
Expand Down Expand Up @@ -895,7 +895,7 @@ export class Agent<
deepThink,
);
// use same intent as aiLocate
const modelConfig = this.modelConfigManager.getModelConfig('grounding');
const modelConfig = this.modelConfigManager.getModelConfig('insight');

const text = await this.service.describe(center, modelConfig, {
deepThink,
Expand Down Expand Up @@ -956,7 +956,7 @@ export class Agent<
assert(locateParam, 'cannot get locate param for aiLocate');
const locatePlan = locatePlanForLocate(locateParam);
const plans = [locatePlan];
const modelConfig = this.modelConfigManager.getModelConfig('grounding');
const modelConfig = this.modelConfigManager.getModelConfig('insight');

const { output } = await this.taskExecutor.runPlans(
taskTitleStr('Locate', locateParamStr(locateParam)),
Expand Down Expand Up @@ -986,7 +986,7 @@ export class Agent<
msg?: string,
opt?: AgentAssertOpt & ServiceExtractOption,
) {
const modelConfig = this.modelConfigManager.getModelConfig('VQA');
const modelConfig = this.modelConfigManager.getModelConfig('insight');

const serviceOpt: ServiceExtractOption = {
domIncluded: opt?.domIncluded ?? defaultServiceExtractOption.domIncluded,
Expand Down Expand Up @@ -1058,7 +1058,7 @@ export class Agent<
}

async aiWaitFor(assertion: TUserPrompt, opt?: AgentWaitForOpt) {
const modelConfig = this.modelConfigManager.getModelConfig('VQA');
const modelConfig = this.modelConfigManager.getModelConfig('insight');
await this.taskExecutor.waitFor(
assertion,
{
Expand Down
2 changes: 1 addition & 1 deletion packages/core/tests/ai/service/service.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ vi.setConfig({
testTimeout: 60 * 1000,
});

const modelConfig = globalModelConfigManager.getModelConfig('grounding');
const modelConfig = globalModelConfigManager.getModelConfig('insight');

describe.skipIf(!modelConfig.vlMode)('service locate with deep think', () => {
test('service locate with search area', async () => {
Expand Down
Loading