Skip to content

Commit fa559e3

Browse files
Two quick updates to AI evaluators UI [public preview] (#57353)
1 parent 2f93ca4 commit fa559e3

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

content/github-models/use-github-models/evaluating-ai-models.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,11 +131,13 @@ After applying the parameters, you can add additional columns to compare more mo
131131

132132
Once the prompt is configured, run a structured evaluation to compare model outputs using real data and repeatable metrics.
133133

134-
Model evaluation helps you understand how different models and prompt configurations perform across real inputs. In the Prompt view, you can apply evaluators to multiple models side by side and review metrics such as similarity, relevance, and groundedness.
134+
Model evaluation helps you understand how different models and prompt configurations perform across real inputs. In the Prompt view, you can apply evaluators to multiple models side by side and review metrics such as similarity, fluency, coherence, relevance, and groundedness.
135135

136136
The following evaluators are available:
137137

138138
* **Similarity**: Measures how closely a model's output matches an expected or reference answer. This is useful when you want to confirm that the model returns consistent and accurate responses aligned with a known result. The score ranges from 0 to 1, with higher values indicating greater similarity.
139+
* **Fluency**: Evaluates the linguistic quality of a response, including grammar, coherence, and readability. This results in linguistically correct responses.
140+
* **Coherence**: Assesses the ability of the LLM to generate text that reads naturally, flows smoothly, and resembles human-like language in its responses. Use it when assessing the readability and user-friendliness of a model’s generated responses in real-world applications.
139141
* **Relevance**: Refers to how effectively a response addresses a question. It assesses the accuracy, completeness, and direct relevance of the response based solely on the given information. The score ranges from 0 to 1, with higher values indicating stronger alignment with the input's intent.
140142
* **Groundedness**: Measures how well an answer is anchored in the provided context, evaluating its relevance, accuracy, and completeness based exclusively on that context. It assesses the extent to which the answer fully addresses the question without introducing unrelated or incorrect information. The score ranges from 0 to 1, with higher values indicating higher accuracy.
141143
* **Custom prompt**: Lets you define your own evaluation criteria for one LLM to assess the output of another. This allows you to score model outputs based on your own guidelines. You can choose between pass/fail or scored evaluations, making it ideal for scenarios where standard metrics do not capture testing expectations.

0 commit comments

Comments
 (0)