Skip to content

Conversation

@DonalEvans
Copy link
Contributor

To support multimodal embedding, where inputs may be a mix of text and images, we need some way of tracking whether a given input is text or an image. The InferenceString object wraps the input String and associates it with a DataType enum value which indicates the type of data represented by the String.

  • Introduce InferenceString object to allow image inputs to be passed through inference code
  • Refactor EmbeddingsInput, EmbeddingRequestChunker and ChunkInferenceInput classes to handle InferenceString instead of String
  • Unwrap InferenceString prior to passing it into the existing Request classes used for embeddings to preserve existing behaviour
  • Update existing tests to handle InferenceString
  • Add additional test coverage for new behaviour

To support multimodal embedding, where inputs may be a mix of text and
images, we need some way of tracking whether a given input is text or an
image. The InferenceString object wraps the input String and associates
it with a DataType enum value which indicates the type of data
represented by the String.

- Introduce InferenceString object to allow image inputs to be passed
  through inference code
- Refactor EmbeddingsInput, EmbeddingRequestChunker and
  ChunkInferenceInput classes to handle InferenceString instead of
  String
- Unwrap InferenceString prior to passing it into the existing Request
  classes used for embeddings to preserve existing behaviour
- Update existing tests to handle InferenceString
- Add additional test coverage for new behaviour
@DonalEvans DonalEvans added >refactoring :ml Machine learning Team:ML Meta label for the ML team v9.3.0 labels Nov 7, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

return DataType.TEXT.equals(dataType);
}

public static List<String> toStringList(List<InferenceString> inferenceStrings) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we filter out DataType.IMAGE_BASE64 items?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should just filter out non-text inputs, because if any manage to make it into one of the two places we call this method, then there's a problem somewhere. Maybe an assert like in EmbeddingsInput.getTextInputs() just for safety? The two classes where this method is called (in ElasticsearchInternalService and SageMakerService) don't use EmbeddingsInput, which is why there's a slightly different flow for them.

for (int chunkIndex = 0; chunkIndex < chunks.size(); chunkIndex++) {
// If the number of chunks is larger than the maximum allowed value,
// scale the indices to [0, MAX) with similar number of original
// scale the indices to [0, MAX] with similar number of original
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, the change here is because MAX is inclusive right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my mistake, I thought this was just a typo rather than indicating inclusive/exclusive. I learned something new today!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:ml Machine learning >refactoring Team:ML Meta label for the ML team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants