Skip to content

Conversation

@prwhelan
Copy link
Member

@prwhelan prwhelan commented Nov 7, 2025

Implements a lazy-loading cache in front of the CCM Inference index. By default, the cache holds a single entry for 15 minutes, and cache misses search the CCM index and load the responses into the cache.

Some design decisions:

  • The cache maintains an "empty" entry so that the isPresent call can reuse the "empty" response to quickly return false. Invalidating the cache or calling get will drop this "empty" entry.
  • Invalidating the cache will broadcast a message to all nodes so that all caches on all nodes will invalidate their caches.
  • Since the broadcast message only works if all nodes are on the latest version, there is a new NodeFeature to enable the cache once all nodes in the cluster have upgraded.

Implements a lazy-loading cache in front of the CCM Inference index.
By default, the cache holds a single entry for 15 minutes, and cache
misses search the CCM index and load the responses into the cache.

Some design decisions:
- The cache maintains an "empty" entry so that the `isPresent` call can
  reuse the "empty" response to quickly return false. Invalidating the
  cache or calling get will drop this "empty" entry.
- Invalidating the cache will broadcast a message to all nodes so that
  all caches on all nodes will invalidate their caches.
- Since the broadcast message only works if all nodes are on the latest
  version, there is a new NodeFeature to enable the cache once all nodes
  in the cluster have upgraded.
@prwhelan prwhelan added >non-issue :ml Machine learning Team:ML Meta label for the ML team v9.3.0 labels Nov 7, 2025
@prwhelan prwhelan marked this pull request as ready for review November 10, 2025 17:55
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Copy link
Contributor

@jonathan-buttner jonathan-buttner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Pat! Left a few suggestions.

return state.clusterRecovered() && featureService.clusterHasFeature(state, InferenceFeatures.INFERENCE_CCM_CACHE);
}

private void enabled(ProjectId projectId, CCMModel ccmModel) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It was a bit hard to understand enabled/disabled, cacheEnabled, isEnabled, and CCMModelEntry::enabled. What about for the methods that do the put we call them enabledEntry(), setEnabled, or createEnabledEntry?

this.client = client;
}

public void get(ActionListener<CCMModel> listener) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment mentioning that the expectation is that a caller would first interact with isEnabled before calling get() and mention that if the configuration entry is in the disabled state that this call with read through to the backing index.


public class CCMCacheTests extends ESSingleNodeTestCase {

private static final TimeValue TIMEOUT = new TimeValue(30, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using this could we use TimeValue.THIRTY_SECONDS?

assertFalse(isPresent());
assertThat(ccmCache.stats().getHits(), equalTo(1L));
assertThat(ccmCache.stats().getMisses(), equalTo(1L));
}
Copy link
Contributor

@jonathan-buttner jonathan-buttner Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed it but do we have a test for the else case in this block:

CCMCache::get

if (cachedEntry != null && cachedEntry.enabled()) {
            listener.onResponse(cachedEntry.ccmModel());
        } else {

Could we add a test that when the internal entry is in the disabled state (but present and not null) that we get a cache miss aka hit the else case?

@prwhelan prwhelan enabled auto-merge (squash) November 11, 2025 14:24
@prwhelan prwhelan disabled auto-merge November 11, 2025 14:24
@prwhelan prwhelan enabled auto-merge (squash) November 11, 2025 14:42
@prwhelan prwhelan disabled auto-merge November 11, 2025 17:15
@prwhelan prwhelan enabled auto-merge (squash) November 13, 2025 18:42
@prwhelan prwhelan merged commit 4d2d053 into elastic:main Nov 13, 2025
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:ml Machine learning >non-issue Team:ML Meta label for the ML team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants