Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -438,4 +438,46 @@ public static void enforceFeatureEnabledOrThrow(
"If set to true (default), allow credential vending for external catalogs. Note this requires ALLOW_EXTERNAL_CATALOG_CREDENTIAL_VENDING to be true first.")
.defaultValue(true)
.buildFeatureConfiguration();

public static final FeatureConfiguration<Integer> CLOUD_API_TIMEOUT_SECONDS =
PolarisConfiguration.<Integer>builder()
.key("CLOUD_API_TIMEOUT_SECONDS")
.description(
"Timeout in seconds for cloud provider API requests. "
+ "Prevents indefinite blocking when cloud provider endpoints are slow or unresponsive. "
+ "Used internally by storage integrations for credential vending and other cloud operations. "
+ "Currently only used by Azure storage integration (not yet implemented for AWS S3 or GCP).")
.defaultValue(15)
.buildFeatureConfiguration();

public static final FeatureConfiguration<Integer> CLOUD_API_RETRY_COUNT =
PolarisConfiguration.<Integer>builder()
.key("CLOUD_API_RETRY_COUNT")
.description(
"Number of retry attempts for cloud provider API requests. "
+ "Uses exponential backoff with jitter to handle transient failures. "
+ "Currently only used by Azure storage integration (not yet implemented for AWS S3 or GCP).")
.defaultValue(3)
.buildFeatureConfiguration();

public static final FeatureConfiguration<Integer> CLOUD_API_RETRY_DELAY_SECONDS =
PolarisConfiguration.<Integer>builder()
.key("CLOUD_API_RETRY_DELAY_SECONDS")
.description(
"Initial delay in seconds before first retry for cloud provider API requests. "
+ "Delay doubles with each retry (exponential backoff). "
+ "Currently only used by Azure storage integration (not yet implemented for AWS S3 or GCP).")
.defaultValue(2)
.buildFeatureConfiguration();

public static final FeatureConfiguration<Integer> CLOUD_API_RETRY_JITTER_MILLIS =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to use milliseconds instead of a 0-1 jitter factor for a few reasons:

User clarity - It's more intuitive for operators to specify "500 milliseconds of jitter" rather than understanding what "0.5 jitter factor" means (50% of the retry delay)
Concrete vs relative - Millis gives direct control over the maximum random delay added, while a factor requires understanding how it interacts with the exponential backoff delays
Consistency - All other time-based configs use concrete units (seconds/millis) rather than abstract factors
Predictability - With millis, the max jitter is always clear regardless of retry delay values

The small conversion cost (jitterMillis / 1000.0) is negligible compared to the benefits of making the config more operator friendly. Happy to change to 0-1 factor if you prefer that approach though!

PolarisConfiguration.<Integer>builder()
.key("CLOUD_API_RETRY_JITTER_MILLIS")
.description(
"Maximum jitter in milliseconds added to retry delays for cloud provider API requests. "
+ "Helps prevent thundering herd when multiple requests fail simultaneously. "
+ "Actual jitter is random between 0 and this value. "
+ "Currently only used by Azure storage integration (not yet implemented for AWS S3 or GCP).")
.defaultValue(500)
.buildFeatureConfiguration();
}
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@
*/
package org.apache.polaris.core.storage.azure;

import static org.apache.polaris.core.config.FeatureConfiguration.CLOUD_API_RETRY_DELAY_SECONDS;
import static org.apache.polaris.core.config.FeatureConfiguration.CLOUD_API_RETRY_JITTER_MILLIS;
import static org.apache.polaris.core.config.FeatureConfiguration.CLOUD_API_RETRY_COUNT;
import static org.apache.polaris.core.config.FeatureConfiguration.CLOUD_API_TIMEOUT_SECONDS;
import static org.apache.polaris.core.config.FeatureConfiguration.STORAGE_CREDENTIAL_DURATION_SECONDS;

import com.azure.core.credential.AccessToken;
Expand Down Expand Up @@ -122,7 +126,7 @@ public StorageAccessConfig getSubscopedCreds(
OffsetDateTime.ofInstant(
start.plusSeconds(3600), ZoneOffset.UTC); // 1 hr to sync with AWS and GCP Access token

AccessToken accessToken = getAccessToken(config().getTenantId());
AccessToken accessToken = getAccessToken(realmConfig, config().getTenantId());
// Get user delegation key.
// Set the new generated user delegation key expiry to 7 days and minute 1 min
// Azure strictly requires the end time to be <= 7 days from the current time, -1 min to avoid
Expand Down Expand Up @@ -315,53 +319,62 @@ private void validateAccountAndContainer(
}

/**
* Fetches an Azure access token with timeout and retry logic to handle transient failures.
* Fetches an Azure AD access token with timeout and retry logic to handle transient failures.
*
* <p>This method implements a defensive strategy against slow or failing token requests:
* <p>This access token is used internally to obtain a user delegation key from Azure Storage,
* which is then used to generate SAS tokens for client credential vending.
*
* <p>This method implements a defensive strategy against slow or failing cloud provider requests:
*
* <ul>
* <li>15-second timeout per individual request attempt
* <li>Exponential backoff retry (3 attempts: 2s, 4s, 8s) with 50% jitter
* <li>90-second overall timeout as a safety net
* <li>Per-attempt timeout (configurable via CLOUD_API_TIMEOUT_SECONDS, default 15s)
* <li>Exponential backoff retry (configurable count and initial delay via CLOUD_API_RETRY_COUNT
* and CLOUD_API_RETRY_DELAY_SECONDS, defaults: 3 attempts starting at 2s)
* <li>Jitter to prevent thundering herd (configurable via CLOUD_API_RETRY_JITTER_MILLIS, default 500ms)
* </ul>
*
* @param realmConfig the realm configuration to get timeout and retry settings
* @param tenantId the Azure tenant ID
* @return the access token
* @throws RuntimeException if token fetch fails after all retries or times out
*/
private AccessToken getAccessToken(String tenantId) {
private AccessToken getAccessToken(RealmConfig realmConfig, String tenantId) {
int timeoutSeconds = realmConfig.getConfig(CLOUD_API_TIMEOUT_SECONDS);
int retryCount = realmConfig.getConfig(CLOUD_API_RETRY_COUNT);
int initialDelaySeconds = realmConfig.getConfig(CLOUD_API_RETRY_DELAY_SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind using millis for initialDelaySeconds... in some cases even 1 sec may be too long. Let's delegate what the min delay should be to the admin user who configures it.

Same for timeoutSeconds... I hope Azure SDK supports millis.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion!

Changed both to milliseconds:
CLOUD_API_TIMEOUT_MILLIS (default: 15000ms)
CLOUD_API_RETRY_DELAY_MILLIS (default: 2000ms)

int jitterMillis = realmConfig.getConfig(CLOUD_API_RETRY_JITTER_MILLIS);
double jitter = jitterMillis / 1000.0; // Convert millis to fraction for jitter factor
Copy link
Contributor

@dimas-b dimas-b Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, I fully understand this logic... per javadoc of reactor.util.retry.RetryBackoffSpec.jitter() the factor applies to the "computed delay", which may not be 1000 ms 🤔 How can the user reason about what the CLOUD_API_RETRY_JITTER_MILLIS value of 750 (for example) means?

Would it not be simpler to use the 0.0-1.0 factor value in the config?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I went with millis initially to keep it consistent with the other timeout/delay configs (which are all in milliseconds), thinking it'd be more straightforward to work with absolute values.

But you're right - that doesn't really make sense here since the jitter factor gets applied to the computed delay, which changes with each retry (2s → 4s → 8s). So 750 would mean something different on each attempt, which is pretty confusing.

Changed it to CLOUD_API_RETRY_JITTER_FACTOR using the 0.0-1.0 range


String scope = "https://storage.azure.com/.default";
AccessToken accessToken =
defaultAzureCredential
.getToken(new TokenRequestContext().addScopes(scope).setTenantId(tenantId))
.timeout(Duration.ofSeconds(15)) // Per-attempt timeout
.timeout(Duration.ofSeconds(timeoutSeconds))
.doOnError(
error ->
LOGGER.warn(
"Error fetching Azure access token for tenant {}: {}",
tenantId,
error.getMessage()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we log the full stack trace instead of error.getMessage() only for better debuggability?

.retryWhen(
Retry.backoff(3, Duration.ofSeconds(2)) // 3 retries: 2s, 4s, 8s
.jitter(0.5) // ±50% jitter to prevent thundering herd
.filter(
throwable ->
throwable instanceof java.util.concurrent.TimeoutException
|| isRetriableAzureException(throwable))
Retry.backoff(retryCount, Duration.ofSeconds(initialDelaySeconds))
.jitter(jitter)
.filter(this::isRetriableAzureException)
.doBeforeRetry(
retrySignal ->
LOGGER.info(
"Retrying Azure token fetch for tenant {} (attempt {}/3)",
"Retrying Azure token fetch for tenant {} (attempt {}/{})",
tenantId,
retrySignal.totalRetries() + 1))
retrySignal.totalRetries() + 1,
retryCount))
Comment on lines +368 to +369
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add 1 to retryCount? Sot it matches the word "attempt" here? The total attempt number would be 1 + retryCount:

int maxAttempts = retryCount + 1;
...

LOGGER.info("Retrying Azure token fetch for tenant {} (attempt {}/{})",
            tenantId,
            retrySignal.totalRetries() + 1,
            maxAttempts);

.onRetryExhaustedThrow(
(retryBackoffSpec, retrySignal) ->
new RuntimeException(
String.format(
"Azure token fetch exhausted after %d attempts for tenant %s",
retrySignal.totalRetries(), tenantId),
retrySignal.failure())))
.blockOptional(Duration.ofSeconds(90)) // Maximum total wait time
.blockOptional()
.orElse(null);

if (accessToken == null) {
Expand All @@ -374,6 +387,17 @@ private AccessToken getAccessToken(String tenantId) {
/**
* Determines if an exception is retriable for Azure token requests.
*
* <p>Retries are attempted for:
*
* <ul>
* <li>TimeoutException - per-attempt timeout exceeded
* <li>AADSTS50058 - Token endpoint timeout
* <li>AADSTS50078 - Service temporarily unavailable
* <li>AADSTS700084 - Token refresh required
* <li>503 - Service unavailable
* <li>429 - Too many requests (rate limited)
* </ul>
*
* @param throwable the exception to check
* @return true if the exception should trigger a retry
*/
Expand Down