[serve][llm] Ray LLM Cloud Filesystem Restructuring: Provider-Specific Implementations #58469
+2,407
−1,466
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ray LLM Cloud Filesystem Restructuring: Provider-Specific Implementations
Summary
This PR restructures the Ray LLM cloud filesystem architecture from a monolithic PyArrow-based implementation to a modular, provider-specific design. The refactor introduces:
cloud_filesystem/module with provider-specific implementations (S3FileSystem,GCSFileSystem,AzureFileSystem) that inherit from an abstract base classCloudFileSystemclass now serves as a facade that delegates to provider-specific implementations, maintaining all existing APIsMotivation
Performance Bottleneck
The previous PyArrow-based implementation suffered from significant performance issues when handling large ML model files. Current implementation provides a unified framework for all cloud providers, using PyArrow's implementations of each filesystem type as an interface for downloads. Benchmarking on
meta-Llama-3.2-1B-Instructshowed that aws cli took 7.97 seconds against PyArrow's 27.78 seconds to download.Future-Proof Design
Beyond addressing immediate performance issues, disaggregating each cloud provider into separate filesystem classes enables:
This architectural change transforms the codebase from a monolithic abstraction into a modular, extensible system where each provider can evolve independently while maintaining a consistent unified interface.
Detailed Changes
1. New Modular Architecture
Directory Structure
Created a new
cloud_filesystem/module underray/llm/_internal/common/utils/:Abstract Base Class (
base.py)Introduced
BaseCloudFileSystemabstract class defining the interface all providers must implement:This interface ensures consistency across all provider implementations while allowing each to optimize for their platform.
2. S3FileSystem: AWS CLI-Based Implementation
The
S3FileSystemclass implements all operations using AWS CLI commands for optimal performance:Key Implementation Details
The
S3FileSystemimplementation uses AWS CLI commands for all operations:aws s3 cpfor file downloads and uploadsaws s3 lsfor directory listingsaws s3 cp --recursivefor bulk transfers with--includeand--excludeflags for filteringThis provides optimal performance by leveraging AWS's native tools for multipart uploads/downloads, automatic parallelism, and efficient bandwidth utilization.
3. PyArrowFileSystem: Legacy Implementation
Extracted all PyArrow-based logic into
pyarrow_filesystem.py(431 lines), preserving the original implementation.This class serves as the implementation for GCS and Azure providers during the transition period.
4. GCS and Azure FileSystem Classes
Both
GCSFileSystemandAzureFileSystemcurrently delegate toPyArrowFileSystem:These wrapper classes maintain the modular architecture while preserving stability. Future PRs can introduce optimized implementations using
gsutil/google-cloud-storageSDK for GCS andazcopy/azure-storage-blobSDK for Azure.5. CloudFileSystem: Facade Pattern
The
CloudFileSystemclass incloud_utils.pywas refactored from a ~400-line implementation class to a lightweight facade (~150 lines):The refactored
CloudFileSystemmaintains the same public API as before, ensuring zero breaking changes for existing code. However, it now acts as a router that detects the cloud provider based on the URI scheme (s3://, gs://, abfss://, azure://) and delegates all operations to the appropriate provider-specific implementation.6. Removed Parallel Download Method
The
download_files_parallel()introduced in this PR replacesdownload_files(), using multithreading by default.7. Import Updates
Updated imports in dependent modules:
cloud_downloader.py: Now imports fromcloud_filesystemsubmoduleTesting
New Test Structure
The test suite was comprehensively refactored from a single 1,105-line file (
test_cloud_utils.py) into five focused test modules totaling 1,465 lines with improved organization:Test File Organization
test_cloud_filesystem.py(79 lines)CloudFileSystemfacade classdownload_model()andupload_model()high-level operationstest_mirror_config.py(158 lines)CloudMirrorConfigandLoraMirrorConfigclassestest_pyarrow_filesystem.py(544 lines)PyArrowFileSystemimplementationget_file(),list_subfolders(),download_files(),upload_files()test_s3_filesystem.py(327 lines)get_file(),list_subfolders()test_utils.py(315 lines)is_remote_path(),CloudObjectCache@remote_object_cacheS3-Specific Test Coverage
New comprehensive test coverage for
S3FileSystem:These tests use mocking to verify:
Additional tests using a real s3 bucket (s3://air-example-data/rayllm-ossci/) are also included
Maintained Test Coverage
All original tests for PyArrow-based functionality were preserved and moved to appropriate modules:
Test Utilities and Mocking
Future Work
This refactor sets the foundation for additional optimizations:
gsutilorgoogle-cloud-storageSDK-based operationsazcopyorazure-storage-blobSDK-based operations