Skip to content

Conversation

@ahao-anyscale
Copy link
Contributor

Ray LLM Cloud Filesystem Restructuring: Provider-Specific Implementations

Summary

This PR restructures the Ray LLM cloud filesystem architecture from a monolithic PyArrow-based implementation to a modular, provider-specific design. The refactor introduces:

  • New modular architecture: A cloud_filesystem/ module with provider-specific implementations (S3FileSystem, GCSFileSystem, AzureFileSystem) that inherit from an abstract base class
  • S3 optimization: Native AWS CLI-based implementation for S3 operations, replacing PyArrow for significantly improved performance
  • Backward compatibility: The existing CloudFileSystem class now serves as a facade that delegates to provider-specific implementations, maintaining all existing APIs
  • Improved test organization: Test suite refactored from a single large file into multiple focused test modules organized by functionality

Motivation

Performance Bottleneck

The previous PyArrow-based implementation suffered from significant performance issues when handling large ML model files. Current implementation provides a unified framework for all cloud providers, using PyArrow's implementations of each filesystem type as an interface for downloads. Benchmarking on meta-Llama-3.2-1B-Instruct showed that aws cli took 7.97 seconds against PyArrow's 27.78 seconds to download.

Future-Proof Design

Beyond addressing immediate performance issues, disaggregating each cloud provider into separate filesystem classes enables:

  1. Provider-specific optimizations: Each provider can leverage its own native SDKs and CLI tools (e.g., AWS CLI for S3, gsutil for GCS, azcopy for Azure)
  2. Incremental improvements: Optimizations can be introduced per provider without affecting others
  3. Flexible implementation strategies: Different providers can use different transfer mechanisms based on what works best
  4. Easier maintenance: Provider-specific code is isolated for better debugging and testing
  5. Extensibility: New cloud providers can be added without modifying existing implementations

This architectural change transforms the codebase from a monolithic abstraction into a modular, extensible system where each provider can evolve independently while maintaining a consistent unified interface.


Detailed Changes

1. New Modular Architecture

Directory Structure

Created a new cloud_filesystem/ module under ray/llm/_internal/common/utils/:

ray/llm/_internal/common/utils/cloud_filesystem/
├── __init__.py                 # Exports public API
├── base.py                     # Abstract base class
├── s3_filesystem.py            # S3-specific AWS CLI implementation
├── gcs_filesystem.py           # GCS implementation (delegates to PyArrow)
├── azure_filesystem.py         # Azure implementation (delegates to PyArrow)
└── pyarrow_filesystem.py       # PyArrow-based implementation

Abstract Base Class (base.py)

Introduced BaseCloudFileSystem abstract class defining the interface all providers must implement:

class BaseCloudFileSystem(ABC):
    @staticmethod
    @abstractmethod
    def get_file(object_uri: str, decode_as_utf_8: bool = True) -> Optional[Union[str, bytes]]:
        """Download a file from cloud storage into memory."""
        pass
    
    @staticmethod
    @abstractmethod
    def list_subfolders(folder_uri: str) -> List[str]:
        """List the immediate subfolders in a cloud directory."""
        pass
    
    @staticmethod
    @abstractmethod
    def download_files(
        path: str,
        bucket_uri: str,
        substrings_to_include: Optional[List[str]] = None,
        suffixes_to_exclude: Optional[List[str]] = None,
    ) -> None:
        """Download files from cloud storage to a local directory."""
        pass
    
    @staticmethod
    @abstractmethod
    def upload_files(local_path: str, bucket_uri: str) -> None:
        """Upload files to cloud storage."""
        pass

This interface ensures consistency across all provider implementations while allowing each to optimize for their platform.

2. S3FileSystem: AWS CLI-Based Implementation

The S3FileSystem class implements all operations using AWS CLI commands for optimal performance:

Key Implementation Details

The S3FileSystem implementation uses AWS CLI commands for all operations:

  • aws s3 cp for file downloads and uploads
  • aws s3 ls for directory listings
  • aws s3 cp --recursive for bulk transfers with --include and --exclude flags for filtering

This provides optimal performance by leveraging AWS's native tools for multipart uploads/downloads, automatic parallelism, and efficient bandwidth utilization.

3. PyArrowFileSystem: Legacy Implementation

Extracted all PyArrow-based logic into pyarrow_filesystem.py (431 lines), preserving the original implementation.
This class serves as the implementation for GCS and Azure providers during the transition period.

4. GCS and Azure FileSystem Classes

Both GCSFileSystem and AzureFileSystem currently delegate to PyArrowFileSystem:

class GCSFileSystem(BaseCloudFileSystem):
    @staticmethod
    def get_file(object_uri: str, decode_as_utf_8: bool = True):
        return PyArrowFileSystem.get_file(object_uri, decode_as_utf_8)
    
    # ... similar delegation for other methods

These wrapper classes maintain the modular architecture while preserving stability. Future PRs can introduce optimized implementations using gsutil/google-cloud-storage SDK for GCS and azcopy/azure-storage-blob SDK for Azure.

5. CloudFileSystem: Facade Pattern

The CloudFileSystem class in cloud_utils.py was refactored from a ~400-line implementation class to a lightweight facade (~150 lines):

The refactored CloudFileSystem maintains the same public API as before, ensuring zero breaking changes for existing code. However, it now acts as a router that detects the cloud provider based on the URI scheme (s3://, gs://, abfss://, azure://) and delegates all operations to the appropriate provider-specific implementation.

6. Removed Parallel Download Method

The download_files_parallel() introduced in this PR replaces download_files(), using multithreading by default.

7. Import Updates

Updated imports in dependent modules:

  • cloud_downloader.py: Now imports from cloud_filesystem submodule
  • All imports maintain backward compatibility through the facade pattern

Testing

New Test Structure

The test suite was comprehensively refactored from a single 1,105-line file (test_cloud_utils.py) into five focused test modules totaling 1,465 lines with improved organization:

Test File Organization

  1. test_cloud_filesystem.py (79 lines)

    • Tests for the CloudFileSystem facade class
    • Covers download_model() and upload_model() high-level operations
    • Uses mocking to verify correct delegation to provider implementations
  2. test_mirror_config.py (158 lines)

    • Tests for CloudMirrorConfig and LoraMirrorConfig classes
    • URI validation for all supported schemes (S3, GCS, ABFSS, Azure)
    • Bucket name/path parsing logic for different URI formats
    • Extra files configuration handling
  3. test_pyarrow_filesystem.py (544 lines)

    • Comprehensive tests for PyArrowFileSystem implementation
    • File operations: get_file(), list_subfolders(), download_files(), upload_files()
    • File filtering logic: substring inclusion, suffix exclusion, combined filters
    • Azure/ABFSS support: URI validation, filesystem creation, credential handling
    • Exception handling and error cases
    • Anonymous access patterns
  4. test_s3_filesystem.py (327 lines)

    • New tests for S3-specific AWS CLI implementation
    • Command execution and error handling
    • File operations with AWS CLI: get_file(), list_subfolders()
    • Bulk download operations with complex filtering scenarios
    • Upload operations (single file and recursive)
    • AWS CLI command construction and flag ordering
    • Missing file handling and error cases
  5. test_utils.py (315 lines)

    • Tests for utility functions: is_remote_path(), CloudObjectCache
    • Cache functionality: synchronous/asynchronous cache, LRU eviction, hit/miss tracking
    • Decorator functionality: @remote_object_cache
    • Remote path detection for all supported URI schemes

S3-Specific Test Coverage

New comprehensive test coverage for S3FileSystem:

class TestS3FileSystem:
    def test_get_file_utf8(self):
        """Test getting a file as UTF-8 string."""
        # Tests AWS CLI command: aws s3 cp s3://... /tmp/...
        
    def test_get_file_bytes(self):
        """Test getting a file as bytes."""
        
    def test_get_file_not_found(self):
        """Test handling of missing files."""
        
    def test_list_subfolders(self):
        """Test directory listing with PRE prefix parsing."""
        
    def test_download_files_with_filters(self):
        """Test complex filtering with --include and --exclude."""
        
    def test_upload_files(self):
        """Test recursive upload."""

These tests use mocking to verify:

  • Correct AWS CLI command construction
  • Proper flag ordering for filters
  • Error handling and logging
  • Temporary file cleanup

Additional tests using a real s3 bucket (s3://air-example-data/rayllm-ossci/) are also included

Maintained Test Coverage

All original tests for PyArrow-based functionality were preserved and moved to appropriate modules:

  • PyArrow filesystem operations
  • Azure/ABFSS URI parsing and validation
  • File filtering logic
  • Cloud mirror configuration validation

Test Utilities and Mocking

  • Comprehensive mocking of cloud provider APIs (boto3, PyArrow filesystems)
  • Reusable mock objects for file info structures
  • Consistent test patterns across all provider tests

Future Work

This refactor sets the foundation for additional optimizations:

  1. GCS optimization: Implement native gsutil or google-cloud-storage SDK-based operations
  2. Azure optimization: Implement native azcopy or azure-storage-blob SDK-based operations
  3. Additional providers: Add support for other cloud storage providers (Oracle Cloud, IBM Cloud, etc.)
  4. Fine-grained performance tuning: Provider-specific connection pooling, retry logic, and timeout configurations
  5. Telemetry and monitoring: Add per-provider metrics for transfer speeds and success rates

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant