Skip to content

[Review] Model Engine OnPrem Support and vLLM 0.11.1 + Model Engine Integration Fixes#1

Open
govambam wants to merge 1 commit intobase-for-pr-744from
review-pr-744
Open

[Review] Model Engine OnPrem Support and vLLM 0.11.1 + Model Engine Integration Fixes#1
govambam wants to merge 1 commit intobase-for-pr-744from
review-pr-744

Conversation

@govambam
Copy link
Copy Markdown
Owner

Recreated from scaleapi#744 for Macroscope review.

Original PR: scaleapi#744 by @charlesahn-scale
Status: closed (merged)

Recreated using squash merge commit - exact merged state preserved.

Original PR: scaleapi#744

…n Fixes (scaleapi#744)

* add support for on-prem

* clean up on-prem artificats

* add back comments from initial code

* fix lint

* use ecr image repo:tag directly

* fix: isort import ordering

* fix: remove unused infra_config import

* fix: mypy type annotation errors

* fix: remove type annotation causing mypy no-redef error

* fix: mypy type errors in s3_utils.py and io.py - use botocore.config.Config

* fix: mypy typeddict-item errors - use broad type ignore

* fix: update test mocks to use get_s3_resource from s3_utils

* test: add unit tests for s3_utils, onprem_docker_repository, and onprem_queue_endpoint_resource_delegate

* style: format test files with black

* refactor: use filesystem_gateway abstraction for S3 operations

Address review feedback to use the filesystem_gateway abstraction layer
instead of directly calling get_s3_client. This ensures on-prem S3
configuration logic is properly encapsulated in the gateway.

Changes:
- Add head_object, delete_object, list_objects methods to S3FilesystemGateway
- Update S3FileStorageGateway to use self.filesystem_gateway for all S3 ops
- Remove direct import of get_s3_client from s3_file_storage_gateway

* fix: deduplicate S3 client config by using centralized s3_utils

Refactor open_wrapper to use get_s3_client from s3_utils instead of
duplicating the on-prem S3 configuration logic. This ensures a single
source of truth for S3 client creation across the codebase.

* fix: add pagination to list_objects to handle >1000 objects

S3 list_objects_v2 returns max 1000 objects per request. Use paginator
to iterate through all pages and return complete results. Without this
fix, directories with >1000 files would silently return truncated results.

* fix: make OnPremDockerRepository.get_image_url consistent with ECR/ACR

Include docker_repo_prefix in image URL to match behavior of ECR and ACR
implementations. Also change image_exists logging from warning to debug
to reduce log noise on every deployment.

Updated tests to mock infra_config and verify prefix handling.

* refactor: add explicit on-prem branches in dependencies.py for clarity

Add explicit elif branches for on-prem cloud provider to make it clear
that S3-based gateways are intentionally used for on-prem (with MinIO
configuration applied via s3_utils). This improves code readability
and makes the on-prem support more discoverable.

* feat: implement Redis LLEN for queue depth in OnPremQueueEndpointResourceDelegate

Replace hardcoded queue depth with actual Redis LLEN call to enable
proper autoscaling based on queue metrics. Falls back to 0 gracefully
if Redis client is unavailable.

- Add optional redis_client parameter to constructor
- Implement lazy Redis client initialization
- Add tests for both with and without Redis scenarios

* fix: replace mutable default argument with None in _get_client

Using {} as a default argument is a Python anti-pattern that can cause
subtle bugs since the same dict instance is shared across calls.
Use Optional[Dict] = None pattern instead.

* refactor: extract inline import to module-level helper function

Move the infra_config import from inside the validator to a module-level
helper function _is_onprem_deployment(). This improves testability,
avoids repeated import overhead on each validation call, and follows
Python best practices for imports.

* fix: reduce excessive debug logging in s3_utils

Replace per-call debug logs with a one-time info log when S3 is
configured for on-prem. This prevents log spam from debug messages
firing on every S3 client creation.

- Extract common on-prem config to _get_onprem_client_kwargs helper
- Add _s3_config_logged flag to log endpoint only once
- Add return type annotations to get_s3_client and get_s3_resource
- Update tests to reset logging flag between tests

* chore: remove unused TYPE_CHECKING import

Clean up unused import left over from refactoring the inline import.

* fix: make Dockerfile multi-arch compatible for ARM/AMD64

Use architecture detection to download the correct binaries for
aws-iam-authenticator and kubectl. This enables building the image
for both ARM64 (Mac M1/M2) and AMD64 (CI/production) platforms.

* style: fix black formatting in test_onprem_queue_endpoint_resource_delegate

* fix: restore AWS_PROFILE env var fallback in s3_utils

The original code checked os.getenv('AWS_PROFILE') as a fallback when
no aws_profile kwarg was provided. This was accidentally removed during
refactoring, breaking S3 operations in CI where AWS_PROFILE may be set
via environment variable.

Restores the original behavior for AWS deployments while maintaining
the new on-prem path.

* fix: correct isort ordering in s3_filesystem_gateway.py

* fix: use Literal type for s3 addressing_style to satisfy mypy

* Onprem Compatibility Change

---------

Co-authored-by: Tarun <tarun.ravikumar@scale.com>
@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented Feb 10, 2026

Add on‑prem deployment support across Model Engine and extend vLLM parsing to OpenAI formats while injecting S3_ENDPOINT_URL into all S3 and s5cmd paths

Introduce on‑prem selections for queues, Docker repository, Redis, and Postgres; centralize S3 client/resource creation with S3_ENDPOINT_URL; add s5cmd endpoint flags for model downloads; and update vLLM sync/stream parsing to accept OpenAI completion formats. Tests and Helm templates are updated accordingly.

📍Where to Start

Start with on‑prem behavior wiring and S3 endpoint handling in llm_model_endpoint_use_cases and s3_utils: see CreateLLMModelBundleV1UseCase and _get_s3_endpoint_flag in llm_model_endpoint_use_cases.py, then review S3 client/resource factories in s3_utils.py. After that, scan on‑prem wiring in dependencies.py and the Docker/image URL flow in live_endpoint_builder_service.py.


📊 Macroscope summarized a8cd1ab. 32 files reviewed, 20 issues evaluated, 2 issues filtered, 1 comment posted. View details

if infra_config().cloud_provider == "onprem":
user = os.environ.get("DB_USER", "postgres")
password = os.environ.get("DB_PASSWORD", "postgres")
host = os.environ.get("DB_HOST_RO") or os.environ.get("DB_HOST", "localhost")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium

db/base.py:65 The onprem branch ignores the read_only parameter when selecting the host. Consider using conditional logic like the AWS branch: creds.get("clusterHostRo") if read_only else creds.get("clusterHost"). Currently, if DB_HOST_RO is set, write operations could be routed to a read-only replica.

Suggested change
host = os.environ.get("DB_HOST_RO") or os.environ.get("DB_HOST", "localhost")
host = os.environ.get("DB_HOST_RO") if read_only else os.environ.get("DB_HOST", "localhost")

🚀 Want me to fix this? Reply ex: "fix it for me".

🤖 Prompt for AI
In file model-engine/model_engine_server/db/base.py around line 65:

The `onprem` branch ignores the `read_only` parameter when selecting the host. Consider using conditional logic like the AWS branch: `creds.get("clusterHostRo") if read_only else creds.get("clusterHost")`. Currently, if `DB_HOST_RO` is set, write operations could be routed to a read-only replica.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant