Skip to content

--stream-experts only achieves ~5% of available SSD bandwidth on M1 Ultra (300 MB/s observed vs 5-7 GB/s capable) #24

@ericjlake

Description

@ericjlake

Summary

Running Qwen3.5-122B-A10B-4bit with --stream-experts --ssd-prefetch on an M1 Ultra 64GB produces only 0.6 tok/s generation speed. Profiling shows SSD I/O is only ~300 MB/s — roughly 5% of the M1 Ultra's internal NVMe capacity (5–7 GB/s). The drive is not the bottleneck; something in the expert streaming pipeline is.

Hardware

  • Machine: Apple M1 Ultra, 64GB unified memory, macOS
  • SSD: Internal NVMe — confirmed 5–7 GB/s sequential read via dd / fio
  • SwiftLM version: b253

Model

  • Model: Qwen3.5-122B-A10B-4bit (MoE, ~10B activated params per token)
  • Weight files: 69.6 GB (4-bit quantized)
  • Source: mlx-community on HuggingFace, converted with mlx-vlm 0.3.12

SwiftLM --info output

Strategy:     ⚠️  SWAP-ASSISTED
Overcommit:   1.32× (model is 32% larger than RAM)
GPU layers:   24/32
Est. speed:   ~3 tok/s

Launch command

SwiftLM \
  --model /path/to/Qwen3.5-122B-A10B-4bit \
  --port 8000 \
  --stream-experts \
  --ssd-prefetch

Observed metrics during generation

Metric Value
Generation speed 0.6 tok/s
Disk I/O (iostat disk0) ~300 MB/s
System-wide memory free % 78–80%
Swap activity (swapins) 0 — no swap at all
CPU utilization ~55%

The fact that swap is zero and 78% RAM is free confirms --stream-experts is working correctly — the model is streaming from SSD rather than hitting macOS swap. However, the SSD is barely being used.

The bandwidth gap

Available SSD bandwidth:    5,000–7,000 MB/s
Observed during generation:       ~300 MB/s
Utilization:                        ~5%

Per-token math for a 10B-activated-param MoE at 4-bit:

  • ~5 GB of expert weights need to be accessed per token
  • At full SSD speed (5 GB/s): theoretical ~1 tok/s (ignoring prefetch overlap)
  • With good prefetch overlap: 3–5 tok/s should be achievable
  • Actual: 0.6 tok/s

Even reaching 2–3 GB/s of SSD utilization (well under max) should yield 3–4 tok/s based on linear scaling from our observed baseline.

What we tried

  • ulimit -n 4096 before launch — no change in I/O throughput or tok/s
  • Both with and without --ssd-prefetch — marginal difference

Hypothesis

The 16-worker prefetch pool (--ssd-prefetch) may be serializing at the MLX memory allocator level rather than issuing parallel disk reads. Each worker allocates GPU/unified memory for the incoming expert shard, and if that allocation path holds a global lock, the reads effectively become sequential regardless of worker count. The SSD is capable of much higher throughput but never gets a chance to saturate because the consumer side is bottlenecked.

Evidence: CPU is only at 55% during generation. A properly pipelined SSD→GPU streaming path should drive CPU closer to 80–90% as the prefetch workers stay busy.

Request

  1. Can you confirm whether the prefetch workers are issuing reads concurrently or effectively sequentially?
  2. Is there a flag to increase prefetch depth or worker count beyond the current 16?
  3. Is this a known limitation of the MLX tensor allocation path on Apple Silicon, and if so, is there a fix planned?

This hardware is otherwise well-suited for this model — plenty of SSD bandwidth, no memory pressure, no swap. Getting even 3–4 tok/s would make 122B viable for async/background inference use cases.

Environment

  • macOS (Apple Silicon)
  • SwiftLM b253
  • MLX (version bundled with SwiftLM b253)
  • Model: mlx-community/Qwen3.5-122B-A10B-4bit

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions