Summary
Running Qwen3.5-122B-A10B-4bit with --stream-experts --ssd-prefetch on an M1 Ultra 64GB produces only 0.6 tok/s generation speed. Profiling shows SSD I/O is only ~300 MB/s — roughly 5% of the M1 Ultra's internal NVMe capacity (5–7 GB/s). The drive is not the bottleneck; something in the expert streaming pipeline is.
Hardware
- Machine: Apple M1 Ultra, 64GB unified memory, macOS
- SSD: Internal NVMe — confirmed 5–7 GB/s sequential read via
dd / fio
- SwiftLM version: b253
Model
- Model: Qwen3.5-122B-A10B-4bit (MoE, ~10B activated params per token)
- Weight files: 69.6 GB (4-bit quantized)
- Source: mlx-community on HuggingFace, converted with mlx-vlm 0.3.12
SwiftLM --info output
Strategy: ⚠️ SWAP-ASSISTED
Overcommit: 1.32× (model is 32% larger than RAM)
GPU layers: 24/32
Est. speed: ~3 tok/s
Launch command
SwiftLM \
--model /path/to/Qwen3.5-122B-A10B-4bit \
--port 8000 \
--stream-experts \
--ssd-prefetch
Observed metrics during generation
| Metric |
Value |
| Generation speed |
0.6 tok/s |
| Disk I/O (iostat disk0) |
~300 MB/s |
| System-wide memory free % |
78–80% |
| Swap activity (swapins) |
0 — no swap at all |
| CPU utilization |
~55% |
The fact that swap is zero and 78% RAM is free confirms --stream-experts is working correctly — the model is streaming from SSD rather than hitting macOS swap. However, the SSD is barely being used.
The bandwidth gap
Available SSD bandwidth: 5,000–7,000 MB/s
Observed during generation: ~300 MB/s
Utilization: ~5%
Per-token math for a 10B-activated-param MoE at 4-bit:
- ~5 GB of expert weights need to be accessed per token
- At full SSD speed (5 GB/s): theoretical ~1 tok/s (ignoring prefetch overlap)
- With good prefetch overlap: 3–5 tok/s should be achievable
- Actual: 0.6 tok/s
Even reaching 2–3 GB/s of SSD utilization (well under max) should yield 3–4 tok/s based on linear scaling from our observed baseline.
What we tried
ulimit -n 4096 before launch — no change in I/O throughput or tok/s
- Both with and without
--ssd-prefetch — marginal difference
Hypothesis
The 16-worker prefetch pool (--ssd-prefetch) may be serializing at the MLX memory allocator level rather than issuing parallel disk reads. Each worker allocates GPU/unified memory for the incoming expert shard, and if that allocation path holds a global lock, the reads effectively become sequential regardless of worker count. The SSD is capable of much higher throughput but never gets a chance to saturate because the consumer side is bottlenecked.
Evidence: CPU is only at 55% during generation. A properly pipelined SSD→GPU streaming path should drive CPU closer to 80–90% as the prefetch workers stay busy.
Request
- Can you confirm whether the prefetch workers are issuing reads concurrently or effectively sequentially?
- Is there a flag to increase prefetch depth or worker count beyond the current 16?
- Is this a known limitation of the MLX tensor allocation path on Apple Silicon, and if so, is there a fix planned?
This hardware is otherwise well-suited for this model — plenty of SSD bandwidth, no memory pressure, no swap. Getting even 3–4 tok/s would make 122B viable for async/background inference use cases.
Environment
- macOS (Apple Silicon)
- SwiftLM b253
- MLX (version bundled with SwiftLM b253)
- Model: mlx-community/Qwen3.5-122B-A10B-4bit
Summary
Running Qwen3.5-122B-A10B-4bit with
--stream-experts --ssd-prefetchon an M1 Ultra 64GB produces only 0.6 tok/s generation speed. Profiling shows SSD I/O is only ~300 MB/s — roughly 5% of the M1 Ultra's internal NVMe capacity (5–7 GB/s). The drive is not the bottleneck; something in the expert streaming pipeline is.Hardware
dd/fioModel
SwiftLM
--infooutputLaunch command
Observed metrics during generation
The fact that swap is zero and 78% RAM is free confirms
--stream-expertsis working correctly — the model is streaming from SSD rather than hitting macOS swap. However, the SSD is barely being used.The bandwidth gap
Per-token math for a 10B-activated-param MoE at 4-bit:
Even reaching 2–3 GB/s of SSD utilization (well under max) should yield 3–4 tok/s based on linear scaling from our observed baseline.
What we tried
ulimit -n 4096before launch — no change in I/O throughput or tok/s--ssd-prefetch— marginal differenceHypothesis
The 16-worker prefetch pool (
--ssd-prefetch) may be serializing at the MLX memory allocator level rather than issuing parallel disk reads. Each worker allocates GPU/unified memory for the incoming expert shard, and if that allocation path holds a global lock, the reads effectively become sequential regardless of worker count. The SSD is capable of much higher throughput but never gets a chance to saturate because the consumer side is bottlenecked.Evidence: CPU is only at 55% during generation. A properly pipelined SSD→GPU streaming path should drive CPU closer to 80–90% as the prefetch workers stay busy.
Request
This hardware is otherwise well-suited for this model — plenty of SSD bandwidth, no memory pressure, no swap. Getting even 3–4 tok/s would make 122B viable for async/background inference use cases.
Environment