--stream-experts only achieves ~5% of available SSD bandwidth on M1 Ultra (300 MB/s observed vs 5-7 GB/s capable)

## Summary

Running Qwen3.5-122B-A10B-4bit with `--stream-experts --ssd-prefetch` on an M1 Ultra 64GB produces only **0.6 tok/s** generation speed. Profiling shows SSD I/O is only ~300 MB/s — roughly **5% of the M1 Ultra's internal NVMe capacity (5–7 GB/s)**. The drive is not the bottleneck; something in the expert streaming pipeline is.

## Hardware

- **Machine:** Apple M1 Ultra, 64GB unified memory, macOS
- **SSD:** Internal NVMe — confirmed 5–7 GB/s sequential read via `dd` / `fio`
- **SwiftLM version:** b253

## Model

- **Model:** Qwen3.5-122B-A10B-4bit (MoE, ~10B activated params per token)
- **Weight files:** 69.6 GB (4-bit quantized)
- **Source:** mlx-community on HuggingFace, converted with mlx-vlm 0.3.12

## SwiftLM `--info` output

```
Strategy:     ⚠️  SWAP-ASSISTED
Overcommit:   1.32× (model is 32% larger than RAM)
GPU layers:   24/32
Est. speed:   ~3 tok/s
```

## Launch command

```bash
SwiftLM \
  --model /path/to/Qwen3.5-122B-A10B-4bit \
  --port 8000 \
  --stream-experts \
  --ssd-prefetch
```

## Observed metrics during generation

| Metric | Value |
|---|---|
| Generation speed | **0.6 tok/s** |
| Disk I/O (iostat disk0) | **~300 MB/s** |
| System-wide memory free % | **78–80%** |
| Swap activity (swapins) | **0** — no swap at all |
| CPU utilization | **~55%** |

The fact that swap is zero and 78% RAM is free confirms `--stream-experts` is working correctly — the model is streaming from SSD rather than hitting macOS swap. However, the SSD is barely being used.

## The bandwidth gap

```
Available SSD bandwidth:    5,000–7,000 MB/s
Observed during generation:       ~300 MB/s
Utilization:                        ~5%
```

Per-token math for a 10B-activated-param MoE at 4-bit:
- ~5 GB of expert weights need to be accessed per token
- At full SSD speed (5 GB/s): theoretical ~1 tok/s (ignoring prefetch overlap)
- With good prefetch overlap: 3–5 tok/s should be achievable
- Actual: 0.6 tok/s

Even reaching **2–3 GB/s** of SSD utilization (well under max) should yield **3–4 tok/s** based on linear scaling from our observed baseline.

## What we tried

- `ulimit -n 4096` before launch — no change in I/O throughput or tok/s
- Both with and without `--ssd-prefetch` — marginal difference

## Hypothesis

The 16-worker prefetch pool (`--ssd-prefetch`) may be serializing at the **MLX memory allocator level** rather than issuing parallel disk reads. Each worker allocates GPU/unified memory for the incoming expert shard, and if that allocation path holds a global lock, the reads effectively become sequential regardless of worker count. The SSD is capable of much higher throughput but never gets a chance to saturate because the consumer side is bottlenecked.

Evidence: CPU is only at 55% during generation. A properly pipelined SSD→GPU streaming path should drive CPU closer to 80–90% as the prefetch workers stay busy.

## Request

1. Can you confirm whether the prefetch workers are issuing reads concurrently or effectively sequentially?
2. Is there a flag to increase prefetch depth or worker count beyond the current 16?
3. Is this a known limitation of the MLX tensor allocation path on Apple Silicon, and if so, is there a fix planned?

This hardware is otherwise well-suited for this model — plenty of SSD bandwidth, no memory pressure, no swap. Getting even 3–4 tok/s would make 122B viable for async/background inference use cases.

## Environment

- macOS (Apple Silicon)
- SwiftLM b253
- MLX (version bundled with SwiftLM b253)
- Model: mlx-community/Qwen3.5-122B-A10B-4bit


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--stream-experts only achieves ~5% of available SSD bandwidth on M1 Ultra (300 MB/s observed vs 5-7 GB/s capable) #24

Summary

Hardware

Model

SwiftLM `--info` output

Launch command

Observed metrics during generation

The bandwidth gap

What we tried

Hypothesis

Request

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Value
Generation speed	0.6 tok/s
Disk I/O (iostat disk0)	~300 MB/s
System-wide memory free %	78–80%
Swap activity (swapins)	0 — no swap at all
CPU utilization	~55%

--stream-experts only achieves ~5% of available SSD bandwidth on M1 Ultra (300 MB/s observed vs 5-7 GB/s capable) #24

Description

Summary

Hardware

Model

SwiftLM --info output

Launch command

Observed metrics during generation

The bandwidth gap

What we tried

Hypothesis

Request

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

SwiftLM `--info` output