Add fp32_lm_head flag for vLLM precision parity by jlamypoirier · Pull Request #526 · ServiceNow/Fast-LLM

jlamypoirier · 2026-05-27T19:33:49Z

Summary

Adds an fp32_lm_head field on LanguageModelHeadConfig. When True, the LM head linear's input and weight are upcast to FP32 before the matmul, matching vLLM's bf16_last_layer_fp32 quantization. This lets the trainer compute log-probabilities at the same numerical precision as the actor's sampling, so the importance-sampling ratio starts near 1.0 instead of being artificially inflated by a trainer/actor precision mismatch.

The detached FP32 weight has requires_grad=False, which makes output_parallel_linear_backward skip the weight-grad path. The FSDP gradient contract is restored by computing grad_weight = grad.t() @ saved_input explicitly and accumulating into the original BF16 param's grad_buffer via accumulate_gradient.

Off by default — disabled path is byte-identical to before.

Test plan

pytest tests/layers/test_lm_head.py — passes

Originally part of #502.

When True, upcasts the LM head linear's input and weight to FP32 before the matmul, matching vLLM's bf16_last_layer_fp32 quantization. This lets the trainer compute log-probabilities at the same numerical precision as the actor's sampling, so the importance-sampling ratio starts near 1.0 instead of being inflated by trainer/actor precision mismatch. The detached FP32 weight has requires_grad=False, which makes output_parallel_linear_backward skip the weight-grad path. The FSDP gradient contract is restored by computing grad_weight explicitly and accumulating into the original BF16 param's grad_buffer via accumulate_gradient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an `fp32_lm_head` field on `LanguageModelHeadConfig`. When `True`, the LM head linear's input and weight are upcast to FP32 before the matmul, matching vLLM's `bf16_last_layer_fp32` quantization. This lets the trainer compute log-probabilities at the same numerical precision as the actor's sampling, so the importance-sampling ratio starts near 1.0 instead of being artificially inflated by a trainer/actor precision mismatch. The detached FP32 weight has `requires_grad=False`, which makes `output_parallel_linear_backward` skip the weight-grad path. The FSDP gradient contract is restored by computing `grad_weight = grad.t() @ saved_input` explicitly and accumulating into the original BF16 param's `grad_buffer` via `accumulate_gradient`. Off by default — disabled path is byte-identical to before. Cherry-picked from #526 to unblock the precision-evaluation tool's GSPO smoke test, which compares fp32_lm_head=true vs false. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jlamypoirier mentioned this pull request May 27, 2026

Add docs_per_step for dynamic microbatch accumulation #520

Open

1 task

jlamypoirier mentioned this pull request May 28, 2026

Tool: evaluate layer-wise numerical-error propagation #525

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fp32_lm_head flag for vLLM precision parity#526

Add fp32_lm_head flag for vLLM precision parity#526
jlamypoirier wants to merge 1 commit into
mainfrom
jlp_fp32_lm_head

jlamypoirier commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlamypoirier commented May 27, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant