Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -16,24 +16,6 @@

# Features and diagnostics

## Collect stack traces

When running a Single Program, Multiple Data (SPMD) job on accelerators, the overall process can hang if there is any error or any VM hangs/crashes for some reason. In this scenario, capturing stack traces will help to identify and troubleshoot the issues for the jobs running on TPU VMs.

The following configurations will help to debug a fault or when a program is stuck or hung somewhere by collecting stack traces. Change the parameter values accordingly in `src/maxtext/configs/base.yml`:

1. Set `collect_stack_trace: True` to enable collection of stack traces on faults or when the program is hung. This setting will periodically dump the traces for the program to help in debugging. To disable this, set `collect_stack_trace: False`.
2. Set `stack_trace_to_cloud: False` to display stack traces on console. `stack_trace_to_cloud: True` will create a temporary file in `/tmp/debugging` in the TPUs to store the stack traces. There is an agent running on TPU VMs that will periodically upload the traces from the temporary directory to cloud logging in the gcp project. You can view the traces in Logs Explorer on Cloud Logging using the following query:

```
logName="projects/<project_name>/logs/tpu.googleapis.com%2Fruntime_monitor"
jsonPayload.verb="stacktraceanalyzer"
```

3. `stack_trace_interval_seconds` signifies the duration in seconds between each stack trace collection event. Setting `stack_trace_interval_seconds: 600` will collect the stack traces every 600 seconds (10 minutes).

Here is the related PyPI package: https://pypi.org/project/cloud-tpu-diagnostics.

(aot-compilation)=

## Ahead of Time compilation (AOT)
Expand Down
1 change: 0 additions & 1 deletion docs/reference/architecture/architecture_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,5 @@ The critical technology enabling this strategy is the suite of checkpoint conver

Debugging performance issues in a distributed system with thousands of accelerators is a notoriously difficult challenge. MaxText incorporates several built-in diagnostic features designed to provide visibility into the system's behavior at scale.

- Stack trace collection: To diagnose program hangs or faults, users can set `collect_stack_trace: True` in the configuration. This feature will periodically dump the Python stack traces from all worker processes. The traces can be directed to the console for immediate inspection or, more scalably, uploaded to Cloud Logging, where they can be aggregated and queried to identify misbehaving nodes.
- HLO dumping: For deep, low-level performance analysis, MaxText allows users to dump the XLA High-Level Optimizer (HLO) graph. By setting the `dump_hlo` flag, the compiled graph for a specific training step can be saved to a local directory or uploaded to Cloud Storage. This HLO representation is invaluable for compiler engineers and advanced users who need to understand exactly how XLA is interpreting and optimizing the model, making it possible to debug subtle performance regressions or compiler-related issues.
- Goodput monitoring: The framework integrates with the ml-goodput-measurement library, which provides a more holistic view of job efficiency than simple TFLOPs calculations. This allows for the tracking of metrics that capture overall "goodput," accounting for factors like data loading time, compilation overhead, and idle time, giving a truer picture of end-to-end performance.
9 changes: 2 additions & 7 deletions docs/run_maxtext/decoupled_mode.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,17 +54,12 @@ Optional environment variables:
MaxText exposes a single module `maxtext.common.gcloud_stub` to avoid scattering environment checks:

```python
from maxtext.common.gcloud_stub import is_decoupled, cloud_diagnostics, jetstream
from maxtext.common.gcloud_stub import is_decoupled, jetstream

if is_decoupled():
# Skip optional integrations or use local fallbacks
pass

# Cloud diagnostics (returns diagnostic, debug_configuration, diagnostic_configuration, stack_trace_configuration)
diagnostic, debug_configuration, diagnostic_configuration, stack_trace_configuration = (
cloud_diagnostics()
)

# JetStream (serving) components
config_lib, engine_api, token_utils, tokenizer_api, token_params_ns = jetstream()
TokenizerParameters = getattr(token_params_ns, "TokenizerParameters", object)
Expand All @@ -78,7 +73,7 @@ Behavior when `DECOUPLE_GCLOUD=TRUE`:

## Guidelines:

- Prefer calling `jetstream()` / `cloud_diagnostics()` once at module import and branching on `is_decoupled()` for functionality that truly requires the dependency.
- Prefer calling `jetstream()` once at module import and branching on `is_decoupled()` for functionality that truly requires the dependency.
- Use `is_decoupled()` to avoid direct `os.environ["DECOUPLE_GCLOUD"]` checking.
- Use `get_test_config_path()` instead of hard-coded `base.yml`.
- Prefer conditional local fallbacks for cloud buckets and avoid introducing direct `gs://...` paths.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ aqtp
array-record
chex
cloud-accelerator-diagnostics
cloud-tpu-diagnostics!=1.1.14
datasets
drjax
flax
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ setuptools<81.0.0
sortedcontainers
torch==2.11.0
torchax==0.0.11
torchvision==0.25.0
torchvision==0.26.0
tpu-info
watchfiles
xgrammar
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,17 +23,16 @@ cffi>=2.0.0 ; platform_python_implementation != 'PyPy'
cfgv>=3.5.0
charset-normalizer>=3.4.7
chex>=0.1.91
click>=8.3.3
click>=8.4.0
cloud-accelerator-diagnostics>=0.1.1
cloud-tpu-diagnostics>=0.1.5
cloudpickle>=3.1.2
clu>=0.0.12
colorama>=0.4.6
contourpy>=1.3.3
cryptography>=48.0.0
cycler>=0.12.1
datasets>=4.8.5
decorator>=5.2.1
decorator>=5.3.1
dill>=0.4.1
distlib>=0.4.0
distro>=1.9.0
Expand All @@ -56,10 +55,10 @@ gast>=0.7.0
gcsfs>=2026.2.0
google-api-core>=2.30.3
google-api-python-client>=2.196.0
google-auth>=2.52.0
google-auth>=2.53.0
google-auth-httplib2>=0.4.0
google-auth-oauthlib>=1.4.0
google-cloud-aiplatform>=1.152.0
google-cloud-aiplatform>=1.153.1
google-cloud-appengine-logging>=1.9.0
google-cloud-audit-log>=0.5.0
google-cloud-bigquery>=3.41.0
Expand All @@ -71,7 +70,7 @@ google-cloud-resource-manager>=1.17.0
google-cloud-storage>=3.10.1
google-cloud-storage-control>=1.11.0
google-crc32c>=1.8.0
google-genai>=1.75.0
google-genai>=2.4.0
google-pasta>=0.2.0
google-resumable-media>=2.9.0
googleapis-common-protos>=1.75.0
Expand All @@ -86,7 +85,7 @@ hf-xet>=1.5.0 ; platform_machine == 'AMD64' or platform_machine == 'aarch64' or
httpcore>=1.0.9
httplib2>=0.31.2
httpx>=0.28.1
huggingface-hub>=1.14.0
huggingface-hub>=1.15.0
humanize>=4.15.0
hypothesis>=6.142.1
identify>=2.6.19
Expand Down Expand Up @@ -221,7 +220,7 @@ tensorflow-metadata>=1.17.3
tensorflow-text>=2.20.1
tensorstore>=0.1.82
termcolor>=3.3.0
tiktoken>=0.12.0
tiktoken>=0.13.0
tokamax>=0.0.12
tokenizers>=0.22.2
toml>=0.10.2
Expand All @@ -240,7 +239,7 @@ typing-inspection>=0.4.2
tzdata>=2026.2 ; sys_platform == 'emscripten' or sys_platform == 'win32'
uritemplate>=4.2.0
urllib3>=2.6.3
uvicorn>=0.46.0
uvicorn>=0.47.0
uvloop>=0.22.1
virtualenv>=21.3.3
wadler-lindig>=0.1.7
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,20 +23,19 @@ astunparse>=1.6.3
attrs>=26.1.0
auditwheel>=6.6.0
black>=25.12.0
boto3>=1.43.7
botocore>=1.43.7
boto3>=1.43.10
botocore>=1.43.10
build>=1.4.3
cachetools>=7.1.1
cachetools>=7.1.3
cbor2>=6.1.1
certifi>=2026.2.25
cffi>=2.0.0 ; implementation_name == 'pypy' or platform_python_implementation != 'PyPy'
cfgv>=3.5.0
charset-normalizer>=3.4.7
cheroot>=11.1.2
chex>=0.1.91
click>=8.3.3
click>=8.4.0
cloud-accelerator-diagnostics>=0.1.1
cloud-tpu-diagnostics>=0.1.5
cloudpickle>=3.1.2
clu>=0.0.12
colorama>=0.4.6
Expand All @@ -52,7 +51,7 @@ dataclasses>=0.5
dataclasses-json>=0.0.1
datasets>=4.8.5
debugpy>=1.8.20
decorator>=5.2.1
decorator>=5.3.1
dill>=0.4.1
distlib>=0.4.0
distro>=1.9.0
Expand Down Expand Up @@ -80,10 +79,10 @@ gepa>=0.1.1
gguf>=0.19.0
google-api-core>=2.30.3
google-api-python-client>=2.196.0
google-auth>=2.52.0
google-auth>=2.53.0
google-auth-httplib2>=0.4.0
google-auth-oauthlib>=1.4.0
google-cloud-aiplatform>=1.152.0
google-cloud-aiplatform>=1.153.1
google-cloud-appengine-logging>=1.9.0
google-cloud-audit-log>=0.5.0
google-cloud-bigquery>=3.41.0
Expand All @@ -95,7 +94,7 @@ google-cloud-resource-manager>=1.17.0
google-cloud-storage>=3.10.1
google-cloud-storage-control>=1.11.0
google-crc32c>=1.8.0
google-genai>=1.75.0
google-genai>=2.4.0
google-metrax>=0.2.3
google-pasta>=0.2.0
google-resumable-media>=2.9.0
Expand All @@ -114,7 +113,7 @@ hf-xet>=1.5.0 ; platform_machine == 'AMD64' or platform_machine == 'aarch64' or
httpcore>=1.0.9
httplib2>=0.31.2
httpx>=0.28.1
huggingface-hub>=1.14.0
huggingface-hub>=1.15.0
humanize>=4.15.0
hypothesis>=6.142.1
identify>=2.6.19
Expand All @@ -129,7 +128,7 @@ ipython>=9.13.0
ipython-pygments-lexers>=1.1.1
ipywidgets>=8.1.8
isort>=8.0.1
jaraco-functools>=4.4.0
jaraco-functools>=4.5.0
jax>=0.10.0
jaxlib>=0.10.0
jaxtyping>=0.3.9
Expand All @@ -154,7 +153,7 @@ libtpu>=0.0.40 ; platform_machine == 'x86_64' and sys_platform == 'linux'
llguidance>=1.7.5
llvmlite>=0.47.0
loguru>=0.7.3
lxml>=6.1.0
lxml>=6.1.1
markdown>=3.10.2
markdown-it-py>=4.0.0
markupsafe>=3.0.3
Expand Down Expand Up @@ -203,7 +202,7 @@ nvidia-nvshmem-cu13>=3.4.5 ; sys_platform == 'linux'
nvidia-nvtx>=13.0.85 ; sys_platform == 'linux'
oauthlib>=3.3.1
omegaconf>=2.3.0
openai>=2.36.0
openai>=2.37.0
openai-harmony>=0.0.8
opentelemetry-api>=1.41.1
opt-einsum>=3.4.0
Expand Down Expand Up @@ -309,14 +308,15 @@ tensorflow-metadata>=1.17.3
tensorflow-text>=2.20.1
tensorstore>=0.1.82
termcolor>=3.3.0
tiktoken>=0.12.0
tiktoken>=0.13.0
tokamax>=0.0.12
tokenizers>=0.22.2
toml>=0.10.2
tomlkit>=0.15.0
toolz>=1.1.0
torch>=2.11.0
torchax>=0.0.11
torchvision>=0.26.0
tornado>=6.5.5
tpu-info>=0.11.0
tqdm>=4.66.3
Expand All @@ -331,19 +331,19 @@ typing-inspection>=0.4.2
tzdata>=2026.2 ; sys_platform == 'emscripten' or sys_platform == 'win32'
uritemplate>=4.2.0
urllib3>=2.6.3
uvicorn>=0.46.0
uvicorn>=0.47.0
uvloop>=0.22.1
virtualenv>=21.3.3
wadler-lindig>=0.1.7
watchfiles>=1.1.1
watchfiles>=1.2.0
wcwidth>=0.7.0
websockets>=16.0
werkzeug>=3.1.8
wheel>=0.46.3
widgetsnbextension>=4.0.15
win32-setctime>=1.2.0 ; sys_platform == 'win32'
wrapt>=2.1.2
xgrammar>=0.2.0
xgrammar>=0.2.1
xprof>=2.22.3
xxhash>=3.7.0
yapf>=0.43.0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,17 +23,16 @@ cffi>=2.0.0 ; platform_python_implementation != 'PyPy'
cfgv>=3.5.0
charset-normalizer>=3.4.7
chex>=0.1.91
click>=8.3.3
click>=8.4.0
cloud-accelerator-diagnostics>=0.1.1
cloud-tpu-diagnostics>=0.1.5
cloudpickle>=3.1.2
clu>=0.0.12
colorama>=0.4.6
contourpy>=1.3.3
cryptography>=48.0.0
cycler>=0.12.1
datasets>=4.8.5
decorator>=5.2.1
decorator>=5.3.1
dill>=0.4.1
distlib>=0.4.0
distro>=1.9.0
Expand All @@ -56,10 +55,10 @@ gast>=0.7.0
gcsfs>=2026.2.0
google-api-core>=2.30.3
google-api-python-client>=2.196.0
google-auth>=2.52.0
google-auth>=2.53.0
google-auth-httplib2>=0.4.0
google-auth-oauthlib>=1.4.0
google-cloud-aiplatform>=1.152.0
google-cloud-aiplatform>=1.153.1
google-cloud-appengine-logging>=1.9.0
google-cloud-audit-log>=0.5.0
google-cloud-bigquery>=3.41.0
Expand All @@ -71,7 +70,7 @@ google-cloud-resource-manager>=1.17.0
google-cloud-storage>=3.10.1
google-cloud-storage-control>=1.11.0
google-crc32c>=1.8.0
google-genai>=1.75.0
google-genai>=2.4.0
google-pasta>=0.2.0
google-resumable-media>=2.9.0
googleapis-common-protos>=1.75.0
Expand All @@ -86,7 +85,7 @@ hf-xet>=1.5.0 ; platform_machine == 'AMD64' or platform_machine == 'aarch64' or
httpcore>=1.0.9
httplib2>=0.31.2
httpx>=0.28.1
huggingface-hub>=1.14.0
huggingface-hub>=1.15.0
humanize>=4.15.0
hypothesis>=6.142.1
identify>=2.6.19
Expand Down Expand Up @@ -207,7 +206,7 @@ tensorflow-metadata>=1.17.3
tensorflow-text>=2.20.1
tensorstore>=0.1.82
termcolor>=3.3.0
tiktoken>=0.12.0
tiktoken>=0.13.0
tokamax>=0.0.12
tokenizers>=0.22.2
toml>=0.10.2
Expand All @@ -223,7 +222,7 @@ typing-inspection>=0.4.2
tzdata>=2026.2 ; sys_platform == 'emscripten' or sys_platform == 'win32'
uritemplate>=4.2.0
urllib3>=2.6.3
uvicorn>=0.46.0
uvicorn>=0.47.0
uvloop>=0.22.1
virtualenv>=21.3.3
wadler-lindig>=0.1.7
Expand Down
Loading
Loading