Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 24 additions & 5 deletions gpu-telemetry/about-telemetry.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,32 @@ to create and manage alerts. Prometheus is deployed along with `kube-state-metri
`node_exporter <https://github.com/prometheus/node_exporter>`_ to expose cluster-level metrics for Kubernetes API objects and node-level
metrics such as CPU utilization.

An architecture of Prometheus is shown in the figure below:

.. image:: https://boxboat.com/2019/08/08/monitoring-kubernetes-with-prometheus/prometheus-architecture.png
:width: 800


To gather GPU telemetry in Kubernetes, its recommended to use DCGM Exporter. DCGM Exporter, based on `DCGM <https://developer.nvidia.com/dcgm>`_ exposes
GPU metrics for Prometheus and can be visualized using Grafana. DCGM Exporter is architected to take advantage of
``KubeletPodResources`` `API <https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/>`_ and exposes GPU metrics in a format that can be
scraped by Prometheus. A ``ServiceMonitor`` is also included to expose endpoints.

.. note::
DCGM and DCGM Exporter are deployed by default with the `NVIDIA GPU Operator <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/>`_.


*************************
Benefits of GPU Telemetry
*************************

Understanding GPU usage provides important insights for IT administrators managing a data center.
Trends in GPU metrics correlate with workload behavior and make it possible to optimize resource allocation,
diagnose anomalies, and increase overall data center efficiency.
As GPUs become more mainstream, users would like to get access to GPU metrics to monitor GPU resources, just
like they do today for CPUs.

*************************
Quick Links
*************************

* `DCGM Exporter GitHub repository <https://github.com/NVIDIA/dcgm-exporter>`_
* `DCGM Documentation <https://docs.nvidia.com/datacenter/dcgm/latest/>`_
* `DCGM Exporter Documentation <https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html>`_
* `Integrating GPU Telemetry into Kubernetes <https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/integrating-telemetry-kubernetes.html>`_
* `NVIDIA GPU Operator Documentation <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/>`_
49 changes: 23 additions & 26 deletions gpu-telemetry/dcgm-exporter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Introduction
************

`DCGM-Exporter <https://github.com/NVIDIA/dcgm-exporter>`__ is a tool based on the
Go APIs to `NVIDIA DCGM <https://developer.nvidia.com/dcgm>`__ that allows users to gather
Go APIs to `NVIDIA DCGM <https://developer.nvidia.com/dcgm>`__ that allows you to gather
GPU metrics and understand workload behavior or monitor GPUs in clusters. DCGM Exporter is
written in Go and exposes GPU metrics at an HTTP endpoint (``/metrics``) for monitoring solutions
such as Prometheus.
Expand Down Expand Up @@ -91,31 +91,28 @@ DCGM-Exporter Customization

DCGM-Exporter has various options for adjusting its default behavior. Each option supports both a command-line flag and environment variable.

=================================== ==================== =============================================
Environment Variable Command-Line Flag Value
=================================== ==================== =============================================
``$DCGM_EXPORTER_COLLECTORS`` ``-f`` File Path
Path to file containing DCGM fields to collect. Default: "/etc/dcgm-exporter/default-counters.csv"
--------------------------------------------------------------------------------------------------------
``$DCGM_EXPORTER_LISTEN`` ``-a`` Address
Address of listening http server. Default: ":9400"
--------------------------------------------------------------------------------------------------------
``$DCGM_EXPORTER_INTERVAL`` ``-c`` Interval
Interval of time at which point metrics are collected. Unit is milliseconds. Default:30000
--------------------------------------------------------------------------------------------------------
``$DCGM_EXPORTER_KUBERNETES`` ``-k`` Boolean
Enable kubernetes mapping metrics to kubernetes pods. Default: false
--------------------------------------------------------------------------------------------------------
``$DCGM_EXPORTER_CONFIGMAP_DATA`` ``-m`` Namespace:Name
ConfigMap namespace and name containing DCGM fields to collect. Default: "none"
--------------------------------------------------------------------------------------------------------
``$DCGM_REMOTE_HOSTENGINE_INFO`` ``-r`` Host:Port
Connect to remote hostengine at Host:Port. Default: NA (dcgm-exporter will started in embedded mode)
--------------------------------------------------------------------------------------------------------
``$DCGM_EXPORTER_DEVICES_STR`` ``-d`` Device String (see following note)
Specify which devices to monitor. Default: all GPU instances in MIG mode, all GPUs if MIG disabled.
--------------------------------------------------------------------------------------------------------
=================================== ==================== =============================================
+----------------------------------+----------------------------------------------------------------------+---------------------+----------------------------------+
| Environment Variable | Description | Command-Line Flag | Value |
+==================================+======================================================================+=====================+==================================+
| ``$DCGM_EXPORTER_COLLECTORS`` | Path to file containing DCGM fields to collect. Default: | ``-f`` | File Path |
| | "/etc/dcgm-exporter/default-counters.csv" | | |
+----------------------------------+----------------------------------------------------------------------+---------------------+----------------------------------+
| ``$DCGM_EXPORTER_LISTEN`` | Address of listening http server. Default: ":9400" | ``-a`` | Address |
+----------------------------------+----------------------------------------------------------------------+---------------------+----------------------------------+
| ``$DCGM_EXPORTER_INTERVAL`` | Interval of time at which point metrics are collected. Unit is | ``-c`` | Interval |
| | milliseconds. Default: 30000 | | |
+----------------------------------+----------------------------------------------------------------------+---------------------+----------------------------------+
| ``$DCGM_EXPORTER_KUBERNETES`` | Enable kubernetes mapping metrics to kubernetes pods. Default: false | ``-k`` | Boolean |
+----------------------------------+----------------------------------------------------------------------+---------------------+----------------------------------+
| ``$DCGM_EXPORTER_CONFIGMAP_DATA``| ConfigMap namespace and name containing DCGM fields to collect. | ``-m`` | Namespace:Name |
| | Default: "none" | | |
+----------------------------------+----------------------------------------------------------------------+---------------------+----------------------------------+
| ``$DCGM_REMOTE_HOSTENGINE_INFO`` | Connect to remote hostengine at Host:Port. Default: NA | ``-r`` | Host:Port |
| | (dcgm-exporter will start in embedded mode) | | |
+----------------------------------+----------------------------------------------------------------------+---------------------+----------------------------------+
| ``$DCGM_EXPORTER_DEVICES_STR`` | Specify which devices to monitor. Default: all GPU instances in | ``-d`` | Device String |
| | MIG mode, all GPUs if MIG disabled. | | (see following note) |
+----------------------------------+----------------------------------------------------------------------+---------------------+----------------------------------+

.. note::
Device String Syntax: ``[f] | [g[:id1[,-id2]]] | [i[:id1[,-id2]]]``
Expand Down
18 changes: 2 additions & 16 deletions gpu-telemetry/integrating-telemetry-kubernetes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,22 +20,6 @@
Integrating GPU Telemetry into Kubernetes
#########################################

.. contents::
:depth: 5
:local:
:backlinks: none


*************************
Benefits of GPU Telemetry
*************************

Understanding GPU usage provides important insights for IT administrators managing a data center.
Trends in GPU metrics correlate with workload behavior and make it possible to optimize resource allocation,
diagnose anomalies, and increase overall data center efficiency. As GPUs become more mainstream in
Kubernetes environments, users would like to get access to GPU metrics to monitor GPU resources, just
like they do today for CPUs.

The purpose of this document is to enumerate an end-to-end (e2e) workflow
for setting up and using `DCGM <https://developer.nvidia.com/dcgm>`_ within a Kubernetes environment.

Expand All @@ -44,6 +28,8 @@ a native installation of the NVIDIA drivers on the GPU enabled nodes (i.e. neith
the `NVIDIA GPU Operator <https://github.com/NVIDIA/gpu-operator>`_ nor containerized drivers are used
in this document).



**************
NVIDIA Drivers
**************
Expand Down