diff --git a/gpu-telemetry/about-telemetry.rst b/gpu-telemetry/about-telemetry.rst index d9f4004e1..eded05eba 100644 --- a/gpu-telemetry/about-telemetry.rst +++ b/gpu-telemetry/about-telemetry.rst @@ -27,13 +27,32 @@ to create and manage alerts. Prometheus is deployed along with `kube-state-metri `node_exporter `_ to expose cluster-level metrics for Kubernetes API objects and node-level metrics such as CPU utilization. -An architecture of Prometheus is shown in the figure below: - -.. image:: https://boxboat.com/2019/08/08/monitoring-kubernetes-with-prometheus/prometheus-architecture.png - :width: 800 - To gather GPU telemetry in Kubernetes, its recommended to use DCGM Exporter. DCGM Exporter, based on `DCGM `_ exposes GPU metrics for Prometheus and can be visualized using Grafana. DCGM Exporter is architected to take advantage of ``KubeletPodResources`` `API `_ and exposes GPU metrics in a format that can be scraped by Prometheus. A ``ServiceMonitor`` is also included to expose endpoints. + +.. note:: + DCGM and DCGM Exporter are deployed by default with the `NVIDIA GPU Operator `_. + + +************************* +Benefits of GPU Telemetry +************************* + +Understanding GPU usage provides important insights for IT administrators managing a data center. +Trends in GPU metrics correlate with workload behavior and make it possible to optimize resource allocation, +diagnose anomalies, and increase overall data center efficiency. +As GPUs become more mainstream, users would like to get access to GPU metrics to monitor GPU resources, just +like they do today for CPUs. + +************************* +Quick Links +************************* + +* `DCGM Exporter GitHub repository `_ +* `DCGM Documentation `_ +* `DCGM Exporter Documentation `_ +* `Integrating GPU Telemetry into Kubernetes `_ +* `NVIDIA GPU Operator Documentation `_ diff --git a/gpu-telemetry/dcgm-exporter.rst b/gpu-telemetry/dcgm-exporter.rst index d6a08bcb6..d6d936ef6 100644 --- a/gpu-telemetry/dcgm-exporter.rst +++ b/gpu-telemetry/dcgm-exporter.rst @@ -28,7 +28,7 @@ Introduction ************ `DCGM-Exporter `__ is a tool based on the -Go APIs to `NVIDIA DCGM `__ that allows users to gather +Go APIs to `NVIDIA DCGM `__ that allows you to gather GPU metrics and understand workload behavior or monitor GPUs in clusters. DCGM Exporter is written in Go and exposes GPU metrics at an HTTP endpoint (``/metrics``) for monitoring solutions such as Prometheus. @@ -91,31 +91,28 @@ DCGM-Exporter Customization DCGM-Exporter has various options for adjusting its default behavior. Each option supports both a command-line flag and environment variable. -=================================== ==================== ============================================= -Environment Variable Command-Line Flag Value -=================================== ==================== ============================================= -``$DCGM_EXPORTER_COLLECTORS`` ``-f`` File Path -Path to file containing DCGM fields to collect. Default: "/etc/dcgm-exporter/default-counters.csv" --------------------------------------------------------------------------------------------------------- -``$DCGM_EXPORTER_LISTEN`` ``-a`` Address -Address of listening http server. Default: ":9400" --------------------------------------------------------------------------------------------------------- -``$DCGM_EXPORTER_INTERVAL`` ``-c`` Interval -Interval of time at which point metrics are collected. Unit is milliseconds. Default:30000 --------------------------------------------------------------------------------------------------------- -``$DCGM_EXPORTER_KUBERNETES`` ``-k`` Boolean -Enable kubernetes mapping metrics to kubernetes pods. Default: false --------------------------------------------------------------------------------------------------------- -``$DCGM_EXPORTER_CONFIGMAP_DATA`` ``-m`` Namespace:Name -ConfigMap namespace and name containing DCGM fields to collect. Default: "none" --------------------------------------------------------------------------------------------------------- -``$DCGM_REMOTE_HOSTENGINE_INFO`` ``-r`` Host:Port -Connect to remote hostengine at Host:Port. Default: NA (dcgm-exporter will started in embedded mode) --------------------------------------------------------------------------------------------------------- -``$DCGM_EXPORTER_DEVICES_STR`` ``-d`` Device String (see following note) -Specify which devices to monitor. Default: all GPU instances in MIG mode, all GPUs if MIG disabled. --------------------------------------------------------------------------------------------------------- -=================================== ==================== ============================================= ++----------------------------------+----------------------------------------------------------------------+---------------------+----------------------------------+ +| Environment Variable | Description | Command-Line Flag | Value | ++==================================+======================================================================+=====================+==================================+ +| ``$DCGM_EXPORTER_COLLECTORS`` | Path to file containing DCGM fields to collect. Default: | ``-f`` | File Path | +| | "/etc/dcgm-exporter/default-counters.csv" | | | ++----------------------------------+----------------------------------------------------------------------+---------------------+----------------------------------+ +| ``$DCGM_EXPORTER_LISTEN`` | Address of listening http server. Default: ":9400" | ``-a`` | Address | ++----------------------------------+----------------------------------------------------------------------+---------------------+----------------------------------+ +| ``$DCGM_EXPORTER_INTERVAL`` | Interval of time at which point metrics are collected. Unit is | ``-c`` | Interval | +| | milliseconds. Default: 30000 | | | ++----------------------------------+----------------------------------------------------------------------+---------------------+----------------------------------+ +| ``$DCGM_EXPORTER_KUBERNETES`` | Enable kubernetes mapping metrics to kubernetes pods. Default: false | ``-k`` | Boolean | ++----------------------------------+----------------------------------------------------------------------+---------------------+----------------------------------+ +| ``$DCGM_EXPORTER_CONFIGMAP_DATA``| ConfigMap namespace and name containing DCGM fields to collect. | ``-m`` | Namespace:Name | +| | Default: "none" | | | ++----------------------------------+----------------------------------------------------------------------+---------------------+----------------------------------+ +| ``$DCGM_REMOTE_HOSTENGINE_INFO`` | Connect to remote hostengine at Host:Port. Default: NA | ``-r`` | Host:Port | +| | (dcgm-exporter will start in embedded mode) | | | ++----------------------------------+----------------------------------------------------------------------+---------------------+----------------------------------+ +| ``$DCGM_EXPORTER_DEVICES_STR`` | Specify which devices to monitor. Default: all GPU instances in | ``-d`` | Device String | +| | MIG mode, all GPUs if MIG disabled. | | (see following note) | ++----------------------------------+----------------------------------------------------------------------+---------------------+----------------------------------+ .. note:: Device String Syntax: ``[f] | [g[:id1[,-id2]]] | [i[:id1[,-id2]]]`` diff --git a/gpu-telemetry/integrating-telemetry-kubernetes.rst b/gpu-telemetry/integrating-telemetry-kubernetes.rst index f8bbeecc9..45e25ce4b 100644 --- a/gpu-telemetry/integrating-telemetry-kubernetes.rst +++ b/gpu-telemetry/integrating-telemetry-kubernetes.rst @@ -20,22 +20,6 @@ Integrating GPU Telemetry into Kubernetes ######################################### -.. contents:: - :depth: 5 - :local: - :backlinks: none - - -************************* -Benefits of GPU Telemetry -************************* - -Understanding GPU usage provides important insights for IT administrators managing a data center. -Trends in GPU metrics correlate with workload behavior and make it possible to optimize resource allocation, -diagnose anomalies, and increase overall data center efficiency. As GPUs become more mainstream in -Kubernetes environments, users would like to get access to GPU metrics to monitor GPU resources, just -like they do today for CPUs. - The purpose of this document is to enumerate an end-to-end (e2e) workflow for setting up and using `DCGM `_ within a Kubernetes environment. @@ -44,6 +28,8 @@ a native installation of the NVIDIA drivers on the GPU enabled nodes (i.e. neith the `NVIDIA GPU Operator `_ nor containerized drivers are used in this document). + + ************** NVIDIA Drivers **************