Monitor Your Computing System with Prometheus, Grafana, Alertmanager, and Nvidia DCGM

16 min readMay 23, 2022

In this post it is presented a compilation of the steps, commands, and configurations necessary to setup a monitoring solution on Ubuntu (most steps work on other Linux distributions) centered on Prometheus and Grafana, complemented with a few add-ons, such as the Pushgateway, the Alermanager and the Nvidia Data Center GPU Manager.

Please note:

Parts of the text that complement the code, and some figures, are authored by the references included at the end of this article.

1. Grafana: visualization

Grafana is an open source visualization and analytics software. It allows you to query, visualize, alert on, and explore your metrics, logs, and traces. It provides you with tools to turn your time-series database data into insightful graphs and visualizations.

Download and install Grafana (OSS version)

sudo apt install -y apt-transport-https
sudo apt install -y software-properties-common wget
wget -q -O — https://packages.grafana.com/gpg.key | 
     sudo apt-key add -

Add the Grafana repository for stable releases

echo “deb https://packages.grafana.com/oss/deb stable main” | 
     sudo tee -a /etc/apt/sources.list.d/grafana.listsudo apt update
sudo apt install grafana

Start the Grafana server

sudo systemctl start grafana-server
sudo systemctl status grafana-server

Log in to Grafana for the first time

Open your web browser and go to http://localhost:3000. The default HTTP port that Grafana listens to is 3000, unless you have configured a different port. On the login page, enter admin for username and password. Click Log in. If login is successful, then you will see a prompt to change the password.
Click OK on the prompt, then change your password.

Using the Grafana command-line interface (CLI)

grafana-cli admin --help
grafana-cli admin reset-admin-password <NEW-PASS>

Install a few plugins in Grafana:

sudo grafana-cli plugins install mtanda-histogram-panel
sudo grafana-cli plugins install marcusolsson-csv-datasource
sudo grafana-cli plugins install simpod-json-datasource
sudo grafana-cli plugins install grafana-worldmap-panel
sudo grafana-cli plugins install ae3e-plotly-panel
sudo grafana-cli plugins install camptocamp-prometheus-alertmanager-datasource
...
sudo systemctl restart grafana-server

2. Prometheus

Prometheus is an open-source solution for monitoring and alerting. Prometheus collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.

The prometheus’ main features are:

a multi-dimensional data model with time series data identified by metric name and key/value pairs
PromQL, a query language to leverage the multi-dimensional aspect
it does no rely on distributed storage, and so single server nodes are autonomous
metrics are collected via a pull model over HTTP
pushing metrics to the server is supported via a gateway
targets are discovered via service discovery or static configuration
multiple modes of graphing and dashboarding are supported.

The prometheus ecosystem consists of multiple components, many of which are optional:

the main prometheus server which scrapes and stores time series data
client libraries for instrumenting application code
a push gateway for supporting short-lived jobs
special-purpose exporters for services like HAProxy, StatsD, Graphite, etc.
an alertmanager to handle alerts
various support tools.

The next figure illustrates the architecture of prometheus and some of its ecosystem components:

Source: https://prometheus.io/docs/introduction/overview/

Download prometheus

wget https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gztar xvfz prometheus-2.35.0.linux-amd64.tar.gz
mv prometheus-2.35.0.linux-amd64/ prometheus-2.35.0/
sudo mv prometheus-2.35.0/ /opt

Start prometheus

By default, Prometheus stores its database in ./data (flag — storage.tsdb.path). Add line command --web.listen-address=:9010 to avoid port 9090 conflict.

/opt/prometheus-2.35.0/prometheus \
  --config.file=/opt/prometheus-2.35.0/prometheus.yml \
  --web.listen-address=localhost:9010 &

To run prometheus as a system service create a file:

sudo nano /etc/systemd/system/prometheus.service

with content like this one:

[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target[Service]
User=root
Restart=on-failure
# Adjust the next line with the correct path to prometheus
ExecStart=/opt/prometheus-2.35.0/prometheus \
   --config.file=/opt/prometheus-2.35.0/prometheus.yml \
   --storage.tsdb.path=/var/lib/prometheus/ \
   --web.console.templates=/etc/prometheus/consoles \
   --web.console.libraries=/etc/prometheus/console_libraries \
   --web.listen-address=localhost:9010[Install]
WantedBy=multi-user.target

Reload the systemctl daemon:

sudo systemctl daemon-reload

Start and check the status of the prometheus service:

sudo systemctl start prometheus
sudo systemctl status prometheus

Enable prometheus service to start automatically after the system boots:

sudo systemctl enable prometheus

3. Prometheus node_exporter

The prometheus node_exporter exports hardware and Operating System metrics exposed by the Linux kernel.

cd /opt
sudo wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
sudo tar xvfz node_exporter-1.3.1.linux-amd64.tar.gz
sudo mv node_exporter-1.3.1.linux-amd64/ node_exporter-1.3.1/

Start the node_exporter:

/opt/node_exporter-1.3.1/node_exporter \
   --web.listen-address localhost:9120 &curl http://localhost:9120/metrics

Configuring prometheus to scrape metrics from node_exporter

Your locally running Prometheus instance needs to be properly configured in order to access Node Exporter metrics. The following prometheus.yml configuration fragment will tell the Prometheus instance to scrape, and how frequently, from the Node Exporter via localhost:9120.

nano /opt/prometheus-2.35.0/prometheus.yml
 scrape_configs:
 — job_name: node
 static_configs:
 — targets: ['localhost:9120']

Now we should restart prometheus and query metrics with a name starting with node_, such as node_memory_MemAvailable_bytes or node_exporter_build_info.

Query metrics exposed by node_exporter on prometheus.

We should also install a Grafana dashboard to visualize node_exporter metrics, for example by importing in Grafana the dashboard with ID 1860. This allow us to access a huge list of metrics displayed on panels.

Grafana dashboard to display metrics exposed by node_exporter.

Configure a system service for node_exporter

Create a user for node_exporter

sudo useradd --no-create-home --shell /bin/false nodeusr

2. Create a node_exporter service file under /etc/systemd/system

sudo nano /etc/systemd/system/node_exporter.service

3. Add the following content to the service file:

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target[Service]
User=nodeusr
Group=nodeusr
Type=simple
ExecStart=/opt/node_exporter-1.3.1/node_exporter \
   --web.listen-address localhost:9120[Install]
WantedBy=multi-user.target

4. Reload the system daemon, start the node_exporter service and enable it to start on system boot

sudo systemctl daemon-reload
sudo systemctl start node_exporter
Sudo systemctl status node_exporter
sudo systemctl enable node_exporter

4. Prometheus Pushgateway

The Prometheus Pushgateway allows ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus.

Install the Pushgateway

Get the latest version of pushgateway from prometheus.io, then download and extract it:

wget https://github.com/prometheus/pushgateway/releases/download/v1.4.2/pushgateway-1.4.2.linux-amd64.tar.gz
tar -xvf pushgateway-1.4.2.linux-amd64.tar.gz

Create the pushgateway user:

sudo useradd --no-create-home --shell /bin/false pushgateway

Move the binary to its final destination and update the permissions to the user that we created:

sudo mv pushgateway-1.4.2.linux-amd64 /opt/pushgateway-1.4.2
sudo chown -R pushgateway:pushgateway /opt/pushgateway-1.4.2

Create a link to pushgateway executable on a folder with common binary utilities, for example /opt/bin:

sudo ln -s /opt/pushgateway-1.4.2/pushgateway /opt/bin/pushgateway

Create the “systemd” unit file:

cat > /etc/systemd/system/pushgateway.service << EOF
[Unit]
Description=Pushgateway
Wants=network-online.target
After=network-online.target[Service]
User=pushgateway
Group=pushgateway
Type=simple
ExecStart=/opt/pushgateway-1.4.2/pushgateway \
   --web.listen-address=":9091" \
   --web.telemetry-path="/metrics" \
   --persistence.file="/tmp/metric.store" \
   --persistence.interval=5m \
   --log.level=info \
   --log.format=json[Install]
WantedBy=multi-user.target
EOF

Reload systemd and restart the pushgateway service:

sudo systemctl daemon-reload
sudo systemctl restart pushgateway

Enable pushgateway service to start automatically after system boots:

sudo systemctl enable pushgateway

or start the pushgateway manually:

/opt/pushgateway-1.4.2/pushgateway \
   --web.listen-address=":9011" \
   --web.telemetry-path="/metrics" \
   --persistence.file="/tmp/metric.store" \
   --persistence.interval=5m \
   --log.level=info --log.format=json

Ensure that pushgateway has been started:

systemctl status pushgateway

Configure Prometheus

Now we want to configure prometheus to scrape pushgateway for metrics, then the scraped metrics will be injected into prometheus’s time series database. For example, assuming we have prometheus, node-exporter and pushgateway on the same node, the complete prometheus configuration is presented next. The pushgateway configutation is the last section in next YAML code:

nano /opt/prometheus-2.35.0/prometheus.ymlglobal:
  scrape_interval: 15sscrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9010']
  - job_name: 'node_exporter'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9120']  - job_name: 'pushgateway'
    honor_labels: true
    static_configs:
      - targets: ['localhost:9011']

Restart prometheus:

systemctl restart prometheus

or kill the prometheus process and rerun it.

Push metrics to pushgateway

First we will look at a bash example to push metrics to pushgateway:

echo "cpu_utilization 20.25" | sudo curl --data-binary @-
http://localhost:9011/metrics/job/my_custom_metrics/instance/10.20.0.1:9000/provider/hetzner

Have a look at pushgateway’s metrics endpoint:

sudo curl -L http://localhost:9011/metrics/

The output is a list of metrics and values such as:

# TYPE cpu_utilization untyped
cpu_utlization{instance="10.20.0.1:9000",job="my_custom_metrics",
provider="hetzner"} 20.25
 ...

Let’s look at a python example on how we can push metrics to pushgateway:

import requestsjob_name='my_custom_metrics'
instance_name='10.20.0.1:9000'
provider='hetzner'
payload_key='cpu_utilization'
payload_value='21.90'response = requests.post(
'http://localhost:9011/metrics/job/{j}/instance/{i}/team/{t}'.format(j=job_name, i=instance_name, t=provider), 
data='{k} {v}\n'.format(k=payload_key, 
v=payload_value)
)print(response.status_code)

With this method, you can push any custom metrics (bash, lambda function, etc.) to pushgateway and allow prometheus to consume that data into its time series database.

5. Altermanager

The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver, such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts.

Source: https://devconnected.com/alertmanager-and-prometheus-complete-setup-on-linux/

The core concepts of the Alertmanager are:

Grouping categorizes alerts of similar nature into a single notification;
Inhibition is a concept of suppressing notifications for certain alerts if certain other alerts are already firing;
Silences are a straightforward way to simply mute alerts for a given time. A silence is configured based on matchers, just like the routing tree;
Routes are a set of paths that alerts take in order to determine which action should be associated with the alert. In short, you associate a route with a receiver. The initial route, also called the “root route” is a route that matches every single alert sent to the AlertManager. A route can have siblings and children that are also routes themselves. This way, routes can be nested any number of times, each level defining a new action (or receiver) for the alert. Each route defines receivers. Those receivers are the alert recipients : Slack, a mail service, Pagerduty.

Alertmanager is configured via command-line flags and a configuration file. While the command-line flags configure immutable system parameters, the configuration file defines inhibition rules, notification routing, and notification receivers. To specify which configuration file to load, use the — config.file flag:

./alertmanager --config.file=alertmanager.yml

Next it is an example of YAML configuration that covers most relevant aspects of the configuration format.

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'localhost:25'
  smtp_from: 'alertmanager@example.org'# The root route on which each incoming alert enters.
route:
  # The root route must not have any matchers as it is the entry
  # point for all alerts. It needs to have a receiver configured 
  # so alerts that do not match any of the sub-routes are sent 
  # to someone.
  receiver: 'team-X-mails'  # The labels by which incoming alerts are grouped together. For
  # example, multiple alerts coming in for cluster=A and
  # alertname=LatencyHigh would be batched into a single group.
  #
  # To aggregate by all possible labels use ‘…’ as the sole label
  # name. This effectively disables aggregation entirely, passing
  # through all alerts as-is. This is unlikely to be what you want,
  # unless you have a very low alert volume or your upstream
  # notification system performs its own grouping. Example:
  # group_by: […]
  group_by: ['alertname', 'cluster']  # When a new group of alerts is created by an incoming alert, 
  # wait at least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same 
  # group that start firing shortly after another are batched
  # together on the first notification.
  group_wait: 30s  # When the first notification was sent, wait 'group_interval'
  # to send a batch of new alerts that started firing for that
  # group.
  group_interval: 5m  # If an alert has successfully been sent, wait 'repeat_interval'
  # to resend them.
  repeat_interval: 3h  # All the above attributes are inherited by all child routes 
  # and can overwritten on each.  # The child route trees.
  routes:
  # This routes performs a regular expression match on alert 
  # labels to catch alerts that are related to a list of services.
  - match_re:
      service: ^(foo1|foo2|baz)$
    receiver: team-X-mails    # The service has a sub-route for critical alerts, any alerts
    # that do not match, i.e. severity != critical, fall-back to the
    # parent node and are sent to ‘team-X-mails’
    routes:
    - match:
        severity: critical
      receiver: team-X-pager  - match:
      service: files
    receiver: team-Y-mails    routes:
    - match:
        severity: critical
      receiver: team-Y-pager  # This route handles all alerts coming from a database service.
  # If there's no team to handle it, it defaults to the DB team.
  - match:
      service: database    receiver: team-DB-pager
    # Also group alerts by affected database.
    group_by: [alertname, cluster, database]    routes:
    - match:
        owner: team-X
      receiver: team-X-pager    - match:
        owner: team-Y
      receiver: team-Y-pager# Inhibition rules allow to mute a set of alerts given that 
# another alert is firing.
# We use this to mute any warning-level notifications if the
# same alert is already critical.
inhibit_rules:
- source_matchers:
    — severity="critical"
  target_matchers:
    — severity="warning"
  # Apply inhibition if the alertname is the same.
  # CAUTION: 
  # If all label names listed in 'equal' are missing 
  # from both the source and target alerts,
  # the inhibition rule will apply!
  equal: ['alertname']receivers:
- name: 'team-X-mails'
 email_configs:
 - to: 'team-X+alerts@example.org, team-Y+alerts@example.org'- name: 'team-X-pager'
  email_configs:
  - to: 'team-X+alerts-critical@example.org'
  pagerduty_configs:
  - routing_key: <team-X-key>- name: 'team-Y-mails'
  email_configs:
  - to: 'team-Y+alerts@example.org'- name: 'team-Y-pager'
  pagerduty_configs:
  - routing_key: <team-Y-key>- name: 'team-DB-pager'
  pagerduty_configs:
  - routing_key: <team-DB-key>

Prometheus can be used as client of Alertmanager, providing alerts to it, as shown in next figure:

Source: https://medium.com/devops-dudes/prometheus-alerting-with-alertmanager-e1bbba8e6a8e

Managing alerts with Alertmanager+Prometheus can be setup through the following steps:

Setup and configure AlertManager.
Alter the Prometheus config file so it can talk to the AlertManager.
Define alert rules in Prometheus configuration.
Define alert mechanism in AlertManager to send alerts via slack and/or e-mail.

Alert rules are defined in Prometheus configuration. Prometheus just scrapes metrics from its client application, such as the node_exporter. However, if any alert condition hits, Prometheus sends it to the AlertManager which manages the alerts through its pipeline of silencing, inhibition, grouping and sending out notifications. Silencing is to mute alerts for a given time. Inhibition is to suppress notifications for certain alerts, if other alerts already fired. Grouping group alerts of similar nature into a single notification. This prevents sending multiple notifications simultaneously.

Download the latest Alertmanager release

wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
tar -xvzf alertmanager-0.24.0.linux-amd64.tar.gz
mv alertmanager-0.24.0.linux-amd64/ alertmanager-0.24.0/
sudo mv alertmanager-0.24.0.linux-amd64/ /opt
cd /opt/bin
sudo ln -s /opt/alertmanager-0.24.0/alertmanager alertmanager
sudo ln -s /opt/alertmanager-0.24.0/amtool amtool

Start AlertManager as a service

Create a data folder at the root directory, with a prometheus folder inside.

sudo mkdir -p /data/alertmanager/prometheus

Create the alertmanager user:

sudo useradd --no-create-home --shell /bin/false alertmanager

Give permissions to your newly created user for the AlertManager binaries:

sudo chown alertmanager:alertmanager /opt/alertmanager-0.24.0/amtool /opt/alertmanager-0.24.0/alertmanager

Give the correct permissions to those folders recursively:

sudo chown -R alertmanager:alertmanager /data/alertmanager

Create the service file:

sudo nano /etc/systemd/system/alertmanager.service[Unit]
Description=Alert Manager
Wants=network-online.target
After=network-online.target[Service]
Type=simple
User=alertmanager
Group=alertmanager
ExecStart=/opt/alertmanager-0.24.0/alertmanager \
   --config.file=/opt/alertmanager-0.24.0/alertmanager.yml \
   --storage.path=/data/alertmanagerRestart=always[Install]
WantedBy=multi-user.target

Enable the service and start it:

sudo systemctl enable alertmanager
sudo systemctl start alertmanager
systemctl status alertmanager

Let us verify that AlertManager is running, by opening the browser on the default 9093 port: http://localhost:9093.

Bind Prometheus to AlertManager

We need to modify the Prometheus configuration file, by adding the next content:

nano /opt/prometheus-2.35.0/prometheus.ymlalerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

Optionally, we can also add the Alertmanager metrics endpoint to be scraped by prometheus:

scrape_configs:
... - job_name: alertmanager
   static_configs:
     - targets: ['localhost:9093']

Restart prometheus service:

sudo systemctl restart prometheus

Add a rules file to prometheus configuration file:

nano /opt/prometheus-2.35.0/prometheus.yml...
# Load rules once and periodically evaluate them according 
# to the global 'evaluation_interval'
rule_files:
  - "/opt/prometheus-2.35.0/alert_rules.yml"

We should now define the prometheus alert rules and add them to “alert_rules.yml” file. It is presented next an example of rules configuration file:

nano /opt/prometheus-2.35.0/alert_rules.ymlgroups:
- name: alert_rules
  rules:
  - alert: high_cpu_load
    expr: node_load1 > 60
    for: 30s
    labels:
      severity: critical
    annotations:
      description: Host is under high load, the avg load 1m is 
        at {{ $value}}. Reported by instance {{ $labels.instance }}
        of job {{ $labels.job }}.
      summary: Server is under high load
      type: Server  - alert: high_memory_load
    expr: (sum(node_memory_MemTotal) — sum(node_memory_MemFree 
          + node_memory_Buffers + node_memory_Cached)) /
          sum(node_memory_MemTotal) * 100 > 85
    for: 30s
    labels:
      severity: warning
    annotations:
      description: Host memory usage is {{ humanize $value}}%.
        Reported by instance {{ $labels.instance }} of job
        {{ $labels.job }}.
      summary: Server memory is almost full
      type: Server  - alert: high_storage_load
    expr: (node_filesystem_size{fstype="aufs" -
          node_filesystem_free{fstype="aufs"}) /
          node_filesystem_size{fstype="aufs"} * 100 > 85
    for: 30s
    labels:
      severity: warning
    annotations:
      description: Host storage usage is {{ humanize $value}}%.
        Reported by instance {{ $labels.instance }} of job
        {{ $labels.job }}.
      summary: Server storage is almost full
      type: Server

We can check if the alert rules file is syntactically correct using the promtoolutility:

promtool check rules alert_rules.yml

We have access to the configures alert rules in prometheus web UI:

Configured alert rules viewed on Prometheus UI.

Configure AlertManager

cd /opt/alertmanager-0.24.0/
# backup original config file
sudo mv alertmanager.yml alertmanager-bak.yml

Create a config file:

sudo nano alertmanager.ymlroute:
  # When a new group of alerts is created by an incoming alert, wait
  # at least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group
  # that start firing shortly after another are batched together on
  # the first notification.
  group_wait: 30s  # When the first notification was sent, wait 'group_interval' to
  # send a batch of new alerts that started firing for that group.
  group_interval: 5m  # If an alert has successfully been sent, wait 'repeat_interval'
  # to resend them.
  repeat_interval: 1h  # A default receiver
  receiver: "web.hook"  # All the above attributes are inherited by all child routes and
  # can overwritten on each.
  routes:
    - receiver: "email-me"
      group_wait: 20s
      match_re:
        severity: critical
      continue: true    - receiver: "web.hook"
      group_wait: 10s
      match_re:
        severity: critial|warning
      continue: truereceivers:  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'  - name: 'email-me'
    email_configs:
    - to: 'your_email_id@gmail.com'
      from: 'your_email_id@gmail.com'
      smarthost: smtp.gmail.com:587
      auth_username: 'your_email_id@gmail.com'
      auth_identity: 'your_email_id@gmail.com'
      auth_password: 'email_password'

Check your configuration with the supplied amtool:

/opt/alertmanager-0.24.0/amtool check-config /opt/alertmanager-0.24.0/alertmanager.yml

To view the alertmanager configuration being used:

/opt/alertmanager-0.24.0/amtool config

You can also install the Prometheus Alertmanager Plugin in Grafana

sudo grafana-cli plugins install 
     camptocamp-prometheus-alertmanager-datasource
sudo systemctl restart grafana-server

At this point we can check all the targets that were configured in prometheus: prometheus itself, pushgateway, node_exporter, and alertmanager.

List of configured targets on prometheus.

6. Monitor Nvidia GPUs with Prometheus and Grafana

NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management. DCGM simplifies GPU administration in the data center, improves resource reliability and uptime, automates administrative tasks, and helps drive overall infrastructure efficiency. DCGM also integrates into the Kubernetes ecosystem using DCGM-Exporter to provide rich GPU telemetry in containerized environments [13].

Install Nvidia Data Center GPU Manager (DCGM)

First we must set up the CUDA repository GPG key. For example, on Ubuntu 22.04 and a x86_64 architecture we run:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0–1_all.deb
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"

If an error message is issued by previous command reporting that NO_PUBKEY A4B469963BF863CC is available, install that key:

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 \
     --recv-keys A4B469963BF863C

Now we can install DCGM:

sudo apt update && sudo apt install -y datacenter-gpu-manager

In CentOS 8, or RHEL 8, DCGM installation is as follows:

sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
sudo dnf clean expire-cache && sudo dnf install -y datacenter-gpu-manager

We can opt by enabling the automatic start of DCGM service after the system boots:

sudo systemctl enable nvidia-dcgm
sudo systemctl start nvidia-dcgm

The installation can be checked with the dcgmiutility:

sudo nv-hostengine
dcgmi discovery -l

If the previous command succeeds, the output is similar to:

1 GPU found.
+--------+--------------------------------------------------------+
| GPU ID | Device Information                                     |
+--------+--------------------------------------------------------+
| 0      | Name: NVIDIA GeForce RTX YYYY                          |
|        | PCI Bus ID: 00000000:01:00.0                           |
|        | Device UUID: GPU-xxxxxxxx-yyyy-dddd-nnnn-zzzzzzzzzzzz  |
+--------+--------------------------------------------------------+

Compile and install the DCGM exporter for Prometheus

DCGM exporter exposes GPU metrics to prometheus, leveraging Nvidia DCGM. If the Go compiler is not installed, we must install it first:

wget https://go.dev/dl/go1.18.2.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin
sudo tar -C /usr/local -xzf go1.18.2.linux-amd64.tar.gz
go version

Clone dcgm-exporter github repository and compile the code:

git clone https://github.com/NVIDIA/dcgm-exporter.git
cd dcgm-exporter
make binary
sudo PATH=$PATH make install
sudo chmod 755 /usr/bin/dcgm-exporter

To monitor all GPUs run:

sudo dcgm-exporter &

And to monitor GPU 1 only:

sudo dcgm-exporter -d g:1 &

Test the DCGM_exporter for prometheus:

curl localhost:9400/metrics

The output is a list of metrics and the respective values, such as in the next example:

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="1",UUID="GPU-xxxx-yyy-ddd-nnn-zzzzz",
    device="nvidia1",modelName="NVIDIA GeForce RTX YYYY",
    Hostname="node2"} 300
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
DCGM_FI_DEV_GPU_TEMP{gpu="1",UUID="GPU-xxxx-yyy-ddd-nnn-zzzzz",
    device="nvidia1",modelName="NVIDIA GeForce RTX YYYY",
    Hostname="node2"} 34
...

Integrate DCGM exporter with prometheus

Add the following scrape configuration to the prometheus config file, in order to define one endpoint (per host with GPUs) to be scraped:

scrape_configs:  # The job name is added as a label 'job=<job_name>'' to any
  # timeseries scraped from this config
  - job_name: 'dcgm'    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'    static_configs:
      # scrape metrics from GPUs on two hosts, "node1" and "node2"
      # in this example
      - targets: ['node1:9400', 'node2:9400']

Once the Prometheus configuration file has been updated, restart prometheus service:

sudo systemctl restart prometheus

We can check that prometheus is now scraping metrics from the GPUs, via DCGM_exporter, by inspecting the list of targets and by querying metrics that start with DCGM_ on the Graph tab of the prometheus web UI.