Monitor Your Computing System with Prometheus, Grafana, Alertmanager, and Nvidia DCGM
In this post it is presented a compilation of the steps, commands, and configurations necessary to setup a monitoring solution on Ubuntu (most steps work on other Linux distributions) centered on Prometheus and Grafana, complemented with a few add-ons, such as the Pushgateway, the Alermanager and the Nvidia Data Center GPU Manager.
Please note:
- Parts of the text that complement the code, and some figures, are authored by the references included at the end of this article.
1. Grafana: visualization
Grafana is an open source visualization and analytics software. It allows you to query, visualize, alert on, and explore your metrics, logs, and traces. It provides you with tools to turn your time-series database data into insightful graphs and visualizations.
Download and install Grafana (OSS version)
sudo apt install -y apt-transport-https
sudo apt install -y software-properties-common wget
wget -q -O — https://packages.grafana.com/gpg.key |
sudo apt-key add -
Add the Grafana repository for stable releases
echo “deb https://packages.grafana.com/oss/deb stable main” |
sudo tee -a /etc/apt/sources.list.d/grafana.listsudo apt update
sudo apt install grafana
Start the Grafana server
sudo systemctl start grafana-server
sudo systemctl status grafana-server
Log in to Grafana for the first time
Open your web browser and go to http://localhost:3000. The default HTTP port that Grafana listens to is 3000, unless you have configured a different port. On the login page, enter admin
for username and password. Click Log in
. If login is successful, then you will see a prompt to change the password.
Click OK on the prompt, then change your password.
Using the Grafana command-line interface (CLI)
grafana-cli admin --help
grafana-cli admin reset-admin-password <NEW-PASS>
Install a few plugins in Grafana:
sudo grafana-cli plugins install mtanda-histogram-panel
sudo grafana-cli plugins install marcusolsson-csv-datasource
sudo grafana-cli plugins install simpod-json-datasource
sudo grafana-cli plugins install grafana-worldmap-panel
sudo grafana-cli plugins install ae3e-plotly-panel
sudo grafana-cli plugins install camptocamp-prometheus-alertmanager-datasource
...
sudo systemctl restart grafana-server
2. Prometheus
Prometheus is an open-source solution for monitoring and alerting. Prometheus collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.
The prometheus’ main features are:
- a multi-dimensional data model with time series data identified by metric name and key/value pairs
- PromQL, a query language to leverage the multi-dimensional aspect
- it does no rely on distributed storage, and so single server nodes are autonomous
- metrics are collected via a pull model over HTTP
- pushing metrics to the server is supported via a gateway
- targets are discovered via service discovery or static configuration
- multiple modes of graphing and dashboarding are supported.
The prometheus ecosystem consists of multiple components, many of which are optional:
- the main prometheus server which scrapes and stores time series data
- client libraries for instrumenting application code
- a push gateway for supporting short-lived jobs
- special-purpose exporters for services like HAProxy, StatsD, Graphite, etc.
- an alertmanager to handle alerts
- various support tools.
The next figure illustrates the architecture of prometheus and some of its ecosystem components:
Download prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gztar xvfz prometheus-2.35.0.linux-amd64.tar.gz
mv prometheus-2.35.0.linux-amd64/ prometheus-2.35.0/
sudo mv prometheus-2.35.0/ /opt
Start prometheus
By default, Prometheus stores its database in ./data
(flag — storage.tsdb.path
). Add line command --web.listen-address=:9010
to avoid port 9090 conflict.
/opt/prometheus-2.35.0/prometheus \
--config.file=/opt/prometheus-2.35.0/prometheus.yml \
--web.listen-address=localhost:9010 &
To run prometheus as a system service create a file:
sudo nano /etc/systemd/system/prometheus.service
with content like this one:
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target[Service]
User=root
Restart=on-failure
# Adjust the next line with the correct path to prometheus
ExecStart=/opt/prometheus-2.35.0/prometheus \
--config.file=/opt/prometheus-2.35.0/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address=localhost:9010[Install]
WantedBy=multi-user.target
Reload the systemctl
daemon:
sudo systemctl daemon-reload
Start and check the status of the prometheus service:
sudo systemctl start prometheus
sudo systemctl status prometheus
Enable prometheus service to start automatically after the system boots:
sudo systemctl enable prometheus
3. Prometheus node_exporter
The prometheus node_exporter
exports hardware and Operating System metrics exposed by the Linux kernel.
cd /opt
sudo wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
sudo tar xvfz node_exporter-1.3.1.linux-amd64.tar.gz
sudo mv node_exporter-1.3.1.linux-amd64/ node_exporter-1.3.1/
Start the node_exporter
:
/opt/node_exporter-1.3.1/node_exporter \
--web.listen-address localhost:9120 &curl http://localhost:9120/metrics
Configuring prometheus to scrape metrics from node_exporter
Your locally running Prometheus instance needs to be properly configured in order to access Node Exporter metrics. The following prometheus.yml
configuration fragment will tell the Prometheus instance to scrape, and how frequently, from the Node Exporter via localhost:9120
.
nano /opt/prometheus-2.35.0/prometheus.yml
scrape_configs:
— job_name: node
static_configs:
— targets: ['localhost:9120']
Now we should restart prometheus and query metrics with a name starting with node_
, such as node_memory_MemAvailable_bytes
or node_exporter_build_info
.
We should also install a Grafana dashboard to visualize node_exporter metrics, for example by importing in Grafana the dashboard with ID 1860. This allow us to access a huge list of metrics displayed on panels.
Configure a system service for node_exporter
- Create a user for node_exporter
sudo useradd --no-create-home --shell /bin/false nodeusr
2. Create a node_exporter service file under /etc/systemd/system
sudo nano /etc/systemd/system/node_exporter.service
3. Add the following content to the service file:
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target[Service]
User=nodeusr
Group=nodeusr
Type=simple
ExecStart=/opt/node_exporter-1.3.1/node_exporter \
--web.listen-address localhost:9120[Install]
WantedBy=multi-user.target
4. Reload the system daemon, start the node_exporter
service and enable it to start on system boot
sudo systemctl daemon-reload
sudo systemctl start node_exporter
Sudo systemctl status node_exporter
sudo systemctl enable node_exporter
4. Prometheus Pushgateway
The Prometheus Pushgateway allows ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus.
Install the Pushgateway
Get the latest version of pushgateway from prometheus.io
, then download and extract it:
wget https://github.com/prometheus/pushgateway/releases/download/v1.4.2/pushgateway-1.4.2.linux-amd64.tar.gz
tar -xvf pushgateway-1.4.2.linux-amd64.tar.gz
Create the pushgateway
user:
sudo useradd --no-create-home --shell /bin/false pushgateway
Move the binary to its final destination and update the permissions to the user that we created:
sudo mv pushgateway-1.4.2.linux-amd64 /opt/pushgateway-1.4.2
sudo chown -R pushgateway:pushgateway /opt/pushgateway-1.4.2
Create a link to pushgateway executable on a folder with common binary utilities, for example /opt/bin
:
sudo ln -s /opt/pushgateway-1.4.2/pushgateway /opt/bin/pushgateway
Create the “systemd” unit file:
cat > /etc/systemd/system/pushgateway.service << EOF
[Unit]
Description=Pushgateway
Wants=network-online.target
After=network-online.target[Service]
User=pushgateway
Group=pushgateway
Type=simple
ExecStart=/opt/pushgateway-1.4.2/pushgateway \
--web.listen-address=":9091" \
--web.telemetry-path="/metrics" \
--persistence.file="/tmp/metric.store" \
--persistence.interval=5m \
--log.level=info \
--log.format=json[Install]
WantedBy=multi-user.target
EOF
Reload systemd
and restart the pushgateway service:
sudo systemctl daemon-reload
sudo systemctl restart pushgateway
Enable pushgateway service to start automatically after system boots:
sudo systemctl enable pushgateway
or start the pushgateway manually:
/opt/pushgateway-1.4.2/pushgateway \
--web.listen-address=":9011" \
--web.telemetry-path="/metrics" \
--persistence.file="/tmp/metric.store" \
--persistence.interval=5m \
--log.level=info --log.format=json
Ensure that pushgateway has been started:
systemctl status pushgateway
Configure Prometheus
Now we want to configure prometheus to scrape pushgateway for metrics, then the scraped metrics will be injected into prometheus’s time series database. For example, assuming we have prometheus, node-exporter and pushgateway on the same node, the complete prometheus configuration is presented next. The pushgateway configutation is the last section in next YAML code:
nano /opt/prometheus-2.35.0/prometheus.ymlglobal:
scrape_interval: 15sscrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9010']
- job_name: 'node_exporter'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9120'] - job_name: 'pushgateway'
honor_labels: true
static_configs:
- targets: ['localhost:9011']
Restart prometheus:
systemctl restart prometheus
or kill the prometheus process and rerun it.
Push metrics to pushgateway
First we will look at a bash example to push metrics to pushgateway:
echo "cpu_utilization 20.25" | sudo curl --data-binary @-
http://localhost:9011/metrics/job/my_custom_metrics/instance/10.20.0.1:9000/provider/hetzner
Have a look at pushgateway’s metrics endpoint:
sudo curl -L http://localhost:9011/metrics/
The output is a list of metrics and values such as:
# TYPE cpu_utilization untyped
cpu_utlization{instance="10.20.0.1:9000",job="my_custom_metrics",
provider="hetzner"} 20.25
...
Let’s look at a python example on how we can push metrics to pushgateway:
import requestsjob_name='my_custom_metrics'
instance_name='10.20.0.1:9000'
provider='hetzner'
payload_key='cpu_utilization'
payload_value='21.90'response = requests.post(
'http://localhost:9011/metrics/job/{j}/instance/{i}/team/{t}'.format(j=job_name, i=instance_name, t=provider),
data='{k} {v}\n'.format(k=payload_key,
v=payload_value)
)print(response.status_code)
With this method, you can push any custom metrics (bash, lambda function, etc.) to pushgateway and allow prometheus to consume that data into its time series database.
5. Altermanager
The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver, such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts.
The core concepts of the Alertmanager are:
- Grouping categorizes alerts of similar nature into a single notification;
- Inhibition is a concept of suppressing notifications for certain alerts if certain other alerts are already firing;
- Silences are a straightforward way to simply mute alerts for a given time. A silence is configured based on matchers, just like the routing tree;
- Routes are a set of paths that alerts take in order to determine which action should be associated with the alert. In short, you associate a route with a receiver. The initial route, also called the “root route” is a route that matches every single alert sent to the AlertManager. A route can have siblings and children that are also routes themselves. This way, routes can be nested any number of times, each level defining a new action (or receiver) for the alert. Each route defines receivers. Those receivers are the alert recipients : Slack, a mail service, Pagerduty.
Alertmanager is configured via command-line flags and a configuration file. While the command-line flags configure immutable system parameters, the configuration file defines inhibition rules, notification routing, and notification receivers. To specify which configuration file to load, use the — config.file
flag:
./alertmanager --config.file=alertmanager.yml
Next it is an example of YAML configuration that covers most relevant aspects of the configuration format.
global:
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost: 'localhost:25'
smtp_from: 'alertmanager@example.org'# The root route on which each incoming alert enters.
route:
# The root route must not have any matchers as it is the entry
# point for all alerts. It needs to have a receiver configured
# so alerts that do not match any of the sub-routes are sent
# to someone.
receiver: 'team-X-mails' # The labels by which incoming alerts are grouped together. For
# example, multiple alerts coming in for cluster=A and
# alertname=LatencyHigh would be batched into a single group.
#
# To aggregate by all possible labels use ‘…’ as the sole label
# name. This effectively disables aggregation entirely, passing
# through all alerts as-is. This is unlikely to be what you want,
# unless you have a very low alert volume or your upstream
# notification system performs its own grouping. Example:
# group_by: […]
group_by: ['alertname', 'cluster'] # When a new group of alerts is created by an incoming alert,
# wait at least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same
# group that start firing shortly after another are batched
# together on the first notification.
group_wait: 30s # When the first notification was sent, wait 'group_interval'
# to send a batch of new alerts that started firing for that
# group.
group_interval: 5m # If an alert has successfully been sent, wait 'repeat_interval'
# to resend them.
repeat_interval: 3h # All the above attributes are inherited by all child routes
# and can overwritten on each. # The child route trees.
routes:
# This routes performs a regular expression match on alert
# labels to catch alerts that are related to a list of services.
- match_re:
service: ^(foo1|foo2|baz)$
receiver: team-X-mails # The service has a sub-route for critical alerts, any alerts
# that do not match, i.e. severity != critical, fall-back to the
# parent node and are sent to ‘team-X-mails’
routes:
- match:
severity: critical
receiver: team-X-pager - match:
service: files
receiver: team-Y-mails routes:
- match:
severity: critical
receiver: team-Y-pager # This route handles all alerts coming from a database service.
# If there's no team to handle it, it defaults to the DB team.
- match:
service: database receiver: team-DB-pager
# Also group alerts by affected database.
group_by: [alertname, cluster, database] routes:
- match:
owner: team-X
receiver: team-X-pager - match:
owner: team-Y
receiver: team-Y-pager# Inhibition rules allow to mute a set of alerts given that
# another alert is firing.
# We use this to mute any warning-level notifications if the
# same alert is already critical.
inhibit_rules:
- source_matchers:
— severity="critical"
target_matchers:
— severity="warning"
# Apply inhibition if the alertname is the same.
# CAUTION:
# If all label names listed in 'equal' are missing
# from both the source and target alerts,
# the inhibition rule will apply!
equal: ['alertname']receivers:
- name: 'team-X-mails'
email_configs:
- to: 'team-X+alerts@example.org, team-Y+alerts@example.org'- name: 'team-X-pager'
email_configs:
- to: 'team-X+alerts-critical@example.org'
pagerduty_configs:
- routing_key: <team-X-key>- name: 'team-Y-mails'
email_configs:
- to: 'team-Y+alerts@example.org'- name: 'team-Y-pager'
pagerduty_configs:
- routing_key: <team-Y-key>- name: 'team-DB-pager'
pagerduty_configs:
- routing_key: <team-DB-key>
Prometheus can be used as client of Alertmanager, providing alerts to it, as shown in next figure:
Managing alerts with Alertmanager+Prometheus can be setup through the following steps:
- Setup and configure AlertManager.
- Alter the Prometheus config file so it can talk to the AlertManager.
- Define alert rules in Prometheus configuration.
- Define alert mechanism in AlertManager to send alerts via slack and/or e-mail.
Alert rules are defined in Prometheus configuration. Prometheus just scrapes metrics from its client application, such as the node_exporter. However, if any alert condition hits, Prometheus sends it to the AlertManager which manages the alerts through its pipeline of silencing, inhibition, grouping and sending out notifications. Silencing is to mute alerts for a given time. Inhibition is to suppress notifications for certain alerts, if other alerts already fired. Grouping group alerts of similar nature into a single notification. This prevents sending multiple notifications simultaneously.
Download the latest Alertmanager release
wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
tar -xvzf alertmanager-0.24.0.linux-amd64.tar.gz
mv alertmanager-0.24.0.linux-amd64/ alertmanager-0.24.0/
sudo mv alertmanager-0.24.0.linux-amd64/ /opt
cd /opt/bin
sudo ln -s /opt/alertmanager-0.24.0/alertmanager alertmanager
sudo ln -s /opt/alertmanager-0.24.0/amtool amtool
Start AlertManager as a service
Create a data folder at the root directory, with a prometheus
folder inside.
sudo mkdir -p /data/alertmanager/prometheus
Create the alertmanager
user:
sudo useradd --no-create-home --shell /bin/false alertmanager
Give permissions to your newly created user for the AlertManager binaries:
sudo chown alertmanager:alertmanager /opt/alertmanager-0.24.0/amtool /opt/alertmanager-0.24.0/alertmanager
Give the correct permissions to those folders recursively:
sudo chown -R alertmanager:alertmanager /data/alertmanager
Create the service file:
sudo nano /etc/systemd/system/alertmanager.service[Unit]
Description=Alert Manager
Wants=network-online.target
After=network-online.target[Service]
Type=simple
User=alertmanager
Group=alertmanager
ExecStart=/opt/alertmanager-0.24.0/alertmanager \
--config.file=/opt/alertmanager-0.24.0/alertmanager.yml \
--storage.path=/data/alertmanagerRestart=always[Install]
WantedBy=multi-user.target
Enable the service and start it:
sudo systemctl enable alertmanager
sudo systemctl start alertmanager
systemctl status alertmanager
Let us verify that AlertManager is running, by opening the browser on the default 9093 port: http://localhost:9093.
Bind Prometheus to AlertManager
We need to modify the Prometheus configuration file, by adding the next content:
nano /opt/prometheus-2.35.0/prometheus.ymlalerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
Optionally, we can also add the Alertmanager metrics endpoint to be scraped by prometheus:
scrape_configs:
... - job_name: alertmanager
static_configs:
- targets: ['localhost:9093']
Restart prometheus service:
sudo systemctl restart prometheus
Add a rules file to prometheus configuration file:
nano /opt/prometheus-2.35.0/prometheus.yml...
# Load rules once and periodically evaluate them according
# to the global 'evaluation_interval'
rule_files:
- "/opt/prometheus-2.35.0/alert_rules.yml"
We should now define the prometheus alert rules and add them to “alert_rules.yml” file. It is presented next an example of rules configuration file:
nano /opt/prometheus-2.35.0/alert_rules.ymlgroups:
- name: alert_rules
rules:
- alert: high_cpu_load
expr: node_load1 > 60
for: 30s
labels:
severity: critical
annotations:
description: Host is under high load, the avg load 1m is
at {{ $value}}. Reported by instance {{ $labels.instance }}
of job {{ $labels.job }}.
summary: Server is under high load
type: Server - alert: high_memory_load
expr: (sum(node_memory_MemTotal) — sum(node_memory_MemFree
+ node_memory_Buffers + node_memory_Cached)) /
sum(node_memory_MemTotal) * 100 > 85
for: 30s
labels:
severity: warning
annotations:
description: Host memory usage is {{ humanize $value}}%.
Reported by instance {{ $labels.instance }} of job
{{ $labels.job }}.
summary: Server memory is almost full
type: Server - alert: high_storage_load
expr: (node_filesystem_size{fstype="aufs" -
node_filesystem_free{fstype="aufs"}) /
node_filesystem_size{fstype="aufs"} * 100 > 85
for: 30s
labels:
severity: warning
annotations:
description: Host storage usage is {{ humanize $value}}%.
Reported by instance {{ $labels.instance }} of job
{{ $labels.job }}.
summary: Server storage is almost full
type: Server
We can check if the alert rules file is syntactically correct using the promtool
utility:
promtool check rules alert_rules.yml
We have access to the configures alert rules in prometheus web UI:
Configure AlertManager
cd /opt/alertmanager-0.24.0/
# backup original config file
sudo mv alertmanager.yml alertmanager-bak.yml
Create a config file:
sudo nano alertmanager.ymlroute:
# When a new group of alerts is created by an incoming alert, wait
# at least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group
# that start firing shortly after another are batched together on
# the first notification.
group_wait: 30s # When the first notification was sent, wait 'group_interval' to
# send a batch of new alerts that started firing for that group.
group_interval: 5m # If an alert has successfully been sent, wait 'repeat_interval'
# to resend them.
repeat_interval: 1h # A default receiver
receiver: "web.hook" # All the above attributes are inherited by all child routes and
# can overwritten on each.
routes:
- receiver: "email-me"
group_wait: 20s
match_re:
severity: critical
continue: true - receiver: "web.hook"
group_wait: 10s
match_re:
severity: critial|warning
continue: truereceivers: - name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/' - name: 'email-me'
email_configs:
- to: 'your_email_id@gmail.com'
from: 'your_email_id@gmail.com'
smarthost: smtp.gmail.com:587
auth_username: 'your_email_id@gmail.com'
auth_identity: 'your_email_id@gmail.com'
auth_password: 'email_password'
Check your configuration with the supplied amtool
:
/opt/alertmanager-0.24.0/amtool check-config /opt/alertmanager-0.24.0/alertmanager.yml
To view the alertmanager configuration being used:
/opt/alertmanager-0.24.0/amtool config
You can also install the Prometheus Alertmanager Plugin in Grafana
sudo grafana-cli plugins install
camptocamp-prometheus-alertmanager-datasource
sudo systemctl restart grafana-server
At this point we can check all the targets that were configured in prometheus: prometheus itself, pushgateway, node_exporter, and alertmanager.
6. Monitor Nvidia GPUs with Prometheus and Grafana
NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management. DCGM simplifies GPU administration in the data center, improves resource reliability and uptime, automates administrative tasks, and helps drive overall infrastructure efficiency. DCGM also integrates into the Kubernetes ecosystem using DCGM-Exporter to provide rich GPU telemetry in containerized environments [13].
Install Nvidia Data Center GPU Manager (DCGM)
First we must set up the CUDA repository GPG key. For example, on Ubuntu 22.04 and a x86_64 architecture we run:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0–1_all.deb
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
If an error message is issued by previous command reporting that NO_PUBKEY A4B469963BF863CC is available, install that key:
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 \
--recv-keys A4B469963BF863C
Now we can install DCGM:
sudo apt update && sudo apt install -y datacenter-gpu-manager
In CentOS 8, or RHEL 8, DCGM installation is as follows:
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
sudo dnf clean expire-cache && sudo dnf install -y datacenter-gpu-manager
We can opt by enabling the automatic start of DCGM service after the system boots:
sudo systemctl enable nvidia-dcgm
sudo systemctl start nvidia-dcgm
The installation can be checked with the dcgmi
utility:
sudo nv-hostengine
dcgmi discovery -l
If the previous command succeeds, the output is similar to:
1 GPU found.
+--------+--------------------------------------------------------+
| GPU ID | Device Information |
+--------+--------------------------------------------------------+
| 0 | Name: NVIDIA GeForce RTX YYYY |
| | PCI Bus ID: 00000000:01:00.0 |
| | Device UUID: GPU-xxxxxxxx-yyyy-dddd-nnnn-zzzzzzzzzzzz |
+--------+--------------------------------------------------------+
Compile and install the DCGM exporter for Prometheus
DCGM exporter exposes GPU metrics to prometheus, leveraging Nvidia DCGM. If the Go compiler is not installed, we must install it first:
wget https://go.dev/dl/go1.18.2.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin
sudo tar -C /usr/local -xzf go1.18.2.linux-amd64.tar.gz
go version
Clone dcgm-exporter github repository and compile the code:
git clone https://github.com/NVIDIA/dcgm-exporter.git
cd dcgm-exporter
make binary
sudo PATH=$PATH make install
sudo chmod 755 /usr/bin/dcgm-exporter
To monitor all GPUs run:
sudo dcgm-exporter &
And to monitor GPU 1 only:
sudo dcgm-exporter -d g:1 &
Test the DCGM_exporter for prometheus:
curl localhost:9400/metrics
The output is a list of metrics and the respective values, such as in the next example:
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="1",UUID="GPU-xxxx-yyy-ddd-nnn-zzzzz",
device="nvidia1",modelName="NVIDIA GeForce RTX YYYY",
Hostname="node2"} 300
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
DCGM_FI_DEV_GPU_TEMP{gpu="1",UUID="GPU-xxxx-yyy-ddd-nnn-zzzzz",
device="nvidia1",modelName="NVIDIA GeForce RTX YYYY",
Hostname="node2"} 34
...
Integrate DCGM exporter with prometheus
Add the following scrape configuration to the prometheus config file, in order to define one endpoint (per host with GPUs) to be scraped:
scrape_configs: # The job name is added as a label 'job=<job_name>'' to any
# timeseries scraped from this config
- job_name: 'dcgm' # metrics_path defaults to '/metrics'
# scheme defaults to 'http' static_configs:
# scrape metrics from GPUs on two hosts, "node1" and "node2"
# in this example
- targets: ['node1:9400', 'node2:9400']
Once the Prometheus configuration file has been updated, restart prometheus service:
sudo systemctl restart prometheus
We can check that prometheus is now scraping metrics from the GPUs, via DCGM_exporter, by inspecting the list of targets and by querying metrics that start with DCGM_
on the Graph tab of the prometheus web UI.
References
- https://grafana.com/oss/grafana/
- https://grafana.com/go/webinar/getting-started-with-grafana/
- https://prometheus.io/docs/introduction/overview/
- https://github.com/prometheus
- https://prometheus.io/docs/guides/node-exporter/
- https://github.com/prometheus/pushgateway/
- https://prometheus.io/docs/alerting/latest/alertmanager/
- https://medium.com/devops-dudes/prometheus-alerting-with-alertmanager-e1bbba8e6a8e
- https://devconnected.com/alertmanager-and-prometheus-complete-setup-on-linux/
- https://grafana.com/grafana/plugins/camptocamp-prometheus-alertmanager-datasource/
- https://kifarunix.com/configure-prometheus-email-alerting-with-alertmanager/
- https://medium.com/techno101/how-to-send-a-mail-using-prometheus-alertmanager-7e880a3676db
- https://developer.nvidia.com/dcgm
- https://github.com/NVIDIA/dcgm-exporter
- https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html