Skip to content

Monitoring

We currently provide you only with Ganglia for monitoring the cluster status.

Using Ganglia

Go to the following address and login with your home organization (Charite or MDC):

Ganglia does not know about Slurm

Ganglia will not show you anything about the Slurm job schedulign system. If a job uses a whole node but uses no CPUs then this will be displayed as unused in Ganglia. However, Slurm would not schedule another job on this node.

You will be show a screen as shown below. This allows you to get a good idea of what is going on on the HPC.

By default you will be shown the cluster usage of the last day. You can quickly switch to report for two or four hours as well, etc.

In the first row of pictures you see the number of total CPUs (actually hardware threads), number of hosts seen as up and down by Ganglia, and cluster load/utilization. You will then see the overall cluster load, memory usage, CPU usage, and network utilization across the selected time period.

Linux load is not intuitive

Note that the technical details behind Linux load is not very interactive. It is incorporating much more than just the CPU usage. You can find a quite comprehensive treatement of Linux Load here.

We are using a fast shared storage system and almost no local storage (except in /tmp). Also, almost no jobs use MPI or other heavy network communication. Thus, the network utilization is a good measure of the I/O on the cluster.

Below, you can drill down into various metrics and visualize them historically. Just try it out and find your way around, you cannot break anything. Sadly, there is no good documentation of Ganglia online.

Aggregate GPU Utilization Visualization

Ganglia allows you to obtain metrics in several interesting and useful ways. If you click on "Aggregate Graphs" then you could enter the following values to get an overview of the live GPU utilization.

  • Title: Aggreate GPU Utilization
  • Host Regular expression: hpc-gpu-.*
  • Metric Regular Expressions: gpu._util
  • Graph Type: Stacked
  • Legend Options: Hide legend

Then click Create Graph.

If a GPU is fully used, it will contribute 100 points on the vertical axis. See above for an example, and here is a direct link: