运营维护👉Linux

1. 硬件监控

(1)实时监控服务器中显卡的温度、功耗、显存占用 watch -n 0.1 nvidia-smi

root@node01:~# watch -n 0.1 nvidia-smi

Every 0.1s: nvidia-smi
node01: Sun Mar  2 22:24:31 2025

Sun Mar  2 22:24:31 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-16GB           Off |   00000000:02:00.0 Off |                    0 |
| N/A   40C    P0            142W /  250W |    8439MiB /  16384MiB |     94%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-PCIE-16GB           Off |   00000000:03:00.0 Off |                    0 |
| N/A   65C    P0             48W /  250W |   15605MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla V100-PCIE-16GB           Off |   00000000:82:00.0 Off |                    0 |
| N/A   32C    P0            139W /  250W |    8443MiB /  16384MiB |     95%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla V100-PCIE-16GB           Off |   00000000:83:00.0 Off |                    0 |
| N/A   71C    P0            149W /  250W |   10297MiB /  16384MiB |     98%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1391      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A    211887      C   ..._cuda12.4_intel2024.2/vasp/vasp_std       8432MiB |
|    1   N/A  N/A      1391      G   /usr/lib/xorg/Xorg                              4MiB |
|    1   N/A  N/A    234167      C   ..._cuda12.4_intel2024.2/vasp/vasp_std      15598MiB |
|    2   N/A  N/A      1391      G   /usr/lib/xorg/Xorg                              4MiB |
|    2   N/A  N/A    211888      C   ..._cuda12.4_intel2024.2/vasp/vasp_std       8436MiB |
|    3   N/A  N/A      1391      G   /usr/lib/xorg/Xorg                              4MiB |
|    3   N/A  N/A    234489      C   ..._cuda12.4_intel2024.2/vasp/vasp_std      10290MiB |
+-----------------------------------------------------------------------------------------+

特别注意显卡温度不要超过80度

按Ctrl+C键退出实时监控

(2)更改显卡功率 nvidia-smi -i n -pl x #将第n块显卡的功率改为x

root@node01:~# nvidia-smi -i 0 -pl 150
Power limit for GPU 00000000:02:00.0 was set to 150.00 W from 250.00 W.

(3)监控服务器中CPU的占用率 top

root@node01:~# top

top - 22:29:45 up 73 days, 12:14,  5 users,  load average: 0.21, 0.17, 0.18
Tasks: 750 total,   1 running, 749 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  0.0 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 257820.4 total,   1107.2 free,   4931.1 used, 251782.0 buff/cache
MiB Swap:   2048.0 total,   2047.5 free,      0.5 used. 251068.2 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
      1 root      20   0  168180  11648   7616 S   0.3   0.0  15:03.46 systemd
    796 root      19  -1  335656 189596 189596 S   0.3   0.1 115:38.21 systemd-journal
   1190 systemd+  20   0   14836   6272   5824 S   0.3   0.0 135:01.29 systemd-oomd
   1206 avahi     20   0    7880   2688   2688 S   0.3   0.0  13:42.68 avahi-daemon
  74214 root      20   0   15856   8960   7616 S   0.3   0.0  36:04.27 sshd
1753466 root      20   0 1084508   8064   4928 S   0.3   0.0   1:45.56 slurmctld

%Cpu(s): 0.1 意思是当前CPU占用率为0.1%

按Ctrl+C键退出实时监控

(4)监控服务器中运行内存的占用率 free -h

root@node01:~# free -h
               total        used        free      shared  buff/cache   available
Mem:           251Gi       4.8Gi       3.0Gi        13Mi       244Gi       245Gi
Swap:          2.0Gi       0.0Ki       2.0Gi

total 总的运行内存 used 已使用的运行内存 available 可用运行内存

(5)监控服务器中存储空间 df -h

root@node01:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            26G  3.3M   26G   1% /run
/dev/sda2        44T  696G   41T   2% /
tmpfs           126G     0  126G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
efivarfs        304K  164K  136K  55% /sys/firmware/efi/efivars
/dev/sda1       511M  6.1M  505M   2% /boot/efi

硬盘总量44T 已用696G 可用41T

(6)

2.系统维护

(1)启动slurm作业系统

root@node01:~# systemctl restart slurmctld   #启动控制节点slurm服务
root@node01:~# systemctl restart slurmdbd    #启动控制节点slurm数据库
root@node01:~# systemctl restart slurmd      #启动计算节点slurm服务

(2)查看节点状态 sinfo

root@node01:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
GPU*         up   infinite      1  down* node56
GPU*         up   infinite     21    mix node[01-48,55,57-67]
GPU*         up   infinite      6   idle node[49-54]

STATE表示节点状态,idle为节点空闲、mix为部分资源占用、alloc为全部资源占用、down为节点掉线

(3)更新节点状态

root@node01:~# scontrol update nodename=node01 state=resume

这里的node01替换为集群中要更新状态的计算节点的hostname