运营维护👉Linux
1. 硬件监控
(1)实时监控服务器中显卡的温度、功耗、显存占用 watch -n 0.1 nvidia-smi
root@node01:~# watch -n 0.1 nvidia-smi
Every 0.1s: nvidia-smi
node01: Sun Mar 2 22:24:31 2025
Sun Mar 2 22:24:31 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-PCIE-16GB Off | 00000000:02:00.0 Off | 0 |
| N/A 40C P0 142W / 250W | 8439MiB / 16384MiB | 94% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla V100-PCIE-16GB Off | 00000000:03:00.0 Off | 0 |
| N/A 65C P0 48W / 250W | 15605MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 Tesla V100-PCIE-16GB Off | 00000000:82:00.0 Off | 0 |
| N/A 32C P0 139W / 250W | 8443MiB / 16384MiB | 95% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 Tesla V100-PCIE-16GB Off | 00000000:83:00.0 Off | 0 |
| N/A 71C P0 149W / 250W | 10297MiB / 16384MiB | 98% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1391 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 211887 C ..._cuda12.4_intel2024.2/vasp/vasp_std 8432MiB |
| 1 N/A N/A 1391 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 234167 C ..._cuda12.4_intel2024.2/vasp/vasp_std 15598MiB |
| 2 N/A N/A 1391 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 211888 C ..._cuda12.4_intel2024.2/vasp/vasp_std 8436MiB |
| 3 N/A N/A 1391 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 234489 C ..._cuda12.4_intel2024.2/vasp/vasp_std 10290MiB |
+-----------------------------------------------------------------------------------------+
特别注意显卡温度不要超过80度
按Ctrl+C键退出实时监控
(2)更改显卡功率 nvidia-smi -i n -pl x #将第n块显卡的功率改为x
root@node01:~# nvidia-smi -i 0 -pl 150
Power limit for GPU 00000000:02:00.0 was set to 150.00 W from 250.00 W.
(3)监控服务器中CPU的占用率 top
root@node01:~# top
top - 22:29:45 up 73 days, 12:14, 5 users, load average: 0.21, 0.17, 0.18
Tasks: 750 total, 1 running, 749 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.1 us, 0.0 sy, 0.0 ni, 99.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 257820.4 total, 1107.2 free, 4931.1 used, 251782.0 buff/cache
MiB Swap: 2048.0 total, 2047.5 free, 0.5 used. 251068.2 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 168180 11648 7616 S 0.3 0.0 15:03.46 systemd
796 root 19 -1 335656 189596 189596 S 0.3 0.1 115:38.21 systemd-journal
1190 systemd+ 20 0 14836 6272 5824 S 0.3 0.0 135:01.29 systemd-oomd
1206 avahi 20 0 7880 2688 2688 S 0.3 0.0 13:42.68 avahi-daemon
74214 root 20 0 15856 8960 7616 S 0.3 0.0 36:04.27 sshd
1753466 root 20 0 1084508 8064 4928 S 0.3 0.0 1:45.56 slurmctld
%Cpu(s): 0.1 意思是当前CPU占用率为0.1%
按Ctrl+C键退出实时监控
(4)监控服务器中运行内存的占用率 free -h
root@node01:~# free -h
total used free shared buff/cache available
Mem: 251Gi 4.8Gi 3.0Gi 13Mi 244Gi 245Gi
Swap: 2.0Gi 0.0Ki 2.0Gi
total 总的运行内存 used 已使用的运行内存 available 可用运行内存
(5)监控服务器中存储空间 df -h
root@node01:~# df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 26G 3.3M 26G 1% /run
/dev/sda2 44T 696G 41T 2% /
tmpfs 126G 0 126G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
efivarfs 304K 164K 136K 55% /sys/firmware/efi/efivars
/dev/sda1 511M 6.1M 505M 2% /boot/efi
硬盘总量44T 已用696G 可用41T
(6)
2.系统维护
(1)启动slurm作业系统
root@node01:~# systemctl restart slurmctld #启动控制节点slurm服务
root@node01:~# systemctl restart slurmdbd #启动控制节点slurm数据库
root@node01:~# systemctl restart slurmd #启动计算节点slurm服务
(2)查看节点状态 sinfo
root@node01:~# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
GPU* up infinite 1 down* node56
GPU* up infinite 21 mix node[01-48,55,57-67]
GPU* up infinite 6 idle node[49-54]
STATE表示节点状态,idle为节点空闲、mix为部分资源占用、alloc为全部资源占用、down为节点掉线
(3)更新节点状态
root@node01:~# scontrol update nodename=node01 state=resume
这里的node01替换为集群中要更新状态的计算节点的hostname