跳转至

Fuser

fuser 是 Linux 下查询文件或 socket 占用的程序,因为 Linux 下一切系统资源都是文件,所以 fuser 是检查系统资源占用的强大工具。

用例

torch.distributed.DistributedDataParallel 训练失败多次后,留下了很多占用显卡内存的进程,如下:

Tue Aug 15 20:37:50 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:4F:00.0 Off |                  N/A |
| 30%   39C    P8    24W / 350W |     20MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:52:00.0 Off |                  N/A |
| 30%   27C    P8    26W / 350W |      8MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:56:00.0 Off |                  N/A |
| 30%   26C    P8    24W / 350W |      8MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:57:00.0 Off |                  N/A |
| 30%   28C    P8    29W / 350W |      8MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  Off  | 00000000:CE:00.0 Off |                  N/A |
| 35%   47C    P2   145W / 350W |  11084MiB / 24576MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  Off  | 00000000:D1:00.0 Off |                  N/A |
| 32%   45C    P2   145W / 350W |  11084MiB / 24576MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  Off  | 00000000:D5:00.0 Off |                  N/A |
| 36%   47C    P2   155W / 350W |  11082MiB / 24576MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  Off  | 00000000:D6:00.0 Off |                  N/A |
| 30%   29C    P8    25W / 350W |      8MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2868      G   /usr/lib/xorg/Xorg                  9MiB |
|    0   N/A  N/A      3282      G   /usr/bin/gnome-shell                6MiB |
|    1   N/A  N/A      2868      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2868      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      2868      G   /usr/lib/xorg/Xorg                  4MiB |
|    4   N/A  N/A      2868      G   /usr/lib/xorg/Xorg                  4MiB |
|    4   N/A  N/A      6956      C   ...da3/envs/torch/bin/python     2382MiB |
|    4   N/A  N/A      8949      C   ...da3/envs/torch/bin/python     2382MiB |
|    4   N/A  N/A     10902      C   ...da3/envs/torch/bin/python     2382MiB |
|    4   N/A  N/A     12880      C   ...da3/envs/torch/bin/python     3926MiB |
|    5   N/A  N/A      2868      G   /usr/lib/xorg/Xorg                  4MiB |
|    5   N/A  N/A      6957      C   ...da3/envs/torch/bin/python     2382MiB |
|    5   N/A  N/A      8950      C   ...da3/envs/torch/bin/python     2382MiB |
|    5   N/A  N/A     10903      C   ...da3/envs/torch/bin/python     2382MiB |
|    5   N/A  N/A     12881      C   ...da3/envs/torch/bin/python     3926MiB |
|    6   N/A  N/A      2868      G   /usr/lib/xorg/Xorg                  4MiB |
|    6   N/A  N/A      6958      C   ...da3/envs/torch/bin/python     2382MiB |
|    6   N/A  N/A      8951      C   ...da3/envs/torch/bin/python     2382MiB |
|    6   N/A  N/A     10904      C   ...da3/envs/torch/bin/python     2382MiB |
|    6   N/A  N/A     12882      C   ...da3/envs/torch/bin/python     3926MiB |
|    7   N/A  N/A      2868      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

(这是不应该出现的,具体为什么出现了有待排查)

所以当我把能看到的 python 进程全部 kill 掉:

Tue Aug 15 20:56:04 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:4F:00.0 Off |                  N/A |
| 30%   26C    P8    20W / 350W |     20MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:52:00.0 Off |                  N/A |
| 30%   26C    P8    27W / 350W |      8MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:56:00.0 Off |                  N/A |
| 30%   25C    P8    24W / 350W |      8MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:57:00.0 Off |                  N/A |
| 30%   27C    P8    35W / 350W |      8MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  Off  | 00000000:CE:00.0 Off |                  N/A |
| 30%   40C    P2   106W / 350W |   2394MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  Off  | 00000000:D1:00.0 Off |                  N/A |
| 30%   26C    P8    24W / 350W |     10MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  Off  | 00000000:D5:00.0 Off |                  N/A |
| 30%   27C    P8    28W / 350W |     10MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  Off  | 00000000:D6:00.0 Off |                  N/A |
| 30%   26C    P8    27W / 350W |      8MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2868      G   /usr/lib/xorg/Xorg                  9MiB |
|    0   N/A  N/A      3282      G   /usr/bin/gnome-shell                6MiB |
|    1   N/A  N/A      2868      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2868      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      2868      G   /usr/lib/xorg/Xorg                  4MiB |
|    4   N/A  N/A      2868      G   /usr/lib/xorg/Xorg                  4MiB |
|    5   N/A  N/A      2868      G   /usr/lib/xorg/Xorg                  4MiB |
|    6   N/A  N/A      2868      G   /usr/lib/xorg/Xorg                  4MiB |
|    7   N/A  N/A      2868      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

诡异的事情出现了,卡 4 上只有一个 Xorg 进程运行,但显存占用却达到 2374MiB。这时可以用 fuser 排查(具体为什么 fuser 看得到 nvidia-smi 看不到我也不知道)

sudo fuser -v /dev/nvidia*
                     USER        PID ACCESS COMMAND
/dev/nvidia0:        root       2868 F...m Xorg
                     gdm        3282 F...m gnome-shell
                     ruiming   14810 F...m python
/dev/nvidia1:        root       2868 F...m Xorg
                     gdm        3282 F...m gnome-shell
                     ruiming   14810 F...m python
/dev/nvidia2:        root       2868 F...m Xorg
                     gdm        3282 F...m gnome-shell
                     ruiming   14810 F...m python
/dev/nvidia3:        root       2868 F...m Xorg
                     gdm        3282 F...m gnome-shell
                     ruiming   14810 F...m python
/dev/nvidia4:        root       2868 F...m Xorg
                     gdm        3282 F...m gnome-shell
                     ruiming   14810 F...m python
/dev/nvidia5:        root       2868 F...m Xorg
                     gdm        3282 F...m gnome-shell
                     ruiming   14810 F...m python
/dev/nvidia6:        root       2868 F...m Xorg
                     gdm        3282 F...m gnome-shell
                     ruiming   14810 F...m python
/dev/nvidia7:        root       2868 F...m Xorg
                     gdm        3282 F...m gnome-shell
                     ruiming   14810 F...m python
/dev/nvidiactl:      root       2868 F...m Xorg
                     gdm        3282 F...m gnome-shell
                     ruiming   14810 F...m python
/dev/nvidia-modeset: root       2868 F.... Xorg
                     gdm        3282 F.... gnome-shell
/dev/nvidia-uvm:     ruiming   14810 F...m python

发现这些 GPU 都被 14810 进程占用,kill 之,问题解决。