Home > other >  Cluster storage using NFS, run python machine learning application in storage process often D state
Cluster storage using NFS, run python machine learning application in storage process often D state

Time:10-08

RT, describe the environment here, experienced Daniel analysis to help the screening:
K8S cluster: a management node, 4 to 12 compute node
Together these computing nodes mount an NFS server, placed on the NFS server machine learning code and code of image resources
Maximum concurrent, each compute node in the NFS mounted directory to run the program, the occasional D process or Z state process, this process is very trouble
Kill not to drop, the kill off the parent process, grandpa go to 1 the process after process will not be recycled,
The process using the GPU resources will not be released (GPU cache will not be cleared), only to restart the compute nodes, and reboot also gets stuck,


This is the process/etc/proc under partial information:
 
Root @ n004:/proc/16496 # cat stack
[& lt; ffffffff81086248 & gt;] To do_exit + 0 x778/0 xb00
[& lt; ffffffff81086653 & gt;] Do_group_exit + 0 x43/0 xb0
[& lt; ffffffff81092e74 & gt;] Get_signal + 0 x294/0 x600
[& lt; ffffffff8102e567 & gt;] Do_signal + 0 x37/0 x6f0
[& lt; ffffffff8100320c & gt;] Exit_to_usermode_loop + 0 x8c/0 xd0
[& lt; ffffffff81003c7e & gt;] Syscall_return_slowpath + 0 x4e/0 x60
[& lt; ffffffff8184f170 & gt;] Int_ret_from_sys_call + 0 x25/0 x9f
[& lt; ffffffffffffffff>] 0 XFFFFFFFFFFFFFFFF
Root @ n004:/proc/16496 # cat status
Name: python
State: Z (zombie)
Tgid: 16496
Ngid: 0
Pid: 16496
PPid: 16371
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 0
Groups:
NStgid: 16496 1953
NSpid: 16496 1953
NSpgid: 50821
NSsid: 50761 1
Threads: 1
3/1031197 SigQ:
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000001005002
SigCgt: 0000000180000000
00000000 a80425fb CapInh:
00000000 a80425fb CapPrm:
00000000 a80425fb CapEff:
00000000 a80425fb CapBnd:
CapAmb: 0000000000000000
Seccomp: 0
Cpus_allowed: FFFFFF, FFFFFFFF
Cpus_allowed_list: 0-55
Mems_allowed: 00000000000000
Mems_allowed_list: 0-1
Voluntary_ctxt_switches: 3
Nonvoluntary_ctxt_switches: 1
Root @ n004:/proc/16496 # cat wchan
To do_exit

CodePudding user response:

According to the current understanding to the data, most of the D state process are linked to the IO, but how to position from the process must be IO, still hope pass by show just a little big,