GBASE 8 a database cluster network troubleshooting experience sharing-CodePudding

Fault phenomenon: cluster gcadmin state view is normal, but suddenly cluster task execution efficiency is extremely low, much slower than usual hundreds or even work the stagnation of the basic motionless,

Screening process: first, through the restart the cluster failed (restart can solve the problem of 90%, - -!) , then check the state of the cluster, data fragmentation, the system log, express log, pstack information, Linux system log messages are not found any abnormalities, such as the discovery of random failure happen, sometimes also a task of more than one thousand rows of data (small query, query a table) will soon be out as a result, the but again is jammed state at a hard, just random things big probability have relationship with network, as each screening condition of each cluster node network ping found certain node values will be replace, then replace the switch card, problem solving,

Screening summary:
1. Network problem when processing tasks in cluster network random nestlings that task node stuck in some network problems lead to the whole task card dead state, only from the perspective of the monitoring state of the cluster and a single node may be difficult to find problems,
It needs to be 2. Ping ping bags, such as network transmission capacity is the largest 9000, then use the ping command - s 9000 IP to measure whether to replace (not ping parcel is found not replace),
3. GBASE 8 a database provides a cluster node status setting function, but the management problems of the encounter some nodes nodes in distributed task in the past time can be provided by gcadmin command sets the cluster nodes to failure state, so the management node will not, to get a task request issued to this node, then the task will not be stuck on the problem node, this function is suitable for the malfunction of unresolved but business needs in a timely manner to ensure case,