Mariadb server is always with 100% processor utilization but application servers are under utilized-CodePudding

My Mariadb installation works well with normal times. However during peak usage, all CPU cores are being utilized 100% constantly. There are around 250000 active users during peak usage times. The cores have been scaled vertically from 8 to 16 to 32 for now. I am still facing issues of high utilization of cores during peak times. PFB the htop result:

All CPU Cores: 100%
Mem 5.25G/126G
Swp 0K/0K
Tasks: 43, 961 thr; 32 running
Load average: 150 148 148

The values of few important commands are below:

cat /proc/sys/kernel/threads-max
1030492
cat /proc/sys/kernel/pid_max
4194304
cat /proc/sys/vm/max_map_count
65530

innodb_buffer_pool_size=55G
max_connections        = 50000

MariaDB version: 10.6

Slow Query main issue:

# User@Host: lrp_dbuser[lrp_dbuser] @ lrp-application-server-2.internal.cloudapp.net [10.10.0.5]
# Thread_id: 3630886  Schema: lrp  QC_hit: No
# Query_time: 50.155999  Lock_time: 0.049785  Rows_sent: 0  Rows_examined: 150617
# Rows_affected: 1  Bytes_sent: 52
SET timestamp=1666331690;
UPDATE `session_lrp` SET `timestamp` = 1666331640
WHERE `id` = 'pghjejnvf8ou61usog5983h6lihh4jr0';
# Time: 221021  5:54:51
# User@Host: lrp_dbuser[lrp_dbuser] @ lrp-application-server-2.internal.cloudapp.net [10.10.0.5]
# Thread_id: 3632952  Schema: lrp  QC_hit: No
# Query_time: 50.334102  Lock_time: 0.050021  Rows_sent: 0  Rows_examined: 150632
# Rows_affected: 1  Bytes_sent: 52
SET timestamp=1666331691;
UPDATE `session_lrp` SET `timestamp` = 1666331640
WHERE `id` = 'hsun0ggh48i8t4pbv9u6ac2f13fdkujf';
# User@Host: lrp_dbuser[lrp_dbuser] @ lrp-application-server-2.internal.cloudapp.net [10.10.0.5]
# Thread_id: 3632570  Schema: lrp  QC_hit: No
# Query_time: 59.245218  Lock_time: 0.000000  Rows_sent: 1  Rows_examined: 0
# Rows_affected: 0  Bytes_sent: 70
SET timestamp=1666331691;
SELECT GET_LOCK('62f67a85ee74fca9c3b3c30a20f70f82', 300) AS ci_session_lock;
# User@Host: lrp_dbuser[lrp_dbuser] @ lrp-application-server-2.internal.cloudapp.net [10.10.0.5]
# Thread_id: 3634754  Schema: lrp  QC_hit: No
# Query_time: 15.116038  Lock_time: 0.000000  Rows_sent: 1  Rows_examined: 0
# Rows_affected: 0  Bytes_sent: 70
SET timestamp=1666331691;
SELECT GET_LOCK('6a046e7d67351ce96e8ed43e2c27eaed', 300) AS ci_session_lock;
# User@Host: lrp_dbuser[lrp_dbuser] @ lrp-application-server-1.internal.cloudapp.net [10.10.0.4]
# Thread_id: 3632996  Schema: lrp  QC_hit: No
# Query_time: 50.000255  Lock_time: 0.050056  Rows_sent: 0  Rows_examined: 0
# Rows_affected: 0  Bytes_sent: 67
SET timestamp=1666331691;
UPDATE `session_lrp` SET `timestamp` = 1666331641
WHERE `id` = '67aofej14f1eq8aqobq0hecuh7qavp1d';

Please let me know what do i need to optimize.

CodePudding user response：

Lack of INDEX?

# Query_time: 50.155999  Lock_time: 0.049785  Rows_examined: 150617
# Rows_affected: 1  Bytes_sent: 52
SET timestamp=1666331690;
UPDATE `session_lrp` SET `timestamp` = 1666331640
WHERE `id` = 'pghjejnvf8ou61usog5983h6lihh4jr0';

Is there an index on id? Please provide SHOW CREATE TABLE session_lrp

Usage of GET_LOCK?

# Query_time: 59.245218
# Rows_affected: 0  Bytes_sent: 70
SET timestamp=1666331691;
SELECT GET_LOCK('62f67a85ee74fca9c3b3c30a20f70f82', 300)
         AS ci_session_lock;

Please explain the usage of GET_LOCK -- It may be the cause of your problem. And there may be a much better workaround.

Timeout -> ROLLBACK

# Query_time: 50.000255  Lock_time: 0.050056

Do you check for errors? The query was not performed! Being that close to 50 seconds smells like you hit innodb_lock_wait_timeout.

Do not increase that setting. Instead:

fix the cause
decrease the timeout
change the code to detail with this and every other query that gets an error.