記一次項(xiàng)目上MySQL死鎖Deadlock的排查優(yōu)化過程
起因是最近兩天收到了線上項(xiàng)目的告警通知,隔一段時(shí)間會(huì)出現(xiàn)幾筆MySQL的死鎖Deadlock的錯(cuò)誤,錯(cuò)誤日志如下:
updating database. Cause: com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Deadlock found when trying to get lock;
try restarting transaction ### The error may exist in com/.../XXXMapper.java (best guess)
### The error may involve com.....MsgFrameSmsMapper.update-Inline
### The error occurred while setting parameters
### SQL: UPDATE msg SET success_count=success_count+1 WHERE (batch_name = ?)
### Cause: com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction ;
Deadlock found when trying to get lock;
try restarting transaction; nested exception is com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Deadlock found when trying to get lock;
...
看了下日志位置以及執(zhí)行的sql定位到了功能模塊,其模塊的功能概括如下圖:

三方服務(wù)會(huì)給我們推送批次號(hào)以及狀態(tài)的回調(diào)通知,一次通知中可能會(huì)包含多個(gè)批次號(hào),每個(gè)批次號(hào)可能會(huì)產(chǎn)生多個(gè)子狀態(tài),每個(gè)子狀態(tài)都需要通知到我們,我們根據(jù)子狀態(tài)來決定來更新表中的數(shù)量字段。
庫使用的引擎是InnoDB,batch_name加了索引,找DBA使用 SHOW ENGINE INNODB STATUS 命令查詢了下mysql死鎖的日志:
LATEST DETECTED DEADLOCK
------------------------
2025-05-27 10:00:29 0x7fbbb7ce7700
*** (1) TRANSACTION:
TRANSACTION 29608052, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 42 lock struct(s), heap size 8400, 47 row lock(s), undo log entries 75
MySQL thread id 11897, OS thread handle 140446306776832, query id 54085329 10.16.2.173 user_sms updating
UPDATE msg SET success_count=success_count+1 WHERE (batch_name = '1927183012550696960')
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 188 page no 81 n bits 248 index ix_batch_name of table `ums_sms`.`msg` trx id 29608052 lock_mode X locks rec but not gap waiting
Record lock, heap no 98 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: len 19; hex 31393237313833303132353530363936393630; asc 1927183012550696960;;
1: len 8; hex 1abebabd41802001; asc A ;;
*** (2) TRANSACTION:
TRANSACTION 29608051, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
31 lock struct(s), heap size 3520, 44 row lock(s), undo log entries 79
MySQL thread id 11906, OS thread handle 140444219373312, query id 54085339 10.16.2.173 user_sms updating
UPDATE msg SET success_count=success_count+1 WHERE (batch_name = '1927183018338836480')
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 188 page no 81 n bits 248 index ix_batch_name of table `ums_sms`.`msg` trx id 29608051 lock_mode X locks rec but not gap
Record lock, heap no 78 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: len 19; hex 31393237313832393939393830333731393638; asc 1927182999980371968;;
1: len 8; hex 1abebaba54402001; asc T@ ;;
Record lock, heap no 93 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: len 19; hex 31393237313833303038333934313435373932; asc 1927183008394145792;;
1: len 8; hex 1abebabc49802001; asc I ;;
Record lock, heap no 98 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: len 19; hex 31393237313833303132353530363936393630; asc 1927183012550696960;;
1: len 8; hex 1abebabd41802001; asc A ;;
Record lock, heap no 99 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: len 19; hex 31393237313833303132383639343638313630; asc 1927183012869468160;;
1: len 8; hex 1abebabd54401001; asc T@ ;;
Record lock, heap no 100 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: len 19; hex 31393237313833303133383637373038343136; asc 1927183013867708416;;
1: len 8; hex 1abebabd90001001; asc ;;
Record lock, heap no 141 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: len 19; hex 31393237313833303139323533313934373532; asc 1927183019253194752;;
1: len 8; hex 1abebabed0c02001; asc ;;
Record lock, heap no 157 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: len 19; hex 31393237313833303231303339393732333532; asc 1927183021039972352;;
1: len 8; hex 1abebabf3b802001; asc ; ;;
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 188 page no 81 n bits 248 index ix_batch_name of table `ums_sms`.`msg` trx id 29608051 lock_mode X locks rec but not gap waiting
Record lock, heap no 129 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: len 19; hex 31393237313833303138333338383336343830; asc 1927183018338836480;;
1: len 8; hex 1abebabe9a802001; asc ;;
意思就是:
事務(wù)1 (29608052) 嘗試執(zhí)行sql語句:
UPDATE msg SET success_count=success_count+1 WHERE (batch_name = '1927183012550696960');
等待鎖定的記錄是batch_name = '1927183012550696960'
事務(wù)2(29608051) 執(zhí)行了幾個(gè)批次的update語句,其中包括:
UPDATE msg SET success_count=success_count+1 WHERE (batch_name = '1927183012550696960');
也就是事務(wù)2已經(jīng)獲得了batch_name = '1927183012550696960'的排它鎖,然后等待鎖定的記錄是batch_name = '1927183018338836480',也就是嘗試執(zhí)行下面的sql語句:
UPDATE msg SET success_count=success_count+1 WHERE (batch_name = '1927183018338836480');
根據(jù)mysql的死鎖日志可以推斷出,事務(wù)1在等待batch_name='1927183012550696960'也就是被事務(wù)2執(zhí)行的排它鎖(堆號(hào)98),事務(wù)2在等待batch_name='1927183018338836480'的排它鎖(堆號(hào)129),而這條記錄被事務(wù)事務(wù)1(或其他事務(wù),但根據(jù)日志上下文可推斷為事務(wù)1)間接依賴或鎖定,以此形成了循環(huán)依賴。
項(xiàng)目中會(huì)通過p6spy打印sql的執(zhí)行日志,根據(jù)traceId看了下批次號(hào)1927183012550696960和1927183018338836480都存在于兩條MQ消息中,說明這兩個(gè)批次號(hào)都接收到了兩次通知,這個(gè)三方服務(wù)本來只有一家,當(dāng)時(shí)對(duì)接的時(shí)候約定同一個(gè)批次號(hào)下的子狀態(tài)會(huì)聚合,不會(huì)出現(xiàn)一個(gè)批次號(hào)通知多次的場(chǎng)景,但是后面隨著后面其它廠商的接入,對(duì)接的時(shí)候沒有確定這一點(diǎn),導(dǎo)致了一個(gè)批次號(hào)多次推送從而出現(xiàn)了上述的問題,并且這個(gè)問題也不是必現(xiàn),正好要碰到這樣的場(chǎng)景才會(huì)觸發(fā)。
好在一條MQ消息中的某一個(gè)批次號(hào)執(zhí)行失敗,整個(gè)MQ消息的事務(wù)會(huì)回滾,然后其中一條MQ消息進(jìn)入重試隊(duì)列,重試之后就不會(huì)出現(xiàn)事務(wù)爭搶的情況,因?yàn)镸ySQL會(huì)選擇死鎖中的一個(gè)事務(wù)執(zhí)行成功,根據(jù)MySQL日志觀察是事務(wù)2被回滾了(事務(wù)2回滾的日志沒有貼出),事務(wù)1被提交。
畢竟有這些錯(cuò)誤日志告警持續(xù)產(chǎn)生,得想辦法解決了,首先想到的就是化數(shù)據(jù)庫的鎖為redis的鎖了,也就是在執(zhí)行某一個(gè)批次號(hào)的更新動(dòng)作時(shí),先使用redis獲取這一個(gè)批次號(hào)的鎖,獲得鎖之后進(jìn)行操作,否則等待,不過這還牽扯出一個(gè)新的問題,就是一條消息里面的多個(gè)批次在處理時(shí)是一個(gè)完整的事務(wù),只有等待所有批次處理完成后,才能釋放掉這條消息中每一個(gè)批次的鎖,初步方案如下:

方案不夠好,還有優(yōu)化的空間,想了下,可以借助MQ的唯一消息id和重試隊(duì)列來實(shí)現(xiàn),縮小事務(wù)范圍,將一條消息中的批次號(hào)作為一個(gè)事務(wù),將循環(huán)批次的代碼放在事務(wù)外面,一個(gè)批次處理成功、事務(wù)提交成功之后通過redis記錄,key是MQ的msgId+批次號(hào),value按需設(shè)置,我這邊存儲(chǔ)的是批次號(hào)處理成功時(shí)候的時(shí)間戳,因?yàn)楹竺嬷卦囘@條消息里面處理失敗的批次號(hào)時(shí),跳過處理這條消息里面成功的批次號(hào)是根據(jù)msgId+批次號(hào)去redis里查詢的。
項(xiàng)目上使用的MQ是RocketMQ,RocketMQ從4.7.1版本開始,一條消息重試時(shí)消息的msgId會(huì)變化,但是可以通過讀取消息中的屬性UNIQ_KEY來獲取原始的msgId,RocketMQ的MessageClientIDSetter類中也提供該方法:
public static String getUniqID(final Message msg) {
return msg.getProperty(MessageConst.PROPERTY_UNIQ_CLIENT_MESSAGE_ID_KEYIDX);
}
那么改版后的方案如下:

這個(gè)方案的好處就是一個(gè)事務(wù)里面不會(huì)牽扯到多個(gè)批次號(hào),化大事物為小事物,也就避免了交叉獲取批次號(hào)的鎖從而發(fā)生死鎖的問題,也不用使用redis對(duì)每一個(gè)批次號(hào)進(jìn)行加鎖,跨事務(wù)操作更新某一個(gè)批次號(hào)的數(shù)據(jù)時(shí),其中一個(gè)事務(wù)會(huì)等待另一個(gè)事務(wù)提交之后執(zhí)行。
本文來自博客園,作者:杜勁松,轉(zhuǎn)載請(qǐng)注明原文鏈接:http://www.rzrgm.cn/imadc/p/18898998

浙公網(wǎng)安備 33010602011771號(hào)