ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

生产环境新上线集群小概率gtid空洞问题

2022-01-07 15:31:24  阅读:178  来源: 互联网

标签:Last Log Replicate SSL 空洞 Master master 集群 gtid


背景:生产环境新上集群会产生gtid空洞问题

例子:

群集名

服务器

端口

master UUID

复制异常点

ibupub1group01

10.60.52.168

55944

9cd48d9c-6e29-11ec-92d8-98039ba567ea

存在GAP: 间隔10秒两次动态GTID比较:1-402:404-5103#1-402:404-5109

 

线索1:查看binlog发现空洞gtid的binlog不存在,且问题发生于集群已经搭建完毕,已经开始写心跳的阶段后

 

[op1@SVR16599HP360 tmp]$ cat log.txt |grep "9cd48d9c-6e29-11ec-92d8-98039ba567ea:40" -A 10 | head -n 50

SET @@SESSION.GTID_NEXT= '9cd48d9c-6e29-11ec-92d8-98039ba567ea:402'/*!*/;

# at 159796

#220105 22:45:58 server id 52153  end_log_pos 159880 CRC32 0xdc731cf7    Query     thread_id=6068       exec_time=0   error_code=0

SET TIMESTAMP=1641393958/*!*/;

BEGIN

/*!*/;

# at 159880

#220105 22:45:58 server id 52153  end_log_pos 159957 CRC32 0x49efcff4      Rows_query

# replace into repl_heartbeat

# values (@@hostname,now())

# at 159957

--

SET @@SESSION.GTID_NEXT= '9cd48d9c-6e29-11ec-92d8-98039ba567ea:404'/*!*/;

# at 160194

 

继续查:

线索2:线上已有集群写心跳不会出现问题,为什么新上集群才有小概率会发生?

猜测:和写心跳这个操作无关,写心跳只是gtid增长的手段,新上集群和已有集群的差别在于新上集群在搭建集群环节有reset master操作,怀疑到语句reset master

验证

经过一系列测试,接下来的操作可100%稳定复现:

第一步:在测试集群上部署一个小脚本,while 1 不停的在master上写心跳,然后在master上手工reset master。然后观察Executed_Gtid_Set:

mysql> show master status;
+-------------------------+----------+--------------+------------------+-----------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+-------------------------+----------+--------------+------------------+-----------------------------------------------+
| FATM00003214-bin.000001 | 12771 | | | 744c6c9a-766d-11eb-9807-fa163e6a649a:1-31:135 |
+-------------------------+----------+--------------+------------------+-----------------------------------------------+

发现会稳定出现一个不存在的幽灵gtid 135,且binlog中搜不到这个gtid的相关信息:

mysqlbinlog -vvv FATM00003214-bin.000001|grep 744c6c9a-766d-11eb-9807-fa163e6a649a:135   ---》 返回空

 

第二步:模拟搭建集群操作,在slave上stop slave;reset master;change master to xxx;start slave;show slave status;

mysql> show slave status\G

*************************** 1. row ***************************

               Slave_IO_State: Waiting for master to send event

                  Master_Host: 10.6.26.233

                  Master_User: usvr_replication

                  Master_Port: 55111

                Connect_Retry: 60

              Master_Log_File: FATM00003214-bin.000001

          Read_Master_Log_Pos: 15585

               Relay_Log_File: FATM00003215-relay-bin.000002

                Relay_Log_Pos: 15812

        Relay_Master_Log_File: FATM00003214-bin.000001

             Slave_IO_Running: Yes

            Slave_SQL_Running: Yes

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 0

                   Last_Error:

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 15585

              Relay_Log_Space: 16082

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: 0

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 0

               Last_SQL_Error:

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 10001

                  Master_UUID: 744c6c9a-766d-11eb-9807-fa163e6a649a

             Master_Info_File: mysql.slave_master_info

                    SQL_Delay: 0

          SQL_Remaining_Delay: NULL

      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

           Master_Retry_Count: 86400

                  Master_Bind:

      Last_IO_Error_Timestamp:

     Last_SQL_Error_Timestamp:

               Master_SSL_Crl:

           Master_SSL_Crlpath:

           Retrieved_Gtid_Set: 744c6c9a-766d-11eb-9807-fa163e6a649a:1-38

            Executed_Gtid_Set: 744c6c9a-766d-11eb-9807-fa163e6a649a:1-38

                Auto_Position: 1

         Replicate_Rewrite_DB:

                 Channel_Name:

           Master_TLS_Version:

1 row in set (0.00 sec)

 

Slave在向master拉取gtid时,拉取到了理论上应该存在的744c6c9a-766d-11eb-9807-fa163e6a649a:1-38,以及幽灵gtid 135,然后因为binlog里没有gtid135,导致slave上不会执行,不会记载在Executed_Gtid_Set中。

此时集群已经搭建完成,在心跳不断的写入的情况下,master的gtid :1-38:135会被填平至例如1-169,但因为master的135是不存在的,slave上就会被跳过,至此产生空洞:

 

mysql> show slave status\G

*************************** 1. row ***************************

               Slave_IO_State: Waiting for master to send event

                  Master_Host: 10.6.26.233

                  Master_User: usvr_replication

                  Master_Port: 55111

               Connect_Retry: 60

              Master_Log_File: FATM00003214-bin.000001

          Read_Master_Log_Pos: 68475

               Relay_Log_File: FATM00003215-relay-bin.000002

                Relay_Log_Pos: 68702

        Relay_Master_Log_File: FATM00003214-bin.000001

             Slave_IO_Running: Yes

            Slave_SQL_Running: Yes

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 0

                   Last_Error:

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 68475

              Relay_Log_Space: 68972

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: 0

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 0

               Last_SQL_Error:

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 10001

                  Master_UUID: 744c6c9a-766d-11eb-9807-fa163e6a649a

             Master_Info_File: mysql.slave_master_info

                    SQL_Delay: 0

          SQL_Remaining_Delay: NULL

      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

           Master_Retry_Count: 86400

                  Master_Bind:

      Last_IO_Error_Timestamp:

     Last_SQL_Error_Timestamp:

               Master_SSL_Crl:

           Master_SSL_Crlpath:

           Retrieved_Gtid_Set: 744c6c9a-766d-11eb-9807-fa163e6a649a:1-134:136-169

            Executed_Gtid_Set: 744c6c9a-766d-11eb-9807-fa163e6a649a:1-134:136-169

                Auto_Position: 1

         Replicate_Rewrite_DB:

                 Channel_Name:

           Master_TLS_Version:

1 row in set (0.00 sec)

 

结论:集群上线的过程中可能在mysql实例部署完毕后某个脚本就开始运行,然后这个采集脚本的insert动作和master的reset master操作撞车,产生那个幽灵gtid,导致后续slave/dr上产生gtid空洞。至于为什么撞车这个动作会生成一个幽灵gtid尚且不知

解决方法:集群搭建流程中master在reset master之前flush tables with read lock; 等最后再打开,杜绝撞车的可能性

 

标签:Last,Log,Replicate,SSL,空洞,Master,master,集群,gtid
来源: https://www.cnblogs.com/wyznb/p/15775245.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有