“The king is dead, long live the king”: Our Paxos-based consensus
“The king is dead, long live the king”: Our Paxos-based consensus
In this blog post, we will describe our Paxos-based solution, named eXtended COMmunications, or simply XCOM, which is a key component in the mysql Group Replication.
XCOM is responsible for disseminating transactions to MySQL instances that are members in a group and for managing their membership. Its key functionalities are:Ordered Delivery: Guarantees that messages (i.e. transactions) are delivered by the same order at all members Dynamic Membership: Provides functionalities to manage theset of MySQL Server instances belonging to the group Failure Detection: Along with Dynamic Membership, decides upon the fate of failed members
In the beginning of the MySQL Group Replication project, we used Corosync as our group communication system but soon decided to switch to our own solution as described in this blog post: Order from Chaos: Member Coordination in Group Replication .
In what follows, we will present some background information on Group Replication and Consensus, then we will describe our design decisions behind XCOM.1. Background
MySQL Group Replication is based on the Database State Machine . Update transactions are started and executed at a member without requiring any coordination with remote members.Upon commit, transaction’s changes are disseminated to all members in the group anddelivered everywhere by the same order. Then a certificationprocess checks whether there are conflicts among concurrent transactions,verifying if any other transaction has tried to update a record that had been alreadyupdated before. If there is no conflict, the transaction is committed at all members. Otherwise, it is aborted.
See further details on this process at MySQL Group Replication Transaction life cycle explained .
Behind this solution there is a group communication system which manages the membership and guarantees that messages carrying out transactions’ changes are totally ordered. Membership management and totally ordered message delivery are instances of a fundamental concept in distributed system know as consensus:
The consensus problem requires agreement among a number of processes (or agents) for a single data value .
In our case, MySQL Server instances reach an agreement on whether a new MySQL Server instance shall be added to the group or an old instance shall be removed from it or which transaction’s change shall be the next one to be delivered.
Paxos is probably the most well known consensus protocol and works in two phases.
Classic Paxos is a leader-oriented algorithm. As such, the first phase is called Prepare or Leader Election phase, in which a membersends a message tagged with a ballot number to all members suggesting that itwants to become a leader. Members that have not promised to anybody that theyalready accepted a request from another member with a higher ballot number will reply to the request. If the member gets replies from a majority of members inthe group, it will become the new leader. Otherwise, the member will simply try to win the election with a higher ballot number after a timeout until it succeeds.
Note that thisphase is only necessary if the current leader has beensuspected to have failed. Otherwise, only the second phase of the protocol is executed. See Good Leaders are game changes: Paxos & Raft for a discussion on leader election, ballot numbers, etc.
During the second phase, the leader proposes a value which is considered accepted when it gets a reply from a majority ofmembers. Messages are tagged with the leader’s ballot number and a member will only reply to the leader saying that it accepts the proposal if nobody else has tried to take over the leadership with a higher ballot number.
When the leader finds out that a proposal has been accepted, it sends a learn message to all members in the group saying that they can deliver the message to the Group Communication System Layer which will eventually deliver it to MySQL Group Replication. Usually, the learn message is piggybacked onto or batched along with other messages.
In practice, agreement is reached over a sequence of values and this protocol is known as Multi-Paxos.
Note that with classic or standard Multi-Paxos, any memberhas to send the transaction’s changes to the leader which then guarantees a total ordered message delivery. If the transaction was executed at the same memberthat was elected as Paxos leader, there is no problem. Otherwise, there will be an extra communication step and usually the leader will become a bottleneck:
Such protocols scale poorly, because as the number of replicas or the load on the system increases, the leader replica quickly reaches the limits of one of its resources.2. XCOM
MySQL Group Replication is a multi-master update everywhere solution and transaction’s changes may be originated at any member in the group. Having a leader-based protocol would clearly harm scalability as all updates would have to go through the leader which then would be responsible for disseminating them. So our first goal was to overcomethe possible bottleneck with a single leader approach andwe created a multi-leader or more precisely a multi-proposer solution. This approach has some similarities to Mencius, for example uses skip messages, but the overall design is closer to the original Paxos.
In this protocol, every member has an associated unique number and a reversed slot in the stream of totally ordered messages.For example, with three members:member 0 will get slots: 0, 3, 6, … 3 * n + 0 member 1 will get slots: 1, 4, 7, … 3 * n + 1 member 2 will get slots: 2, 5, 8, … 3 * n + 2
In a group with 'g' members, the next slot available to a member is given by the formula: g * n + member's number where ‘n’ is a monotonic counter kept by each member and incremented every time a proposal is sent.
So there is no leader election and each member is a leader of its own slots in the stream of messages. Members can propose messages for their slots without having to wait for other members, although there is a limit on how far they can get as we will describe in the next section.2.1 HandlingGaps
If members may propose messages to their own slots without any coordination, it is likely that there may be gaps in the message stream. For example, member 1 and 2 may have got an agreement on messages 1 and 2, respectively, but member 0 may have not proposed or got an agreement on message 0 yet, whatever the reason is.
本文数据库（mysql）相关术语:navicat for mysql mysql workbench mysql数据库 mysql 存储过程 mysql安装图解 mysql教程 mysql 管理工具
本文标题：“The king is dead, long live the king”: Our Paxos-based consensus