Tuesday, November 27, 2012

Realtime stats to pay attention to in Percona XtraDB Cluster and Galera

Realtime stats to pay attention to in Percona XtraDB Cluster and Galera:
I learn more and more about Galera every day.  As I learn more, I try to keep my myq_gadgets toolkit up to date with what I consider is important to keep any eye on on a PXC node.  In that spirit, I just today pushed some changes to the ‘wsrep’ report, and I thought I’d go over some of the status variables and metrics being tracked there with the aim to show folks what they should be watching (at least from my opinion, this is subject to change!).
First, let’s take a look at the output:
[root@node3 ~]# myq_status -t 1 wsrep
Wsrep    Cluster        Node           Queue   Ops     Bytes     Flow        Conflct
    time  name P cnf  #  name  cmt sta  Up  Dn  Up  Dn   Up   Dn pau snt dst lcf bfa
19:17:01 trime P   3  3 node3 Sync T/T   0   0  35  40  54K  61K 0.0   0  17   0   2
19:17:03 trime P   3  3 node3 Sync T/T   0   0  70  85 107K 124K 0.0   0  13   0   2
19:17:04 trime P   3  3 node3 Sync T/T   0   0  72  81 111K 121K 0.0   0  16   0   3
19:17:05 trime P   3  3 node3 Sync T/T   0   0  70  85 108K 124K 0.0   0  17   0   4
19:17:06 trime P   3  3 node3 Sync T/T   0   0  66  82 100K 124K 0.0   0  17   0   3
19:17:07 trime P   3  3 node3 Sync T/T   0   0  68  78 105K 117K 0.0   0  22   0   0
19:17:08 trime P   3  3 node3 Sync T/T   0   0  65  93 101K 135K 0.0   0  14   1   5
19:17:09 trime P   3  3 node3 Sync T/T   0   0  73  83 111K 125K 0.0   0  19   0   3
19:17:10 trime P   3  3 node3 Sync T/T   0   0  30  46  46K  66K 0.0   0  10   0   2
19:17:12 trime P   3  3 node3 Sync T/T   0   0  64  80  97K 120K 0.0   0  19   0   4
19:17:13 trime P   3  3 node3 Sync T/T   0   0  69  88 106K 131K 0.0   0  28   0   1
19:17:14 trime P   3  3 node3 Sync T/T   0   0  70  83 106K 121K 0.0   0  11   0   3
19:17:15 trime P   3  3 node3 Sync T/T   0   0  72  84 111K 126K 0.0   0  15   0   3
As I’ve mentioned before, myq_status gives an iostat-like output of your server.  This tool takes what are usually global counters in SHOW GLOBAL STATUS and calculates the change each second and reports that.  There’s lot of other reports it can run, but this one is focused on ‘wsrep%’ status variables.
It’s important to note that this reflects the status of a single PXC node in my cluster, node3 to be precise, so some information is cluster-wide, other information is specific to this particular node.  I tend to open a window for each node and run the tool on each so I can see things across the entire cluster at a glance.  Sometime in the future, I’d like to build a tool that polls every cluster node, but that’s not available currently.
Let’s go through the columns.

Cluster

There are 4 columns in the cluster section, and it’s important to understand that this tool only currently connections to a single node (by default, localhost).  The state of the cluster could be divergent across multiple nodes, so be careful to not assume all nodes have these values!

name

The cluster’s name (first 5 characters).  This is wsrep_cluster_name.

P

Either P for Primary or N for Non-primary.  This is the state of this partition of the cluster.  If a cluster gets split brained, then only a quorum (>=51%) of the remaining nodes will remain Primary.  Non-primary clusters are the remaining minority and will not allow database operations.

cnf

This is wserep_cluster_conf_id — the version # of the cluster configuration.  This changes every time a node joins or leaves the cluster.  Seeing high values here may indicate you have nodes frequently dropping out and rejoining the cluster and you may need some retuning of some node timeouts to prevent this.

#

The number of nodes in the cluster.

Node

This is state data about the local node that the tool happens to be connected to.

name

The name of this local node (first 5 characters).  This is handy when you have this tool running in several windows on several nodes.

cmt

This is the wsrep_local_state_comment — basically a plaintext word describing the state of the  node in terms of the rest of the cluster.  ’Sync’ (Synced) is what you want to see, but ‘Dono’ (Donor), ‘Join’ (Joiner), and others are possible.  This is handy to quickly spot which node was elected to Donate either SST or IST to another node entering the cluster.

sta

Short for state, this is two True/False values (T/F) for wsrep_ready and wsrep_connected.  These are somewhat redundant with the local_state value, so I may remove them in the future.

Queue

This is information about the replication queues in both directions.
The ‘Up’ queue is outbound replication.  This generally increases when some other node is having difficulty receiving replication events.
The ‘Dn’ (down) queue is inbound replication.  Positive values here can be an indicator that this node is slow to apply replication writesets.

Ops

Ops are simply replication transactions or writesets.  Up is outbound, i.e., where this node was the originator of the transaction.  In is inbound, that is, transactions from other nodes in the cluster.

Bytes

Just like Ops, but in Bytes instead of transaction counts.  I have seen production clusters having performance issues where I noticed that the Ops and Bytes went to Zeros on all the nodes for a few seconds, and then a massive 90M+ replication transaction came through.  Using the Up and Dn columns, I could easily see which node was the originator of the transaction.

Flow

Flow gives some information about Flow Control events.  Galera has some sophisticated ways of metering replication so lag does not become a problem.

pau

wsrep_flow_control_paused — This is the amount of time since the last time SHOW GLOBAL STATUS was run that replication was paused due to flow control.  This is a general indicator that flow control is slowing replication (and hence overall cluster writes) down.

snt

wsrep_flow_control_sent — how many flow control events were SENT from this node.  Handy to find the node slowing the others down.

dst

This does not go under the Flow group.  This is wsrep_cert_deps_distance — This is a general indicator of how many parallel replication threads you could use.  In practice I haven’t found this extremely helpful yet and I may remove this in the future.  I think being aware of how Flow control works and watching flow control events and queue sizes is a better way to detect replication lag, and this really just tells you if multi-threaded replication could help improve replication speed at all.

Conflct

Replication conflicts, as described in my last post.  lcf is local certification failures, and bfa is brute force aborts.  This should be helpful to understand that these conflicts are or are not happening.

Interpreting the results

Let’s look at that output again and make some observations about our cluster and this node:
[root@node3 ~]# myq_status -t 1 wsrep
Wsrep    Cluster        Node           Queue   Ops     Bytes     Flow        Conflct
    time  name P cnf  #  name  cmt sta  Up  Dn  Up  Dn   Up   Dn pau snt dst lcf bfa
19:17:01 trime P   3  3 node3 Sync T/T   0   0  35  40  54K  61K 0.0   0  17   0   2
19:17:03 trime P   3  3 node3 Sync T/T   0   0  70  85 107K 124K 0.0   0  13   0   2
19:17:04 trime P   3  3 node3 Sync T/T   0   0  72  81 111K 121K 0.0   0  16   0   3
19:17:05 trime P   3  3 node3 Sync T/T   0   0  70  85 108K 124K 0.0   0  17   0   4
19:17:06 trime P   3  3 node3 Sync T/T   0   0  66  82 100K 124K 0.0   0  17   0   3
19:17:07 trime P   3  3 node3 Sync T/T   0   0  68  78 105K 117K 0.0   0  22   0   0
19:17:08 trime P   3  3 node3 Sync T/T   0   0  65  93 101K 135K 0.0   0  14   1   5
19:17:09 trime P   3  3 node3 Sync T/T   0   0  73  83 111K 125K 0.0   0  19   0   3
19:17:10 trime P   3  3 node3 Sync T/T   0   0  30  46  46K  66K 0.0   0  10   0   2
19:17:12 trime P   3  3 node3 Sync T/T   0   0  64  80  97K 120K 0.0   0  19   0   4
19:17:13 trime P   3  3 node3 Sync T/T   0   0  69  88 106K 131K 0.0   0  28   0   1
19:17:14 trime P   3  3 node3 Sync T/T   0   0  70  83 106K 121K 0.0   0  11   0   3
19:17:15 trime P   3  3 node3 Sync T/T   0   0  72  84 111K 126K 0.0   0  15   0   3
We can see are are connected to the node identified as ‘node3′.  Our cluster is Primary and there are 3 nodes total belonging to it.
There isn’t any replication queue activity, and I find this is common except during cluster stalls.  There are clearly a fair amount of transactions being replicated to and from this node: approximately 100K worth of data outbound, and just a hair more than that coming in.
Our replication is performing well, because the Flow control columns are zeroes, but we do see some replication conflicts.  Mostly these are brute force aborts, but I was able to see the (very occasional) local certification failure.  This makes sense to me because the replication inbound queue always reports as empty, so it seems that replication is being applied nearly immediately. Local certification failures only happen when the inbound queue is > 0.  Instead brute force aborts are applying writesets rolling back locally open transactions.
In fact, this is a sysbench test what is running full speed (these are VMs, so that’s not that particularly fast) on two of my three nodes, and more slowly on the third.  I had to decrease my table size from 250k to 2.5k to start seeing replication conflicts regularly.
Hopefully this is helpful for you to get an idea of how to observe and interpret Galera status variables.

PlanetMySQL Voting:
Vote UP /
Vote DOWN

DIGITAL JUICE

No comments:

Post a Comment

Thank's!