VizualBusiness: Here’s a quick way to Foresee if Replication Slave is ever going to catch up and When!

Thursday, August 30, 2012

Here’s a quick way to Foresee if Replication Slave is ever going to catch up and When!

Here’s a quick way to Foresee if Replication Slave is ever going to catch up and When!:
If you ever had a replication slave that is severely behind, you probably noticed that it’s not catching up with a busy master at a steady pace. Instead, the “Seconds behind master” is going up and down so you can’t really tell whether the replica is catching up or not by looking at just few samples, unless these are spread apart. And even then you can’t tell at a glance when it is going to catch up.
Normally, the “severely behind” thing should not happen, but it does often happen in our consulting practice:

sometimes replication would break and then it needs to catch up after it is fixed,

other times new replication slave is built from a backup which is normally hours behind,

or, it could be that replication slave became too slow to catch up due to missing index

Whatever the case is, single question I am being asked by the customer every time this happens is this: When is the replica going to catch up?”
I used to tell them “I don’t know, it depends..” and indeed it is not an easy question to answer. There are few reasons catching up is so unstable:

If you have restarted the server, or started a new one, caches are cold and there’s a lot of IO happening,

Not all queries are created equal – some would run for seconds, while others can be instant,

Batch jobs: some sites would run nightly tasks like building statistics tables or table checksum – these are usually very intense and cause slave to backup slightly.

I didn’t like my own answer to The question, so I decided to do something about it. And because I love awk, I did that something in awk:

delay=60
cmd="mysql -e 'show slave status\G' | grep Seconds_Behind_Master | awk '{print \$2}'"
while sleep $delay; do
  eval $cmd
done | awk -v delay=$delay '
{
   passed += delay;
   if (count%10==0)   
      printf("s_behind  d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h\n");
   if (prev==NULL){
      prev = $1;
      start = $1;
   }
   speed = (prev-$1)/delay;
   o_speed = (start-($1-passed))/passed
   if (speed == 0)    speed_d = 1;
     else             speed_d = speed;
   eta = $1/speed_d;
   if (eta<0)         eta = -86400;
   o_eta = $1/o_speed;
   printf("%8d %8.6f %9.3f %7.3f | %9.3f %7.3f %7.2f\n",
      $1, $1/86400, speed, eta/86400, o_speed, o_eta/86400, o_eta/3600);
   prev=$1;
   count++;
}'

I don't know if this is ever going to become a part of a Percona Toolkit, however since it's pretty much a one-liner, I just keep it in my snippets pool for easy copy'n'paste.
Here's a piece of an output from a server that was almost 27 days behind just yesterday:

// at the beginning:
s_behind  d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
 2309941 26.735428     0.000  26.735 |     1.000  26.735  641.65
 2309764 26.733380     2.950   9.062 |     2.475  10.801  259.23
 2308946 26.723912    13.633   1.960 |     6.528   4.094   98.25
 2308962 26.724097    -0.267  -1.000 |     5.079   5.262  126.28
 2309022 26.724792    -1.000  -1.000 |     4.063   6.577  157.85
...
// after one hour:
s_behind  d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
 2264490 26.209375    39.033   0.671 |    13.418   1.953   46.88
 2262422 26.185440    34.467   0.760 |    13.774   1.901   45.63
 2261702 26.177106    12.000   2.181 |    13.762   1.902   45.65
...
// after three hours:
s_behind  d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
 2179124 25.221343    13.383   1.885 |    13.046   1.933   46.40
 2178937 25.219178     3.117   8.092 |    12.997   1.940   46.57
 2178472 25.213796     7.750   3.253 |    12.973   1.943   46.64
...
// after 12 hours:
s_behind  d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
 1824590 21.117940    20.233   1.044 |    12.219   1.728   41.48
 1823867 21.109572    12.050   1.752 |    12.221   1.727   41.46
 1823089 21.100567    12.967   1.627 |    12.223   1.726   41.43
...
// after 21 hours:
s_behind  d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
 1501659 17.380312    -0.533  -1.000 |    11.768   1.477   35.44
 1501664 17.380370    -0.083  -1.000 |    11.760   1.478   35.47
 1501689 17.380660    -0.417  -1.000 |    11.751   1.479   35.50
...

Of course, it is still not perfectly accurate and it does not account for any potential changes in queries, workload, warm-up, nor the time it takes to run the mysql cli, but it does give you an idea and direction that replication slave is going. Note, negative values mean replication isn't catching up, but values themselves are mostly meaningless.
Here's what the weird acronyms stand for:

s_behind - current Seconds_Behind_Master value

d_behind - number of days behind based on current s_behind

c_sec_s - how many seconds per second were caught up during last interval

eta_d - this is ETA based on last interval

O_c_sec_s - overall catch-up speed in seconds per second

O_eta_d - ETA based on overall catch-up speed (in days)

O_eta_h - same like previous but in hours

Let me know if you ever find this useful.

PlanetMySQL Voting:
Vote UP /
Vote DOWN

DIGITAL JUICE

VizualBusiness

Thursday, August 30, 2012

Here’s a quick way to Foresee if Replication Slave is ever going to catch up and When!

No comments:

Post a Comment

Digital Juice for Primary Education

About Me

Translate

Total Pageviews

Thursday, August 30, 2012

Here’s a quick way to Foresee if Replication Slave is ever going to catch up and When!

No comments:

Post a Comment

Digital Juice for Primary Education

About Me

Translate

Subscribe To

Total Pageviews