Sunday, May 13, 2012

Another debugging story...

Another debugging story...: There is a saying "correlation is not causation" meaning - just because you observe A and then observe B, it does not mean that A causes B - even if every single time you observe A - you see B as a 'side effect'.

Here is a great story to back that up - a debugging session where the end users had determined that they could not send email more than 500 miles.  Read it and note how much work the end users had gone into 'proving' that email cannot be sent more than 500 miles...

I use a similar story when talking about bind variables and bind variable peeking.  In the story - I make the claim that "when ever it rains on a Monday morning - you have to restart the database in the afternoon to make it perform correctly".  That is - rain on Mondays implies poor performance, and the only fix is to reboot.

This conclusion was reached via empirical observation - every single time, *every single time*, it would rain on Monday mornings - the database would respond with really poor response times to certain critical queries in the afternoon.  Every single time, without failure.  It must be that rain affects database performance somehow.

But here is what really was happening.  The database in question was a database that was backed up using cold backups (DBA's were of the archaic type).  These cold backups took place Sunday nights - late at night.  That meant that every Monday morning all of the caches would be virtually empty.  No parsed SQL for the applications in particular would be present.  So - Monday morning was the morning of hard parses every single week.

Now, in the critical application - there was a query that went against some very skewed data.  99.9999% of the time - the queries would be very selective, they would return 1 or 2 rows.  The other times - they would be very non-selective returning much of the table - but these queries were rare.

Now, the main user of this application was a person that came in early most of the time - they would get into the office at 6am and start working.  They always used bind inputs that returned very few rows - hence the problem queries would hard parse and tend to use indexes.  When this happened - everyone was happy.

The other set of users - they came in later, usually around 9am or so.  They would run the non-selective query and were OK with the index performance.  It might not have been the *best* possible performance - but it worked OK for them.

So, what was the link to rain?  Well, the office where everyone worked was located in Washington DC - when it rains in DC it doesn't matter if you leave home at 6am - it'll been noon before you get to work.  If you leave at 7am - it'll be noon when you get to work.  If you leave at 8am - it'll still be noon before you get to work.  In short - it really didn't matter when you left for work - if you are driving into the city - it is going to take a long time.  So, when it rained, the person that did the selective queries would just hit the snooze button on the alarm and go back to sleep.  Not worth driving in yet.

However, the people that did the non-selective queries - they lived in an apartment across the street from the  office.  They were not affected by the rain.  They came in at 9am regardless.  Since they ran the non-selective queries and the cache was empty - they would hard parse the queries and result with full scan plans.  This group of people was in fact a little convinced that rain *helped* performance a bit - when it rained on Mondays - they saw slightly better performance in the morning.

When the person that slept in finally got to work - and ran the application - they would definitely notice the full scans and how slow the database was.  Since it was raining - and they had observed this hundreds of times before - they call the data center - tell them "It is Monday, it is raining, you know what to do", after the reboot, everything is running fast again.

But - did it have anything to do with rain - or was it something entirely different :)

Getting to the root cause (bind peeking issue) can lead you to an efficient, effective corrective action (perhaps query plan stability for a certain set of queries for example).

Going down the empirical path of "correlation must imply causation" will lead to never actually fixing the problem - and the invention of many myths...



ICT4PE&D

No comments:

Post a Comment

Thank's!