Thursday, April 12, 2012

Extended Stats

Extended Stats:
I’m very keen on the 11g extended stats feature, but I’ve just discovered a critical weakness in one of the implementation details that could lead to some surprising instability in execution plans. It’s a combination of “column group” statistics and “out of range” predicates. Let’s start with  some sample data. (Note: I was running this test on 11.2.0.3):
create table t1
as
with generator as (
select --+ materialize
rownum id
from dual
connect by
level <= 10000
)
select
mod(rownum,100) col1,
mod(rownum,10) col2
from
generator v1,
generator v2
where
rownum <= 50000
;

begin
dbms_stats.gather_table_stats(
ownname => user,
tabname =>'T1',
method_opt => 'for all columns size 1'
);
end;
/


I’ve got 100 distinct values for col1 and 10 distinct values for col2 – so the optimizer will assume that I’ve got 1,000 distinct values for the combination. Since there are 50,000 rows this means 50 rows per combination. So let’s see this demonstrated:
set autotrace traceonly explain

select *
from t1
where
col1 = 50
and col2 = 5
;

Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
----------------------------------------------------------
| 0 | SELECT STATEMENT | | 50 | 300 | 13 |
|* 1 | TABLE ACCESS FULL| T1 | 50 | 300 | 13 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("COL1"=50 AND "COL2"=5)


If you examine my data definition, though, you will see that the col1 and col2 values align in a way that means there are really only 100 distinct combination, with 500 rows per combination. This looks like one of those cases where extended statistics might be useful. So let’s create a suitable set:
begin
dbms_output.put_line(
dbms_stats.create_extended_stats(
ownname => user,
tabname => 'T1',
extension => '(col1, col2)'
)
);

dbms_stats.gather_table_stats(
ownname => user,
tabname =>'T1',
block_sample => true,
method_opt => 'for all hidden columns size 1'
);
end;
/


With these stats in place, I repeat my query with the following effect:
Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
----------------------------------------------------------
| 0 | SELECT STATEMENT | | 500 | 3000 | 13 |
|* 1 | TABLE ACCESS FULL| T1 | 500 | 3000 | 13 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("COL1"=50 AND "COL2"=5)


Oracle has used the “column group” statistics to derive a good estimate of the combined selectivity and produced an estimate of 500 rows. But what happens as new data (following the same pattern) is entered, and the queries change to keep up with the data; consider, for example, the slightly artificial query:
select *
from t1
where
col1 = 50
and col2 = 9.001
;

Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
----------------------------------------------------------
| 0 | SELECT STATEMENT | | 50 | 300 | 13 |
|* 1 | TABLE ACCESS FULL| T1 | 50 | 300 | 13 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("COL1"=50 AND "COL2"=9.001)


I’ve used the value 9.001 because it is only just outside the known value range for col2 (which ranges from 0 to 9) – and the optimizer has immediately switched from the extended stats back to the single column stats and the “independent predicates” mechanism to estimate the cardinality, leading to a dramatic change in the estimate.
This strategy may be a deliberate choice on the part of the developers, but personally I think that it’s not appropriate. If you’ve created “column group” stats like this it seems reasonably likely that you’re trying to show Oracle a consistent pattern in the data; however it’s quite possible that the data you’re using shows some sort of time-dependency such that the pattern doesn’t change, but the actual values in (at least one) of the columns keeps increasing.
In this case, applying the normal “linear decay” strategy to the extended stats seems much more sensible – if one (or both) of the predicates is slightly out of range then use the extended stats to derive an estimate then apply the decay factor(s) to that value, rather than falling back on multiplying the individual selectivities.
(The case were I saw this problem was one where rows were constantly being inserted and deleted. One column was a time-based audit id, the other was an action id, and every audit id produced a few dozen action ids, with some overlap of actions across audit ids. New, higher valued, audit rows were constantly being added to this table, while old, lower ones, were being deleted. The effect was that the extended stats because useless within seconds of being generated.)


Footnote:


Coincidentally, Randolf Geist highlighted another edge case a couple of days ago – but it hasn’t stopped me thinking that extended stats, and “column group” stats especially are a terrific feature.

No comments:

Post a Comment

Thank's!