Friday, August 10, 2012

Debugging

Debugging:
One of my recent assignments involved a company that had run into some performance problems after upgrading from 10.2.0.3 to 11.2.0.2. We had spent half an hour on the phone discussing the system before I had arrived, and I’d made a couple of suggestions that had solved most of their problems before I got on site – but they still wanted me to come in and give them some specific ideas about why the critical part of the solution had helped.
The most critical piece of advice I had given them (after listening very carefully to their description of the system) was to get rid of ALL the histograms they had on their system, and then watch very carefully for any signs that they might need to re-introduce a handful of histograms over the next few weeks.
One of their critical queries completed in less that 2 seconds when histograms were removed, but took 33 seconds to complete when histograms were in place. With their permission, the following notes record my investigation of this puzzle, the underlying Oracle bug (possibly not yet documented) that caused it, and the optimum workaround that was available to them.

The Query

The application vendor had created a number of views that were designed to hide complexity from the Java programmers, so the problem query looked very simple:

SELECT
 *
FROM
 wltuser.VLADUNGEN_FREI
WHERE
 MATERIALNR = '117098'
AND wltuser.LVSKOMM.FuncSel_BatchReservationOK0(14536, LadungNr) = 0
AND wltuser.LVSKOMM.FuncSel_QAReservationOK0(14536, ChargeNr, StatusQS) = 0
AND 1=1
AND ((UPPER (zugeordnetfktbereich) LIKE  UPPER ('Warehouse_G') ))
ORDER BY
 ladungnr
;


There are a couple clues in this text suggesting that it’s generated code – the odd “1=1″ construct, the quirky mixture of case (particularly visible in the first predicate – which also look suspect because the name suggests “number” and the comparison is with a number that has been turned into a character string). Moreover, this is just one of several variants of the same query using different constants. (A significant fraction of the work done by the system is “hard parsing”- and in this case that’s an issue worth pursuing with the vendor.)
You’ve probably also noticed two packaged functions in the where clause (such things can be really nasty: what do they do and how much “hidden” time do they take) and a possible “index-busting” upper() function.
As a last little complication, you might note that the vendor is German (or, at least Germanic). It slows me down a bit when I have to keep asking questions like “what does VLADUNGEN_FREI mean?”; it’s much easier to recognise threats in execution plans if you have a more intuitive understanding of what the tables and columns represent.
The client had previous traced various slow queries, by the way, and had noted that a single execution of some “user” queries would result in literally tens of thousands of executions of other (very simple, very precise, very efficent – yet threateningly repetitive) queries. Several of the pl/sql functions, like the ones above, were used by the system to perform simple lookups against the database – which were then used in where clauses to eliminate data. (Tip: sometimes you can declare such functions to be deterministic, sometimes (select function() from dual) is much more efficient than a simple call to the function).

The execution plans.

Since it was easy to reproduce both execution plans (by dropping or recreating histograms) the obvious first step in the investigation was to compare execution plans, ideally with rowsource execution statistics enabled. So that’s what we did in a backup copy of the production data. Here’s what we got – first the bad plan:

-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name                    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |                         |      1 |        |     61 |00:00:00.03 |    1821K|
|   1 |  SORT AGGREGATE                         |                         |     61 |      1 |     61 |00:00:00.01 |     129 |
|   2 |   TABLE ACCESS BY INDEX ROWID           | LADUNGEN                |     61 |      8 |    266 |00:00:00.01 |     129 |
|*  3 |    INDEX RANGE SCAN                     | LADSAMMELINDEX          |     61 |      8 |    266 |00:00:00.01 |      54 |
|*  4 |  FILTER                                 |                         |      1 |        |     61 |00:00:00.03 |    1821K|
|   5 |   NESTED LOOPS OUTER                    |                         |      1 |      1 |     83 |00:00:00.03 |    1821K|
|   6 |    NESTED LOOPS ANTI                    |                         |      1 |      1 |     83 |00:00:33.95 |    1821K|
|*  7 |     FILTER                              |                         |      1 |        |     83 |00:00:33.95 |    1821K|
|   8 |      NESTED LOOPS OUTER                 |                         |      1 |      1 |     83 |00:00:00.03 |    1821K|
|*  9 |       FILTER                            |                         |      1 |        |     83 |00:00:33.95 |    1821K|
|  10 |        NESTED LOOPS OUTER               |                         |      1 |      1 |     98 |00:00:00.03 |    1820K|
|  11 |         NESTED LOOPS                    |                         |      1 |      1 |     98 |00:00:18.40 |    1820K|
|  12 |          NESTED LOOPS                   |                         |      1 |      1 |     98 |00:00:18.39 |    1820K|
|  13 |           TABLE ACCESS BY INDEX ROWID   | LVSMATERIALSTAMM        |      1 |      1 |      1 |00:00:00.01 |       3 |
|* 14 |            INDEX UNIQUE SCAN            | LVSMATERIALSTAMMPK      |      1 |      1 |      1 |00:00:00.01 |       2 |
|* 15 |           TABLE ACCESS BY INDEX ROWID   | LADUNGEN                |      1 |      1 |     98 |00:00:18.39 |    1820K|
|* 16 |            INDEX FULL SCAN              | LADUNGENPK              |      1 |     36 |  69157 |00:00:22.40 |    1808K|
|  17 |             SORT GROUP BY               |                         |  69296 |      1 |  69296 |00:00:00.86 |   22907 |
|  18 |              TABLE ACCESS BY INDEX ROWID| LADUNGEN                |  69296 |      8 |  30356 |00:00:00.45 |   22907 |
|* 19 |               INDEX RANGE SCAN          | LADSAMMELINDEX          |  69296 |      8 |  30356 |00:00:00.29 |   12827 |
|  20 |          TABLE ACCESS BY INDEX ROWID    | MATERIALSTAMM           |     98 |      1 |     98 |00:00:00.01 |     111 |
|* 21 |           INDEX UNIQUE SCAN             | MATERIALSTAMMPK         |     98 |      1 |     98 |00:00:00.01 |      13 |
|  22 |         TABLE ACCESS BY INDEX ROWID     | LVSCHARGE               |     98 |      1 |     97 |00:00:00.01 |     110 |
|* 23 |          INDEX UNIQUE SCAN              | LVSCHARGEPK             |     98 |      1 |     97 |00:00:00.01 |      13 |
|  24 |       TABLE ACCESS BY INDEX ROWID       | LVSKOMMAUFTRAGPOSLADUNG |     83 |  23250 |      0 |00:00:00.01 |      59 |
|* 25 |        INDEX RANGE SCAN                 | POSLADUNG_LADUNG_I      |     83 |      1 |      0 |00:00:00.01 |      59 |
|  26 |     VIEW PUSHED PREDICATE               | VW_NSO_1                |     83 |      1 |      0 |00:00:00.01 |       0 |
|* 27 |      FILTER                             |                         |     83 |        |      0 |00:00:00.01 |       0 |
|  28 |       NESTED LOOPS                      |                         |      0 |        |      0 |00:00:00.01 |       0 |
|  29 |        NESTED LOOPS                     |                         |      0 |      1 |      0 |00:00:00.01 |       0 |
|  30 |         TABLE ACCESS BY INDEX ROWID     | LADUNGEN                |      0 |      4 |      0 |00:00:00.01 |       0 |
|* 31 |          INDEX RANGE SCAN               | LADSAMMELINDEX          |      0 |      4 |      0 |00:00:00.01 |       0 |
|* 32 |         INDEX RANGE SCAN                | POSLADUNG_LADUNG_I      |      0 |      1 |      0 |00:00:00.01 |       0 |
|* 33 |        TABLE ACCESS BY INDEX ROWID      | LVSKOMMAUFTRAGPOSLADUNG |      0 |      1 |      0 |00:00:00.01 |       0 |
|  34 |    TABLE ACCESS BY INDEX ROWID          | PASXBATCHSTATUS         |     83 |      1 |     82 |00:00:00.01 |      90 |
|* 35 |     INDEX UNIQUE SCAN                   | PASXBATCHSTATUSUNIQUE   |     83 |      1 |     82 |00:00:00.01 |       8 |
|  36 |   SORT AGGREGATE                        |                         |      0 |      1 |      0 |00:00:00.01 |       0 |
|* 37 |    INDEX SKIP SCAN                      | FSRESERVIERUNGEN        |      0 |      1 |      0 |00:00:00.01 |       0 |
-----------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("TSU"."SAMMELLADUNGNR"=:B1)
   4 - filter(DECODE("MAT"."CHARGENPFLICHTIG",22,21,DECODE(NVL("STA"."ALLOWEDFORPICKING",23),901,21,902,DEC
              ODE( (SELECT COUNT(*) FROM WLTUSER."FSRESERVIERUNGEN" "RS" WHERE
              "RS"."FSRS_CHARGENNR"=:B1),0,22,21),903,DECODE("LVS"."DEFAULTQSSTAT",NVL("CHA"."STATUSQS",23),21,22),22))=2
              1)
   7 - filter("KOM"."BEARBSTATUSKEY" IS NULL)
   9 - filter("LVSKOMM"."FUNCSEL_QARESERVATIONOK0"(14536,"LAD"."CHARGENR","CHA"."STATUSQS")=0 AND
              ("CHA"."VERFALLDATUM" IS NULL OR "CHA"."VERFALLDATUM">=SYSDATE@!))
  14 - access("LVS"."MATERIALNR"='117098')
  15 - filter("LAD"."MATERIALNR"='117098' AND "LAD"."LADUNGART"=1011 AND
              UPPER("LAD"."ZUGEORDNETFKTBEREICH") LIKE 'WAREHOUSE_G' AND "LAD"."INTRANSPORT"=22 AND
              "LAD"."INBEARBEITUNG"=22 AND "LAD"."INENTNAHME"=22 AND "LAD"."ININVENTUR"=22 AND
              "LAD"."ZUGEORDNETSTANDORT"="LVS"."STANDORTKEY")
  16 - filter("LVSKOMM"."FUNCSEL_BATCHRESERVATIONOK0"(14536,"LAD"."LADUNGNR")=0 AND  (SELECT
              COUNT(DISTINCT "TSU"."MATERIALNR") FROM WLTUSER."LADUNGEN" "TSU" WHERE "TSU"."SAMMELLADUNGNR"=:B1)<=1)
  19 - access("TSU"."SAMMELLADUNGNR"=:B1)
  21 - access("MAT"."MATERIALNR"='117098' AND "LAD"."ZUGEORDNETSTANDORT"="MAT"."STANDORTKEY")
  23 - access("CHA"."MATERIALNR"(+)='117098' AND "LAD"."CHARGENR"="CHA"."CHARGENR"(+))
  25 - access("LAD"."LADUNGNR"="KOM"."LADUNGNR"(+))
  27 - filter((-1)<>"LAD"."SAMMELLADUNGNR")
  31 - access("PAL"."SAMMELLADUNGNR"="LAD"."SAMMELLADUNGNR")
       filter("PAL"."SAMMELLADUNGNR"<>(-1) AND "PAL"."SAMMELLADUNGNR"<>0)
  32 - access("KLA"."LADUNGNR"="PAL"."LADUNGNR")
  33 - filter("KLA"."BEARBSTATUSKEY"=4603)
  35 - access("STA"."QSSTATUSKEY"(+)=NVL("CHA"."STATUSQS",23))
  37 - access("RS"."FSRS_CHARGENNR"=:B1)
       filter("RS"."FSRS_CHARGENNR"=:B1)

Editor's comments:
==================
Line 17 (SORT GROUP BY) also recorded
    ----------------------------
    |  OMem |  1Mem | Used-Mem |
    ----------------------------
    |  2048 |  2048 | 2048  (0)|
    ----------------------------

I acquired the plan structure from a calld to dbms_xplan.display_cursor()
I acquired the complete predicate from calls to explain plan and dbms_xplan.display()


You can’t help noticing that the plan is a lot more complex than you might expect – the query is running against a complex view whose definition, unfortunately, I forget to pick up from the client; however, here are a few notes about the view definition that appear in the execution plan:
  • Lines 1-2 represent a scalar subquery in the view’s select list.
  • Lines 4,36,37 represent an inline scalar subquery inside a decode() in the view’s where clause.
  • Lines 26-33 represents a NOT IN subquery that has been unnested into a non-mergeable view that has then been subject to predicate pushing so that it has become the second “table” in a nested loop ANTI-JOIN driven from line 6 of the plan.
  • Lines 17-19 represents a correlated scalar subquery in the view’s where clause.
  • Line 9 is where we can see one of the pl/sql functions of the main query being called for each row
  • Line 16 is where we can see the other pl/sql function being called
The shape of the plan around lines 15 to 19 may be a little confusing. There is a filter subquery in that part of the plan, but there is no FILTER operation visible in the plan itself. There are several cases where filter operations “disappear” and this can make it easy to mis-read plans, so I’m just going to show you an edited version of those lines to make the execution a little clearer.

------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name                    | Starts | E-Rows | A-Rows |
------------------------------------------------------------------------------------------------------
|* 15 |           TABLE ACCESS BY INDEX ROWID   | LADUNGEN                |      1 |      1 |     98 |
|* 16a|            FILTER                       |                         |      1 |     36 |  69157 |
|  16b|             INDEX FULL SCAN             | LADUNGENPK              |      1 |     36 |  69296 |
|  17 |             SORT GROUP BY               |                         |  69296 |      1 |  69296 |
|  18 |              TABLE ACCESS BY INDEX ROWID| LADUNGEN                |  69296 |      8 |  30356 |
|* 19 |               INDEX RANGE SCAN          | LADSAMMELINDEX          |  69296 |      8 |  30356 |
------------------------------------------------------------------------------------------------------

  15 - filter("LAD"."MATERIALNR"='117098' AND "LAD"."LADUNGART"=1011 AND
              UPPER("LAD"."ZUGEORDNETFKTBEREICH") LIKE 'WAREHOUSE_G' AND "LAD"."INTRANSPORT"=22 AND
              "LAD"."INBEARBEITUNG"=22 AND "LAD"."INENTNAHME"=22 AND "LAD"."ININVENTUR"=22 AND
              "LAD"."ZUGEORDNETSTANDORT"="LVS"."STANDORTKEY")
  16a- filter(SELECT COUNT(DISTINCT "TSU"."MATERIALNR") FROM WLTUSER."LADUNGEN" "TSU" WHERE "TSU"."SAMMELLADUNGNR"=:B1)<=1)
  16b- filter("LVSKOMM"."FUNCSEL_BATCHRESERVATIONOK0"(14536,"LAD"."LADUNGNR")=0


In this edited extract we see that Oracle has done a full scan (not a fast full scan) of index LADUNGENPK at line 16b and, for each index entry, called one of our pl/sql functions; any row that survives the function call is then used in the filter at line 16a to drive the aggregate subquery (starting at line 17) to determine whether or not the entry should be used to access the table.
Since I got the execution plan statistics by adding the hint /*+ gather_plan_statistics */ to the query we can’t be completely confident in the accuracy of the time reported – it’s based on sampling – but we do see a lot of time spent on these lines, and an index full scan on a table with thousands of rows generally looks like a bad idea, and calling a pl/sql function thousands of time looks like a bad idea, and running an aggregate subquery thousands of times looks like a bad idea – especially since we end up acquiring only 98 rows from the table: so at first sight we might have our attention drawn to this part of the plan. But before diving in too deeply, let’s look at the good execution plan:

------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name                    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |                         |      1 |        |     61 |00:00:00.11 |    9661 |       |       |          |
|   1 |  SORT AGGREGATE                       |                         |     61 |      1 |     61 |00:00:00.01 |     125 |       |       |          |
|   2 |   TABLE ACCESS BY INDEX ROWID         | LADUNGEN                |     61 |      8 |    266 |00:00:00.01 |     125 |       |       |          |
|*  3 |    INDEX RANGE SCAN                   | LADSAMMELINDEX          |     61 |      8 |    266 |00:00:00.01 |      51 |       |       |          |
|   4 |  SORT ORDER BY                        |                         |      1 |      1 |     61 |00:00:00.11 |    9661 | 24576 | 24576 |22528  (0)|
|*  5 |   FILTER                              |                         |      1 |        |     61 |00:00:00.05 |    9536 |       |       |          |
|   6 |    NESTED LOOPS ANTI                  |                         |      1 |      1 |     83 |00:00:00.17 |    9381 |       |       |          |
|*  7 |     FILTER                            |                         |      1 |        |     83 |00:00:00.17 |    9381 |       |       |          |
|   8 |      NESTED LOOPS OUTER               |                         |      1 |      1 |     83 |00:00:00.01 |    9381 |       |       |          |
|   9 |       NESTED LOOPS OUTER              |                         |      1 |      1 |     83 |00:00:00.17 |    9325 |       |       |          |
|* 10 |        FILTER                         |                         |      1 |        |     83 |00:00:00.17 |    9239 |       |       |          |
|  11 |         NESTED LOOPS OUTER            |                         |      1 |      1 |     98 |00:00:00.01 |    9045 |       |       |          |
|  12 |          NESTED LOOPS                 |                         |      1 |      1 |     98 |00:00:00.05 |    8939 |       |       |          |
|  13 |           NESTED LOOPS                |                         |      1 |      1 |     98 |00:00:00.05 |    8832 |       |       |          |
|  14 |            TABLE ACCESS BY INDEX ROWID| LVSMATERIALSTAMM        |      1 |      1 |      1 |00:00:00.01 |       3 |       |       |          |
|* 15 |             INDEX UNIQUE SCAN         | LVSMATERIALSTAMMPK      |      1 |      1 |      1 |00:00:00.01 |       2 |       |       |          |
|* 16 |            TABLE ACCESS BY INDEX ROWID| LADUNGEN                |      1 |      1 |     98 |00:00:00.05 |    8829 |       |       |          |
|* 17 |             INDEX RANGE SCAN          | LADMATCHARGEINDEX       |      1 |     46 |   1137 |00:00:00.01 |       5 |       |       |          |
|  18 |           TABLE ACCESS BY INDEX ROWID | MATERIALSTAMM           |     98 |      1 |     98 |00:00:00.01 |     107 |       |       |          |
|* 19 |            INDEX UNIQUE SCAN          | MATERIALSTAMMPK         |     98 |      1 |     98 |00:00:00.01 |       9 |       |       |          |
|  20 |          TABLE ACCESS BY INDEX ROWID  | LVSCHARGE               |     98 |      1 |     97 |00:00:00.01 |     106 |       |       |          |
|* 21 |           INDEX UNIQUE SCAN           | LVSCHARGEPK             |     98 |      1 |     97 |00:00:00.01 |       9 |       |       |          |
|  22 |        TABLE ACCESS BY INDEX ROWID    | PASXBATCHSTATUS         |     83 |      1 |     82 |00:00:00.01 |      86 |       |       |          |
|* 23 |         INDEX UNIQUE SCAN             | PASXBATCHSTATUSUNIQUE   |     83 |      1 |     82 |00:00:00.01 |       4 |       |       |          |
|  24 |       TABLE ACCESS BY INDEX ROWID     | LVSKOMMAUFTRAGPOSLADUNG |     83 |  23244 |      0 |00:00:00.01 |      56 |       |       |          |
|* 25 |        INDEX RANGE SCAN               | POSLADUNG_LADUNG_I      |     83 |      1 |      0 |00:00:00.01 |      56 |       |       |          |
|  26 |     VIEW PUSHED PREDICATE             | VW_NSO_1                |     83 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
|* 27 |      FILTER                           |                         |     83 |        |      0 |00:00:00.01 |       0 |       |       |          |
|  28 |       NESTED LOOPS                    |                         |      0 |        |      0 |00:00:00.01 |       0 |       |       |          |
|  29 |        NESTED LOOPS                   |                         |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
|  30 |         TABLE ACCESS BY INDEX ROWID   | LADUNGEN                |      0 |      8 |      0 |00:00:00.01 |       0 |       |       |          |
|* 31 |          INDEX RANGE SCAN             | LADSAMMELINDEX          |      0 |      8 |      0 |00:00:00.01 |       0 |       |       |          |
|* 32 |         INDEX RANGE SCAN              | POSLADUNG_LADUNG_I      |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
|* 33 |        TABLE ACCESS BY INDEX ROWID    | LVSKOMMAUFTRAGPOSLADUNG |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
|  34 |    SORT GROUP BY                      |                         |     83 |      1 |     83 |00:00:00.01 |     155 |  2048 |  2048 |          |
|  35 |     TABLE ACCESS BY INDEX ROWID       | LADUNGEN                |     83 |      8 |    377 |00:00:00.01 |     155 |       |       |          |
|* 36 |      INDEX RANGE SCAN                 | LADSAMMELINDEX          |     83 |      8 |    377 |00:00:00.01 |      65 |       |       |          |
|  37 |    SORT AGGREGATE                     |                         |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
|* 38 |     INDEX SKIP SCAN                   | FSRESERVIERUNGEN        |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------


The difference that leaps off the page when you view this plan is that the big numbers (69,000) in the starts column have disappeared. Tracking through the detail we can see that the critical scalar subquery has been postponed almost to the last step (lines 34 – 36) and has been executed only 83 times. Tracking back up the plan to see why, we get back to the first steps of physical execution – lines 13 – 17 in the new plan (12 – 16 in the old plan). In both cases the initial steps of the plan form a nested loop between LVSMATERIALSTAMM and LADUNGEN. Here’s the comparison:

Bad Plan
------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name                    | Starts | E-Rows | A-Rows |
------------------------------------------------------------------------------------------------------
|  12 |          NESTED LOOPS                   |                         |      1 |      1 |     98 |
|  13 |           TABLE ACCESS BY INDEX ROWID   | LVSMATERIALSTAMM        |      1 |      1 |      1 |
|* 14 |            INDEX UNIQUE SCAN            | LVSMATERIALSTAMMPK      |      1 |      1 |      1 |
|* 15 |           TABLE ACCESS BY INDEX ROWID   | LADUNGEN                |      1 |      1 |     98 |
|* 16 |            INDEX FULL SCAN              | LADUNGENPK              |      1 |     36 |  69157 |
------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name                    | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------------------------------------
|  13 |           NESTED LOOPS                |                         |      1 |      1 |     98 |
|  14 |            TABLE ACCESS BY INDEX ROWID| LVSMATERIALSTAMM        |      1 |      1 |      1 |
|* 15 |             INDEX UNIQUE SCAN         | LVSMATERIALSTAMMPK      |      1 |      1 |      1 |
|* 16 |            TABLE ACCESS BY INDEX ROWID| LADUNGEN                |      1 |      1 |     98 |
|* 17 |             INDEX RANGE SCAN          | LADMATCHARGEINDEX       |      1 |     46 |   1137 |
----------------------------------------------------------------------------------------------------


It’s the same nested loop starting with the same (guaranteed) single row from LVSMATERIALSTAMM. When we have histograms the optimizer picks a full scan on the primary key index LADUNGENPK predicting 36 index entries and getting 69,000, but when we don’t have histograms the optimizer picks a range scan on index LADMATCHARGEINDEX predicting 46 index entries and getting 1,137. The bad choice of index by itself wouldn’t be a disaster – except Oracle ends up calling a pl/sql function thousands of times that executes another SQL statement, and then runs an aggregate subquery thousands of times for anything that survives the function call.
A little detail that you might spot further up the plan, by the way, is that when we drive off the primary key index (bad plan) we DON’T end up with a SORT ORDER BY operation in the plan (see line 4 of the good plan). There is an ORDER BY clause in the original query that is satisfied by the full scan on LADUNGENPK and maintained by the optimizer’s choice of all the subsequent nested loops; when we access the LADUNGEN data through the LADMATCHARGEINDEX we end up needing to sort it.
So there are two questions that we need to answer: how do we make Oracle pick the right index, and why does Oracle choose the wrong index when the histogram is in place. The answer to the second question is the easier one – when we have a histogram Oracle’s estimate of the rows returned by the range scan on LADMATCHARGEINDEX changes from 46 (the average number of rows per MATERIALNR) to 1,137 (the actual number of rows for MATERIALNR=’117098′, because that value happens to be a popular value. With the larger estimate (up by a factor of 25) the increased cardinality and costs cascade through the plan and the final cost of the plan is much higher than the cost of the accepted plan. In this example, better information has resulted in a worse plan.

The Buggy bit

The (nominally correct) increase, though, is only part of the problem. There’s another, simultaneous, issue that causes the bad plan to appear as the quality of information improves, and I can best demontrate this by looking at some partial execution plans, starting with an extract from the bad plan when I run it through explain plan.

--------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name                    | Rows  | Bytes | Cost (%CPU)|
--------------------------------------------------------------------------------------------------------
|* 15 |           TABLE ACCESS BY INDEX ROWID   | LADUNGEN                |     1 |   234 |   258   (2)|
|* 16 |            INDEX FULL SCAN              | LADUNGENPK              |    36 |       |   251   (2)|
|  17 |             SORT GROUP BY               |                         |     1 |    12 |            |
|  18 |              TABLE ACCESS BY INDEX ROWID| LADUNGEN                |     8 |    96 |     3   (0)|
|* 19 |               INDEX RANGE SCAN          | LADSAMMELINDEX          |     8 |       |     1   (0)|
--------------------------------------------------------------------------------------------------------


What I’d like you to note is that the full scan of LADUNGENPK, with the inclusion of the function call for each index entry, and the calls to the subquery, followed by the (predicted) 36 visits to the table has a total cost of 258 (Line 15). Let’s break this down, first by checking the predicted cost of just the index full scan, then the predicated cost of the index full scan with the function calls.

SQL> set autotrace traceonly explain
SQL> select /*+ index(ladungen) */ ladungnr from wltuser.ladungen;

Execution Plan
----------------------------------------------------------
Plan hash value: 3373915778

-------------------------------------------------------------------------------
| Id  | Operation        | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |            | 71590 |   489K|   249   (1)| 00:00:04 |
|   1 |  INDEX FULL SCAN | LADUNGENPK | 71590 |   489K|   249   (1)| 00:00:04 |
-------------------------------------------------------------------------------

SQL> select /*+ index(ladungen) */ ladungnr from wltuser.ladungen
  2* where wltuser."LVSKOMM"."FUNCSEL_BATCHRESERVATIONOK0"(14536,"LADUNGNR")=0

Execution Plan
----------------------------------------------------------
Plan hash value: 3373915778

-------------------------------------------------------------------------------
| Id  | Operation        | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |            |   716 |  5012 |   277  (11)| 00:00:04 |
|*  1 |  INDEX FULL SCAN | LADUNGENPK |   716 |  5012 |   277  (11)| 00:00:04 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("LVSKOMM"."FUNCSEL_BATCHRESERVATIONOK0"(14536,"LADUNGNR")=0)


From the first plan we see that the optimizer knows that there are 71,590 entries in the index, and a full scan costs 249.

From the second plan we see that the incremental cost of calling the function 71,590 times is 28, and Oracle’s selectivity for the function call is one of its “guesses” – in this case 1%.
But from the earlier extract we see the cardinality prediction for the index is 36 entries – which is 5% of 716, and that’s because Oracle has a standard guess of 5% for certain types of subqueries (particularly subqueries with range predicates), and our use of the subquery comes down to:

(SELECT COUNT(DISTINCT "TSU"."MATERIALNR") FROM WLTUSER."LADUNGEN" "TSU" WHERE "TSU"."SAMMELLADUNGNR"=:B1)<=1

More dramatically, from the extract we see that the total cost of the index full scan, function calls, and subquery executions is 251 – an increment of only 2 over the cost of the basic index full scan. Here’s an interesting coincidence: 5% (subquery selectivity) of 28 (cost of 71,590 function calls) = 2 (when rounded up).
I think that Oracle has applied the 5% subquery factor to the number of rows in the index before applying the cost of calling the function, and then it hasn’t added ANY cost to the full scan to allow for the number of times that it will have to call the subquery. The entire calculation strategy is broken and, as a result, Oracle underestimates the cost of this part of the plan by a huge amount – and that makes the “correct” cost of using the primary key index appear to be the worst choice.
By this time I had already spent about an hour picking the problem apart – so I suggested that as a first (easy) step we should modify the code to tell the optimizer to run the subquery late (as it did in the good plan). I was fairly confident that this would improve things and, given the incorrect application of the 5% selectivity, might result in a switch back to the correct index.
This is where the vendor’s design strategy came to our aid. The code embedded in the application was generally constructed dynamically but very simple. Most of the complexity was in stored views. So I recreated the view VLADUNGEN_FREI with the hint /*+ no_push_subq */ embedde – and luckily the view definition allowed me to put that hint into precisely the subquery where it was needed. The effect was exactly what we wanted to see; with the delayed appearance of the subquery the COST of the wrong index made the plan much more expensive and the number of times the function and the subquery were called at run_time dropped dramatically, so the query dropped back to two-second time to completion. (There was still a slight variation in the subsequent order of tables thanks to slight variations in the arithmetic derived for the histograms, though.)

Conclusion

I didn’t want to spend any more of the client’s time on this problem – we were, after all, only trying to determine why dropping histograms had made such a difference to some of their queries. This example demonstrated an important principle, but the client had got rid of the histograms in production anyway.
The underlying problem had two components. The presence of a histogram had given Oracle better information about the data and had therefore increased the cost of the ideal execution plan; this would not have mattered if there had not been some defects in the handling of function calls and subquery costs for filters against indexes which made an alternative plan with a different index produce a lower cost estimate despite consuming far more resources at run-time.
If we had not been dropping the histograms anyway, we would have been able to work around the problem because the vendor embedded most of the code complexity in views – so we could add hints to the view definitions without having to wait for the vendor to modify the front-end code, and without having to do massive regression tests to prove the our changes still produced the same result sets.



DIGITAL JUICE

No comments:

Post a Comment

Thank's!