Tuesday, November 27, 2012

Full table scan vs full index scan performance

Full table scan vs full index scan performance:
Earlier this week, Cédric blogged about how easy we can get confused between a covering index and a full index scan in the EXPLAIN output. While a covering index (seen with EXPLAIN as Extra: Using index) is a very interesting performance optimization, a full index scan (type: index) is according to the documentation the 2nd worst possible execution plan after a full table scan.

If it is obvious that a full table scan is not good for performance, how much can we expect if we can switch to a full index scan? In other terms, is a full table scan always the worst possible execution and should it be avoided at all costs?
Let’s take the employees database, and slightly modify the employees tables:
mysql> ALTER TABLE employees ADD INDEX idx_first (first_name),ENGINE=InnoDB;
And then let’s consider this query:
SELECT * FROM employees ORDER BY first_name;
This query can of course by executed by running a full table scan, but we could also take advantage of the idx_first index, which will translate to a full index scan.

Let’s see how the optimizer will execute it:
mysql> EXPLAIN SELECT * FROM employees ORDER BY first_name\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: employees
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 300363
        Extra: Using filesort
Surprising? The optimizer preferred a full table scan, and it did not even consider scanning the idx_first index as a relevant choice (possible_keys: NULL).
What do we get if we force the optimizer to use the index?
mysql> EXPLAIN SELECT * FROM employees FORCE INDEX(idx_first) ORDER BY first_name\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: employees
         type: index
possible_keys: NULL
          key: idx_first
      key_len: 16
          ref: NULL
         rows: 300363
        Extra:
Honestly, it looks better: the number of rows is the same, but a full index scan instead of a full table scan and no filesort. But predicting how a query will perform by only looking at the execution plan is not enough, we must run both queries and compare execution time.
First case: the employees table does not fit in memory

With the full table scan, the query runs in about 4s.

With the full index scan, it runs in about 30s.
So the optimizer was right after all. But why? Simply because all access patterns are not equal. When we are doing a full table scan, we are doing sequential reads, which are quite fast even with slow disks. But when we are using the index, we first have to do a full index scan (fast sequential reads, no problem) and then lots of random reads to fetch rows by index value. And random reads are orders of magnitude slower than sequential reads.

The optimizer has this knowledge, so it is able to pick the right execution plan.
Second case: the employees table fits in memory

With the full table scan, the query runs in about 3.3s.

With the full index scan, the query runs in about 2.6s.
We can see here a limitation of the optimizer: it does not know on which kind of media data is stored. If it is stored on spinning disks, the assumption that random reads are much slower than sequential reads is correct, but it is not the case anymore if data is stored in memory. That’s why when execution plans look similar, you should always execute the query to really see which execution plan should be chosen. Here if we know that the table will always be in memory, we should add the FORCE INDEX hint to ensure optimal response time.
Now let’s modify the query by selecting only the first_name field instead of selecting all the fields:
mysql> explain SELECT first_name FROM employees ORDER BY first_name\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: employees
         type: index
possible_keys: NULL
          key: idx_first
      key_len: 16
          ref: NULL
         rows: 300584
        Extra: Using index
The optimizer chooses the full index scan. It is a good choice because the index now covers the query, which means that reading the index is enough to get the results.
Conclusions:
  • For a non covering index, the different between a full table scan and an execution plan based on a full index scan is basically the difference between sequential reads and random reads: it can be close if you have fast storage or it can be very different if you have slow storage.
  • A full index scan can become interesting when the index is covering.
  • Don’t forget to measure response time when you are trying different execution plans. It is too easy to get focused on having a good-looking execution, but the end user only cares about response time!

PlanetMySQL Voting:
Vote UP /
Vote DOWN

DIGITAL JUICE

No comments:

Post a Comment

Thank's!