Saturday, September 8, 2012

How common_schema split()s tables - internals

How common_schema split()s tables - internals:
This post exposes some of the internals, and the SQL behind QueryScript's split. common_schema/QueryScript 1.1 introduces the split statement, which auto-breaks a "large" query (one which operates on large tables as a whole or without keys) into smaller queries, and executes them in sequence.
This makes for easier transactions, less locks held, potentially (depending on the user) more idle time released back to the database. split has similar concepts to oak-chunk-update and pt-archiver, but works differently, and implemented entirely in SQL on server side.
Take the following statement as example:

split (UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR)
  pass;


It yields with (roughly) the following statements:

UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '1')) OR ((`inventory`.`inventory_id` = '1'))) AND (((`inventory`.`inventory_id` < '1000')) OR ((`inventory`.`inventory_id` = '1000'))));
UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '1000'))) AND (((`inventory`.`inventory_id` < '2000')) OR ((`inventory`.`inventory_id` = '2000'))));
UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '2000'))) AND (((`inventory`.`inventory_id` < '3000')) OR ((`inventory`.`inventory_id` = '3000'))));
UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '3000'))) AND (((`inventory`.`inventory_id` < '4000')) OR ((`inventory`.`inventory_id` = '4000'))));
UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '4000'))) AND (((`inventory`.`inventory_id` < '4581')) OR ((`inventory`.`inventory_id` = '4581'))));


(I say "roughly" because internally there are user defined variables at play, but for convenience, I verbose the actual values as constants.)


How does that work?


common_schema works on server side. There is no Perl script or anything. It must therefore use server-side operations to:

  • Identify table to be split

  • Analyze the table in the first place, deciding how to split it

  • Analyze the query, deciding on how to rewrite it

  • Split the table (logically) into unique and distinct chunks

  • Work out the query on each such chunk


Following is an internal look at how common_schema does all the above.


Identifying the table


When query operates on a single table, split is able to parse the query's SQL and find out that table. When multiple tables are involved, split requires user instruction: which table is it that the query should be split by?


Analyzing the table


Table analysis is done via a similar method to candidate_keys_recommended. It is almost identical, only it uses INFORMATION_SCHEMA optimizations to make the query short and lightweight. Simulating the analysis using candidate_keys_recommended, we get:

mysql> select * from candidate_keys_recommended where table_name='inventory' \G
*************************** 1. row ***************************
          table_schema: sakila
            table_name: inventory
recommended_index_name: PRIMARY
          has_nullable: 0
            is_primary: 1
 count_column_in_index: 1
          column_names: inventory_id


This is cool, simple and very easy to work with: we choose to split the table via the inventory_id column, which is conveniently an integer. We'll soon see split can handle complex cases as well.


Analyzing the query


This is done in part via Roland's query_analysis_routines, and in part just parsing the query, looking for WHERE, GROUP BY, LIMIT etc. clauses.
The nice part is injecting a WHERE condition, which didn't appear in the original query. That WHERE condition is what limits the query to a distinct chunk of rows.


Splitting the table


With a single INTEGER PRIMARY KEY this sounds simple, right? Take rows 1..1,000, then 1,001..2,000, then 2,001..3,000 etc.
Wrong: even with this simple scenario, things are much more complex. Are the numbers successive? What if there are holes? What if there is a 1,000,000 gap between every two numbers? What if there are multiple holes of differing size and frequency?
And if we have two columns in our UNIQUE KEY? What if one of them is textual, not an INTEGER, the other a TIMESTAMP, not an INTEGER either?
split doesn't work in that naive way. It makes no assumptions on the density of values. It only requires:

  • some UNIQUE KEY to work with,

  • which has no NULL values.


Given the above, it uses User Defined Variables to setup the chunks. With our single INTEGER column, the minimum value is set like this:

select 
  inventory_id
from
  `sakila`.`inventory`
order by
  inventory_id ASC
limit 1  
into @_split_column_variable_min_1
;


This sets the first value of the first chunk. What value terminates this chunk? It is calculated like this:

select 
  inventory_id
from (
  select
    inventory_id
  from
    `sakila`.`inventory`
  where
    (((`inventory`.`inventory_id` > @_split_column_variable_range_start_1)) OR ((`inventory`.`inventory_id` = @_split_column_variable_range_start_1))) and (((`inventory`.`inventory_id` < @_split_column_variable_max_1)) OR ((`inventory`.`inventory_id` = @_split_column_variable_max_1)))
  order by
    inventory_id ASC limit 1000
  ) sel_split_range  
order by
  inventory_id DESC
limit 1  
into @_split_column_variable_range_end_1
;


Now there's a query you wouldn't want to work by hand, now would you?
The cool part here is that the above works well for any type of column; this doesn't have to be an INTEGER. Dates, strings etc. are all just fine.
The above also works well for multiple columns, where the query gets more complicated (see following).


Working out the query per chunk


This part is the easy one, now that all the hard work is done. We know ho to manipulate the query, we know the lower and upper boundaries of the chunk, so we just fill in the values and execute.


Multi-columns keys


Consider a similar query on sakila.film_actor, where the PRIMARY KEY is a compound of two columns:

split (UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR)
  throttle 2;


The chunked queries will look like this:

UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '1')) OR ((`film_actor`.`actor_id` = '1') AND (`film_actor`.`film_id` > '1')) OR ((`film_actor`.`actor_id` = '1') AND (`film_actor`.`film_id` = '1'))) AND (((`film_actor`.`actor_id` < '39')) OR ((`film_actor`.`actor_id` = '39') AND (`film_actor`.`film_id` < '293')) OR ((`film_actor`.`actor_id` = '39') AND (`film_actor`.`film_id` = '293'))));
UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '39')) OR ((`film_actor`.`actor_id` = '39') AND (`film_actor`.`film_id` > '293'))) AND (((`film_actor`.`actor_id` < '76')) OR ((`film_actor`.`actor_id` = '76') AND (`film_actor`.`film_id` < '234')) OR ((`film_actor`.`actor_id` = '76') AND (`film_actor`.`film_id` = '234'))));
UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '76')) OR ((`film_actor`.`actor_id` = '76') AND (`film_actor`.`film_id` > '234'))) AND (((`film_actor`.`actor_id` < '110')) OR ((`film_actor`.`actor_id` = '110') AND (`film_actor`.`film_id` < '513')) OR ((`film_actor`.`actor_id` = '110') AND (`film_actor`.`film_id` = '513'))));
UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '110')) OR ((`film_actor`.`actor_id` = '110') AND (`film_actor`.`film_id` > '513'))) AND (((`film_actor`.`actor_id` < '146')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` < '278')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` = '278'))));
UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '146')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` > '278'))) AND (((`film_actor`.`actor_id` < '183')) OR ((`film_actor`.`actor_id` = '183') AND (`film_actor`.`film_id` < '862')) OR ((`film_actor`.`actor_id` = '183') AND (`film_actor`.`film_id` = '862'))));
UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '183')) OR ((`film_actor`.`actor_id` = '183') AND (`film_actor`.`film_id` > '862'))) AND (((`film_actor`.`actor_id` < '200')) OR ((`film_actor`.`actor_id` = '200') AND (`film_actor`.`film_id` < '993')) OR ((`film_actor`.`actor_id` = '200') AND (`film_actor`.`film_id` = '993'))));


View the complete command to realize just how much more complex each query is, and how much more complex the chunking becomes. Here's how I evaluate the chunk's "next range end" variables:

select 
  actor_id, film_id
from (
  select
    actor_id, film_id
  from
    `sakila`.`film_actor`
  where
    (((`film_actor`.`actor_id` > @_split_column_variable_range_start_1)) OR ((`film_actor`.
`actor_id` = @_split_column_variable_range_start_1) AND (`film_actor`.`film_id` > @_split_column_variable_range_start_2))) and (((`film_actor`.`actor_id` < @_split_column_variable_max_1)) OR ((`film_actor`.`actor_id` = @_split_column_variable_max_1) AND (`film_actor`.`film_id` < @_split_column_variable_max_2)) OR ((`film_actor`.`actor_id` = @_split_column_variable_max_1) AND (`film_actor`.`film_id` = @_split_column_variable_max_2)))
  order by
    actor_id ASC, film_id ASC
  limit 1000
  ) sel_split_range  
order by
  actor_id DESC, film_id DESC
limit 1  
into @_split_column_variable_range_end_1, @_split_column_variable_range_end_2
;


By the way, you may recall that everything is done server side. The WHERE condition for the chunked queries is in itself generated via SQL statement, and not too much by programmatic logic. Here's part of the query which computes the limiting condition:

  select
    group_concat('(', partial_comparison, ')' order by n separator ' OR ') as comparison
  from (
    select
      n,
      group_concat('(', column_name, ' ', if(is_last, comparison_operator, '='), ' ', variable_name, ')' order by column_order separator ' AND ') as partial_comparison
    from (
      select
        n, CONCAT(mysql_qualify(split_table_name), '.', mysql_qualify(column_name)) AS column_name,
        case split_variable_type
          when 'range_start' then range_start_variable_name
          when 'range_end' then range_end_variable_name
          when 'max' then max_variable_name
        end as variable_name,
        _split_column_names_table.column_order, _split_column_names_table.column_order = n as is_last
      from
        numbers, _split_column_names_table
      where
        n between _split_column_names_table.column_order and num_split_columns
      order by n, _split_column_names_table.column_order
    ) s1
    group by n
  ) s2
  into return_value
  ;


There is a lot of complexity to split to make it able to provide with as clean a syntax for the user as possible.

PlanetMySQL Voting:
Vote UP /
Vote DOWN

DIGITAL JUICE

No comments:

Post a Comment

Thank's!