Wednesday, April 26, 2006

Speeding up queries

What an informative session. One of my favorites at the conference.

I am sitting in the session Speeding-up queries by Timour Katchaounov.

Query engine principles

What's new in the 5.0/5.1 engine?

Query engine architecture.
A number of stages in the query engine happen in the parse tree. This is where the SQL standards compliance happens. Access rights are also checked here in Preprocessor. Then we are ready to execute the query in the optimizer.

First are the logical transformations, then cost-based optimization and then plan refinement. Once we have a complete query execution plan, it is sent to query execution component. It uses either table scan or the index scan etc and decides on the various join methods such as nested loops join, hash join etc. Then its passed to the handler API which uses the appropriate storage engine (InnoDB etc).

MySQL uses a left-deep (linear) plan vs. the bushy plan.

Limiting to left-deep (linear) plan, allows to achieve good performance.

QEP (Query Execution Plans) as operator sequences:

For each part, there is table, condition, access and index. Each of these operators contain one variables such as depending on the table. This method of query execution is called pipeline.

It is a nested loop.

1. table = City
Condition = (Population >= 5000000)
Access = Index range scan
index = Population, by (Population >= 5000000)

2. table = Country
condition = (Country.Code = City.Country AND City.ID = COuntry.Capital)
access = Index by unique reference
index = Primary (Code), by (Country.Code = City.Country)
3 table = Languages
condition = (Languages.Country = Country.Code AND Percentage > 4.0)
access = Index by reference
index = Primary (Country, Language), by (Langugaes.Country = Country.code)

Part 1: Query engine principles:

What are the query plans and their execution.

General idea is :
  • assign cost to operations
  • assign cost to (partial) plans (not to complete plans)
  • search for plans with lowest cost
This search is possible because
  • query plans are simnple
  • there is data statistics and
  • there is meta-data
Main characteristics of an optimizer
  • search space (possible plans): The total set of plans we can search through.
  • cost model:
  • search procedure
Cost model:

Cost of operations is proportional (~) to disk accesses
A cost unit in MySQL = random read of a data page of 4KB

THe main cost factors include (whats the cost to access the table)
  • Data statistics
    • number of pages per table (index) - P(R) or (P(I))
    • the cardinatily of tables / indexes N(R)
    • length of rows and keys
    • key distribution
  • Schema
    • uniqueness (PK)
    • nullability
  • Simplified cost model of table scan
    • cost (access(R)) ~ P(R))
Query optimization - exhaustive search:

Search all plans in a bottom-up manner.
  • which means begin with all 1-table plans and
  • for each plan QEP =
  • expand QEP with remaining {Tk, ..., Tn} tables

Depth-first search illustrated:
Suppose we consider tables the way they are mentioned. First languges is joined with Country and then we go back and join with City.

Next we go to table country and expand it with languages and then with city. Next, we go to City. The cost for each possible combination is searched.

New in the 5.0 engine.
If you have more than 5 joins, you may have experienced slowness. To address that greedy optimizer was introduced

"Greedy" join optimizer:
Greedy search (decisions are made on the immediately visible benefit) controls how exhaustive is the search depth. For each serarch step we estimate each potential extension up to search depth. Then the best extension (greedy) is picked and we "forget" the other extensions. Then we continue with the remaining tables.

A drawback / trade off is that we may miss the best plan.

How does greedy search procedure work.

It starts with empty plan and considers all plans with length of 1. and then thinks of all possible ways of adding 1 table. It looks all possible extensions and looks at their cost. The plan with minimum cost is picked.

Then we consider plans of length 2 and so on.

Couple of ways to Controlling the optimizer: Users can influence the index by specifyingm USE INDEX or FORCE INDEX or even IGNORE INDEX. If possible don't use it because some other plan may be better of..

How to control join optimization.
A system variable called optimizer_search_depth can be manipulated. A value of 0 means automatic, 1 means minimal and 62 means maximal. The defauly is now automatic (used to be 62)

Then there is a variable for pruning called optimizer_prune_level 0=> exhaustive, 1=>heuristic(default)

THe third way to force join order is use STRAIGHT_JOIN(you specify the order by specifiying the plan manually (strongly advised not to use it) )


Tables up to 6-7 search depth, the time to optimize is a couple of seconds, then we the search depth is 10, the optimize time is 100 seconds and near 13 search depth, the query compilation time is almost 1000.

Another feature of optimizer: The range optimizer (implemented by Monty).

The range optimizer takes a bookean combination of SARGable conditions. This results in minimal sequence of smallest possible disjoint intervals.

The range optimizer is used for single index and multiple index access and also for partition pruning.

Index merge Union
RowID ordered union. All rows in the index will be returned in RowID order. MySQL goes back and forth to two tables. Everytime it finds a bigger rowID in a different table, it starts adding rows from that table.

Index merge Intersection.
Index scan is done from both tables and each time we do opposite of Index merge union. If the rowId is found in one table and not in another table it is skipped.

Index merge superposition

merge almost any boolean expression. We can combine the above algorithms, so we can intersect first, then merge, and then perform the union.

Loose index scan for GROUP BY / DISTINCT
Another access method to retirve data called Loose index scan for GROUP BY / DISTINCT.

Consider query

SELECT Country, max (POpulation) FROM CIty WHERE Name = 'Sorrento' GROUP by country;

If we execute the query in 4.1 it will perform an index scan considering each possible key in the index then it checks the index and then it groups.

There is a way to execute these queries faster using Loose index scan. First it scans the index, and finds the first key where name is Sorrento and from that record it can find the group of the query (i.e. USA). Once the group prefix 'USA' has been found, it composes a search key <'USA', 'Sorrento'> and then jumps to first key with <'USA', 'Sorrento'>

Find the first tuple in the group,
Add the value for the name key,
compose a search key
and jump to that key so we don't have a need to scan every key.

If we need to find the minimum value, pick the first one. If we want the max, we pick the last one.

This method is applied in a bit more tricky case. Some conditions must be met. First of all teh GROUP BY column must be in front. If there are any query conditions, then those fields should be immediately after the GROUP BY column. Otherwise, this method won't be applied.

Even if we have a DISTINCT clause, the optimizer can still apply this method.

This method offers potentionally x10, x100, etc improvement. There are a few rules of thumb: when this method is good
  • the logic of choice here is opposite to range scan. (the less selective the index, the better)
  • the more keys per group, the better (there will be fewer jumps)
  • the index must cover all fields in order
The major difference is query optimizer in 5.0 and 5.1 is the partition pruning. How do we acces the partition faster? We do it with parition pruning.

Given a query over paritioned table, match table DDL against the WHERE clause and find subset of partitions we need to scan to resolve the query.

The internals of partition pruning
We create virtual table definition from partitioning description: (part_field1, ... part_fieldN, subpart_field1,...subpart_fieldN)

Rub Range Analyzer over the WHERE clause and the virtual table definition.

Obtain and interval sequence over virtual table definition.

Suppose we have range [a,b] from 2 to 5. The range optimizer can produce a sequence. We walk through all paritions and we see how it overlaps.

Equality propagation (5.0)
SARGable argument transitive closure

What's coming:

  • Subquery optimizations (5.2)
    • Applied to subquerues in the WHERE clause
    • Flattening through semi-joins
    • Use hash semi-join when can't flatten
  • Batched multiple range (MRR)
    • IMplemented for NDB (5,1)
    • Coming soon for MyISAM / InnoDB
  • Hash join (one-pass; multi-pass)
  • Loose index scan for equality conditions

No comments: