Most query optimizers represent query plans as a
tree of "plan nodes". A plan node encapsulates a single operation that is required to execute the query. The nodes are arranged as a tree, in which intermediate results flow from the bottom of the tree to the top. Each node has zero or more child nodes—those are nodes whose output is fed as input to the parent node. For example, a join node will have two child nodes, which represent the two join operands, whereas a sort node would have a single child node (the input to be sorted). The leaves of the tree are nodes which produce results by scanning the disk, for example by performing an
index scan or a sequential scan.
Join ordering The performance of a query plan is determined largely by the order in which the tables are joined. For example, when joining 3 tables A, B, C of size 10 rows, 10,000 rows, and 1,000,000 rows, respectively, a query plan that joins B and C first can take several orders-of-magnitude more time to execute than one that joins A and C first. Most query optimizers determine
join order via a
dynamic programming algorithm pioneered by
IBM's System R database project . This algorithm works in two stages: • First, all ways to access each
relation in the query are computed. Every relation in the query can be accessed via a sequential scan. If there is an
index on a relation that can be used to answer a
predicate in the query, an index scan can also be used. For each relation, the optimizer records the cheapest way to scan the relation, as well as the cheapest way to scan the relation that produces records in a particular sorted order. • The optimizer then considers combining each pair of relations for which a join condition exists. For each pair, the optimizer will consider the available join algorithms implemented by the
DBMS. It will preserve the cheapest way to join each pair of relations, in addition to the cheapest way to join each pair of relations that produces its output according to a particular sort order. • Then all three-relation query plans are computed, by joining each two-relation plan produced by the previous phase with the remaining relations in the query. Sort order can avoid a redundant sort operation later on in processing the query. Second, a particular sort order can speed up a subsequent join because it clusters the data in a particular way.
Query planning for nested SQL queries A SQL query to a modern relational DBMS does more than just selections and joins. In particular, SQL queries often nest several layers of
SPJ blocks (Select-Project-Join), by means of
group by,
exists, and
not exists operators. In some cases such nested SQL queries can be
flattened into a select-project-join query, but not always. Query plans for nested SQL queries can also be chosen using the same dynamic programming algorithm as used for join ordering, but this can lead to an enormous escalation in query optimization time. So some database management systems use an alternative rule-based approach that uses a query graph model.
Cost estimation One of the hardest problems in query optimization is to accurately estimate the costs of alternative query plans. Optimizers cost query plans using a mathematical model of query execution costs that relies heavily on estimates of the
cardinality, or number of tuples, flowing through each edge in a query plan. Cardinality estimation in turn depends on estimates of the
selection factor of predicates in the query. Traditionally, database systems estimate selectivities through fairly detailed statistics on the distribution of values in each column, such as
histograms. This technique works well for estimation of selectivities of individual predicates. However many queries have
conjunctions of predicates such as select count(*) from R where R.make='Honda' and R.model='Accord'. Query predicates are often highly correlated (for example, model='Accord' implies make='Honda'), and it is very hard to estimate the selectivity of the conjunct in general. Poor cardinality estimates and uncaught correlation are one of the main reasons why query optimizers pick poor query plans. This is one reason why a
database administrator should regularly update the database statistics, especially after major data loads/unloads. ==Extensions==