Relational algebra

In database theory, relational algebra is a theory that uses algebraic structures for modeling data and defining queries on it with well founded semantics. The theory was introduced by Edgar F. Codd.

Introduction

Relational algebra received little attention outside of pure mathematics until the publication of E.F. Codd's relational model of data in 1970. Codd proposed such an algebra as a basis for database query languages. A relation of arity n is a set of ntuples. Relational algebra operates on homogeneous sets of tuples, S = \{(s_{j1},s_{j2}, ... s_{jn}) | j \in 1 ... m \}, where an ntuple is a tuple (row; index j) with n (or data domains), thus m is the number of rows of tuples in a table and n is the number of columns (and all entries in each column have the same ). A relation also has a unique tuple called the header which gives each column a unique name or attribute inside the relation. Attributes are used in projections and selections. == Set operators ==

Set operators

The relational algebra uses set union, set difference, and Cartesian product from set theory, and adds additional constraints to these operators to create new ones. For set union and set difference, the two relations involved must be that is, the two relations must have the same set of attributes. Because set intersection is defined in terms of set union and set difference, the two relations involved in set intersection must also be union-compatible. For the Cartesian product to be defined, the two relations involved must have disjoint headers (i.e. they must not have a common attribute name). In addition, the Cartesian product is defined differently from the one in set theory, in the sense that tuples are considered to be "shallow" for the purposes of the operation. That means the Cartesian product of a set of n-tuples with a set of m-tuples yields a set of "flattened" (n+m)-tuples (whereas basic set theory would have prescribed a set of 2-tuples, each containing an n-tuple and an m-tuple). In relational algebra, the Cartesian product R\times S is defined formally as \begin{align}&R\times S:=\{(r_1,r_2,\dots,r_n,s_1,s_2,\dots,s_m);\\[4pt]&\text{where}\,\,\,(r_1,r_2,\dots,r_n)\in R, (s_1,s_2,\dots,s_m)\in S\}\end{align} The cardinality of the Cartesian product is the product of the cardinalities of its factors, that is, . == Projection ==

Projection

A projection () is a unary operation written as \Pi_{a_1, \ldots,a_n}( R ) where a_1,\ldots,a_n is a set of attribute names. The result of such projection is defined as the set that is obtained when all tuples in R are restricted to the set \{a_1,\ldots,a_n\}. Note: when implemented in SQL standard the "default projection" returns a multiset instead of a set, and the projection to eliminate duplicate data is obtained by the addition of the DISTINCT keyword. == Selection ==

Selection

A generalized selection (σ) is a unary operation written as \sigma_\varphi(R) where is a propositional formula that consists of atoms as allowed in the normal selection and the logical operators (and), (or) and (negation). This selection selects all those tuples in R for which holds. To obtain a listing of all friends or business associates in an address book, the selection might be written as \sigma_{\text{isFriend = true} \,\lor\, \text{isBusinessContact = true}}( \text{addressBook} ). The result would be a relation containing every attribute of every unique record where is true or where is true. == Rename ==

Rename

A rename (ρ) is a unary operation written as \rho_{a / b}(R) where the result is identical to R except that the b attribute in all tuples is renamed to an a attribute. This is commonly used to rename the attribute of a relation for the purpose of a join. To rename the "isFriend" attribute to "isBusinessContact" in a relation, \rho_{\text{isBusinessContact / isFriend} } ( \text{addressBook} ) might be used. There is also the \rho_{x(A_1, \ldots,A_n)}(R) notation, where R is renamed to x and the attributes \{a_1,\ldots,a_n\} are renamed to \{A_1,\ldots,A_n\}. == Joins and join-like operators ==

Common extensions

In practice the classical relational algebra described above is extended with various operations such as outer joins, aggregate functions and even transitive closure. Outer joins Whereas the result of a join (or inner join) consists of tuples formed by combining matching tuples in the two operands, an outer join contains those tuples and additionally some tuples formed by extending an unmatched tuple in one of the operands by "fill" values for each of the attributes of the other operand. Outer joins are not considered part of the classical relational algebra discussed so far. The operators defined in this section assume the existence of a null value, ω, which we do not define, to be used for the fill values; in practice this corresponds to the NULL in SQL. In order to make subsequent selection operations on the resulting table meaningful, a semantic meaning needs to be assigned to nulls; in Codd's approach the propositional logic used by the selection is extended to a three-valued logic, although we elide those details in this article. Three outer join operators are defined: left outer join, right outer join, and full outer join. (The word "outer" is sometimes omitted.) Left outer join The left outer join (⟕) is written as R ⟕ S where R and S are relations. The result of the left outer join is the set of all combinations of tuples in R and S that are equal on their common attribute names, in addition (loosely speaking) to tuples in R that have no matching tuples in S. For an example consider the tables Employee and Dept and their left outer join: In the resulting relation, tuples in S which have no common values in common attribute names with tuples in R take a null value, ω. Since there are no tuples in Dept with a DeptName of Finance or Executive, ωs occur in the resulting relation where tuples in Employee have a DeptName of Finance or Executive. Let r1, r2, ..., rn be the attributes of the relation R and let {(ω, ..., ω)} be the singleton relation on the attributes that are unique to the relation S (those that are not attributes of R). Then the left outer join can be described in terms of the natural join (and hence using basic operators) as follows: (R \bowtie S) \cup ((R - \pi_{r_1, r_2, \dots, r_n}(R \bowtie S)) \times \{(\omega, \dots, \omega)\}) Right outer join The right outer join (⟖) behaves almost identically to the left outer join, but the roles of the tables are switched. The right outer join of relations R and S is written as R ⟖ S. The result of the right outer join is the set of all combinations of tuples in R and S that are equal on their common attribute names, in addition to tuples in S that have no matching tuples in R. For example, consider the tables Employee and Dept and their right outer join: In the resulting relation, tuples in R which have no common values in common attribute names with tuples in S take a null value, ω. Since there are no tuples in Employee with a DeptName of Production, ωs occur in the Name and EmpId attributes of the resulting relation where tuples in Dept had DeptName of Production. Let s1, s2, ..., sn be the attributes of the relation S and let {(ω, ..., ω)} be the singleton relation on the attributes that are unique to the relation R (those that are not attributes of S). Then, as with the left outer join, the right outer join can be simulated using the natural join as follows: (R \bowtie S) \cup (\{(\omega, \dots, \omega)\} \times (S - \pi_{s_1, s_2, \dots, s_n}(R \bowtie S))) Full outer join The outer join (⟗) or full outer join in effect combines the results of the left and right outer joins. The full outer join is written as R ⟗ S where R and S are relations. The result of the full outer join is the set of all combinations of tuples in R and S that are equal on their common attribute names, in addition to tuples in S that have no matching tuples in R and tuples in R that have no matching tuples in S in their common attribute names. For an example consider the tables Employee and Dept and their full outer join: In the resulting relation, tuples in R which have no common values in common attribute names with tuples in S take a null value, ω. Tuples in S which have no common values in common attribute names with tuples in R also take a null value, ω. The full outer join can be simulated using the left and right outer joins (and hence the natural join and set union) as follows: :R ⟗ S = (R ⟕ S) ∪ (R ⟖ S) Operations for domain computations There is nothing in relational algebra introduced so far that would allow computations on the data domains (other than evaluation of propositional expressions involving equality). For example, it is not possible using only the algebra introduced so far to write an expression that would multiply the numbers from two columns, e.g. a unit price with a quantity to obtain a total price. Practical query languages have such facilities, e.g. the SQL SELECT allows arithmetic operations to define new columns in the result SELECT unit_price * quantity AS total_price FROM t, and a similar facility is provided more explicitly by Tutorial D's EXTEND keyword. In database theory, this is called extended projection. SQL however officially supports such fixpoint queries since 1999, and it had vendor-specific extensions in this direction well before that. == Use of algebraic properties for query optimization ==

Use of algebraic properties for query optimization

Relational database management systems often include a query optimizer which attempts to determine the most efficient way to execute a given query. Query optimizers enumerate possible query plans, estimate their cost, and pick the plan with the lowest estimated cost. If queries are represented by operators from relational algebra, the query optimizer can enumerate possible query plans by rewriting the initial query using the algebraic properties of these operators. Queries can be represented as a tree, where • the internal nodes are operators, • leaves are relations, • subtrees are subexpressions. The primary goal of the query optimizer is to transform expression trees into equivalent expression trees, where the average size of the relations yielded by subexpressions in the tree is smaller than it was before the optimization. The secondary goal is to try to form common subexpressions within a single query, or if there is more than one query being evaluated at the same time, in all of those queries. The rationale behind the second goal is that it is enough to compute common subexpressions once, and the results can be used in all queries that contain that subexpression. Here are a set of rules that can be used in such transformations. Selection Rules about selection operators play the most important role in query optimization. Selection is an operator that very effectively decreases the number of rows in its operand, so if the selections in an expression tree are moved towards the leaves, the internal relations (yielded by subexpressions) will likely shrink. Basic selection properties Selection is idempotent (multiple applications of the same selection have no additional effect beyond the first one), and commutative (the order selections are applied in has no effect on the eventual result). • \sigma_{A}(R)=\sigma_{A}\sigma_{A}(R)\,\! • \sigma_{A}\sigma_{B}(R)=\sigma_{B}\sigma_{A}(R)\,\! Breaking up selections with complex conditions A selection whose condition is a conjunction of simpler conditions is equivalent to a sequence of selections with those same individual conditions, and selection whose condition is a disjunction is equivalent to a union of selections. These identities can be used to merge selections so that fewer selections need to be evaluated, or to split them so that the component selections may be moved or optimized separately. • \sigma_{A \land B}(R)=\sigma_{A}(\sigma_{B}(R))=\sigma_{B}(\sigma_{A}(R)) • \sigma_{A \lor B}(R)=\sigma_{A}(R)\cup\sigma_{B}(R) Selection and cross product Cross product is the costliest operator to evaluate. If the input relations have N and M rows, the result will contain NM rows. Therefore, it is important to decrease the size of both operands before applying the cross product operator. This can be effectively done if the cross product is followed by a selection operator, e.g. \sigma_{A}(R \times P). Considering the definition of join, this is the most likely case. If the cross product is not followed by a selection operator, we can try to push down a selection from higher levels of the expression tree using the other selection rules. In the above case the condition A is broken up in to conditions B, C and D using the split rules about complex selection conditions, so that A = B \wedge C \wedge D and B contains attributes only from R, C contains attributes only from P, and D contains the part of A that contains attributes from both R and P. Note, that B, C or D are possibly empty. Then the following holds: :\sigma_{A}(R \times P) = \sigma_{B \wedge C \wedge D}(R \times P) = \sigma_{D}(\sigma_{B}(R) \times \sigma_{C}(P)) Selection and set operators Selection is distributive over the set difference, intersection, and union operators. The following three rules are used to push selection below set operations in the expression tree. For the set difference and the intersection operators, it is possible to apply the selection operator to just one of the operands following the transformation. This can be beneficial where one of the operands is small, and the overhead of evaluating the selection operator outweighs the benefits of using a smaller relation as an operand. • \sigma_{A}(R\setminus P)=\sigma_{A}(R)\setminus \sigma_{A}(P) =\sigma_{A}(R)\setminus P • \sigma_{A}(R\cup P)=\sigma_{A}(R)\cup\sigma_{A}(P) • \sigma_{A}(R\cap P)=\sigma_{A}(R)\cap\sigma_{A}(P)=\sigma_{A}(R)\cap P=R\cap \sigma_{A}(P) Selection and projection Selection commutes with projection if and only if the fields referenced in the selection condition are a subset of the fields in the projection. Performing selection before projection may be useful if the operand is a cross product or join. In other cases, if the selection condition is relatively expensive to compute, moving selection outside the projection may reduce the number of tuples which must be tested (since projection may produce fewer tuples due to the elimination of duplicates resulting from omitted fields). \pi_{a_1, \ldots ,a_n}(\sigma_A( R )) = \sigma_A(\pi_{a_1, \ldots,a_n}( R ))\,\,\text{if}\,A \subseteq \{a_1,\ldots,a_n\} Projection Basic projection properties Projection is idempotent, so that a series of (valid) projections is equivalent to the outermost projection. \begin{align}&\pi_{a_1, \ldots , a_n}(\pi_{b_1,\ldots , b_m}(R)) = \pi_{a_1, \ldots , a_n}(R)\\[4pt] &\,\,\,\text{where}\,\,\{a_1, \ldots , a_n\} \subseteq \{b_1, \ldots , b_m\}\end{align} Projection and set operators Projection is distributive over set union. \pi_{a_1, \ldots, a_n}(R \cup P) = \pi_{a_1, \ldots, a_n}(R) \cup \pi_{a_1, \ldots, a_n}(P). \, Projection does not distribute over intersection and set difference. Counterexamples are given by: \pi_A(\{ \langle A=a, B=b \rangle \} \cap \{ \langle A=a, B=b' \rangle \}) = \emptyset \begin{align}&\pi_A(\{ \langle A=a, B=b \rangle \}) \cap \pi_A(\{ \langle A=a, B=b' \rangle \})\\[4pt] &= \{ \langle A=a \rangle \}\end{align} and \begin{align}&\pi_A(\{ \langle A=a, B=b \rangle \} \setminus \{ \langle A=a, B=b' \rangle \})\\[4pt] &= \{ \langle A=a\rangle \}\end{align} \pi_A(\{ \langle A=a, B=b \rangle \}) \setminus \pi_A(\{ \langle A=a, B=b' \rangle \}) = \emptyset where is assumed to be distinct from . Rename Basic rename properties Successive renames of a variable can be collapsed into a single rename. Rename operations which have no variables in common can be arbitrarily reordered with respect to one another, which can be exploited to make successive renames adjacent so that they can be collapsed. • \rho_{a / b}(\rho_{b / c}(R)) = \rho_{a / c}(R)\,\! • \rho_{a / b}(\rho_{c / d}(R)) = \rho_{c / d}(\rho_{a / b}(R))\,\! Rename and set operators Rename is distributive over set difference, union, and intersection. • \rho_{a / b}(R \setminus P) = \rho_{a / b}(R) \setminus \rho_{a / b}(P) • \rho_{a / b}(R \cup P) = \rho_{a / b}(R) \cup \rho_{a / b}(P) • \rho_{a / b}(R \cap P) = \rho_{a / b}(R) \cap \rho_{a / b}(P) Product and union Cartesian product is distributive over union. • (A \times B) \cup (A \times C) = A \times (B \cup C) == Implementations ==

Implementations

The first query language to be based on Codd's algebra was Alpha, developed by Dr. Codd himself. Subsequently, ISBL was created, and this pioneering work has been acclaimed by many authorities as having shown the way to make Codd's idea into a useful language. Business System 12 was a short-lived industry-strength relational DBMS that followed the ISBL example. In 1998 Chris Date and Hugh Darwen proposed a language called Tutorial D intended for use in teaching relational database theory, and its query language also draws on ISBL's ideas. Rel is an implementation of Tutorial D. Bmg is an implementation of relational algebra in Ruby which closely follows the principles of Tutorial D and The Third Manifesto. Even the query language of SQL is loosely based on a relational algebra, though the operands in SQL (tables) are not exactly relations and several useful theorems about the relational algebra do not hold in the SQL counterpart (arguably to the detriment of optimisers and/or users). The SQL table model is a bag (multiset), rather than a set. For example, the expression (R \cup S) \setminus T = (R \setminus T) \cup (S \setminus T) is a theorem for relational algebra on sets, but not for relational algebra on bags. == See also ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com