Range query (computer science)

In computer science, the range query problem consists of efficiently answering several queries regarding a given interval of elements within an array. For example, a common task, known as range minimum query, is finding the smallest value inside a given range within a list of numbers.

Definition

Given a function f that accepts an array, a range query f_q(l, r) on an array a=[a_1,..,a_n] takes two indices l and r and returns the result of f when applied to the subarray [a_l, \ldots, a_r]. For example, for a function \operatorname{sum} that returns the sum of all values in an array, the range query \operatorname{sum}_q(l, r) returns the sum of all values in the range [l, r]. ==Solutions==

Solutions

Prefix sum array Range sum queries may be answered in constant time and linear space by pre-computing an array of same length as the input such that for every index , the element is the sum of the first elements of . Any query may then be computed as follows: \operatorname{sum}_q(l, r) = p_r - p_{l-1}. This strategy may be extended to any other binary operation f whose inverse function f^{-1} is well-defined and easily computable. It can also be extended to higher dimensions with a similar pre-processing. For example, if contains the sum of the first elements of , then \operatorname{sum}_q(l, r, t, b) = p_{r,b} - p_{l-1,b} - p_{r, t-1} + p_{l-1,t-1}. Dynamic range queries A more difficult subset of the problem consists of executing range queries on dynamic data; that is, data that may mutate between each query. In order to efficiently update array values, more sophisticated data structures like the segment tree or Fenwick tree are necessary. ==Examples==

Examples

Semigroup operators reduced to the lowest common ancestor problem. When the function of interest in a range query is a semigroup operator, the notion of f^{-1} is not always defined, so the strategy in the previous section does not work. Andrew Yao showed that there exists an efficient solution for range queries that involve semigroup operators. He proved that for any constant , a pre-processing of time and space \Theta(c\cdot n) allows to answer range queries on lists where is a semigroup operator in \theta(\alpha_c(n)) time, where \alpha_c is a certain functional inverse of the Ackermann function. There are some semigroup operators that admit slightly better solutions. For instance when f\in \{\max,\min\}. Assume f = \min then \min(A[1..n]) returns the index of the minimum element of A[1..n]. Then \min_{i,j}(A) denotes the corresponding minimum range query. There are several data structures that allow to answer a range minimum query in O(1) time using a pre-processing of time and space O(n). One such solution is based on the equivalence between this problem and the lowest common ancestor problem. The Cartesian tree T_A of an array A[1,n] has as root a_i = \min\{a_1,a_2,\ldots,a_n\} and as left and right subtrees the Cartesian tree of A[1,i-1] and the Cartesian tree of A[i+1,n] respectively. A range minimum query \min_{i,j}(A) is the lowest common ancestor in T_A of a_i and a_j. Because the lowest common ancestor can be solved in constant time using a pre-processing of time and space O(n), range minimum query can as well. The solution when f = \max is analogous. Cartesian trees can be constructed in linear time. Mode The mode of an array is the element that appears the most in it. For instance the mode of a=[4,5,6,7,4] is . In case of a tie, any of the most frequent elements might be picked as the mode. A range mode query consists in pre-processing A[1,n] such that we can find the mode in any range of A[1,n]. Several data structures have been devised to solve this problem, we summarize some of the results in the following table. Median This particular case is of special interest since finding the median has several applications. On the other hand, the median problem, a special case of the selection problem, is solvable in O(n), using the median of medians algorithm. However its generalization through range median queries is recent. rangeMedian(A, i, j, r) { if A.length() == 1 return A[1] if A.low is undefined then m = median(A) A.low = [e in A | e m ] calculate t the number of elements of A[i, j] that belong to A.low if r A[i,j] that end up in is and this number is bigger than then we should keep looking for the element of rank in ; otherwise we should look for the element of rank (r-t) in . To find , it is enough to find the maximum index m\leq i-1 such that a_m is in and the maximum index l\leq j such that a_l is in . Then t=l-m. The total cost for any query, without considering the partitioning part, is \log n since at most \log n recursion calls are done and only a constant number of operations are performed in each of them (to get the value of fractional cascading should be used). If a linear algorithm to find the medians is used, the total cost of pre-processing for range median queries is n\log k. The algorithm can also be modified to solve the online version of the problem. which is also known as the Boyer–Moore majority vote algorithm. Boyer and Moore proposed an algorithm to find the majority element of a string (if it has one) in O(n) time and using O(1) space. In the context of Boyer and Moore's work and generally speaking, a majority element in a set of items (for example string or an array) is one whose number of instances is more than half of the size of that set. Few years later, Misra and Gries proposed a more general version of Boyer and Moore's algorithm using O \left ( n \log \left ( \frac{1}{\tau} \right ) \right ) comparisons to find all items in an array whose relative frequencies are greater than some threshold 0. A range \tau-majority query is one that, given a subrange of a data structure (for example an array) of size |R|, returns the set of all distinct items that appear more than (or in some publications equal to) \tau |R| times in that given range. In different structures that support range \tau-majority queries, \tau can be either static (specified during pre-processing) or dynamic (specified at query time). Many of such approaches are based on the fact that, regardless of the size of the range, for a given \tau there could be at most O(1/\tau) distinct candidates with relative frequencies at least \tau. By verifying each of these candidates in constant time, O(1/\tau) query time is achieved. A range \tau-majority query is decomposable in the sense that a \tau-majority in a range R with partitions R_1 and R_2 must be a \tau-majority in either R_1or R_2. Due to this decomposability, some data structures answer \tau-majority queries on one-dimensional arrays by finding the Lowest common ancestor (LCA) of the endpoints of the query range in a Range tree and validating two sets of candidates (of size O(1/\tau)) from each endpoint to the lowest common ancestor in constant time resulting in O(1/\tau) query time. Two-dimensional arrays Gagie et al. proposed a data structure that supports range \tau-majority queries on an m\times n array A. For each query \operatorname{Q}=(\operatorname{R}, \tau) in this data structure a threshold 0 and a rectangular range \operatorname{R} are specified, and the set of all elements that have relative frequencies (inside that rectangular range) greater than or equal to \tau are returned as the output. This data structure supports dynamic thresholds (specified at query time) and a pre-processing threshold \alpha based on which it is constructed. During the pre-processing, a set of vertical and horizontal intervals are built on the m \times n array. Together, a vertical and a horizontal interval form a block. Each block is part of a superblock nine times bigger than itself (three times the size of the block's horizontal interval and three times the size of its vertical one). For each block a set of candidates (with \frac{9}{\alpha} elements at most) is stored which consists of elements that have relative frequencies at least \frac{\alpha}{9} (the pre-processing threshold as mentioned above) in its respective superblock. These elements are stored in non-increasing order according to their frequencies and it is easy to see that, any element that has a relative frequency at least \alpha in a block must appear its set of candidates. Each \tau-majority query is first answered by finding the query block, or the biggest block that is contained in the provided query rectangle in O(1) time. For the obtained query block, the first \frac{9}{\tau} candidates are returned (without being verified) in O(1/\tau) time, so this process might return some false positives. Many other data structures (as discussed below) have proposed methods for verifying each candidate in constant time and thus maintaining the O(1/\tau) query time while returning no false positives. The cases in which the query block is smaller than 1/\alpha are handled by storing \log \left ( \frac{1}{\alpha} \right ) different instances of this data structure of the following form: \beta=2^{-i}, \;\; i\in \left \{ 1,\dots,\log \left (\frac{1}{\alpha} \right ) \right \} where \beta is the pre-processing threshold of the i-th instance. Thus, for query blocks smaller than 1/\alpha the \lceil\log (1 / \tau)\rceil-th instance is queried. As mentioned above, this data structure has query time O(1/\tau) and requires O \left ( m n(H+1) \log^2 \left ( \frac{1}{\alpha} \right ) \right ) bits of space by storing a Huffman-encoded copy of it (note the \log(\frac{1}{\alpha}) factor and also see Huffman coding). One-dimensional arrays Chan et al. proposed a data structure that given a one-dimensional arrayA, a subrange R of A (specified at query time) and a threshold \tau (specified at query time), is able to return the list of all \tau-majorities in O(1/\tau) time requiring O(n \log n) words of space. To answer such queries, Chan et al. Let z denote the LCA of i and j, using z and according to the decomposability of range \tau-majority queries (as described above and in ) that is able to determine in O(1) time whether or not a given subrange of an array A contains at least q instances of a particular element e. Tree paths Gagie et al. proposed a data structure which supports queries such that, given two nodes u and v in a tree, are able to report the list of elements that have a greater relative frequency than \tau on the path from u to v. More formally, let T be a labelled tree in which each node has a label from an alphabet of size \sigma. Let label(u)\in [1,\dots,\sigma] denote the label of node u in T. Let P_{uv} denote the unique path from u to v in T in which middle nodes are listed in the order they are visited. Given T, and a fixed (specified during pre-processing) threshold 0, a query Q(u,v) must return the set of all labels that appear more than \tau|P_{uv}| times in P_{uv}. To construct this data structure, first {O}(\tau n) nodes are marked. This can be done by marking any node that has distance at least \lceil 1 / \tau\rceil from the bottom of the three (height) and whose depth is divisible by \lceil 1 / \tau\rceil. After doing this, it can be observed that the distance between each node and its nearest marked ancestor is less than 2\lceil 1 / \tau\rceil. For a marked node x, \log(depth(x)) different sequences (paths towards the root) P_i(x) are stored, P_{i}(x)=\left\langle \operatorname{label}(x), \operatorname{par}(x), \operatorname{par}^{2}(x), \ldots, \operatorname{par}^{2^i}(x)\right\rangle for 0\leq i \leq \log(depth(x)) where \operatorname{par}(x) returns the label of the direct parent of node x. Put another way, for each marked node, the set of all paths with a power of two length (plus one for the node itself) towards the root is stored. Moreover, for each P_i(x), the set of all majority candidates C_i(x) are stored. More specifically, C_i(x) contains the set of all (\tau/2)-majorities in P_i(x) or labels that appear more than (\tau/2).(2^i+1) times in P_i(x). It is easy to see that the set of candidates C_i(x) can have at most 2/\tau distinct labels for each i. Gagie et al. Therefore, verifying each of the O(1/\tau) candidates in O\left(\log \log _{w} \sigma\right) time results in O\left((1/\tau)\log \log _{w} \sigma\right) total query time for returning the set of all \tau -majorities on the path from u to v . ==Related problems==