Group testing

In statistics and combinatorial mathematics, group testing is any procedure that breaks up the task of identifying objects into tests on groups of items, rather than testing each item individually. First studied by Robert Dorfman in 1943, group testing is a relatively new field of mathematics that can be applied to a wide range of practical applications and is an active area of research today.

Basic description and terms

Unlike many areas of mathematics, the origins of group testing can be traced back to a single report Variations and extensions There are many ways to extend the problem of group testing. One of the most important is called noisy group testing, and deals with a big assumption of the original problem: that testing is error-free. A group-testing problem is called noisy when there is some chance that the result of a group test is erroneous (e.g. comes out positive when the test contained no defectives). The Bernoulli noise model assumes this probability is some constant, q, but in general it can depend on the true number of defectives in the test and the number of items tested. A noisy algorithm will always have a non-zero probability of making an error (that is, mislabeling an item). Group testing with inhibitors is a variant with applications in molecular biology. Here, there is a third class of items called inhibitors, and the result of a test is positive if it contains at least one defective and no inhibitors. == History and development ==

History and development

Invention and initial progress The concept of group testing was first introduced by Robert Dorfman in 1943 in a short report published in the Notes section of Annals of Mathematical Statistics. After 1943, group testing remained largely untouched for a number of years. Then in 1957, Sterrett produced an improvement on Dorfman's procedure. This newer process starts by again performing individual testing on the positive groups, but stopping as soon as a defective is identified. Then, the remaining items in the group are tested together, since it is very likely that none of them are defective. The first thorough treatment of group testing was given by Sobel and Groll in their formative 1959 paper on the subject. They described five new procedures – in addition to generalisations for when the prevalence rate is unknown – and for the optimal one, they provided an explicit formula for the expected number of tests it would use. The paper also made the connection between group testing and information theory for the first time, as well as discussing several generalisations of the group-testing problem and providing some new applications of the theory. The fundamental result by Peter Ungar in 1960 shows that if the prevalence rate p>p_{u}, where p_{u} = (3-\sqrt{5})/2\approx 0.38, then individual testing is the optimal group testing procedure with respect to the expected number of tests, and if p, then it is not optimal. However, it is important to note that despite 80 years' worth of research effort, the optimal procedure is yet unknown for p and a general population size n>2. Combinatorial group testing Group testing was first studied in the combinatorial context by Li in 1962, There is an open question as to when individual testing is minmax. Hu, Hwang and Wang showed in 1981 that individual testing is minmax when n \leq \lfloor (5d + 1)/2 \rfloor, and that it is not minmax when n > 3d. Some progress was made in 2000 by Riccio and Colbourn, who showed that for large n, individual testing is minmax when d \geq n/\log_{3/2}(3) \approx 0.369n. Non-adaptive and probabilistic testing One of the key insights in non-adaptive group testing is that significant gains can be made by eliminating the requirement that the group-testing procedure be certain to succeed (the "combinatorial" problem), but rather permit it to have some low but non-zero probability of mis-labelling each item (the "probabilistic" problem). It is known that as the number of defective items approaches the total number of items, exact combinatorial solutions require significantly more tests than probabilistic solutions — even probabilistic solutions permitting only an asymptotically small probability of error. In this vein, Chan et al. (2011) introduced COMP, a probabilistic algorithm that requires no more than t = ed(1+\delta)\ln(n) tests to find up to d defectives in n items with a probability of error no more than n^{-\delta}. This is within a constant factor of the t = O(d\log_2 n) lower bound. Chan et al. (2011) also provided a generalisation of COMP to a simple noisy model, and similarly produced an explicit performance bound, which was again only a constant (dependent on the likelihood of a failed test) above the corresponding lower bound. In general, the number of tests required in the Bernoulli noise case is a constant factor larger than in the noiseless case. Aldridge, Baldassini and Johnson (2014) produced an extension of the COMP algorithm that added additional post-processing steps. They showed that the performance of this new algorithm, called DD, strictly exceeds that of COMP, and that DD is 'essentially optimal' in scenarios where d^2 \geq n, by comparing it to a hypothetical algorithm that defines a reasonable optimum. The performance of this hypothetical algorithm suggests that there is room for improvement when d^2 , as well as suggesting how much improvement this might be. == Formalisation of combinatorial group testing ==

Formalisation of combinatorial group testing

This section formally defines the notions and terms relating to group testing. • The input vector, \mathbf{x} = (x_1, x_2, \dots, x_n), is defined to be a binary vector of length n (that is, \mathbf{x} \in \{0,1\}^n), with the j-th item being called defective if and only if x_j= 1. Further, any non-defective item is called a 'good' item. \mathbf{x} is intended to describe the (unknown) set of defective items. The key property of \mathbf{x} is that it is an implicit input. That is to say, there is no direct knowledge of what the entries of \mathbf{x} are, other than that which can be inferred via some series of 'tests'. This leads on to the next definition. • Let \mathbf{x} be an input vector. A set, S \subseteq \{ 1, 2, \dots, n \} is called a test. When testing is noiseless, the result of a test is positive when there exists j \in S such that x_j = 1, and the result is negative otherwise. Therefore, the goal of group testing is to come up with a method for choosing a 'short' series of tests that allow \mathbf{x} to be determined, either exactly or with a high degree of certainty. • A group-testing algorithm is said to make an error if it incorrectly labels an item (that is, labels any defective item as non-defective or vice versa). This is not the same thing as the result of a group test being incorrect. An algorithm is called zero-error if the probability that it makes an error is zero. • t(d, n) denotes the minimum number of tests required to always find d defectives among n items with zero probability of error by any group-testing algorithm. For the same quantity but with the restriction that the algorithm is non-adaptive, the notation \bar{t}(d, n) is used. \mathbf{x} is defined as the number of 1's in \mathbf{x}. Hence, |\mathbf{x}| \leq d where |\mathbf{x}| is the Hamming weight. The vector \mathbf{x} is an implicit input since we do not know the positions of the 1's. The only way to find out is to run the tests.--> General bounds Since it is always possible to resort to individual testing by setting S_j = \{j\} for each 1 \leq j \leq n, it must be that that \bar{t}(d, n) \leq n. Also, since any non-adaptive testing procedure can be written as an adaptive algorithm by simply performing all the tests without regard to their outcome, t(d, n) \leq \bar{t}(d, n). Finally, when 0 \neq d \neq n, there is at least one item whose defectiveness must be determined (by at least one test), and so 1 \leq t(d, n). In summary (when assuming 0 \neq d \neq n), 1 \leq t(d,n) \leq \bar{t}(d, n) \leq n .{{efn|In fact it is possible to do much better. For example, Li's s-stage algorithm gives an explicit construction were t \leq \frac{e}{\log_2 e} d \log_2{(n/d)}.}} Information lower bound A lower bound on the number of tests needed can be described using the notion of sample space, denoted \mathcal{S}, which is simply the set of possible placements of defectives. For any group testing problem with sample space \mathcal{S} and any group-testing algorithm, it can be shown that t \geq \lceil \log_2 \rceil, where t is the minimum number of tests required to identify all defectives with a zero probability of error. This is called the information lower bound. Representation of non-adaptive algorithms Algorithms for non-adaptive group testing consist of two distinct phases. First, it is decided how many tests to perform and which items to include in each test. In the second phase, often called the decoding step, the results of each group test are analysed to determine which items are likely to be defective. The first phase is usually encoded in a matrix as follows. • Suppose a non-adaptive group testing procedure for n items consists of the tests S_1, S_2, \dots, S_t for some t \in \mathbb{N}_{\geq 0}. The testing matrix for this scheme is the t \times n binary matrix, M, where (M)_{ij} = 1 if and only if j \in S_i (and is zero otherwise). Thus each column of M represents an item and each row represents a test, with a 1 in the (i,j)\textrm{-th} entry indicating that the i\textrm{-th} test included the j\textrm{-th} item and a 0 indicating otherwise. As well as the vector \mathbf{x} (of length n) that describes the unknown defective set, it is common to introduce the result vector, which describes the results of each test. • Let t be the number of tests performed by a non-adaptive algorithm. The result vector, \mathbf{y} = (y_1, y_2, \dots, y_t), is a binary vector of length t (that is, \mathbf{y} \in \{0,1\}^t) such that y_i= 1 if and only if the result of the i\textrm{-th} test was positive (i.e. contained at least one defective).{{efn|Alternatively \mathbf {y} can be defined by the equation \mathbf{y} := M\mathbf {x}, where multiplication is logical AND (\wedge) and addition is logical OR (\vee). Here, \mathbf {y} will have a 1 in position i if and only if (M)_{i,j} and \mathbf {x} _{j} are both 1 for any j. That is, if and only if at least one defective item was included in the i\textrm{-th} test.}} With these definitions, the non-adaptive problem can be reframed as follows: first a testing matrix is chosen, M, after which the vector \mathbf{y} is returned. Then the problem is to analyse \mathbf{y} to find some estimate for \mathbf{x}. In the simplest noisy case, where there is a constant probability, q, that a group test will have an erroneous result, one considers a random binary vector, \mathbf{v}, where each entry has a probability q of being 1, and is 0 otherwise. The vector that is returned is then \hat{\mathbf{y}} = \mathbf{y} + \mathbf{v}, with the usual addition on (\mathbb{Z}/2\mathbb{Z})^n (equivalently this is the element-wise XOR operation). A noisy algorithm must estimate \mathbf{x} using \hat{\mathbf{y}} (that is, without direct knowledge of \mathbf{y}). Bounds for non-adaptive algorithms The matrix representation makes it possible to prove some bounds on non-adaptive group testing. The approach mirrors that of many deterministic designs, where d-separable matrices are considered, as defined below. • A binary matrix, M, is called d-separable if every Boolean sum (logical OR) of any d of its columns is distinct. Additionally, the notation \bar{d}-separable indicates that every sum of any of up to d of M's columns is distinct. (This is not the same as M being k-separable for every k \leq d.) When M is a testing matrix, the property of being d-separable (\bar{d}-separable) is equivalent to being able to distinguish between (up to) d defectives. However, it does not guarantee that this will be straightforward. A stronger property, called disjunctness does. • A binary matrix, M is called d-disjunct if the Boolean sum of any d columns does not contain any other column. (In this context, a column A is said to contain a column B if for every index where B has a 1, A also has a 1.) A useful property of d-disjunct testing matrices is that, with up to d defectives, every non-defective item will appear in at least one test whose outcome is negative. This means there is a simple procedure for finding the defectives: just remove every item that appears in a negative test. Using the properties of d-separable and d-disjunct matrices the following can be shown for the problem of identifying d defectives among n total items. • The number of tests needed for an asymptotically small average probability of error scales as O(d\log_2 n). • The number of tests needed for an asymptotically small maximum probability of error scales as O(d^2 \log_2 n). • The number of tests needed for a zero probability of error scales as O \left(\frac{d^2 \log_2 n}{\log_2 d} \right). == Generalised binary-splitting algorithm ==

Generalised binary-splitting algorithm

The generalised binary-splitting algorithm is an essentially-optimal adaptive group-testing algorithm that finds d or fewer defectives among n items as follows: • If n \leq 2d - 2, test the n items individually. Otherwise, set l = n - d + 1 and \alpha = \lfloor \log_2{l/d} \rfloor. • Test a group of size 2^\alpha. If the outcome is negative, every item in the group is declared to be non-defective; set n := n - 2^\alpha and go to step 1. Otherwise, use a binary search to identify one defective and an unspecified number, called x, of non-defective items; set n := n - 1 - x and d := d - 1. Go to step 1. The generalised binary-splitting algorithm requires no more than T tests where T = \begin{cases} n & n \leq 2d-2\\ (\alpha+2)d + p - 1 & n \geq 2d - 1 \end{cases} . For n/d large, it can be shown that T \rightarrow d \log_2(n/d), which compares favorably to the t = \frac{e}{\log_2 e}d\log_2 \left( \frac{n}{d} \right) tests required for Li's s-stage algorithm. In fact, the generalised binary-splitting algorithm is close to optimal in the following sense. When d \geq 2 it can be shown that T - B_I(d,n) \leq (d-1), where B_I(d,n) = \left\lceil \log_2 \sum_{i=0}^d {n \choose i} \right\rceil is the information lower bound. == Non-adaptive algorithms ==

Non-adaptive algorithms

Non-adaptive group-testing algorithms tend to assume that the number of defectives, or at least a good upper bound on them, is known. This quantity is denoted d in this section. If no bounds are known, there are non-adaptive algorithms with low query complexity that can help estimate d. Combinatorial orthogonal matching pursuit (COMP) Combinatorial Orthogonal Matching Pursuit, or COMP, is a simple non-adaptive group-testing algorithm that forms the basis for the more complicated algorithms that follow in this section. First, each entry of the testing matrix is chosen i.i.d. to be 1 with probability 1/d and 0 otherwise. The decoding step proceeds column-wise (i.e. by item). If every test in which an item appears is positive, then the item is declared defective; otherwise the item is assumed to be non-defective. Or equivalently, if an item appears in any test whose outcome is negative, the item is declared non-defective; otherwise the item is assumed to be defective. An important property of this algorithm is that it never creates false negatives, though a false positive occurs when all locations with ones in the j-th column of M (corresponding to a non-defective item j) are "hidden" by the ones of other columns corresponding to defective items. The COMP algorithm requires no more than ed(1 + \delta) \ln(n) tests to have an error probability less than or equal to n^{-\delta}. The decoding step uses a useful property of the COMP algorithm: that every item that COMP declares non-defective is certainly non-defective (that is, there are no false negatives). It proceeds as follows. • First the COMP algorithm is run, and any non-defectives that it detects are removed. All remaining items are now "possibly defective". • Next the algorithm looks at all the positive tests. If an item appears as the only "possible defective" in a test, then it must be defective, so the algorithm declares it to be defective. • All other items are assumed to be non-defective. The justification for this last step comes from the assumption that the number of defectives is much smaller than the total number of items. Note that steps 1 and 2 never make a mistake, so the algorithm can only make a mistake if it declares a defective item to be non-defective. Thus the DD algorithm can only create false negatives. Sequential COMP (SCOMP) SCOMP (Sequential COMP) is an algorithm that makes use of the fact that DD makes no mistakes until the last step, where it is assumed that the remaining items are non-defective. Let the set of declared defectives be K. A positive test is called explained by K if it contains at least one item in K. The key observation with SCOMP is that the set of defectives found by DD may not explain every positive test, and that every unexplained test must contain a hidden defective. The algorithm proceeds as follows. • Carry out steps 1 and 2 of the DD algorithm to obtain K, an initial estimate for the set of defectives. • If K explains every positive test, terminate the algorithm: K is the final estimate for the set of defectives. • If there are any unexplained tests, find the "possible defective" that appears in the largest number of unexplained tests, and declare it to be defective (that is, add it to the set K). Go to step 2. In simulations, SCOMP has been shown to perform close to optimally. The algorithm is for the construction of the pooling matrix M, which can be straightforwardly used to decode the observations in y . Similar to COMP, a sample is decoded according to the relation: x_i = 1 ~~ \text{ if } ~~ M(:,i)~.*~y = M(:,i) , where .* represents element wise multiplication and M(:,i) is the ith column of M. Since the decoding step is not difficult, PP is specialized for generating M. Forming groups A group/pool \ell is generated using a polynomial relation that specifies the indices of the samples contained in each pool. A set of input parameters determines the algorithm. For a prime number p>1 and an integer n \ge 1 any prime power is defined by q=p^n. For a dimension parameter c \ge 2 the total number of samples is n = q^c and the number of samples per pool is q^{c-1} . Further, the Finite field of order q is denoted by \mathbb{F}_q (i.e., the integers \{0,1,2,\ldots,q-1\} defined by special arithmetic operations that ensure that addition and multiplication in \mathbb{F}_q remains in \mathbb{F}_q ). The method arranges each sample in a grid and represents it by coordinates x = (u,v) . The coordinates are computed according to a polynomial relation using the integers 1 \le l \le c-1 , 0 \le u_{i_l} \le q-1 v~=~a^{c-1}~u_{i_{c-1}} + \cdots + a~u_{i_1} + b, \quad a, b, u_{i_l} \in \mathbb{F}_q. The combination of looping through the u_{i_l} values is represented by a set with q^{c-1} elements of a sequence of d-1 integers, i.e., u_{i_1} \times \cdots \times u_{i_{c-1}} = \{(i_1,\ldots,i_{c-1})\} , where 0 \le i_l \le q-1 . Without loss of generality, the combination is such that i_{d-1} cycles every q times, i_{d-2} cycles every q^2 times until i_1 cycles only once. Formulas that compute the sample indices, and thus the corresponding pools, for fixed a and b, are given by \begin{align} u_i &= \sum_{l=1}^{c-1}~q^{d-1-l}~i_l \\ v_{u_i} &= \sum_{l=1}^{c-1}~a^{l}~i_l + b\quad (\text{computed in } \mathbb{F}_q) \\ x_{q u_i + v_{u_i}} &= (u_i,v_{u_i}) \end{align} The computations in \mathbb{F}_q can be implemented with publicly available software libraries for finite fields, when q is a prime power. When q is a prime number then the computations in \mathbb{F}_q simplify to modulus arithmetics, i.e., v_{u_i} = (\sum_{l=1}^{c-1}a^{l}i_l + b) ~ \text{mod}~q . An example of how to generate one pool \ell when a = 1, b = 0, c = 2 is displayed in the table below, while the corresponding selection of samples is shown in the figure above. This method uses q(c-1)(d+1) tests to exactly identify up to d positives among n = q^c samples. Because of this PP is particularly effective for large sample sizes, since the number of tests grows only linearly with respect to c while the samples grow exponentially with this parameter. However PP can also be effective for small sample sizes. == Example applications ==

Example applications

The generality of the theory of group testing lends it to many diverse applications, including clone screening, locating electrical shorts; medical examination, quantity searching, statistics; cryptography; and data forensics. This section provides a brief overview of a small selection of these applications. Multiaccess channels A multiaccess channel is a communication channel that connects many users at once. Every user can listen and transmit on the channel, but if more than one user transmits at the same time, the signals collide, and are reduced to unintelligible noise. Multiaccess channels are important for various real-world applications, notably wireless computer networks and phone networks. A prominent problem with multiaccess channels is how to assign transmission times to the users so that their messages do not collide. A simple method is to give each user their own time slot in which to transmit, requiring n slots. (This is called time division multiplexing, or TDM.) However, this is very inefficient, since it will assign transmission slots to users that may not have a message, and it is usually assumed that only a few users will want to transmit at any given time – otherwise a multiaccess channel is not practical in the first place. In the context of group testing, this problem is usually tackled by dividing time into 'epochs' in the following way.) is that only a small number of entries of \textbf{v} are significant, meaning that they have a large magnitude. Since the measurements are dot products of \textbf{v}, the equation M\textbf{v} = \textbf{q} holds, where M is a t \times N matrix that describes the set of measurements that have been chosen and \mathbf{q} is the set of measurement results. This construction shows that compressed sensing is a kind of 'continuous' group testing. The primary difficulty in compressed sensing is identifying which entries are significant. This task of identification can be approached with a simple application of group testing. Here a group test produces a complex number: the sum of the entries that are tested. The outcome of a test is called positive if it produces a complex number with a large magnitude, which, given the assumption that the significant entries are sparse, indicates that at least one significant entry is contained in the test. There are explicit deterministic constructions for this type of combinatorial search algorithm, requiring d2^{(\log_2 \log_2 N)^{O(1)}} measurements. However, as with group-testing, these are sub-optimal, and random constructions (such as COMP) can often recover f sub-linearly in N. One example was provided by the Origami Assays project which released open source group testing designs to run on a laboratory standard 96 well plate. In a laboratory setting, one challenge of group testing is the construction of the mixtures can be time-consuming and difficult to do accurately by hand. Origami assays provided a workaround for this construction problem by providing paper templates to guide the technician on how to allocate patient samples across the test wells. Using the largest group testing designs (XL3) it was possible to test 1120 patient samples in 94 assay wells. If the true positive rate was low enough, then no additional testing was required. Data forensics Data forensics is a field dedicated to finding methods for compiling digital evidence of a crime. Such crimes typically involve an adversary modifying the data, documents or databases of a victim, with examples including the altering of tax records, a virus hiding its presence, or an identity thief modifying personal data. A common tool in data forensics is the one-way cryptographic hash. This is a function that takes the data, and through a difficult-to-reverse procedure, produces a unique number called a hash. Hashes, which are often much shorter than the data, allow us to check if the data has been changed without having to wastefully store complete copies of the information: the hash for the current data can be compared with a past hash to determine if any changes have occurred. An unfortunate property of this method is that, although it is easy to tell if the data has been modified, there is no way of determining how: that is, it is impossible to recover which part of the data has changed. One way to get around this limitation is to store more hashes – now of subsets of the data structure – to narrow down where the attack has occurred. However, to find the exact location of the attack with a naive approach, a hash would need to be stored for every datum in the structure, which would defeat the point of the hashes in the first place. (One may as well store a regular copy of the data.) Group testing can be used to dramatically reduce the number of hashes that need to be stored. A test becomes a comparison between the stored and current hashes, which is positive when there is a mismatch. This indicates that at least one edited datum (which is taken as defectiveness in this model) is contained in the group that generated the current hash. In fact, the amount of hashes needed is so low that they, along with the testing matrix they refer to, can even be stored within the organisational structure of the data itself. This means that as far as memory is concerned the test can be performed 'for free'. (This is true with the exception of a master-key/password that is used to secretly determine the hashing function.) == Notes ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com