Probabilistic context-free grammar

In theoretical linguistics and computational linguistics, probabilistic context free grammars (PCFGs) extend context-free grammars, similar to how hidden Markov models extend regular grammars. Each production is assigned a probability. The probability of a derivation (parse) is the product of the probabilities of the productions used in that derivation. These probabilities can be viewed as parameters of the model, and for large problems it is convenient to learn these parameters via machine learning. A probabilistic grammar's validity is constrained by context of its training dataset.

Definitions

Derivation: The process of recursive generation of strings from a grammar. Parsing: Finding a valid derivation using an automaton. Parse Tree: The alignment of the grammar to a sequence. An example of a parser for PCFG grammars is the pushdown automaton. The algorithm parses grammar nonterminals from left to right in a stack-like manner. This brute-force approach is not very efficient. In RNA secondary structure prediction variants of the Cocke–Younger–Kasami (CYK) algorithm provide more efficient alternatives to grammar parsing than pushdown automata. == Formal definition ==

Formal definition

Similar to a CFG, a probabilistic context-free grammar can be defined by a quintuple: :G = (M, T, R, S, P) where • is the set of non-terminal symbols • is the set of terminal symbols • is the set of production rules • is the start symbol • is the set of probabilities on production rules ==Relation with hidden Markov models==

Relation with hidden Markov models

PCFGs models extend context-free grammars the same way as hidden Markov models extend regular grammars. The Inside-Outside algorithm is an analogue of the Forward-Backward algorithm. It computes the total probability of all derivations that are consistent with a given sequence, based on some PCFG. This is equivalent to the probability of the PCFG generating the sequence, and is intuitively a measure of how consistent the sequence is with the given grammar. The Inside-Outside algorithm is used in model parametrization to estimate prior frequencies observed from training sequences in the case of RNAs. Dynamic programming variants of the CYK algorithm find the Viterbi parse of a RNA sequence for a PCFG model. This parse is the most likely derivation of the sequence by the given PCFG. ==Grammar construction==

Grammar construction

Context-free grammars are represented as a set of rules inspired from attempts to model natural languages. The rules are absolute and have a typical syntax representation known as Backus–Naur form. The production rules consist of terminal \left \{a, b \right\} and non-terminal ' symbols and a blank \epsilon' may also be used as an end point. In the production rules of CFG and PCFG the left side has only one nonterminal whereas the right side can be any string of terminal or nonterminals. In PCFG nulls are excluded. An example of a grammar: : S \to aS, S \to bS, S \to \epsilon This grammar can be shortened using the '|' ('or') character into: : S\to aS | bS |\epsilon Terminals in a grammar are words and through the grammar rules a non-terminal symbol is transformed into a string of either terminals and/or non-terminals. The above grammar is read as "beginning from a non-terminal the emission can generate either or or \epsilon". Its derivation is: : S \Rightarrow aS\Rightarrow abS\Rightarrow abbS \Rightarrow abb Ambiguous grammar may result in ambiguous parsing if applied on homographs since the same word sequence can have more than one interpretation. Pun sentences such as the newspaper headline "Iraqi Head Seeks Arms" are an example of ambiguous parses. One strategy of dealing with ambiguous parses (originating with grammarians as early as Pāṇini) is to add yet more rules, or prioritize them so that one rule takes precedence over others. This, however, has the drawback of proliferating the rules, often to the point where they become difficult to manage. Another difficulty is overgeneration, where unlicensed structures are also generated. Probabilistic grammars circumvent these problems by ranking various productions on frequency weights, resulting in a "most likely" (winner-take-all) interpretation. As usage patterns are altered in diachronic shifts, these probabilistic rules can be re-learned, thus updating the grammar. Assigning probability to production rules makes a PCFG. These probabilities are informed by observing distributions on a training set of similar composition to the language to be modeled. On most samples of broad language, probabilistic grammars where probabilities are estimated from data typically outperform hand-crafted grammars. CFGs when contrasted with PCFGs are not applicable to RNA structure prediction because while they incorporate sequence-structure relationship they lack the scoring metrics that reveal a sequence structural potential ==Weighted context-free grammar==

Weighted context-free grammar

A weighted context-free grammar (WCFG) is a more general category of context-free grammar, where each production has a numeric weight associated with it. The weight of a specific parse tree in a WCFG is the product (or sum ) of all rule weights in the tree. Each rule weight is included as often as the rule is used in the tree. A special case of WCFGs are PCFGs, where the weights are (logarithms of ) probabilities. An extended version of the CYK algorithm can be used to find the "lightest" (least-weight) derivation of a string given some WCFG. When the tree weight is the product of the rule weights, WCFGs and PCFGs can express the same set of probability distributions. ==Applications==

Applications

RNA structure prediction Since the 1990s, PCFG has been applied to model RNA structures. As a consequence, most applications of formal language theory to protein analysis have been mainly restricted to the production of grammars of lower expressive power to model simple functional patterns based on local interactions. Since protein structures commonly display higher-order dependencies including nested and crossing relationships, they clearly exceed the capabilities of any CFG. Still, development of PCFGs allows expressing some of those dependencies and providing the ability to model a wider range of protein patterns. ==See also==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com