Topological similarity There are essentially two types of approaches that calculate topological similarity between ontological concepts: • Edge-based: which use the edges and their types as the data source; • Node-based: in which the main data sources are the nodes and their properties. Other measures calculate the similarity between ontological instances: • Pairwise: measure functional similarity between two instances by combining the semantic similarities of the concepts they represent • Groupwise: calculate the similarity directly not combining the semantic similarities of the concepts they represent Some examples:
Edge-based • Pekar et al. • Cheng and Cline • Wu et al. • Del Pozo et al. • IntelliGO: Benabderrahmane et al. • based on the notion of
information content. The information content of a concept (term or word) is the logarithm of the probability of finding the concept in a given corpus. • only considers the information content of
lowest common subsumer (lcs). A lowest common subsumer is a concept in a lexical taxonomy (e.g. WordNet), which has the shortest distance from the two concepts compared. For example, animal and mammal both are the subsumers of cat and dog, but mammal is lower subsumer than animal for them. • Lin • based on Resnik's similarity. • considers the information content of lowest common subsumer (lcs) and the two compared concepts. • Maguitman,
Menczer, Roinestad and
Vespignani • Generalizes Lin's similarity to arbitrary ontologies (graphs). • Jiang and Conrath • based on Resnik's similarity. • considers the information content of lowest common subsumer (lcs) and the two compared concepts to calculate the distance between the two concepts. The distance is later used in computing the similarity measure. • Align, Disambiguate, and Walk: Random walks on Semantic Networks
Node-and-relation-content-based • applicable to ontology • consider properties (content) of nodes • consider types (content) of relations • based on eTVSM • based on Resnik's similarity
Pairwise • maximum of the pairwise similarities • composite average in which only the best-matching pairs are considered (best-match average)
Groupwise •
Jaccard index Statistical similarity Statistical similarity approaches can be learned from data, or predefined.
Similarity learning can often outperform predefined similarity measures. Broadly speaking, these approaches build a statistical model of documents, and use it to estimate similarity. • LSA (
latent semantic analysis): (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times • PMI (
pointwise mutual information): (+) large vocab, because it uses any search engine (like Google); (−) cannot measure relatedness between whole sentences or documents • SOC-PMI (
second-order co-occurrence pointwise mutual information): (+) sort lists of important neighbor words from a large corpus; (−) cannot measure relatedness between whole sentences or documents • GLSA (generalized latent semantic analysis): (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times • ICAN (incremental construction of an associative network): (+) incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; (−) cannot measure relatedness between multi-word terms, long pre-processing times • NGD (
normalized Google distance): (+) large vocab, because it uses any search engine (like Google); (−) can measure relatedness between whole sentences or documents but the larger the sentence or document, the more ingenuity is required (Cilibrasi & Vitanyi, 2007). • TSS (Twitter semantic similarity): large vocab, because it use online tweets from Twitter to compute the similarity. It has high temporary resolution that allows to capture high frequency events. Open source • NCD (
normalized compression distance) • ESA (
explicit semantic analysis) based on
Wikipedia and the
ODP • SSA (salient semantic analysis) which indexes terms using salient concepts found in their immediate context. • n° of Wikipedia (noW), inspired by the game Six Degrees of Wikipedia, is a distance metric based on the hierarchical structure of Wikipedia. A directed-acyclic graph is first constructed and later,
Dijkstra's shortest path algorithm is employed to determine the noW value between two terms as the geodesic distance between the corresponding topics (i.e. nodes) in the graph. • VGEM (vector generation of an explicitly defined multidimensional semantic space): (+) incremental vocab, can compare multi-word terms (−) performance depends on choosing specific dimensions •
SimRank • NASARI: Sparse vector representations constructed by applying the hypergeometric distribution over the Wikipedia corpus in combination with BabelNet taxonomy. Cross-lingual similarity is currently also possible thanks to the multilingual and unified extension.
Semantics-based similarity • Marker passing: Combining lexical decomposition for automated ontology creation and marker passing, the approach of Fähndrich et al. introduces a new type of semantic similarity measure. Here markers are passed from the two target concepts carrying an amount of activation. This activation might increase or decrease depending on the relations weight with which the concepts are connected. This combines edge and node based approaches and includes connectionist reasoning with symbolic information. • Good common subsumer (GCS)-based semantic similarity measure
Semantics similarity networks • A
semantic similarity network (SSN) is a special form of
semantic network. designed to represent concepts and their semantic similarity. Its main contribution is reducing the complexity of calculating semantic distances. Bendeck (2004, 2008) introduced the concept of
semantic similarity networks (SSN) as the specialization of a semantic network to measure semantic similarity from ontological representations. Implementations include genetic information handling.
Gold standards Researchers have collected datasets with similarity judgements on pairs of words, which are used to evaluate the cognitive plausibility of computational measures. The golden standard up to today is an old 65 word list where humans have judged the word similarity. • RG65 • MC30 • WordSim353 == See also ==