Steps In the HITS algorithm, the first step is to retrieve the most relevant pages to the search query. This set is called the
root set and can be obtained by taking the top pages returned by a text-based
search algorithm. A
base set is generated by augmenting the root set with all the web pages that are linked from it and some of the pages that link to it. The web pages in the base set and all hyperlinks among those pages form a focused subgraph. The HITS computation is performed only on this
focused subgraph. According to Kleinberg the reason for constructing a base set is to ensure that most (or many) of the strongest authorities are included. Authority and hub values are defined in terms of one another in a
mutual recursion. An authority value is computed as the sum of the scaled hub values that point to that page. A hub value is the sum of the scaled authority values of the pages it points to. Some implementations also consider the relevance of the linked pages. The algorithm performs a series of iterations, each consisting of two basic steps: •
Authority update: Update each node's
authority score to be equal to the sum of the
hub scores of each node that points to it. That is, a node is given a high authority score by being linked from pages that are recognized as Hubs for information. •
Hub update: Update each node's
hub score to be equal to the sum of the
authority scores of each node that it points to. That is, a node is given a high hub score by linking to nodes that are considered to be authorities on the subject. The Hub score and Authority score for a node is calculated with the following algorithm: • Start with each node having a hub score and authority score of 1. • Run the authority update rule • Run the hub update rule • Normalize the values by dividing each Hub score by
square root of the sum of the squares of all Hub scores, and dividing each Authority score by square root of the sum of the squares of all Authority scores. • Repeat from the second step as necessary.
Comparison to PageRank HITS, like
Page and
Brin's
PageRank, is an
iterative algorithm based on the
linkage of the documents on the web. However it does have some major differences: • It is processed on a small subset of ‘relevant’ documents (a 'focused subgraph' or base set), instead of the set of all documents as was the case with PageRank. • It is query-dependent: the same page can receive a different hub/authority score given a different base set, which appears for a different query; • It must, as a corollary, be executed at query time, not at indexing time, with the associated drop in performance that accompanies query-time processing. • It computes two scores per document (hub and authority) as opposed to a single score; • It is not commonly used by search engines (though a similar algorithm was said to be used by
Teoma, which was acquired by
Ask Jeeves/Ask.com). == In detail ==