Divergence From Randomness (DFR) Framework

The Divergence from Randomness (DFR) paradigm is a generalization of one of the very first models of Information Retrieval, Harter's 2-Poisson indexing-model [1]. The 2-Poisson model is based on the hypothesis that the level of treatment of the informative words is witnessed by an elite set of documents, in which these words occur to a relatively greater extent than in the rest of the documents.

On the other hand, there are words, which do not possess elite documents, and thus their frequency follows a random distribution, that is the single Poisson model. Harter's model was first explored as a retrieval-model by Robertson, Van Rijsbergen and Porter [4]. Successively it was combined with standard probabilistic model by Robertson and Walker [3] and gave birth to the family of the BMs IR models (among them there is the well known BM25 which is at the basis the Okapi system).

DFR models are obtained by instantiating the three components of the framework: selecting a basic randomness model, applying the first normalisation and normalising the term frequencies.

Basic Randomness Models

The DFR models are based on this simple idea: "The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d. In other words the term-weight is inversely related to the probability of term-frequency within the document d obtained by a model M of randomness:

$\displaystyle \textrm{weight} (t\vert d) \propto -\log \textrm{Prob}_{M} \left (t\in d\vert\textrm{Collection}\right)$ (1)

where the subscript M stands for the type of model of randomness employed to compute the probability. In order to choose the appropriate model M of randomness, we can use different urn models. IR is thus seen as a probabilistic process, which uses random drawings from urn models, or equivalently random placement of coloured balls into urns. Instead of urns we have documents, and instead of different colours we have different terms, where each term occurs with some multiplicity in the urns as anyone of a number of related words or phrases which are called tokens of that term. There are many ways to choose M, each of these provides a basic DFR model. The basic models are derived in the following table.

Basic DFR Models
D Divergence approximation of the binomial
P Approximation of the binomial
BE Bose-Einstein distribution
G Geometric approximation of the Bose-Einstein
I(n) Inverse Document Frequency model
I(F) Inverse Term Frequency model
I(ne) Inverse Expected Document Frequency model

If the model M is the binomial distribution, then the basic model is P and computes the value1:

$\displaystyle -\log \textrm{Prob}_{P}(t\in d\vert \textrm{Collection})=-\log \left(\begin {array}{c} TF\ tf \end{array}\right )p^{tf}q^{TF -tf}$ (2)

Similarly, if the model M is the geometric distribution, then the basic model is G and computes the value:

$\displaystyle -\log \textrm{Prob}_{G}(t\in d\vert \textrm{Collection}) = -\log ...
...1}{ 1+\lambda}\right )\cdot \left(\frac{\lambda}{ 1+\lambda}\right)^{tf}\right)$ (3)
where λ = F/N.

First Normalisation

When a rare term does not occur in a document then it has almost zero probability of being informative for the document. On the contrary, if a rare term has many occurrences in a document then it has a very high probability (almost the certainty) to be informative for the topic described by the document. Similarly to Ponte and Croft's [2] language model, we include a risk component in the DFR models. If the term-frequency in the document is high then the risk for the term of not being informative is minimal. In such a case Formula 1 gives a high value, but a minimal risk has also the negative effect of providing a small information gain. Therefore, instead of using the full weight provided by the Formula 1, we tune or smooth the weight of Formula 1 by considering only the portion of it which is the amount of information gained with the term:

$\displaystyle \textrm{gain}(t\vert d)= P_{\textrm{risk}}\cdot\left(-\log \textrm{Prob}_{M} \left (t\in d\vert \textrm{Collection}\right)\right)$ (4)

The more the term occurs in the elite set, the less term-frequency is due to randomness, and thus the smaller the probability Prisk is, that is:

$\displaystyle P_{\textrm{risk}}=1- \textrm{Prob}(t\in d\vert d\in\textrm{Elite set})$ (5)

We use two models for computing the information-gain with a term within a document: the Laplace L model and the ratio of two Bernoulli's processes B:

$\displaystyle P_{\textrm{risk}} = \left\{ \begin{tabular}{ll} $\displaystyle \f...
...{TF}{df\cdot(tf+1)}$\ &(Ratio $B$\ of two binomial distributions) \end{tabular}$ (6)
where df is the number of documents containing the term.

Term Frequency Normalisation

Before using Formula 4 the document-length dl is normalized to a standard length sl. Consequently, the term-frequencies tf are also recomputed with respect to the standard document-length, that is:

$\displaystyle tfn= tf\cdot\log\left(1+ \displaystyle \frac{sl}{dl}\right)$ (7)

A more flexible formula, refered to as normalisation 2, is given below:

$\displaystyle tfn= tf\cdot\log\left(1+ \displaystyle c\cdot\frac{sl}{dl}\right)$ (8)

The parameter c can be set automatically, as described by He and Ounis [6][7]. Here we give a list of estimated parameter values on TREC collections:

Query type
c value
disk1&2 and disk4&5
title-only queries
description-only queries
title-description-narrative queries

title-only queries
description-only queries
title-description-narrative queries

title-only queries
description-only queries
title-description-narrative queries

title-only queries (TREC11 topic distillation task
title-only queries (TREC12/13 topic distillation tasks

title-only queries
title-description-narrative queries

DFR Models are finally obtained from the generating Formula 4, using a basic DFR model (such as Formulas 2 or 3) in combination with a model of information-gain (such as Formulas 6) and normalizing the term-frequency (such as in Formula 7 or Formula 8).

DFR Models in Terrier

In the distribution of Terrier, the available DFR models are the following:

Model Description
BB2 Bose Enstein model with Bernouli after-effect and normalisation 2.
IFB2 Inverse Term Frequency model with Bernouli after-effect and normalisation 2.
In_expB2 Inverse Expected Document Frequency model with Bernouli after-effect and normalisation 2. The logarithms are base 2. This model can be used for classic ad-hoc tasks.
In_expC2 Inverse Expected Document Frequency model with Bernouli after-effect and normalisation 2. The logarithms are base e. This model can be used for classic ad-hoc tasks.
InL2 Inverse Document Frequency model with Laplace after-effect and normalisation 2. This model can be used for tasks that require early precision.
PL2 Poisson model with Laplace after-effect and normalisation 2. This model can be used for tasks that require early precision [7, 8]

Another provided weighting model is a derivation of the BM25 formula from the Divergence From Randomness framework.

Query Expansion

The query expansion mechanism extracts the most informative terms from the top-returned document as the expanded query terms. In this expansion process, terms in the top-returned documents are weighted using a particular DFR term weighting model. Currently, Terrier deploys Bo1 (Bose-Enstein 1), Bo2 (Bose-Enstein 2) and KL (Kullback-Leibler) term weighting models. The DFR term weighting models follow a parameter-free approach in default.

An alternative approach is Rocchio's query expansion mechanism. A user can switch to the latter approach by setting parameter.free.expansion to false in the terrier.properties file. The default value of the parameter beta of Rocchio's approach is 0.4. To change this parameter, the user needs to specify the property rocchio_beta in the terrier.properties file.

DFR Models and Cross-Entropy

A different interpretation of the gain-risk generating Formula 4 can be explained by the notion of cross-entropy. Shannon's mathematical theory of communication in the 1940s [5] established that the minimal average code word length is about the value of the entropy of the probabilities of the source words. This result is known under the name of the Noiseless Coding Theorem. The term noiseless refers at the assumption of the theorem that there is no possibility of errors in transmitting words. Nevertheless, it may happen that different sources about the same information are available. In general each source produces a different coding. In such cases, we can make a comparison of the two sources of evidence using the cross-entropy. The cross entropy is minimised when the two pair of observations return the same probability density function, and in such a case cross-entropy coincides with the Shannon's entropy.

We possess two tests of randomness: the first test is Prisk and is relative to the term distribution within its elite set, while the second ProbM is relative to the document with respect the entire collection. The first distribution can be treated as a new source of the term distribution, while the coding of the term with the term distribution within the collection can be considered as the primary source. The definition of the cross-entropy relation of these two probabilities distribution is:

$\displaystyle \textrm{gain}(t\vert d)=CE(P_{\textrm{risk}}\vert\vert \textrm{Prob}_{M}) = P_{\textrm{risk}}\cdot\left ( -\log_2 \textrm{Prob}_{M}\right)$ (9)

Relation 9 is indeed Relation 4 of the DFR framework. DFR models can be equivalently defined as the divergence of two probabilities measuring the amount of randomness of two different sources of evidence.

For more details about the Divergence from Randomness framework, you may refer to the PhD thesis of Gianni Amati, or to Amati and Van Rijsbergen's paper Probabilistic models of information retrieval based on measuring divergence from randomness, TOIS 20(4):357-389, 2002.

[1] S.P. Harter. A probabilistic approach to automatic keyword indexing. PhD thesis, Graduate Library, The University of Chicago, Thesis No. T25146, 1974.
[2] J. Ponte and B. Croft. A Language Modeling Approach in Information Retrieval. In The 21st ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia, 1998), B. Croft, A.Moffat, and C.J. van Rijsbergen, Eds., pp.275-281.
[3] S.E. Robertson and S. Walker. Some simple approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. In Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland, June 1994), Springer-Verlag, pp. 232-241.
[4] S.E. Robertson, C.J. van Risjbergen and M. Porter. Probabilistic models of indexing and searching. In Information retrieval Research, S.E. Robertson, C.J. van Risjbergen and P. Williams, Eds. Butterworths, 1981, ch. 4, pp. 35-56.
[5] C. Shannon and W. Weaver. The Mathematical Theory of Communication. University of Illinois Press, Urbana, Illinois, 1949.
[6] B. He and I. Ounis. A study of parameter tuning for term frequency normalization, in Proceedings of the twelfth international conference on Information and knowledge management, New Orleans, LA, USA, 2003.
[7] B. He and I. Ounis. Term Frequency Normalisation Tuning for BM25 and DFR Model, in Proceedings of the 27th European Conference on Information Retrieval (ECIR'05), 2005.
[8] V. Plachouras and I. Ounis. Usefulness of Hyperlink Structure for Web Information Retrieval. In Proceedings of ACM SIGIR 2004.
[9] V. Plachouras, B. He and I. Ounis. University of Glasgow in TREC 2004: experiments in Web, Robust and Terabyte tracks with Terrier. In Proceedings of the 13th Text REtrieval Conference (TREC 2004), 2004.


1: We actually use approximating formulas for the factorials. [back]