org.terrier.structures
Class CollectionStatistics

java.lang.Object
  extended by org.terrier.structures.CollectionStatistics
All Implemented Interfaces:
Serializable, org.apache.hadoop.io.Writable
Direct Known Subclasses:
Index.UpdatingCollectionStatistics

public class CollectionStatistics
extends Object
implements Serializable, org.apache.hadoop.io.Writable

This class provides basic statistics for the indexed collection of documents, such as the average length of documents, or the total number of documents in the collection.
After indexing, statistics are saved in the PREFIX.log file, along with the classes that should be used for the Lexicon, the DocumentIndex, the DirectIndex and the InvertedIndex. This means that an index knows how it was build and how it should be opened again.

Author:
Gianni Amati, Vassilis Plachouras, Craig Macdonald
See Also:
Serialized Form

Field Summary
protected  double averageDocumentLength
          The average length of a document in the collection.
protected  double[] avgFieldLengths
          Average length of each field
protected  long[] fieldTokens
          number of tokens in each field
protected  int numberOfDocuments
          The total number of documents in the collection.
protected  int numberOfFields
          Number of fields used to index
protected  long numberOfPointers
          The total number of pointers in the inverted file.
protected  long numberOfTokens
          The total number of tokens in the collection.
protected  int numberOfUniqueTerms
          The total number of unique terms in the collection.
 
Constructor Summary
CollectionStatistics()
           
CollectionStatistics(int numDocs, int numTerms, long numTokens, long numPointers, long[] _fieldTokens)
          Constructs an instance of the class with
 
Method Summary
 void addStatistics(CollectionStatistics cs)
          Increment the statistics by the specified amount
 double getAverageDocumentLength()
          Returns the documents' average length.
 double[] getAverageFieldLengths()
          Returns the average length of each field in tokens
 long[] getFieldTokens()
          Returns the length of each field in tokens
 int getNumberOfDocuments()
          Returns the total number of documents in the collection.
 int getNumberOfFields()
          Returns the number of fields being used to index
 long getNumberOfPointers()
          Returns the total number of pointers in the collection.
 long getNumberOfTokens()
          Returns the total number of tokens in the collection.
 int getNumberOfUniqueTerms()
          Returns the total number of unique terms in the lexicon.
 void readFields(DataInput in)
           
protected  void relcaluateAverageLengths()
           
 String toString()
          Returns a concrete representation of an index's statistics
 void write(DataOutput out)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

numberOfFields

protected int numberOfFields
Number of fields used to index


fieldTokens

protected long[] fieldTokens
number of tokens in each field


avgFieldLengths

protected double[] avgFieldLengths
Average length of each field


numberOfDocuments

protected int numberOfDocuments
The total number of documents in the collection.


numberOfTokens

protected long numberOfTokens
The total number of tokens in the collection.


numberOfPointers

protected long numberOfPointers
The total number of pointers in the inverted file. This corresponds to the sum of the document frequencies for the terms in the lexicon.


numberOfUniqueTerms

protected int numberOfUniqueTerms
The total number of unique terms in the collection. This corresponds to the number of entries in the lexicon.


averageDocumentLength

protected double averageDocumentLength
The average length of a document in the collection.

Constructor Detail

CollectionStatistics

public CollectionStatistics(int numDocs,
                            int numTerms,
                            long numTokens,
                            long numPointers,
                            long[] _fieldTokens)
Constructs an instance of the class with

Parameters:
numDocs -
numTerms -
numTokens -
numPointers -
_fieldTokens -

CollectionStatistics

public CollectionStatistics()
Method Detail

relcaluateAverageLengths

protected void relcaluateAverageLengths()

toString

public String toString()
Returns a concrete representation of an index's statistics

Overrides:
toString in class Object

getAverageDocumentLength

public double getAverageDocumentLength()
Returns the documents' average length.

Returns:
the average length of the documents in the collection.

getNumberOfDocuments

public int getNumberOfDocuments()
Returns the total number of documents in the collection.

Returns:
the total number of documents in the collection

getNumberOfPointers

public long getNumberOfPointers()
Returns the total number of pointers in the collection.

Returns:
the total number of pointers in the collection

getNumberOfTokens

public long getNumberOfTokens()
Returns the total number of tokens in the collection.

Returns:
the total number of tokens in the collection

getNumberOfUniqueTerms

public int getNumberOfUniqueTerms()
Returns the total number of unique terms in the lexicon.

Returns:
the total number of unique terms in the lexicon

getNumberOfFields

public int getNumberOfFields()
Returns the number of fields being used to index


getFieldTokens

public long[] getFieldTokens()
Returns the length of each field in tokens


getAverageFieldLengths

public double[] getAverageFieldLengths()
Returns the average length of each field in tokens


addStatistics

public void addStatistics(CollectionStatistics cs)
Increment the statistics by the specified amount


readFields

public void readFields(DataInput in)
                throws IOException
Specified by:
readFields in interface org.apache.hadoop.io.Writable
Throws:
IOException

write

public void write(DataOutput out)
           throws IOException
Specified by:
write in interface org.apache.hadoop.io.Writable
Throws:
IOException


Terrier 3.6. Copyright © 2004-2011 University of Glasgow