Package org.terrier.structures
Class CollectionStatistics
- java.lang.Object
-
- org.terrier.structures.CollectionStatistics
-
- All Implemented Interfaces:
java.io.Serializable
,org.apache.hadoop.io.Writable
- Direct Known Subclasses:
MemoryCollectionStatistics
,MultiStats
,PropertiesIndex.UpdatingCollectionStatistics
public class CollectionStatistics extends java.lang.Object implements java.io.Serializable, org.apache.hadoop.io.Writable
This class provides basic statistics for the indexed collection of documents, such as the average length of documents, or the total number of documents in the collection.
After indexing, statistics are saved in the PREFIX.log file, along with the classes that should be used for the Lexicon, the DocumentIndex, the DirectIndex and the InvertedIndex. This means that an index knows how it was build and how it should be opened again.- Author:
- Gianni Amati, Vassilis Plachouras, Craig Macdonald
- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description protected double
averageDocumentLength
Average length of a document in the collection.protected double[]
avgFieldLengths
Average length of each fieldprotected java.lang.String[]
fieldNames
Field namesprotected long[]
fieldTokens
Number of tokens in each fieldprotected boolean
hasPositions
Does the index have positionsprotected int
numberOfDocuments
Total number of documents in the collection.protected int
numberOfFields
Number of fields used to indexprotected long
numberOfPointers
Total number of pointers in the inverted file.protected long
numberOfTokens
Total number of tokens in the collection.protected int
numberOfUniqueTerms
Total number of unique terms in the collection.
-
Constructor Summary
Constructors Constructor Description CollectionStatistics()
Default constructor.CollectionStatistics(int numDocs, int numTerms, long numTokens, long numPointers, long[] _fieldTokens, java.lang.String[] _fieldNames)
Deprecated.CollectionStatistics(int numDocs, int numTerms, long numTokens, long numPointers, long[] _fieldTokens, java.lang.String[] _fieldNames, boolean positions)
Constructs an instance of the class.
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description void
addStatistics(CollectionStatistics cs)
Increment the collection statistics with the provided collection statistics.double
getAverageDocumentLength()
Returns the documents' average length.double[]
getAverageFieldLengths()
Returns the average length of each field in tokens.java.lang.String[]
getFieldNames()
Returns the field names.long[]
getFieldTokens()
Returns the length of each field in tokens.int
getNumberOfDocuments()
Returns the total number of documents in the collection.int
getNumberOfFields()
Returns the number of fields being used to index.long
getNumberOfPointers()
Deprecated.long
getNumberOfPostings()
Returns the total number of postings in the collection.long
getNumberOfTokens()
Returns the total number of tokens in the collection.int
getNumberOfUniqueTerms()
Returns the total number of unique terms in the lexicon.boolean
hasPositions()
Returns true if the inverted index will have position informatvoid
readFields(java.io.DataInput in)
void
readFieldsV5(java.io.DataInput in)
protected void
recalculateAverageLengths()
java.lang.String
toString()
void
write(java.io.DataOutput out)
-
-
-
Field Detail
-
numberOfFields
protected int numberOfFields
Number of fields used to index
-
fieldTokens
protected long[] fieldTokens
Number of tokens in each field
-
avgFieldLengths
protected double[] avgFieldLengths
Average length of each field
-
fieldNames
protected java.lang.String[] fieldNames
Field names
-
numberOfDocuments
protected int numberOfDocuments
Total number of documents in the collection.
-
numberOfTokens
protected long numberOfTokens
Total number of tokens in the collection.
-
numberOfPointers
protected long numberOfPointers
Total number of pointers in the inverted file. This corresponds to the sum of the document frequencies for the terms in the lexicon.
-
numberOfUniqueTerms
protected int numberOfUniqueTerms
Total number of unique terms in the collection. This corresponds to the number of entries in the lexicon.
-
averageDocumentLength
protected double averageDocumentLength
Average length of a document in the collection.
-
hasPositions
protected boolean hasPositions
Does the index have positions
-
-
Constructor Detail
-
CollectionStatistics
@Deprecated public CollectionStatistics(int numDocs, int numTerms, long numTokens, long numPointers, long[] _fieldTokens, java.lang.String[] _fieldNames)
Deprecated.
-
CollectionStatistics
public CollectionStatistics(int numDocs, int numTerms, long numTokens, long numPointers, long[] _fieldTokens, java.lang.String[] _fieldNames, boolean positions)
Constructs an instance of the class.- Parameters:
numDocs
- the number of documents in the collection.numTerms
- the number of terms in the collection.numTokens
- the number of tokens in the collection.numPointers
- the number of pointers in the inverted file._fieldTokens
- the number of tokens in each field._fieldNames
- the field names.
-
CollectionStatistics
public CollectionStatistics()
Default constructor.
-
-
Method Detail
-
recalculateAverageLengths
protected void recalculateAverageLengths()
-
toString
public java.lang.String toString()
- Overrides:
toString
in classjava.lang.Object
-
hasPositions
public boolean hasPositions()
Returns true if the inverted index will have position informat
-
getAverageDocumentLength
public double getAverageDocumentLength()
Returns the documents' average length.- Returns:
- the average length of the documents in the collection.
-
getNumberOfDocuments
public int getNumberOfDocuments()
Returns the total number of documents in the collection.- Returns:
- the total number of documents in the collection.
-
getNumberOfPointers
@Deprecated public long getNumberOfPointers()
Deprecated.Returns the total number of postings in the collection.- Returns:
- the total number of postings in the collection.
-
getNumberOfPostings
public long getNumberOfPostings()
Returns the total number of postings in the collection.- Returns:
- the total number of postings in the collection.
-
getNumberOfTokens
public long getNumberOfTokens()
Returns the total number of tokens in the collection.- Returns:
- the total number of tokens in the collection.
-
getNumberOfUniqueTerms
public int getNumberOfUniqueTerms()
Returns the total number of unique terms in the lexicon.- Returns:
- the total number of unique terms in the lexicon.
-
getNumberOfFields
public int getNumberOfFields()
Returns the number of fields being used to index.- Returns:
- the number of fields being used to index.
-
getFieldTokens
public long[] getFieldTokens()
Returns the length of each field in tokens.- Returns:
- the length of each field in tokens.
-
getAverageFieldLengths
public double[] getAverageFieldLengths()
Returns the average length of each field in tokens.- Returns:
- the average length of each field in tokens.
-
getFieldNames
public java.lang.String[] getFieldNames()
Returns the field names.- Returns:
- the field names.
-
addStatistics
public void addStatistics(CollectionStatistics cs)
Increment the collection statistics with the provided collection statistics.- Parameters:
cs
- the collection statistics to use to increment.
-
readFields
public void readFields(java.io.DataInput in) throws java.io.IOException
- Specified by:
readFields
in interfaceorg.apache.hadoop.io.Writable
- Throws:
java.io.IOException
-
readFieldsV5
public void readFieldsV5(java.io.DataInput in) throws java.io.IOException
- Throws:
java.io.IOException
-
write
public void write(java.io.DataOutput out) throws java.io.IOException
- Specified by:
write
in interfaceorg.apache.hadoop.io.Writable
- Throws:
java.io.IOException
-
-