org.terrier.indexing
Class TRECCollection

java.lang.Object
  extended by org.terrier.indexing.TRECCollection
All Implemented Interfaces:
java.io.Closeable, Collection, DocumentExtractor
Direct Known Subclasses:
TRECUTFCollection, TRECWebCollection

public class TRECCollection
extends java.lang.Object
implements Collection, DocumentExtractor

Models a TREC test collection by implementing the interfaces Collection and DocumentExtractor. It provides sequential access to the documents in the collection and also it can return the text of a document as a String. The precise Document class to be used can be specified with the trec.document.class property. TREC format files are opened using the default encoding unless the trec.encoding has been set to a valid supported encoding. Since 3.5, the contents of tags can be added to the meta index instead of being indexed normally. This is useful to hold URLs or dates that you need to later during retrieval. To use this, the fields in the TREC file need to be ordered and the tags to add need to be specified in TrecDocTags.propertytags and indexer.meta.forward.keys and the maximum length of the tags given in indexer.meta.forward.keylens.

Properties:

Author:
Craig Macdonald & Vassilis Plachouras & Richard McCreadie

Field Summary
protected  CountingInputStream br
          The inputstream used for reading data.
protected  java.lang.String currentFilename
          Filename of current file
protected  java.lang.String desiredEncoding
          Encoding to be used to open all files.
protected  java.util.HashSet<java.lang.String> DocIDBlacklist
           
protected  java.lang.String docnotag
          The docno tag
protected  java.util.Map<java.lang.String,java.lang.String> DocProperties
          properties for the current document
protected  java.lang.Class<? extends Document> documentClass
           
protected  int documentCounter
          Counts the documents that are found in the collection, ignoring those documents that appear in the black list
protected  int documentsInThisFile
          Counts the number of documents that have been found in this file.
protected  char[] end_docnoTag
          The closing document number tag.
protected  int end_docnoTagLength
          The length of the closing document number tag.
protected  java.lang.String end_docTag
          The closing document tag.
protected  int end_docTagLength
          The length of the closing document tag.
protected  boolean endOfCollection
          Indicates whether the end of the collection has been reached.
protected  char[][] endPropertyTags
          The end property tag
protected  int FileNumber
          The index in the FilesToProcess of the currently processed file.
protected  java.util.ArrayList<java.lang.String> FilesToProcess
          The list of files to process.
protected  boolean ignoreProperties
          Do we ignore properties?
protected static org.apache.log4j.Logger logger
          logger for this class
protected  int[] propertyTagLengths
          The length of each property tag
protected  boolean SkipFile
          A boolean which is true when a new file is open.
protected  char[] start_docnoTag
          The opening document number tag.
protected  int start_docnoTagLength
          The length of the opening document number tag.
protected  char[] start_docTag
          The opening document tag.
protected  int start_docTagLength
          The length of the opening document tag.
protected  char[][] startPropertyTags
          The start property tags
protected  boolean tags_CaseSensitive
          Is the markup case-sensitive?
protected  java.lang.String ThisDocID
          The string identifier of the current document.
protected  Tokeniser tokeniser
           
 
Constructor Summary
TRECCollection()
          A default constructor that reads the collection specification file, as configured by the property collection.spec, reads a list of blacklisted document numbers, specified by the property trec.blacklist.docids and opens the first collection file to process.
TRECCollection(java.io.InputStream input)
          A constructor that reads only the document in the specificed InputStream.
TRECCollection(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
          Specific constructor: reads the files listed in CollectionSpecFilename, the Blacklist of Document IDs in BlacklistSpecFilename, and stores document offsets and lengths in the document pointers file docPointersFilename.
 
Method Summary
protected  void afterPropertyTags()
           
 void close()
          Closes the files and streams used by the collection object.
 boolean endOfCollection()
          Indicates whether the end of the collection has been reached.
 java.lang.String getDocid()
          Returns the document number of the current document.
 Document getDocument()
          Returns the current document to process.
 Document getDocument(TagSet _tags, TagSet _exact, TagSet _fields)
          Deprecated. 
 java.lang.String getDocumentString(int docid)
          Deprecated. 
protected  java.lang.StringBuilder getTag(int taglength, char[] startTag, char[] endTag)
          Scans through a document reading in the first occurrence of the specified tag, returning its contents as a StringBuilder object
 boolean hasNext()
          Check whether it is the end of the collection
protected  void loadDocumentClass()
          Loads the class that will supply all documents for this Collection.
 Document next()
          Return next document
 boolean nextDocument()
          Moves to the next document to process from the collection.
protected  boolean openNextFile()
          Opens the next document from the collection specification.
protected  void readCollectionSpec(java.lang.String CollectionSpecFilename)
           
protected  void readDocumentBlacklist(java.lang.String BlacklistSpecFilename)
           
 void remove()
          This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations
 void reset()
          Resets the collection object back to the beginning of the collection.
protected  void setTags(java.lang.String TagSet)
          protected method for initialising the opening and closing document and document number tags.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

protected static final org.apache.log4j.Logger logger
logger for this class


currentFilename

protected java.lang.String currentFilename
Filename of current file


documentsInThisFile

protected int documentsInThisFile
Counts the number of documents that have been found in this file.


DocProperties

protected java.util.Map<java.lang.String,java.lang.String> DocProperties
properties for the current document


documentCounter

protected int documentCounter
Counts the documents that are found in the collection, ignoring those documents that appear in the black list


DocIDBlacklist

protected java.util.HashSet<java.lang.String> DocIDBlacklist

FilesToProcess

protected java.util.ArrayList<java.lang.String> FilesToProcess
The list of files to process.


FileNumber

protected int FileNumber
The index in the FilesToProcess of the currently processed file.


ThisDocID

protected java.lang.String ThisDocID
The string identifier of the current document.


br

protected CountingInputStream br
The inputstream used for reading data.


SkipFile

protected boolean SkipFile
A boolean which is true when a new file is open.


start_docTag

protected char[] start_docTag
The opening document tag.


start_docTagLength

protected int start_docTagLength
The length of the opening document tag.


end_docTag

protected java.lang.String end_docTag
The closing document tag.


end_docTagLength

protected int end_docTagLength
The length of the closing document tag.


start_docnoTag

protected char[] start_docnoTag
The opening document number tag.


start_docnoTagLength

protected int start_docnoTagLength
The length of the opening document number tag.


end_docnoTag

protected char[] end_docnoTag
The closing document number tag.


end_docnoTagLength

protected int end_docnoTagLength
The length of the closing document number tag.


tags_CaseSensitive

protected boolean tags_CaseSensitive
Is the markup case-sensitive?


ignoreProperties

protected boolean ignoreProperties
Do we ignore properties?


docnotag

protected java.lang.String docnotag
The docno tag


propertyTagLengths

protected int[] propertyTagLengths
The length of each property tag


startPropertyTags

protected char[][] startPropertyTags
The start property tags


endPropertyTags

protected char[][] endPropertyTags
The end property tag


desiredEncoding

protected java.lang.String desiredEncoding
Encoding to be used to open all files.


documentClass

protected java.lang.Class<? extends Document> documentClass

tokeniser

protected Tokeniser tokeniser

endOfCollection

protected boolean endOfCollection
Indicates whether the end of the collection has been reached.

Constructor Detail

TRECCollection

public TRECCollection(java.lang.String CollectionSpecFilename,
                      java.lang.String TagSet,
                      java.lang.String BlacklistSpecFilename,
                      java.lang.String ignored)
Specific constructor: reads the files listed in CollectionSpecFilename, the Blacklist of Document IDs in BlacklistSpecFilename, and stores document offsets and lengths in the document pointers file docPointersFilename. The collection will be parsed according to the TagSet specified by TagSet string

Parameters:
CollectionSpecFilename - The collections specification filename. The file contains a list of filenames to read. Must be specified, fatal error otherwise.
TagSet - the TagSet constructor string to use to obtain the tags to parse for.
BlacklistSpecFilename - A filename to a file containing a list of document identifiers thay have NOT to be processed. Not loaded if null or length 0
ignored - no longer used

TRECCollection

public TRECCollection()
A default constructor that reads the collection specification file, as configured by the property collection.spec, reads a list of blacklisted document numbers, specified by the property trec.blacklist.docids and opens the first collection file to process. TagSet TagSet.TREC_DOC_TAGS is used to tokenize the collection.


TRECCollection

public TRECCollection(java.io.InputStream input)
A constructor that reads only the document in the specificed InputStream. Also reads a list of blacklisted document numbers, specified by the property trec.blacklist.docids and opens the first collection file to process.

Method Detail

setTags

protected void setTags(java.lang.String TagSet)
protected method for initialising the opening and closing document and document number tags.


readCollectionSpec

protected void readCollectionSpec(java.lang.String CollectionSpecFilename)

readDocumentBlacklist

protected void readDocumentBlacklist(java.lang.String BlacklistSpecFilename)

loadDocumentClass

protected void loadDocumentClass()
Loads the class that will supply all documents for this Collection. Set by property trec.document.class


hasNext

public boolean hasNext()
Check whether it is the end of the collection

Returns:
boolean

next

public Document next()
Return next document

Returns:
next document

remove

public void remove()
This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations


nextDocument

public boolean nextDocument()
Moves to the next document to process from the collection.

Specified by:
nextDocument in interface Collection
Returns:
boolean true if there are more documents to process in the collection, otherwise it returns false.

afterPropertyTags

protected void afterPropertyTags()
                          throws java.io.IOException
Throws:
java.io.IOException

getTag

protected java.lang.StringBuilder getTag(int taglength,
                                         char[] startTag,
                                         char[] endTag)
                                  throws java.io.IOException
Scans through a document reading in the first occurrence of the specified tag, returning its contents as a StringBuilder object

Parameters:
taglength - - the length of the start tag
startTag - - the start tag
endTag - - the end tag
Returns:
- the tag contents
Throws:
java.io.IOException

getDocument

public Document getDocument()
Returns the current document to process.

Specified by:
getDocument in interface Collection
Returns:
Document the object of the current document to process.

getDocument

@Deprecated
public Document getDocument(TagSet _tags,
                                       TagSet _exact,
                                       TagSet _fields)
Deprecated. 

A TREC-specific getDocument method, that allows the tags to be specified for each document.

Returns:
Document the object of the current document to process.

getDocid

public java.lang.String getDocid()
Returns the document number of the current document.

Returns:
String the document number of the current document.

endOfCollection

public boolean endOfCollection()
Indicates whether the end of the collection has been reached.

Specified by:
endOfCollection in interface Collection
Returns:
boolean true if there are no more documents to process in the collection, otherwise it returns false.

reset

public void reset()
Resets the collection object back to the beginning of the collection.

Specified by:
reset in interface Collection

openNextFile

protected boolean openNextFile()
                        throws java.io.IOException
Opens the next document from the collection specification.

Returns:
boolean true if the file was opened successufully. If there are no more files to open, it returns false.
Throws:
java.io.IOException - if there is an exception while opening the collection files.

getDocumentString

@Deprecated
public java.lang.String getDocumentString(int docid)
Deprecated. 

Returns the text of a document with the given identifier.

Specified by:
getDocumentString in interface DocumentExtractor
Parameters:
docid - the internal identifier of a document.
Returns:
String the text of the document as a string.

close

public void close()
Closes the files and streams used by the collection object.

Specified by:
close in interface java.io.Closeable


Terrier 3.5. Copyright © 2004-2011 University of Glasgow