|
Terrier IR Platform 2.2.1 |
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object uk.ac.gla.terrier.indexing.TRECCollection
public class TRECCollection
Models a TREC test collection by implementing the interfaces Collection and DocumentExtractor. It provides sequential access to the documents in the collection and also it can return the text of a document as a String. TREC format files are opened using the default encoding unless the trec.encoding has been set to a valid supported encoding.
Properties:
Constructor Summary | |
---|---|
TRECCollection()
A default constructor that reads the collection specification file, as configured by the property collection.spec, reads a list of blacklisted document numbers, specified by the property trec.blacklist.docids and opens the first collection file to process. |
|
TRECCollection(java.io.InputStream input)
A constructor that reads only the document in the specificed InputStream. |
|
TRECCollection(java.lang.String CollectionSpecFilename,
java.lang.String TagSet,
java.lang.String BlacklistSpecFilename,
java.lang.String docPointersFilename)
Specific constructor: reads the files listed in CollectionSpecFilename, the Blacklist of Document IDs in BlacklistSpecFilename, and stores document offsets and lengths in the document pointers file docPointersFilename. |
Method Summary | |
---|---|
void |
close()
Closes the files and streams used by the collection object. |
boolean |
endOfCollection()
Indicates whether the end of the collection has been reached. |
java.lang.String |
getDocid()
Returns the document number of the current document. |
Document |
getDocument()
Returns the current document to process. |
Document |
getDocument(TagSet _tags,
TagSet _exact,
TagSet _fields)
A TREC-specific getDocument method, that allows the tags to be specified for each document. |
java.lang.String |
getDocumentString(int docid)
Returns the text of a document with the given identifier. |
boolean |
nextDocument()
Moves to the next document to process from the collection. |
void |
reset()
Resets the collection object back to the beginning of the collection. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public TRECCollection(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String docPointersFilename)
CollectionSpecFilename
- The collections specification filename. The file contains
a list of filenames to read. Must be specified, fatal error otherwise.TagSet
- the TagSet constructor string to use to obtain the tags to parse for.BlacklistSpecFilename
- A filename to a file containing a list of document identifiers
thay have NOT to be processed. Not loaded if null or length 0docPointersFilename
- Where to store document offsets and lengths to. Not used if null.public TRECCollection()
public TRECCollection(java.io.InputStream input)
Method Detail |
---|
public boolean nextDocument()
nextDocument
in interface Collection
public Document getDocument()
getDocument
in interface Collection
public Document getDocument(TagSet _tags, TagSet _exact, TagSet _fields)
public java.lang.String getDocid()
getDocid
in interface Collection
public boolean endOfCollection()
endOfCollection
in interface Collection
public void reset()
reset
in interface Collection
public java.lang.String getDocumentString(int docid)
getDocumentString
in interface DocumentExtractor
docid
- the internal identifier of a document.
public void close()
close
in interface Collection
|
Terrier IR Platform 2.2.1 |
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |