Package org.terrier.indexing
Class TRECCollection
- java.lang.Object
-
- org.terrier.indexing.MultiDocumentFileCollection
-
- org.terrier.indexing.TRECCollection
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
,Collection
- Direct Known Subclasses:
TRECUTFCollection
,TRECWebCollection
public class TRECCollection extends MultiDocumentFileCollection
Models a TREC test collection by implementing the interfaces Collection and DocumentExtractor. It provides sequential access to the documents in the collection and also it can return the text of a document as a String. The preciseDocument
class to be used can be specified with the trec.document.class property. TREC format files are opened using the default encoding unless the trec.encoding has been set to a valid supported encoding. Since 3.5, the contents of tags can be added to the meta index instead of being indexed normally. This is useful to hold URLs or dates that you need to later during retrieval. To use this, the fields in the TREC file need to be ordered and the tags to add need to be specified in TrecDocTags.propertytags and indexer.meta.forward.keys and the maximum length of the tags given in indexer.meta.forward.keylens.Properties:
- trec.document.class the
Document
class to parse individual documents (defaults toTaggedDocument
). - trec.encoding - encoding to use to open all files. Leave unset for System default encoding.
- (tagset).propertytags - list of tags to add to the meta index rather than to index. Tags are assumed to be IN ORDER after the docid.
- indexer.meta.forward.keys - list of keys to add to the meta index, remember to put any property tags here as well.
- indexer.meta.forward.keylens - lengths of each of the the meta keys, remember to put the lengths of the property tags here as well.
- Author:
- Craig Macdonald & Vassilis Plachouras & Richard McCreadie
-
-
Field Summary
Fields Modifier and Type Field Description protected CountingInputStream
br
The inputstream used for reading data.protected java.util.HashSet<java.lang.String>
DocIDBlacklist
protected java.lang.String
docnotag
The docno tagprotected int
documentCounter
Counts the documents that are found in the collection, ignoring those documents that appear in the black listprotected char[]
end_docnoTag
The closing document number tag.protected int
end_docnoTagLength
The length of the closing document number tag.protected java.lang.String
end_docTag
The closing document tag.protected int
end_docTagLength
The length of the closing document tag.protected char[][]
endPropertyTags
The end property tagprotected boolean
ignoreProperties
Do we ignore properties?protected int[]
propertyTagLengths
The length of each property tagprotected java.lang.String[]
propertyTags
Tag names for tags that should be added as propertiesprotected char[]
start_docnoTag
The opening document number tag.protected int
start_docnoTagLength
The length of the opening document number tag.protected char[]
start_docTag
The opening document tag.protected int
start_docTagLength
The length of the opening document tag.protected char[][]
startPropertyTags
The start property tagsprotected boolean
tags_CaseSensitive
Is the markup case-sensitive?protected java.lang.String
ThisDocID
The string identifier of the current document.-
Fields inherited from class org.terrier.indexing.MultiDocumentFileCollection
currentFilename, desiredEncoding, DocProperties, documentClass, documentsInThisFile, eoc, eof, FileNumber, FilesToProcess, forceUTF8, is, logger, SkipFile, tokeniser
-
-
Constructor Summary
Constructors Constructor Description TRECCollection()
A default constructor that reads the collection specification file, as configured by the property collection.spec, reads a list of blacklisted document numbers, specified by the property trec.blacklist.docids and opens the first collection file to process.TRECCollection(java.io.InputStream input)
A constructor that reads only the document in the specificed InputStream.TRECCollection(java.lang.String collSpec)
TRECCollection(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
Specific constructor: reads the files listed in CollectionSpecFilename, the Blacklist of Document IDs in BlacklistSpecFilename, and stores document offsets and lengths in the document pointers file docPointersFilename.TRECCollection(java.util.List<java.lang.String> files)
TRECCollection(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename)
TRECCollection(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
afterPropertyTags()
Document
getDocument()
Returns the current document to process.protected java.lang.StringBuilder
getTag(int taglength, char[] startTag, char[] endTag)
Scans through a document reading in the first occurrence of the specified tag, returning its contents as a StringBuilder objectboolean
hasNext()
Check whether it is the end of the collectionDocument
next()
Return next documentboolean
nextDocument()
Moves to the next document to process from the collection.protected void
openNewFile()
protected void
readDocumentBlacklist(java.lang.String BlacklistSpecFilename)
void
reset()
Resets the Collection iterator to the start of the collection.protected void
setTags(java.lang.String TagSet)
protected method for initialising the opening and closing document and document number tags.-
Methods inherited from class org.terrier.indexing.MultiDocumentFileCollection
checkEncoding, close, endOfCollection, extractCharset, loadDocumentClass, openNextFile
-
-
-
-
Field Detail
-
propertyTags
protected java.lang.String[] propertyTags
Tag names for tags that should be added as properties
-
documentCounter
protected int documentCounter
Counts the documents that are found in the collection, ignoring those documents that appear in the black list
-
DocIDBlacklist
protected java.util.HashSet<java.lang.String> DocIDBlacklist
-
ThisDocID
protected java.lang.String ThisDocID
The string identifier of the current document.
-
br
protected CountingInputStream br
The inputstream used for reading data.
-
start_docTag
protected char[] start_docTag
The opening document tag.
-
start_docTagLength
protected int start_docTagLength
The length of the opening document tag.
-
end_docTag
protected java.lang.String end_docTag
The closing document tag.
-
end_docTagLength
protected int end_docTagLength
The length of the closing document tag.
-
start_docnoTag
protected char[] start_docnoTag
The opening document number tag.
-
start_docnoTagLength
protected int start_docnoTagLength
The length of the opening document number tag.
-
end_docnoTag
protected char[] end_docnoTag
The closing document number tag.
-
end_docnoTagLength
protected int end_docnoTagLength
The length of the closing document number tag.
-
tags_CaseSensitive
protected boolean tags_CaseSensitive
Is the markup case-sensitive?
-
ignoreProperties
protected boolean ignoreProperties
Do we ignore properties?
-
docnotag
protected java.lang.String docnotag
The docno tag
-
propertyTagLengths
protected int[] propertyTagLengths
The length of each property tag
-
startPropertyTags
protected char[][] startPropertyTags
The start property tags
-
endPropertyTags
protected char[][] endPropertyTags
The end property tag
-
-
Constructor Detail
-
TRECCollection
public TRECCollection(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
Specific constructor: reads the files listed in CollectionSpecFilename, the Blacklist of Document IDs in BlacklistSpecFilename, and stores document offsets and lengths in the document pointers file docPointersFilename. The collection will be parsed according to the TagSet specified by TagSet string- Parameters:
CollectionSpecFilename
- The collections specification filename. The file contains a list of filenames to read. Must be specified, fatal error otherwise.TagSet
- the TagSet constructor string to use to obtain the tags to parse for.BlacklistSpecFilename
- A filename to a file containing a list of document identifiers thay have NOT to be processed. Not loaded if null or length 0ignored
- no longer used
-
TRECCollection
public TRECCollection(java.util.List<java.lang.String> files)
-
TRECCollection
public TRECCollection(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
-
TRECCollection
public TRECCollection(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename)
-
TRECCollection
public TRECCollection(java.lang.String collSpec)
-
TRECCollection
public TRECCollection()
A default constructor that reads the collection specification file, as configured by the property collection.spec, reads a list of blacklisted document numbers, specified by the property trec.blacklist.docids and opens the first collection file to process. TagSet TagSet.TREC_DOC_TAGS is used to tokenize the collection.
-
TRECCollection
public TRECCollection(java.io.InputStream input)
A constructor that reads only the document in the specificed InputStream. Also reads a list of blacklisted document numbers, specified by the property trec.blacklist.docids and opens the first collection file to process.
-
-
Method Detail
-
setTags
protected void setTags(java.lang.String TagSet)
protected method for initialising the opening and closing document and document number tags.
-
readDocumentBlacklist
protected void readDocumentBlacklist(java.lang.String BlacklistSpecFilename)
-
hasNext
public boolean hasNext()
Check whether it is the end of the collection- Overrides:
hasNext
in classMultiDocumentFileCollection
- Returns:
- boolean
-
next
public Document next()
Return next document- Overrides:
next
in classMultiDocumentFileCollection
- Returns:
- next document
-
nextDocument
public boolean nextDocument()
Moves to the next document to process from the collection.- Specified by:
nextDocument
in interfaceCollection
- Specified by:
nextDocument
in classMultiDocumentFileCollection
- Returns:
- boolean true if there are more documents to process in the collection, otherwise it returns false.
-
afterPropertyTags
protected void afterPropertyTags() throws java.io.IOException
- Throws:
java.io.IOException
-
getTag
protected java.lang.StringBuilder getTag(int taglength, char[] startTag, char[] endTag) throws java.io.IOException
Scans through a document reading in the first occurrence of the specified tag, returning its contents as a StringBuilder object- Parameters:
taglength
- - the length of the start tagstartTag
- - the start tagendTag
- - the end tag- Returns:
- - the tag contents
- Throws:
java.io.IOException
-
openNewFile
protected void openNewFile() throws java.lang.Exception
- Overrides:
openNewFile
in classMultiDocumentFileCollection
- Throws:
java.lang.Exception
-
getDocument
public Document getDocument()
Returns the current document to process.- Specified by:
getDocument
in interfaceCollection
- Specified by:
getDocument
in classMultiDocumentFileCollection
- Returns:
- Document the object of the current document to process.
-
reset
public void reset()
Description copied from class:MultiDocumentFileCollection
Resets the Collection iterator to the start of the collection.- Specified by:
reset
in interfaceCollection
- Overrides:
reset
in classMultiDocumentFileCollection
-
-