public class TRECCollection extends Object implements Collection, DocumentExtractor
Document
class to be
used can be specified with the trec.document.class property.
TREC format files are opened using the default encoding unless the
trec.encoding has been set to a valid supported encoding.
Since 3.5, the contents of tags can be added to the meta index instead of being indexed normally. This is useful
to hold URLs or dates that you need to later during retrieval. To use this, the fields in the TREC file
need to be ordered and the tags to add need to be specified in TrecDocTags.propertytags and indexer.meta.forward.keys
and the maximum length of the tags given in indexer.meta.forward.keylens.
Properties:
Document
class to parse individual documents (defaults to TaggedDocument
).Modifier and Type | Field and Description |
---|---|
protected CountingInputStream |
br
The inputstream used for reading data.
|
protected String |
currentFilename
Filename of current file
|
protected String |
desiredEncoding
Encoding to be used to open all files.
|
protected HashSet<String> |
DocIDBlacklist |
protected String |
docnotag
The docno tag
|
protected Map<String,String> |
DocProperties
properties for the current document
|
protected Class<? extends Document> |
documentClass |
protected int |
documentCounter
Counts the documents that are found in the collection, ignoring those
documents that appear in the black list
|
protected int |
documentsInThisFile
Counts the number of documents that have been found in this file.
|
protected char[] |
end_docnoTag
The closing document number tag.
|
protected int |
end_docnoTagLength
The length of the closing document number tag.
|
protected String |
end_docTag
The closing document tag.
|
protected int |
end_docTagLength
The length of the closing document tag.
|
protected boolean |
endOfCollection
Indicates whether the end of the collection has been reached.
|
protected char[][] |
endPropertyTags
The end property tag
|
protected int |
FileNumber
The index in the FilesToProcess of the currently processed file.
|
protected ArrayList<String> |
FilesToProcess
The list of files to process.
|
protected boolean |
ignoreProperties
Do we ignore properties?
|
protected static org.apache.log4j.Logger |
logger
logger for this class
|
protected int[] |
propertyTagLengths
The length of each property tag
|
protected boolean |
SkipFile
A boolean which is true when a new file is open.
|
protected char[] |
start_docnoTag
The opening document number tag.
|
protected int |
start_docnoTagLength
The length of the opening document number tag.
|
protected char[] |
start_docTag
The opening document tag.
|
protected int |
start_docTagLength
The length of the opening document tag.
|
protected char[][] |
startPropertyTags
The start property tags
|
protected boolean |
tags_CaseSensitive
Is the markup case-sensitive?
|
protected String |
ThisDocID
The string identifier of the current document.
|
protected Tokeniser |
tokeniser |
Constructor and Description |
---|
TRECCollection()
A default constructor that reads the collection specification
file, as configured by the property collection.spec,
reads a list of blacklisted document numbers, specified by the
property trec.blacklist.docids and opens the
first collection file to process.
|
TRECCollection(InputStream input)
A constructor that reads only the document in the specificed
InputStream.
|
TRECCollection(String CollectionSpecFilename,
String TagSet,
String BlacklistSpecFilename,
String ignored)
Specific constructor: reads the files listed in CollectionSpecFilename,
the Blacklist of Document IDs in BlacklistSpecFilename, and stores document
offsets and lengths in the document pointers file docPointersFilename.
|
Modifier and Type | Method and Description |
---|---|
protected void |
afterPropertyTags() |
void |
close()
Closes the files and streams used by the collection object.
|
boolean |
endOfCollection()
Indicates whether the end of the collection has been reached.
|
String |
getDocid()
Returns the document number of the current document.
|
Document |
getDocument()
Returns the current document to process.
|
String |
getDocumentString(int docid)
Deprecated.
|
protected StringBuilder |
getTag(int taglength,
char[] startTag,
char[] endTag)
Scans through a document reading in the first occurrence of the specified tag,
returning its contents as a StringBuilder object
|
boolean |
hasNext()
Check whether it is the end of the collection
|
protected void |
loadDocumentClass()
Loads the class that will supply all documents for this Collection.
|
Document |
next()
Return next document
|
boolean |
nextDocument()
Moves to the next document to process from the collection.
|
protected boolean |
openNextFile()
Opens the next document from the collection specification.
|
protected void |
readCollectionSpec(String CollectionSpecFilename) |
protected void |
readDocumentBlacklist(String BlacklistSpecFilename) |
void |
remove()
This is unsupported by this Collection implementation, and
any calls will throw UnsupportedOperationException
Throws UnsupportedOperationException on all invocations
|
void |
reset()
Resets the collection object back to the beginning
of the collection.
|
protected void |
setTags(String TagSet)
protected method for initialising the
opening and closing document and document number
tags.
|
protected static final org.apache.log4j.Logger logger
protected String currentFilename
protected int documentsInThisFile
protected int documentCounter
protected int FileNumber
protected String ThisDocID
protected CountingInputStream br
protected boolean SkipFile
protected char[] start_docTag
protected int start_docTagLength
protected String end_docTag
protected int end_docTagLength
protected char[] start_docnoTag
protected int start_docnoTagLength
protected char[] end_docnoTag
protected int end_docnoTagLength
protected boolean tags_CaseSensitive
protected boolean ignoreProperties
protected String docnotag
protected int[] propertyTagLengths
protected char[][] startPropertyTags
protected char[][] endPropertyTags
protected String desiredEncoding
protected Tokeniser tokeniser
protected boolean endOfCollection
public TRECCollection(String CollectionSpecFilename, String TagSet, String BlacklistSpecFilename, String ignored)
CollectionSpecFilename
- The collections specification filename. The file contains
a list of filenames to read. Must be specified, fatal error otherwise.TagSet
- the TagSet constructor string to use to obtain the tags to parse for.BlacklistSpecFilename
- A filename to a file containing a list of document identifiers
thay have NOT to be processed. Not loaded if null or length 0ignored
- no longer usedpublic TRECCollection()
public TRECCollection(InputStream input)
protected void setTags(String TagSet)
protected void readCollectionSpec(String CollectionSpecFilename)
protected void readDocumentBlacklist(String BlacklistSpecFilename)
protected void loadDocumentClass()
public boolean hasNext()
public Document next()
public void remove()
public boolean nextDocument()
nextDocument
in interface Collection
protected void afterPropertyTags() throws IOException
IOException
protected StringBuilder getTag(int taglength, char[] startTag, char[] endTag) throws IOException
taglength
- - the length of the start tagstartTag
- - the start tagendTag
- - the end tagIOException
public Document getDocument()
getDocument
in interface Collection
public String getDocid()
public boolean endOfCollection()
endOfCollection
in interface Collection
public void reset()
reset
in interface Collection
protected boolean openNextFile() throws IOException
IOException
- if there is an exception while opening the
collection files.@Deprecated public String getDocumentString(int docid)
getDocumentString
in interface DocumentExtractor
docid
- the internal identifier of a document.public void close()
close
in interface Closeable
close
in interface AutoCloseable
Terrier 4.0. Copyright © 2004-2014 University of Glasgow