public class TRECCollection extends MultiDocumentFileCollection
Document
class to be
used can be specified with the trec.document.class property.
TREC format files are opened using the default encoding unless the
trec.encoding has been set to a valid supported encoding.
Since 3.5, the contents of tags can be added to the meta index instead of being indexed normally. This is useful
to hold URLs or dates that you need to later during retrieval. To use this, the fields in the TREC file
need to be ordered and the tags to add need to be specified in TrecDocTags.propertytags and indexer.meta.forward.keys
and the maximum length of the tags given in indexer.meta.forward.keylens.
Properties:
Document
class to parse individual documents (defaults to TaggedDocument
).Modifier and Type | Field and Description |
---|---|
protected CountingInputStream |
br
The inputstream used for reading data.
|
protected HashSet<String> |
DocIDBlacklist |
protected String |
docnotag
The docno tag
|
protected int |
documentCounter
Counts the documents that are found in the collection, ignoring those
documents that appear in the black list
|
protected char[] |
end_docnoTag
The closing document number tag.
|
protected int |
end_docnoTagLength
The length of the closing document number tag.
|
protected String |
end_docTag
The closing document tag.
|
protected int |
end_docTagLength
The length of the closing document tag.
|
protected char[][] |
endPropertyTags
The end property tag
|
protected boolean |
ignoreProperties
Do we ignore properties?
|
protected int[] |
propertyTagLengths
The length of each property tag
|
protected char[] |
start_docnoTag
The opening document number tag.
|
protected int |
start_docnoTagLength
The length of the opening document number tag.
|
protected char[] |
start_docTag
The opening document tag.
|
protected int |
start_docTagLength
The length of the opening document tag.
|
protected char[][] |
startPropertyTags
The start property tags
|
protected boolean |
tags_CaseSensitive
Is the markup case-sensitive?
|
protected String |
ThisDocID
The string identifier of the current document.
|
currentFilename, desiredEncoding, DocProperties, documentClass, documentsInThisFile, eoc, eof, FileNumber, FilesToProcess, forceUTF8, is, logger, SkipFile, tokeniser
Constructor and Description |
---|
TRECCollection()
A default constructor that reads the collection specification
file, as configured by the property collection.spec,
reads a list of blacklisted document numbers, specified by the
property trec.blacklist.docids and opens the
first collection file to process.
|
TRECCollection(InputStream input)
A constructor that reads only the document in the specificed
InputStream.
|
TRECCollection(String CollectionSpecFilename,
String TagSet,
String BlacklistSpecFilename,
String ignored)
Specific constructor: reads the files listed in CollectionSpecFilename,
the Blacklist of Document IDs in BlacklistSpecFilename, and stores document
offsets and lengths in the document pointers file docPointersFilename.
|
Modifier and Type | Method and Description |
---|---|
protected void |
afterPropertyTags() |
Document |
getDocument()
Returns the current document to process.
|
protected StringBuilder |
getTag(int taglength,
char[] startTag,
char[] endTag)
Scans through a document reading in the first occurrence of the specified tag,
returning its contents as a StringBuilder object
|
boolean |
hasNext()
Check whether it is the end of the collection
|
Document |
next()
Return next document
|
boolean |
nextDocument()
Moves to the next document to process from the collection.
|
protected void |
openNewFile() |
protected void |
readDocumentBlacklist(String BlacklistSpecFilename) |
void |
reset()
Resets the Collection iterator to the start of the collection.
|
protected void |
setTags(String TagSet)
protected method for initialising the
opening and closing document and document number
tags.
|
close, endOfCollection, extractCharset, loadDocumentClass, openNextFile, readCollectionSpec
protected int documentCounter
protected String ThisDocID
protected CountingInputStream br
protected char[] start_docTag
protected int start_docTagLength
protected String end_docTag
protected int end_docTagLength
protected char[] start_docnoTag
protected int start_docnoTagLength
protected char[] end_docnoTag
protected int end_docnoTagLength
protected boolean tags_CaseSensitive
protected boolean ignoreProperties
protected String docnotag
protected int[] propertyTagLengths
protected char[][] startPropertyTags
protected char[][] endPropertyTags
public TRECCollection(String CollectionSpecFilename, String TagSet, String BlacklistSpecFilename, String ignored)
CollectionSpecFilename
- The collections specification filename. The file contains
a list of filenames to read. Must be specified, fatal error otherwise.TagSet
- the TagSet constructor string to use to obtain the tags to parse for.BlacklistSpecFilename
- A filename to a file containing a list of document identifiers
thay have NOT to be processed. Not loaded if null or length 0ignored
- no longer usedpublic TRECCollection()
public TRECCollection(InputStream input)
protected void setTags(String TagSet)
protected void readDocumentBlacklist(String BlacklistSpecFilename)
public boolean hasNext()
hasNext
in class MultiDocumentFileCollection
public Document next()
next
in class MultiDocumentFileCollection
public boolean nextDocument()
nextDocument
in interface Collection
nextDocument
in class MultiDocumentFileCollection
protected void afterPropertyTags() throws IOException
IOException
protected StringBuilder getTag(int taglength, char[] startTag, char[] endTag) throws IOException
taglength
- - the length of the start tagstartTag
- - the start tagendTag
- - the end tagIOException
protected void openNewFile() throws Exception
openNewFile
in class MultiDocumentFileCollection
Exception
public Document getDocument()
getDocument
in interface Collection
getDocument
in class MultiDocumentFileCollection
public void reset()
MultiDocumentFileCollection
reset
in interface Collection
reset
in class MultiDocumentFileCollection
Terrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow