|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.terrier.indexing.TRECCollection
public class TRECCollection
Models a TREC test collection by implementing the interfaces
Collection and DocumentExtractor. It provides sequential access
to the documents in the collection and also it can return the text
of a document as a String. The precise Document
class to be
used can be specified with the trec.document.class property.
TREC format files are opened using the default encoding unless the
trec.encoding has been set to a valid supported encoding.
Since 3.5, the contents of tags can be added to the meta index instead of being indexed normally. This is useful
to hold URLs or dates that you need to later during retrieval. To use this, the fields in the TREC file
need to be ordered and the tags to add need to be specified in TrecDocTags.propertytags and indexer.meta.forward.keys
and the maximum length of the tags given in indexer.meta.forward.keylens.
Properties:
Document
class to parse individual documents (defaults to TaggedDocument
).
Field Summary | |
---|---|
protected CountingInputStream |
br
The inputstream used for reading data. |
protected java.lang.String |
currentFilename
Filename of current file |
protected java.lang.String |
desiredEncoding
Encoding to be used to open all files. |
protected java.util.HashSet<java.lang.String> |
DocIDBlacklist
|
protected java.lang.String |
docnotag
The docno tag |
protected java.util.Map<java.lang.String,java.lang.String> |
DocProperties
properties for the current document |
protected java.lang.Class<? extends Document> |
documentClass
|
protected int |
documentCounter
Counts the documents that are found in the collection, ignoring those documents that appear in the black list |
protected int |
documentsInThisFile
Counts the number of documents that have been found in this file. |
protected char[] |
end_docnoTag
The closing document number tag. |
protected int |
end_docnoTagLength
The length of the closing document number tag. |
protected java.lang.String |
end_docTag
The closing document tag. |
protected int |
end_docTagLength
The length of the closing document tag. |
protected boolean |
endOfCollection
Indicates whether the end of the collection has been reached. |
protected char[][] |
endPropertyTags
The end property tag |
protected int |
FileNumber
The index in the FilesToProcess of the currently processed file. |
protected java.util.ArrayList<java.lang.String> |
FilesToProcess
The list of files to process. |
protected boolean |
ignoreProperties
Do we ignore properties? |
protected static org.apache.log4j.Logger |
logger
logger for this class |
protected int[] |
propertyTagLengths
The length of each property tag |
protected boolean |
SkipFile
A boolean which is true when a new file is open. |
protected char[] |
start_docnoTag
The opening document number tag. |
protected int |
start_docnoTagLength
The length of the opening document number tag. |
protected char[] |
start_docTag
The opening document tag. |
protected int |
start_docTagLength
The length of the opening document tag. |
protected char[][] |
startPropertyTags
The start property tags |
protected boolean |
tags_CaseSensitive
Is the markup case-sensitive? |
protected java.lang.String |
ThisDocID
The string identifier of the current document. |
protected Tokeniser |
tokeniser
|
Constructor Summary | |
---|---|
TRECCollection()
A default constructor that reads the collection specification file, as configured by the property collection.spec, reads a list of blacklisted document numbers, specified by the property trec.blacklist.docids and opens the first collection file to process. |
|
TRECCollection(java.io.InputStream input)
A constructor that reads only the document in the specificed InputStream. |
|
TRECCollection(java.lang.String CollectionSpecFilename,
java.lang.String TagSet,
java.lang.String BlacklistSpecFilename,
java.lang.String ignored)
Specific constructor: reads the files listed in CollectionSpecFilename, the Blacklist of Document IDs in BlacklistSpecFilename, and stores document offsets and lengths in the document pointers file docPointersFilename. |
Method Summary | |
---|---|
protected void |
afterPropertyTags()
|
void |
close()
Closes the files and streams used by the collection object. |
boolean |
endOfCollection()
Indicates whether the end of the collection has been reached. |
java.lang.String |
getDocid()
Returns the document number of the current document. |
Document |
getDocument()
Returns the current document to process. |
Document |
getDocument(TagSet _tags,
TagSet _exact,
TagSet _fields)
Deprecated. |
java.lang.String |
getDocumentString(int docid)
Deprecated. |
protected java.lang.StringBuilder |
getTag(int taglength,
char[] startTag,
char[] endTag)
Scans through a document reading in the first occurrence of the specified tag, returning its contents as a StringBuilder object |
boolean |
hasNext()
Check whether it is the end of the collection |
protected void |
loadDocumentClass()
Loads the class that will supply all documents for this Collection. |
Document |
next()
Return next document |
boolean |
nextDocument()
Moves to the next document to process from the collection. |
protected boolean |
openNextFile()
Opens the next document from the collection specification. |
protected void |
readCollectionSpec(java.lang.String CollectionSpecFilename)
|
protected void |
readDocumentBlacklist(java.lang.String BlacklistSpecFilename)
|
void |
remove()
This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations |
void |
reset()
Resets the collection object back to the beginning of the collection. |
protected void |
setTags(java.lang.String TagSet)
protected method for initialising the opening and closing document and document number tags. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static final org.apache.log4j.Logger logger
protected java.lang.String currentFilename
protected int documentsInThisFile
protected java.util.Map<java.lang.String,java.lang.String> DocProperties
protected int documentCounter
protected java.util.HashSet<java.lang.String> DocIDBlacklist
protected java.util.ArrayList<java.lang.String> FilesToProcess
protected int FileNumber
protected java.lang.String ThisDocID
protected CountingInputStream br
protected boolean SkipFile
protected char[] start_docTag
protected int start_docTagLength
protected java.lang.String end_docTag
protected int end_docTagLength
protected char[] start_docnoTag
protected int start_docnoTagLength
protected char[] end_docnoTag
protected int end_docnoTagLength
protected boolean tags_CaseSensitive
protected boolean ignoreProperties
protected java.lang.String docnotag
protected int[] propertyTagLengths
protected char[][] startPropertyTags
protected char[][] endPropertyTags
protected java.lang.String desiredEncoding
protected java.lang.Class<? extends Document> documentClass
protected Tokeniser tokeniser
protected boolean endOfCollection
Constructor Detail |
---|
public TRECCollection(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
CollectionSpecFilename
- The collections specification filename. The file contains
a list of filenames to read. Must be specified, fatal error otherwise.TagSet
- the TagSet constructor string to use to obtain the tags to parse for.BlacklistSpecFilename
- A filename to a file containing a list of document identifiers
thay have NOT to be processed. Not loaded if null or length 0ignored
- no longer usedpublic TRECCollection()
public TRECCollection(java.io.InputStream input)
Method Detail |
---|
protected void setTags(java.lang.String TagSet)
protected void readCollectionSpec(java.lang.String CollectionSpecFilename)
protected void readDocumentBlacklist(java.lang.String BlacklistSpecFilename)
protected void loadDocumentClass()
public boolean hasNext()
public Document next()
public void remove()
public boolean nextDocument()
nextDocument
in interface Collection
protected void afterPropertyTags() throws java.io.IOException
java.io.IOException
protected java.lang.StringBuilder getTag(int taglength, char[] startTag, char[] endTag) throws java.io.IOException
taglength
- - the length of the start tagstartTag
- - the start tagendTag
- - the end tag
java.io.IOException
public Document getDocument()
getDocument
in interface Collection
@Deprecated public Document getDocument(TagSet _tags, TagSet _exact, TagSet _fields)
public java.lang.String getDocid()
public boolean endOfCollection()
endOfCollection
in interface Collection
public void reset()
reset
in interface Collection
protected boolean openNextFile() throws java.io.IOException
java.io.IOException
- if there is an exception while opening the
collection files.@Deprecated public java.lang.String getDocumentString(int docid)
getDocumentString
in interface DocumentExtractor
docid
- the internal identifier of a document.
public void close()
close
in interface java.io.Closeable
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |