Class TRECCollection

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable, Collection
    Direct Known Subclasses:
    TRECUTFCollection, TRECWebCollection

    public class TRECCollection
    extends MultiDocumentFileCollection
    Models a TREC test collection by implementing the interfaces Collection and DocumentExtractor. It provides sequential access to the documents in the collection and also it can return the text of a document as a String. The precise Document class to be used can be specified with the trec.document.class property. TREC format files are opened using the default encoding unless the trec.encoding has been set to a valid supported encoding. Since 3.5, the contents of tags can be added to the meta index instead of being indexed normally. This is useful to hold URLs or dates that you need to later during retrieval. To use this, the fields in the TREC file need to be ordered and the tags to add need to be specified in TrecDocTags.propertytags and indexer.meta.forward.keys and the maximum length of the tags given in indexer.meta.forward.keylens.

    Properties:

    • trec.document.class the Document class to parse individual documents (defaults to TaggedDocument).
    • trec.encoding - encoding to use to open all files. Leave unset for System default encoding.
    • (tagset).propertytags - list of tags to add to the meta index rather than to index. Tags are assumed to be IN ORDER after the docid.
    • indexer.meta.forward.keys - list of keys to add to the meta index, remember to put any property tags here as well.
    • indexer.meta.forward.keylens - lengths of each of the the meta keys, remember to put the lengths of the property tags here as well.
    Author:
    Craig Macdonald & Vassilis Plachouras & Richard McCreadie
    • Constructor Summary

      Constructors 
      Constructor Description
      TRECCollection()
      A default constructor that reads the collection specification file, as configured by the property collection.spec, reads a list of blacklisted document numbers, specified by the property trec.blacklist.docids and opens the first collection file to process.
      TRECCollection​(java.io.InputStream input)
      A constructor that reads only the document in the specificed InputStream.
      TRECCollection​(java.lang.String collSpec)  
      TRECCollection​(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
      Specific constructor: reads the files listed in CollectionSpecFilename, the Blacklist of Document IDs in BlacklistSpecFilename, and stores document offsets and lengths in the document pointers file docPointersFilename.
      TRECCollection​(java.util.List<java.lang.String> files)  
      TRECCollection​(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename)  
      TRECCollection​(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)  
    • Field Detail

      • propertyTags

        protected java.lang.String[] propertyTags
        Tag names for tags that should be added as properties
      • documentCounter

        protected int documentCounter
        Counts the documents that are found in the collection, ignoring those documents that appear in the black list
      • DocIDBlacklist

        protected java.util.HashSet<java.lang.String> DocIDBlacklist
      • ThisDocID

        protected java.lang.String ThisDocID
        The string identifier of the current document.
      • start_docTag

        protected char[] start_docTag
        The opening document tag.
      • start_docTagLength

        protected int start_docTagLength
        The length of the opening document tag.
      • end_docTag

        protected java.lang.String end_docTag
        The closing document tag.
      • end_docTagLength

        protected int end_docTagLength
        The length of the closing document tag.
      • start_docnoTag

        protected char[] start_docnoTag
        The opening document number tag.
      • start_docnoTagLength

        protected int start_docnoTagLength
        The length of the opening document number tag.
      • end_docnoTag

        protected char[] end_docnoTag
        The closing document number tag.
      • end_docnoTagLength

        protected int end_docnoTagLength
        The length of the closing document number tag.
      • tags_CaseSensitive

        protected boolean tags_CaseSensitive
        Is the markup case-sensitive?
      • ignoreProperties

        protected boolean ignoreProperties
        Do we ignore properties?
      • docnotag

        protected java.lang.String docnotag
        The docno tag
      • propertyTagLengths

        protected int[] propertyTagLengths
        The length of each property tag
      • startPropertyTags

        protected char[][] startPropertyTags
        The start property tags
      • endPropertyTags

        protected char[][] endPropertyTags
        The end property tag
    • Constructor Detail

      • TRECCollection

        public TRECCollection​(java.lang.String CollectionSpecFilename,
                              java.lang.String TagSet,
                              java.lang.String BlacklistSpecFilename,
                              java.lang.String ignored)
        Specific constructor: reads the files listed in CollectionSpecFilename, the Blacklist of Document IDs in BlacklistSpecFilename, and stores document offsets and lengths in the document pointers file docPointersFilename. The collection will be parsed according to the TagSet specified by TagSet string
        Parameters:
        CollectionSpecFilename - The collections specification filename. The file contains a list of filenames to read. Must be specified, fatal error otherwise.
        TagSet - the TagSet constructor string to use to obtain the tags to parse for.
        BlacklistSpecFilename - A filename to a file containing a list of document identifiers thay have NOT to be processed. Not loaded if null or length 0
        ignored - no longer used
      • TRECCollection

        public TRECCollection​(java.util.List<java.lang.String> files)
      • TRECCollection

        public TRECCollection​(java.util.List<java.lang.String> files,
                              java.lang.String TagSet,
                              java.lang.String BlacklistSpecFilename,
                              java.lang.String ignored)
      • TRECCollection

        public TRECCollection​(java.util.List<java.lang.String> files,
                              java.lang.String TagSet,
                              java.lang.String BlacklistSpecFilename)
      • TRECCollection

        public TRECCollection​(java.lang.String collSpec)
      • TRECCollection

        public TRECCollection()
        A default constructor that reads the collection specification file, as configured by the property collection.spec, reads a list of blacklisted document numbers, specified by the property trec.blacklist.docids and opens the first collection file to process. TagSet TagSet.TREC_DOC_TAGS is used to tokenize the collection.
      • TRECCollection

        public TRECCollection​(java.io.InputStream input)
        A constructor that reads only the document in the specificed InputStream. Also reads a list of blacklisted document numbers, specified by the property trec.blacklist.docids and opens the first collection file to process.
    • Method Detail

      • setTags

        protected void setTags​(java.lang.String TagSet)
        protected method for initialising the opening and closing document and document number tags.
      • readDocumentBlacklist

        protected void readDocumentBlacklist​(java.lang.String BlacklistSpecFilename)
      • nextDocument

        public boolean nextDocument()
        Moves to the next document to process from the collection.
        Specified by:
        nextDocument in interface Collection
        Specified by:
        nextDocument in class MultiDocumentFileCollection
        Returns:
        boolean true if there are more documents to process in the collection, otherwise it returns false.
      • afterPropertyTags

        protected void afterPropertyTags()
                                  throws java.io.IOException
        Throws:
        java.io.IOException
      • getTag

        protected java.lang.StringBuilder getTag​(int taglength,
                                                 char[] startTag,
                                                 char[] endTag)
                                          throws java.io.IOException
        Scans through a document reading in the first occurrence of the specified tag, returning its contents as a StringBuilder object
        Parameters:
        taglength - - the length of the start tag
        startTag - - the start tag
        endTag - - the end tag
        Returns:
        - the tag contents
        Throws:
        java.io.IOException