Class MultiDocumentFileCollection

    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected java.lang.String currentFilename
      Filename of current file
      protected java.lang.String desiredEncoding
      Encoding to be used to open all files.
      protected java.util.Map<java.lang.String,​java.lang.String> DocProperties
      properties for the current document
      protected java.lang.Class<? extends Document> documentClass
      Class to use for all documents parsed by this class
      protected int documentsInThisFile
      Counts the number of documents that have been found in this file.
      protected boolean eoc
      are we at the end of the collection?
      protected boolean eof
      has the end of the current input file been reached?
      protected int FileNumber
      The index in the FilesToProcess of the currently processed file.
      protected java.util.List<java.lang.String> FilesToProcess
      The list of files to process.
      protected boolean forceUTF8
      should UTF8 encoding be assumed?
      protected java.io.InputStream is
      the input stream of the current input file
      protected static org.slf4j.Logger logger
      logger for this class
      protected boolean SkipFile
      A boolean which is true when a new file is open.
      protected Tokeniser tokeniser
      Tokeniser to use for all documents parsed by this class
    • Method Summary

      All Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      protected void checkEncoding()  
      void close()
      Closes the collection, any files that may be open.
      boolean endOfCollection()
      Returns true if the end of the collection has been reached
      protected void extractCharset()  
      abstract Document getDocument()
      Get the document object representing the current document.
      boolean hasNext()
      Check whether it is the last document in the collection
      protected void loadDocumentClass()
      Loads the class that will supply all documents for this Collection.
      Document next()
      Return the next document
      abstract boolean nextDocument()
      Move the collection to the start of the next document.
      protected void openNewFile()  
      protected boolean openNextFile()
      Opens the next document from the collection specification.
      void reset()
      Resets the Collection iterator to the start of the collection.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • logger

        protected static final org.slf4j.Logger logger
        logger for this class
      • documentsInThisFile

        protected int documentsInThisFile
        Counts the number of documents that have been found in this file.
      • eoc

        protected boolean eoc
        are we at the end of the collection?
      • eof

        protected boolean eof
        has the end of the current input file been reached?
      • SkipFile

        protected boolean SkipFile
        A boolean which is true when a new file is open.
      • currentFilename

        protected java.lang.String currentFilename
        Filename of current file
      • forceUTF8

        protected final boolean forceUTF8
        should UTF8 encoding be assumed?
      • is

        protected java.io.InputStream is
        the input stream of the current input file
      • DocProperties

        protected java.util.Map<java.lang.String,​java.lang.String> DocProperties
        properties for the current document
      • FilesToProcess

        protected java.util.List<java.lang.String> FilesToProcess
        The list of files to process.
      • FileNumber

        protected int FileNumber
        The index in the FilesToProcess of the currently processed file.
      • desiredEncoding

        protected java.lang.String desiredEncoding
        Encoding to be used to open all files.
      • documentClass

        protected java.lang.Class<? extends Document> documentClass
        Class to use for all documents parsed by this class
      • tokeniser

        protected Tokeniser tokeniser
        Tokeniser to use for all documents parsed by this class
    • Constructor Detail

      • MultiDocumentFileCollection

        protected MultiDocumentFileCollection()
      • MultiDocumentFileCollection

        protected MultiDocumentFileCollection​(java.util.List<java.lang.String> _FilesToProcess)
        construct a collection from the denoted collection.spec file
      • MultiDocumentFileCollection

        protected MultiDocumentFileCollection​(java.lang.String CollectionSpecFilename)
        construct a collection from the denoted collection.spec file
      • MultiDocumentFileCollection

        protected MultiDocumentFileCollection​(java.io.InputStream input)
        A constructor that reads only the specified InputStream.
    • Method Detail

      • getDocument

        public abstract Document getDocument()
        Description copied from interface: Collection
        Get the document object representing the current document.
        Specified by:
        getDocument in interface Collection
        Returns:
        Document the current document;
      • checkEncoding

        protected void checkEncoding()
      • loadDocumentClass

        protected void loadDocumentClass()
        Loads the class that will supply all documents for this Collection. Set by property trec.document.class
      • hasNext

        public boolean hasNext()
        Check whether it is the last document in the collection
        Returns:
        boolean
      • next

        public Document next()
        Return the next document
        Returns:
        next document
      • close

        public void close()
        Closes the collection, any files that may be open.
        Specified by:
        close in interface java.lang.AutoCloseable
        Specified by:
        close in interface java.io.Closeable
      • endOfCollection

        public boolean endOfCollection()
        Returns true if the end of the collection has been reached
        Specified by:
        endOfCollection in interface Collection
        Returns:
        boolean true if the end of collection has been reached, otherwise it returns false.
      • openNextFile

        protected boolean openNextFile()
                                throws java.io.IOException
        Opens the next document from the collection specification.
        Returns:
        boolean true if the file was opened successufully. If there are no more files to open, it returns false.
        Throws:
        java.io.IOException - if there is an exception while opening the collection files.
      • extractCharset

        protected void extractCharset()
      • openNewFile

        protected void openNewFile()
                            throws java.lang.Exception
        Throws:
        java.lang.Exception
      • nextDocument

        public abstract boolean nextDocument()
        Move the collection to the start of the next document.
        Specified by:
        nextDocument in interface Collection
        Returns:
        boolean true if there exists another document in the collection, otherwise it returns false.
      • reset

        public void reset()
        Resets the Collection iterator to the start of the collection.
        Specified by:
        reset in interface Collection