Class BasicSinglePassIndexer

  • Direct Known Subclasses:
    BlockSinglePassIndexer, ExtensibleSinglePassIndexer, NoDuplicatesSinglePassIndexing

    public class BasicSinglePassIndexer
    extends BasicIndexer
    This class indexes a document collection (skipping the direct file construction). It implements a single-pass algorithm, that operates in two phases:
    First, it traverses the document collection, passes the terms through the TermPipeline and builds an in-memory representation of the posting lists. When it has exhausted the main memory, it flushes the sorted postings to disk, along with the lexicon (collectively known as a run, and continues traversing the collection.
    The second phase, merges the sorted runs (with their partial lexicons) in disk to create the final inverted file. This class follows the template pattern, so the main bulk of the code is reused for block (and fields) indexing. There are a few hook methods, that chooses the right classes to instantiate, depending on the indexing options defined.

    Memory tracking is a key concern in this class. Four properties are provided for checking the amount of memory consumed, how regularly to check the memory, and (optional) maximums on the amount of memory that can be used for the postings, or on the number of documents before a flush is comitted.

    Properties:

    • memory.reserved - amount of free memory threshold before a run is committed. Default is 50 000 000 (50MB) and 100 000 000 (100MB) for 32bit and 64bit JVMs respectively.
    • memory.heap.usage - proportion of max heap allocated to JVM before a run is committed. Default 0.70.
    • indexing.singlepass.max.postings.memory - maximum amount of memory that the postings can consume before a run is committed. Default is 0, which is no limit.
    • indexing.singlepass.max.documents.flush - maximum number of documents before a run is committed. Default is 0, which is no limit.
    • docs.check - interval of how many documents indexed should the amount of free memory be checked. Default is 20 - check memory consumption every 20 documents.
    Author:
    Roi Blanco
    • Field Detail

      • currentId

        protected int currentId
        Current document Id
      • maxMemory

        protected long maxMemory
      • memoryCheck

        protected MemoryChecker memoryCheck
        Memory Checker - provides the method for checking to see if the system is running low on memory
      • docsPerCheck

        protected int docsPerCheck
        Number of documents read per memory check
      • maxDocsPerFlush

        protected int maxDocsPerFlush
      • runtime

        protected static final java.lang.Runtime runtime
        Runtime system JVM running this instance of Terrier
      • numberOfDocsSinceCheck

        protected int numberOfDocsSinceCheck
        Number of documents read since the memory consumption was last checked
      • numberOfDocsSinceFlush

        protected int numberOfDocsSinceFlush
        Number of documents read since the memory runs were last flushed to disk
      • memoryAfterFlush

        protected long memoryAfterFlush
        Memory status after flush
      • fileNames

        protected java.util.Queue<java.lang.String[]> fileNames
        Queue with the file names for the runs in disk
      • currentFile

        protected int currentFile
        Number of the current Run to be written in disk
      • mp

        protected org.terrier.structures.indexing.singlepass.MemoryPostings mp
        Structure that keeps the posting lists in memory
      • merger

        protected org.terrier.structures.indexing.singlepass.RunsMerger merger
        Structure for merging the run
      • numberOfDocuments

        protected int numberOfDocuments
        Number of documents indexed
      • numberOfTokens

        protected long numberOfTokens
        Number of tokens indexed
      • numberOfUniqueTerms

        protected int numberOfUniqueTerms
        Number of unique terms indexed
      • numberOfPointers

        protected long numberOfPointers
        Number of pointers indexed
      • invertedIndexClass

        protected java.lang.String invertedIndexClass
        what class should be used to read the generated inverted index?
      • basicInvertedIndexPostingIteratorClass

        protected java.lang.String basicInvertedIndexPostingIteratorClass
      • fieldInvertedIndexPostingIteratorClass

        protected java.lang.String fieldInvertedIndexPostingIteratorClass
      • invertedIndexInputStreamClass

        protected java.lang.String invertedIndexInputStreamClass
        what class should be used to read the inverted index as a stream?
    • Constructor Detail

      • BasicSinglePassIndexer

        public BasicSinglePassIndexer​(java.lang.String pathname,
                                      java.lang.String prefix)
        Constructs an instance of a BasicSinglePassIndexer, using the given path name for storing the data structures.
        Parameters:
        pathname - String the path where the datastructures will be created. This is assumed to be absolute.
        prefix - String the prefix of the index, usually "data".
      • BasicSinglePassIndexer

        protected BasicSinglePassIndexer​(long a,
                                         long b,
                                         long c)
        Protected do-nothing constructor for use by child classes
    • Method Detail

      • createDirectIndex

        public void createDirectIndex​(Collection[] collections)
        Description copied from class: BasicIndexer
        Creates the direct index, the document index and the lexicon. Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).
        Overrides:
        createDirectIndex in class BasicIndexer
        Parameters:
        collections - Collection[] the collections to be indexed.
      • createInvertedIndex

        public void createInvertedIndex()
        Description copied from class: BasicIndexer
        Creates the inverted index after having created the direct index, document index and lexicon.
        Overrides:
        createInvertedIndex in class BasicIndexer
      • createInvertedIndex

        public void createInvertedIndex​(Collection[] collections)
        Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).
        Parameters:
        collections - Collection[] the collections to be indexed.
      • checkFlush

        protected void checkFlush()
                           throws java.io.IOException
        check to see if a flush is required, and perform if necessary
        Throws:
        java.io.IOException
      • forceFlush

        protected void forceFlush()
                           throws java.io.IOException
        Throws:
        java.io.IOException
      • indexDocument

        protected void indexDocument​(java.util.Map<java.lang.String,​java.lang.String> docProperties,
                                     DocumentPostingList termsInDocument)
                              throws java.lang.Exception
        This adds a document to the direct and document indexes, as well as it's terms to the lexicon. Handled internally by the methods indexFieldDocument and indexNoFieldDocument.. This implementation only places content in the runs in memory, which will eventually be flushed to disk.
        Overrides:
        indexDocument in class BasicIndexer
        Parameters:
        docProperties - Map<String,String> properties of the document
        termsInDocument - DocumentPostingList the terms in the document.
        Throws:
        java.lang.Exception
      • finishMemoryPosting

        protected java.lang.String[] finishMemoryPosting()
        Adds the name of the current run + partial lexicon to be flushed in disk.
        Returns:
        the two dimensional String[] array with the names of the run and partial lexicon to write.
      • performMultiWayMerge

        public void performMultiWayMerge()
                                  throws java.io.IOException
        Uses the merger class to perform a k multiway merge in a set of previously written runs. The file names and the number of runs are given by the private queue
        Throws:
        java.io.IOException
      • getFileNames

        protected java.lang.String[][] getFileNames()
        Returns:
        the String[][] structure with the name of the runs files and partial lexicons.
      • createFieldRunMerger

        protected void createFieldRunMerger​(java.lang.String[][] files)
                                     throws java.lang.Exception
        Hook method that creates a FieldRunMerger instance
        Throws:
        java.io.IOException - if an I/O error occurs.
        java.lang.Exception
      • createRunMerger

        protected void createRunMerger​(java.lang.String[][] files)
                                throws java.lang.Exception
        Hook method that creates a RunsMerger instance
        Throws:
        java.io.IOException - if an I/O error occurs.
        java.lang.Exception
      • createMemoryPostings

        protected void createMemoryPostings()
        Hook method that creates the right type of MemoryPostings class.