Class Indexer

  • Direct Known Subclasses:
    BasicIndexer, BlockIndexer

    public abstract class Indexer
    extends java.lang.Object
    Properties:
    • termpipelines - the sequence of TermPipeline stages (e.g. Stopwords removal and PorterStemmer).
    • termpipelines.skip - a list of tokens which should not be skipped from the term pipeline. If not set or empty, then none will be skipped.
    • indexing.max.tokens - The maximum number of tokens the indexer will attempt to index in a document. If 0, then all tokens will be indexed (default).
    • ignore.empty.documents - Assign empty documents with docids. Default true
    • indexing.max.docs.per.builder - Maximum number of documents in an index before a new index is created, and merged later.
    • indexing.builder.boundary.docnos - Docnos of documents that force the index being created to be completed, and a new index to be commenced. An alternative to indexing.max.docs.per.builder
    • indexer.meta.forward.keys - comma delimited list of Document properties to index as document metadata in the MetaIndex. Defaults to "docno", which permits docid->docno lookups.. Examples are "docno,url" or "docno,url,content"
    • indexer.meta.forward.keylens - comma delimited list of the length of the values to record in the MetaIndex. Defaults to 20.
    • indexer.meta.reverse.keys - comma delimited list of Document properties to permit lookups for (i.e. docno->docid). Defaults to empty (none are enabled).
    • indexer.meta.builder - name of the class to build the MetaIndex. Defaults to ZstdMetaIndexBuilder, which uses zstandard compression.
    Author:
    Craig Macdonald
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
        Indexer()
      Creates an indexer at the location ApplicationSetup.TERRIER_INDEX_PATH and ApplicationSetup.TERRIER_INDEX_PREFIX
      protected Indexer​(long a, long b, long c)
      Protected do-nothing constructor for use by child classes
        Indexer​(java.lang.String _path, java.lang.String _prefix)
      Creates an instance of the class.
    • Method Summary

      All Methods Static Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      abstract void createDirectIndex​(Collection[] collections)
      An abstract method for creating the direct index, the document index and the lexicon for the given collections.
      abstract void createInvertedIndex()
      An abstract method for creating the inverted index, given that the the direct index, the document index and the lexicon have already been created.
      protected MetaIndexBuilder createMetaIndexBuilder()  
      protected void finishedDirectIndexBuild()
      event method to be overridden by child classes
      protected void finishedInvertedIndexBuild()
      event method to be overridden by child classes
      protected abstract TermPipeline getEndOfPipeline()
      An abstract method that returns the last component of the term pipeline.
      int getExternalParalllism()
      how many indexers are running in this and other threads?
      void index​(Collection[] collections)
      Creates the data structures for a set of collections.
      protected void indexEmpty​(java.util.Map<java.lang.String,​java.lang.String> docProperties)
      Adds an entry to document index for empty document @param docid, only if IndexEmptyDocuments is set to true.
      protected void init()
      This method must be called by anything which directly extends Indexer.
      protected void load_builder_boundary_documents()
      Loads the builder boundary documents from the property indexing.builder.boundary.docnos, comma delimited.
      protected void load_field_ids()
      loads a mapping of field name -> field id
      protected void load_indexer_properties()  
      protected void load_pipeline()
      Creates the term pipeline, as specified by the property termpipelines in the properties file.
      static void main​(java.lang.String[] args)
      Utility method for merging indices
      static void merge​(java.lang.String mpath, java.lang.String mprefix, int lowest, int highest, boolean blocks)
      Merge a series of numbered indices in the same path/prefix area.
      static void merge​(java.lang.String mpath, java.lang.String mprefix, java.util.LinkedList<java.lang.String[]> llist, int counterMerged, boolean blocks)
      Merge a series of indices, in pair-wise fashion
      protected static void mergeTwoIndices​(java.lang.String[] index1, java.lang.String[] index2, java.lang.String[] outputIndex, boolean blocks)
      Merge two indices.
      protected static int[] parseInts​(java.lang.String[] in)  
      void setExternalParalllism​(int externalParalllism)
      set how many indexers are running in this and other threads?
      boolean useFieldInformation()
      Returns the is the index will record fields
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • logger

        protected static final org.slf4j.Logger logger
        the logger for this class
      • MAX_DOCS_PER_BUILDER

        protected int MAX_DOCS_PER_BUILDER
        The number of documents indexed with a set of builders. If a collection consists of more documents, then we need to create new builders and later merge the data structures. The corresponding property is indexing.max.docs.per.builder and the default value is 18000000 (18 million documents). If the property is set equal to zero, then there is no limit.
      • MAX_TOKENS_IN_DOCUMENT

        protected int MAX_TOKENS_IN_DOCUMENT
        The maximum number of tokens in a document. If it is set to zero, then there is no limit in the number of tokens indexed for a document. Set by property indexing.max.tokens.
      • BUILDER_BOUNDARY_DOCUMENTS

        protected final java.util.HashSet<java.lang.String> BUILDER_BOUNDARY_DOCUMENTS
        The DOCNO of documents to force builder boundaries
      • useFieldInformation

        protected boolean useFieldInformation
        Indicates whether field information should be saved in the created data structures.
      • pipeline_first

        protected TermPipeline pipeline_first
        The first component of the term pipeline.
      • IndexEmptyDocuments

        protected boolean IndexEmptyDocuments
        Indicates whether an entry for empty documents is stored in the document index, or empty documents should be ignored.
      • emptyDocCount

        protected int emptyDocCount
      • docIndexBuilder

        protected DocumentIndexBuilder docIndexBuilder
        The builder that creates the document index.
      • invertedIndexBuilder

        protected InvertedIndexBuilder invertedIndexBuilder
        The builder that creates the inverted index.
      • lexiconBuilder

        protected LexiconBuilder lexiconBuilder
        The builder that creates the lexicon.
      • fileNameNoExtension

        protected java.lang.String fileNameNoExtension
        The common prefix of the data structures filenames.
      • path

        protected java.lang.String path
        The path in which the data structures are stored.
      • prefix

        protected java.lang.String prefix
        The prefix of the data structures, ie the first part of the filename
      • currentIndex

        protected IndexOnDisk currentIndex
        The index being worked on, denoted by path and prefix
      • fieldNames

        protected gnu.trove.TObjectIntHashMap<java.lang.String> fieldNames
        mapping: field name -> field id, returns 0 for no mapping
      • numFields

        protected int numFields
        the number of fields
      • blocks

        protected boolean blocks
        is block indexing
      • externalParalllism

        protected int externalParalllism
        how many instances are being used by the code calling this class in parallel
    • Constructor Detail

      • Indexer

        public Indexer()
        Creates an indexer at the location ApplicationSetup.TERRIER_INDEX_PATH and ApplicationSetup.TERRIER_INDEX_PREFIX
      • Indexer

        public Indexer​(java.lang.String _path,
                       java.lang.String _prefix)
        Creates an instance of the class. The generated data structures will be saved in the given path. The of the data is given by the prefix parameter.
        Parameters:
        _path - String the path where the generated data structures will be saved.
        _prefix - String the filename that the data structures will have.
      • Indexer

        protected Indexer​(long a,
                          long b,
                          long c)
        Protected do-nothing constructor for use by child classes
    • Method Detail

      • init

        protected void init()
        This method must be called by anything which directly extends Indexer. See: http://benpryor.com/blog/2008/01/02/dont-call-subclass-methods-from-a-superclass-constructor/
      • createDirectIndex

        public abstract void createDirectIndex​(Collection[] collections)
        An abstract method for creating the direct index, the document index and the lexicon for the given collections.
        Parameters:
        collections - Collection[] An array of collections to index
      • createInvertedIndex

        public abstract void createInvertedIndex()
        An abstract method for creating the inverted index, given that the the direct index, the document index and the lexicon have already been created.
      • getEndOfPipeline

        protected abstract TermPipeline getEndOfPipeline()
        An abstract method that returns the last component of the term pipeline.
        Returns:
        TermPipeline the end of the term pipeline.
      • getExternalParalllism

        public int getExternalParalllism()
        how many indexers are running in this and other threads?
      • setExternalParalllism

        public void setExternalParalllism​(int externalParalllism)
        set how many indexers are running in this and other threads?
      • parseInts

        protected static final int[] parseInts​(java.lang.String[] in)
      • load_indexer_properties

        protected void load_indexer_properties()
      • load_field_ids

        protected void load_field_ids()
        loads a mapping of field name -> field id
      • load_pipeline

        protected void load_pipeline()
        Creates the term pipeline, as specified by the property termpipelines in the properties file. The default value of the property termpipelines is Stopwords,PorterStemmer. This means that we first remove stopwords and then apply Porter's stemming algorithm.
      • load_builder_boundary_documents

        protected void load_builder_boundary_documents()
        Loads the builder boundary documents from the property indexing.builder.boundary.docnos, comma delimited.
      • index

        public void index​(Collection[] collections)
        Creates the data structures for a set of collections. It creates a set of data structures for every indexing.max.docs.per.builder, if the value of this property is greater than zero, and then it mertges the generated data structures.
        Parameters:
        collections - The document collection objects to index.
      • merge

        public static void merge​(java.lang.String mpath,
                                 java.lang.String mprefix,
                                 int lowest,
                                 int highest,
                                 boolean blocks)
        Merge a series of numbered indices in the same path/prefix area. New merged index will be stored at mpath/mprefix_highest+1.
        Parameters:
        mpath - Path of all indices
        mprefix - Common prefix of all indices
        lowest - lowest subfix of prefix
        highest - highest subfix of prefix
      • mergeTwoIndices

        protected static void mergeTwoIndices​(java.lang.String[] index1,
                                              java.lang.String[] index2,
                                              java.lang.String[] outputIndex,
                                              boolean blocks)
        Merge two indices.
        Parameters:
        index1 - Path/Prefix of source index 1
        index2 - Path/Prefix of source index 2
        outputIndex - Path/Prefix of destination index
        blocks - TODO
      • merge

        public static void merge​(java.lang.String mpath,
                                 java.lang.String mprefix,
                                 java.util.LinkedList<java.lang.String[]> llist,
                                 int counterMerged,
                                 boolean blocks)
        Merge a series of indices, in pair-wise fashion
        Parameters:
        mpath - Common path of all indices
        mprefix - Prefix of target index
        counterMerged - - number of indices to merge
      • finishedDirectIndexBuild

        protected void finishedDirectIndexBuild()
        event method to be overridden by child classes
      • finishedInvertedIndexBuild

        protected void finishedInvertedIndexBuild()
        event method to be overridden by child classes
      • useFieldInformation

        public boolean useFieldInformation()
        Returns the is the index will record fields
      • indexEmpty

        protected void indexEmpty​(java.util.Map<java.lang.String,​java.lang.String> docProperties)
                           throws java.io.IOException
        Adds an entry to document index for empty document @param docid, only if IndexEmptyDocuments is set to true.
        Throws:
        java.io.IOException
      • main

        public static void main​(java.lang.String[] args)
                         throws java.lang.Exception
        Utility method for merging indices
        Throws:
        java.lang.Exception