Class LexiconBuilder


  • public class LexiconBuilder
    extends java.lang.Object
    Builds temporary lexicons during indexing a collection and merges them when the indexing of a collection has finished.
    Author:
    Craig Macdonald & Vassilis Plachouras
    • Field Detail

      • lexiconOutputStream

        protected java.lang.Class<? extends LexiconOutputStream> lexiconOutputStream
        class to be used as a lexiconoutpustream. set by this and child classes
      • lexiconEntryFactoryValueClass

        protected final java.lang.String lexiconEntryFactoryValueClass
      • logger

        protected static final org.slf4j.Logger logger
        The logger used for this class
      • DocCount

        protected int DocCount
        How many documents have been processed so far.
      • TermCount

        protected int TermCount
        How many terms are in the final lexicon
      • DocumentsPerLexicon

        protected static final int DocumentsPerLexicon
        The number of documents for which a temporary lexicon is created. Corresponds to property bundle.size, default value 2000.
      • tempLexFiles

        protected final java.util.LinkedList<java.lang.String> tempLexFiles
        The list in which the temporary lexicon structure names are stored. These are merged into a single Lexicon by the merge() method. LinkedList is best List implementation for this, as all operations are either append element, or remove first element - making LinkedList ideal.
      • TempLex

        protected LexiconMap TempLex
        The lexicontree to write the current term stream to
      • indexPath

        protected java.lang.String indexPath
        The directory to write the final lexicons to
      • indexPrefix

        protected java.lang.String indexPrefix
        The filename of the lexicons.
      • TempLexCount

        protected int TempLexCount
        How many temporary lexicons have been generated so far
      • MERGE2LEXATTIME

        protected static final boolean MERGE2LEXATTIME
        Should we only merge lexicons in pairs (Terrier 1.0.x scheme)? Set by property lexicon.builder.merge.2lex.attime
      • MAXLEXMERGE

        protected static final int MAXLEXMERGE
        Number of lexicons to merge at once. Set by property lexicon.builder.merge.lex.max, defaults to 16
      • defaultStructureName

        protected java.lang.String defaultStructureName
    • Constructor Detail

      • LexiconBuilder

        public LexiconBuilder​(IndexOnDisk i,
                              java.lang.String _structureName,
                              TermCodes tc)
        constructor
        Parameters:
        i -
        _structureName -
      • LexiconBuilder

        public LexiconBuilder​(IndexOnDisk i,
                              java.lang.String _structureName,
                              java.lang.Class<? extends LexiconMap> _LexiconMapClass,
                              java.lang.String _lexiconEntryClass,
                              TermCodes termCodes)
        constructor
        Parameters:
        i -
        _structureName -
        _LexiconMapClass -
        _lexiconEntryClass -
      • LexiconBuilder

        public LexiconBuilder​(IndexOnDisk i,
                              java.lang.String _structureName,
                              LexiconMap lexiconMap,
                              java.lang.String _lexiconEntryClass,
                              TermCodes termCodes)
        constructor
        Parameters:
        i -
        _structureName -
        lexiconMap -
        _lexiconEntryClass -
      • LexiconBuilder

        public LexiconBuilder​(IndexOnDisk i,
                              java.lang.String _structureName,
                              LexiconMap lexiconMap,
                              java.lang.String _lexiconEntryClass,
                              java.lang.String valueFactoryParamTypes,
                              java.lang.String valueFactoryParamValues,
                              TermCodes _termCodes)
        constructor
        Parameters:
        i -
        _structureName -
        lexiconMap -
        _lexiconEntryClass -
        valueFactoryParamTypes -
        valueFactoryParamValues -
    • Method Detail

      • instantiate

        protected static LexiconMap instantiate​(java.lang.Class<? extends LexiconMap> LexiconMapClass)
      • getFinalNumberOfTerms

        public int getFinalNumberOfTerms()
        Returns the number of terms in the final lexicon. Only updated once finishDirectIndexBuild() has executed
      • addTemporaryLexicon

        public void addTemporaryLexicon​(java.lang.String structureName)
        Deprecated.
        If the application code generated lexicons itself, use this method to add them to the merge list Otherwise dont touch this method.
        Parameters:
        structureName - Fully path to a lexicon to merge
      • writeTemporaryLexicon

        protected void writeTemporaryLexicon()
        Writes the current contents of TempLex temporary lexicon binary tree down to a temporary disk lexicon.
      • addTerm

        public void addTerm​(java.lang.String term,
                            int tf)
        Add a single term to the lexicon being built
        Parameters:
        term - The String term
        tf - the frequency of the term
      • addDocumentTerms

        public void addDocumentTerms​(DocumentPostingList terms)
        adds the terms of a document to the temporary lexicon in memory.
        Parameters:
        terms - DocumentPostingList the terms of the document to add to the temporary lexicon
      • flush

        public void flush()
        Force a temporary lexicon to be flushed
      • finishedInvertedIndexBuild

        public void finishedInvertedIndexBuild()
        Processing the lexicon after finished creating the inverted index.
      • finishedDirectIndexBuild

        public void finishedDirectIndexBuild()
        Processing the lexicon after finished creating the direct and document indexes.
      • merge

        public void merge​(java.util.LinkedList<java.lang.String> filesToMerge)
                   throws java.io.IOException
        Merges the intermediate lexicon files created during the indexing.
        Parameters:
        filesToMerge - java.util.LinkedList the list containing the filenames of the temporary files.
        Throws:
        java.io.IOException - an input/output exception is throws if a problem is encountered.
      • newLexiconEntry

        protected LexiconEntry newLexiconEntry​(int termid)
      • mergeNLexicons

        protected void mergeNLexicons​(java.util.Iterator<java.util.Map.Entry<java.lang.String,​LexiconEntry>>[] lis,
                                      LexiconOutputStream<java.lang.String> los)
                               throws java.io.IOException
        Throws:
        java.io.IOException
      • mergeTwoLexicons

        protected void mergeTwoLexicons​(java.util.Iterator<java.util.Map.Entry<java.lang.String,​LexiconEntry>> lis1,
                                        java.util.Iterator<java.util.Map.Entry<java.lang.String,​LexiconEntry>> lis2,
                                        LexiconOutputStream<java.lang.String> los)
                                 throws java.io.IOException
        Merge the two LexiconInputStreams into the given LexiconOutputStream
        Parameters:
        lis1 - First lexicon to be merged
        lis2 - Second lexicon to be merged
        los - Lexion to be merged to
        Throws:
        java.io.IOException
      • createLexiconIndex

        public static void createLexiconIndex​(IndexOnDisk index)
                                       throws java.io.IOException
        Deprecated.
        use optimise instead
        Creates a lexicon index for the specified index
        Parameters:
        index - IndexOnDisk to make the lexicon index for
        Throws:
        java.io.IOException
      • createLexiconHash

        public static void createLexiconHash​(IndexOnDisk index)
                                      throws java.io.IOException
        Deprecated.
        use optimise instead
        Creates a lexicon hash for the specified index
        Parameters:
        index - IndexOnDisk to make the LexiconHash the lexicoin
        Throws:
        java.io.IOException
      • optimiseLexicon

        public void optimiseLexicon()
        optimise the lexicon
      • optimise

        public static void optimise​(IndexOnDisk index,
                                    java.lang.String structureName)
        Optimises the lexicon, eg lexid file
      • reAssignTermIds

        public static void reAssignTermIds​(IndexOnDisk index,
                                           java.lang.String structureName,
                                           int numEntries)
                                    throws java.io.IOException
        Re-assigned the termids within the named lexicon structure to be ascending with descending term frequency, i.e. the terms with termid 0 will have the highest frequency.
        Parameters:
        index -
        structureName -
        numEntries -
        Throws:
        java.io.IOException
      • getLexInputStream

        protected java.util.Iterator<java.util.Map.Entry<java.lang.String,​LexiconEntry>> getLexInputStream​(java.lang.String structureName)
                                                                                                          throws java.io.IOException
        return the lexicon input stream for the current index at the specified filename
        Throws:
        java.io.IOException
      • getLexOutputStream

        protected LexiconOutputStream<java.lang.String> getLexOutputStream​(java.lang.String structureName)
                                                                    throws java.io.IOException
        return the lexicon outputstream for the current index at the specified filename
        Throws:
        java.io.IOException