org.terrier.structures.indexing
Class LexiconBuilder

java.lang.Object
  extended by org.terrier.structures.indexing.LexiconBuilder
Direct Known Subclasses:
BlockLexiconBuilder

public class LexiconBuilder
extends java.lang.Object

Builds temporary lexicons during indexing a collection and merges them when the indexing of a collection has finished.

Author:
Craig Macdonald & Vassilis Plachouras

Nested Class Summary
static class LexiconBuilder.BasicLexiconCollectionStaticticsCounter
          counts global statistics in the non-fields case
static interface LexiconBuilder.CollectionStatisticsCounter
          Counter of LexiconEntries
protected static class LexiconBuilder.FieldLexiconCollectionStaticticsCounter
          counts global statistics in the fields case
protected static class LexiconBuilder.NullCollectionStatisticsCounter
           
 
Field Summary
protected  java.lang.String defaultStructureName
           
protected  int DocCount
          How many documents have been processed so far.
protected static int DocumentsPerLexicon
          The number of documents for which a temporary lexicon is created.
protected  Index index
           
protected  java.lang.String indexPath
          The directory to write the final lexicons to
protected  java.lang.String indexPrefix
          The filename of the lexicons.
protected  java.lang.String lexiconEntryFactoryValueClass
           
protected  java.lang.Class<? extends LexiconOutputStream> lexiconOutputStream
          class to be used as a lexiconoutpustream.
protected static org.apache.log4j.Logger logger
          The logger used for this class
protected static int MAXLEXMERGE
          Number of lexicons to merge at once.
protected static boolean MERGE2LEXATTIME
          Should we only merge lexicons in pairs (Terrier 1.0.x scheme)? Set by property lexicon.builder.merge.2lex.attime
protected  LexiconMap TempLex
          The lexicontree to write the current term stream to
protected  int TempLexCount
          How many temporary lexicons have been generated so far
protected  java.util.LinkedList<java.lang.String> tempLexFiles
          The linkedlist in which the temporary lexicon structure names are stored.
protected  int TermCount
          How many terms are in the final lexicon
protected  FixedSizeWriteableFactory<LexiconEntry> valueFactory
           
 
Constructor Summary
LexiconBuilder(Index i, java.lang.String _structureName)
          constructor
LexiconBuilder(Index i, java.lang.String _structureName, java.lang.Class<? extends LexiconMap> _LexiconMapClass, java.lang.String _lexiconEntryClass)
          constructor
LexiconBuilder(Index i, java.lang.String _structureName, LexiconMap lexiconMap, java.lang.String _lexiconEntryClass)
          constructor
 
Method Summary
 void addDocumentTerms(DocumentPostingList terms)
          adds the terms of a document to the temporary lexicon in memory.
 void addTemporaryLexicon(java.lang.String structureName)
          Deprecated.  
 void addTerm(java.lang.String term, int tf)
          Add a single term to the lexicon being built
static void createLexiconHash(Index index)
          Deprecated. use optimise instead
static void createLexiconIndex(Index index)
          Deprecated. use optimise instead
 void finishedDirectIndexBuild()
          Processing the lexicon after finished creating the direct and document indexes.
 void finishedInvertedIndexBuild()
          Processing the lexicon after finished creating the inverted index.
 void flush()
          Force a temporary lexicon to be flushed
 int getFinalNumberOfTerms()
          Returns the number of terms in the final lexicon.
protected  java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> getLexInputStream(java.lang.String structureName)
          return the lexicon input stream for the current index at the specified filename
protected  LexiconOutputStream<java.lang.String> getLexOutputStream(java.lang.String structureName)
          return the lexicon outputstream or the current index at the specified filename
protected static LexiconMap instantiate(java.lang.Class<? extends LexiconMap> LexiconMapClass)
           
 void merge(java.util.LinkedList<java.lang.String> filesToMerge)
          Merges the intermediate lexicon files created during the indexing.
protected  void mergeNLexicons(java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>>[] lis, LexiconOutputStream<java.lang.String> los)
           
protected  void mergeTwoLexicons(java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lis1, java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lis2, LexiconOutputStream<java.lang.String> los)
          Merge the two LexiconInputStreams into the given LexiconOutputStream
protected  LexiconEntry newLexiconEntry(int termid)
           
static void optimise(Index index, java.lang.String structureName)
          Optimises the lexicon, eg lexid file
 void optimiseLexicon()
          optimise the lexicon
protected  void writeTemporaryLexicon()
          Writes the current contents of TempLex temporary lexicon binary tree down to a temporary disk lexicon.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

lexiconOutputStream

protected java.lang.Class<? extends LexiconOutputStream> lexiconOutputStream
class to be used as a lexiconoutpustream. set by this and child classes


lexiconEntryFactoryValueClass

protected final java.lang.String lexiconEntryFactoryValueClass

logger

protected static final org.apache.log4j.Logger logger
The logger used for this class


DocCount

protected int DocCount
How many documents have been processed so far.


TermCount

protected int TermCount
How many terms are in the final lexicon


DocumentsPerLexicon

protected static final int DocumentsPerLexicon
The number of documents for which a temporary lexicon is created. Corresponds to property bundle.size, default value 2000.


tempLexFiles

protected final java.util.LinkedList<java.lang.String> tempLexFiles
The linkedlist in which the temporary lexicon structure names are stored. These are merged into a single Lexicon by the merge() method. LinkedList is best List implementation for this, as all operations are either append element, or remove first element - making LinkedList ideal.


TempLex

protected LexiconMap TempLex
The lexicontree to write the current term stream to


indexPath

protected java.lang.String indexPath
The directory to write the final lexicons to


indexPrefix

protected java.lang.String indexPrefix
The filename of the lexicons.


index

protected Index index

TempLexCount

protected int TempLexCount
How many temporary lexicons have been generated so far


MERGE2LEXATTIME

protected static final boolean MERGE2LEXATTIME
Should we only merge lexicons in pairs (Terrier 1.0.x scheme)? Set by property lexicon.builder.merge.2lex.attime


MAXLEXMERGE

protected static final int MAXLEXMERGE
Number of lexicons to merge at once. Set by property lexicon.builder.merge.lex.max, defaults to 16


defaultStructureName

protected java.lang.String defaultStructureName

valueFactory

protected FixedSizeWriteableFactory<LexiconEntry> valueFactory
Constructor Detail

LexiconBuilder

public LexiconBuilder(Index i,
                      java.lang.String _structureName)
constructor

Parameters:
i -
_structureName -

LexiconBuilder

public LexiconBuilder(Index i,
                      java.lang.String _structureName,
                      java.lang.Class<? extends LexiconMap> _LexiconMapClass,
                      java.lang.String _lexiconEntryClass)
constructor

Parameters:
i -
_structureName -
_LexiconMapClass -
_lexiconEntryClass -

LexiconBuilder

public LexiconBuilder(Index i,
                      java.lang.String _structureName,
                      LexiconMap lexiconMap,
                      java.lang.String _lexiconEntryClass)
constructor

Parameters:
i -
_structureName -
lexiconMap -
_lexiconEntryClass -
Method Detail

instantiate

protected static LexiconMap instantiate(java.lang.Class<? extends LexiconMap> LexiconMapClass)

getFinalNumberOfTerms

public int getFinalNumberOfTerms()
Returns the number of terms in the final lexicon. Only updated once finishDirectIndexBuild() has executed


addTemporaryLexicon

public void addTemporaryLexicon(java.lang.String structureName)
Deprecated. 

If the application code generated lexicons itself, use this method to add them to the merge list Otherwise dont touch this method.

Parameters:
structureName - Fully path to a lexicon to merge

writeTemporaryLexicon

protected void writeTemporaryLexicon()
Writes the current contents of TempLex temporary lexicon binary tree down to a temporary disk lexicon.


addTerm

public void addTerm(java.lang.String term,
                    int tf)
Add a single term to the lexicon being built

Parameters:
term - The String term
tf - the frequency of the term

addDocumentTerms

public void addDocumentTerms(DocumentPostingList terms)
adds the terms of a document to the temporary lexicon in memory.

Parameters:
terms - DocumentPostingList the terms of the document to add to the temporary lexicon

flush

public void flush()
Force a temporary lexicon to be flushed


finishedInvertedIndexBuild

public void finishedInvertedIndexBuild()
Processing the lexicon after finished creating the inverted index.


finishedDirectIndexBuild

public void finishedDirectIndexBuild()
Processing the lexicon after finished creating the direct and document indexes.


merge

public void merge(java.util.LinkedList<java.lang.String> filesToMerge)
           throws java.io.IOException
Merges the intermediate lexicon files created during the indexing.

Parameters:
filesToMerge - java.util.LinkedList the list containing the filenames of the temporary files.
Throws:
java.io.IOException - an input/output exception is throws if a problem is encountered.

newLexiconEntry

protected LexiconEntry newLexiconEntry(int termid)

mergeNLexicons

protected void mergeNLexicons(java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>>[] lis,
                              LexiconOutputStream<java.lang.String> los)
                       throws java.io.IOException
Throws:
java.io.IOException

mergeTwoLexicons

protected void mergeTwoLexicons(java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lis1,
                                java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lis2,
                                LexiconOutputStream<java.lang.String> los)
                         throws java.io.IOException
Merge the two LexiconInputStreams into the given LexiconOutputStream

Parameters:
lis1 - First lexicon to be merged
lis2 - Second lexicon to be merged
los - Lexion to be merged to
Throws:
java.io.IOException

createLexiconIndex

public static void createLexiconIndex(Index index)
                               throws java.io.IOException
Deprecated. use optimise instead

Creates a lexicon index for the specified index

Parameters:
index - Index to make the lexicon index for
Throws:
java.io.IOException

createLexiconHash

public static void createLexiconHash(Index index)
                              throws java.io.IOException
Deprecated. use optimise instead

Creates a lexicon hash for the specified index

Parameters:
index - Index to make the LexiconHash the lexicoin
Throws:
java.io.IOException

optimiseLexicon

public void optimiseLexicon()
optimise the lexicon


optimise

public static void optimise(Index index,
                            java.lang.String structureName)
Optimises the lexicon, eg lexid file


getLexInputStream

protected java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> getLexInputStream(java.lang.String structureName)
                                                                                            throws java.io.IOException
return the lexicon input stream for the current index at the specified filename

Throws:
java.io.IOException

getLexOutputStream

protected LexiconOutputStream<java.lang.String> getLexOutputStream(java.lang.String structureName)
                                                            throws java.io.IOException
return the lexicon outputstream or the current index at the specified filename

Throws:
java.io.IOException


Terrier 3.5. Copyright © 2004-2011 University of Glasgow