Package org.terrier.structures.indexing
Class LexiconBuilder
- java.lang.Object
-
- org.terrier.structures.indexing.LexiconBuilder
-
public class LexiconBuilder extends java.lang.Object
Builds temporary lexicons during indexing a collection and merges them when the indexing of a collection has finished.- Author:
- Craig Macdonald & Vassilis Plachouras
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
LexiconBuilder.BasicLexiconCollectionStaticticsCounter
counts global statistics in the non-fields casestatic interface
LexiconBuilder.CollectionStatisticsCounter
Counter of LexiconEntriesprotected static class
LexiconBuilder.FieldLexiconCollectionStaticticsCounter
counts global statistics in the fields caseprotected static class
LexiconBuilder.NullCollectionStatisticsCounter
-
Field Summary
Fields Modifier and Type Field Description protected java.lang.String
defaultStructureName
protected int
DocCount
How many documents have been processed so far.protected static int
DocumentsPerLexicon
The number of documents for which a temporary lexicon is created.protected IndexOnDisk
index
protected java.lang.String
indexPath
The directory to write the final lexicons toprotected java.lang.String
indexPrefix
The filename of the lexicons.protected java.lang.String
lexiconEntryFactoryValueClass
protected java.lang.Class<? extends LexiconOutputStream>
lexiconOutputStream
class to be used as a lexiconoutpustream.protected static org.slf4j.Logger
logger
The logger used for this classprotected static int
MAXLEXMERGE
Number of lexicons to merge at once.protected MemoryChecker
memCheck
protected static boolean
MERGE2LEXATTIME
Should we only merge lexicons in pairs (Terrier 1.0.x scheme)? Set by property lexicon.builder.merge.2lex.attimeprotected LexiconMap
TempLex
The lexicontree to write the current term stream toprotected int
TempLexCount
How many temporary lexicons have been generated so farprotected java.util.LinkedList<java.lang.String>
tempLexFiles
The list in which the temporary lexicon structure names are stored.protected TermCodes
termCodes
protected int
TermCount
How many terms are in the final lexiconprotected FixedSizeWriteableFactory<LexiconEntry>
valueFactory
-
Constructor Summary
Constructors Constructor Description LexiconBuilder(IndexOnDisk i, java.lang.String _structureName, java.lang.Class<? extends LexiconMap> _LexiconMapClass, java.lang.String _lexiconEntryClass, TermCodes termCodes)
constructorLexiconBuilder(IndexOnDisk i, java.lang.String _structureName, LexiconMap lexiconMap, java.lang.String _lexiconEntryClass, java.lang.String valueFactoryParamTypes, java.lang.String valueFactoryParamValues, TermCodes _termCodes)
constructorLexiconBuilder(IndexOnDisk i, java.lang.String _structureName, LexiconMap lexiconMap, java.lang.String _lexiconEntryClass, TermCodes termCodes)
constructorLexiconBuilder(IndexOnDisk i, java.lang.String _structureName, TermCodes tc)
constructor
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description void
addDocumentTerms(DocumentPostingList terms)
adds the terms of a document to the temporary lexicon in memory.void
addTemporaryLexicon(java.lang.String structureName)
Deprecated.void
addTerm(java.lang.String term, int tf)
Add a single term to the lexicon being builtstatic void
createLexiconHash(IndexOnDisk index)
Deprecated.use optimise insteadstatic void
createLexiconIndex(IndexOnDisk index)
Deprecated.use optimise insteadvoid
finishedDirectIndexBuild()
Processing the lexicon after finished creating the direct and document indexes.void
finishedInvertedIndexBuild()
Processing the lexicon after finished creating the inverted index.void
flush()
Force a temporary lexicon to be flushedint
getFinalNumberOfTerms()
Returns the number of terms in the final lexicon.protected java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>>
getLexInputStream(java.lang.String structureName)
return the lexicon input stream for the current index at the specified filenameprotected LexiconOutputStream<java.lang.String>
getLexOutputStream(java.lang.String structureName)
return the lexicon outputstream for the current index at the specified filenameprotected static LexiconMap
instantiate(java.lang.Class<? extends LexiconMap> LexiconMapClass)
void
merge(java.util.LinkedList<java.lang.String> filesToMerge)
Merges the intermediate lexicon files created during the indexing.protected void
mergeNLexicons(java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>>[] lis, LexiconOutputStream<java.lang.String> los)
protected void
mergeTwoLexicons(java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lis1, java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lis2, LexiconOutputStream<java.lang.String> los)
Merge the two LexiconInputStreams into the given LexiconOutputStreamprotected LexiconEntry
newLexiconEntry(int termid)
static void
optimise(IndexOnDisk index, java.lang.String structureName)
Optimises the lexicon, eg lexid filevoid
optimiseLexicon()
optimise the lexiconstatic void
reAssignTermIds(IndexOnDisk index, java.lang.String structureName, int numEntries)
Re-assigned the termids within the named lexicon structure to be ascending with descending term frequency, i.e.protected void
writeTemporaryLexicon()
Writes the current contents of TempLex temporary lexicon binary tree down to a temporary disk lexicon.
-
-
-
Field Detail
-
lexiconOutputStream
protected java.lang.Class<? extends LexiconOutputStream> lexiconOutputStream
class to be used as a lexiconoutpustream. set by this and child classes
-
lexiconEntryFactoryValueClass
protected final java.lang.String lexiconEntryFactoryValueClass
-
logger
protected static final org.slf4j.Logger logger
The logger used for this class
-
DocCount
protected int DocCount
How many documents have been processed so far.
-
TermCount
protected int TermCount
How many terms are in the final lexicon
-
DocumentsPerLexicon
protected static final int DocumentsPerLexicon
The number of documents for which a temporary lexicon is created. Corresponds to property bundle.size, default value 2000.
-
tempLexFiles
protected final java.util.LinkedList<java.lang.String> tempLexFiles
The list in which the temporary lexicon structure names are stored. These are merged into a single Lexicon by the merge() method. LinkedList is best List implementation for this, as all operations are either append element, or remove first element - making LinkedList ideal.
-
TempLex
protected LexiconMap TempLex
The lexicontree to write the current term stream to
-
termCodes
protected TermCodes termCodes
-
indexPath
protected java.lang.String indexPath
The directory to write the final lexicons to
-
indexPrefix
protected java.lang.String indexPrefix
The filename of the lexicons.
-
index
protected IndexOnDisk index
-
TempLexCount
protected int TempLexCount
How many temporary lexicons have been generated so far
-
MERGE2LEXATTIME
protected static final boolean MERGE2LEXATTIME
Should we only merge lexicons in pairs (Terrier 1.0.x scheme)? Set by property lexicon.builder.merge.2lex.attime
-
MAXLEXMERGE
protected static final int MAXLEXMERGE
Number of lexicons to merge at once. Set by property lexicon.builder.merge.lex.max, defaults to 16
-
defaultStructureName
protected java.lang.String defaultStructureName
-
valueFactory
protected FixedSizeWriteableFactory<LexiconEntry> valueFactory
-
memCheck
protected MemoryChecker memCheck
-
-
Constructor Detail
-
LexiconBuilder
public LexiconBuilder(IndexOnDisk i, java.lang.String _structureName, TermCodes tc)
constructor- Parameters:
i
-_structureName
-
-
LexiconBuilder
public LexiconBuilder(IndexOnDisk i, java.lang.String _structureName, java.lang.Class<? extends LexiconMap> _LexiconMapClass, java.lang.String _lexiconEntryClass, TermCodes termCodes)
constructor- Parameters:
i
-_structureName
-_LexiconMapClass
-_lexiconEntryClass
-
-
LexiconBuilder
public LexiconBuilder(IndexOnDisk i, java.lang.String _structureName, LexiconMap lexiconMap, java.lang.String _lexiconEntryClass, TermCodes termCodes)
constructor- Parameters:
i
-_structureName
-lexiconMap
-_lexiconEntryClass
-
-
LexiconBuilder
public LexiconBuilder(IndexOnDisk i, java.lang.String _structureName, LexiconMap lexiconMap, java.lang.String _lexiconEntryClass, java.lang.String valueFactoryParamTypes, java.lang.String valueFactoryParamValues, TermCodes _termCodes)
constructor- Parameters:
i
-_structureName
-lexiconMap
-_lexiconEntryClass
-valueFactoryParamTypes
-valueFactoryParamValues
-
-
-
Method Detail
-
instantiate
protected static LexiconMap instantiate(java.lang.Class<? extends LexiconMap> LexiconMapClass)
-
getFinalNumberOfTerms
public int getFinalNumberOfTerms()
Returns the number of terms in the final lexicon. Only updated once finishDirectIndexBuild() has executed
-
addTemporaryLexicon
public void addTemporaryLexicon(java.lang.String structureName)
Deprecated.If the application code generated lexicons itself, use this method to add them to the merge list Otherwise dont touch this method.- Parameters:
structureName
- Fully path to a lexicon to merge
-
writeTemporaryLexicon
protected void writeTemporaryLexicon()
Writes the current contents of TempLex temporary lexicon binary tree down to a temporary disk lexicon.
-
addTerm
public void addTerm(java.lang.String term, int tf)
Add a single term to the lexicon being built- Parameters:
term
- The String termtf
- the frequency of the term
-
addDocumentTerms
public void addDocumentTerms(DocumentPostingList terms)
adds the terms of a document to the temporary lexicon in memory.- Parameters:
terms
- DocumentPostingList the terms of the document to add to the temporary lexicon
-
flush
public void flush()
Force a temporary lexicon to be flushed
-
finishedInvertedIndexBuild
public void finishedInvertedIndexBuild()
Processing the lexicon after finished creating the inverted index.
-
finishedDirectIndexBuild
public void finishedDirectIndexBuild()
Processing the lexicon after finished creating the direct and document indexes.
-
merge
public void merge(java.util.LinkedList<java.lang.String> filesToMerge) throws java.io.IOException
Merges the intermediate lexicon files created during the indexing.- Parameters:
filesToMerge
- java.util.LinkedList the list containing the filenames of the temporary files.- Throws:
java.io.IOException
- an input/output exception is throws if a problem is encountered.
-
newLexiconEntry
protected LexiconEntry newLexiconEntry(int termid)
-
mergeNLexicons
protected void mergeNLexicons(java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>>[] lis, LexiconOutputStream<java.lang.String> los) throws java.io.IOException
- Throws:
java.io.IOException
-
mergeTwoLexicons
protected void mergeTwoLexicons(java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lis1, java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lis2, LexiconOutputStream<java.lang.String> los) throws java.io.IOException
Merge the two LexiconInputStreams into the given LexiconOutputStream- Parameters:
lis1
- First lexicon to be mergedlis2
- Second lexicon to be mergedlos
- Lexion to be merged to- Throws:
java.io.IOException
-
createLexiconIndex
public static void createLexiconIndex(IndexOnDisk index) throws java.io.IOException
Deprecated.use optimise insteadCreates a lexicon index for the specified index- Parameters:
index
- IndexOnDisk to make the lexicon index for- Throws:
java.io.IOException
-
createLexiconHash
public static void createLexiconHash(IndexOnDisk index) throws java.io.IOException
Deprecated.use optimise insteadCreates a lexicon hash for the specified index- Parameters:
index
- IndexOnDisk to make the LexiconHash the lexicoin- Throws:
java.io.IOException
-
optimiseLexicon
public void optimiseLexicon()
optimise the lexicon
-
optimise
public static void optimise(IndexOnDisk index, java.lang.String structureName)
Optimises the lexicon, eg lexid file
-
reAssignTermIds
public static void reAssignTermIds(IndexOnDisk index, java.lang.String structureName, int numEntries) throws java.io.IOException
Re-assigned the termids within the named lexicon structure to be ascending with descending term frequency, i.e. the terms with termid 0 will have the highest frequency.- Parameters:
index
-structureName
-numEntries
-- Throws:
java.io.IOException
-
getLexInputStream
protected java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> getLexInputStream(java.lang.String structureName) throws java.io.IOException
return the lexicon input stream for the current index at the specified filename- Throws:
java.io.IOException
-
getLexOutputStream
protected LexiconOutputStream<java.lang.String> getLexOutputStream(java.lang.String structureName) throws java.io.IOException
return the lexicon outputstream for the current index at the specified filename- Throws:
java.io.IOException
-
-