Terrier IR Platform
2.2.1

uk.ac.gla.terrier.structures.indexing
Class LexiconBuilder

java.lang.Object
  extended by uk.ac.gla.terrier.structures.indexing.LexiconBuilder
Direct Known Subclasses:
BlockLexiconBuilder, UTFLexiconBuilder

public class LexiconBuilder
extends java.lang.Object

Builds temporary lexicons during indexing a collection and merges them when the indexing of a collection has finished.

Version:
$Revision: 1.47 $
Author:
Craig Macdonald & Vassilis Plachouras

Constructor Summary
LexiconBuilder()
          Deprecated.  
LexiconBuilder(Index i)
           
LexiconBuilder(java.lang.String pathname, java.lang.String prefix)
          Creates an instance of the class, given the path to save the temporary lexicons.
 
Method Summary
 void addDocumentTerms(DocumentPostingList terms)
          adds the terms of a document to the temporary lexicon in memory.
 void addTemporaryLexicon(java.lang.String filename)
          If the application code generated lexicons itself, use this method to add them to the merge list Otherwise dont touch this method.
 void addTerm(java.lang.String term, int tf)
          Add a single term to the lexicon being built
static void createLexiconHash(Index index)
          Creates a lexicon hash for the specified index
 void createLexiconHash(LexiconInputStream lexStream)
          Create a lexicon hash for the current index
static void createLexiconHash(LexiconInputStream lexStream, java.io.OutputStream out)
           
static void createLexiconHash(LexiconInputStream lexStream, java.lang.String path, java.lang.String prefix)
          Creates a Lexicon hash.
static void createLexiconIndex(Index index)
          Creates a lexicon index for the specified index
 void createLexiconIndex(LexiconInputStream lexicon, int lexiconEntries, int lexiconEntrySize)
          Creates the lexicon index file that contains a mapping from the given term id to the offset in the lexicon, in order to be able to retrieve the term information according to the term identifier.
static void createLexiconIndex(LexiconInputStream lexicon, int lexiconEntries, int lexiconEntrySize, java.io.DataOutputStream dosLexid)
           
static void createLexiconIndex(LexiconInputStream lexicon, int lexiconEntries, int lexiconEntrySize, java.lang.String path, java.lang.String prefix)
          Creates the lexicon index file that contains a mapping from the given term id to the offset in the lexicon, in order to be able to retrieve the term information according to the term identifier.
 void finishedDirectIndexBuild()
          Processing the lexicon after finished creating the direct and document indexes.
 void finishedInvertedIndexBuild()
          Processing the lexicon after finished creating the inverted index.
 void flush()
          Force a temporary lexicon to be flushed
 int getFinalNumberOfTerms()
          Returns the number of terms in the final lexicon.
static void main(java.lang.String[] args)
           
 void merge(java.util.LinkedList<java.lang.String> filesToMerge)
          Merges the intermediate lexicon files created during the indexing.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LexiconBuilder

public LexiconBuilder()
Deprecated. 

A default constructor of the class. The lexicon is built in the default path and file: ApplicationSetup.TERRIER_INDEX_PATH and ApplicationSetup.TERRIER_INDEX_PREFIX respectively.


LexiconBuilder

public LexiconBuilder(Index i)

LexiconBuilder

public LexiconBuilder(java.lang.String pathname,
                      java.lang.String prefix)
Creates an instance of the class, given the path to save the temporary lexicons.

Parameters:
pathname - String the path to save the temporary lexicons.
Method Detail

getFinalNumberOfTerms

public int getFinalNumberOfTerms()
Returns the number of terms in the final lexicon. Only updated once finishDirectIndexBuild() has executed


addTemporaryLexicon

public void addTemporaryLexicon(java.lang.String filename)
If the application code generated lexicons itself, use this method to add them to the merge list Otherwise dont touch this method.

Parameters:
filename - Fully path to a lexicon to merge

addTerm

public void addTerm(java.lang.String term,
                    int tf)
Add a single term to the lexicon being built

Parameters:
term - The String term
tf - the frequency of the term

addDocumentTerms

public void addDocumentTerms(DocumentPostingList terms)
adds the terms of a document to the temporary lexicon in memory.

Parameters:
terms - DocumentPostingList the terms of the document to add to the temporary lexicon

flush

public void flush()
Force a temporary lexicon to be flushed


finishedInvertedIndexBuild

public void finishedInvertedIndexBuild()
Processing the lexicon after finished creating the inverted index.


finishedDirectIndexBuild

public void finishedDirectIndexBuild()
Processing the lexicon after finished creating the direct and document indexes.


merge

public void merge(java.util.LinkedList<java.lang.String> filesToMerge)
           throws java.io.IOException
Merges the intermediate lexicon files created during the indexing.

Parameters:
filesToMerge - java.util.LinkedList the list containing the filenames of the temporary files.
Throws:
java.io.IOException - an input/output exception is throws if a problem is encountered.

createLexiconIndex

public void createLexiconIndex(LexiconInputStream lexicon,
                               int lexiconEntries,
                               int lexiconEntrySize)
                        throws java.io.IOException
Creates the lexicon index file that contains a mapping from the given term id to the offset in the lexicon, in order to be able to retrieve the term information according to the term identifier. This is necessary, because the terms in the lexicon file are saved in lexicographical order, and we also want to have fast access based on their term identifier.

Parameters:
lexicon - The input stream of the lexicon that we are creating the lexid file for
lexiconEntries - The number of entries in this lexicon
lexiconEntrySize - The size of one entry in this lexicon
Throws:
java.io.IOException - Throws an Input/Output exception if there is an input/output error.

createLexiconIndex

public static void createLexiconIndex(LexiconInputStream lexicon,
                                      int lexiconEntries,
                                      int lexiconEntrySize,
                                      java.lang.String path,
                                      java.lang.String prefix)
                               throws java.io.IOException
Creates the lexicon index file that contains a mapping from the given term id to the offset in the lexicon, in order to be able to retrieve the term information according to the term identifier. This is necessary, because the terms in the lexicon file are saved in lexicographical order, and we also want to have fast access based on their term identifier.

Parameters:
lexicon - The input stream of the lexicon that we are creating the lexid file for
lexiconEntries - The number of entries in this lexicon
lexiconEntrySize - The size of one entry in this lexicon
path - The path to the index containing the lexicon
prefix - The prefix of the index containing the lexicon
Throws:
java.io.IOException - Throws an Input/Output exception if there is an input/output error.

createLexiconIndex

public static void createLexiconIndex(LexiconInputStream lexicon,
                                      int lexiconEntries,
                                      int lexiconEntrySize,
                                      java.io.DataOutputStream dosLexid)
                               throws java.io.IOException
Throws:
java.io.IOException

createLexiconIndex

public static void createLexiconIndex(Index index)
                               throws java.io.IOException
Creates a lexicon index for the specified index

Parameters:
index - Index to make the lexicon index for
Throws:
java.io.IOException

createLexiconHash

public void createLexiconHash(LexiconInputStream lexStream)
Create a lexicon hash for the current index

Parameters:
lexStream - lexiconinputstream to process

createLexiconHash

public static void createLexiconHash(Index index)
                              throws java.io.IOException
Creates a lexicon hash for the specified index

Parameters:
index - Index to make the LexiconHash for
Throws:
java.io.IOException

createLexiconHash

public static void createLexiconHash(LexiconInputStream lexStream,
                                     java.lang.String path,
                                     java.lang.String prefix)
Creates a Lexicon hash. This method reads the lexicon and finds the entries which start with a different letter. The offset of these entries is used to speed up the binary search performed during retrieval. These offsets are saved to a lex hash file beside the Lexicon in the Index.

Parameters:
lexStream - LexiconInputStream to process
path - Path to the index containing the lexicon
prefix - Prefix of the index containing the lexicon

createLexiconHash

public static void createLexiconHash(LexiconInputStream lexStream,
                                     java.io.OutputStream out)

main

public static void main(java.lang.String[] args)

Terrier IR Platform
2.2.1

Terrier Information Retrieval Platform 2.2.1. Copyright 2004-2008 University of Glasgow