Terrier IR Platform
1.1.1

uk.ac.gla.terrier.structures.indexing
Class LexiconBuilder

java.lang.Object
  extended by uk.ac.gla.terrier.structures.indexing.LexiconBuilder
Direct Known Subclasses:
BlockLexiconBuilder, UTFLexiconBuilder

public class LexiconBuilder
extends java.lang.Object

Builds temporary lexicons during indexing a collection and merges them when the indexing of a collection has finished.

Version:
$Revision: 1.36 $
Author:
Craig Macdonald & Vassilis Plachouras

Constructor Summary
LexiconBuilder()
          A default constructor of the class.
LexiconBuilder(java.lang.String pathname, java.lang.String prefix)
          Creates an instance of the class, given the path to save the temporary lexicons.
 
Method Summary
 void addDocumentTerms(DocumentPostingList terms)
          adds the terms of a document to the temporary lexicon in memory.
 void addDocumentTerms(FieldDocumentTreeNode[] terms)
          Adds the terms of a document in the temporary lexicon in memory.
 void addTemporaryLexicon(java.lang.String filename)
          If the application code generated lexicons itself, use this method to add them to the merge list Otherwise dont touch this method.
 void createLexiconHash(LexiconInputStream lexStream)
           
static void createLexiconHash(LexiconInputStream lexStream, java.lang.String path, java.lang.String prefix)
          This method reads the lexicon and finds the entries which start with a different letter.
 void createLexiconIndex(LexiconInputStream lexicon, int lexiconEntries, int lexiconEntrySize)
          Creates the lexicon index file that contains a mapping from the given term id to the offset in the lexicon, in order to be able to retrieve the term information according to the term identifier.
static void createLexiconIndex(LexiconInputStream lexicon, int lexiconEntries, int lexiconEntrySize, java.lang.String path, java.lang.String prefix)
           
 void finishedDirectIndexBuild()
          Processing the lexicon after finished creating the direct and document indexes.
 void finishedInvertedIndexBuild()
          Processing the lexicon after finished creating the inverted index.
 int getFinalNumberOfTerms()
          Returns the number of terms in the final lexicon.
 LexiconInputStream getLexInputStream(java.lang.String filename)
           
 LexiconOutputStream getLexOutputStream(java.lang.String filename)
           
static void main(java.lang.String[] args)
           
 void merge(java.util.LinkedList<java.lang.String> filesToMerge)
          Merges the intermediate lexicon files created during the indexing.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LexiconBuilder

public LexiconBuilder()
A default constructor of the class. The lexicon is built in the default path and file: ApplicationSetup.TERRIER_INDEX_PATH and ApplicationSetup.TERRIER_INDEX_PREFIX respectively.


LexiconBuilder

public LexiconBuilder(java.lang.String pathname,
                      java.lang.String prefix)
Creates an instance of the class, given the path to save the temporary lexicons.

Parameters:
pathname - String the path to save the temporary lexicons.
Method Detail

getFinalNumberOfTerms

public int getFinalNumberOfTerms()
Returns the number of terms in the final lexicon. Only updated once finishDirectIndexBuild() has executed


addTemporaryLexicon

public void addTemporaryLexicon(java.lang.String filename)
If the application code generated lexicons itself, use this method to add them to the merge list Otherwise dont touch this method.

Parameters:
filename - Fully path to a lexicon to merge

addDocumentTerms

public void addDocumentTerms(FieldDocumentTreeNode[] terms)
Adds the terms of a document in the temporary lexicon in memory.

Parameters:
terms - FieldDocumentTreeNode[] the terms of the document to add in the temporary lexicon in memory.

addDocumentTerms

public void addDocumentTerms(DocumentPostingList terms)
adds the terms of a document to the temporary lexicon in memory.

Parameters:
terms - DocumentPostingList the terms of the document to add to the temporary lexicon

finishedInvertedIndexBuild

public void finishedInvertedIndexBuild()
Processing the lexicon after finished creating the inverted index.


finishedDirectIndexBuild

public void finishedDirectIndexBuild()
Processing the lexicon after finished creating the direct and document indexes.


merge

public void merge(java.util.LinkedList<java.lang.String> filesToMerge)
           throws java.io.IOException
Merges the intermediate lexicon files created during the indexing.

Parameters:
filesToMerge - java.util.LinkedList the list containing the filenames of the temporary files.
Throws:
java.io.IOException - an input/output exception is throws if a problem is encountered.

createLexiconIndex

public void createLexiconIndex(LexiconInputStream lexicon,
                               int lexiconEntries,
                               int lexiconEntrySize)
                        throws java.io.IOException
Creates the lexicon index file that contains a mapping from the given term id to the offset in the lexicon, in order to be able to retrieve the term information according to the term identifier. This is necessary, because the terms in the lexicon file are saved in lexicographical order, and we also want to have fast access based on their term identifier.

Parameters:
lexicon - The input stream of the lexicon that we are creating the lexid file for
lexiconEntries - The number of entries in this lexicon
lexiconEntrySize - The size of one entry in this lexicon
Throws:
java.io.IOException - Throws an Input/Output exception if there is an input/output error.

createLexiconIndex

public static void createLexiconIndex(LexiconInputStream lexicon,
                                      int lexiconEntries,
                                      int lexiconEntrySize,
                                      java.lang.String path,
                                      java.lang.String prefix)
                               throws java.io.IOException
Throws:
java.io.IOException

createLexiconHash

public void createLexiconHash(LexiconInputStream lexStream)

createLexiconHash

public static void createLexiconHash(LexiconInputStream lexStream,
                                     java.lang.String path,
                                     java.lang.String prefix)
This method reads the lexicon and finds the entries which start with a different letter. The offset of these entries is used to speed up the binary search performed during retrieval.


main

public static void main(java.lang.String[] args)

getLexInputStream

public LexiconInputStream getLexInputStream(java.lang.String filename)

getLexOutputStream

public LexiconOutputStream getLexOutputStream(java.lang.String filename)

Terrier IR Platform
1.1.1

Terrier Information Retrieval Platform 1.1.1. Copyright 2004-2007 University of Glasgow