public abstract class Indexer extends Object
TermPipeline
stages (e.g. Stopwords
removal and PorterStemmer
).
Document
properties to index as document metadata in the MetaIndex
. Defaults to "docno", which permits docid->docno lookups.. Examples are "docno,url" or "docno,url,content"MetaIndex
. Defaults to 20.Document
properties to permit lookups for (i.e. docno->docid). Defaults to empty (none are enabled).Modifier and Type | Field and Description |
---|---|
protected HashSet<String> |
BUILDER_BOUNDARY_DOCUMENTS
The DOCNO of documents to force builder boundaries
|
protected IndexOnDisk |
currentIndex
The index being worked on, denoted by path and prefix
|
protected AbstractPostingOutputStream |
directIndexBuilder
The builder that creates the direct index.
|
protected DocumentIndexBuilder |
docIndexBuilder
The builder that creates the document index.
|
protected DocumentIndexEntry |
emptyDocIndexEntry |
protected gnu.trove.TObjectIntHashMap<String> |
fieldNames
mapping: field name -> field id, returns 0 for no mapping
|
protected String |
fileNameNoExtension
The common prefix of the data structures filenames.
|
protected boolean |
IndexEmptyDocuments
Indicates whether an entry for empty documents is stored in the
document index, or empty documents should be ignored.
|
protected InvertedIndexBuilder |
invertedIndexBuilder
The builder that creates the inverted index.
|
protected LexiconBuilder |
lexiconBuilder
The builder that creates the lexicon.
|
protected static org.apache.log4j.Logger |
logger
the logger for this class
|
protected int |
MAX_DOCS_PER_BUILDER
The number of documents indexed with a set
of builders.
|
protected int |
MAX_TOKENS_IN_DOCUMENT
The maximum number of tokens in a document.
|
protected MetaIndexBuilder |
metaBuilder |
protected int |
numFields
the number of fields
|
protected String |
path
The path in which the data structures are stored.
|
protected TermPipeline |
pipeline_first
The first component of the term pipeline.
|
protected String |
prefix
The prefix of the data structures, ie the first part of the filename
|
protected boolean |
useFieldInformation
Indicates whether field information should be saved in the
created data structures.
|
Modifier | Constructor and Description |
---|---|
|
Indexer()
Creates an indexer at the location ApplicationSetup.TERRIER_INDEX_PATH and
ApplicationSetup.TERRIER_INDEX_PREFIX
|
protected |
Indexer(long a,
long b,
long c)
Protected do-nothing constructor for use by child classes
|
|
Indexer(String _path,
String _prefix)
Creates an instance of the class.
|
Modifier and Type | Method and Description |
---|---|
abstract void |
createDirectIndex(Collection[] collections)
An abstract method for creating the direct index, the document index
and the lexicon for the given collections.
|
abstract void |
createInvertedIndex()
An abstract method for creating the inverted index, given that the
the direct index, the document index and the lexicon have
already been created.
|
protected MetaIndexBuilder |
createMetaIndexBuilder() |
protected void |
finishedDirectIndexBuild()
event method to be overridden by child classes
|
protected void |
finishedInvertedIndexBuild()
event method to be overridden by child classes
|
protected abstract TermPipeline |
getEndOfPipeline()
An abstract method that returns the last component
of the term pipeline.
|
void |
index(Collection[] collections)
Creates the data structures for a set of collections.
|
protected void |
indexEmpty(Map<String,String> docProperties)
Adds an entry to document index for empty document @param docid, only if
IndexEmptyDocuments is set to true.
|
protected void |
init()
This method must be called by anything which directly extends Indexer.
|
protected void |
load_builder_boundary_documents()
Loads the builder boundary documents from the property indexing.builder.boundary.docnos, comma delimited.
|
protected void |
load_field_ids()
loads a mapping of field name -> field id
|
protected void |
load_indexer_properties() |
protected void |
load_pipeline()
Creates the term pipeline, as specified by the
property termpipelines in the properties
file.
|
static void |
main(String[] args)
Utility method for merging indices
|
static void |
merge(String mpath,
String mprefix,
int lowest,
int highest)
Merge a series of numbered indices in the same path/prefix area.
|
static void |
merge(String mpath,
String mprefix,
LinkedList<String[]> llist,
int counterMerged)
Merge a series of indices, in pair-wise fashion
|
protected static void |
mergeTwoIndices(String[] index1,
String[] index2,
String[] outputIndex)
Merge two indices.
|
protected static int[] |
parseInts(String[] in) |
boolean |
useFieldInformation()
Returns the is the index will record fields
|
protected static final org.apache.log4j.Logger logger
protected int MAX_DOCS_PER_BUILDER
protected int MAX_TOKENS_IN_DOCUMENT
protected final HashSet<String> BUILDER_BOUNDARY_DOCUMENTS
protected boolean useFieldInformation
protected TermPipeline pipeline_first
protected boolean IndexEmptyDocuments
protected AbstractPostingOutputStream directIndexBuilder
protected DocumentIndexBuilder docIndexBuilder
protected InvertedIndexBuilder invertedIndexBuilder
protected LexiconBuilder lexiconBuilder
protected MetaIndexBuilder metaBuilder
protected String fileNameNoExtension
protected String path
protected String prefix
protected IndexOnDisk currentIndex
protected gnu.trove.TObjectIntHashMap<String> fieldNames
protected int numFields
protected DocumentIndexEntry emptyDocIndexEntry
public Indexer()
public Indexer(String _path, String _prefix)
_path
- String the path where the generated data structures will be saved._prefix
- String the filename that the data structures will have.protected Indexer(long a, long b, long c)
protected void init()
public abstract void createDirectIndex(Collection[] collections)
collections
- Collection[] An array of collections to indexpublic abstract void createInvertedIndex()
protected abstract TermPipeline getEndOfPipeline()
protected MetaIndexBuilder createMetaIndexBuilder()
protected static final int[] parseInts(String[] in)
protected void load_indexer_properties()
protected void load_field_ids()
protected void load_pipeline()
protected void load_builder_boundary_documents()
public void index(Collection[] collections)
collections
- The document collection objects to index.public static void merge(String mpath, String mprefix, int lowest, int highest)
mpath
- Path of all indicesmprefix
- Common prefix of all indiceslowest
- lowest subfix of prefixhighest
- highest subfix of prefixprotected static void mergeTwoIndices(String[] index1, String[] index2, String[] outputIndex)
index1
- Path/Prefix of source index 1index2
- Path/Prefix of source index 2outputIndex
- Path/Prefix of destination indexpublic static void merge(String mpath, String mprefix, LinkedList<String[]> llist, int counterMerged)
mpath
- Common path of all indicesmprefix
- Prefix of target indexcounterMerged
- - number of indices to mergeprotected void finishedDirectIndexBuild()
protected void finishedInvertedIndexBuild()
public boolean useFieldInformation()
protected void indexEmpty(Map<String,String> docProperties) throws IOException
IOException
Terrier 4.0. Copyright © 2004-2014 University of Glasgow