Class NoDuplicatesSinglePassIndexing
- java.lang.Object
-
- org.terrier.structures.indexing.Indexer
-
- org.terrier.structures.indexing.classical.BasicIndexer
-
- org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer
-
- org.terrier.structures.indexing.singlepass.NoDuplicatesSinglePassIndexing
-
public class NoDuplicatesSinglePassIndexing extends BasicSinglePassIndexer
Single pass indexer that performs document deduplication based upon the the docno.- Since:
- 4.0
- Author:
- Dyaa Albakour
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.terrier.structures.indexing.classical.BasicIndexer
BasicIndexer.BasicTermProcessor, BasicIndexer.FieldTermProcessor
-
-
Field Summary
Fields Modifier and Type Field Description protected java.util.TreeSet<java.lang.String>
seenDocnos
-
Fields inherited from class org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer
basicInvertedIndexPostingIteratorClass, currentFile, currentId, docsPerCheck, fieldInvertedIndexPostingIteratorClass, fileNames, invertedIndexClass, invertedIndexInputStreamClass, maxDocsPerFlush, maxMemory, memoryAfterFlush, memoryCheck, merger, mp, numberOfDocsSinceCheck, numberOfDocsSinceFlush, numberOfDocuments, numberOfPointers, numberOfTokens, numberOfUniqueTerms, runtime
-
Fields inherited from class org.terrier.structures.indexing.classical.BasicIndexer
compressionDirectConfig, compressionInvertedConfig, numOfTokensInDocument, termCodes, termFields, termsInDocument
-
Fields inherited from class org.terrier.structures.indexing.Indexer
blocks, BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocCount, emptyDocIndexEntry, externalParalllism, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation
-
-
Constructor Summary
Constructors Modifier Constructor Description protected
NoDuplicatesSinglePassIndexing(long a, long b, long c)
NoDuplicatesSinglePassIndexing(java.lang.String pathname, java.lang.String prefix)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties, DocumentPostingList termsInDocument)
This adds a document to the direct and document indexes, as well as it's terms to the lexicon.protected void
indexEmpty(java.util.Map<java.lang.String,java.lang.String> docProperties)
Adds an entry to document index for empty document @param docid, only if IndexEmptyDocuments is set to true.-
Methods inherited from class org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer
checkFlush, createDirectIndex, createFieldRunMerger, createInvertedIndex, createInvertedIndex, createMemoryPostings, createRunMerger, finishMemoryPosting, forceFlush, getFileNames, load_indexer_properties, performMultiWayMerge
-
Methods inherited from class org.terrier.structures.indexing.classical.BasicIndexer
createDocumentPostings, finishedInvertedIndexBuild, getEndOfPipeline
-
Methods inherited from class org.terrier.structures.indexing.Indexer
createMetaIndexBuilder, finishedDirectIndexBuild, getExternalParalllism, index, init, load_builder_boundary_documents, load_field_ids, load_pipeline, main, merge, merge, mergeTwoIndices, parseInts, setExternalParalllism, useFieldInformation
-
-
-
-
Method Detail
-
indexDocument
protected void indexDocument(java.util.Map<java.lang.String,java.lang.String> docProperties, DocumentPostingList termsInDocument) throws java.lang.Exception
This adds a document to the direct and document indexes, as well as it's terms to the lexicon. Handled internally by the methods indexFieldDocument and indexNoFieldDocument.. This implementation only places content in the runs in memory, which will eventually be flushed to disk.. This implementation only places content in the runs in memory, which will eventually be flushed to disk.- Overrides:
indexDocument
in classBasicSinglePassIndexer
- Parameters:
docProperties
- Map<String,String> properties of the documenttermsInDocument
- DocumentPostingList the terms in the document.- Throws:
java.lang.Exception
-
indexEmpty
protected void indexEmpty(java.util.Map<java.lang.String,java.lang.String> docProperties) throws java.io.IOException
Adds an entry to document index for empty document @param docid, only if IndexEmptyDocuments is set to true.- Overrides:
indexEmpty
in classIndexer
- Throws:
java.io.IOException
-
-