[TR-108] Some indexers do not set the IterablePosting class for the DirectIndex Created: 25/Mar/10  Updated: 25/Mar/10  Resolved: 25/Mar/10

Status: Resolved
Project: Terrier Core
Component/s: .structures
Affects Version/s: 3.0
Fix Version/s: 3.5

Type: Bug Priority: Major
Reporter: Richard Eckart de Castilho Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: Text File IncrementalIndexerImpl.java    

 Description   
org.terrier.structures.BitPostingIndex.getPostings(BitIndexPointer) is not able to instantiate the postingImplementation when fields are used. The error happens in line 105ff:

{code}
if (fieldCount > 0)
rtr = postingImplementation
.getConstructor(BitIn.class, Integer.TYPE, DocumentIndex.class, Integer.TYPE)
.newInstance(file, pointer.getNumberOfEntries(), null, fieldCount);
else
rtr = postingImplementation
.getConstructor(BitIn.class, Integer.TYPE, DocumentIndex.class)
.newInstance(file, pointer.getNumberOfEntries(), null);
{code}


The reason is that per default this constructor is called and thus a BasicIterablePosting is used:

{code}
public DirectIndex(Index index, String structureName) throws IOException {
super(index, structureName, BasicIterablePosting.class);
docIndex = index.getDocumentIndex();
}
{code}

The solution is to adjust the code as follows:

{code}
public DirectIndex(Index index, String structureName) throws IOException {
super(index, structureName, index.getIntIndexProperty("index.direct.fields.count", 0) > 0 ?
FieldIterablePosting.class : BasicIterablePosting.class);
docIndex = index.getDocumentIndex();
}
{code}

The same goes for org.terrier.structures.DirectIndexInputStream.DirectIndexInputStream(Index, String):

{code}
public DirectIndexInputStream(Index index, String structureName) throws IOException
{
super(index, structureName, (Iterator<DocumentIndexEntry>)index.getIndexStructureInputStream("document"),
index.getIntIndexProperty("index.direct.fields.count", 0) > 0 ?
FieldIterablePosting.class : BasicIterablePosting.class);
}
{code}

I didn't check if this problem is also present in additional parts of the code.

 Comments   
Comment by Craig Macdonald [ 25/Mar/10 ]

Can you tell me two things:

(a) Which indexing method did you use? (classical, singlepass, MapReduce)

(b) Can you paste me your index data.properties file (grep "index.direct" data.properties)

Thanks for your detailed investigations.

Comment by Richard Eckart de Castilho [ 25/Mar/10 ]

I used a custom indexer (IncrementalIndexer). It basically allows me to index a document term-by-term. Basically it is a a reordered, stateful BasicIndexer. I upgraded this to Terrier 3.0 now and think that I did not miss anything important. If you are interested, I would consider contributing it.

Here are the properties:

index.direct.fields.names=TOKEN,UTF8,NO_NAME_FIELD_2,NO_NAME_FIELD_3,NO_NAME_FIELD_4,NO_NAME_FIELD_5,NO_NAME_FIELD_6,NO_NAME_FIELD_7,NO_NAME_FIELD_8,NO_NAME_FIELD_9,NO_NAME_FIELD_10,NO_NAME_FIELD_11,NO_NAME_FIELD_12,NO_NAME_FIELD_13,NO_NAME_FIELD_14,NO_NAME_FIELD_15,NO_NAME_FIELD_16,NO_NAME_FIELD_17,NO_NAME_FIELD_18,NO_NAME_FIELD_19,NO_NAME_FIELD_20,NO_NAME_FIELD_21,NO_NAME_FIELD_22,NO_NAME_FIELD_23,NO_NAME_FIELD_24,NO_NAME_FIELD_25,NO_NAME_FIELD_26,NO_NAME_FIELD_27,NO_NAME_FIELD_28,NO_NAME_FIELD_29,NO_NAME_FIELD_30
index.direct.class=org.terrier.structures.DirectIndex
index.direct-inputstream.parameter_values=index,structureName,org.terrier.structures.postings.FieldIterablePosting
index.document-factory.parameter_values=${index.direct.fields.count}
index.direct.parameter_types=org.terrier.structures.Index,java.lang.String
index.direct-inputstream.parameter_types=org.terrier.structures.Index,java.lang.String,java.lang.Class
index.direct.fields.count=31
index.lexicon-valuefactory.parameter_values=${index.direct.fields.count}
index.direct-inputstream.class=org.terrier.structures.DirectIndexInputStream
Comment by Craig Macdonald [ 25/Mar/10 ]

Thanks.

Can you update the following index properties?
{{{
index.direct.parameter_types=org.terrier.structures.Index,java.lang.String,java.lang.Class
index.direct.parameter_values=index,structureName,org.terrier.structures.postings.FieldIterablePosting
}}}

I would rather change BasicIndexer to write the correct index properties than add more intelligence in the constructors.

Your IncrementalIndexer sounds interesting. I'm currently refactoring the Indexer classes - document analysing is a distinct notion from structure writing, so I'm trying to suitably move things around.

What I can't work out is why the unit tests don't identify this problem, as the correctness of the direct index is checked in ShakespeareEndToEndTest.java

Comment by Craig Macdonald [ 25/Mar/10 ]

Ok, I fixed BasicIndexer, BlockIndexer, StrutureMerger and Inverted2DirectIndexBuilder. BasicShakespeareEndToEndTest now catchs this.\

Here is a patch. Can you make the same change to IncrementalIndexer and re-test? Many thanks

Index: src/core/org/terrier/structures/merging/StructureMerger.java
===================================================================
--- src/core/org/terrier/structures/merging/StructureMerger.java	(revision 2991)
+++ src/core/org/terrier/structures/merging/StructureMerger.java	(working copy)
@@ -487,8 +487,9 @@
 			destIndex.addIndexStructure(
 					"direct", 
 					"org.terrier.structures.DirectIndex", 
-					"org.terrier.structures.Index,java.lang.String", 
-					"index,structureName");
+					"org.terrier.structures.Index,java.lang.String,java.lang.Class", 
+					"index,structureName,"+ 
+						(fieldCount > 0 ? fieldDirectIndexPostingIteratorClass : basicDirectIndexPostingIteratorClass));
 			destIndex.addIndexStructureInputStream(
 					"direct",
 					"org.terrier.structures.DirectIndexInputStream", 
Index: src/core/org/terrier/structures/indexing/singlepass/Inverted2DirectIndexBuilder.java
===================================================================
--- src/core/org/terrier/structures/indexing/singlepass/Inverted2DirectIndexBuilder.java	(revision 2991)
+++ src/core/org/terrier/structures/indexing/singlepass/Inverted2DirectIndexBuilder.java	(working copy)
@@ -246,8 +246,9 @@
 			index.addIndexStructure(
 				destinationStructure, 
 				directIndexClass, 
-				"org.terrier.structures.Index,java.lang.String", 
-				"index,structureName");
+				"org.terrier.structures.Index,java.lang.String,java.lang.Class", 
+				"index,structureName,"+ 
+					(fieldCount > 0 ? fieldDirectIndexPostingIteratorClass : basicDirectIndexPostingIteratorClass));
 			index.addIndexStructureInputStream(
 				destinationStructure, 
 				directIndexInputStreamClass,
Index: src/core/org/terrier/indexing/BasicIndexer.java
===================================================================
--- src/core/org/terrier/indexing/BasicIndexer.java	(revision 2991)
+++ src/core/org/terrier/indexing/BasicIndexer.java	(working copy)
@@ -303,8 +303,9 @@
 		currentIndex.addIndexStructure(
 				"direct", 
 				"org.terrier.structures.DirectIndex", 
-				"org.terrier.structures.Index,java.lang.String", 
-				"index,structureName");
+				"org.terrier.structures.Index,java.lang.String,java.lang.Class", 
+				"index,structureName,"+ 
+					(FieldScore.FIELDS_COUNT > 0 ? fieldDirectIndexPostingIteratorClass : basicDirectIndexPostingIteratorClass));
 		currentIndex.addIndexStructureInputStream(
 				"direct",
 				"org.terrier.structures.DirectIndexInputStream", 
Index: src/test/org/terrier/tests/BasicShakespeareEndToEndTest.java
===================================================================
--- src/test/org/terrier/tests/BasicShakespeareEndToEndTest.java	(revision 2991)
+++ src/test/org/terrier/tests/BasicShakespeareEndToEndTest.java	(working copy)
@@ -24,9 +24,16 @@
  *   Craig Macdonald <craigm{a.}dcs.gla.ac.uk> (original contributor)
  */
 package org.terrier.tests;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
 import org.junit.Test;
-import static org.junit.Assert.*;
+import org.terrier.structures.BitIndexPointer;
+import org.terrier.structures.BitPostingIndex;
+import org.terrier.structures.BitPostingIndexInputStream;
 import org.terrier.structures.Index;
+import org.terrier.structures.postings.FieldPosting;
+import org.terrier.structures.postings.IterablePosting;
 
 public class BasicShakespeareEndToEndTest extends ShakespeareEndToEndTest {
 
@@ -73,6 +80,28 @@
 			assertEquals(2, index.getCollectionStatistics().getNumberOfFields());
 			assertEquals(123, index.getCollectionStatistics().getFieldTokens()[0]);
 			assertEquals(611, index.getCollectionStatistics().getFieldTokens()[1]);
+			
+			
+			
+			//no check correct type of all structures
+			BitPostingIndexInputStream bpiis;
+			IterablePosting ip;
+			BitPostingIndex bpi;
+			
+			//check stream structures
+			bpiis = (BitPostingIndexInputStream) index.getIndexStructureInputStream("direct");
+			ip = bpiis.next();
+			assertTrue(ip instanceof FieldPosting);
+			bpiis = (BitPostingIndexInputStream) index.getIndexStructureInputStream("inverted");
+			ip = bpiis.next();
+			assertTrue(ip instanceof FieldPosting);
+			
+			//check random structures
+			bpi = (BitPostingIndex) index.getInvertedIndex();
+			ip = bpi.getPostings((BitIndexPointer) index.getLexicon().getLexiconEntry(0).getValue());
+			assertTrue(ip instanceof FieldPosting);
+			bpi = (BitPostingIndex) index.getDirectIndex();
+			ip = bpi.getPostings((BitIndexPointer) index.getDocumentIndex().getDocumentEntry(0));
 		}
 	}
 	
Index: src/core/org/terrier/indexing/BlockIndexer.java
===================================================================
--- src/core/org/terrier/indexing/BlockIndexer.java	(revision 2991)
+++ src/core/org/terrier/indexing/BlockIndexer.java	(working copy)
@@ -421,8 +421,9 @@
 		currentIndex.addIndexStructure(
 				"direct", 
 				"org.terrier.structures.BlockDirectIndex", 
-				"org.terrier.structures.Index,java.lang.String", 
-				"index,structureName");
+				"org.terrier.structures.Index,java.lang.String,java.lang.Class", 
+				"index,structureName,"+ 
+				(FieldScore.FIELDS_COUNT > 0 ? fieldDirectIndexPostingIteratorClass : basicDirectIndexPostingIteratorClass));
 		currentIndex.addIndexStructureInputStream(
 				"direct",
 				"org.terrier.structures.BlockDirectIndexInputStream", 

Comment by Craig Macdonald [ 25/Mar/10 ]

updated title.

Comment by Richard Eckart de Castilho [ 25/Mar/10 ]

I upgraded my IncrementalIndexer accordingly and that works fine. Thanks very much for the fast fix

Comment by Richard Eckart de Castilho [ 25/Mar/10 ]

The IncrementalIndexer basically allows you to do this:

                // Write a simple index
		indexer.startDirectIndex();
		indexer.startDocument("doc1");
		indexer.indexTerm("term1", singleton("token")); // The second argument are the field names to add to
		indexer.indexTerm("term2", singleton("token"));
		indexer.indexTerm("tag1", singleton("tag"));
		indexer.endDocument();
		indexer.syncFieldNames(); // This is where the fields names discovered during term adding are synched with FieldScore
		indexer.endDirectIndex();
		indexer.createInvertedIndex();

Btw. it would be great if not everything in Terrier was done using (public) static variables for almost anything. I had to do some black
classpath voodoo to be able to deploy multiple Terrier instances in different configurations in the same JVM.

Comment by Craig Macdonald [ 25/Mar/10 ]

Closing as resolved. Fix has been commited to trunk. Anyone experiencing this issue with an index already created by Terrier v3.0 should amend the data.properties file of their index:

index.direct.parameter_types=org.terrier.structures.Index,java.lang.String,java.lang.Class
index.direct.parameter_values=index,structureName,org.terrier.structures.postings.FieldIterablePosting
#or BlockFieldIterablePosting for a block & field index
Comment by Craig Macdonald [ 25/Mar/10 ]

My proposed refactor would let you do similar code as the following:

IndexWriter iw = //choose some appropriate implementation

//create a document posting list for a document
DocumentPostingList d = new DocumentPostingList();
d.indexTerm("term", fieldid);

//add document to index
iw.addDocument(d);

//finish all structures
iw.close();

Would this be OK by you?

I know all about ApplicationSetup. I have plans round this, it takes time though! We thought best to release what we had at present.

Comment by Richard Eckart de Castilho [ 25/Mar/10 ]

That should be fine I think. And if that enables us to incrementally build block indexes, that would be great.

Comment by Richard Eckart de Castilho [ 25/Mar/10 ]

Just one more comment. It may become an issue for us that Hadoop is so deeply integrated into Terrier. I set up Terrier as a Maven project here and hoped to
be able to make Hadoop an optional dependency. Mainly for two reasons: a) reduce dependencies for non-Hadoop projects and b) to be able to run Terrier
on another version of Hadoop - that is not using Terriers Hadoop capabilities, but embedding it as part of another Hadoop program. Maybe it would be possible
to use adapters to use adapter classes to adapt Terrier classes to Hadoop instead of having Terrier classes directly inherit from Hadoop. That is, making Hadoop
really an optional feature on top of Terrier.

Comment by Craig Macdonald [ 25/Mar/10 ]

Its just an interface on the existing code, not incremental indexing directly. However, if you have the relevant code, then we wrap it up in the same interface.

Can you file a new issue for the last comment re Hadoop please?

Comment by Richard Eckart de Castilho [ 25/Mar/10 ]

Variant of BasicIndexer which allows to index one term a a time. Feel free to integrate if you like it.

Generated at Tue Dec 12 13:57:28 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.