[TR-129] Posting.getDocumentLength() does not work for postings from the direct file Created: 29/Sep/09  Updated: 05/Apr/11  Resolved: 07/Mar/11

Status: Resolved
Project: Terrier Core
Component/s: .structures
Affects Version/s: 3.0
Fix Version/s: 3.5

Type: Bug Priority: Major
Reporter: Rodrygo L. T. Santos Assignee: Rodrygo L. T. Santos
Resolution: Fixed  
Labels: None


 Description   
The following code raises an error when WeightingModel.score(Posting) is called, as a posting retrieved from the direct file apparently does not encapsulate the document length appropriately.

WeightingModel wm = new BM25();

DocumentIndex document = index.getDirectIndex();
DocumentIndexEntry de = document.getDocumentEntry(docid);
IterablePosting ip = direct.getPostings(de);

double score = 0;
while (ip.next()) {
    score += wm.score(ip);
}


 Comments   
Comment by Craig Macdonald [ 29/Sep/09 ]

Indeed, this is an issue. The problem with IterablePostings.getDocumentLength() arises from two conflicting requirements:

  • This method is time critical when scoring postings during normal retrieval
  • This method has different semantics for DirectIndex and InvertedIndex lookups.

In particular, the InvertedIndex case should look like:

public int getDocumentLength()
{
 return docIndex.getDocumentLength(this.getId());
}

For the DirectIndex, a call to getDocumentLength() should return a constant document length for every term in the postings of a given document, i.e. it is not related to getId(). Indeed, the IterablePosting may already have the document length as part of it's pointer.

public int getDocumentLength()
{
 return this.pointer.getDocumentLength();
}

I cant see an elegant solution to address this, apart from sub-classing various Posting implementations. Putting an if() into the current implementation may have a marked effectiveness impact.

Comment by Craig Macdonald [ 21/Oct/09 ]

An alternative I have thought of is that DirectIndex passes a mockup DocumentIndex object to the IterablePosting that does the correct thing.

Comment by Craig Macdonald [ 26/Jan/10 ]

Here's an initial patch:

Index: .
===================================================================
--- .	(revision 2790)
+++ .	(working copy)
@@ -19,6 +19,34 @@
  */
 public class BitPostingIndex implements PostingIndex<BitIndexPointer>
 {
+	static class DocidSpecificDocumentIndex implements FieldDocumentIndex
+	{
+		DocumentIndexEntry die;
+		DocumentIndex di;
+		
+		public DocidSpecificDocumentIndex(DocumentIndex _di, DocumentIndexEntry _die)
+		{
+			di = _di;
+			die = _die;
+		}
+		
+		public DocumentIndexEntry getDocumentEntry(int docid) throws IOException {
+			return die;
+		}
+
+		public int getDocumentLength(int docid) throws IOException {
+			return die.getDocumentLength();
+		}
+
+		public int getNumberOfDocuments() {
+			return di.getNumberOfDocuments();
+		}
+
+		public int[] getFieldLengths(int docid) throws IOException {
+			return ((FieldDocumentIndexEntry)die).getFieldLengths();
+		}
+	}
+	
 	protected BitInSeekable[] file;
 	protected Class<? extends IterablePosting> postingImplementation;
 	protected Index index = null;
@@ -74,6 +102,9 @@
 	{
 		final BitIn file = this.file[pointer.getFileNumber()].readReset(pointer.getOffset(), pointer.getOffsetBits());
 		IterablePosting rtr = null;
+		DocumentIndex fixedDi = pointer instanceof DocumentIndexEntry
+			? new DocidSpecificDocumentIndex(index.getDocumentIndex(), (DocumentIndexEntry)pointer)
+			: null;
 		try{
 			if (fieldCount > 0)
 				rtr = postingImplementation
@@ -78,11 +109,11 @@
 			if (fieldCount > 0)
 				rtr = postingImplementation
 					.getConstructor(BitIn.class, Integer.TYPE, DocumentIndex.class, Integer.TYPE)
-					.newInstance(file, pointer.getNumberOfEntries(), null, fieldCount);
+					.newInstance(file, pointer.getNumberOfEntries(), fixedDi, fieldCount);
 			else
 				rtr = postingImplementation
 					.getConstructor(BitIn.class, Integer.TYPE, DocumentIndex.class)
-					.newInstance(file, pointer.getNumberOfEntries(), null);
+					.newInstance(file, pointer.getNumberOfEntries(), fixedDi);
 		} catch (Exception e) {
 			throw new WrappedIOException(e);
 		}


Comment by Craig Macdonald [ 17/Feb/11 ]

can you test the patch, for all four variants of index: normal, field, blocks, fields+blocks

Comment by Craig Macdonald [ 17/Feb/11 ]

Tagging for 3.1

Comment by Rodrygo L. T. Santos [ 07/Mar/11 ]

Applied patch and updated TestIndexers accordingly. Passes all tests.

Comment by Rodrygo L. T. Santos [ 01/Apr/11 ]

Committed fix to input stream structures in addition to the standard direct index structure.

Generated at Sun Dec 17 14:02:11 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.