Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-129

Posting.getDocumentLength() does not work for postings from the direct file

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0
    • Fix Version/s: 3.5
    • Component/s: .structures
    • Labels:
      None

      Description

      The following code raises an error when WeightingModel.score(Posting) is called, as a posting retrieved from the direct file apparently does not encapsulate the document length appropriately.

      WeightingModel wm = new BM25();

      DocumentIndex document = index.getDirectIndex();
      DocumentIndexEntry de = document.getDocumentEntry(docid);
      IterablePosting ip = direct.getPostings(de);

      double score = 0;
      while (ip.next()) {
          score += wm.score(ip);
      }

        Attachments

          Activity

          Hide
          rodrygo Rodrygo L. T. Santos added a comment -

          Committed fix to input stream structures in addition to the standard direct index structure.

          Show
          rodrygo Rodrygo L. T. Santos added a comment - Committed fix to input stream structures in addition to the standard direct index structure.
          Hide
          rodrygo Rodrygo L. T. Santos added a comment -

          Applied patch and updated TestIndexers accordingly. Passes all tests.

          Show
          rodrygo Rodrygo L. T. Santos added a comment - Applied patch and updated TestIndexers accordingly. Passes all tests.
          Hide
          craigm Craig Macdonald added a comment -

          Tagging for 3.1

          Show
          craigm Craig Macdonald added a comment - Tagging for 3.1
          Hide
          craigm Craig Macdonald added a comment -

          can you test the patch, for all four variants of index: normal, field, blocks, fields+blocks

          Show
          craigm Craig Macdonald added a comment - can you test the patch, for all four variants of index: normal, field, blocks, fields+blocks
          Hide
          craigm Craig Macdonald added a comment -

          Here's an initial patch:

          Index: .
          ===================================================================
          --- .	(revision 2790)
          +++ .	(working copy)
          @@ -19,6 +19,34 @@
            */
           public class BitPostingIndex implements PostingIndex<BitIndexPointer>
           {
          +	static class DocidSpecificDocumentIndex implements FieldDocumentIndex
          +	{
          +		DocumentIndexEntry die;
          +		DocumentIndex di;
          +		
          +		public DocidSpecificDocumentIndex(DocumentIndex _di, DocumentIndexEntry _die)
          +		{
          +			di = _di;
          +			die = _die;
          +		}
          +		
          +		public DocumentIndexEntry getDocumentEntry(int docid) throws IOException {
          +			return die;
          +		}
          +
          +		public int getDocumentLength(int docid) throws IOException {
          +			return die.getDocumentLength();
          +		}
          +
          +		public int getNumberOfDocuments() {
          +			return di.getNumberOfDocuments();
          +		}
          +
          +		public int[] getFieldLengths(int docid) throws IOException {
          +			return ((FieldDocumentIndexEntry)die).getFieldLengths();
          +		}
          +	}
          +	
           	protected BitInSeekable[] file;
           	protected Class<? extends IterablePosting> postingImplementation;
           	protected Index index = null;
          @@ -74,6 +102,9 @@
           	{
           		final BitIn file = this.file[pointer.getFileNumber()].readReset(pointer.getOffset(), pointer.getOffsetBits());
           		IterablePosting rtr = null;
          +		DocumentIndex fixedDi = pointer instanceof DocumentIndexEntry
          +			? new DocidSpecificDocumentIndex(index.getDocumentIndex(), (DocumentIndexEntry)pointer)
          +			: null;
           		try{
           			if (fieldCount > 0)
           				rtr = postingImplementation
          @@ -78,11 +109,11 @@
           			if (fieldCount > 0)
           				rtr = postingImplementation
           					.getConstructor(BitIn.class, Integer.TYPE, DocumentIndex.class, Integer.TYPE)
          -					.newInstance(file, pointer.getNumberOfEntries(), null, fieldCount);
          +					.newInstance(file, pointer.getNumberOfEntries(), fixedDi, fieldCount);
           			else
           				rtr = postingImplementation
           					.getConstructor(BitIn.class, Integer.TYPE, DocumentIndex.class)
          -					.newInstance(file, pointer.getNumberOfEntries(), null);
          +					.newInstance(file, pointer.getNumberOfEntries(), fixedDi);
           		} catch (Exception e) {
           			throw new WrappedIOException(e);
           		}
          
          
          
          Show
          craigm Craig Macdonald added a comment - Here's an initial patch: Index: . =================================================================== --- . (revision 2790) +++ . (working copy) @@ -19,6 +19,34 @@ */ public class BitPostingIndex implements PostingIndex<BitIndexPointer> { + static class DocidSpecificDocumentIndex implements FieldDocumentIndex + { + DocumentIndexEntry die; + DocumentIndex di; + + public DocidSpecificDocumentIndex(DocumentIndex _di, DocumentIndexEntry _die) + { + di = _di; + die = _die; + } + + public DocumentIndexEntry getDocumentEntry(int docid) throws IOException { + return die; + } + + public int getDocumentLength(int docid) throws IOException { + return die.getDocumentLength(); + } + + public int getNumberOfDocuments() { + return di.getNumberOfDocuments(); + } + + public int[] getFieldLengths(int docid) throws IOException { + return ((FieldDocumentIndexEntry)die).getFieldLengths(); + } + } + protected BitInSeekable[] file; protected Class<? extends IterablePosting> postingImplementation; protected Index index = null; @@ -74,6 +102,9 @@ { final BitIn file = this.file[pointer.getFileNumber()].readReset(pointer.getOffset(), pointer.getOffsetBits()); IterablePosting rtr = null; + DocumentIndex fixedDi = pointer instanceof DocumentIndexEntry + ? new DocidSpecificDocumentIndex(index.getDocumentIndex(), (DocumentIndexEntry)pointer) + : null; try{ if (fieldCount > 0) rtr = postingImplementation @@ -78,11 +109,11 @@ if (fieldCount > 0) rtr = postingImplementation .getConstructor(BitIn.class, Integer.TYPE, DocumentIndex.class, Integer.TYPE) - .newInstance(file, pointer.getNumberOfEntries(), null, fieldCount); + .newInstance(file, pointer.getNumberOfEntries(), fixedDi, fieldCount); else rtr = postingImplementation .getConstructor(BitIn.class, Integer.TYPE, DocumentIndex.class) - .newInstance(file, pointer.getNumberOfEntries(), null); + .newInstance(file, pointer.getNumberOfEntries(), fixedDi); } catch (Exception e) { throw new WrappedIOException(e); }
          Hide
          craigm Craig Macdonald added a comment -

          An alternative I have thought of is that DirectIndex passes a mockup DocumentIndex object to the IterablePosting that does the correct thing.

          Show
          craigm Craig Macdonald added a comment - An alternative I have thought of is that DirectIndex passes a mockup DocumentIndex object to the IterablePosting that does the correct thing.
          Hide
          craigm Craig Macdonald added a comment -

          Indeed, this is an issue. The problem with IterablePostings.getDocumentLength() arises from two conflicting requirements:

          • This method is time critical when scoring postings during normal retrieval
          • This method has different semantics for DirectIndex and InvertedIndex lookups.

          In particular, the InvertedIndex case should look like:

          public int getDocumentLength()
          {
           return docIndex.getDocumentLength(this.getId());
          }
          

          For the DirectIndex, a call to getDocumentLength() should return a constant document length for every term in the postings of a given document, i.e. it is not related to getId(). Indeed, the IterablePosting may already have the document length as part of it's pointer.

          public int getDocumentLength()
          {
           return this.pointer.getDocumentLength();
          }
          

          I cant see an elegant solution to address this, apart from sub-classing various Posting implementations. Putting an if() into the current implementation may have a marked effectiveness impact.

          Show
          craigm Craig Macdonald added a comment - Indeed, this is an issue. The problem with IterablePostings.getDocumentLength() arises from two conflicting requirements: This method is time critical when scoring postings during normal retrieval This method has different semantics for DirectIndex and InvertedIndex lookups. In particular, the InvertedIndex case should look like: public int getDocumentLength() { return docIndex.getDocumentLength( this .getId()); } For the DirectIndex, a call to getDocumentLength() should return a constant document length for every term in the postings of a given document, i.e. it is not related to getId(). Indeed, the IterablePosting may already have the document length as part of it's pointer. public int getDocumentLength() { return this .pointer.getDocumentLength(); } I cant see an elegant solution to address this, apart from sub-classing various Posting implementations. Putting an if() into the current implementation may have a marked effectiveness impact.

            People

            • Assignee:
              rodrygo Rodrygo L. T. Santos
              Reporter:
              rodrygo Rodrygo L. T. Santos
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: