Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-108

Some indexers do not set the IterablePosting class for the DirectIndex

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0
    • Fix Version/s: 3.5
    • Component/s: .structures
    • Labels:
      None

      Description

      org.terrier.structures.BitPostingIndex.getPostings(BitIndexPointer) is not able to instantiate the postingImplementation when fields are used. The error happens in line 105ff:

      {code}
      if (fieldCount > 0)
      rtr = postingImplementation
      .getConstructor(BitIn.class, Integer.TYPE, DocumentIndex.class, Integer.TYPE)
      .newInstance(file, pointer.getNumberOfEntries(), null, fieldCount);
      else
      rtr = postingImplementation
      .getConstructor(BitIn.class, Integer.TYPE, DocumentIndex.class)
      .newInstance(file, pointer.getNumberOfEntries(), null);
      {code}


      The reason is that per default this constructor is called and thus a BasicIterablePosting is used:

      {code}
      public DirectIndex(Index index, String structureName) throws IOException {
      super(index, structureName, BasicIterablePosting.class);
      docIndex = index.getDocumentIndex();
      }
      {code}

      The solution is to adjust the code as follows:

      {code}
      public DirectIndex(Index index, String structureName) throws IOException {
      super(index, structureName, index.getIntIndexProperty("index.direct.fields.count", 0) > 0 ?
      FieldIterablePosting.class : BasicIterablePosting.class);
      docIndex = index.getDocumentIndex();
      }
      {code}

      The same goes for org.terrier.structures.DirectIndexInputStream.DirectIndexInputStream(Index, String):

      {code}
      public DirectIndexInputStream(Index index, String structureName) throws IOException
      {
      super(index, structureName, (Iterator<DocumentIndexEntry>)index.getIndexStructureInputStream("document"),
      index.getIntIndexProperty("index.direct.fields.count", 0) > 0 ?
      FieldIterablePosting.class : BasicIterablePosting.class);
      }
      {code}

      I didn't check if this problem is also present in additional parts of the code.

        Attachments

          Activity

          Hide
          rec Richard Eckart de Castilho added a comment -

          Variant of BasicIndexer which allows to index one term a a time. Feel free to integrate if you like it.

          Show
          rec Richard Eckart de Castilho added a comment - Variant of BasicIndexer which allows to index one term a a time. Feel free to integrate if you like it.
          Hide
          craigm Craig Macdonald added a comment -

          Its just an interface on the existing code, not incremental indexing directly. However, if you have the relevant code, then we wrap it up in the same interface.

          Can you file a new issue for the last comment re Hadoop please?

          Show
          craigm Craig Macdonald added a comment - Its just an interface on the existing code, not incremental indexing directly. However, if you have the relevant code, then we wrap it up in the same interface. Can you file a new issue for the last comment re Hadoop please?
          Hide
          rec Richard Eckart de Castilho added a comment -

          Just one more comment. It may become an issue for us that Hadoop is so deeply integrated into Terrier. I set up Terrier as a Maven project here and hoped to
          be able to make Hadoop an optional dependency. Mainly for two reasons: a) reduce dependencies for non-Hadoop projects and b) to be able to run Terrier
          on another version of Hadoop - that is not using Terriers Hadoop capabilities, but embedding it as part of another Hadoop program. Maybe it would be possible
          to use adapters to use adapter classes to adapt Terrier classes to Hadoop instead of having Terrier classes directly inherit from Hadoop. That is, making Hadoop
          really an optional feature on top of Terrier.

          Show
          rec Richard Eckart de Castilho added a comment - Just one more comment. It may become an issue for us that Hadoop is so deeply integrated into Terrier. I set up Terrier as a Maven project here and hoped to be able to make Hadoop an optional dependency. Mainly for two reasons: a) reduce dependencies for non-Hadoop projects and b) to be able to run Terrier on another version of Hadoop - that is not using Terriers Hadoop capabilities, but embedding it as part of another Hadoop program. Maybe it would be possible to use adapters to use adapter classes to adapt Terrier classes to Hadoop instead of having Terrier classes directly inherit from Hadoop. That is, making Hadoop really an optional feature on top of Terrier.
          Hide
          rec Richard Eckart de Castilho added a comment -

          That should be fine I think. And if that enables us to incrementally build block indexes, that would be great.

          Show
          rec Richard Eckart de Castilho added a comment - That should be fine I think. And if that enables us to incrementally build block indexes, that would be great.
          Hide
          craigm Craig Macdonald added a comment -

          My proposed refactor would let you do similar code as the following:

          IndexWriter iw = //choose some appropriate implementation
          
          //create a document posting list for a document
          DocumentPostingList d = new DocumentPostingList();
          d.indexTerm("term", fieldid);
          
          //add document to index
          iw.addDocument(d);
          
          //finish all structures
          iw.close();
          

          Would this be OK by you?

          I know all about ApplicationSetup. I have plans round this, it takes time though! We thought best to release what we had at present.

          Show
          craigm Craig Macdonald added a comment - My proposed refactor would let you do similar code as the following: IndexWriter iw = //choose some appropriate implementation //create a document posting list for a document DocumentPostingList d = new DocumentPostingList(); d.indexTerm( "term" , fieldid); //add document to index iw.addDocument(d); //finish all structures iw.close(); Would this be OK by you? I know all about ApplicationSetup. I have plans round this, it takes time though! We thought best to release what we had at present.

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              rec Richard Eckart de Castilho
            • Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: