Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: 2.2.1
    • Component/s: None
    • Labels:
      None

      Description

      Terrier currently loads all postings for a given term into memory. This is very space in-efficient, and the need to find large amounts of memory can slow matters down. We should move to an iterable posting design, where the posting list is not fully decompressed at once.

        Attachments

          Issue Links

            Activity

            Hide
            craigm Craig Macdonald added a comment -

            Proposal:

            A single posting in an inverted index will be respresented by the following interface:

            public interface Posting 
            {
                /** Return the document id of the current posting */
                public int getDocId();
                /** Return the frequency of the term in the current document */
                public int getFrequency();
                //usually uses the DocumentIndex, may do otherwise if doclength in posting list 
                public int getDocumentLength();
            }
            

            Note that getDocumentLength() is attached the posting. The intention is that using the current Terrier data structures, this will be accessed from the DocumentIndex. However, for very large collections, the document index is expensive to keep in memory, and hence it would be beneficial to put document statistics into the posting list.

            An interface will be implemented when a Posting object is iterable, by calling the next() method. NB: java.util.Iterable will not be implemented, as we dont want to create a new Posing object for each posting in the posting list.

            public interface IterablePosting extends Closable, Posting
            {
                //stuff to move forward in a posting list or close it. Could separate this to a separate interface? */
                /** move to the next document.
                  * return false iff end of posting list has been met */
                public boolean next() throws IOException;
            }
            

            In the case where posting lists are sorted by docid and skipping tables are supported, the following extended interface will be supported:

            public interface SkippablePosting extends IterablePosting
            {
                /** move as far as desiredDocid. Stops as soon as getDocid() >= desiredDocid.
                  * Use getDocid() to determine what document was moved to. Only works on docid sorted
                  * postings lists.
                  * @return false iff end of posting list has been met */
                public boolean next(int desiredDocid) throws IOException;
            }
            

            Posting lists can also contain block (position) information. This will be represented as another extension of Posting:

            public interface BlockPosting extends Posting
            {
                public int[] getPositions();
            }
            

            In the future, for TR-13, field frequency information will be encapsulated in the inverted index. We will support this by a further extension on Posting with additional field statistics.

            How will these Posting classes be used? Well, currently a weighting model is only passed a frequency and a document length. To support future extensions, the WeightingModel would have the following interface:

            public abstract class WeightingModel
            {
                /* from the lexicon */
                public void setEntryStatistics(EntryStatistics ts);
                /* from the index */
                public void setCollectionStatistics(CollectionStatistics cs);
                /* do the actual scoring */
                public abstract double score(Posting p, double currentScore?);
                /* access to parameter settings */
                public void setRequest(Request q); //Request will give access to parameters
            }
            
            Show
            craigm Craig Macdonald added a comment - Proposal: A single posting in an inverted index will be respresented by the following interface: public interface Posting { /** Return the document id of the current posting */ public int getDocId(); /** Return the frequency of the term in the current document */ public int getFrequency(); //usually uses the DocumentIndex, may do otherwise if doclength in posting list public int getDocumentLength(); } Note that getDocumentLength() is attached the posting. The intention is that using the current Terrier data structures, this will be accessed from the DocumentIndex. However, for very large collections, the document index is expensive to keep in memory, and hence it would be beneficial to put document statistics into the posting list. An interface will be implemented when a Posting object is iterable, by calling the next() method. NB: java.util.Iterable will not be implemented, as we dont want to create a new Posing object for each posting in the posting list. public interface IterablePosting extends Closable, Posting { //stuff to move forward in a posting list or close it. Could separate this to a separate interface ? */ /** move to the next document. * return false iff end of posting list has been met */ public boolean next() throws IOException; } In the case where posting lists are sorted by docid and skipping tables are supported, the following extended interface will be supported: public interface SkippablePosting extends IterablePosting { /** move as far as desiredDocid. Stops as soon as getDocid() >= desiredDocid. * Use getDocid() to determine what document was moved to. Only works on docid sorted * postings lists. * @ return false iff end of posting list has been met */ public boolean next( int desiredDocid) throws IOException; } Posting lists can also contain block (position) information. This will be represented as another extension of Posting: public interface BlockPosting extends Posting { public int [] getPositions(); } In the future, for TR-13 , field frequency information will be encapsulated in the inverted index. We will support this by a further extension on Posting with additional field statistics. How will these Posting classes be used? Well, currently a weighting model is only passed a frequency and a document length. To support future extensions, the WeightingModel would have the following interface: public abstract class WeightingModel { /* from the lexicon */ public void setEntryStatistics(EntryStatistics ts); /* from the index */ public void setCollectionStatistics(CollectionStatistics cs); /* do the actual scoring */ public abstract double score(Posting p, double currentScore?); /* access to parameter settings */ public void setRequest(Request q); //Request will give access to parameters }
            Hide
            craigm Craig Macdonald added a comment -

            I'm thinking of adding a void prepare() method to WeightingModel, so that each model can do any pre-calculations on the background statistics that it is likely to do time and time again.

            Matching would then look like:

            wmodel.setCollectionStatistics(index.getCollectionStatistics());
            for (String term : query)
            {
              LexiconEntry le = lexicon.getLexiconEntry(term);
              if (le == null)
                continue;
              wmodel.setEntryStatistics(le);
              wmodel.prepare();
              IterablePosting p = invertedIndex.getPostings(le);
              do{
                score[p.getDocumentId()] += wmodel.score(p);
              } while (p.next());
            }
            

            Note:
            1. wmodel.prepare() is called after the statistics are prepared prior to scoring a posting list
            2. a do{}while loop is used as IterablePosting starts off with the first posting loaded.

            Show
            craigm Craig Macdonald added a comment - I'm thinking of adding a void prepare() method to WeightingModel, so that each model can do any pre-calculations on the background statistics that it is likely to do time and time again. Matching would then look like: wmodel.setCollectionStatistics(index.getCollectionStatistics()); for ( String term : query) { LexiconEntry le = lexicon.getLexiconEntry(term); if (le == null ) continue ; wmodel.setEntryStatistics(le); wmodel.prepare(); IterablePosting p = invertedIndex.getPostings(le); do { score[p.getDocumentId()] += wmodel.score(p); } while (p.next()); } Note: 1. wmodel.prepare() is called after the statistics are prepared prior to scoring a posting list 2. a do{}while loop is used as IterablePosting starts off with the first posting loaded.
            Hide
            craigm Craig Macdonald added a comment -

            Work on this patch is progressing. Work is currently in related changes to Matching, in particular in relation to the TermScoreModifiers. TSMs are permitted to alter the score of documents as they are scored for a given term. There are currently three TSMs:

            • RequiredTermModifier just checks that a term actually exists in a document
            • TermInFieldModifier checks that it exists in a given field
            • FieldScoreModifier is a primitive field weighting model

            As fields are being rewritten as part of TR-13, then FieldScoreModifier will be removed. TermInFieldModifier will be dropped for the time being. This leaves RequiredTermModifier.

            I am currently considering whether TSMs can be implemented as WeightingModels. Do they actually need the score to alter? If they do, then WeightingModel has to look like:

             score = wmodel.score(p, score); 
            Show
            craigm Craig Macdonald added a comment - Work on this patch is progressing. Work is currently in related changes to Matching, in particular in relation to the TermScoreModifiers. TSMs are permitted to alter the score of documents as they are scored for a given term. There are currently three TSMs: RequiredTermModifier just checks that a term actually exists in a document TermInFieldModifier checks that it exists in a given field FieldScoreModifier is a primitive field weighting model As fields are being rewritten as part of TR-13 , then FieldScoreModifier will be removed. TermInFieldModifier will be dropped for the time being. This leaves RequiredTermModifier. I am currently considering whether TSMs can be implemented as WeightingModels. Do they actually need the score to alter? If they do, then WeightingModel has to look like: score = wmodel.score(p, score);
            Hide
            craigm Craig Macdonald added a comment -

            First version of a patch for this issue.

            Notes:

            • TermScoreModifiers are not used (and will be removed shortly).
            • WeightingModels are added for specific query terms into MatchingQueryTerms. Matching then asks MatchingQueryTerms which WeightingModels it should apply for a given query term.
            • All end-to-end tests pass, except those involving fields. As noted above, the fields implementation in Terrier will be evolving in TR-13, which will be directly after this patch. This patch has some of the foot-work for TR-13 - e.g. the FieldPosting interface is in place.

            Questions:

            1. Some more work requires to be done wrt to Manager, WeightingModels and MatchingQueryTerms. When are they instantiated etc, when should a term not use the default WeightingModel etc.
            2. We need to decide what we use: WeightingModel or Model. Either only one is required, or they have separate purposes, which is not clear from their specification or documentation. Also, how does WeightingModel and Model bear in QueryExpansion?
            3. If we move BasicIterablePosting and BlockIterablePosting out of InvertedIndex and BlockInvertedIndex, then InvertedIndex could be made to take a posting class implementation. However, if we are maintaining the existing getDocuments() methods for InvertedIndex, then this will not be possible. Discuss
            4. Relationships with other classes: DirectIndex, DirectIndexInputStream & InvertedIndexInputStreams need careful consideration. Should they all use the same posting interface?

            Efficiency-based testing still required.

            Show
            craigm Craig Macdonald added a comment - First version of a patch for this issue. Notes: TermScoreModifiers are not used (and will be removed shortly). WeightingModels are added for specific query terms into MatchingQueryTerms. Matching then asks MatchingQueryTerms which WeightingModels it should apply for a given query term. All end-to-end tests pass, except those involving fields. As noted above, the fields implementation in Terrier will be evolving in TR-13 , which will be directly after this patch. This patch has some of the foot-work for TR-13 - e.g. the FieldPosting interface is in place. Questions: Some more work requires to be done wrt to Manager, WeightingModels and MatchingQueryTerms. When are they instantiated etc, when should a term not use the default WeightingModel etc. We need to decide what we use: WeightingModel or Model. Either only one is required, or they have separate purposes, which is not clear from their specification or documentation. Also, how does WeightingModel and Model bear in QueryExpansion? If we move BasicIterablePosting and BlockIterablePosting out of InvertedIndex and BlockInvertedIndex, then InvertedIndex could be made to take a posting class implementation. However, if we are maintaining the existing getDocuments() methods for InvertedIndex, then this will not be possible. Discuss Relationships with other classes: DirectIndex, DirectIndexInputStream & InvertedIndexInputStreams need careful consideration. Should they all use the same posting interface? Efficiency-based testing still required.
            Hide
            craigm Craig Macdonald added a comment -

            Update on progress. Work on this patch integrated several other issues, such as TR-12 & TR-17. However, on efficiency testing, I have found that my working patch is noticeably slower than Terrier 2.x.

            Using some profiling, I have identified that the problem is with my DocumentIndex. All statistical information about a document is wrapped in DocumentIndexEntry, which contains: document length (int), number of terms (int), direct index offset (long & byte).

            Posting uses the DocumentIndex to satisfy the getDocumentLength() method as thus:

            public int getDocumentLength() {
              return doi.getDocumentEntry(id).getDocumentLength();
            }
            

            The issue is that DocumentIndexEntry.readFields() has to read all information, even though all it needs is one int. Note that there is no disk IO, only decoding bytes to ints from a byte array in memory.

            I'm not sure what to do here. (a) I liked DocumentIndex using an abstract DocumentIndex entry objects, however this appears too slow. It also allows generality for the fields setting (b) The direct index offsets could go to a different index structure, as they could also be kept on disk.

            Show
            craigm Craig Macdonald added a comment - Update on progress. Work on this patch integrated several other issues, such as TR-12 & TR-17 . However, on efficiency testing, I have found that my working patch is noticeably slower than Terrier 2.x. Using some profiling, I have identified that the problem is with my DocumentIndex. All statistical information about a document is wrapped in DocumentIndexEntry, which contains: document length (int), number of terms (int), direct index offset (long & byte). Posting uses the DocumentIndex to satisfy the getDocumentLength() method as thus: public int getDocumentLength() { return doi.getDocumentEntry(id).getDocumentLength(); } The issue is that DocumentIndexEntry.readFields() has to read all information, even though all it needs is one int. Note that there is no disk IO, only decoding bytes to ints from a byte array in memory. I'm not sure what to do here. (a) I liked DocumentIndex using an abstract DocumentIndex entry objects, however this appears too slow. It also allows generality for the fields setting (b) The direct index offsets could go to a different index structure, as they could also be kept on disk.

              People

              • Assignee:
                craigm Craig Macdonald
                Reporter:
                craigm Craig Macdonald
              • Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: