Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-10

Term Pipeline only supports token events

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The TermPipeline introduced for Terrier 1 allows tokens to be transformed during index, e.g. by stemming, stopword removal, and more. The design has proved useful, however it has some limitations. In particular, the TermPipeline objects may require access to state information.

      For instance, consider the following examples which require statae:
       * POS tagger: needs to know when a sentence boundary occurs, and when the document ends. It also needs to decorate the tokens with POS somehow
       * Language-specific stemming: needs to know when the language of a document (or query) stream has changed

      To this end, there are in fact two problems:
      1. Access to events other than tokens
      2. Access to state associated with events: e.g. a document boundary has a document name, a token may have a position, and/or fields

        Attachments

          Issue Links

            Activity

            Hide
            juanito Giovanni Stilo added a comment -

            I think should be something more flexible such as:

            interface EventProducer
            {
             public Event produceNextEvent();
            }
            
            In Indexer:
            {
             Pipeline p= new Pipeline();
             Event e;
             Context c;
            
            foreach( d in Collection){
              if (Document d instanceof EventProducer)
              {
               c= new Context();
               c.addDocument(d);
               while((e=d.produceNextEvent())!=null){
                 c.addEvent(e);
               }
                p.process(c);
              }
             }
            }
            

            You can then chose how to manage the pipeline but this is another problem.
            The upper code is just an example to have an idea, dosn't want to be the definitive design.
            Then at this point i haven't got THE SOLUTION i'm just trying to do Brain Storming for me also.

            Show
            juanito Giovanni Stilo added a comment - I think should be something more flexible such as: interface EventProducer { public Event produceNextEvent(); } In Indexer: { Pipeline p= new Pipeline(); Event e; Context c; foreach( d in Collection){ if (Document d instanceof EventProducer) { c= new Context(); c.addDocument(d); while ((e=d.produceNextEvent())!= null ){ c.addEvent(e); } p.process(c); } } } You can then chose how to manage the pipeline but this is another problem. The upper code is just an example to have an idea, dosn't want to be the definitive design. Then at this point i haven't got THE SOLUTION i'm just trying to do Brain Storming for me also.
            Hide
            craigm Craig Macdonald added a comment -

            Advantage: So the context object adds the ability for a given pipeline phase to look forward and backward in the pipe?

            I'm worried that this will increase memory requirements, as then all of a document has to be in memory (e.g. 3 objects for each of 100,000 tokens). This is a higher memory requirement than currently, where we are only incrementing counters for each term (cf DocumentPostingList).

            Moreover, the event pipeline can already look forwards and backwards by buffering events. I have implementations which do this already.

            Show
            craigm Craig Macdonald added a comment - Advantage: So the context object adds the ability for a given pipeline phase to look forward and backward in the pipe? I'm worried that this will increase memory requirements, as then all of a document has to be in memory (e.g. 3 objects for each of 100,000 tokens). This is a higher memory requirement than currently, where we are only incrementing counters for each term (cf DocumentPostingList). Moreover, the event pipeline can already look forwards and backwards by buffering events. I have implementations which do this already.
            Hide
            juanito Giovanni Stilo added a comment -

            Yes.
            U have problably have to reuse object and don't need to have all in memory
            especially if u consider 1 context for each document.
            Then u can think context as some kind of buffering strategy.
            At the end a think u should stil use terrier as is why u need to chenge it?

            Show
            juanito Giovanni Stilo added a comment - Yes. U have problably have to reuse object and don't need to have all in memory especially if u consider 1 context for each document. Then u can think context as some kind of buffering strategy. At the end a think u should stil use terrier as is why u need to chenge it?
            Hide
            craigm Craig Macdonald added a comment -

            Yes.
            U have problably have to reuse object and don't need to have all in memory
            especially if u consider 1 context for each document.

            I'm unclear here - are you suggesting that Context could swap events to disk for very large documents?

            At the end a think u should stil use terrier as is why u need to chenge it?

            I like the Terrier model at present, but it does need to evolve. I think that much is clear, from both Gianni's and my presentations in Rome, and the motivations in the original postfor this issue. Any use of the current model to address the existing problem results in un-standard code, where, with careful thought we could have an improved model, and easy code reuse between applications.

            I'm trying to pursue one of two evolutions to the current model, rather than a revolution. However, it's good to discuss such changes to make sure we are evolving in the correct manner.

            Show
            craigm Craig Macdonald added a comment - Yes. U have problably have to reuse object and don't need to have all in memory especially if u consider 1 context for each document. I'm unclear here - are you suggesting that Context could swap events to disk for very large documents? At the end a think u should stil use terrier as is why u need to chenge it? I like the Terrier model at present, but it does need to evolve. I think that much is clear, from both Gianni's and my presentations in Rome, and the motivations in the original postfor this issue. Any use of the current model to address the existing problem results in un-standard code, where, with careful thought we could have an improved model, and easy code reuse between applications. I'm trying to pursue one of two evolutions to the current model, rather than a revolution. However, it's good to discuss such changes to make sure we are evolving in the correct manner.
            Hide
            craigm Craig Macdonald added a comment -

            For the time being, TR-106 deals with the most salient point of this, the reset().

            Show
            craigm Craig Macdonald added a comment - For the time being, TR-106 deals with the most salient point of this, the reset().

              People

              • Assignee:
                craigm Craig Macdonald
                Reporter:
                craigm Craig Macdonald
              • Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: