Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-146

Tokenisation should be done separately from Document parsing

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.5
    • Component/s: .indexing, tests
    • Labels:
      None

      Description

      Various Document objects perform hard-coded tokenisations for different languages. For instance, TRECUTFCollection$UTFDocument.

      It would be desirable to separate tokenisation from the rest of the document parsing, which deals with other tasks such as HTML parsing, metadata collection, etc.

        Attachments

          Issue Links

            Activity

            Hide
            craigm Craig Macdonald added a comment -

            Much hard work and perseverance by Rodrygo and Craig, and this is now done. Tokenisation occurs in the indexing.tokenisation package. Tokeniser defaults to EnglishTokeniser, but can be controlled by property tokeniser=.

            Show
            craigm Craig Macdonald added a comment - Much hard work and perseverance by Rodrygo and Craig, and this is now done. Tokenisation occurs in the indexing.tokenisation package. Tokeniser defaults to EnglishTokeniser, but can be controlled by property tokeniser=.
            Hide
            craigm Craig Macdonald added a comment -

            I'm considering whether we should change Tokeniser to be abstract. Moreover, the current interface cant be directly called for an entire (possibly lengthy) document. This wouldnt happen for HTMLDocument (which will pass one tag at a time), but it might happen for FileDocument classes.

            Such would look like:

            interface TokenStream extends Iterator<String>
            {}
            
            
            abstract class Tokeniser
            {
            
             public TokenStream tokeniserStream(Reader text);
             public String[] tokenise(Reader text)
             {
              List<String> terms = new ArrayList<String>();
              TokenStream i = this.tokeniserStream(text);
              while(i.hasNext())
              {
                String t = i.next();
                if (t != null)
                 terms.add(t);
              }
              return terms.toArray(new String[terms.size()]);
             }
            }
            
            

            TokenStream would need some way to implement hasNext(), or be allowed to return null after saying hasNext().

            Show
            craigm Craig Macdonald added a comment - I'm considering whether we should change Tokeniser to be abstract. Moreover, the current interface cant be directly called for an entire (possibly lengthy) document. This wouldnt happen for HTMLDocument (which will pass one tag at a time), but it might happen for FileDocument classes. Such would look like: interface TokenStream extends Iterator<String> {} abstract class Tokeniser { public TokenStream tokeniserStream(Reader text); public String[] tokenise(Reader text) { List<String> terms = new ArrayList<String>(); TokenStream i = this.tokeniserStream(text); while(i.hasNext()) { String t = i.next(); if (t != null) terms.add(t); } return terms.toArray(new String[terms.size()]); } } TokenStream would need some way to implement hasNext(), or be allowed to return null after saying hasNext().
            Hide
            craigm Craig Macdonald added a comment -

            The interface defined by Rodrygo and myself earlier today was

            interface Tokeniser
            {
             public String[] tokenise(Reader text);
            }
            
            Show
            craigm Craig Macdonald added a comment - The interface defined by Rodrygo and myself earlier today was interface Tokeniser { public String [] tokenise(Reader text); }

              People

              • Assignee:
                craigm Craig Macdonald
                Reporter:
                rodrygo Rodrygo L. T. Santos
              • Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: