Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-146

Tokenisation should be done separately from Document parsing

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.5
    • Component/s: .indexing, tests
    • Labels:
      None

      Description

      Various Document objects perform hard-coded tokenisations for different languages. For instance, TRECUTFCollection$UTFDocument.

      It would be desirable to separate tokenisation from the rest of the document parsing, which deals with other tasks such as HTML parsing, metadata collection, etc.

        Attachments

          Issue Links

            Activity

            rodrygo Rodrygo L. T. Santos created issue -
            rodrygo Rodrygo L. T. Santos made changes -
            Field Original Value New Value
            Link This issue is related to TREC-224 [ TREC-224 ]
            rodrygo Rodrygo L. T. Santos made changes -
            Link This issue blocks TREC-220 [ TREC-220 ]
            rodrygo Rodrygo L. T. Santos made changes -
            Link This issue blocks TREC-200 [ TREC-200 ]
            craigm Craig Macdonald made changes -
            Link This issue blocks TREC-228 [ TREC-228 ]
            craigm Craig Macdonald made changes -
            Link This issue blocks TREC-227 [ TREC-227 ]
            Hide
            craigm Craig Macdonald added a comment -

            The interface defined by Rodrygo and myself earlier today was

            interface Tokeniser
            {
             public String[] tokenise(Reader text);
            }
            
            Show
            craigm Craig Macdonald added a comment - The interface defined by Rodrygo and myself earlier today was interface Tokeniser { public String [] tokenise(Reader text); }
            Hide
            craigm Craig Macdonald added a comment -

            I'm considering whether we should change Tokeniser to be abstract. Moreover, the current interface cant be directly called for an entire (possibly lengthy) document. This wouldnt happen for HTMLDocument (which will pass one tag at a time), but it might happen for FileDocument classes.

            Such would look like:

            interface TokenStream extends Iterator<String>
            {}
            
            
            abstract class Tokeniser
            {
            
             public TokenStream tokeniserStream(Reader text);
             public String[] tokenise(Reader text)
             {
              List<String> terms = new ArrayList<String>();
              TokenStream i = this.tokeniserStream(text);
              while(i.hasNext())
              {
                String t = i.next();
                if (t != null)
                 terms.add(t);
              }
              return terms.toArray(new String[terms.size()]);
             }
            }
            
            

            TokenStream would need some way to implement hasNext(), or be allowed to return null after saying hasNext().

            Show
            craigm Craig Macdonald added a comment - I'm considering whether we should change Tokeniser to be abstract. Moreover, the current interface cant be directly called for an entire (possibly lengthy) document. This wouldnt happen for HTMLDocument (which will pass one tag at a time), but it might happen for FileDocument classes. Such would look like: interface TokenStream extends Iterator<String> {} abstract class Tokeniser { public TokenStream tokeniserStream(Reader text); public String[] tokenise(Reader text) { List<String> terms = new ArrayList<String>(); TokenStream i = this.tokeniserStream(text); while(i.hasNext()) { String t = i.next(); if (t != null) terms.add(t); } return terms.toArray(new String[terms.size()]); } } TokenStream would need some way to implement hasNext(), or be allowed to return null after saying hasNext().
            Hide
            craigm Craig Macdonald added a comment -

            Much hard work and perseverance by Rodrygo and Craig, and this is now done. Tokenisation occurs in the indexing.tokenisation package. Tokeniser defaults to EnglishTokeniser, but can be controlled by property tokeniser=.

            Show
            craigm Craig Macdonald added a comment - Much hard work and perseverance by Rodrygo and Craig, and this is now done. Tokenisation occurs in the indexing.tokenisation package. Tokeniser defaults to EnglishTokeniser, but can be controlled by property tokeniser=.
            craigm Craig Macdonald made changes -
            Status Open [ 1 ] Resolved [ 5 ]
            Assignee Rodrygo L. T. Santos [ rodrygo ] Craig Macdonald [ craigm ]
            Resolution Fixed [ 1 ]
            craigm Craig Macdonald made changes -
            Project TREC [ 10010 ] Terrier Core [ 10000 ]
            Key TREC-225 TR-146
            Issue Type New Feature [ 2 ] Bug [ 1 ]
            Workflow jira [ 10502 ] Terrier Open Source [ 10539 ]
            Component/s .indexing [ 10002 ]
            Component/s tests [ 10006 ]
            Component/s Core [ 10020 ]
            Fix Version/s 3.1 [ 10040 ]
            Fix Version/s 3.1 [ 10021 ]

              People

              • Assignee:
                craigm Craig Macdonald
                Reporter:
                rodrygo Rodrygo L. T. Santos
              • Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: