[TR-146] Tokenisation should be done separately from Document parsing Created: 17/Mar/11  Updated: 05/Apr/11  Resolved: 24/Mar/11

Status: Resolved
Project: Terrier Core
Component/s: .indexing, tests
Affects Version/s: None
Fix Version/s: 3.5

Type: Bug Priority: Major
Reporter: Rodrygo L. T. Santos Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Issue Links:
Block
blocks TR-140 Indexing support for query-biased sum... Resolved
Related
is related to TR-147 Allow various Collection implementati... Resolved

 Description   
Various Document objects perform hard-coded tokenisations for different languages. For instance, TRECUTFCollection$UTFDocument.

It would be desirable to separate tokenisation from the rest of the document parsing, which deals with other tasks such as HTML parsing, metadata collection, etc.

 Comments   
Comment by Craig Macdonald [ 17/Mar/11 ]

The interface defined by Rodrygo and myself earlier today was

interface Tokeniser
{
 public String[] tokenise(Reader text);
}
Comment by Craig Macdonald [ 17/Mar/11 ]

I'm considering whether we should change Tokeniser to be abstract. Moreover, the current interface cant be directly called for an entire (possibly lengthy) document. This wouldnt happen for HTMLDocument (which will pass one tag at a time), but it might happen for FileDocument classes.

Such would look like:

interface TokenStream extends Iterator<String>
{}


abstract class Tokeniser
{

 public TokenStream tokeniserStream(Reader text);
 public String[] tokenise(Reader text)
 {
  List<String> terms = new ArrayList<String>();
  TokenStream i = this.tokeniserStream(text);
  while(i.hasNext())
  {
    String t = i.next();
    if (t != null)
     terms.add(t);
  }
  return terms.toArray(new String[terms.size()]);
 }
}

TokenStream would need some way to implement hasNext(), or be allowed to return null after saying hasNext().

Comment by Craig Macdonald [ 24/Mar/11 ]

Much hard work and perseverance by Rodrygo and Craig, and this is now done. Tokenisation occurs in the indexing.tokenisation package. Tokeniser defaults to EnglishTokeniser, but can be controlled by property tokeniser=.

Generated at Sat Dec 16 18:31:21 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.