I'm considering whether we should change Tokeniser to be abstract. Moreover, the current interface cant be directly called for an entire (possibly lengthy) document. This wouldnt happen for HTMLDocument (which will pass one tag at a time), but it might happen for FileDocument classes.
Such would look like:
interface TokenStream extends Iterator<String>
abstract class Tokeniser
public TokenStream tokeniserStream(Reader text);
public String tokenise(Reader text)
List<String> terms = new ArrayList<String>();
TokenStream i = this.tokeniserStream(text);
String t = i.next();
if (t != null)
return terms.toArray(new String[terms.size()]);
TokenStream would need some way to implement hasNext(), or be allowed to return null after saying hasNext().