Interface Tokenizer

  • All Known Implementing Classes:
    TRECFullTokenizer

    public interface Tokenizer
    The specification of the interface implemented by tokeniser classes.
    Author:
    Gianni Amati, Vassilis Plachouras
    • Method Summary

      All Methods Instance Methods Abstract Methods 
      Modifier and Type Method Description
      java.lang.String currentTag()
      Returns the identifier of the tag the tokenizer is into.
      long getByteOffset()
      Returns the byte offset in the current indexed file.
      boolean inDocnoTag()
      Indicates whether we are in a special document number tag.
      boolean inTagToProcess()
      Indicates whether we are in a tag to process.
      boolean inTagToSkip()
      Indicates whether we are in a tag to skip
      boolean isEndOfDocument()
      Returns true if the end of document is encountered.
      boolean isEndOfFile()
      Returns true if the end of file is encountered.
      void nextDocument()
      Proceed to process the next document.
      java.lang.String nextToken()
      Returns the next token from the input stream used.
      void setInput​(java.io.BufferedReader input)
      Sets the input of the tokenizer
    • Method Detail

      • currentTag

        java.lang.String currentTag()
        Returns the identifier of the tag the tokenizer is into.
        Returns:
        the name of the tag the tokenizer is processing
      • nextToken

        java.lang.String nextToken()
        Returns the next token from the input stream used.
        Returns:
        the next token, or null if the end of file is encountered.
      • inDocnoTag

        boolean inDocnoTag()
        Indicates whether we are in a special document number tag.
        Returns:
        true if the tokenizer is in a document number tag.
      • inTagToProcess

        boolean inTagToProcess()
        Indicates whether we are in a tag to process.
        Returns:
        true if we are in a tag to process.
      • inTagToSkip

        boolean inTagToSkip()
        Indicates whether we are in a tag to skip
        Returns:
        true if we are in a tag to skip
      • isEndOfDocument

        boolean isEndOfDocument()
        Returns true if the end of document is encountered.
        Returns:
        true if the end of document is encountered.
      • isEndOfFile

        boolean isEndOfFile()
        Returns true if the end of file is encountered.
        Returns:
        true if the end of document is encountered.
      • nextDocument

        void nextDocument()
        Proceed to process the next document.
      • getByteOffset

        long getByteOffset()
        Returns the byte offset in the current indexed file.
      • setInput

        void setInput​(java.io.BufferedReader input)
        Sets the input of the tokenizer
        Parameters:
        input - BufferedReader the input stream to tokenize