Class TRv2PorterStemmer

  • All Implemented Interfaces:
    Stemmer, TermPipeline
    Direct Known Subclasses:
    TRv2WeakPorterStemmer

    public class TRv2PorterStemmer
    extends StemmerTermPipeline
    This is the Porter stemming algorithm, coded up in JAVA by Gianni Amati. All comments were made by Porter, but few ones due to some implementation choices. For Porter's implementation in Java, see PorterStemmer
    Porter says "It may be be regarded as canonical, in that it follows the algorithm presented in Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14, no. 3, pp 130-137, only differing from it at the points marked --DEPARTURE-- below. The algorithm as described in the paper could be exactly replicated by adjusting the points of DEPARTURE, but this is barely necessary, because (a) the points of DEPARTURE are definitely improvements, and (b) no encoding of the Porter stemmer I have seen is anything like as exact as this version, even with the points of DEPARTURE!".
    This class is not thread safe.
    Author:
    Gianni Amati, modified into a TermPipeline and (Java) optimised by Craig Macdonald
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected char[] b
      A buffer for word to be stemmed.
      protected int j
      A general offset into the string.
      protected int k  
      protected int k0  
    • Constructor Summary

      Constructors 
      Constructor Description
      TRv2PorterStemmer​(TermPipeline next)
      Constructs an instance of the TRv2PorterStemmer.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected boolean cons​(int i)
      cons(i) is TRUE <=> b[i] is a consonant.
      protected boolean consonantinstem()  
      protected boolean cvc​(int i)
      Returns true if i-2,i-1,i has the form consonant - vowel - consonant and also if the second character is not w,x or y.
      protected void defineBuffer​(java.lang.String s)  
      protected boolean doublec​(int _j)
      Returns true if j,(j-1) contain a double consonant.
      protected boolean ends​(java.lang.String s)
      Returns true if k0,...k ends with the string s.
      protected int m()
      Measures the number of consonant sequences between k0 and j.
      static void main​(java.lang.String[] args)
      main
      protected void setto​(int i1, int i2, java.lang.String str)
      Sets (j+1),...k to the characters in the string s, readjusting k and j.
      java.lang.String stem​(java.lang.String s)
      Returns the stem of a given term
      protected void step1ab()
      Removes the plurals and -ed or -ing.
      protected void step1c()
      Turns terminal y to i when there is another vowel in the stem.
      protected void step2()
      Maps double suffices to single ones.
      protected void step3()
      Deals with -ic-, -full, -ness etc, similarly to the strategy of step2.
      protected void step4()
      Takes off -ant, -ence etc., in context vcvc.
      protected void step5()
      Removes a final -e if m() > 1, and changes -ll to -l if m() > 1.
      protected boolean vowelinstem()
      Returns TRUE if k0,...j contains a vowel.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • b

        protected char[] b
        A buffer for word to be stemmed.
      • k

        protected int k
      • k0

        protected int k0
      • j

        protected int j
        A general offset into the string.
    • Constructor Detail

      • TRv2PorterStemmer

        public TRv2PorterStemmer​(TermPipeline next)
        Constructs an instance of the TRv2PorterStemmer.
        Parameters:
        next -
    • Method Detail

      • cons

        protected boolean cons​(int i)
        cons(i) is TRUE <=> b[i] is a consonant.
      • consonantinstem

        protected boolean consonantinstem()
      • cvc

        protected final boolean cvc​(int i)
        Returns true if i-2,i-1,i has the form consonant - vowel - consonant and also if the second character is not w,x or y. This is used when trying to restore an e at the end of a short word. For example:
        • cav(e)
        • lov(e)
        • hop(e)
        • crim(e)
        but keep terms snow, box, tray as they are.
      • defineBuffer

        protected final void defineBuffer​(java.lang.String s)
      • doublec

        protected final boolean doublec​(int _j)
        Returns true if j,(j-1) contain a double consonant.
      • ends

        protected final boolean ends​(java.lang.String s)
        Returns true if k0,...k ends with the string s.
      • m

        protected final int m()
        Measures the number of consonant sequences between k0 and j. If c is a consonant sequence and v a vowel sequence, and <..> indicates arbitrary presence:
        • <c><v> gives 0
        • <c>vc<v> gives 1
        • <c>vcvc<v> gives 2
        • <c>vcvcvc<v> gives 3
      • setto

        protected final void setto​(int i1,
                                   int i2,
                                   java.lang.String str)
        Sets (j+1),...k to the characters in the string s, readjusting k and j.
      • stem

        public java.lang.String stem​(java.lang.String s)
        Returns the stem of a given term
        Parameters:
        s - String the term to be stemmed.
        Returns:
        String the stem of a given term.
      • step1ab

        protected final void step1ab()
        Removes the plurals and -ed or -ing. For example,
        • caresses becomes caress
        • ponies becomes poni
        • ties becomes ti
        • caress becomes caress
        • cats becomes cat
        • feed becomes feed
        • agreed becomes agree
        • disabled becomes disable
        • matting becomes mat
        • mating becomes mate
        • meeting becomes meet
        • milling becomes mill
        • messing becomes mess
        • meetings becomes meet
      • step1c

        protected final void step1c()
        Turns terminal y to i when there is another vowel in the stem.
      • step2

        protected final void step2()
        Maps double suffices to single ones. So -ization ( = -ize plus -ation) maps to -ize etc. note that the string before the suffix must give m() > 0.
      • step3

        protected final void step3()
        Deals with -ic-, -full, -ness etc, similarly to the strategy of step2.
      • step4

        protected final void step4()
        Takes off -ant, -ence etc., in context vcvc.
      • step5

        protected final void step5()
        Removes a final -e if m() > 1, and changes -ll to -l if m() > 1.
      • vowelinstem

        protected final boolean vowelinstem()
        Returns TRUE if k0,...j contains a vowel.
        Returns:
        true if k0,...,j contains a vowel.
      • main

        public static void main​(java.lang.String[] args)
        main
        Parameters:
        args -