org.terrier.terms
Class TRv2PorterStemmer

java.lang.Object
  extended by org.terrier.terms.StemmerTermPipeline
      extended by org.terrier.terms.TRv2PorterStemmer
All Implemented Interfaces:
Stemmer, TermPipeline
Direct Known Subclasses:
TRv2WeakPorterStemmer

public class TRv2PorterStemmer
extends StemmerTermPipeline

This is the Porter stemming algorithm, coded up in JAVA by Gianni Amati. All comments were made by Porter, but few ones due to some implementation choices. For Porter's implementation in Java, see PorterStemmer
Porter says "It may be be regarded as cononical, in that it follows the algorithm presented in Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14, no. 3, pp 130-137, only differing from it at the points marked --DEPARTURE-- below. The algorithm as described in the paper could be exactly replicated by adjusting the points of DEPARTURE, but this is barely necessary, because (a) the points of DEPARTURE are definitely improvements, and (b) no encoding of the Porter stemmer I have seen is anything like as exact as this version, even with the points of DEPARTURE!".
This class is not thread safe.

Author:
Gianni Amati, modified into a TermPipeline and (Java) optimised by Craig Macdonald

Field Summary
protected  char[] b
          A buffer for word to be stemmed.
protected  int j
          A general offset into the string.
protected  int k
           
protected  int k0
           
 
Fields inherited from class org.terrier.terms.StemmerTermPipeline
next
 
Constructor Summary
TRv2PorterStemmer(TermPipeline next)
          Constructs an instance of the TRv2PorterStemmer.
 
Method Summary
protected  boolean cons(int i)
          cons(i) is TRUE <=> b[i] is a consonant.
protected  boolean consonantinstem()
           
protected  boolean cvc(int i)
          Returns true if i-2,i-1,i has the form consonant - vowel - consonant and also if the second character is not w,x or y.
protected  void defineBuffer(java.lang.String s)
           
protected  boolean doublec(int _j)
          Returns true if j,(j-1) contain a double consonant.
protected  boolean ends(java.lang.String s)
          Returns true if k0,...k ends with the string s.
protected  int m()
          Measures the number of consonant sequences between k0 and j.
static void main(java.lang.String[] args)
          main
protected  void setto(int i1, int i2, java.lang.String str)
          Sets (j+1),...k to the characters in the string s, readjusting k and j.
 java.lang.String stem(java.lang.String s)
          Returns the stem of a given term
protected  void step1ab()
          Removes the plurals and -ed or -ing.
protected  void step1c()
          Turns terminal y to i when there is another vowel in the stem.
protected  void step2()
          Maps double suffices to single ones.
protected  void step3()
          Deals with -ic-, -full, -ness etc, similarly to the strategy of step2.
protected  void step4()
          Takes off -ant, -ence etc., in context vcvc.
protected  void step5()
          Removes a final -e if m() > 1, and changes -ll to -l if m() > 1.
protected  boolean vowelinstem()
          Returns TRUE if k0,...j contains a vowel.
 
Methods inherited from class org.terrier.terms.StemmerTermPipeline
processTerm, reset
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

b

protected char[] b
A buffer for word to be stemmed.


k

protected int k

k0

protected int k0

j

protected int j
A general offset into the string.

Constructor Detail

TRv2PorterStemmer

public TRv2PorterStemmer(TermPipeline next)
Constructs an instance of the TRv2PorterStemmer.

Parameters:
next -
Method Detail

cons

protected boolean cons(int i)
cons(i) is TRUE <=> b[i] is a consonant.


consonantinstem

protected boolean consonantinstem()

cvc

protected final boolean cvc(int i)
Returns true if i-2,i-1,i has the form consonant - vowel - consonant and also if the second character is not w,x or y. This is used when trying to restore an e at the end of a short word. For example:
but keep terms snow, box, tray as they are.


defineBuffer

protected final void defineBuffer(java.lang.String s)

doublec

protected final boolean doublec(int _j)
Returns true if j,(j-1) contain a double consonant.


ends

protected final boolean ends(java.lang.String s)
Returns true if k0,...k ends with the string s.


m

protected final int m()
Measures the number of consonant sequences between k0 and j. If c is a consonant sequence and v a vowel sequence, and <..> indicates arbitrary presence:


setto

protected final void setto(int i1,
                           int i2,
                           java.lang.String str)
Sets (j+1),...k to the characters in the string s, readjusting k and j.


stem

public java.lang.String stem(java.lang.String s)
Returns the stem of a given term

Parameters:
s - String the term to be stemmed.
Returns:
String the stem of a given term.

step1ab

protected final void step1ab()
Removes the plurals and -ed or -ing. For example,


step1c

protected final void step1c()
Turns terminal y to i when there is another vowel in the stem.


step2

protected final void step2()
Maps double suffices to single ones. So -ization ( = -ize plus -ation) maps to -ize etc. note that the string before the suffix must give m() > 0.


step3

protected final void step3()
Deals with -ic-, -full, -ness etc, similarly to the strategy of step2.


step4

protected final void step4()
Takes off -ant, -ence etc., in context vcvc.


step5

protected final void step5()
Removes a final -e if m() > 1, and changes -ll to -l if m() > 1.


vowelinstem

protected final boolean vowelinstem()
Returns TRUE if k0,...j contains a vowel.

Returns:
true if k0,...,j contains a vowel.

main

public static void main(java.lang.String[] args)
main

Parameters:
args -


Terrier 3.5. Copyright © 2004-2011 University of Glasgow