Package org.terrier.terms
Class TRv2PorterStemmer
- java.lang.Object
-
- org.terrier.terms.StemmerTermPipeline
-
- org.terrier.terms.TRv2PorterStemmer
-
- All Implemented Interfaces:
Stemmer
,TermPipeline
- Direct Known Subclasses:
TRv2WeakPorterStemmer
public class TRv2PorterStemmer extends StemmerTermPipeline
This is the Porter stemming algorithm, coded up in JAVA by Gianni Amati. All comments were made by Porter, but few ones due to some implementation choices. For Porter's implementation in Java, see PorterStemmer
Porter says "It may be be regarded as canonical, in that it follows the algorithm presented in Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14, no. 3, pp 130-137, only differing from it at the points marked --DEPARTURE-- below. The algorithm as described in the paper could be exactly replicated by adjusting the points of DEPARTURE, but this is barely necessary, because (a) the points of DEPARTURE are definitely improvements, and (b) no encoding of the Porter stemmer I have seen is anything like as exact as this version, even with the points of DEPARTURE!".
This class is not thread safe.- Author:
- Gianni Amati, modified into a TermPipeline and (Java) optimised by Craig Macdonald
-
-
Constructor Summary
Constructors Constructor Description TRv2PorterStemmer(TermPipeline next)
Constructs an instance of the TRv2PorterStemmer.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected boolean
cons(int i)
cons(i) is TRUE <=> b[i] is a consonant.protected boolean
consonantinstem()
protected boolean
cvc(int i)
Returns true if i-2,i-1,i has the form consonant - vowel - consonant and also if the second character is not w,x or y.protected void
defineBuffer(java.lang.String s)
protected boolean
doublec(int _j)
Returns true if j,(j-1) contain a double consonant.protected boolean
ends(java.lang.String s)
Returns true if k0,...k ends with the string s.protected int
m()
Measures the number of consonant sequences between k0 and j.static void
main(java.lang.String[] args)
mainprotected void
setto(int i1, int i2, java.lang.String str)
Sets (j+1),...k to the characters in the string s, readjusting k and j.java.lang.String
stem(java.lang.String s)
Returns the stem of a given termprotected void
step1ab()
Removes the plurals and -ed or -ing.protected void
step1c()
Turns terminal y to i when there is another vowel in the stem.protected void
step2()
Maps double suffices to single ones.protected void
step3()
Deals with -ic-, -full, -ness etc, similarly to the strategy of step2.protected void
step4()
Takes off -ant, -ence etc., in contextvcvc . protected void
step5()
Removes a final -e if m() > 1, and changes -ll to -l if m() > 1.protected boolean
vowelinstem()
Returns TRUE if k0,...j contains a vowel.-
Methods inherited from class org.terrier.terms.StemmerTermPipeline
processTerm, reset
-
-
-
-
Constructor Detail
-
TRv2PorterStemmer
public TRv2PorterStemmer(TermPipeline next)
Constructs an instance of the TRv2PorterStemmer.- Parameters:
next
-
-
-
Method Detail
-
cons
protected boolean cons(int i)
cons(i) is TRUE <=> b[i] is a consonant.
-
consonantinstem
protected boolean consonantinstem()
-
cvc
protected final boolean cvc(int i)
Returns true if i-2,i-1,i has the form consonant - vowel - consonant and also if the second character is not w,x or y. This is used when trying to restore an e at the end of a short word. For example:
- cav(e)
- lov(e)
- hop(e)
- crim(e)
-
defineBuffer
protected final void defineBuffer(java.lang.String s)
-
doublec
protected final boolean doublec(int _j)
Returns true if j,(j-1) contain a double consonant.
-
ends
protected final boolean ends(java.lang.String s)
Returns true if k0,...k ends with the string s.
-
m
protected final int m()
Measures the number of consonant sequences between k0 and j. If c is a consonant sequence and v a vowel sequence, and <..> indicates arbitrary presence:- <c><v> gives 0
- <c>vc<v> gives 1
- <c>vcvc<v> gives 2
- <c>vcvcvc<v> gives 3
-
setto
protected final void setto(int i1, int i2, java.lang.String str)
Sets (j+1),...k to the characters in the string s, readjusting k and j.
-
stem
public java.lang.String stem(java.lang.String s)
Returns the stem of a given term- Parameters:
s
- String the term to be stemmed.- Returns:
- String the stem of a given term.
-
step1ab
protected final void step1ab()
Removes the plurals and -ed or -ing. For example,- caresses becomes caress
- ponies becomes poni
- ties becomes ti
- caress becomes caress
- cats becomes cat
- feed becomes feed
- agreed becomes agree
- disabled becomes disable
- matting becomes mat
- mating becomes mate
- meeting becomes meet
- milling becomes mill
- messing becomes mess
- meetings becomes meet
-
step1c
protected final void step1c()
Turns terminal y to i when there is another vowel in the stem.
-
step2
protected final void step2()
Maps double suffices to single ones. So -ization ( = -ize plus -ation) maps to -ize etc. note that the string before the suffix must give m() > 0.
-
step3
protected final void step3()
Deals with -ic-, -full, -ness etc, similarly to the strategy of step2.
-
step4
protected final void step4()
Takes off -ant, -ence etc., in contextvcvc .
-
step5
protected final void step5()
Removes a final -e if m() > 1, and changes -ll to -l if m() > 1.
-
vowelinstem
protected final boolean vowelinstem()
Returns TRUE if k0,...j contains a vowel.- Returns:
- true if k0,...,j contains a vowel.
-
main
public static void main(java.lang.String[] args)
main- Parameters:
args
-
-
-