Details
-
Type:
Bug
-
Status: Resolved
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: None
-
Fix Version/s: 3.0
-
Component/s: None
-
Labels:None
Description
Martin Porter's website provides some test cases for stemming. Our Porter stemmer predates the Porter stemmer in Java, as it was hand-coded by Gianni. It has some known points of difference from Porter's algorithm.
Below are a list of terms that our current stemmer stems differently from Porters:
These are just the terms starting with "a"
"abruption", "acquisition", "addiction", "addition", "additions", "admission",
"admonition", "adoption", "affection", "affections", "affliction", "afflictions",
"allusion", "ambition", "ambitions", "apparition", "apparitions",
"apprehension", "apprehensions", "ascension", "aspersion", "assumption", "assumptions",
"attention", "attraction", "attribution"
For these terms, it seems that we either remove one character too much, or that we don't remove at all.
Below are a list of terms that our current stemmer stems differently from Porters:
These are just the terms starting with "a"
"abruption", "acquisition", "addiction", "addition", "additions", "admission",
"admonition", "adoption", "affection", "affections", "affliction", "afflictions",
"allusion", "ambition", "ambitions", "apparition", "apparitions",
"apprehension", "apprehensions", "ascension", "aspersion", "assumption", "assumptions",
"attention", "attraction", "attribution"
For these terms, it seems that we either remove one character too much, or that we don't remove at all.
Changing the stemmer will potentially destroy usability of all TRv3 indices we have currently. Everyone else has to re-index anyway.
Here are the options: