Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-87

PorterStemmer doesnt match expected output by Porter himself

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.0
    • Component/s: None
    • Labels:
      None

      Description

      Martin Porter's website provides some test cases for stemming. Our Porter stemmer predates the Porter stemmer in Java, as it was hand-coded by Gianni. It has some known points of difference from Porter's algorithm.

      Below are a list of terms that our current stemmer stems differently from Porters:
      These are just the terms starting with "a"

      "abruption", "acquisition", "addiction", "addition", "additions", "admission",
      "admonition", "adoption", "affection", "affections", "affliction", "afflictions",
      "allusion", "ambition", "ambitions", "apparition", "apparitions",
      "apprehension", "apprehensions", "ascension", "aspersion", "assumption", "assumptions",
      "attention", "attraction", "attribution"

      For these terms, it seems that we either remove one character too much, or that we don't remove at all.

        Attachments

          Activity

          Hide
          craigm Craig Macdonald added a comment - - edited

          Changing the stemmer will potentially destroy usability of all TRv3 indices we have currently. Everyone else has to re-index anyway.

          Here are the options:

          • do nothing
          • add Porter's actual stemmer as another non-default option
          • add Porter's stemmer as default option - we should do this at a major version change. Also, it would be good to see if performance was positively impacted for any of our test collections.
          Show
          craigm Craig Macdonald added a comment - - edited Changing the stemmer will potentially destroy usability of all TRv3 indices we have currently. Everyone else has to re-index anyway. Here are the options: do nothing add Porter's actual stemmer as another non-default option add Porter's stemmer as default option - we should do this at a major version change. Also, it would be good to see if performance was positively impacted for any of our test collections.
          Hide
          rodrygo Rodrygo L. T. Santos added a comment -

          I second the idea of having Porter's correct implementation as the default option (and maybe provide the current one as a deprecated version, just for backwards compatibility). Also, as we discussed, this is the best opportunity for correcting this, since indices will change anyway with TRv3. The only disadvantage is indeed to have to rebuild our own TRv3 indices.

          Show
          rodrygo Rodrygo L. T. Santos added a comment - I second the idea of having Porter's correct implementation as the default option (and maybe provide the current one as a deprecated version, just for backwards compatibility). Also, as we discussed, this is the best opportunity for correcting this, since indices will change anyway with TRv3. The only disadvantage is indeed to have to rebuild our own TRv3 indices.
          Hide
          craigm Craig Macdonald added a comment -

          Resolved.

          I have replaced PorterStemmer and WeakPorterStemmer with Porter's own implementation.
          TRv2 implementations have become TRv2PorterStemmer and TRv2WeakPorterStemmer. If you have indices based on these, you need to update your property files NOW.

          Show
          craigm Craig Macdonald added a comment - Resolved. I have replaced PorterStemmer and WeakPorterStemmer with Porter's own implementation. TRv2 implementations have become TRv2PorterStemmer and TRv2WeakPorterStemmer. If you have indices based on these, you need to update your property files NOW.

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              craigm Craig Macdonald
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: