Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-275

TRECWebCollection doesn't normalise encodings

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 3.6
    • Component/s: None
    • Labels:
      None

      Description

      StringTools.normaliseEncoding() is never called.

        Attachments

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          Committed r3679

          Show
          craigm Craig Macdonald added a comment - Committed r3679
          Hide
          craigm Craig Macdonald added a comment -

          Prevent these exceptions:

          java.io.UnsupportedEncodingException: x-mac-roman
          at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:52)
          at java.io.InputStreamReader.<init>(InputStreamReader.java:83)
          at org.terrier.indexing.TaggedDocument.<init>(TaggedDocument.java:156)
          at sun.reflect.GeneratedConstructorAccessor1.newInstance(Unknown Source)
          at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
          at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
          at org.terrier.indexing.TRECCollection.getDocument(TRECCollection.java:509)
          at org.terrier.indexing.FilterCollection.getDocument(FilterCollection.java:66)
          at org.terrier.indexing.URLCollection.getDocument(URLCollection.java:78)
          at org.terrier.spam.BenderskyFeatures.main(BenderskyFeatures.java:37)
          INFO - Processing /local/terrier/Collections/TREC/DOTGOV2/gov2-corpus//GX006/99.gz
          WARN - Desired encoding (iso-8859-1;charset=iso-8859-1) unsupported. Resorting to platform default.
          java.io.UnsupportedEncodingException: iso-8859-1;charset=iso-8859-1
          at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:52)
          at java.io.InputStreamReader.<init>(InputStreamReader.java:83)
          at org.terrier.indexing.TaggedDocument.<init>(TaggedDocument.java:156)
          at sun.reflect.GeneratedConstructorAccessor1.newInstance(Unknown Source)
          at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
          at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
          at org.terrier.indexing.TRECCollection.getDocument(TRECCollection.java:509)
          at org.terrier.indexing.FilterCollection.getDocument(FilterCollection.java:66)
          at org.terrier.indexing.URLCollection.getDocument(URLCollection.java:78)
          at org.terrier.spam.BenderskyFeatures.main(BenderskyFeatures.java:37)

          Show
          craigm Craig Macdonald added a comment - Prevent these exceptions: java.io.UnsupportedEncodingException: x-mac-roman at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:52) at java.io.InputStreamReader.<init>(InputStreamReader.java:83) at org.terrier.indexing.TaggedDocument.<init>(TaggedDocument.java:156) at sun.reflect.GeneratedConstructorAccessor1.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.terrier.indexing.TRECCollection.getDocument(TRECCollection.java:509) at org.terrier.indexing.FilterCollection.getDocument(FilterCollection.java:66) at org.terrier.indexing.URLCollection.getDocument(URLCollection.java:78) at org.terrier.spam.BenderskyFeatures.main(BenderskyFeatures.java:37) INFO - Processing /local/terrier/Collections/TREC/DOTGOV2/gov2-corpus//GX006/99.gz WARN - Desired encoding (iso-8859-1;charset=iso-8859-1) unsupported. Resorting to platform default. java.io.UnsupportedEncodingException: iso-8859-1;charset=iso-8859-1 at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:52) at java.io.InputStreamReader.<init>(InputStreamReader.java:83) at org.terrier.indexing.TaggedDocument.<init>(TaggedDocument.java:156) at sun.reflect.GeneratedConstructorAccessor1.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.terrier.indexing.TRECCollection.getDocument(TRECCollection.java:509) at org.terrier.indexing.FilterCollection.getDocument(FilterCollection.java:66) at org.terrier.indexing.URLCollection.getDocument(URLCollection.java:78) at org.terrier.spam.BenderskyFeatures.main(BenderskyFeatures.java:37)

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              craigm Craig Macdonald
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: