[TR-275] TRECWebCollection doesn't normalise encodings Created: 10/Jan/13  Updated: 04/Apr/14  Resolved: 11/Mar/13

Status: Resolved
Project: Terrier Core
Component/s: None
Affects Version/s: 3.5
Fix Version/s: 3.6

Type: Bug Priority: Minor
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None


 Description   
StringTools.normaliseEncoding() is never called.

 Comments   
Comment by Craig Macdonald [ 10/Jan/13 ]

Prevent these exceptions:

java.io.UnsupportedEncodingException: x-mac-roman
at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:52)
at java.io.InputStreamReader.<init>(InputStreamReader.java:83)
at org.terrier.indexing.TaggedDocument.<init>(TaggedDocument.java:156)
at sun.reflect.GeneratedConstructorAccessor1.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.terrier.indexing.TRECCollection.getDocument(TRECCollection.java:509)
at org.terrier.indexing.FilterCollection.getDocument(FilterCollection.java:66)
at org.terrier.indexing.URLCollection.getDocument(URLCollection.java:78)
at org.terrier.spam.BenderskyFeatures.main(BenderskyFeatures.java:37)
INFO - Processing /local/terrier/Collections/TREC/DOTGOV2/gov2-corpus//GX006/99.gz
WARN - Desired encoding (iso-8859-1;charset=iso-8859-1) unsupported. Resorting to platform default.
java.io.UnsupportedEncodingException: iso-8859-1;charset=iso-8859-1
at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:52)
at java.io.InputStreamReader.<init>(InputStreamReader.java:83)
at org.terrier.indexing.TaggedDocument.<init>(TaggedDocument.java:156)
at sun.reflect.GeneratedConstructorAccessor1.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.terrier.indexing.TRECCollection.getDocument(TRECCollection.java:509)
at org.terrier.indexing.FilterCollection.getDocument(FilterCollection.java:66)
at org.terrier.indexing.URLCollection.getDocument(URLCollection.java:78)
at org.terrier.spam.BenderskyFeatures.main(BenderskyFeatures.java:37)

Comment by Craig Macdonald [ 11/Mar/13 ]

Committed r3679

Generated at Mon Dec 18 09:00:56 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.