Details
Description
This could be a regression, or I may have a bitrot issue in my copy of Blog06. In either event, exceptions iterating documents during indexing probably shouldn't kill the indexer.
13:23:35.006 [main] INFO o.t.i.MultiDocumentFileCollection - TRECWebCollection 49% processing /home/collections/blog06/20051227/permalinks-047
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1931)
at org.terrier.indexing.TRECWebCollection.afterPropertyTags(TRECWebCollection.java:178)
at org.terrier.indexing.TRECCollection.nextDocument(TRECCollection.java:351)
at org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer.createInvertedIndex(BasicSinglePassIndexer.java:200)
at org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer.createDirectIndex(BasicSinglePassIndexer.java:159)
at org.terrier.structures.indexing.Indexer.index(Indexer.java:346)
at org.terrier.applications.TRECIndexing.index(TRECIndexing.java:154)
at org.terrier.applications.BatchIndexing$Command.run(BatchIndexing.java:102)
at org.terrier.applications.CLITool$CLIParsedCLITool.run(CLITool.java:130)
at org.terrier.applications.CLITool.main(CLITool.java:244)
13:23:35.006 [main] INFO o.t.i.MultiDocumentFileCollection - TRECWebCollection 49% processing /home/collections/blog06/20051227/permalinks-047
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1931)
at org.terrier.indexing.TRECWebCollection.afterPropertyTags(TRECWebCollection.java:178)
at org.terrier.indexing.TRECCollection.nextDocument(TRECCollection.java:351)
at org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer.createInvertedIndex(BasicSinglePassIndexer.java:200)
at org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer.createDirectIndex(BasicSinglePassIndexer.java:159)
at org.terrier.structures.indexing.Indexer.index(Indexer.java:346)
at org.terrier.applications.TRECIndexing.index(TRECIndexing.java:154)
at org.terrier.applications.BatchIndexing$Command.run(BatchIndexing.java:102)
at org.terrier.applications.CLITool$CLIParsedCLITool.run(CLITool.java:130)
at org.terrier.applications.CLITool.main(CLITool.java:244)
Hi Ian,
Starting to look at your issues. This one happens because a Header doesnt have value.
change the if statement further upto read:
if ((Colon = lines[i].indexOf(':') ) > 1 && Colon < lines[i].length() -1)
I think traditionally we dont use the Web data in TRECWebCollection, hence why we never observed this.
On the other hand, in general, indexing in Terrier is fail-fast, as you dont want to mistakenly miss large portions of (a potentially large) collection without knowing it!
I'm travelling for the next 2 weeks, so I might not get to all of your issues for a little while.
Craig