Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-538

TRECWebCollection doesnt parse malformed HTTP headers

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.0
    • Fix Version/s: 5.1
    • Component/s: .indexing
    • Labels:
      None

      Description

      This could be a regression, or I may have a bitrot issue in my copy of Blog06. In either event, exceptions iterating documents during indexing probably shouldn't kill the indexer.

      13:23:35.006 [main] INFO o.t.i.MultiDocumentFileCollection - TRECWebCollection 49% processing /home/collections/blog06/20051227/permalinks-047
      Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
      at java.lang.String.substring(String.java:1931)
      at org.terrier.indexing.TRECWebCollection.afterPropertyTags(TRECWebCollection.java:178)
      at org.terrier.indexing.TRECCollection.nextDocument(TRECCollection.java:351)
      at org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer.createInvertedIndex(BasicSinglePassIndexer.java:200)
      at org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer.createDirectIndex(BasicSinglePassIndexer.java:159)
      at org.terrier.structures.indexing.Indexer.index(Indexer.java:346)
      at org.terrier.applications.TRECIndexing.index(TRECIndexing.java:154)
      at org.terrier.applications.BatchIndexing$Command.run(BatchIndexing.java:102)
      at org.terrier.applications.CLITool$CLIParsedCLITool.run(CLITool.java:130)
      at org.terrier.applications.CLITool.main(CLITool.java:244)

        Attachments

          Activity

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              isoboroff Ian Soboroff
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: