Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-538

TRECWebCollection doesnt parse malformed HTTP headers

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.0
    • Fix Version/s: 5.1
    • Component/s: .indexing
    • Labels:
      None

      Description

      This could be a regression, or I may have a bitrot issue in my copy of Blog06. In either event, exceptions iterating documents during indexing probably shouldn't kill the indexer.

      13:23:35.006 [main] INFO o.t.i.MultiDocumentFileCollection - TRECWebCollection 49% processing /home/collections/blog06/20051227/permalinks-047
      Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
      at java.lang.String.substring(String.java:1931)
      at org.terrier.indexing.TRECWebCollection.afterPropertyTags(TRECWebCollection.java:178)
      at org.terrier.indexing.TRECCollection.nextDocument(TRECCollection.java:351)
      at org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer.createInvertedIndex(BasicSinglePassIndexer.java:200)
      at org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer.createDirectIndex(BasicSinglePassIndexer.java:159)
      at org.terrier.structures.indexing.Indexer.index(Indexer.java:346)
      at org.terrier.applications.TRECIndexing.index(TRECIndexing.java:154)
      at org.terrier.applications.BatchIndexing$Command.run(BatchIndexing.java:102)
      at org.terrier.applications.CLITool$CLIParsedCLITool.run(CLITool.java:130)
      at org.terrier.applications.CLITool.main(CLITool.java:244)

        Attachments

          Activity

          Hide
          craigm Craig Macdonald added a comment - - edited

          Hi Ian,

          Starting to look at your issues. This one happens because a Header doesnt have value.

          change the if statement further upto read:

          if ((Colon = lines[i].indexOf(':') ) > 1 && Colon < lines[i].length() -1)

          I think traditionally we dont use the Web data in TRECWebCollection, hence why we never observed this.

          On the other hand, in general, indexing in Terrier is fail-fast, as you dont want to mistakenly miss large portions of (a potentially large) collection without knowing it!

          I'm travelling for the next 2 weeks, so I might not get to all of your issues for a little while.

          Craig

          Show
          craigm Craig Macdonald added a comment - - edited Hi Ian, Starting to look at your issues. This one happens because a Header doesnt have value. change the if statement further upto read: if ((Colon = lines[i].indexOf(':') ) > 1 && Colon < lines[i].length() -1) I think traditionally we dont use the Web data in TRECWebCollection, hence why we never observed this. On the other hand, in general, indexing in Terrier is fail-fast, as you dont want to mistakenly miss large portions of (a potentially large) collection without knowing it! I'm travelling for the next 2 weeks, so I might not get to all of your issues for a little while. Craig
          Hide
          craigm Craig Macdonald added a comment -
          Show
          craigm Craig Macdonald added a comment - I have committed the fix to github. https://github.com/terrier-org/terrier-core/commit/f7024dfe399cdfbdaf78e67335c6b6d1ff92b3ef Thanks Ian. Craig

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              isoboroff Ian Soboroff
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: