[TR-295] WARC10Collection incorrectly misses some documents Created: 12/May/14  Updated: 16/Jun/14  Resolved: 20/May/14

Status: Resolved
Project: Terrier Core
Component/s: None
Affects Version/s: 3.6
Fix Version/s: 4.0

Type: Bug Priority: Major
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: Text File WARC018Collection.java     Text File WARC10Collection.java    

 Description   
ClueWeb12_00/0000tw/0000tw-00.warc.gz should have 24644 records (of which 3 are redirects). Currently WARC10Collection only finds 23377 documents in that file.

 Comments   
Comment by Craig Macdonald [ 12/May/14 ]

The issue appears to be that line ending characters in WARC records are with DOS line endings (\r\n). We were counting only one byte.

Comment by Craig Macdonald [ 13/May/14 ]

Other problem was that the HTTP status line (/usually/ 14 bytes) was not accounted for in the blob length

Comment by Craig Macdonald [ 13/May/14 ]

These files address the issue.

Comment by Craig Macdonald [ 20/May/14 ]

Committed r3816

Generated at Mon Dec 11 22:52:51 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.