Details
-
Type:
Bug
-
Status: Resolved
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 3.6
-
Fix Version/s: 4.0
-
Component/s: None
-
Labels:None
Description
ClueWeb12_00/0000tw/0000tw-00.warc.gz should have 24644 records (of which 3 are redirects). Currently WARC10Collection only finds 23377 documents in that file.
The issue appears to be that line ending characters in WARC records are with DOS line endings (\r\n). We were counting only one byte.