[TR-144] CollectionRecordReader.next should not be recursive Created: 17/Feb/11 Updated: 05/Apr/11 Resolved: 04/Mar/11
|Reporter:||Rodrygo L. T. Santos||Assignee:||Rodrygo L. T. Santos|
org.terrier.structures.indexing.singlepass.hadoop.CollectionRecordReader.next recursively locates the next Document to be processed from the Collection object. However, for cases where some documents in the sequence are missing (e.g., we might want to index only a few selected documents), this results in too many recursive calls, which raise a stack overflow exception.
CollectionRecordReader.next should be made iterative instead of recursive.
|Comment by Craig Macdonald [ 21/Feb/11 ]|
Did your implementation for this work out OK?
If so, you should test for normal indexing scenarios as well as the Hadoop end-to-end test before committing.
|Comment by Rodrygo L. T. Santos [ 04/Mar/11 ]|
Committed version with an iterative implementation of next(). Tested under a standard indexing scenario (TRECCollection), as well as under the scenario that caused problems before (WhitelistCollection,TRECCollection). The number of indexed documents matches the expected value in both scenarios.