This issue is progressing well:
1. An adaption to BitIndexPointer to handle file ids
2. Alterations to HadoopIndexing to create one inverted file per reducer
3. A partitioner to split terms among reducers.
The last part of this requires the most thought. Primarily, we wish to ensure that terms in different inverted file segments are contiguous, to retain backward compatibility. However, this means that partitioning reducers without a-prior knowledge of the lexical distribution of terms is difficult.
As an initial implementation, terms are partitioned in the a-z range, with the following special cases: all numbers end up on the first reducers (and everything less than 'a'), everything everything greater than 'z' ends up on the last reducer). This assumes that most content is mostly English - OK for most of our corpora. However, a more problematic assumption is that there is an even distribution between starting characters. Clearly, this is invalid, and causes some reducers to do more work than others. E.g. there are many more terms starting with numbers, not very many terms starting with 'z', etc.
Ivory uses a different approach, hashing terms across reducers, however, this means that while the content of an inverted file is sorted in itself, you need to read the files in parallel to achieve the terms in sorted order. I'm not sure if there are any particular downsides to this approach, although merge reading many files at once in a stream may be somewhat inefficient for index analysis jobs.
This issue is progressing well:
1. An adaption to BitIndexPointer to handle file ids
2. Alterations to HadoopIndexing to create one inverted file per reducer
3. A partitioner to split terms among reducers.
The last part of this requires the most thought. Primarily, we wish to ensure that terms in different inverted file segments are contiguous, to retain backward compatibility. However, this means that partitioning reducers without a-prior knowledge of the lexical distribution of terms is difficult.
As an initial implementation, terms are partitioned in the a-z range, with the following special cases: all numbers end up on the first reducers (and everything less than 'a'), everything everything greater than 'z' ends up on the last reducer). This assumes that most content is mostly English - OK for most of our corpora. However, a more problematic assumption is that there is an even distribution between starting characters. Clearly, this is invalid, and causes some reducers to do more work than others. E.g. there are many more terms starting with numbers, not very many terms starting with 'z', etc.
Ivory uses a different approach, hashing terms across reducers, however, this means that while the content of an inverted file is sorted in itself, you need to read the files in parallel to achieve the terms in sorted order. I'm not sure if there are any particular downsides to this approach, although merge reading many files at once in a stream may be somewhat inefficient for index analysis jobs.