[TR-71] Allow BitInputStream structures to be split across multiple files Created: 19/Oct/09  Updated: 05/Mar/10  Resolved: 17/Dec/09

Status: Resolved
Project: Terrier Core
Component/s: .indexing, .structures
Affects Version/s: None
Fix Version/s: 3.0

Type: Improvement Priority: Minor
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Inverted/direct files are traditionally contained in one single file. There is no reason for this to be the case. BitIndexPointer has two fields: long (for byte offset), and byte (for bit offset).

A byte has range 0-255, but we only need 0-7. This means we could use the remainder of the byte to indicate which file a given pointers points to.

Why? Well:
 * for MR indexing, we create one entire index per reducer. We could instead create one file of the inverted index per reducer.
 * for Inverted2DirectMultiReducer, we create one file of the direct index per reducer, but then have to join them later. This join is unnecessary.

Comment by Craig Macdonald [ 14/Dec/09 ]

This issue is progressing well:
1. An adaption to BitIndexPointer to handle file ids
2. Alterations to HadoopIndexing to create one inverted file per reducer
3. A partitioner to split terms among reducers.

The last part of this requires the most thought. Primarily, we wish to ensure that terms in different inverted file segments are contiguous, to retain backward compatibility. However, this means that partitioning reducers without a-prior knowledge of the lexical distribution of terms is difficult.

As an initial implementation, terms are partitioned in the a-z range, with the following special cases: all numbers end up on the first reducers (and everything less than 'a'), everything everything greater than 'z' ends up on the last reducer). This assumes that most content is mostly English - OK for most of our corpora. However, a more problematic assumption is that there is an even distribution between starting characters. Clearly, this is invalid, and causes some reducers to do more work than others. E.g. there are many more terms starting with numbers, not very many terms starting with 'z', etc.

Ivory uses a different approach, hashing terms across reducers, however, this means that while the content of an inverted file is sorted in itself, you need to read the files in parallel to achieve the terms in sorted order. I'm not sure if there are any particular downsides to this approach, although merge reading many files at once in a stream may be somewhat inefficient for index analysis jobs.

Comment by Craig Macdonald [ 17/Dec/09 ]

Resolved. Testing on Glagrid2 obtained identical results to a single reducer scenario, with much improved reduce speed.

Generated at Mon Jul 16 05:39:33 BST 2018 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.