Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-71

Allow BitInputStream structures to be split across multiple files

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.0
    • Component/s: .indexing, .structures
    • Labels:
      None

      Description

      Inverted/direct files are traditionally contained in one single file. There is no reason for this to be the case. BitIndexPointer has two fields: long (for byte offset), and byte (for bit offset).

      A byte has range 0-255, but we only need 0-7. This means we could use the remainder of the byte to indicate which file a given pointers points to.

      Why? Well:
       * for MR indexing, we create one entire index per reducer. We could instead create one file of the inverted index per reducer.
       * for Inverted2DirectMultiReducer, we create one file of the direct index per reducer, but then have to join them later. This join is unnecessary.

        Attachments

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          This issue is progressing well:
          1. An adaption to BitIndexPointer to handle file ids
          2. Alterations to HadoopIndexing to create one inverted file per reducer
          3. A partitioner to split terms among reducers.

          The last part of this requires the most thought. Primarily, we wish to ensure that terms in different inverted file segments are contiguous, to retain backward compatibility. However, this means that partitioning reducers without a-prior knowledge of the lexical distribution of terms is difficult.

          As an initial implementation, terms are partitioned in the a-z range, with the following special cases: all numbers end up on the first reducers (and everything less than 'a'), everything everything greater than 'z' ends up on the last reducer). This assumes that most content is mostly English - OK for most of our corpora. However, a more problematic assumption is that there is an even distribution between starting characters. Clearly, this is invalid, and causes some reducers to do more work than others. E.g. there are many more terms starting with numbers, not very many terms starting with 'z', etc.

          Ivory uses a different approach, hashing terms across reducers, however, this means that while the content of an inverted file is sorted in itself, you need to read the files in parallel to achieve the terms in sorted order. I'm not sure if there are any particular downsides to this approach, although merge reading many files at once in a stream may be somewhat inefficient for index analysis jobs.

          Show
          craigm Craig Macdonald added a comment - This issue is progressing well: 1. An adaption to BitIndexPointer to handle file ids 2. Alterations to HadoopIndexing to create one inverted file per reducer 3. A partitioner to split terms among reducers. The last part of this requires the most thought. Primarily, we wish to ensure that terms in different inverted file segments are contiguous, to retain backward compatibility. However, this means that partitioning reducers without a-prior knowledge of the lexical distribution of terms is difficult. As an initial implementation, terms are partitioned in the a-z range, with the following special cases: all numbers end up on the first reducers (and everything less than 'a'), everything everything greater than 'z' ends up on the last reducer). This assumes that most content is mostly English - OK for most of our corpora. However, a more problematic assumption is that there is an even distribution between starting characters. Clearly, this is invalid, and causes some reducers to do more work than others. E.g. there are many more terms starting with numbers, not very many terms starting with 'z', etc. Ivory uses a different approach, hashing terms across reducers, however, this means that while the content of an inverted file is sorted in itself, you need to read the files in parallel to achieve the terms in sorted order. I'm not sure if there are any particular downsides to this approach, although merge reading many files at once in a stream may be somewhat inefficient for index analysis jobs.
          Hide
          craigm Craig Macdonald added a comment -

          Resolved. Testing on Glagrid2 obtained identical results to a single reducer scenario, with much improved reduce speed.

          Show
          craigm Craig Macdonald added a comment - Resolved. Testing on Glagrid2 obtained identical results to a single reducer scenario, with much improved reduce speed.

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              craigm Craig Macdonald
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: