Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-383

Default constructors dont work for WARC Collection implementations

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 4.1
    • Fix Version/s: 4.2
    • Component/s: .indexing
    • Labels:
      None

      Description

      Hi

      We're trying to index ClueWeb12 collection using Terrier 4.1. When we try to index a single warc.gz file, we get the following error.

      ---
      [terrier@xxxx terrier]$ ./bin/trec_terrier.sh -i -j
      Setting TERRIER_HOME to /home/terrier/terrier
      Picked up _JAVA_OPTIONS: -Xmx8192M
      Starting building the inverted file ...
      15:28:28.191 [main] INFO o.t.structures.indexing.Indexer - Checking memory usage every 20 maxDocPerFlush=0
      15:28:28.200 [main] INFO o.t.structures.indexing.Indexer - creating the data structures data_1
      15:28:28.200 [main] INFO o.t.structures.indexing.Indexer - Creating IF (no direct file)..
      A problem occurred: java.lang.NullPointerException
      java.lang.NullPointerException
      at org.terrier.indexing.WARC018Collection.readLine(WARC018Collection.java:234)
      at org.terrier.indexing.WARC10Collection.nextDocument(WARC10Collection.java:69)
      at org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer.createInvertedIndex(BasicSinglePassIndexer.java:200)
      at org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer.createDirectIndex(BasicSinglePassIndexer.java:159)
      at org.terrier.structures.indexing.Indexer.index(Indexer.java:348)
      at org.terrier.applications.TRECIndexing.createSinglePass(TRECIndexing.java:272)
      at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:402)
      at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:599)
      at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:246)
      [terrier@ibq0 terrier]$
      ---

      Any idea?

      Here is some detail information.

      [terrier@xxxx terrier]$ cat etc/terrier.properties
      trec.collection.class=WARC10Collection
      indexer.meta.forward.keys=docno,url
      indexer.meta.forward.keylens=26,256
      indexer.meta.reverse.keys=docno
      TrecDocTags.skip=SCRIPT,STYLE

      [terrier@xxxx terrier]$ cat etc/collection.spec
      #add the files to index
      /home/terrier/0000tw-00.warc.gz

      [terrier@xxxx terrier]$ echo $JAVA_HOME
      /opt/jdk1.7.0_60/

      [terrier@xxxx terrier]$ java -version
      Picked up _JAVA_OPTIONS: -Xmx8192M
      java version "1.7.0_60"
      Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
      Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)


      Thanks,

      Hideo

        Attachments

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          Show
          craigm Craig Macdonald added a comment -
          Hide
          hideojoho Hideo Joho added a comment -

          Solved the problem. Thanks!

          Hideo

          Show
          hideojoho Hideo Joho added a comment - Solved the problem. Thanks! Hideo
          Hide
          craigm Craig Macdonald added a comment -

          Patch attached.

          Show
          craigm Craig Macdonald added a comment - Patch attached.
          Hide
          craigm Craig Macdonald added a comment -

          Yes. Minor bug.

          On line 47, the default constructor of WARC10Collection, change super() to this(ApplicationSetup.COLLECTION_SPEC);

          New test case will be added to next release.

          Craig

          Show
          craigm Craig Macdonald added a comment - Yes. Minor bug. On line 47, the default constructor of WARC10Collection, change super() to this(ApplicationSetup.COLLECTION_SPEC); New test case will be added to next release. Craig

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              hideojoho Hideo Joho
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: