Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-518

Microblog 2013 collection does not work

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1
    • Component/s: None
    • Labels:
      None

      Description

      Reported at SIGIR by Zia from CNRS ?

        Attachments

          Activity

          craigm Craig Macdonald created issue -
          Hide
          richardm Richard McCreadie added a comment -

          CompressingMetaIndexBuilder crop function is not effective leading to indexing failures for tweets,

          Crop performs processing by character, while encoding checks by bytes.

          It appears that FixedSizeTextFactory.getMaximumTextLength underestimates the maximum number of bytes that a string of N characters needs to encode

          Show
          richardm Richard McCreadie added a comment - CompressingMetaIndexBuilder crop function is not effective leading to indexing failures for tweets, Crop performs processing by character, while encoding checks by bytes. It appears that FixedSizeTextFactory.getMaximumTextLength underestimates the maximum number of bytes that a string of N characters needs to encode
          Hide
          richardm Richard McCreadie added a comment -

          Relevant Error;

          Exception in thread "main" java.lang.RuntimeException: java.lang.IllegalArgumentException: Data ('??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????') with 134 characters and byte length 591 for key text exceeds max byte length of 452(string length of 150). Crop in the Document, or increase indexer.meta.forward.keylens

          Show
          richardm Richard McCreadie added a comment - Relevant Error; Exception in thread "main" java.lang.RuntimeException: java.lang.IllegalArgumentException: Data ('??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????') with 134 characters and byte length 591 for key text exceeds max byte length of 452(string length of 150). Crop in the Document, or increase indexer.meta.forward.keylens
          Hide
          richardm Richard McCreadie added a comment - - edited

          Test confirms above result:

          @Test
          	public void testCropFunction() throws IOException {
          		String separator = ApplicationSetup.FILE_SEPARATOR;
          		String exampleTweetFile = ApplicationSetup.TERRIER_HOME+separator+"share"+separator+"tests"+separator+"tweets"+separator+"utf8-tweet.json";
          		File tweetFile = new File(exampleTweetFile);
          		assertTrue("Tweet file is available",tweetFile.exists());
          		
          		BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(tweetFile), "UTF-8"));
          		String tweet = br.readLine();
          		br.close();
          		
          		FlatJSONDocument doc = new FlatJSONDocument(tweet);
          		String tweetText = doc.getProperty("text");
          		final byte[] textAsBytes = Text.encode(tweetText).array();
          		
          		int HundredAndFortyCharLengthBytes = FixedSizeTextFactory.getMaximumTextLength(140);
          		
          		int textLengthCharacters = tweetText.length();
          		int textLengthBytes = textAsBytes.length;
          		
          		System.err.println(textLengthCharacters+" "+textLengthBytes+" "+HundredAndFortyCharLengthBytes);
          		
          		assertTrue("Crop test, text is less than 140 characters",textLengthCharacters<140);
          		assertTrue("Crop test, text (of length "+textLengthCharacters+" characters) expressed as bytes is less than 140 characters expressed as bytes", textLengthBytes<HundredAndFortyCharLengthBytes);
          	}
          
          Show
          richardm Richard McCreadie added a comment - - edited Test confirms above result: @Test public void testCropFunction() throws IOException { String separator = ApplicationSetup.FILE_SEPARATOR; String exampleTweetFile = ApplicationSetup.TERRIER_HOME+separator+ "share" +separator+ "tests" +separator+ "tweets" +separator+ "utf8-tweet.json" ; File tweetFile = new File(exampleTweetFile); assertTrue( "Tweet file is available" ,tweetFile.exists()); BufferedReader br = new BufferedReader( new InputStreamReader( new FileInputStream(tweetFile), "UTF-8" )); String tweet = br.readLine(); br.close(); FlatJSONDocument doc = new FlatJSONDocument(tweet); String tweetText = doc.getProperty( "text" ); final byte [] textAsBytes = Text.encode(tweetText).array(); int HundredAndFortyCharLengthBytes = FixedSizeTextFactory.getMaximumTextLength(140); int textLengthCharacters = tweetText.length(); int textLengthBytes = textAsBytes.length; System .err.println(textLengthCharacters+ " " +textLengthBytes+ " " +HundredAndFortyCharLengthBytes); assertTrue( "Crop test, text is less than 140 characters" ,textLengthCharacters<140); assertTrue( "Crop test, text (of length " +textLengthCharacters+ " characters) expressed as bytes is less than 140 characters expressed as bytes" , textLengthBytes<HundredAndFortyCharLengthBytes); }
          Hide
          richardm Richard McCreadie added a comment -

          java.lang.AssertionError: Crop test, text (of length 134 characters) expressed as bytes is less than 140 characters expressed as bytes
          at org.junit.Assert.fail(Assert.java:91)
          at org.junit.Assert.assertTrue(Assert.java:43)
          at org.terrier.structures.TestCompressingMetaIndex.testCropFunction(TestCompressingMetaIndex.java:387)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at java.lang.reflect.Method.invoke(Method.java:498)
          at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
          at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
          at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
          at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
          at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
          at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
          at org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:110)
          at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:43)
          at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
          at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
          at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
          at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
          at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
          at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
          at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
          at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
          at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
          at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
          at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:538)
          at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:760)
          at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:460)
          at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:206)

          Show
          richardm Richard McCreadie added a comment - java.lang.AssertionError: Crop test, text (of length 134 characters) expressed as bytes is less than 140 characters expressed as bytes at org.junit.Assert.fail(Assert.java:91) at org.junit.Assert.assertTrue(Assert.java:43) at org.terrier.structures.TestCompressingMetaIndex.testCropFunction(TestCompressingMetaIndex.java:387) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) at org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:110) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:43) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:538) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:760) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:460) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:206)
          richardm Richard McCreadie made changes -
          Field Original Value New Value
          Attachment utf8-tweet.json [ 10700 ]
          richardm Richard McCreadie made changes -
          Attachment TestCompressingMetaIndex.java [ 10701 ]
          richardm Richard McCreadie made changes -
          Attachment CompressingMetaIndexBuilder.java [ 10702 ]
          richardm Richard McCreadie made changes -
          Comment [ Attached updated files for fix and unit test as I don't seem to be able to push to the 5.x branch and can't see the logs to fix that.

          @Craigm can you check and merge in? ]
          richardm Richard McCreadie made changes -
          Attachment CompressingMetaIndexBuilder.java [ 10702 ]
          richardm Richard McCreadie made changes -
          Attachment utf8-tweet.json [ 10700 ]
          richardm Richard McCreadie made changes -
          Attachment TestCompressingMetaIndex.java [ 10701 ]
          Hide
          richardm Richard McCreadie added a comment -

          Fix committed, c7755498 (classes) and 1444183d (example tweet)

          Show
          richardm Richard McCreadie added a comment - Fix committed, c7755498 (classes) and 1444183d (example tweet)
          Hide
          richardm Richard McCreadie added a comment -

          Resolved via changes to CompressingMetaIndexBuilder

          Show
          richardm Richard McCreadie added a comment - Resolved via changes to CompressingMetaIndexBuilder
          richardm Richard McCreadie made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              richardm Richard McCreadie
              Reporter:
              craigm Craig Macdonald
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: