Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-518

Microblog 2013 collection does not work

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1
    • Component/s: None
    • Labels:
      None

      Description

      Reported at SIGIR by Zia from CNRS ?

        Attachments

          Activity

          Hide
          richardm Richard McCreadie added a comment -

          Resolved via changes to CompressingMetaIndexBuilder

          Show
          richardm Richard McCreadie added a comment - Resolved via changes to CompressingMetaIndexBuilder
          Hide
          richardm Richard McCreadie added a comment -

          Fix committed, c7755498 (classes) and 1444183d (example tweet)

          Show
          richardm Richard McCreadie added a comment - Fix committed, c7755498 (classes) and 1444183d (example tweet)
          Hide
          richardm Richard McCreadie added a comment -

          java.lang.AssertionError: Crop test, text (of length 134 characters) expressed as bytes is less than 140 characters expressed as bytes
          at org.junit.Assert.fail(Assert.java:91)
          at org.junit.Assert.assertTrue(Assert.java:43)
          at org.terrier.structures.TestCompressingMetaIndex.testCropFunction(TestCompressingMetaIndex.java:387)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at java.lang.reflect.Method.invoke(Method.java:498)
          at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
          at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
          at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
          at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
          at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
          at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
          at org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:110)
          at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:43)
          at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
          at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
          at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
          at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
          at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
          at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
          at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
          at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
          at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
          at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
          at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:538)
          at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:760)
          at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:460)
          at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:206)

          Show
          richardm Richard McCreadie added a comment - java.lang.AssertionError: Crop test, text (of length 134 characters) expressed as bytes is less than 140 characters expressed as bytes at org.junit.Assert.fail(Assert.java:91) at org.junit.Assert.assertTrue(Assert.java:43) at org.terrier.structures.TestCompressingMetaIndex.testCropFunction(TestCompressingMetaIndex.java:387) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) at org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:110) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:43) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:538) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:760) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:460) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:206)
          Hide
          richardm Richard McCreadie added a comment - - edited

          Test confirms above result:

          @Test
          	public void testCropFunction() throws IOException {
          		String separator = ApplicationSetup.FILE_SEPARATOR;
          		String exampleTweetFile = ApplicationSetup.TERRIER_HOME+separator+"share"+separator+"tests"+separator+"tweets"+separator+"utf8-tweet.json";
          		File tweetFile = new File(exampleTweetFile);
          		assertTrue("Tweet file is available",tweetFile.exists());
          		
          		BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(tweetFile), "UTF-8"));
          		String tweet = br.readLine();
          		br.close();
          		
          		FlatJSONDocument doc = new FlatJSONDocument(tweet);
          		String tweetText = doc.getProperty("text");
          		final byte[] textAsBytes = Text.encode(tweetText).array();
          		
          		int HundredAndFortyCharLengthBytes = FixedSizeTextFactory.getMaximumTextLength(140);
          		
          		int textLengthCharacters = tweetText.length();
          		int textLengthBytes = textAsBytes.length;
          		
          		System.err.println(textLengthCharacters+" "+textLengthBytes+" "+HundredAndFortyCharLengthBytes);
          		
          		assertTrue("Crop test, text is less than 140 characters",textLengthCharacters<140);
          		assertTrue("Crop test, text (of length "+textLengthCharacters+" characters) expressed as bytes is less than 140 characters expressed as bytes", textLengthBytes<HundredAndFortyCharLengthBytes);
          	}
          
          Show
          richardm Richard McCreadie added a comment - - edited Test confirms above result: @Test public void testCropFunction() throws IOException { String separator = ApplicationSetup.FILE_SEPARATOR; String exampleTweetFile = ApplicationSetup.TERRIER_HOME+separator+ "share" +separator+ "tests" +separator+ "tweets" +separator+ "utf8-tweet.json" ; File tweetFile = new File(exampleTweetFile); assertTrue( "Tweet file is available" ,tweetFile.exists()); BufferedReader br = new BufferedReader( new InputStreamReader( new FileInputStream(tweetFile), "UTF-8" )); String tweet = br.readLine(); br.close(); FlatJSONDocument doc = new FlatJSONDocument(tweet); String tweetText = doc.getProperty( "text" ); final byte [] textAsBytes = Text.encode(tweetText).array(); int HundredAndFortyCharLengthBytes = FixedSizeTextFactory.getMaximumTextLength(140); int textLengthCharacters = tweetText.length(); int textLengthBytes = textAsBytes.length; System .err.println(textLengthCharacters+ " " +textLengthBytes+ " " +HundredAndFortyCharLengthBytes); assertTrue( "Crop test, text is less than 140 characters" ,textLengthCharacters<140); assertTrue( "Crop test, text (of length " +textLengthCharacters+ " characters) expressed as bytes is less than 140 characters expressed as bytes" , textLengthBytes<HundredAndFortyCharLengthBytes); }
          Hide
          richardm Richard McCreadie added a comment -

          Relevant Error;

          Exception in thread "main" java.lang.RuntimeException: java.lang.IllegalArgumentException: Data ('??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????') with 134 characters and byte length 591 for key text exceeds max byte length of 452(string length of 150). Crop in the Document, or increase indexer.meta.forward.keylens

          Show
          richardm Richard McCreadie added a comment - Relevant Error; Exception in thread "main" java.lang.RuntimeException: java.lang.IllegalArgumentException: Data ('??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????') with 134 characters and byte length 591 for key text exceeds max byte length of 452(string length of 150). Crop in the Document, or increase indexer.meta.forward.keylens
          Hide
          richardm Richard McCreadie added a comment -

          CompressingMetaIndexBuilder crop function is not effective leading to indexing failures for tweets,

          Crop performs processing by character, while encoding checks by bytes.

          It appears that FixedSizeTextFactory.getMaximumTextLength underestimates the maximum number of bytes that a string of N characters needs to encode

          Show
          richardm Richard McCreadie added a comment - CompressingMetaIndexBuilder crop function is not effective leading to indexing failures for tweets, Crop performs processing by character, while encoding checks by bytes. It appears that FixedSizeTextFactory.getMaximumTextLength underestimates the maximum number of bytes that a string of N characters needs to encode

            People

            • Assignee:
              richardm Richard McCreadie
              Reporter:
              craigm Craig Macdonald
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: