[TR-518] Microblog 2013 collection does not work Created: 12/Jul/18  Updated: 19/Dec/18  Resolved: 19/Dec/18

Status: Resolved
Project: Terrier Core
Component/s: None
Affects Version/s: None
Fix Version/s: 5.1

Type: Improvement Priority: Minor
Reporter: Craig Macdonald Assignee: Richard McCreadie
Resolution: Fixed  
Labels: None


 Description   
Reported at SIGIR by Zia from CNRS ?

 Comments   
Comment by Richard McCreadie [ 13/Dec/18 ]

CompressingMetaIndexBuilder crop function is not effective leading to indexing failures for tweets,

Crop performs processing by character, while encoding checks by bytes.

It appears that FixedSizeTextFactory.getMaximumTextLength underestimates the maximum number of bytes that a string of N characters needs to encode

Comment by Richard McCreadie [ 13/Dec/18 ]

Relevant Error;

Exception in thread "main" java.lang.RuntimeException: java.lang.IllegalArgumentException: Data ('??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????') with 134 characters and byte length 591 for key text exceeds max byte length of 452(string length of 150). Crop in the Document, or increase indexer.meta.forward.keylens

Comment by Richard McCreadie [ 17/Dec/18 ]

Test confirms above result:

@Test
	public void testCropFunction() throws IOException {
		String separator = ApplicationSetup.FILE_SEPARATOR;
		String exampleTweetFile = ApplicationSetup.TERRIER_HOME+separator+"share"+separator+"tests"+separator+"tweets"+separator+"utf8-tweet.json";
		File tweetFile = new File(exampleTweetFile);
		assertTrue("Tweet file is available",tweetFile.exists());
		
		BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(tweetFile), "UTF-8"));
		String tweet = br.readLine();
		br.close();
		
		FlatJSONDocument doc = new FlatJSONDocument(tweet);
		String tweetText = doc.getProperty("text");
		final byte[] textAsBytes = Text.encode(tweetText).array();
		
		int HundredAndFortyCharLengthBytes = FixedSizeTextFactory.getMaximumTextLength(140);
		
		int textLengthCharacters = tweetText.length();
		int textLengthBytes = textAsBytes.length;
		
		System.err.println(textLengthCharacters+" "+textLengthBytes+" "+HundredAndFortyCharLengthBytes);
		
		assertTrue("Crop test, text is less than 140 characters",textLengthCharacters<140);
		assertTrue("Crop test, text (of length "+textLengthCharacters+" characters) expressed as bytes is less than 140 characters expressed as bytes", textLengthBytes<HundredAndFortyCharLengthBytes);
	}
Comment by Richard McCreadie [ 17/Dec/18 ]

java.lang.AssertionError: Crop test, text (of length 134 characters) expressed as bytes is less than 140 characters expressed as bytes
at org.junit.Assert.fail(Assert.java:91)
at org.junit.Assert.assertTrue(Assert.java:43)
at org.terrier.structures.TestCompressingMetaIndex.testCropFunction(TestCompressingMetaIndex.java:387)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
at org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:110)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:43)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:538)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:760)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:460)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:206)

Comment by Richard McCreadie [ 19/Dec/18 ]

Fix committed, c7755498 (classes) and 1444183d (example tweet)

Comment by Richard McCreadie [ 19/Dec/18 ]

Resolved via changes to CompressingMetaIndexBuilder

Generated at Fri Nov 15 13:25:03 GMT 2019 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.