Package org.terrier.indexing
Class FlatJSONDocument
- java.lang.Object
-
- org.terrier.indexing.FlatJSONDocument
-
- All Implemented Interfaces:
Document
public class FlatJSONDocument extends java.lang.Object implements Document
This is a Terrier Document implementation of a document stored in JSON format. It assumes that a single JSON document has at least a single attribute called 'text' that contains the text of the document. Fields: This implementation supports a single field named 'TEXT' by default. FieldTags.process is a comma delimited list of properties to use as fields. Meta-Data: During the parsing process, the properties of each FlatJSONDocument is decorated with document meta-data. This decoration process is performed by 'flattening' the layered structure of the JSON object and its sub-attributes into individual properties. For property naming, attributes in different layers are connected with a dot '.', e.g. user.name- Since:
- 5.1
- Author:
- Richard McCreadie and Saul Vargas
-
-
Field Summary
Fields Modifier and Type Field Description protected int
fieldIndex
protected java.util.List<java.lang.String>
fieldQueue
protected java.lang.String[]
fieldsToProcess
protected java.util.Map<java.lang.String,java.lang.String>
properties
protected int
remainingTokens
protected int
tokenIndex
protected Tokeniser
tokenizer
java.lang.String[][]
tokens
-
Constructor Summary
Constructors Constructor Description FlatJSONDocument(com.google.gson.JsonObject json)
FlatJSONDocument(java.lang.String rawJson)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
endOfDocument()
Returns true when the end of the document has been reached, and there are no other terms to be retrieved from it.java.util.Map<java.lang.String,java.lang.String>
getAllProperties()
Returns the underlying map of all the properties defined by this Document.java.util.Set<java.lang.String>
getFields()
Returns a list of the fields the current term appears in.java.lang.String
getNextTerm()
Gets the next term of the document.java.lang.String
getProperty(java.lang.String name)
Allows access to a named property of the Document.java.io.Reader
getReader()
Returns a Reader object so client code can tokenise the document or deal with the document itself.protected void
initalize(java.lang.String rawJson)
-
-
-
Field Detail
-
properties
protected java.util.Map<java.lang.String,java.lang.String> properties
-
tokenizer
protected Tokeniser tokenizer
-
tokens
public java.lang.String[][] tokens
-
fieldQueue
protected java.util.List<java.lang.String> fieldQueue
-
fieldsToProcess
protected java.lang.String[] fieldsToProcess
-
fieldIndex
protected int fieldIndex
-
tokenIndex
protected int tokenIndex
-
remainingTokens
protected int remainingTokens
-
-
Constructor Detail
-
FlatJSONDocument
public FlatJSONDocument(com.google.gson.JsonObject json)
-
FlatJSONDocument
public FlatJSONDocument(java.lang.String rawJson) throws com.fasterxml.jackson.core.JsonParseException, com.fasterxml.jackson.databind.JsonMappingException, java.io.IOException
- Throws:
com.fasterxml.jackson.core.JsonParseException
com.fasterxml.jackson.databind.JsonMappingException
java.io.IOException
-
-
Method Detail
-
initalize
protected void initalize(java.lang.String rawJson)
-
endOfDocument
public boolean endOfDocument()
Description copied from interface:Document
Returns true when the end of the document has been reached, and there are no other terms to be retrieved from it.- Specified by:
endOfDocument
in interfaceDocument
- Returns:
- boolean true if there are no more terms in the document, otherwise it returns false.
-
getAllProperties
public java.util.Map<java.lang.String,java.lang.String> getAllProperties()
Description copied from interface:Document
Returns the underlying map of all the properties defined by this Document.- Specified by:
getAllProperties
in interfaceDocument
-
getFields
public java.util.Set<java.lang.String> getFields()
Description copied from interface:Document
Returns a list of the fields the current term appears in.
-
getNextTerm
public java.lang.String getNextTerm()
Description copied from interface:Document
Gets the next term of the document. NB:Null string returned from getNextTerm() should be ignored. They do not signify the lack of any more terms. endOfDocument() should be used to check that.- Specified by:
getNextTerm
in interfaceDocument
- Returns:
- String the next term of the document. Null returns should be ignored.
-
getProperty
public java.lang.String getProperty(java.lang.String name)
Description copied from interface:Document
Allows access to a named property of the Document. Examples might be URL, filename etc.- Specified by:
getProperty
in interfaceDocument
- Parameters:
name
- Name of the property. It is suggested, but not required that this name should not be case insensitive.
-
-