Package org.terrier.utility
Class TagSet
- java.lang.Object
-
- org.terrier.utility.TagSet
-
public class TagSet extends java.lang.Object
A class that models a set of tags to process (white list), a set of tags to skip (black list), a tag that is used as a document delimiter, and a tag the contents of which are used as a unique identifier. The text within any tag encountered within the scope of a tag from the white list, is processed by default, unless it is explicitly black listed.
For example, in order to index all the text within the DOC tag of a document from a typical TREC collection, without indexing the contents of the DOCHDR tag, we could define in the properties file the following properties:
TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
TrecDocTags.process=
TrecDocTags.skip=DOCHDR
TrecDocTags.casesensitive=false
In the source code, we would create an instance of the class as follows:
TagSet TrecIndexToProcess = new TagSet("TrecDocTags");
All the tags are converted to uppercase, in order to check whether they belong to the specified set of tags.- Author:
- Vassilis Plachouras, Craig Macdonald
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
TagSet.TagSetFactory
-
Field Summary
Fields Modifier and Type Field Description protected java.util.Set<java.lang.String>
blackList
The set of tags to skip.protected int
blackListSize
Size of whiteList hashsetprotected java.util.List<java.lang.String>
blackListTags
protected boolean
caseSensitive
is this TagSet case sensitive.protected java.lang.String
docTag
The tag that is used for denoting the beginning of a document.static java.lang.String
EMPTY_TAGS
A prefix for an empty set of tags, that is a set of tags that are not defined in the properties file.static java.lang.String
FIELD_TAGS
The prefix for the tags to consider as fields, during indexing.protected java.lang.String
idTag
The tag that is used as a unique identifier.static java.lang.String
TREC_DOC_TAGS
The prefix for the TREC document tags.static java.lang.String
TREC_EXACT_DOC_TAGS
The prefix for the TREC document exact tags.static java.lang.String
TREC_PROPERTY_TAGS
The prefix for the TREC property tags.static java.lang.String
TREC_QUERY_TAGS
The prefix for the TREC topic tags.protected java.util.Set<java.lang.String>
whiteList
The set of tags to process.protected int
whiteListSize
Size of whiteList hashsetprotected java.util.List<java.lang.String>
whiteListTags
-
Constructor Summary
Constructors Constructor Description TagSet(java.lang.String prefix)
Constructs the tag set for the given prefix, by reading the corresponding properties from the properties file.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static TagSet.TagSetFactory
factory()
java.lang.String
getDocTag()
Return the document delimiter tag.java.lang.String
getIdTag()
Return the id tag.java.lang.String[]
getTagsToProcess()
Returns a comma separated list of tags to processjava.lang.String[]
getTagsToSkip()
Returns a comma separated list of tags to skipboolean
hasWhitelist()
Returns true if whiteListSize > 0.boolean
isCaseSensitive()
Returns true if this tag set has been specified as case-sensitiveboolean
isDocTag(java.lang.String tag)
Checks whether the given tag indicates the limits of a document.boolean
isIdTag(java.lang.String tag)
Checks whether the given tag is a unique identifier tag, that is the document number of a document, of the identifier of a topic.boolean
isTagToProcess(java.lang.String tag)
Checks whether the tag should be processed.boolean
isTagToSkip(java.lang.String tag)
Checks whether a tag should be skipped.
-
-
-
Field Detail
-
EMPTY_TAGS
public static final java.lang.String EMPTY_TAGS
A prefix for an empty set of tags, that is a set of tags that are not defined in the properties file.- See Also:
- Constant Field Values
-
TREC_DOC_TAGS
public static final java.lang.String TREC_DOC_TAGS
The prefix for the TREC document tags. The corresponding properties in the setup file should start with TrecDocTags.- See Also:
- Constant Field Values
-
TREC_EXACT_DOC_TAGS
public static final java.lang.String TREC_EXACT_DOC_TAGS
The prefix for the TREC document exact tags. The corresponding properties in the setup file should start with TrecExactDocTags.- See Also:
- Constant Field Values
-
TREC_QUERY_TAGS
public static final java.lang.String TREC_QUERY_TAGS
The prefix for the TREC topic tags. The corresponding properties in the setup file should start with TrecQueryTags.- See Also:
- Constant Field Values
-
TREC_PROPERTY_TAGS
public static final java.lang.String TREC_PROPERTY_TAGS
The prefix for the TREC property tags. The corresponding properties in the setup file should start with TrecPropertyTags.- See Also:
- Constant Field Values
-
FIELD_TAGS
public static final java.lang.String FIELD_TAGS
The prefix for the tags to consider as fields, during indexing. The corresponding properties in the setup file should start with FieldTags.- See Also:
- Constant Field Values
-
whiteList
protected java.util.Set<java.lang.String> whiteList
The set of tags to process.
-
whiteListSize
protected final int whiteListSize
Size of whiteList hashset
-
whiteListTags
protected java.util.List<java.lang.String> whiteListTags
-
blackList
protected java.util.Set<java.lang.String> blackList
The set of tags to skip.
-
blackListSize
protected final int blackListSize
Size of whiteList hashset
-
blackListTags
protected java.util.List<java.lang.String> blackListTags
-
idTag
protected final java.lang.String idTag
The tag that is used as a unique identifier.
-
docTag
protected final java.lang.String docTag
The tag that is used for denoting the beginning of a document.
-
caseSensitive
protected final boolean caseSensitive
is this TagSet case sensitive. Defaults to true for all sets except TrecDocTags
-
-
Method Detail
-
factory
public static TagSet.TagSetFactory factory()
-
hasWhitelist
public boolean hasWhitelist()
Returns true if whiteListSize > 0.- Returns:
- Returns true if whiteListSize > 0
-
isTagToProcess
public boolean isTagToProcess(java.lang.String tag)
Checks whether the tag should be processed.- Parameters:
tag
- String the tag to check.- Returns:
- boolean true if the tag should be processed
-
isTagToSkip
public boolean isTagToSkip(java.lang.String tag)
Checks whether a tag should be skipped. You should use isTagToProcess as it checks the whitelist and blacklist.- Parameters:
tag
- the tag to check.- Returns:
- true if the tag is an identifier tag, otherwise it returns false.
-
isIdTag
public boolean isIdTag(java.lang.String tag)
Checks whether the given tag is a unique identifier tag, that is the document number of a document, of the identifier of a topic.- Parameters:
tag
- String the tag to check.- Returns:
- boolean true if the tag is an identifier tag, otherwise it returns false.
-
isDocTag
public boolean isDocTag(java.lang.String tag)
Checks whether the given tag indicates the limits of a document.- Parameters:
tag
- String the tag to check.- Returns:
- boolean true if the tag is a document delimiter tag, otherwise it returns false.
-
isCaseSensitive
public boolean isCaseSensitive()
Returns true if this tag set has been specified as case-sensitive
-
getTagsToProcess
public java.lang.String[] getTagsToProcess()
Returns a comma separated list of tags to process- Returns:
- String the tags to process
-
getTagsToSkip
public java.lang.String[] getTagsToSkip()
Returns a comma separated list of tags to skip- Returns:
- String the tags to skip
-
getIdTag
public java.lang.String getIdTag()
Return the id tag.- Returns:
- String the id tag
-
getDocTag
public java.lang.String getDocTag()
Return the document delimiter tag.- Returns:
- String the document delimiter tag
-
-