TagSet (Terrier 3.6 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.terrier.utility
Class TagSet

java.lang.Object
  org.terrier.utility.TagSet

public class TagSet
extends Object
extends Object

A class that models a set of tags to process (white list), a set of tags to skip (black list), a tag that is used as a document delimiter, and a tag the contents of which are used as a unique identifier. The text within any tag encountered within the scope of a tag from the white list, is processed by default, unless it is explicitly black listed.
For example, in order to index all the text within the DOC tag of a document from a typical TREC collection, without indexing the contents of the DOCHDR tag, we could define in the properties file the following properties:
TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
TrecDocTags.process=
TrecDocTags.skip=DOCHDR
TrecDocTags.casesensitive=false

In the source code, we would create an instance of the class as follows:
TagSet TrecIndexToProcess = new TagSet("TrecDocTags");
All the tags are converted to uppercase, in order to check whether they belong to the specified set of tags.

Author:: Vassilis Plachouras, Craig Macdonald

Field Summary
`protected HashSet<String>`	`blackList` The set of tags to skip.
`protected String`	`blackListTags` A comma separated list of tags to skip.
`protected boolean`	`caseSensitive` is this TagSet case sensitive.
`protected String`	`docTag` The tag that is used for denoting the beginning of a document.
`static String`	`EMPTY_TAGS` A prefix for an empty set of tags, that is a set of tags that are not defined in the properties file.
`static String`	`FIELD_TAGS` The prefix for the tags to consider as fields, during indexing.
`protected String`	`idTag` The tag that is used as a unique identifier.
`static String`	`TREC_DOC_TAGS` The prefix for the TREC document tags.
`static String`	`TREC_EXACT_DOC_TAGS` The prefix for the TREC document exact tags.
`static String`	`TREC_PROPERTY_TAGS` The prefix for the TREC property tags.
`static String`	`TREC_QUERY_TAGS` The prefix for the TREC topic tags.
`protected HashSet<String>`	`whiteList` The set of tags to process.
`protected int`	`whiteListSize` Size of whiteList hashset
`protected String`	`whiteListTags` A comma separated list of tags to process.

Constructor Summary
`TagSet(String prefix)` Constructs the tag set for the given prefix, by reading the corresponding properties from the properties file.

Method Summary
`String`	`getDocTag()` Return the document delimiter tag.
`String`	`getIdTag()` Return the id tag.
`String`	`getTagsToProcess()` Returns a comma separated list of tags to process
`String`	`getTagsToSkip()` Returns a comma separated list of tags to skip
`boolean`	`hasWhitelist()` Returns true if whiteListSize > 0.
`boolean`	`isCaseSensitive()` Returns true if this tag set has been specified as case-sensitive
`boolean`	`isDocTag(String tag)` Checks whether the given tag indicates the limits of a document.
`boolean`	`isIdTag(String tag)` Checks whether the given tag is a unique identifier tag, that is the document number of a document, of the identifier of a topic.
`boolean`	`isTagToProcess(String tag)` Checks whether the tag should be processed.
`boolean`	`isTagToSkip(String tag)` Checks whether a tag should be skipped.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

EMPTY_TAGS

public static final String EMPTY_TAGS

A prefix for an empty set of tags, that is a set of tags that are not defined in the properties file.

See Also:: Constant Field Values

TREC_DOC_TAGS

public static final String TREC_DOC_TAGS

The prefix for the TREC document tags. The corresponding properties in the setup file should start with TrecDocTags.

See Also:: Constant Field Values

TREC_EXACT_DOC_TAGS

public static final String TREC_EXACT_DOC_TAGS

The prefix for the TREC document exact tags. The corresponding properties in the setup file should start with TrecExactDocTags.

See Also:: Constant Field Values

TREC_QUERY_TAGS

public static final String TREC_QUERY_TAGS

The prefix for the TREC topic tags. The corresponding properties in the setup file should start with TrecQueryTags.

See Also:: Constant Field Values

TREC_PROPERTY_TAGS

public static final String TREC_PROPERTY_TAGS

The prefix for the TREC property tags. The corresponding properties in the setup file should start with TrecPropertyTags.

See Also:: Constant Field Values

FIELD_TAGS

public static final String FIELD_TAGS

The prefix for the tags to consider as fields, during indexing. The corresponding properties in the setup file should start with FieldTags.

See Also:: Constant Field Values

whiteList

protected HashSet<String> whiteList

The set of tags to process.

whiteListSize

protected final int whiteListSize

Size of whiteList hashset

whiteListTags

protected String whiteListTags

A comma separated list of tags to process.

blackList

protected HashSet<String> blackList

The set of tags to skip.

blackListTags

protected String blackListTags

A comma separated list of tags to skip.

idTag

protected String idTag

The tag that is used as a unique identifier.

docTag

protected String docTag

The tag that is used for denoting the beginning of a document.

caseSensitive

protected boolean caseSensitive

is this TagSet case sensitive. Defaults to true for all sets except TrecDocTags

Constructor Detail

TagSet

public TagSet(String prefix)

Constructs the tag set for the given prefix, by reading the corresponding properties from the properties file.

Parameters:: prefix - the common prefix of the properties to read.

Method Detail

hasWhitelist

public boolean hasWhitelist()

Returns true if whiteListSize > 0.

Returns:: Returns true if whiteListSize > 0

isTagToProcess

public boolean isTagToProcess(String tag)

Checks whether the tag should be processed.

Parameters:: tag - String the tag to check.
Returns:: boolean true if the tag should be processed

isTagToSkip

public boolean isTagToSkip(String tag)

Checks whether a tag should be skipped. You should use isTagToProcess as it checks the whitelist and blacklist.

Parameters:: tag - the tag to check.
Returns:: true if the tag is an identifier tag, otherwise it returns false.

isIdTag

public boolean isIdTag(String tag)

Checks whether the given tag is a unique identifier tag, that is the document number of a document, of the identifier of a topic.

Parameters:: tag - String the tag to check.
Returns:: boolean true if the tag is an identifier tag, otherwise it returns false.

isDocTag

public boolean isDocTag(String tag)

Checks whether the given tag indicates the limits of a document.

Parameters:: tag - String the tag to check.
Returns:: boolean true if the tag is a document delimiter tag, otherwise it returns false.

isCaseSensitive

public boolean isCaseSensitive()

Returns true if this tag set has been specified as case-sensitive

getTagsToProcess

public String getTagsToProcess()

Returns a comma separated list of tags to process

Returns:: String the tags to process

getTagsToSkip

public String getTagsToSkip()

Returns a comma separated list of tags to skip

Returns:: String the tags to skip

getIdTag

public String getIdTag()

Return the id tag.

Returns:: String the id tag

getDocTag

public String getDocTag()

Return the document delimiter tag.

Returns:: String the document delimiter tag