|
Terrier IR Platform 1.1.1 |
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object uk.ac.gla.terrier.utility.TagSet
public class TagSet
A class that models a set of tags to process (white list),
a set of tags to skip (black list), a tag that is used as a
document delimiter, and a tag the contents of which are
used as a unique identifier. The text within any tag encountered
within the scope of a tag from the white list, is processed
by default, unless it is explicitly black listed.
For example, in order to index all the text within
the DOC tag of a document from a typical TREC collection,
without indexing the contents of the DOCHDR tag,,
we could define in the properties file the following properties:
TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
TrecDocTags.process=
TrecDocTags.skip=DOCHDR
In the source code, we would create an instance of
the class as follows:
TagSet TrecIndexToProcess = new TagSet("TrecDocTags");
All the tags are converted to uppercase, in order to check
whether they belong to the specified set of tags.
Field Summary | |
---|---|
static java.lang.String |
EMPTY_TAGS
A prefix for an empty set of tags, that is a set of tags that are not defined in the properties file. |
static java.lang.String |
FIELD_TAGS
The prefix for the tags to consider as fields, during indexing. |
static java.lang.String |
TREC_DOC_TAGS
The prefix for the TREC document tags. |
static java.lang.String |
TREC_EXACT_DOC_TAGS
The prefix for the TREC document exact tags. |
static java.lang.String |
TREC_QUERY_TAGS
The prefix for the TREC topic tags. |
Constructor Summary | |
---|---|
TagSet(java.lang.String prefix)
Constructs the tag set for the given prefix, by reading the corresponding properties from the properties file. |
Method Summary | |
---|---|
java.lang.String |
getDocTag()
Return the document delimiter tag. |
java.lang.String |
getIdTag()
Return the id tag. |
java.lang.String |
getTagsToProcess()
Returns a comma separated list of tags to process |
java.lang.String |
getTagsToSkip()
Returns a comma separated list of tags to skip |
boolean |
hasWhitelist()
Returns true if whiteListSize > 0. |
boolean |
isDocTag(java.lang.String tag)
Checks whether the given tag indicates the limits of a document. |
boolean |
isIdTag(java.lang.String tag)
Checks whether the given tag is a unique identifier tag, that is the document number of a document, of the identifier of a topic. |
boolean |
isTagToProcess(java.lang.String tag)
Checks whether the tag should be processed. |
boolean |
isTagToSkip(java.lang.String tag)
Checks whether a tag should be skipped. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final java.lang.String EMPTY_TAGS
public static final java.lang.String TREC_DOC_TAGS
public static final java.lang.String TREC_EXACT_DOC_TAGS
public static final java.lang.String TREC_QUERY_TAGS
public static final java.lang.String FIELD_TAGS
Constructor Detail |
---|
public TagSet(java.lang.String prefix)
prefix
- the common prefix of the properties to read.Method Detail |
---|
public boolean hasWhitelist()
public boolean isTagToProcess(java.lang.String tag)
tag
- String the tag to check.
public boolean isTagToSkip(java.lang.String tag)
tag
- the tag to check.
public boolean isIdTag(java.lang.String tag)
tag
- String the tag to check.
public boolean isDocTag(java.lang.String tag)
tag
- String the tag to check.
public java.lang.String getTagsToProcess()
public java.lang.String getTagsToSkip()
public java.lang.String getIdTag()
public java.lang.String getDocTag()
|
Terrier IR Platform 1.1.1 |
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |