Package org.terrier.terms
Class Stopwords
- java.lang.Object
-
- org.terrier.terms.Stopwords
-
- All Implemented Interfaces:
TermPipeline
public class Stopwords extends java.lang.Object implements TermPipeline
Implements stopword removal, as a TermPipeline object. Stopword list to load can be passed in the constructor or loaded from the stopwords.filename property. Note that this TermPipeline uses the system default encoding for the stopword list. Properties
- stopwords.filename - the stopword list to load. More than one stopword list can be specified, by comma-separating the filenames. The default is resource:/stopword-list.txt which is included in the terrier-core jar file.
- stopwords.intern.terms - optimisation of Java for indexing: Stopwords terms are likely to appear extremely frequently in a Collection, interning them in Java will save on GC costs during indexing.
- stopwords.encoding - encoding of the file containing the stopwords and if that is not set, onto the default system encoding.
- Author:
- Craig Macdonald
-
-
Field Summary
Fields Modifier and Type Field Description protected static boolean
INTERN_STOPWORDS
protected TermPipeline
next
The next component in the term pipeline.protected gnu.trove.THashSet<java.lang.String>
stopWords
The hashset that contains all the stop words.
-
Constructor Summary
Constructors Constructor Description Stopwords(TermPipeline _next)
Makes a new stopword termpipeline object.Stopwords(TermPipeline _next, java.lang.String StopwordsFile)
Makes a new stopword term pipeline object.Stopwords(TermPipeline _next, java.lang.String[] StopwordsFiles)
Makes a new stopword term pipeline object.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
clear()
Clear all stopwords from this stopword list object.boolean
isStopword(java.lang.String t)
Returns true is term t is a stopwordvoid
loadStopwordsList(java.lang.String stopwordsFilename)
Loads the specified stopwords file.void
loadStopwordsList(java.lang.String[] StopwordsFiles)
Loads the specified stopwords files.void
processTerm(java.lang.String t)
Checks to see if term t is a stopword.boolean
reset()
This method implements the specific rest option needed to implements query or doc oriented policy.
-
-
-
Field Detail
-
INTERN_STOPWORDS
protected static final boolean INTERN_STOPWORDS
-
next
protected final TermPipeline next
The next component in the term pipeline.
-
stopWords
protected final gnu.trove.THashSet<java.lang.String> stopWords
The hashset that contains all the stop words.
-
-
Constructor Detail
-
Stopwords
public Stopwords(TermPipeline _next)
Makes a new stopword termpipeline object. The stopwords file is loaded from the application setup file, under the property stopwords.filename.- Parameters:
_next
- TermPipeline the next component in the term pipeline.
-
Stopwords
public Stopwords(TermPipeline _next, java.lang.String StopwordsFile)
Makes a new stopword term pipeline object. The stopwords file(s) are loaded from the filename parameter. If the filename is not absolute, it is assumed to be in TERRIER_SHARE. StopwordsFile is split on \s*,\s* if a comma is found in StopwordsFile parameter.- Parameters:
_next
- TermPipeline the next component in the term pipelineStopwordsFile
- The filename(s) of the file to use as the stopwords list. Split on comma, and passed to the (TermPipeline,String[]) constructor.
-
Stopwords
public Stopwords(TermPipeline _next, java.lang.String[] StopwordsFiles)
Makes a new stopword term pipeline object. The stopwords file(s) are loaded from the filenames array parameter. The non-existance of any file is not enough to stop the system. If a filename is not absolute, it is is assumed to be in TERRIER_SHARE.- Parameters:
_next
- TermPipeline the next component in the term pipelineStopwordsFiles
- Array of filenames of stopword lists.- Since:
- 1.1.0
-
-
Method Detail
-
loadStopwordsList
public void loadStopwordsList(java.lang.String[] StopwordsFiles)
Loads the specified stopwords files. Calls loadStopwordsList(String).- Parameters:
StopwordsFiles
- Array of filenames of stopword lists.- Since:
- 1.1.0
-
loadStopwordsList
public void loadStopwordsList(java.lang.String stopwordsFilename)
Loads the specified stopwords file. Used internally by Stopwords(TermPipeline, String[]). If a stopword list filename is not absolute, then ApplicationSetup.TERRIER_SHARE is appended.- Parameters:
stopwordsFilename
- The filename of the file to use as the stopwords list.
-
clear
public void clear()
Clear all stopwords from this stopword list object.- Since:
- 1.1.0
-
isStopword
public boolean isStopword(java.lang.String t)
Returns true is term t is a stopword
-
processTerm
public void processTerm(java.lang.String t)
Checks to see if term t is a stopword. If so, then the TermPipeline is exited. Otherwise, the term is passed on to the next TermPipeline object. This is the TermPipeline implementation part of this object.- Specified by:
processTerm
in interfaceTermPipeline
- Parameters:
t
- The term to be checked.
-
reset
public boolean reset()
This method implements the specific rest option needed to implements query or doc oriented policy.- Specified by:
reset
in interfaceTermPipeline
- Returns:
- results of the operation
-
-