CrawlStrategy (Terrier Information Retrieval Platform 4.1 API)

Prev Class
Next Class

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

java.lang.Object
- edu.uci.ics.crawler4j.crawler.WebCrawler
- - org.terrier.services.websitesearch.crawler4j.CrawlStrategy

All Implemented Interfaces:

Runnable
```
public class CrawlStrategy
extends edu.uci.ics.crawler4j.crawler.WebCrawler
```
Overrides Crawler4J methods in WebCrawler to enable restriction to a named host and to connect to the Terrier index. This class auto-configures by overriding variables normally loaded from the terrier.properties file as follows:
- TaggedDocument.abstracts = title,content
- TaggedDocument.abstracts.tags = title,ELSE
- TaggedDocument.abstracts.lengths = 140,5000
- WebCrawlerTags.process = p,title
- WebCrawlerTags.skip = ""
- WebCrawlerTags.casesensitive = false
- trec.model = DirichletLM
Since:

4.0

Author:

Richard McCreadie

Field Summary
- Fields inherited from class edu.uci.ics.crawler4j.crawler.WebCrawler
  logger, myController, myId

Constructor Summary

Constructors
Constructor and Description

CrawlStrategy()

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`init()`
`boolean`	`shouldVisit(edu.uci.ics.crawler4j.crawler.Page page, edu.uci.ics.crawler4j.url.WebURL url)` Check to see if the page is on the specified host
`void`	`visit(edu.uci.ics.crawler4j.crawler.Page page)` Get the page and make a Terrier document from it

Methods inherited from class edu.uci.ics.crawler4j.crawler.WebCrawler
getMyController, getMyId, getMyLocalData, getThread, handlePageStatusCode, handleUrlBeforeProcess, init, isNotWaitingForNewURLs, onBeforeExit, onContentFetchError, onPageBiggerThanMaxSize, onParseError, onStart, onUnexpectedStatusCode, run, setThread

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - CrawlStrategy
```
public CrawlStrategy()
```
- Method Detail
  - init
```
public void init()
```
  - shouldVisit
```
public boolean shouldVisit(edu.uci.ics.crawler4j.crawler.Page page,
                  edu.uci.ics.crawler4j.url.WebURL url)
```
    Check to see if the page is on the specified host
    
    Overrides:
    
    shouldVisit in class edu.uci.ics.crawler4j.crawler.WebCrawler
  - visit
```
public void visit(edu.uci.ics.crawler4j.crawler.Page page)
```
    Get the page and make a Terrier document from it
    
    Overrides:
    
    visit in class edu.uci.ics.crawler4j.crawler.WebCrawler

Prev Class
Next Class

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

Terrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow