Documentation
¶
Overview ¶
Crawlbot is a simple, efficient, and flexible webcrawler / spider. Crawlbot is easy to use out-of-the-box, but also provides extensive flexibility for advanced users.
func main() {
crawler := NewCrawler("http://cnn.com", myURLHandler, 4)
crawler.Start()
crawler.Wait()
}
func myURLHandler(resp *crawlbot.Response) {
if resp.Err != nil {
log.Fatal(resp.Err)
}
fmt.Println("Found URL at " + resp.URL)
}
CrawlBot provides extensive customizability for advances use cases. Please see documentation on Crawler and Response for more details.
func main() {
crawler := crawlbot.Crawler{
URLs: []string{"http://example.com", "http://cnn.com", "http://en.wikipedia.org"},
NumWorkers: 12,
Handler: PrintTitle,
CheckURL: AllowEverything,
}
crawler.Start()
crawler.Wait()
}
// Print the title of the page
func PrintTitle(resp *crawlbot.Response) {
if resp.Err != nil {
log.Println(resp.Err)
}
if resp.Doc != nil {
title, err := resp.Doc.Search("//title")
if err != nil {
log.Println(err)
}
fmt.Printf("Title of %s is %s\n", resp.URL, title[0].Content())
} else {
fmt.Println("HTML was not parsed for " + resp.URL)
}
}
// Crawl everything!
func AllowEverything(crawler *crawlbot.Crawler, url string) bool {
return true
}
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ( ErrReqFailed = errors.New("HTTP request failed") ErrBodyRead = errors.New("Error reading HTTP response body") ErrAlreadyStarted = errors.New("Cannot start crawler that is already running") ErrHeaderRejected = errors.New("CheckHeader rejected URL") ErrURLRejected = errors.New("CheckURL rejected URL") ErrBadHttpCode = errors.New("Bad HTTP reponse code") ErrBadContentType = errors.New("Unsupported Content-Type") )
Functions ¶
This section is empty.
Types ¶
type Crawler ¶
type Crawler struct {
// A list of URLs to start crawling. This is your list of seed URLs.
URLs []string
// Number of concurrent workers
NumWorkers int
// For each page crawled this function will be called.
// This is where your business logic should reside.
// There is no default. If Handler is not set the crawler will panic.
Handler func(resp *Response)
// Before a URL is crawled it is passed to this function to see if it should be followed or not. A good url should return nil.
// By default we follow the link if it's in one of the same domains as our seed URLs.
CheckURL func(crawler *Crawler, url string) error
// Before reading in the body we can check the headers to see if we want to continue.
// By default we abort if it's not HTTP 200 OK or not an html Content-Type.
// Override this function if you wish to handle non-html files such as binary images.
// This function should return nil if we wish to continue and read the body.
CheckHeader func(crawler *Crawler, url string, status int, header http.Header) error
// This function is called to find new urls in the document to crawl. By default it will
// find all <a href> links in an html document. Override this function if you wish to follow
// non <a href> links such as <img src>, or if you wish to find links in non-html documents.
LinkFinder func(resp *Response) []string
// The crawler will call this function when it needs a new http.Client to give to a worker.
// The default client is the built-in net/http Client with a 15 seconnd timeout
// A sensible alternative might be a simple round-tripper (eg. github.com/pkulak/simpletransport/simpletransport)
// If you wish to rate-throttle your crawler you would do so by implemting a custom http.Client
Client func() *http.Client
// Set this to true and the crawler will not stop by itself, you will need to explicitly call Stop()
// This is useful when you need a long-running crawler that you occationally feed new urls via Add()
Persistent bool
// contains filtered or unexported fields
}
func NewCrawler ¶
Create a new simple crawler. If more customization options are needed then a Crawler{} should be created directly.
func (*Crawler) Add ¶
Add a URL to the crawler. If the item already exists this is a no-op. TODO: change this behavior so an item is re-queued if it already exists -- tricky if the item is StateRunning
func (*Crawler) Start ¶
Start crawling. Start() will immidiately return; if you wish to wait for the crawl to finish you will want to cal Wait() after calling Start().
type Response ¶
type Response struct {
// The http.Reponse object
*http.Response
// The for this Response
URL string
// If any errors were encountered in retrieiving or processing this item, Err will be non-nill
// Your Handler function should generally check this first
Err error
// The Crawler object that retreived this item. You may use this to stop the crawler, add more urls etc.
// Calling Crawler.Wait() from within your Handler will cause a deadlock. Don't do this.
Crawler *Crawler
// contains filtered or unexported fields
}
When handling a crawled page a Response is passed to the Handler function. A crawlbot.Response is an http.Response with a few extra fields.