grawlr

package module
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 6, 2025 License: Apache-2.0 Imports: 12 Imported by: 0

README

Grawlr

Grawlr, a simple web crawler written in Go

Table of Contents

Installation

Prerequisites
  • Go (version 1.23+)
Clone the Repository

To download the source code, clone the repository:

git clone git@github.com:HRemonen/Grawlr.git
cd grawlr
Install Dependencies

Run the following command to install Go module dependencies:

go mod tidy

This will install any necessary packages for the project.

Usage

For detailed usage instructions, refer to the Usage Documentation.

Testing

This project includes tests for various different modules.

To run the tests for a single package, use the command:

go test -v <package name dir>

To run all the tests:

go test ./...

Linting

To ensure that the codebase follows Go best practices and maintain a clean, consistent style, we use golangci-lint, a popular linter aggregator for Go.

Installing the Linter

First, install golangci-lint by following the official instructions here. You can also install it using go install:

go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest
Running the Linter

Once installed, you can run the linter on the project using the following command:

golangci-lint run

This will check the entire codebase for issues and display any linting errors, warnings, or suggestions.

In some cases, the linter can automatically fix issues like formatting errors. To apply fixes automatically, run:

golangci-lint run --fix

The linter is also run on the CI pipeline.

Documentation

Overview

Copyright 2024 Henri Remonen

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrForbiddenURL is returned when a URL is defined in the AllowedURLs setting.
	ErrForbiddenURL = func(u string) error {
		return fmt.Errorf("URL %s is forbidden", u)
	}
	// ErrRobotsDisallowed is returned when a URL is disallowed by robots.txt.
	ErrRobotsDisallowed = func(u string) error {
		return fmt.Errorf("URL %s is disallowed by robots.txt", u)
	}
	// ErrVisitedURL is returned when a URL has already been visited.
	ErrVisitedURL = func(u string) error {
		return fmt.Errorf("URL %s has already been visited", u)
	}
	// ErrDepthLimitExceeded is returned when the maximum depth limit is exceeded.
	ErrDepthLimitExceeded = func(depth, limit int) error {
		return fmt.Errorf("depth limit exceeded: %d > %d", depth, limit)
	}
)

Functions

This section is empty.

Types

type Harvester

type Harvester struct {
	// Client is the http.Client used to fetch web pages.
	Client *http.Client
	// AllowedURLs is a list of URLs that are allowed to be fetched. Can be set with the WithAllowedURLs functional option.
	AllowedURLs []string
	// DisallowedURLs is a list of URLs that are disallowed to be fetched. Can be set with the WithDisallowedURLs functional option.
	DisallowedURLs []string
	// DepthLimit is the maximum depth of links to follow. If set to 0, all links are followed. Can be set with the WithDepthLimit functional option.
	DepthLimit int
	// AllowRevisit is a flag that determines whether to allow revisiting URLs. If set to true, URLs can be revisited even if they have already been visited. Defaults to false.
	AllowRevisit bool
	// Context is the context used to optionally cancel ALL harvester's requests. Can be set with the WithContext functional option.
	Context context.Context
	// contains filtered or unexported fields
}

Harvester is a Harvester that uses an http.Client to fetch web pages.

func NewHarvester

func NewHarvester(options ...Options) *Harvester

NewHarvester creates a new Harvester with the given http.Client.

func (*Harvester) Clone

func (h *Harvester) Clone() *Harvester

Clone returns a new Harvester with the same options as the original except for the middleware functions.

func (*Harvester) HtmlDo

func (h *Harvester) HtmlDo(gqSelector string, fn HtmlCallback)

HtmlDo is a functional option that adds a Html middleware to the Harvester. HtmlCallback is a function that is executed on every Html HtmlElement that matches the given GoQuery selector.

SEE GoQuery documentation for more information on selectors: https://pkg.go.dev/github.com/PuerkitoBio/goquery

func (*Harvester) RequestDo

func (h *Harvester) RequestDo(mw ReqMiddleware)

RequestDo is a functional option that adds a request middleware to the Harvester. Triggers the given ReqMiddleware for each request before it is fetched.

func (*Harvester) ResponseDo

func (h *Harvester) ResponseDo(mw ResMiddleware)

ResponseDo is a functional option that adds a response middleware to the Harvester. Triggers the given ResMiddleware for each response after a request.

func (*Harvester) Visit

func (h *Harvester) Visit(u string) error

Visit requests the web page at the given URL if it is allowed to be fetched. It returns a Response with the response data or an error if the request fails.

type HtmlCallback

type HtmlCallback func(el *HtmlElement)

type HtmlElement

type HtmlElement struct {
	Text string

	Request   *Request
	Response  *Response
	Selection *goquery.Selection
	// contains filtered or unexported fields
}

HtmlElement is a representation of an HTML element.

func (*HtmlElement) Attribute

func (e *HtmlElement) Attribute(key string) string

Attribute returns the value of the attribute with the given key.

type HtmlMiddleware

type HtmlMiddleware struct {
	Selector string
	Function HtmlCallback
}

type InMemoryStore

type InMemoryStore struct {
	// contains filtered or unexported fields
}

func NewInMemoryStore

func NewInMemoryStore() *InMemoryStore

func (*InMemoryStore) Visit

func (s *InMemoryStore) Visit(url string)

func (*InMemoryStore) Visited

func (s *InMemoryStore) Visited(url string) bool

type Options

type Options func(h *Harvester)

Options is a type for functional options that can be used to configure a Harvester.

func WithAllowRevisit

func WithAllowRevisit(allow bool) Options

WithAllowRevisit is a functional option that sets the AllowRevisit flag for the Harvester.

func WithAllowedURLs

func WithAllowedURLs(urls []string) Options

WithAllowedURLs is a functional option that sets the allowed URLs for the Harvester.

func WithClient

func WithClient(client *http.Client) Options

WithClient is a functional option that sets the http.Client for the Harvester.

func WithContext

func WithContext(ctx context.Context) Options

WithContext is a functional option that sets the context for the Harvester.

func WithDepthLimit

func WithDepthLimit(depth int) Options

WithDepthLimit is a functional option that sets the maximum depth for the Harvester.

func WithDisallowedURLs

func WithDisallowedURLs(urls []string) Options

WithDisallowedURLs is a functional option that sets the disallowed URLs for the Harvester.

func WithIgnoreRobots

func WithIgnoreRobots(ignore bool) Options

WithIgnoreRobots is a functional option that sets the ignoreRobots flag for the Harvester.

func WithStore

func WithStore(store Storer) Options

WithStore is a functional option that sets the Storer for the Harvester. See the Storer interface in store.go for more information.

type ReqMiddleware

type ReqMiddleware func(req *Request)

ReqMiddleware is a type for request middlewares that can be used to modify a Request before it is fetched.

type Request

type Request struct {
	URL     *url.URL
	BaseURL *url.URL
	Headers *http.Header
	Host    string
	Method  string
	Body    io.Reader
	Depth   int
	// contains filtered or unexported fields
}

func (*Request) GetAbsoluteURL

func (r *Request) GetAbsoluteURL(link string) string

GetAbsoluteURL returns the absolute URL for a link found on the page.

func (*Request) Visit

func (r *Request) Visit(u string) error

Visit continues the crawling process by visiting a new URL preserving the current request context.

type ResMiddleware

type ResMiddleware func(res *Response)

ResMiddleware is a type for response middlewares that can be used to modify a Response after it is fetched.

type Response

type Response struct {
	StatusCode int
	Headers    *http.Header
	Request    *Request
	Body       io.Reader
}

Response is a representation of the response from a Harvester.

type Storer

type Storer interface {
	// Visited returns true if the URL has been visited.
	Visited(url string) bool
	// Visit marks the URL as visited.
	Visit(url string)
}

Storer is an interface for a cache that storer Harvester's internal data.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL