Documentation
¶
Overview ¶
Copyright 2024 Henri Remonen
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2024 Henri Remonen ¶
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2024 Henri Remonen ¶
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2024 Henri Remonen ¶
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2024 Henri Remonen ¶
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Index ¶
- Variables
- type Harvester
- type HtmlCallback
- type HtmlElement
- type HtmlMiddleware
- type InMemoryStore
- type Options
- func WithAllowRevisit(allow bool) Options
- func WithAllowedURLs(urls []string) Options
- func WithClient(client *http.Client) Options
- func WithContext(ctx context.Context) Options
- func WithDepthLimit(depth int) Options
- func WithDisallowedURLs(urls []string) Options
- func WithIgnoreRobots(ignore bool) Options
- func WithStore(store Storer) Options
- type ReqMiddleware
- type Request
- type ResMiddleware
- type Response
- type Storer
Constants ¶
This section is empty.
Variables ¶
var ( // ErrForbiddenURL is returned when a URL is defined in the AllowedURLs setting. ErrForbiddenURL = func(u string) error { return fmt.Errorf("URL %s is forbidden", u) } // ErrRobotsDisallowed is returned when a URL is disallowed by robots.txt. ErrRobotsDisallowed = func(u string) error { return fmt.Errorf("URL %s is disallowed by robots.txt", u) } // ErrVisitedURL is returned when a URL has already been visited. ErrVisitedURL = func(u string) error { return fmt.Errorf("URL %s has already been visited", u) } // ErrDepthLimitExceeded is returned when the maximum depth limit is exceeded. ErrDepthLimitExceeded = func(depth, limit int) error { return fmt.Errorf("depth limit exceeded: %d > %d", depth, limit) } )
Functions ¶
This section is empty.
Types ¶
type Harvester ¶
type Harvester struct {
// Client is the http.Client used to fetch web pages.
Client *http.Client
// AllowedURLs is a list of URLs that are allowed to be fetched. Can be set with the WithAllowedURLs functional option.
AllowedURLs []string
// DisallowedURLs is a list of URLs that are disallowed to be fetched. Can be set with the WithDisallowedURLs functional option.
DisallowedURLs []string
// DepthLimit is the maximum depth of links to follow. If set to 0, all links are followed. Can be set with the WithDepthLimit functional option.
DepthLimit int
// AllowRevisit is a flag that determines whether to allow revisiting URLs. If set to true, URLs can be revisited even if they have already been visited. Defaults to false.
AllowRevisit bool
// Context is the context used to optionally cancel ALL harvester's requests. Can be set with the WithContext functional option.
Context context.Context
// contains filtered or unexported fields
}
Harvester is a Harvester that uses an http.Client to fetch web pages.
func NewHarvester ¶
NewHarvester creates a new Harvester with the given http.Client.
func (*Harvester) Clone ¶
Clone returns a new Harvester with the same options as the original except for the middleware functions.
func (*Harvester) HtmlDo ¶
func (h *Harvester) HtmlDo(gqSelector string, fn HtmlCallback)
HtmlDo is a functional option that adds a Html middleware to the Harvester. HtmlCallback is a function that is executed on every Html HtmlElement that matches the given GoQuery selector.
SEE GoQuery documentation for more information on selectors: https://pkg.go.dev/github.com/PuerkitoBio/goquery
func (*Harvester) RequestDo ¶
func (h *Harvester) RequestDo(mw ReqMiddleware)
RequestDo is a functional option that adds a request middleware to the Harvester. Triggers the given ReqMiddleware for each request before it is fetched.
func (*Harvester) ResponseDo ¶
func (h *Harvester) ResponseDo(mw ResMiddleware)
ResponseDo is a functional option that adds a response middleware to the Harvester. Triggers the given ResMiddleware for each response after a request.
type HtmlCallback ¶
type HtmlCallback func(el *HtmlElement)
type HtmlElement ¶
type HtmlElement struct {
Text string
Request *Request
Response *Response
Selection *goquery.Selection
// contains filtered or unexported fields
}
HtmlElement is a representation of an HTML element.
func (*HtmlElement) Attribute ¶
func (e *HtmlElement) Attribute(key string) string
Attribute returns the value of the attribute with the given key.
type HtmlMiddleware ¶
type HtmlMiddleware struct {
Selector string
Function HtmlCallback
}
type InMemoryStore ¶
type InMemoryStore struct {
// contains filtered or unexported fields
}
func NewInMemoryStore ¶
func NewInMemoryStore() *InMemoryStore
func (*InMemoryStore) Visit ¶
func (s *InMemoryStore) Visit(url string)
func (*InMemoryStore) Visited ¶
func (s *InMemoryStore) Visited(url string) bool
type Options ¶
type Options func(h *Harvester)
Options is a type for functional options that can be used to configure a Harvester.
func WithAllowRevisit ¶
WithAllowRevisit is a functional option that sets the AllowRevisit flag for the Harvester.
func WithAllowedURLs ¶
WithAllowedURLs is a functional option that sets the allowed URLs for the Harvester.
func WithClient ¶
WithClient is a functional option that sets the http.Client for the Harvester.
func WithContext ¶
WithContext is a functional option that sets the context for the Harvester.
func WithDepthLimit ¶
WithDepthLimit is a functional option that sets the maximum depth for the Harvester.
func WithDisallowedURLs ¶
WithDisallowedURLs is a functional option that sets the disallowed URLs for the Harvester.
func WithIgnoreRobots ¶
WithIgnoreRobots is a functional option that sets the ignoreRobots flag for the Harvester.
type ReqMiddleware ¶
type ReqMiddleware func(req *Request)
ReqMiddleware is a type for request middlewares that can be used to modify a Request before it is fetched.
type Request ¶
type Request struct {
URL *url.URL
BaseURL *url.URL
Headers *http.Header
Host string
Method string
Body io.Reader
Depth int
// contains filtered or unexported fields
}
func (*Request) GetAbsoluteURL ¶
GetAbsoluteURL returns the absolute URL for a link found on the page.
type ResMiddleware ¶
type ResMiddleware func(res *Response)
ResMiddleware is a type for response middlewares that can be used to modify a Response after it is fetched.