Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Crawler ¶
type Crawler struct {
// contains filtered or unexported fields
}
Crawler is used to crawl the web.
func (*Crawler) AddFetcherRules ¶
func (c *Crawler) AddFetcherRules(rules ...*FetcherRule) error
AddFetcherRules adds new fetcher rules to the crawler. The rules will be re-sorted by priority after adding.
func (*Crawler) AddParserRules ¶
func (c *Crawler) AddParserRules(rule ...*ParserRule) error
AddParserRules adds new parser rules to the crawler. The rules will be re-sorted by priority after adding.
func (*Crawler) Crawl ¶
Crawl the provided URLs and call the callback for each processed page. Links may be followed depending on the configured follow behavior.
func (*Crawler) GetStats ¶
func (c *Crawler) GetStats() *CrawlerStats
GetStats returns the current crawling statistics
type CrawlerStats ¶
type CrawlerStats struct {
// contains filtered or unexported fields
}
CrawlerStats tracks crawling statistics. All methods are thread-safe.
func (*CrawlerStats) GetFailed ¶
func (s *CrawlerStats) GetFailed() int64
GetFailed returns the number of URLs that failed to process
func (*CrawlerStats) GetProcessed ¶
func (s *CrawlerStats) GetProcessed() int64
GetProcessed returns the number of URLs processed
func (*CrawlerStats) GetSucceeded ¶
func (s *CrawlerStats) GetSucceeded() int64
GetSucceeded returns the number of URLs successfully processed
func (*CrawlerStats) IncrementFailed ¶
func (s *CrawlerStats) IncrementFailed()
IncrementFailed atomically increments the failed counter
func (*CrawlerStats) IncrementProcessed ¶
func (s *CrawlerStats) IncrementProcessed()
IncrementProcessed atomically increments the processed counter
func (*CrawlerStats) IncrementSucceeded ¶
func (s *CrawlerStats) IncrementSucceeded()
IncrementSucceeded atomically increments the succeeded counter
type FetcherRule ¶
type FetcherRule struct { MatchRule Fetcher fetch.Fetcher // The fetcher to use for matching domains }
FetcherRule defines a rule for matching domains to fetchers
func NewFetcherRule ¶
func NewFetcherRule(pattern string, fetcher fetch.Fetcher, opts ...FetcherRuleOption) *FetcherRule
NewFetcherRule creates a new fetcher rule with the given pattern and fetcher. By default, it uses exact matching with priority 0. Use functional options to customize behavior.
Example:
rule := NewFetcherRule("example.com", fetcher, WithFetcherPriority(10)) rule := NewFetcherRule("*.example.com", fetcher, WithFetcherMatchType(MatchGlob), WithFetcherPriority(5))
type FetcherRuleOption ¶
type FetcherRuleOption func(*FetcherRule)
FetcherRuleOption defines a function that modifies a FetcherRule
func WithFetcherMatchType ¶
func WithFetcherMatchType(matchType MatchType) FetcherRuleOption
WithFetcherMatchType sets the match type for a fetcher rule
func WithFetcherPriority ¶
func WithFetcherPriority(priority int) FetcherRuleOption
WithFetcherPriority sets the priority for a fetcher rule
type FollowBehavior ¶
type FollowBehavior string
FollowBehavior is used to determine how to follow links.
const ( FollowAny FollowBehavior = "any" FollowSameDomain FollowBehavior = "same-domain" FollowRelatedSubdomains FollowBehavior = "related-subdomains" FollowNone FollowBehavior = "none" )
type MatchRule ¶
type MatchRule struct { Pattern string // The pattern to match against Type MatchType // The type of matching to perform Priority int // Priority for rule evaluation (higher = first) // contains filtered or unexported fields }
MatchRule defines the core matching logic that can be used by different rule types
type MatchType ¶
type MatchType string
MatchType defines the type of pattern matching for rules
const ( MatchExact MatchType = "exact" // Exact domain match MatchRegex MatchType = "regex" // Regular expression match MatchSuffix MatchType = "suffix" // Domain suffix match (e.g., ".com") MatchPrefix MatchType = "prefix" // Domain prefix match (e.g., "blog.") MatchGlob MatchType = "glob" // Glob pattern match (e.g., "*.example.com") )
type MockParser ¶
MockParser implements the Parser interface for testing
func NewMockParser ¶
func NewMockParser() *MockParser
func (*MockParser) SetParseFunc ¶
type Options ¶
type Options struct { MaxURLs int Workers int Cache cache.Cache RequestDelay time.Duration KnownURLs []string ParserRules []*ParserRule DefaultParser Parser FetcherRules []*FetcherRule DefaultFetcher fetch.Fetcher FollowBehavior FollowBehavior Logger *slog.Logger ShowProgress bool ShowProgressInterval time.Duration QueueSize int }
Options used to configure a crawler.
type Parser ¶
Parser is an interface describing a webpage parser. It accepts the fetched page and returns a parsed object.
type ParserRule ¶
ParserRule defines a rule for matching domains to parsers
func NewParserRule ¶
func NewParserRule(pattern string, parser Parser, opts ...ParserRuleOption) *ParserRule
NewParserRule creates a new parser rule with the given pattern and parser. By default, it uses exact matching with priority 0. Use functional options to customize behavior.
Example:
rule := NewParserRule("example.com", parser, WithParserPriority(10)) rule := NewParserRule("*.example.com", parser, WithParserMatchType(MatchGlob), WithParserPriority(5))
type ParserRuleOption ¶
type ParserRuleOption func(*ParserRule)
ParserRuleOption defines a function that modifies a ParserRule
func WithParserMatchType ¶
func WithParserMatchType(matchType MatchType) ParserRuleOption
WithParserMatchType sets the match type for a parser rule
func WithParserPriority ¶
func WithParserPriority(priority int) ParserRuleOption
WithParserPriority sets the priority for a parser rule