Distributed Crawler System
A distributed web crawler system with hub-node architecture.
Configuration
The system supports three levels of configuration (in order of precedence):
- Command-line flags (highest priority)
- Environment variables
- Configuration files (TOML format)
Using Configuration Files
Hub Configuration
# Using TOML config file
./hub -config config/hub.toml
# Override specific settings
./hub -config config/hub.toml -redis localhost:6380 -grpc :50052
Node Configuration
# Using TOML config file
./node -config config/node.toml
# Override specific settings
./node -config config/node.toml -redis localhost:6380 -hub localhost:50052
Using Environment Variables
# Copy example env file
cp .env.example .env
# Edit .env with your settings
vim .env
# Run with environment variables
export $(cat .env | xargs) && ./hub
export $(cat .env | xargs) && ./node
Configuration Priority
Settings are applied in this order (later overrides earlier):
- Default values in code
- Configuration file (TOML)
- Environment variables
- Command-line flags
Example: Development Setup
# Start Redis
redis-server
# Start Hub with config
./hub -config config/hub.toml -log debug
# Start Node with config
./node -config config/node.toml -log debug
# Start TUI Dashboard
./tui -redis localhost:6379
Example: Production Setup
# Using environment variables for secrets
export REDIS_PASSWORD=your_redis_password
export NODE_ID=prod-node-01
# Start with production config
./hub -config config/production/hub.toml
./node -config config/production/node.toml
Configuration Files
Hub Configuration (hub.toml)
server
: gRPC and HTTP server settings
redis
: Redis connection and prefix
cluster
: Cluster identification and limits
coordinator
: Task coordination settings
dispatcher
: Task dispatching settings
communication
: Node communication timeouts
monitoring
: Metrics and monitoring
logging
: Log level and format
Node Configuration (node.toml)
node
: Node identity and capabilities
hub
: Hub connection settings
redis
: Redis connection settings
crawler
: Crawling behavior and rate limits
fetcher
: HTTP fetching settings
extractor
: Data extraction settings
storage
: Storage backend configuration
monitoring
: Node metrics reporting
logging
: Log level and format
Benefits of TOML Configuration
- Structured: Clear hierarchy for complex settings
- Type-safe: Explicit types for all values
- Documented: Inline comments explain options
- Readable: Human-friendly format
- Standard: Common in Go ecosystem
Migration from Command-Line Flags
Instead of:
./hub -redis localhost:6379 -grpc :50051 -log debug
Use:
./hub -config config/hub.toml
Or with overrides:
./hub -config config/hub.toml -log debug