web

package module

v0.0.10 Latest Latest Go to latest Published: Jul 20, 2025 License: Apache-2.0 Imports: 12 Imported by: 3

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/myzie/web

Links

Open Source Insights

README ¶

Web Utilities

A lightweight Go library for working with web pages, URLs, and HTML documents.

Installation

go get github.com/myzie/web

Quick Start

Text Normalization

package main

import (
    "fmt"
    "github.com/myzie/web"
)

func main() {
    // Clean up messy text
    text := "  Hello\t&nbsp;World!  \n\n  "
    clean := web.NormalizeText(text)
    fmt.Println(clean) // "Hello World!"
    
    // Normalize URLs
    url, _ := web.NormalizeURL("example.com/path/")
    fmt.Println(url.String()) // "https://example.com/path"
}

HTML Document Parsing

package main

import (
    "fmt"
    "github.com/myzie/web"
)

func main() {
    html := `
    <html>
        <head>
            <title>My Blog Post</title>
            <meta name="description" content="An amazing article">
            <meta property="og:image" content="https://example.com/image.jpg">
        </head>
        <body>
            <h1>Welcome</h1>
            <p>This is a great article about web scraping.</p>
            <a href="https://example.com">Visit our site</a>
        </body>
    </html>`
    
    doc, err := web.NewDocument(html)
    if err != nil {
        panic(err)
    }
    
    // Extract metadata
    fmt.Println("Title:", doc.Title())             // "My Blog Post"
    fmt.Println("Description:", doc.Description()) // "An amazing article"
    fmt.Println("Image:", doc.Image())             // "https://example.com/image.jpg"
    fmt.Println("H1:", doc.H1())                   // "Welcome"
    
    // Get all links
    links := doc.Links()
    for _, link := range links {
        fmt.Printf("Link: %s -> %s\n", link.Text, link.URL)
    }
    
    // Get clean paragraphs
    paragraphs := doc.Paragraphs()
    for _, p := range paragraphs {
        fmt.Println("Paragraph:", p)
    }
}

URL and Domain Utilities

package main

import (
    "fmt"
    "net/url"
    "github.com/myzie/web"
)

func main() {
    // Check if URL points to media
    u, _ := url.Parse("https://example.com/video.mp4")
    isMedia := web.IsMediaURL(u)
    fmt.Println("Is media:", isMedia) // true
    
    // Compare domains
    u1, _ := url.Parse("https://blog.example.com")
    u2, _ := url.Parse("https://shop.example.com") 
    related := web.AreRelatedHosts(u1, u2)
    fmt.Println("Related domains:", related) // true
    
    // Smart text chunking
    longText := "This is a very long article. It has multiple sentences. We want to split it intelligently."
    chunks := web.Chunk(longText, 30)
    for i, chunk := range chunks {
        fmt.Printf("Chunk %d: %s\n", i+1, chunk)
    }
}

API Reference

Text Processing

Function	Description
`NormalizeText(text string) string`	Clean and normalize text by removing extra whitespace and non-printable characters
`Chunk(text string, size int) []string`	Split text into chunks with intelligent boundary detection
`EndsWithPunctuation(s string) bool`	Check if a string ends with common punctuation

URL Utilities

Function	Description
`NormalizeURL(value string) (*url.URL, error)`	Parse and normalize a URL string
`IsMediaURL(u *url.URL) bool`	Check if URL points to a media file
`AreSameHost(url1, url2 *url.URL) bool`	Check if two URLs have the same host
`AreRelatedHosts(url1, url2 *url.URL) bool`	Check if two URLs share the same base domain
`SortURLs(urls []*url.URL)`	Sort URLs alphabetically

Document Extraction

Method	Description
`NewDocument(html string) (*Document, error)`	Create a new document parser
`Title() string`	Extract page title (with fallbacks)
`Description() string`	Get meta description
`Image() string`	Get Open Graph image URL
`Canonical() string`	Get canonical URL
`H1() string`	Get first H1 heading
`Lang() string`	Get document language
`Author() string`	Get author metadata
`Keywords() []string`	Extract keywords
`PublishedTime() time.Time`	Get publication date
`Links() []*Link`	Get all links with text
`Images() []*Link`	Get all images with alt text
`Paragraphs() []string`	Extract all paragraph text
`HTML(options HTMLOptions) (string, error)`	Get cleaned/prettified HTML

HTML Cleaning

Clean up HTML before processing with predefined removal patterns:

options := web.HTMLOptions{
    RemoveElements: web.StandardRemoveElements, // Remove scripts, styles, modals, etc.
    Prettify:       true,                       // Format HTML nicely
}

cleanHTML, err := doc.HTML(options)

The StandardRemoveElements includes common unwanted elements like scripts, styles, cookie banners, modals, and form inputs.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Documentation ¶

Index ¶

Variables
func AreRelatedHosts(url1, url2 *url.URL) bool
func AreSameHost(url1, url2 *url.URL) bool
func Chunk(text string, size int) []string
func EndsWithPunctuation(s string) bool
func FormatHTML(html string) string
func IsMediaURL(u *url.URL) bool
func Markdown(html string) (string, error)
func NormalizeText(text string) string
func NormalizeURL(value string) (*url.URL, error)
func ReadFileItems(filename string) ([]string, error)
func ResolveLink(domain, value string) (string, bool)
func SortURLs(urls []*url.URL)
type Document
- func NewDocument(html string) (*Document, error)
- func (d *Document) Author() string
- func (d *Document) CanonicalURL() string
- func (d *Document) Description() string
- func (d *Document) GoqueryDocument() *goquery.Document
- func (d *Document) H1() string
- func (d *Document) Icon() string
- func (d *Document) Image() string
- func (d *Document) Images() []*Link
- func (d *Document) Keywords() []string
- func (d *Document) Language() string
- func (d *Document) Links() []*Link
- func (d *Document) Meta() []*Meta
- func (d *Document) Metadata() Metadata
- func (d *Document) Paragraphs() []string
- func (d *Document) PublishedTime() time.Time
- func (d *Document) Raw() string
- func (d *Document) Render(options RenderOptions) (string, error)
- func (d *Document) Robots() string
- func (d *Document) Title() string
- func (d *Document) TwitterSite() string
type Link
- func (l *Link) Host() string
type Meta
type Metadata
type RenderOptions
- func (opts RenderOptions) HasFiltering() bool
- func (opts RenderOptions) IsEmpty() bool

Constants ¶

This section is empty.

Variables ¶

View Source

var MediaExtensions = map[string]bool{
	".7z":      true,
	".aac":     true,
	".apk":     true,
	".avi":     true,
	".bin":     true,
	".bmp":     true,
	".css":     true,
	".deb":     true,
	".dmg":     true,
	".doc":     true,
	".docx":    true,
	".eot":     true,
	".exe":     true,
	".flac":    true,
	".flv":     true,
	".gif":     true,
	".gz":      true,
	".ico":     true,
	".img":     true,
	".iso":     true,
	".jpeg":    true,
	".jpg":     true,
	".m4a":     true,
	".m4v":     true,
	".mkv":     true,
	".mov":     true,
	".mp3":     true,
	".mp4":     true,
	".msi":     true,
	".ogg":     true,
	".otf":     true,
	".pdf":     true,
	".pkg":     true,
	".png":     true,
	".ppt":     true,
	".pptx":    true,
	".rar":     true,
	".rpm":     true,
	".svg":     true,
	".tar":     true,
	".tif":     true,
	".tiff":    true,
	".torrent": true,
	".ttf":     true,
	".wav":     true,
	".webp":    true,
	".wmv":     true,
	".woff":    true,
	".woff2":   true,
	".xls":     true,
	".xlsx":    true,
	".zip":     true,
}

MediaExtensions is a map of file extensions that are considered media files.

View Source

var StandardExcludeTags = []string{
	`[role="dialog"]`,
	`[aria-modal="true"]`,
	`[id*="cookie"]`,
	`[id*="popup"]`,
	`[id*="modal"]`,
	`[class*="modal"]`,
	`[class*="dialog"]`,
	"img[data-cookieconsent]",
	"script",
	"style",
	"hr",
	"noscript",
	"iframe",
	"select",
	"input",
	"button",
	"svg",
	"form",
	"nav",
	"footer",
}

StandardExcludeTags contains the suggested tags to exclude from HTML.

Functions ¶

func AreRelatedHosts ¶

func AreRelatedHosts(url1, url2 *url.URL) bool

AreRelatedHosts checks if two URLs are the same or are related by a common parent domain.

func AreSameHost ¶

func AreSameHost(url1, url2 *url.URL) bool

AreSameHost checks if two URLs have the same host value.

func Chunk ¶

func Chunk(text string, size int) []string

Chunk splits a string into chunks of approximately the given size. Attempts to split on periods or spaces if present, near the split points.

func EndsWithPunctuation ¶

func EndsWithPunctuation(s string) bool

EndsWithPunctuation checks if a string ends with a punctuation mark.

func FormatHTML ¶

func FormatHTML(html string) string

FormatHTML parses the input HTML string, formats it and returns the result.

func IsMediaURL ¶

func IsMediaURL(u *url.URL) bool

IsMediaURL returns true if the URL appears to point to a media file.

func Markdown ¶ added in v0.0.2

func Markdown(html string) (string, error)

Markdown converts HTML to Markdown.

func NormalizeText ¶

func NormalizeText(text string) string

NormalizeText applies transformations to the given text that are commonly helpful for cleaning up text read from a webpage. - Trim whitespace - Unescape HTML entities - Remove non-printable characters

func NormalizeURL ¶

func NormalizeURL(value string) (*url.URL, error)

NormalizeURL parses a URL string and returns a normalized URL. The following transformations are applied: - Trim whitespace - Convert http:// to https:// - Add https:// prefix if missing - Remove any query parameters and URL fragments

func ReadFileItems ¶ added in v0.0.7

func ReadFileItems(filename string) ([]string, error)

func ResolveLink ¶ added in v0.0.7

func ResolveLink(domain, value string) (string, bool)

func SortURLs ¶

func SortURLs(urls []*url.URL)

SortURLs sorts a slice of URLs by their string representation.

Types ¶

type Document ¶

type Document struct {
	// contains filtered or unexported fields
}

Document helps parse and extract information from an HTML document.

func NewDocument ¶

func NewDocument(html string) (*Document, error)

NewDocument creates a new Document from an HTML string.

func (*Document) Author ¶

func (d *Document) Author() string

Author returns the author meta tag of the document.

func (*Document) CanonicalURL ¶ added in v0.0.4

func (d *Document) CanonicalURL() string

CanonicalURL returns the canonical URL of the document.

func (*Document) Description ¶

func (d *Document) Description() string

Description returns the description meta tag of the document.

func (*Document) GoqueryDocument ¶

func (d *Document) GoqueryDocument() *goquery.Document

GoqueryDocument returns the underlying goquery document.

func (*Document) H1 ¶

func (d *Document) H1() string

H1 returns the first H1 element of the document.

func (*Document) Icon ¶

func (d *Document) Icon() string

Icon returns the icon link of the document.

func (*Document) Image ¶

func (d *Document) Image() string

Image returns the image meta tag of the document.

func (*Document) Images ¶

func (d *Document) Images() []*Link

Images returns the images on the document.

func (*Document) Keywords ¶

func (d *Document) Keywords() []string

Keywords returns the keywords meta tag of the document.

func (*Document) Language ¶ added in v0.0.4

func (d *Document) Language() string

Language of the document.

func (*Document) Links ¶

func (d *Document) Links() []*Link

Links returns the links on the document.

func (*Document) Meta ¶

func (d *Document) Meta() []*Meta

Meta returns the meta tags of the document.

func (*Document) Metadata ¶ added in v0.0.4

func (d *Document) Metadata() Metadata

Metadata returns the metadata summary for the document.

func (*Document) Paragraphs ¶

func (d *Document) Paragraphs() []string

Paragraphs returns the paragraphs on the document.

func (*Document) PublishedTime ¶

func (d *Document) PublishedTime() time.Time

PublishedTime returns the published time meta tag of the document.

func (*Document) Raw ¶

func (d *Document) Raw() string

Raw returns the raw HTML text of the document.

func (*Document) Render ¶ added in v0.0.4

func (d *Document) Render(options RenderOptions) (string, error)

Render the document as HTML, with optional transformations.

func (*Document) Robots ¶

func (d *Document) Robots() string

Robots returns the robots meta tag of the document.

func (*Document) Title ¶

func (d *Document) Title() string

Title returns the title of the document.

func (*Document) TwitterSite ¶

func (d *Document) TwitterSite() string

TwitterSite returns the twitter site meta tag of the document.

type Link ¶

type Link struct {
	URL  string `json:"url"`
	Text string `json:"text,omitempty"`
}

Link represents a link on a page.

func (*Link) Host ¶

func (l *Link) Host() string

Host returns the host of the link.

type Meta ¶

type Meta struct {
	Tag      string `json:"tag"`
	Name     string `json:"name,omitempty"`
	Property string `json:"property,omitempty"`
	Content  string `json:"content,omitempty"`
	Charset  string `json:"charset,omitempty"`
}

Meta represents a meta tag on a page.

type Metadata ¶ added in v0.0.4

type Metadata struct {
	Title         string   `json:"title,omitempty"`
	Description   string   `json:"description,omitempty"`
	Language      string   `json:"language,omitempty"`
	Author        string   `json:"author,omitempty"`
	CanonicalURL  string   `json:"canonical_url,omitempty"`
	Heading       string   `json:"heading,omitempty"`
	Robots        string   `json:"robots,omitempty"`
	Image         string   `json:"image,omitempty"`
	Icon          string   `json:"icon,omitempty"`
	PublishedTime string   `json:"published_time,omitempty"`
	Keywords      []string `json:"keywords,omitempty"`
	Tags          []*Meta  `json:"tags,omitempty"`
}

Metadata conveys high level information about a page.

type RenderOptions ¶ added in v0.0.4

type RenderOptions struct {
	ExcludeTags     []string
	OnlyMainContent bool
	Prettify        bool
}

RenderOptions contains HTML rendering options.

func (RenderOptions) HasFiltering ¶ added in v0.0.4

func (opts RenderOptions) HasFiltering() bool

HasFiltering returns true if any filtering is requested.

func (RenderOptions) IsEmpty ¶ added in v0.0.4

func (opts RenderOptions) IsEmpty() bool

IsEmpty returns true if no transformations are requested.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cache
cmd
crawl
crawler
errors
fetch

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL