web

package module
v0.0.10 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 20, 2025 License: Apache-2.0 Imports: 12 Imported by: 3

README

Web Utilities

A lightweight Go library for working with web pages, URLs, and HTML documents.

Installation

go get github.com/myzie/web

Quick Start

Text Normalization
package main

import (
    "fmt"
    "github.com/myzie/web"
)

func main() {
    // Clean up messy text
    text := "  Hello\t World!  \n\n  "
    clean := web.NormalizeText(text)
    fmt.Println(clean) // "Hello World!"
    
    // Normalize URLs
    url, _ := web.NormalizeURL("example.com/path/")
    fmt.Println(url.String()) // "https://example.com/path"
}
HTML Document Parsing
package main

import (
    "fmt"
    "github.com/myzie/web"
)

func main() {
    html := `
    <html>
        <head>
            <title>My Blog Post</title>
            <meta name="description" content="An amazing article">
            <meta property="og:image" content="https://example.com/image.jpg">
        </head>
        <body>
            <h1>Welcome</h1>
            <p>This is a great article about web scraping.</p>
            <a href="https://example.com">Visit our site</a>
        </body>
    </html>`
    
    doc, err := web.NewDocument(html)
    if err != nil {
        panic(err)
    }
    
    // Extract metadata
    fmt.Println("Title:", doc.Title())             // "My Blog Post"
    fmt.Println("Description:", doc.Description()) // "An amazing article"
    fmt.Println("Image:", doc.Image())             // "https://example.com/image.jpg"
    fmt.Println("H1:", doc.H1())                   // "Welcome"
    
    // Get all links
    links := doc.Links()
    for _, link := range links {
        fmt.Printf("Link: %s -> %s\n", link.Text, link.URL)
    }
    
    // Get clean paragraphs
    paragraphs := doc.Paragraphs()
    for _, p := range paragraphs {
        fmt.Println("Paragraph:", p)
    }
}
URL and Domain Utilities
package main

import (
    "fmt"
    "net/url"
    "github.com/myzie/web"
)

func main() {
    // Check if URL points to media
    u, _ := url.Parse("https://example.com/video.mp4")
    isMedia := web.IsMediaURL(u)
    fmt.Println("Is media:", isMedia) // true
    
    // Compare domains
    u1, _ := url.Parse("https://blog.example.com")
    u2, _ := url.Parse("https://shop.example.com") 
    related := web.AreRelatedHosts(u1, u2)
    fmt.Println("Related domains:", related) // true
    
    // Smart text chunking
    longText := "This is a very long article. It has multiple sentences. We want to split it intelligently."
    chunks := web.Chunk(longText, 30)
    for i, chunk := range chunks {
        fmt.Printf("Chunk %d: %s\n", i+1, chunk)
    }
}

API Reference

Text Processing
Function Description
NormalizeText(text string) string Clean and normalize text by removing extra whitespace and non-printable characters
Chunk(text string, size int) []string Split text into chunks with intelligent boundary detection
EndsWithPunctuation(s string) bool Check if a string ends with common punctuation
URL Utilities
Function Description
NormalizeURL(value string) (*url.URL, error) Parse and normalize a URL string
IsMediaURL(u *url.URL) bool Check if URL points to a media file
AreSameHost(url1, url2 *url.URL) bool Check if two URLs have the same host
AreRelatedHosts(url1, url2 *url.URL) bool Check if two URLs share the same base domain
SortURLs(urls []*url.URL) Sort URLs alphabetically
Document Extraction
Method Description
NewDocument(html string) (*Document, error) Create a new document parser
Title() string Extract page title (with fallbacks)
Description() string Get meta description
Image() string Get Open Graph image URL
Canonical() string Get canonical URL
H1() string Get first H1 heading
Lang() string Get document language
Author() string Get author metadata
Keywords() []string Extract keywords
PublishedTime() time.Time Get publication date
Links() []*Link Get all links with text
Images() []*Link Get all images with alt text
Paragraphs() []string Extract all paragraph text
HTML(options HTMLOptions) (string, error) Get cleaned/prettified HTML

HTML Cleaning

Clean up HTML before processing with predefined removal patterns:

options := web.HTMLOptions{
    RemoveElements: web.StandardRemoveElements, // Remove scripts, styles, modals, etc.
    Prettify:       true,                       // Format HTML nicely
}

cleanHTML, err := doc.HTML(options)

The StandardRemoveElements includes common unwanted elements like scripts, styles, cookie banners, modals, and form inputs.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var MediaExtensions = map[string]bool{
	".7z":      true,
	".aac":     true,
	".apk":     true,
	".avi":     true,
	".bin":     true,
	".bmp":     true,
	".css":     true,
	".deb":     true,
	".dmg":     true,
	".doc":     true,
	".docx":    true,
	".eot":     true,
	".exe":     true,
	".flac":    true,
	".flv":     true,
	".gif":     true,
	".gz":      true,
	".ico":     true,
	".img":     true,
	".iso":     true,
	".jpeg":    true,
	".jpg":     true,
	".m4a":     true,
	".m4v":     true,
	".mkv":     true,
	".mov":     true,
	".mp3":     true,
	".mp4":     true,
	".msi":     true,
	".ogg":     true,
	".otf":     true,
	".pdf":     true,
	".pkg":     true,
	".png":     true,
	".ppt":     true,
	".pptx":    true,
	".rar":     true,
	".rpm":     true,
	".svg":     true,
	".tar":     true,
	".tif":     true,
	".tiff":    true,
	".torrent": true,
	".ttf":     true,
	".wav":     true,
	".webp":    true,
	".wmv":     true,
	".woff":    true,
	".woff2":   true,
	".xls":     true,
	".xlsx":    true,
	".zip":     true,
}

MediaExtensions is a map of file extensions that are considered media files.

View Source
var StandardExcludeTags = []string{
	`[role="dialog"]`,
	`[aria-modal="true"]`,
	`[id*="cookie"]`,
	`[id*="popup"]`,
	`[id*="modal"]`,
	`[class*="modal"]`,
	`[class*="dialog"]`,
	"img[data-cookieconsent]",
	"script",
	"style",
	"hr",
	"noscript",
	"iframe",
	"select",
	"input",
	"button",
	"svg",
	"form",
	"nav",
	"footer",
}

StandardExcludeTags contains the suggested tags to exclude from HTML.

Functions

func AreRelatedHosts

func AreRelatedHosts(url1, url2 *url.URL) bool

AreRelatedHosts checks if two URLs are the same or are related by a common parent domain.

func AreSameHost

func AreSameHost(url1, url2 *url.URL) bool

AreSameHost checks if two URLs have the same host value.

func Chunk

func Chunk(text string, size int) []string

Chunk splits a string into chunks of approximately the given size. Attempts to split on periods or spaces if present, near the split points.

func EndsWithPunctuation

func EndsWithPunctuation(s string) bool

EndsWithPunctuation checks if a string ends with a punctuation mark.

func FormatHTML

func FormatHTML(html string) string

FormatHTML parses the input HTML string, formats it and returns the result.

func IsMediaURL

func IsMediaURL(u *url.URL) bool

IsMediaURL returns true if the URL appears to point to a media file.

func Markdown added in v0.0.2

func Markdown(html string) (string, error)

Markdown converts HTML to Markdown.

func NormalizeText

func NormalizeText(text string) string

NormalizeText applies transformations to the given text that are commonly helpful for cleaning up text read from a webpage. - Trim whitespace - Unescape HTML entities - Remove non-printable characters

func NormalizeURL

func NormalizeURL(value string) (*url.URL, error)

NormalizeURL parses a URL string and returns a normalized URL. The following transformations are applied: - Trim whitespace - Convert http:// to https:// - Add https:// prefix if missing - Remove any query parameters and URL fragments

func ReadFileItems added in v0.0.7

func ReadFileItems(filename string) ([]string, error)
func ResolveLink(domain, value string) (string, bool)

func SortURLs

func SortURLs(urls []*url.URL)

SortURLs sorts a slice of URLs by their string representation.

Types

type Document

type Document struct {
	// contains filtered or unexported fields
}

Document helps parse and extract information from an HTML document.

func NewDocument

func NewDocument(html string) (*Document, error)

NewDocument creates a new Document from an HTML string.

func (*Document) Author

func (d *Document) Author() string

Author returns the author meta tag of the document.

func (*Document) CanonicalURL added in v0.0.4

func (d *Document) CanonicalURL() string

CanonicalURL returns the canonical URL of the document.

func (*Document) Description

func (d *Document) Description() string

Description returns the description meta tag of the document.

func (*Document) GoqueryDocument

func (d *Document) GoqueryDocument() *goquery.Document

GoqueryDocument returns the underlying goquery document.

func (*Document) H1

func (d *Document) H1() string

H1 returns the first H1 element of the document.

func (*Document) Icon

func (d *Document) Icon() string

Icon returns the icon link of the document.

func (*Document) Image

func (d *Document) Image() string

Image returns the image meta tag of the document.

func (*Document) Images

func (d *Document) Images() []*Link

Images returns the images on the document.

func (*Document) Keywords

func (d *Document) Keywords() []string

Keywords returns the keywords meta tag of the document.

func (*Document) Language added in v0.0.4

func (d *Document) Language() string

Language of the document.

func (d *Document) Links() []*Link

Links returns the links on the document.

func (*Document) Meta

func (d *Document) Meta() []*Meta

Meta returns the meta tags of the document.

func (*Document) Metadata added in v0.0.4

func (d *Document) Metadata() Metadata

Metadata returns the metadata summary for the document.

func (*Document) Paragraphs

func (d *Document) Paragraphs() []string

Paragraphs returns the paragraphs on the document.

func (*Document) PublishedTime

func (d *Document) PublishedTime() time.Time

PublishedTime returns the published time meta tag of the document.

func (*Document) Raw

func (d *Document) Raw() string

Raw returns the raw HTML text of the document.

func (*Document) Render added in v0.0.4

func (d *Document) Render(options RenderOptions) (string, error)

Render the document as HTML, with optional transformations.

func (*Document) Robots

func (d *Document) Robots() string

Robots returns the robots meta tag of the document.

func (*Document) Title

func (d *Document) Title() string

Title returns the title of the document.

func (*Document) TwitterSite

func (d *Document) TwitterSite() string

TwitterSite returns the twitter site meta tag of the document.

type Link struct {
	URL  string `json:"url"`
	Text string `json:"text,omitempty"`
}

Link represents a link on a page.

func (*Link) Host

func (l *Link) Host() string

Host returns the host of the link.

type Meta

type Meta struct {
	Tag      string `json:"tag"`
	Name     string `json:"name,omitempty"`
	Property string `json:"property,omitempty"`
	Content  string `json:"content,omitempty"`
	Charset  string `json:"charset,omitempty"`
}

Meta represents a meta tag on a page.

type Metadata added in v0.0.4

type Metadata struct {
	Title         string   `json:"title,omitempty"`
	Description   string   `json:"description,omitempty"`
	Language      string   `json:"language,omitempty"`
	Author        string   `json:"author,omitempty"`
	CanonicalURL  string   `json:"canonical_url,omitempty"`
	Heading       string   `json:"heading,omitempty"`
	Robots        string   `json:"robots,omitempty"`
	Image         string   `json:"image,omitempty"`
	Icon          string   `json:"icon,omitempty"`
	PublishedTime string   `json:"published_time,omitempty"`
	Keywords      []string `json:"keywords,omitempty"`
	Tags          []*Meta  `json:"tags,omitempty"`
}

Metadata conveys high level information about a page.

type RenderOptions added in v0.0.4

type RenderOptions struct {
	ExcludeTags     []string
	OnlyMainContent bool
	Prettify        bool
}

RenderOptions contains HTML rendering options.

func (RenderOptions) HasFiltering added in v0.0.4

func (opts RenderOptions) HasFiltering() bool

HasFiltering returns true if any filtering is requested.

func (RenderOptions) IsEmpty added in v0.0.4

func (opts RenderOptions) IsEmpty() bool

IsEmpty returns true if no transformations are requested.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL