Documentation
¶
Index ¶
- Variables
- func AreRelatedHosts(url1, url2 *url.URL) bool
- func AreSameHost(url1, url2 *url.URL) bool
- func Chunk(text string, size int) []string
- func EndsWithPunctuation(s string) bool
- func FormatHTML(html string) string
- func IsMediaURL(u *url.URL) bool
- func Markdown(html string) (string, error)
- func NormalizeText(text string) string
- func NormalizeURL(value string) (*url.URL, error)
- func ReadFileItems(filename string) ([]string, error)
- func ResolveLink(domain, value string) (string, bool)
- func SortURLs(urls []*url.URL)
- type Document
- func (d *Document) Author() string
- func (d *Document) CanonicalURL() string
- func (d *Document) Description() string
- func (d *Document) GoqueryDocument() *goquery.Document
- func (d *Document) H1() string
- func (d *Document) Icon() string
- func (d *Document) Image() string
- func (d *Document) Images() []*Link
- func (d *Document) Keywords() []string
- func (d *Document) Language() string
- func (d *Document) Links() []*Link
- func (d *Document) Meta() []*Meta
- func (d *Document) Metadata() Metadata
- func (d *Document) Paragraphs() []string
- func (d *Document) PublishedTime() time.Time
- func (d *Document) Raw() string
- func (d *Document) Render(options RenderOptions) (string, error)
- func (d *Document) Robots() string
- func (d *Document) Title() string
- func (d *Document) TwitterSite() string
- type Link
- type Meta
- type Metadata
- type RenderOptions
Constants ¶
This section is empty.
Variables ¶
var MediaExtensions = map[string]bool{ ".7z": true, ".aac": true, ".apk": true, ".avi": true, ".bin": true, ".bmp": true, ".css": true, ".deb": true, ".dmg": true, ".doc": true, ".docx": true, ".eot": true, ".exe": true, ".flac": true, ".flv": true, ".gif": true, ".gz": true, ".ico": true, ".img": true, ".iso": true, ".jpeg": true, ".jpg": true, ".m4a": true, ".m4v": true, ".mkv": true, ".mov": true, ".mp3": true, ".mp4": true, ".msi": true, ".ogg": true, ".otf": true, ".pdf": true, ".pkg": true, ".png": true, ".ppt": true, ".pptx": true, ".rar": true, ".rpm": true, ".svg": true, ".tar": true, ".tif": true, ".tiff": true, ".torrent": true, ".ttf": true, ".wav": true, ".webp": true, ".wmv": true, ".woff": true, ".woff2": true, ".xls": true, ".xlsx": true, ".zip": true, }
MediaExtensions is a map of file extensions that are considered media files.
var StandardExcludeTags = []string{
`[role="dialog"]`,
`[aria-modal="true"]`,
`[id*="cookie"]`,
`[id*="popup"]`,
`[id*="modal"]`,
`[class*="modal"]`,
`[class*="dialog"]`,
"img[data-cookieconsent]",
"script",
"style",
"hr",
"noscript",
"iframe",
"select",
"input",
"button",
"svg",
"form",
"nav",
"footer",
}
StandardExcludeTags contains the suggested tags to exclude from HTML.
Functions ¶
func AreRelatedHosts ¶
AreRelatedHosts checks if two URLs are the same or are related by a common parent domain.
func AreSameHost ¶
AreSameHost checks if two URLs have the same host value.
func Chunk ¶
Chunk splits a string into chunks of approximately the given size. Attempts to split on periods or spaces if present, near the split points.
func EndsWithPunctuation ¶
EndsWithPunctuation checks if a string ends with a punctuation mark.
func FormatHTML ¶
FormatHTML parses the input HTML string, formats it and returns the result.
func IsMediaURL ¶
IsMediaURL returns true if the URL appears to point to a media file.
func NormalizeText ¶
NormalizeText applies transformations to the given text that are commonly helpful for cleaning up text read from a webpage. - Trim whitespace - Unescape HTML entities - Remove non-printable characters
func NormalizeURL ¶
NormalizeURL parses a URL string and returns a normalized URL. The following transformations are applied: - Trim whitespace - Convert http:// to https:// - Add https:// prefix if missing - Remove any query parameters and URL fragments
func ReadFileItems ¶ added in v0.0.7
func ResolveLink ¶ added in v0.0.7
Types ¶
type Document ¶
type Document struct {
// contains filtered or unexported fields
}
Document helps parse and extract information from an HTML document.
func NewDocument ¶
NewDocument creates a new Document from an HTML string.
func (*Document) CanonicalURL ¶ added in v0.0.4
CanonicalURL returns the canonical URL of the document.
func (*Document) Description ¶
Description returns the description meta tag of the document.
func (*Document) GoqueryDocument ¶
GoqueryDocument returns the underlying goquery document.
func (*Document) Paragraphs ¶
Paragraphs returns the paragraphs on the document.
func (*Document) PublishedTime ¶
PublishedTime returns the published time meta tag of the document.
func (*Document) Render ¶ added in v0.0.4
func (d *Document) Render(options RenderOptions) (string, error)
Render the document as HTML, with optional transformations.
func (*Document) TwitterSite ¶
TwitterSite returns the twitter site meta tag of the document.
type Meta ¶
type Meta struct { Tag string `json:"tag"` Name string `json:"name,omitempty"` Property string `json:"property,omitempty"` Content string `json:"content,omitempty"` Charset string `json:"charset,omitempty"` }
Meta represents a meta tag on a page.
type Metadata ¶ added in v0.0.4
type Metadata struct { Title string `json:"title,omitempty"` Description string `json:"description,omitempty"` Language string `json:"language,omitempty"` Author string `json:"author,omitempty"` CanonicalURL string `json:"canonical_url,omitempty"` Heading string `json:"heading,omitempty"` Robots string `json:"robots,omitempty"` Image string `json:"image,omitempty"` Icon string `json:"icon,omitempty"` PublishedTime string `json:"published_time,omitempty"` Keywords []string `json:"keywords,omitempty"` Tags []*Meta `json:"tags,omitempty"` }
Metadata conveys high level information about a page.
type RenderOptions ¶ added in v0.0.4
RenderOptions contains HTML rendering options.
func (RenderOptions) HasFiltering ¶ added in v0.0.4
func (opts RenderOptions) HasFiltering() bool
HasFiltering returns true if any filtering is requested.
func (RenderOptions) IsEmpty ¶ added in v0.0.4
func (opts RenderOptions) IsEmpty() bool
IsEmpty returns true if no transformations are requested.