Documentation
¶
Overview ¶
Example ¶
// URL to extract contents (title, description, images, ...)
url := "https://en.wikipedia.org/wiki/Lego"
// Default option
opt := readability.NewOption()
// You can modify some option values if needed.
opt.ImageRequestTimeout = 3000 // ms
content, err := readability.Extract(url, opt)
if err != nil {
log.Fatal(err)
}
log.Println(content.Title)
log.Println(content.Description)
log.Println(content.Images)
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
Types ¶
type Content ¶
Content contains primary readable content of a webpage.
func ExtractFromDocument ¶
ExtractFromDocument returns Content when extraction succeeds, otherwise error. reqURL is required for converting relative image paths to absolute.
If you already have *goquery.Document after requesting HTTP, use this function, otherwise use Extract(reqURL, opt).
type OpenGraph ¶
type OpenGraph struct {
Title string `json:"og:title,omitempty"`
Description string `json:"og:description,omitempty"`
ImageURL string `json:"og:image,omitempty"`
}
OpenGraph contains opengraph meta values.
type Option ¶
type Option struct {
// RetryLength is minimum length for a page description.
// It will retry to extract page description with more liberal rule
// if extracted description length is less than this value.
RetryLength int
// MinTextLength is minimum length of an inner text for a tag.
// If a tag has short inner text (length is less than MinTextLength),
// the text will be discarded from the page description candidates.
MinTextLength int
// RemoveUnlikelyCandidates is a flag whether to remove some tags
// if they are considered relatively unimportant.
RemoveUnlikelyCandidates bool
// WeightClasses is a flag whether to give more/less weight to some tags
// if they contain some positive/negative words in id/class value.
WeightClasses bool
// CleanConditionally is a flag whether to remove some tags
// using various rules in conditionalCleanReason().
CleanConditionally bool
// RemoveEmptyNodes is a flag whether to remove some tags which have empty inner text.
RemoveEmptyNodes bool
// MinImageWidth is the minimum width (pixel) for choosing images.
MinImageWidth uint32
// MinImageHeight is the minimum height (pixel) for choosing images.
MinImageHeight uint32
// MaxImageCount is the maximum number of images for a web page.
MaxImageCount int
// CheckImageLoopCount is the number of images
// for parallel requests to fetch the image size.
// For example, if this value is set to 10,
// the first 10 img src URLs without width/height attributes
// will be requested over network.
// (img tags with both width/height attributes (pixels in int) are not conunted,
// since they are not requested over network to get image size.)
CheckImageLoopCount uint
// ImageRequestTimeout is timeout(ms) for a single image request.
ImageRequestTimeout uint
// IgnoreImageFormat is an array of strings for ignoring some images.
// If an image URL contains at least one of strings in this array, the image will be ignored.
IgnoreImageFormat []string
// DescriptionAsPlainText is a flag whether to strip all tags in a description value.
DescriptionAsPlainText bool
// DescriptionExtractionTimeout is timeout(ms) for extracting description for a page.
DescriptionExtractionTimeout uint
// LookupOpenGraphTags is a flag whether to use opengraph tag value for title, descriptions and image if exists.
LookupOpenGraphTags bool
}
Option contains variety of options for extracting page content and images.
Click to show internal directories.
Click to hide internal directories.