readability

package module

v0.1.3 Latest Latest Go to latest Published: Jan 18, 2025 License: Apache-2.0 Imports: 15 Imported by: 1

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/giulianopz/go-readability

Links

Open Source Insights

README ¶

go-readability

A Go port of Mozilla Readability.js, an algorithm based on heuristics (e.g. link density, text similarity, number of images, etc.) that just somehow work well and powers the Firefox Reader View offering a distraction-free reading experience for articles, blog posts, and other text-heavy web pages by removing ads, GDPR-compliant cookie banners and other unsolicited junk.

The source code is aligned with the latest commit (97db40b) on the main branch.

A Bit of History

Readability.js maintained by Mozilla is based on a JavaScript bookmarklet developed by Arc90, a consulting firm which was experimenting with Web technology at that time and which used to share some of their stuff as open source software. The company site has long disappeared but it can still be found with the Wayback Machine.

The source code was released in 2009 under the Apache 2.0 software license on Google Code before being abandoned in 2010 to be repackaged as a web service called Readability.com, then discontinued in 2016. The main source code contributor was Chris Dary (@umbrae).

Most modern browsers still use one of the available forks of the Arc90 original implementation when displaying web pages in reading mode.

For a historical and detailed analysis of this topic, please read this excellent series of articles by Daniel Aleksandersen.

Basic usage

Add a dependency for the package:

go get -u github.com/giulianopz/go-readability

Get text content from a web page article:

package main

import (
	"fmt"

	"github.com/giulianopz/go-readability"
)

func main() {

	var htmlSource = `<!DOCTYPE html>
<html>

<head>
	<meta charset="utf-8" />
	<title>
		Redis will remain BSD licensed - &lt;antirez&gt;
	</title>
	<link href="/rss" rel="alternate" type="application/rss+xml" />
</head>

<body>
	<div id="container">
		<header>
			<h1><a href="/">&lt;antirez&gt;</a></h1>
		</header>
		<div id="content">
			<section id="newslist">
				<article data-news-id="120">
					<h2><a href="/news/120">Redis will remain BSD licensed</a></h2>
				</article>
			</section>
			<article class="comment" style="margin-left:0px" data-comment-id="120-" id="120-"><span class="info"><span
						class="username"><a href="/user/antirez">antirez</a></span> 2095 days ago.
					170643 views. </span>
				<pre>Today a page about the new Common Clause license in the Redis Labs web site was interpreted as if Redis itself switched license. This is not the case, Redis is, and will remain, BSD licensed. However in the era of [edit] uncontrollable spreading of information, my attempts to provide the correct information failed, and I’m still seeing everywhere “Redis is no longer open source”. The reality is that Redis remains BSD, and actually Redis Labs did the right thing supporting my effort to keep the Redis core open as usually.

				[...]

We at Redis Labs are sorry for the confusion generated by the Common Clause page, and my colleagues are working to fix the page with better wording.</pre>
			</article>
		</div>
	</div>
</body>

</html>`

	isReaderable := readability.IsProbablyReaderable(htmlSource)
	fmt.Printf("Contains any text?: %t\n", isReaderable)

	reader, err := readability.New(
		htmlSource,
		"http://antirez.com/news/120",
		readability.ClassesToPreserve("caption"),
	)
	if err != nil {
		panic(err)
	}

	result, err := reader.Parse()
	if err != nil {
		panic(err)
	}

	fmt.Printf("Title: %s\n", result.Title)
	fmt.Printf("Author: %s\n", result.Byline)
	fmt.Printf("Length: %d\n", result.Length)
	fmt.Printf("Excerpt: %s\n", result.Excerpt)
	fmt.Printf("SiteName: %s\n", result.SiteName)
	fmt.Printf("Lang: %s\n", result.Lang)
	fmt.Printf("PublishedTime: %s\n", result.PublishedTime)
	fmt.Printf("Content: %s\n", result.Content)
	fmt.Printf("TextContent: %s\n", result.TextContent)
}

Documentation ¶

Index ¶

func IsProbablyReaderable(htmlSource string, opts ...Option) bool
type Node
type Option
type Options
type Readability
- func New(htmlSource, uri string, opts ...Option) (*Readability, error)
- func (r *Readability) Parse() (*Result, error)
type Result

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func IsProbablyReaderable ¶

func IsProbablyReaderable(htmlSource string, opts ...Option) bool

Decides whether or not the document is reader-able without parsing the whole thing. Options:

options.minContentLength (default 140), the minimum node content length used to decide if the document is readerable
options.minScore (default 20), the minumum cumulated 'score' used to determine if the document is readerable
options.visibilityChecker (default isNodeVisible), the function used to determine if a node is visible

Types ¶

type Node ¶ added in v0.1.2

type Node struct {
	NodeType  uint
	LocalName string

	TagName    string
	Attributes []*attribute

	// relations
	ParentNode             *Node
	NextSibling            *Node
	PreviousSibling        *Node
	PreviousElementSibling *Node
	NextElementSibling     *Node
	ChildNodes             []*Node
	Children               []*Node

	// document
	DocumentURI string

	Body                 *Node
	DocumentElement      *Node
	ReadabilityNode      *readabilityNode
	ReadabilityDataTable *readabilityDataTable
	// contains filtered or unexported fields
}

func (*Node) AppendChild ¶ added in v0.1.2

func (n *Node) AppendChild(child *Node)

func (*Node) FirstChild ¶ added in v0.1.2

func (n *Node) FirstChild() *Node

func (*Node) FirstElementChild ¶ added in v0.1.2

func (n *Node) FirstElementChild() *Node

func (*Node) GetAttribute ¶ added in v0.1.2

func (n *Node) GetAttribute(name string) string

func (*Node) GetAttributeByIndex ¶ added in v0.1.2

func (n *Node) GetAttributeByIndex(idx int) *attribute

func (*Node) GetAttributeLen ¶ added in v0.1.2

func (n *Node) GetAttributeLen() int

func (*Node) GetClassName ¶ added in v0.1.2

func (n *Node) GetClassName() string

 func (s *style) setStyle(jsName, styleValue string) {

	var cssName = styleMap[jsName]

	var value = s.node.getAttribute("style")
	var index = 0
	for index >= 0 {
		var next = indexOfFrom(value, ";", index)
		var length = next - index - 1
		var style string
		if length > 0 {
			style = substring(value, index, length)
		} else {
			style = substring(value, index, len(style))
		}
		substr := substring(style, 0, strings.IndexRune(style, ':'))
		if strings.TrimSpace(substr) == cssName {
			value = strings.TrimSpace(substring(value, 0, index))
			if next >= 0 {
				value += " " + strings.TrimSpace(substring(value, next, len((value))))
			}
		}
		index = next
	}
	value += " " + cssName + ": " + styleValue + ";"
	s.node.setAttribute("style", strings.TrimSpace(value))
}

func (*Node) GetElementById ¶ added in v0.1.2

func (n *Node) GetElementById(id string) *Node

func (*Node) GetId ¶ added in v0.1.2

func (n *Node) GetId() string

func (*Node) GetInnerHTML ¶ added in v0.1.2

func (n *Node) GetInnerHTML() string

func (*Node) GetNodeName ¶ added in v0.1.2

func (n *Node) GetNodeName() string

 func (n *node) setSrcset(str string) {
	n.setAttribute("srcset", str)
}

func (*Node) GetSrc ¶ added in v0.1.2

func (n *Node) GetSrc() string

func (*Node) GetSrcset ¶ added in v0.1.2

func (n *Node) GetSrcset() string

func (*Node) GetTextContent ¶ added in v0.1.2

func (n *Node) GetTextContent() string

func (*Node) HasAttribute ¶ added in v0.1.2

func (n *Node) HasAttribute(name string) bool

func (*Node) LastChild ¶ added in v0.1.2

func (n *Node) LastChild() *Node

func (*Node) RemoveAttribute ¶ added in v0.1.2

func (n *Node) RemoveAttribute(name string)

func (*Node) RemoveChild ¶ added in v0.1.2

func (n *Node) RemoveChild(child *Node) (*Node, error)

func (*Node) ReplaceChild ¶ added in v0.1.2

func (n *Node) ReplaceChild(newNode, oldNode *Node) *Node

func (*Node) SetAttribute ¶ added in v0.1.2

func (n *Node) SetAttribute(name, value string)

func (*Node) SetClassName ¶ added in v0.1.2

func (n *Node) SetClassName(str string)

func (*Node) SetId ¶ added in v0.1.2

func (n *Node) SetId(str string)

func (*Node) SetInnerHTML ¶ added in v0.1.2

func (n *Node) SetInnerHTML(html string)

func (*Node) SetTextContent ¶ added in v0.1.2

func (n *Node) SetTextContent(text string)

type Option ¶

type Option func(*Options)

func AllowedVideoRegex ¶

func AllowedVideoRegex(rgx *regexp.Regexp) Option

func CharThreshold ¶

func CharThreshold(n int) Option

func ClassesToPreserve ¶

func ClassesToPreserve(classes ...string) Option

func DisableJSONLD ¶

func DisableJSONLD(b bool) Option

func Html2Text ¶ added in v0.1.2

func Html2Text(f func(string) string) Option

func KeepClasses ¶

func KeepClasses(b bool) Option

func MaxElemsToParse ¶

func MaxElemsToParse(n int) Option

func MinContentLength ¶

func MinContentLength(len int) Option

func MinScore ¶

func MinScore(score float64) Option

func NTopCandidates ¶

func NTopCandidates(n int) Option

func Serializer ¶

func Serializer(f func(*Node) string) Option

func VisibilityChecker ¶

func VisibilityChecker(f func(*html.Node) bool) Option

type Options ¶

type Options struct {
	// contains filtered or unexported fields
}

type Readability ¶

type Readability struct {
	// contains filtered or unexported fields
}

func New ¶

func New(htmlSource, uri string, opts ...Option) (*Readability, error)

New is the public constructor of Readability and it supports the following options:

options.debug
options.maxElemsToParse
options.nbTopCandidates
options.charThreshold
this.classesToPreseve
options.keepClasses
options.serializer

func (*Readability) Parse ¶

func (r *Readability) Parse() (*Result, error)

Runs readability. Workflow:

Prep the document by removing script tags, css, etc.
Build readability's DOM tree.
Grab the article content from the current dom tree.
Replace the current DOM tree with the new one.
Read peacefully.

type Result ¶

type Result struct {
	// article title
	Title string
	// HTML string of processed article HTMLContent
	HTMLContent string
	// text content of the article, with all the HTML tags removed
	TextContent string
	// length of an article, in characters (runes)
	Length int
	// article description, or short excerpt from the content
	Excerpt string
	// author metadata
	Byline string
	// content direction
	Dir string
	// name of the site
	SiteName string
	// content language
	Lang string
	// published time
	PublishedTime string
}

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
readability

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL