readability

package module
v0.1.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 18, 2025 License: Apache-2.0 Imports: 15 Imported by: 1

README

go-readability

A Go port of Mozilla Readability.js, an algorithm based on heuristics (e.g. link density, text similarity, number of images, etc.) that just somehow work well and powers the Firefox Reader View offering a distraction-free reading experience for articles, blog posts, and other text-heavy web pages by removing ads, GDPR-compliant cookie banners and other unsolicited junk.

The source code is aligned with the latest commit (97db40b) on the main branch.

A Bit of History

Readability.js maintained by Mozilla is based on a JavaScript bookmarklet developed by Arc90, a consulting firm which was experimenting with Web technology at that time and which used to share some of their stuff as open source software. The company site has long disappeared but it can still be found with the Wayback Machine.

The source code was released in 2009 under the Apache 2.0 software license on Google Code before being abandoned in 2010 to be repackaged as a web service called Readability.com, then discontinued in 2016. The main source code contributor was Chris Dary (@umbrae).

Most modern browsers still use one of the available forks of the Arc90 original implementation when displaying web pages in reading mode.

For a historical and detailed analysis of this topic, please read this excellent series of articles by Daniel Aleksandersen.

Basic usage

Add a dependency for the package:

go get -u github.com/giulianopz/go-readability

Get text content from a web page article:

package main

import (
	"fmt"

	"github.com/giulianopz/go-readability"
)

func main() {

	var htmlSource = `<!DOCTYPE html>
<html>

<head>
	<meta charset="utf-8" />
	<title>
		Redis will remain BSD licensed - &lt;antirez&gt;
	</title>
	<link href="/rss" rel="alternate" type="application/rss+xml" />
</head>

<body>
	<div id="container">
		<header>
			<h1><a href="/">&lt;antirez&gt;</a></h1>
		</header>
		<div id="content">
			<section id="newslist">
				<article data-news-id="120">
					<h2><a href="/news/120">Redis will remain BSD licensed</a></h2>
				</article>
			</section>
			<article class="comment" style="margin-left:0px" data-comment-id="120-" id="120-"><span class="info"><span
						class="username"><a href="/user/antirez">antirez</a></span> 2095 days ago.
					170643 views. </span>
				<pre>Today a page about the new Common Clause license in the Redis Labs web site was interpreted as if Redis itself switched license. This is not the case, Redis is, and will remain, BSD licensed. However in the era of [edit] uncontrollable spreading of information, my attempts to provide the correct information failed, and I’m still seeing everywhere “Redis is no longer open source”. The reality is that Redis remains BSD, and actually Redis Labs did the right thing supporting my effort to keep the Redis core open as usually.

				[...]

We at Redis Labs are sorry for the confusion generated by the Common Clause page, and my colleagues are working to fix the page with better wording.</pre>
			</article>
		</div>
	</div>
</body>

</html>`

	isReaderable := readability.IsProbablyReaderable(htmlSource)
	fmt.Printf("Contains any text?: %t\n", isReaderable)

	reader, err := readability.New(
		htmlSource,
		"http://antirez.com/news/120",
		readability.ClassesToPreserve("caption"),
	)
	if err != nil {
		panic(err)
	}

	result, err := reader.Parse()
	if err != nil {
		panic(err)
	}

	fmt.Printf("Title: %s\n", result.Title)
	fmt.Printf("Author: %s\n", result.Byline)
	fmt.Printf("Length: %d\n", result.Length)
	fmt.Printf("Excerpt: %s\n", result.Excerpt)
	fmt.Printf("SiteName: %s\n", result.SiteName)
	fmt.Printf("Lang: %s\n", result.Lang)
	fmt.Printf("PublishedTime: %s\n", result.PublishedTime)
	fmt.Printf("Content: %s\n", result.Content)
	fmt.Printf("TextContent: %s\n", result.TextContent)
}

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func IsProbablyReaderable

func IsProbablyReaderable(htmlSource string, opts ...Option) bool

Decides whether or not the document is reader-able without parsing the whole thing. Options:

  • options.minContentLength (default 140), the minimum node content length used to decide if the document is readerable
  • options.minScore (default 20), the minumum cumulated 'score' used to determine if the document is readerable
  • options.visibilityChecker (default isNodeVisible), the function used to determine if a node is visible

Types

type Node added in v0.1.2

type Node struct {
	NodeType  uint
	LocalName string

	TagName    string
	Attributes []*attribute

	// relations
	ParentNode             *Node
	NextSibling            *Node
	PreviousSibling        *Node
	PreviousElementSibling *Node
	NextElementSibling     *Node
	ChildNodes             []*Node
	Children               []*Node

	// document
	DocumentURI string

	Body                 *Node
	DocumentElement      *Node
	ReadabilityNode      *readabilityNode
	ReadabilityDataTable *readabilityDataTable
	// contains filtered or unexported fields
}

func (*Node) AppendChild added in v0.1.2

func (n *Node) AppendChild(child *Node)

func (*Node) FirstChild added in v0.1.2

func (n *Node) FirstChild() *Node

func (*Node) FirstElementChild added in v0.1.2

func (n *Node) FirstElementChild() *Node

func (*Node) GetAttribute added in v0.1.2

func (n *Node) GetAttribute(name string) string

func (*Node) GetAttributeByIndex added in v0.1.2

func (n *Node) GetAttributeByIndex(idx int) *attribute

func (*Node) GetAttributeLen added in v0.1.2

func (n *Node) GetAttributeLen() int

func (*Node) GetClassName added in v0.1.2

func (n *Node) GetClassName() string
 func (s *style) setStyle(jsName, styleValue string) {

	var cssName = styleMap[jsName]

	var value = s.node.getAttribute("style")
	var index = 0
	for index >= 0 {
		var next = indexOfFrom(value, ";", index)
		var length = next - index - 1
		var style string
		if length > 0 {
			style = substring(value, index, length)
		} else {
			style = substring(value, index, len(style))
		}
		substr := substring(style, 0, strings.IndexRune(style, ':'))
		if strings.TrimSpace(substr) == cssName {
			value = strings.TrimSpace(substring(value, 0, index))
			if next >= 0 {
				value += " " + strings.TrimSpace(substring(value, next, len((value))))
			}
		}
		index = next
	}
	value += " " + cssName + ": " + styleValue + ";"
	s.node.setAttribute("style", strings.TrimSpace(value))
}

func (*Node) GetElementById added in v0.1.2

func (n *Node) GetElementById(id string) *Node

func (*Node) GetId added in v0.1.2

func (n *Node) GetId() string

func (*Node) GetInnerHTML added in v0.1.2

func (n *Node) GetInnerHTML() string

func (*Node) GetNodeName added in v0.1.2

func (n *Node) GetNodeName() string
 func (n *node) setSrcset(str string) {
	n.setAttribute("srcset", str)
}

func (*Node) GetSrc added in v0.1.2

func (n *Node) GetSrc() string

func (*Node) GetSrcset added in v0.1.2

func (n *Node) GetSrcset() string

func (*Node) GetTextContent added in v0.1.2

func (n *Node) GetTextContent() string

func (*Node) HasAttribute added in v0.1.2

func (n *Node) HasAttribute(name string) bool

func (*Node) LastChild added in v0.1.2

func (n *Node) LastChild() *Node

func (*Node) RemoveAttribute added in v0.1.2

func (n *Node) RemoveAttribute(name string)

func (*Node) RemoveChild added in v0.1.2

func (n *Node) RemoveChild(child *Node) (*Node, error)

func (*Node) ReplaceChild added in v0.1.2

func (n *Node) ReplaceChild(newNode, oldNode *Node) *Node

func (*Node) SetAttribute added in v0.1.2

func (n *Node) SetAttribute(name, value string)

func (*Node) SetClassName added in v0.1.2

func (n *Node) SetClassName(str string)

func (*Node) SetId added in v0.1.2

func (n *Node) SetId(str string)

func (*Node) SetInnerHTML added in v0.1.2

func (n *Node) SetInnerHTML(html string)

func (*Node) SetTextContent added in v0.1.2

func (n *Node) SetTextContent(text string)

type Option

type Option func(*Options)

func AllowedVideoRegex

func AllowedVideoRegex(rgx *regexp.Regexp) Option

func CharThreshold

func CharThreshold(n int) Option

func ClassesToPreserve

func ClassesToPreserve(classes ...string) Option

func DisableJSONLD

func DisableJSONLD(b bool) Option

func Html2Text added in v0.1.2

func Html2Text(f func(string) string) Option

func KeepClasses

func KeepClasses(b bool) Option

func MaxElemsToParse

func MaxElemsToParse(n int) Option

func MinContentLength

func MinContentLength(len int) Option

func MinScore

func MinScore(score float64) Option

func NTopCandidates

func NTopCandidates(n int) Option

func Serializer

func Serializer(f func(*Node) string) Option

func VisibilityChecker

func VisibilityChecker(f func(*html.Node) bool) Option

type Options

type Options struct {
	// contains filtered or unexported fields
}

type Readability

type Readability struct {
	// contains filtered or unexported fields
}

func New

func New(htmlSource, uri string, opts ...Option) (*Readability, error)

New is the public constructor of Readability and it supports the following options:

  • options.debug
  • options.maxElemsToParse
  • options.nbTopCandidates
  • options.charThreshold
  • this.classesToPreseve
  • options.keepClasses
  • options.serializer

func (*Readability) Parse

func (r *Readability) Parse() (*Result, error)

Runs readability. Workflow:

  1. Prep the document by removing script tags, css, etc.
  2. Build readability's DOM tree.
  3. Grab the article content from the current dom tree.
  4. Replace the current DOM tree with the new one.
  5. Read peacefully.

type Result

type Result struct {
	// article title
	Title string
	// HTML string of processed article HTMLContent
	HTMLContent string
	// text content of the article, with all the HTML tags removed
	TextContent string
	// length of an article, in characters (runes)
	Length int
	// article description, or short excerpt from the content
	Excerpt string
	// author metadata
	Byline string
	// content direction
	Dir string
	// name of the site
	SiteName string
	// content language
	Lang string
	// published time
	PublishedTime string
}

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL