readability

package module
v0.0.0-...-0f57a44 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 25, 2025 License: MIT Imports: 22 Imported by: 1

README

Go-Readability Go Reference

Go-Readability is a Go package that find the main readable content and the metadata from a HTML page. It works by removing clutter like buttons, ads, background images, script, etc.

This is a fork of github.com/go-shiori/go-readability originally written by Radhi Fadlillah and maintained by Felipe Martin and GitHub contributors. For more information about the changes in this fork, see FORK.md.

Radhi Fadlillah initially ported Readability.js line-by-line to Go to make sure it looks and works as similar as possible. This way, hopefully all web page that can be parsed by Readability.js are parse-able by go-readability as well.

This module is compatible with Readability.js v0.6.0.

Installation

Note: you are viewing documentation for version 0, which is API-compatible with github.com/go-shiori/go-readability. The development of this project continues in the v2 branch, which you should choose for best speed and memory efficiency, with API-breaking changes being that some Article fields were converted to methods.

To add this package to your project, use go get:

go get -u codeberg.org/readeck/go-readability

And to get the v2 branch instead:

go get -u codeberg.org/readeck/go-readability/v2

Example

package main

import (
	"fmt"
	"log"
	"os"

	readability "codeberg.org/readeck/go-readability"
)

func main() {
	srcFile, err := os.Open("index.html")
	if err != nil {
		log.Fatal(err)
	}
	defer srcFile.Close()

	baseURL, _ := url.Parse("https://example.com/path/to/article")
	article, err := readability.FromReader(srcFile, baseURL)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("Found article with title %q\n\n", article.Title)
	// Print the parsed, cleaned-up HTML markup of the article.
	fmt.Println(article.Content)
}

Command Line Usage

You can also use go-readability as command-line tool:

go install codeberg.org/readeck/go-readability/cmd/go-readability@latest

Now you can use it by running go-readability in your terminal :

$ go-readability -h

go-readability is a parser that extracts article contents from a web page.
The source can be a URL or a filesystem path to a HTML file.
Pass "-" or no argument to read the HTML document from standard input.
Use "--http :0" to automatically choose an available port for the HTTP server.

Usage:
  go-readability [<flags>...] [<url> | <file> | -]

Flags:
  -h, --help          help for go-readability
  -l, --http string   start the http server at the specified address
  -m, --metadata      only print the page's metadata
  -t, --text          only print the page's text

Documentation

Overview

Package readability is a Go package that find the main readable content from a HTML page. It works by removing clutter like buttons, ads, background images, script, etc.

This package is based from Readability.js by Mozilla, and written line by line to make sure it looks and works as similar as possible. This way, hopefully all web page that can be parsed by Readability.js are parse-able by go-readability as well.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func Check

func Check(input io.Reader) bool

Check checks whether the input is readable without parsing the whole thing. It's the wrapper for `Parser.Check()` and useful if you only use the default parser.

func CheckDocument

func CheckDocument(doc *html.Node) bool

CheckDocument checks whether the document is readable without parsing the whole thing. It's the wrapper for `Parser.CheckDocument()` and useful if you only use the default parser.

Types

type Article

type Article struct {
	Title         string
	Byline        string
	Node          *html.Node
	Content       string
	TextContent   string
	Length        int
	Excerpt       string
	SiteName      string
	Image         string
	Favicon       string
	Language      string
	PublishedTime *time.Time
	ModifiedTime  *time.Time
}

Article is the final readable content.

func FromDocument

func FromDocument(doc *html.Node, pageURL *nurl.URL) (Article, error)

FromDocument parses an document and returns the readable content. It's the wrapper or `Parser.ParseDocument()` and useful if you only want to use the default parser.

func FromReader

func FromReader(input io.Reader, pageURL *nurl.URL) (Article, error)

FromReader parses an `io.Reader` and returns the readable content. It's the wrapper or `Parser.Parse()` and useful if you only want to use the default parser.

Example
srcFile, err := os.Open("index.html")
if err != nil {
	log.Fatal(err)
}
defer srcFile.Close()

baseURL, _ := url.Parse("https://example.com/path/to/article")
article, err := FromReader(srcFile, baseURL)
if err != nil {
	log.Fatal(err)
}

fmt.Printf("Found article with title %q\n\n", article.Title)
// Print the parsed, cleaned-up HTML markup of the article.
fmt.Println(article.Content)

func FromURL

func FromURL(pageURL string, timeout time.Duration, requestModifiers ...RequestWith) (Article, error)

FromURL fetch the web page from specified url then parses the response to find the readable content.

type Parser

type Parser struct {
	// MaxElemsToParse is the max number of nodes supported by this
	// parser. Default: 0 (no limit)
	MaxElemsToParse int
	// NTopCandidates is the number of top candidates to consider when
	// analysing how tight the competition is among candidates.
	NTopCandidates int
	// CharThresholds is the default number of chars an article must
	// have in order to return a result
	CharThresholds int
	// ClassesToPreserve are the classes that readability sets itself.
	ClassesToPreserve []string
	// KeepClasses specify whether the classes should be stripped or not.
	KeepClasses bool
	// TagsToScore is element tags to score by default.
	TagsToScore []string
	// Deprecated: opt into printing logs to stderr. Use Logger instead.
	Debug bool
	// The structured logger to write to. The default log is written to io.Discard.
	Logger *slog.Logger
	// DisableJSONLD determines if metadata in JSON+LD will be extracted
	// or not. Default: false.
	DisableJSONLD bool
	// AllowedVideoRegex is a regular expression that matches video URLs that should be
	// allowed to be included in the article content. If undefined, it will use default filter.
	AllowedVideoRegex *regexp.Regexp
	// contains filtered or unexported fields
}

Parser is the parser that parses the page to get the readable content.

func NewParser

func NewParser() Parser

NewParser returns new Parser which set up with default value.

func (*Parser) Check

func (ps *Parser) Check(input io.Reader) bool

Check checks whether the input is readable without parsing the whole thing.

func (*Parser) CheckDocument

func (ps *Parser) CheckDocument(doc *html.Node) bool

CheckDocument checks whether the document is readable without parsing the whole thing.

func (*Parser) Parse

func (ps *Parser) Parse(input io.Reader, pageURL *nurl.URL) (Article, error)

Parse parses a reader and find the main readable content.

func (*Parser) ParseAndMutate

func (ps *Parser) ParseAndMutate(doc *html.Node, pageURL *nurl.URL) (Article, error)

ParseAndMutate is like ParseDocument, but mutates doc during parsing.

func (*Parser) ParseDocument

func (ps *Parser) ParseDocument(doc *html.Node, pageURL *nurl.URL) (Article, error)

ParseDocument parses the specified document and find the main readable content.

type RequestWith

type RequestWith func(r *http.Request)

Directories

Path Synopsis
cmd
go-readability command
internal
re2go
Code generated by re2go 4.0.2, DO NOT EDIT.
Code generated by re2go 4.0.2, DO NOT EDIT.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL