Documentation
¶
Overview ¶
Package xmlutils is mostly content analysis for XML files, including both with and without DOCTYPE. .
Index ¶
- Variables
- func DoParseRaw_xml(s string) (xtokens []CT.CToken, err error)
- func DoParse_xml(s string) (xtokens []CT.CToken, err error)
- func DoParse_xml_locationAware(s string) (xtokens []CT.LAToken, err error)
- func NewConfiguredDecoder(r io.Reader) *xml.Decoder
- type CommonCPR
- type ContentityBasics
- type ContypingInfo
- type DitaContype
- type DitaFlavor
- type DoctypeMType
- type KeyElmTriplet
- type MType
- type NS
- type NSsnapshot
- type PIDFPIfields
- type PIDSIDcatalogFileRecord
- type ParsedDoctype
- type ParsedPreamble
- type ParserResults_xml
- type SliceBounds
- type XmlCatalogFile
- type XmlContype
- type XmlDoctype
- type XmlPeek
- type XmlPublicID
- type XmlSystemID
Constants ¶
This section is empty.
Variables ¶
var DITArootElms = []string{
"topic", "concept", "reference", "task", "bookmap",
"map", "glossentry", "glossgroup"}
DITArootElms are all the XML root elements that can be classified as DITA-type. Note that LwDITA uses only "topic".
var DITAtypeFileExtensions = []string{".dita", ".ditamap", ".ditaval"}
DITAtypeFileExtensions are all the file extensions that are automatically classified as being DITA-type.
var DTDtypeFileExtensions = []string{".dtd", ".mod", ".ent"}
DTDtypeFileExtensions are all the file extensions that are automatically classified as being DTD-type.
var DTMTmap = []DoctypeMType{ {"html", "html/cnt/html5", "html", false, true}, {"//DTD LIGHTWEIGHT DITA Topic//", "xml/cnt/topic", "topic", true, true}, {"//DTD LW DITA Topic//", "xml/cnt/topic", "topic", true, true}, {"//DTD XDITA Topic//", "html/cnt/topic", "topic", true, true}, {"//DTD LIGHTWEIGHT DITA Map//", "xml/map/---", "map", true, true}, {"//DTD LW DITA Map//", "xml/map/---", "map", true, true}, {"//DTD XDITA Map//", "html/map/---", "map", true, true}, {"//DTD DITA Concept//", "xml/cnt/concept", "concept", false, false}, {"//DTD DITA Topic//", "xml/cnt/topic", "topic", false, false}, {"//DTD DITA Task//", "xml/cnt/task", "task", false, false}, {"//DTD HTML 4.", "html/cnt/html4", "html", false, false}, {"//DTD XHTML 1.0 ", "html/cnt/xhtml1.0", "html", false, false}, {"//DTD XHTML 1.1//", "html/cnt/xhtml1.1", "html", false, false}, {"//DTD MathML 2.0//", "html/cnt/mathml", "", false, false}, {"//DTD SVG 1.0//", "xml/img/svg1.0", "svg", false, false}, {"//DTD SVG 1.1", "xml/img/svg", "svg", false, false}, {"//DTD XHTML Basic 1.1//", "html/cnt/topic", "html", false, false}, {"//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//", "html/cnt/blarg", "html", false, false}, }
DTMTmap maps DOCTYPEs to MTypes (and: Is it LwDITA ?). This list should suffice for all ordinary XML files (except of course Docbook).
var DitaContypes = []DitaContype{"Map", "Bookmap", "Topic", "Task", "Concept",
"Reference", "Dita", "Glossary", "Conrefs", "LwMap", "LwTopic"}
DitaContypes - see DitaContype.
var DitaFlavors = []DitaFlavor{"1.2", "1.3", "XDITA", "HDITA", "MDATA"}
DitaFlavors - see DitaFlavor.
var HtmlKeyContentElms = []string{"main", "content"}
HtmlKeyContentElms is elements that often surround the actual page content.
var HtmlSectioningContentElms = []string{"article", "aside", "nav", "section"}
HtmlSectioningContentElms have internal sections and subsections.
var HtmlSectioningRootElms = []string{
"blockquote", "body", "details", "dialog", "fieldset", "figure", "td"}
HtmlSectioningRootElms have their OWN outlines, separate from the outlines of their ancestors, i.e. self-contained hierarchies.
var HtmlSelfClosingTags = []string{
"area", "base", "br", "col", "command", "embed", "hr", "img", "input",
"keygen", "link", "meta", "param", "source", "track", "wbr",
}
HtmlSelfClosingTags is tags that do not need closing tags.
var KeyElmTriplets = []*KeyElmTriplet{
{"html", "head", "body"},
{"topic", "prolog", "body"},
{"map", "topicmeta", ""},
{"reference", "", ""},
{"task", "", ""},
{"bookmap", "", ""},
{"glossentry", "", ""},
{"glossgroup", "", ""},
{"meta", "", ""},
}
KeyElmTriplets are for HTML5, LwDITA, and DITA.
var MarkdownFileExtensions = []string{".md", ".mdown", ".markdown", ".mkdn"}
MarkdownFileExtensions are all the file extensions that are automatically classified as being Markdown-type, even tho we generally use a regex instead.
var MiscFileExtensions = []string{".sqlar"}
MiscFileExtensions are all the other file extensions that we want to process.
var NS_OASIS_XML_CATALOG = "urn:oasis:names:tc:entity:xmlns:xml:catalog:"
NS_OASIS_XML_CATALOG is the OASIS namespace (as URN) for XML catalogs.
var NS_XML = "http://www.w3.org/XML/1998/namespace"
NS_XML is the XML namespace.
var STD_PREAMBLE CT.Raw = xml.Header
STD_PREAMBLE is "<?xml version="1.0" encoding="UTF-8"?>" + "\n"
var XML_NS_Recognized = []string{
"lang", "space", "base", "id", "Father"}
var XmlContypes = []XmlContype{"Unknown", "DTD", "DTDmod", "DTDent",
"RootTagData", "RootTagMixedContent", "MultipleRootTags", "INVALID"}
XmlContypes categorise an XML file by structure and content. NOTE: Maybe DTDmod should be DTDelms.
Functions ¶
func DoParseRaw_xml ¶
DoParseRaw_xml( is TBS.
func DoParse_xml ¶
DoParse_xml takes a string, so we can assume that we can discard it after use cos the caller has another copy of it. To be safe, it copies every token using `xml.CopyToken(T)`.
func DoParse_xml_locationAware ¶
DoParse_xml_locationAware is TBS.
func NewConfiguredDecoder ¶
NewConfiguredDecoder returns a new xml.Decoder that has been confgured with non-strict namespace parsing, HTML auto-closing tags, and HTML entities.
Types ¶
type CommonCPR ¶
type CommonCPR struct { NodeDepths []int FilePosns []*CT.FilePosition CPR_raw string // Writer is usually the GTokens Writer io.Writer }
CommonCPR is Concrete Parse Results common to all formats processed.
type ContentityBasics ¶
type ContentityBasics struct { // XmlRoot is not meaningful for non-XML XmlRoot CT.Span Text CT.Span Meta CT.Span // MetaFormat is "YAML" or "XML" MetaFormat string // MetaProps uses dot separators if hierarchy is needed MetaProps SU.PropSet }
ContentityBasics describes the top-level structure of a Contentity (content entity). It has XmlRoot, Text, Meta, MetaProps, and is embedded in struct XmlPeek. .
func (*ContentityBasics) CheckTopTags ¶
func (p *ContentityBasics) CheckTopTags() (bool, string)
HasRootTag returns true is a root element was found, plus a message about any missing top-level constructs, and it can return warnings. .
func (*ContentityBasics) HasNone ¶
func (p *ContentityBasics) HasNone() bool
HasNone returns false if no expected top-level structure is found.
func (*ContentityBasics) SetToNonXml ¶
func (p *ContentityBasics) SetToNonXml(L int)
SetToNonXml just needs the length of the content, and sets no useful information about deeper structure.
type ContypingInfo ¶
ContypingInfo has simple fields related to typing content (i.e. determining its type). .
func (ContypingInfo) MultilineString ¶
func (p ContypingInfo) MultilineString() (s string)
MultilineString should probably be renamed to Debug, in implementing [Stringser].
func (*ContypingInfo) ParseDoctype ¶
func (pC *ContypingInfo) ParseDoctype(sRaw CT.Raw) (*ParsedDoctype, error)
ParseDoctype should probably NOT be a method on ContypingInfo.
It expects to receive (a file extension) plus (a content type as determined by the HTTP stdlib. However a DOCTYPE is always considered authoritative, so this func can ignore things like the file extension, and overwrite or set any field it wants to.
It works by first trying to match the DOCTYPE against a list. If that fails, stronger measures are called for.
Note two things about this function:
Firstly, it can handle PID, SID, or both:
<!DOCTYPE topic PUBLIC "-//OASIS//DTD LWDITA Topic//EN"> <!DOCTYPE topic PUBLIC "-//OASIS//DTD LWDITA Topic//EN" "./foo.dtd"> <!DOCTYPE topic SYSTEM "./foo.dtd">
Secondly, it can handle a less-than-complete declaration:
DOCTYPE topic PUBLIC "-//OASIS//DTD LWDITA Topic//EN" (and variations) topic PUBLIC "-//OASIS//DTD LWDITA Topic//EN" (and variations) PUBLIC "-//OASIS//DTD LWDITA Topic//EN" (and variations)
The last one is quite important because it is the format that appears in XML catalog files. .
func (ContypingInfo) String ¶
func (p ContypingInfo) String() (s string)
type DitaContype ¶
type DitaContype string
DitaContype is a [Lw]DITA Topic, Map, etc. See DitaContypes.
type DoctypeMType ¶
type DoctypeMType struct { ToMatch string DoctypesMType string RootElm string IsLwDITA bool // LwDITA, HTML5, and not much more (if any) IsInScope bool }
DoctypeMType maps a DOCTYPE string to an MType string and a bool, ? Is it LwDITA ?
type KeyElmTriplet ¶
KeyElmTriplet is a set of tags that appear together in a well-known content format.
func GetKeyElmTriplet ¶
func GetKeyElmTriplet(localName string) *KeyElmTriplet
GetKeyElmTriplet uses an XML localName to retrieve the other elements that are expected.
type MType ¶
type MType string
An MType is specific to this app and/but is modeled after the prior concept of Mime-type. An MType has three fields.
Its value is generally based on two to four inputs:
- The Mime-type guess returned by Go stdlib func net/http.DetectContentType(data []byte) string (which is based on https://mimesniff.spec.whatwg.org/ ) (The no-op default return value is "application/octet-stream")
- Our own shallow analysis of file contents
- The file extension (it is normally present)
- The DOCTYPE (iff XML, incl. HTML)
Note that
- a plain text file MAY be presumed to be Markdown, altho it is not clear (yet) which (TXT or MKDN) should take precedence.
- a Markdown file CAN and WILL be presumed to be LwDITA MDITA; this may cause conflicts/problems for other dialects.
- mappings can appear bogus, for example HTTP stdlib "text/html" might become MType "xml/html".
String possibilities (but in LOWER CASE!) in each field:
- [0] XML, HTML, BIN, TXT, MKDN, (new!) DIRLIKE (i.e. non-contentful)
- We might (or not) keep XML and HTML distinct for a number of reasons, but partly because in the Go stdlib, they are processed quite differently, and we take advantage of it to keep HTML pro- cessing free of nasty surprises and unhelpful strictness
- We might (or might not) keep MKDN distinct from TXT
- [1] CNT (Content), MAP (ToC), IMG, SCH(ema) [and maybe others TBD?]
- [2] Depends on [0]: XML: per-DTD [and/or Pub-ID/filext]; HTML: per-DTD [and/or Pub-ID/filext]; BIN: format/filext; SCH: format/filext [DTD,MOD,XSD,wotevs]; TXT: TBD MKDN: flavor of Markdown (?) (note also AsciiDoc, RST, ...) DIRLIKE: dir, symlink, pipe, socket, ...?
Possible FIXME: Let [2] (3rd) be version info (html5, lwdiat, dita13) and then keep root tag info separately.
Possible FIXME: Append version info, probably after a semicolon. .
type NS ¶
type NS struct { // Prefix is the shorthand version. Prefix string // URI is the full version. URI string }
NS is e.g. { "xml", "http://www.w3.org/XML/1998/namespace" }
type NSsnapshot ¶
One of these has to be filled in for the NS declarations at the top of a content file. Also this can describe the NS state at any point in parsing or traversing a content tree.
type PIDFPIfields ¶
type PIDFPIfields struct { // Registration is "+" or "-" Registration string // IsOasis but if not, then could be any of many others IsOasis bool // Organization is "OASIS" or maybe something else Organization string // PublicTextClass is typically "DTD" (filename.dtd) // or "ELEMENTS" (filename.mod) PublicTextClass string // PublicTextDesc is the distinguishing string, // e.g. PUBLIC "-//OASIS//DTD (_PublicTextDesc_)//EN". // It can end with the root tag of the document // (e.g. "Topic"). It can have an optional // embedded version number, such as "DITA 1.3". PublicTextDesc string }
PIDFPIfields holds the parsed results of a PID (PublicID) a.k.a. Formal Public Identifier, for example "-//OASIS//DTD LIGHTWEIGHT DITA Topic//EN"
func (PIDFPIfields) String ¶
func (p PIDFPIfields) String() string
type PIDSIDcatalogFileRecord ¶
type PIDSIDcatalogFileRecord struct { // XMLName probably does not ever need to be printed. XMLName xml.Name `xml:"public"` // XmlPublicID (PID) (FPI) is the DOCTYPE string XmlPublicID `xml:"publicId,attr"` PIDFPIfields // PublicID // XmlSystemID is the path to the file. Tipicly a relative filepath. XmlSystemID `xml:"uri,attr"` // The filepath long form, as resolved. // Note that we must use a string in order to avoid an import cycle. AbsFilePath string // FU.AbsFilePath HttpPath string Err error // in case an entry barfs }
PIDSIDcatalogFileRecord representa a line item from a parsed XML catalog file. One with a simple structure, such as the catalog file for LwDITA. This same struct is also used to record the PID and/or SID of a DOCTYPE declaration.
func NewPIDSIDcatalogFileRecord ¶
func NewPIDSIDcatalogFileRecord(pid string, sid string) (*PIDSIDcatalogFileRecord, error)
NewPIDSIDcatalogFileRecord is pretty self-explanatory.
func NewSIDPIDcatalogRecordfromStartTag ¶
func NewSIDPIDcatalogRecordfromStartTag(ct CT.CToken) (pID *PIDSIDcatalogFileRecord, err error)
NewSIDPIDcatalogRecordfromStartTag is TBS.
func (PIDSIDcatalogFileRecord) DString ¶
func (p PIDSIDcatalogFileRecord) DString() string
DString returns a comprehensive dump.
func (PIDSIDcatalogFileRecord) Echo ¶
func (p PIDSIDcatalogFileRecord) Echo() string
Echo returns the public ID _unquoted_. <!DOCTYPE topic "-//OASIS//DTD LIGHTWEIGHT DITA Topic//EN">
func (*PIDSIDcatalogFileRecord) HasPID ¶
func (p *PIDSIDcatalogFileRecord) HasPID() bool
func (*PIDSIDcatalogFileRecord) HasSID ¶
func (p *PIDSIDcatalogFileRecord) HasSID() bool
func (PIDSIDcatalogFileRecord) String ¶
func (p PIDSIDcatalogFileRecord) String() string
String returns the juicy part. For example, <!DOCTYPE topic "-//OASIS//DTD LIGHTWEIGHT DITA Topic//EN"> maps to "DTD LIGHTWEIGHT DITA Topic".
type ParsedDoctype ¶
type ParsedDoctype struct { // Raw is the raw Doctype string CT.Raw // PIDSIDcatalogFileRecord is the PID + SID. PIDSIDcatalogFileRecord // DTrootElm is the tag declared in the DOCTYPE, which // should match the root tag in the text of the file. DTrootElm string // contains filtered or unexported fields }
ParsedDoctype is a parse of a complete DOCTYPE declaration. For [Lw]DITA, what interests us is something like
PUBLIC "-//OASIS//DTD (PublicTextDesc)//EN" or sometimes PUBLIC "-//OASIS//ELEMENTS (PublicTextDesc)//EN" and maybe followed by SYSTEM...
The structure of a DOCTYPE is like so:
- PUBLIC | SYSTEM = Availability
- - = Registration = Organization & DTD are not registeredd with ISO.
- OASIS = Organization
- DTD = Public Text Class (CAPACITY | CHARSET | DOCUMENT | DTD | ELEMENTS | ENTITIES | LPD | NONSGML | NOTATION | SHORTREF | SUBDOC | SYNTAX | TEXT )
- (*) = Public Text Description, incl. any version number
- EN = Public Text Language
- URL = optional, explicit
We don't include the raw DOCTYPE here because this structure can be optional but we still need to have the Doctype string in the DB as a separate column, even if it is empty (i.e. "").
type ParsedPreamble ¶
type ParsedPreamble struct { // Preamble_raw does not include a trailing newline Preamble_raw string // MinorVersion "0" means XML 1.0 MinorVersion string // Encoding has Valid values and forms TBS Encoding string // IsStandalone has value "yes" or "no" IsStandalone bool }
ParsedPreamble is a parse of an optional PI (processing instruction) at the start of an XML file. The most typical form is defined in the stdlib:
"<?xml version="1.0" encoding="UTF-8"?>" + "\n"
Here the major version MUST be 1. XML has a version 1.1 but nobody uses it, so also the minor version MUST be 0, because that's what the Go stdlib XML parser understands, and anything else is gonna cause crazy breakage. Fields:
<?xml version="version_number" <= required, "1.0" encoding="encoding_declaration" <= optional, assume "UTF-8" standalone="standalone_status" ?> <= opt'l, can be "yes", dflt "no"
Probably any errors returned by this function should be panicked on, because any such error is pretty fundamental and also ridiculous. Note also that strictly speaking, an XML preamble is NOT a PI.
var STD_PreambleParsed ParsedPreamble
STD_PreambleFields is our parse of variable "STD_PREAMBLE".
func ParsePreamble ¶
func ParsePreamble(sRaw CT.Raw) (*ParsedPreamble, error)
ParsePreamble parses an XML preamble, which (BTW) MUST be the first line in a file. XML version MUST be "1.0". Encoding handling is incomplete.
- Example: <?xml version="1.0" encoding='UTF-8' standalone="yes"?>
- Also OK: xml version="1.0" encoding='UTF-8' standalone="yes"
- Also OK: version="1.0" encoding='UTF-8' standalone="yes"
- Also OK: fields as documented for struct "XmlPreambleFields".
type ParserResults_xml ¶
type ParserResults_xml struct { // NodeSlice is driven by the stdlib XML parser in [encoding/xml]. NodeSlice []CT.CToken // []xml.Token CommonCPR }
ParserResults_xml is TBS.
func GenerateParserResults_xml ¶
func GenerateParserResults_xml(s string) (*ParserResults_xml, error)
GenerateParserResults_xml is TBS.
func (*ParserResults_xml) NodeCount ¶
func (p *ParserResults_xml) NodeCount() int
func (*ParserResults_xml) NodeDebug ¶
func (p *ParserResults_xml) NodeDebug(i int) string
func (*ParserResults_xml) NodeEcho ¶
func (p *ParserResults_xml) NodeEcho(i int) string
func (*ParserResults_xml) NodeInfo ¶
func (p *ParserResults_xml) NodeInfo(i int) string
type XmlCatalogFile ¶
type XmlCatalogFile struct { XMLName xml.Name `xml:"catalog"` // Prefer is "public" or "system" Prefer string `xml:"prefer,attr"` XmlPublicIDsubrecords []PIDSIDcatalogFileRecord `xml:"public"` // AbsFilePath is so we can peel off the directory path AbsFilePath string }
XmlCatalogFile represents a parsed XML catalog file, at the top level.
func NewXmlCatalogFile ¶
func NewXmlCatalogFile(fpath string) (pXC *XmlCatalogFile, err error)
NewXmlCatalogFile is a convenience function that reads in the file and then processes the file contents. It is not clear what the constraints on the path are (but a relative path should work okay).
func (*XmlCatalogFile) GetByPublicID ¶
func (p *XmlCatalogFile) GetByPublicID(s string) *PIDSIDcatalogFileRecord
GetByPublicID retrieves from [XmlPublicIDsubrecords].
func (*XmlCatalogFile) Validate ¶
func (p *XmlCatalogFile) Validate() (retval bool)
Validate validates an XML catalog. It checks that the listed files exist and that the IDs (as strings that are not parsed yet) are well-formed. It assumes that the catalog has already been loaded from an XML catalog file on-disk. The return value is false if _any_ entry fails to load, but also each entry has its own error field.
type XmlContype ¶
type XmlContype string
XmlContype categorizes the XML file. See variable 8XmlContypes9.
type XmlDoctype ¶
type XmlDoctype string
XmlDoctype is just a DOCTYPE string, for example: <!DOCTYPE html>
type XmlPeek ¶
type XmlPeek struct { PreambleRaw CT.Raw // string DoctypeRaw CT.Raw // string HasDTDstuff bool ContentityBasics }
XmlPeek is used by [fileutils.AnalyseFile] when preparing a [fileutils.AnalysisRecord]. Note that ContentityBasics has chunks of Raw but not the full "Raw" string. .
func Peek_xml ¶
Peek_xml takes a string and does the minimum to find XML preamble, DOCTYPE, root element, whether DTD stuff was encountered, and the locations of outer elements containing metadata and body text.
It uses the Go stdlib parser, so success in finding a root element in this function all but guarantees that the string is valid XML.
It is called by [fileutils.AnalyzeFile]. .
type XmlPublicID ¶
type XmlPublicID string
XmlPublicID = PID = Public ID = FPI = Formal Public Identifier
type XmlSystemID ¶
type XmlSystemID string
XmlSystemID = SID = System ID = URI (Universal Resource Identifier) (can be a filepath or an HTTP address)