myst_parser.parsers.parse_html#

A simple but complete HTML to Abstract Syntax Tree (AST) parser.

The AST can also reproduce the HTML text.

Example:

>> text = '<div class="note"><p>text</p></div>'
>> ast = tokenize_html(text)
>> list(ast.walk(include_self=True))
[Root(''), Tag('div', {'class': 'note'}), Tag('p'), Data('text')]
>> str(ast)
'<div class="note"><p>text</p></div>'
>> str(ast[0][0])
'<p>text</p>'

Note: optional tags are not accounted for (see https://html.spec.whatwg.org/multipage/syntax.html#optional-tags)

1.  Module Contents#

1.1.  Classes#

Attribute

This class holds the tags’s attributes.

Element

An Element of the xml/html document.

Root

The root of the AST tree.

Tag

Represent xml/html tags under the form: <name key=”value” …> … </name>.

XTag

Represent XHTML style tags with no children, like <img src=”t.gif” />

VoidTag

Represent tags with no children, only start tag, like <img src=”t.gif” >

TerminalElement

Data

Represent data inside xml/html documents, like raw text.

Declaration

Represent declarations, like <!DOCTYPE html>

Comment

Represent HTML comments

Pi

Represent processing instructions like <?xml-stylesheet ?>

Char

Represent character codes like: &#0

Entity

Represent entities like &amp

Tree

The engine class to generate the AST tree.

HtmlToAst

The tokenizer class.

1.2.  Functions#

tokenize_html

1.3.  API#

class myst_parser.parsers.parse_html.Attribute[source]#

Bases: dict

This class holds the tags’s attributes.

Initialization

Initialize self. See help(type(self)) for accurate signature.

property classes: list[str]#

Return ‘class’ attribute as list.

class myst_parser.parsers.parse_html.Element(name: str = '', attr: dict | None = None)[source]#

Bases: collections.abc.MutableSequence

An Element of the xml/html document.

All xml/html entities inherit from this class.

Initialization

Initialise the element.

property parent: myst_parser.parsers.parse_html.Element | None#

Return parent.

property children: list[myst_parser.parsers.parse_html.Element]#

Return copy of children.

reset_children(children: list[myst_parser.parsers.parse_html.Element], deepcopy: bool = False)[source]#
insert(index: int, item: myst_parser.parsers.parse_html.Element)[source]#
deepcopy() myst_parser.parsers.parse_html.Element[source]#

Recursively copy and remove parent.

abstract render(tag_overrides: dict[str, Callable[[myst_parser.parsers.parse_html.Element, dict], str]] | None = None, **kwargs) str[source]#

Returns a HTML string representation of the element.

Parameters:

tag_overrides – Provide a dictionary of render function for specific tag names, to override the normal render format

walk(include_self: bool = False) Iterator[myst_parser.parsers.parse_html.Element][source]#

Walk through the xml/html AST.

strip(inplace: bool = False, recurse: bool = False) myst_parser.parsers.parse_html.Element[source]#

Return copy with all Data tokens that only contain whitespace / newlines removed.

find(identifier: str | type[myst_parser.parsers.parse_html.Element], attrs: dict | None = None, classes: Iterable[str] | None = None, include_self: bool = False, recurse: bool = True) Iterator[myst_parser.parsers.parse_html.Element][source]#

Find all elements that match name and specific attributes.

class myst_parser.parsers.parse_html.Root(name: str = '', attr: dict | None = None)[source]#

Bases: myst_parser.parsers.parse_html.Element

The root of the AST tree.

Initialization

Initialise the element.

render(**kwargs) str[source]#

Returns a string HTML representation of the structure.

class myst_parser.parsers.parse_html.Tag(name: str = '', attr: dict | None = None)[source]#

Bases: myst_parser.parsers.parse_html.Element

Represent xml/html tags under the form: <name key=”value” …> … </name>.

Initialization

Initialise the element.

render(tag_overrides: dict[str, Callable[[myst_parser.parsers.parse_html.Element, dict], str]] | None = None, **kwargs) str[source]#
class myst_parser.parsers.parse_html.XTag(name: str = '', attr: dict | None = None)[source]#

Bases: myst_parser.parsers.parse_html.Element

Represent XHTML style tags with no children, like <img src=”t.gif” />

Initialization

Initialise the element.

render(tag_overrides: dict[str, Callable[[myst_parser.parsers.parse_html.Element, dict], str]] | None = None, **kwargs) str[source]#
class myst_parser.parsers.parse_html.VoidTag(name: str = '', attr: dict | None = None)[source]#

Bases: myst_parser.parsers.parse_html.Element

Represent tags with no children, only start tag, like <img src=”t.gif” >

Initialization

Initialise the element.

render(**kwargs) str[source]#
class myst_parser.parsers.parse_html.TerminalElement(data: str)[source]#

Bases: myst_parser.parsers.parse_html.Element

deepcopy() myst_parser.parsers.parse_html.TerminalElement[source]#

Copy and remove parent.

class myst_parser.parsers.parse_html.Data(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent data inside xml/html documents, like raw text.

Initialization

Initialise the element.

render(**kwargs) str[source]#
class myst_parser.parsers.parse_html.Declaration(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent declarations, like <!DOCTYPE html>

Initialization

Initialise the element.

render(**kwargs) str[source]#
class myst_parser.parsers.parse_html.Comment(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent HTML comments

Initialization

Initialise the element.

render(**kwargs) str[source]#
class myst_parser.parsers.parse_html.Pi(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent processing instructions like <?xml-stylesheet ?>

Initialization

Initialise the element.

render(**kwargs) str[source]#
class myst_parser.parsers.parse_html.Char(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent character codes like: &#0

Initialization

Initialise the element.

render(**kwargs) str[source]#
class myst_parser.parsers.parse_html.Entity(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent entities like &amp

Initialization

Initialise the element.

render(**kwargs) str[source]#
class myst_parser.parsers.parse_html.Tree(name: str = '')[source]#

The engine class to generate the AST tree.

Initialization

Initialise Tree

clear()[source]#

Clear the outmost and stack for a new parsing.

last() myst_parser.parsers.parse_html.Element[source]#

Return the last pointer which point to the actual tag scope.

nest_tag(name: str, attrs: dict)[source]#

Nest a given tag at the bottom of the tree using the last stack’s pointer.

nest_xtag(name: str, attrs: dict)[source]#

Nest an XTag onto the tree.

nest_vtag(name: str, attrs: dict)[source]#

Nest a VoidTag onto the tree.

nest_terminal(klass: type[myst_parser.parsers.parse_html.TerminalElement], data: str)[source]#

Nest the data onto the tree.

enclose(name: str)[source]#

When a closing tag is found, pop the pointer’s scope from the stack, to then point to the earlier scope’s tag.

class myst_parser.parsers.parse_html.HtmlToAst(name: str = '', convert_charrefs: bool = False)[source]#

Bases: html.parser.HTMLParser

The tokenizer class.

Initialization

Initialize and reset this instance.

If convert_charrefs is True (the default), all character references are automatically converted to the corresponding Unicode characters.

void_elements = None#
feed(source: str) myst_parser.parsers.parse_html.Root[source]#

Parse the source string.

handle_starttag(name: str, attr)[source]#

When found an opening tag then nest it onto the tree.

handle_startendtag(name: str, attr)[source]#

When found a XHTML tag style then nest it up to the tree.

handle_endtag(name: str)[source]#

When found a closing tag then makes it point to the right scope.

handle_data(data: str)[source]#

Nest data onto the tree.

handle_decl(decl: str)[source]#
unknown_decl(decl: str)[source]#
handle_charref(data: str)[source]#
handle_entityref(data: str)[source]#
handle_pi(data: str)[source]#
handle_comment(data: str)[source]#
myst_parser.parsers.parse_html.tokenize_html(text: str, name: str = '', convert_charrefs: bool = False) myst_parser.parsers.parse_html.Root[source]#