myst_parser.parsers.parse_html

`myst_parser.parsers.parse_html`#

A simple but complete HTML to Abstract Syntax Tree (AST) parser.

The AST can also reproduce the HTML text.

Example:

>> text = '<div class="note"><p>text</p></div>'
>> ast = tokenize_html(text)
>> list(ast.walk(include_self=True))
[Root(''), Tag('div', {'class': 'note'}), Tag('p'), Data('text')]
>> str(ast)
'<div class="note"><p>text</p></div>'
>> str(ast[0][0])
'<p>text</p>'

Note: optional tags are not accounted for (see https://html.spec.whatwg.org/multipage/syntax.html#optional-tags)

1. Module Contents#

1.1. Classes#

`Attribute`	This class holds the tags’s attributes.
`Element`	An Element of the xml/html document.
`Root`	The root of the AST tree.
`Tag`	Represent xml/html tags under the form: <name key=”value” …> … </name>.
`XTag`	Represent XHTML style tags with no children, like <img src=”t.gif” />
`VoidTag`	Represent tags with no children, only start tag, like <img src=”t.gif” >
`TerminalElement`
`Data`	Represent data inside xml/html documents, like raw text.
`Declaration`	Represent declarations, like <!DOCTYPE html>
`Comment`	Represent HTML comments
`Pi`	Represent processing instructions like <?xml-stylesheet ?>
`Char`	Represent character codes like: &#0
`Entity`	Represent entities like &amp
`Tree`	The engine class to generate the AST tree.
`HtmlToAst`	The tokenizer class.

1.2. Functions#

tokenize_html

1.3. API#

class myst_parser.parsers.parse_html.Attribute[source]#

Bases: dict

This class holds the tags’s attributes.

Initialization

Initialize self. See help(type(self)) for accurate signature.

property classes: list[str]#: Return ‘class’ attribute as list.

class myst_parser.parsers.parse_html.Element(name: str = '', attr: dict | None = None)[source]#

Bases: collections.abc.MutableSequence

An Element of the xml/html document.

All xml/html entities inherit from this class.

Initialization

Initialise the element.

property parent: myst_parser.parsers.parse_html.Element | None#: Return parent.

property children: list[myst_parser.parsers.parse_html.Element]#: Return copy of children.

reset_children(children: list[myst_parser.parsers.parse_html.Element], deepcopy: bool = False)[source]#

insert(index: int, item: myst_parser.parsers.parse_html.Element)[source]#

deepcopy() → myst_parser.parsers.parse_html.Element[source]#: Recursively copy and remove parent.

abstractmethod render(tag_overrides: dict[str, collections.abc.Callable[[myst_parser.parsers.parse_html.Element, dict], str]] | None = None, **kwargs) → str[source]#

Returns a HTML string representation of the element.

Parameters:: tag_overrides – Provide a dictionary of render function for specific tag names, to override the normal render format

walk(include_self: bool = False) → collections.abc.Iterator[myst_parser.parsers.parse_html.Element][source]#: Walk through the xml/html AST.

strip(inplace: bool = False, recurse: bool = False) → myst_parser.parsers.parse_html.Element[source]#: Return copy with all Data tokens that only contain whitespace / newlines removed.

find(identifier: str | type[myst_parser.parsers.parse_html.Element], attrs: dict | None = None, classes: collections.abc.Iterable[str] | None = None, include_self: bool = False, recurse: bool = True) → collections.abc.Iterator[myst_parser.parsers.parse_html.Element][source]#: Find all elements that match name and specific attributes.

class myst_parser.parsers.parse_html.Root(name: str = '', attr: dict | None = None)[source]#

Bases: myst_parser.parsers.parse_html.Element

The root of the AST tree.

Initialization

Initialise the element.

render(**kwargs) → str[source]#: Returns a string HTML representation of the structure.

class myst_parser.parsers.parse_html.Tag(name: str = '', attr: dict | None = None)[source]#

Bases: myst_parser.parsers.parse_html.Element

Represent xml/html tags under the form: <name key=”value” …> … </name>.

Initialization

Initialise the element.

render(tag_overrides: dict[str, collections.abc.Callable[[myst_parser.parsers.parse_html.Element, dict], str]] | None = None, **kwargs) → str[source]#

class myst_parser.parsers.parse_html.XTag(name: str = '', attr: dict | None = None)[source]#

Bases: myst_parser.parsers.parse_html.Element

Represent XHTML style tags with no children, like <img src=”t.gif” />

Initialization

Initialise the element.

render(tag_overrides: dict[str, collections.abc.Callable[[myst_parser.parsers.parse_html.Element, dict], str]] | None = None, **kwargs) → str[source]#

class myst_parser.parsers.parse_html.VoidTag(name: str = '', attr: dict | None = None)[source]#

Bases: myst_parser.parsers.parse_html.Element

Represent tags with no children, only start tag, like <img src=”t.gif” >

Initialization

Initialise the element.

render(**kwargs) → str[source]#

class myst_parser.parsers.parse_html.TerminalElement(data: str)[source]#

Bases: myst_parser.parsers.parse_html.Element

deepcopy() → myst_parser.parsers.parse_html.TerminalElement[source]#: Copy and remove parent.

class myst_parser.parsers.parse_html.Data(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent data inside xml/html documents, like raw text.

Initialization

Initialise the element.

render(**kwargs) → str[source]#

class myst_parser.parsers.parse_html.Declaration(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent declarations, like <!DOCTYPE html>

Initialization

Initialise the element.

render(**kwargs) → str[source]#

class myst_parser.parsers.parse_html.Comment(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent HTML comments

Initialization

Initialise the element.

render(**kwargs) → str[source]#

class myst_parser.parsers.parse_html.Pi(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent processing instructions like <?xml-stylesheet ?>

Initialization

Initialise the element.

render(**kwargs) → str[source]#

class myst_parser.parsers.parse_html.Char(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent character codes like: &#0

Initialization

Initialise the element.

render(**kwargs) → str[source]#

class myst_parser.parsers.parse_html.Entity(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent entities like &amp

Initialization

Initialise the element.

render(**kwargs) → str[source]#

class myst_parser.parsers.parse_html.Tree(name: str = '')[source]#

The engine class to generate the AST tree.

Initialization

Initialise Tree

clear()[source]#: Clear the outmost and stack for a new parsing.

last() → myst_parser.parsers.parse_html.Element[source]#: Return the last pointer which point to the actual tag scope.

nest_tag(name: str, attrs: dict)[source]#: Nest a given tag at the bottom of the tree using the last stack’s pointer.

nest_xtag(name: str, attrs: dict)[source]#: Nest an XTag onto the tree.

nest_vtag(name: str, attrs: dict)[source]#: Nest a VoidTag onto the tree.

nest_terminal(klass: type[myst_parser.parsers.parse_html.TerminalElement], data: str)[source]#: Nest the data onto the tree.

enclose(name: str)[source]#: When a closing tag is found, pop the pointer’s scope from the stack, to then point to the earlier scope’s tag.

class myst_parser.parsers.parse_html.HtmlToAst(name: str = '', convert_charrefs: bool = False)[source]#

Bases: html.parser.HTMLParser

The tokenizer class.

Initialization

Initialize and reset this instance.

If convert_charrefs is True (the default), all character references are automatically converted to the corresponding Unicode characters.

void_elements = None#

feed(source: str) → myst_parser.parsers.parse_html.Root[source]#: Parse the source string.

handle_starttag(name: str, attr)[source]#: When found an opening tag then nest it onto the tree.

handle_startendtag(name: str, attr)[source]#: When found a XHTML tag style then nest it up to the tree.

handle_endtag(name: str)[source]#: When found a closing tag then makes it point to the right scope.

handle_data(data: str)[source]#: Nest data onto the tree.

handle_decl(decl: str)[source]#

unknown_decl(decl: str)[source]#

handle_charref(data: str)[source]#

handle_entityref(data: str)[source]#

handle_pi(data: str)[source]#

handle_comment(data: str)[source]#

myst_parser.parsers.parse_html.tokenize_html(text: str, name: str = '', convert_charrefs: bool = False) → myst_parser.parsers.parse_html.Root[source]#

myst_parser.parsers.parse_html

Contents

myst_parser.parsers.parse_html#

1. Module Contents#

1.1. Classes#

1.2. Functions#

1.3. API#

`myst_parser.parsers.parse_html`#