Skip to main content
Deno 2 is finally here 🎉️
Learn more

htmltok - HTML and XML tokenizer and normalizer

Documentation Index

This library splits HTML code to semantic units like “beginning of open tag”, “attribute name”, “attribute value”, “comment”, etc. It respects preprocessing instructions (like <?...?>), so can be used to implement HTML-based templating languages.

Also this library can tokenize XML markup. However it’s HTML5-centric. When decoding named entities, HTML5 ones will be recognized and decoded (however decoding is beyond tokenization, and happens only when you call Token.getValue()).

During tokenization, this library finds errors in markup, like not closed tags, duplicate attribute names, etc., and suggests fixes. It can be used to convert HTML to canonical form.

Example

// To download and run this example:
// curl 'https://raw.githubusercontent.com/jeremiah-shaulov/htmltok/v3.0.1/README.md' | perl -ne 's/^> //; $y=$1 if /^```(.)?/; print $_ if $y&&$m; $m=$y&&$m+/<example-p9mn>/' > /tmp/example-p9mn.ts
// deno run /tmp/example-p9mn.ts

import {htmltok, TokenType} from 'https://deno.land/x/htmltok@v3.0.1/mod.ts';
import {assertEquals} from 'jsr:@std/assert@1.0.14/equals';

const source =
`	<meta name=viewport content="width=device-width, initial-scale=1.0">
    <div title="&quot;Title&quot;">
        Text.
    </div>
`;

assertEquals
(	[...htmltok(source)].map(v => Object.assign<Record<never, never>, unknown>({}, v)),
    [	{nLine: 1,  nColumn: 1,  level: 0, tagName: "",        isSelfClosing: false, isForeign: false, type: TokenType.TEXT,                         text: "\t"},
        {nLine: 1,  nColumn: 5,  level: 0, tagName: "meta",    isSelfClosing: false, isForeign: false, type: TokenType.TAG_OPEN_BEGIN,               text: "<meta"},
        {nLine: 1,  nColumn: 10, level: 0, tagName: "",        isSelfClosing: false, isForeign: false, type: TokenType.TAG_OPEN_SPACE,               text: " "},
        {nLine: 1,  nColumn: 11, level: 0, tagName: "meta",    isSelfClosing: false, isForeign: false, type: TokenType.ATTR_NAME,                    text: "name"},
        {nLine: 1,  nColumn: 15, level: 0, tagName: "",        isSelfClosing: false, isForeign: false, type: TokenType.ATTR_EQ,                      text: "="},
        {nLine: 1,  nColumn: 16, level: 0, tagName: "meta",    isSelfClosing: false, isForeign: false, type: TokenType.ATTR_VALUE,                   text: "viewport"},
        {nLine: 1,  nColumn: 24, level: 0, tagName: "",        isSelfClosing: false, isForeign: false, type: TokenType.TAG_OPEN_SPACE,               text: " "},
        {nLine: 1,  nColumn: 25, level: 0, tagName: "meta",    isSelfClosing: false, isForeign: false, type: TokenType.ATTR_NAME,                    text: "content"},
        {nLine: 1,  nColumn: 32, level: 0, tagName: "",        isSelfClosing: false, isForeign: false, type: TokenType.ATTR_EQ,                      text: "="},
        {nLine: 1,  nColumn: 33, level: 0, tagName: "meta",    isSelfClosing: false, isForeign: false, type: TokenType.ATTR_VALUE,                   text: "\"width=device-width, initial-scale=1.0\""},
        {nLine: 1,  nColumn: 72, level: 0, tagName: "",        isSelfClosing: true,  isForeign: false, type: TokenType.TAG_OPEN_END,                 text: ">"},
        {nLine: 1,  nColumn: 73, level: 0, tagName: "",        isSelfClosing: false, isForeign: false, type: TokenType.TEXT,                         text: "\n\t"},
        {nLine: 2,  nColumn: 5,  level: 0, tagName: "div",     isSelfClosing: false, isForeign: false, type: TokenType.TAG_OPEN_BEGIN,               text: "<div"},
        {nLine: 2,  nColumn: 9,  level: 0, tagName: "",        isSelfClosing: false, isForeign: false, type: TokenType.TAG_OPEN_SPACE,               text: " "},
        {nLine: 2,  nColumn: 10, level: 0, tagName: "div",     isSelfClosing: false, isForeign: false, type: TokenType.ATTR_NAME,                    text: "title"},
        {nLine: 2,  nColumn: 15, level: 0, tagName: "",        isSelfClosing: false, isForeign: false, type: TokenType.ATTR_EQ,                      text: "="},
        {nLine: 2,  nColumn: 16, level: 0, tagName: "div",     isSelfClosing: false, isForeign: false, type: TokenType.ATTR_VALUE,                   text: "\"&quot;Title&quot;\""},
        {nLine: 2,  nColumn: 35, level: 0, tagName: "",        isSelfClosing: false, isForeign: false, type: TokenType.TAG_OPEN_END,                 text: ">"},
        {nLine: 2,  nColumn: 36, level: 1, tagName: "",        isSelfClosing: false, isForeign: false, type: TokenType.TEXT,                         text: "\n\t\tText.\n\t"},
        {nLine: 4,  nColumn: 5,  level: 0, tagName: "div",     isSelfClosing: false, isForeign: false, type: TokenType.TAG_CLOSE,                    text: "</div>"},
        {nLine: 4,  nColumn: 11, level: 0, tagName: "",        isSelfClosing: false, isForeign: false, type: TokenType.MORE_REQUEST,                 text: "\n"},
        {nLine: 4,  nColumn: 11, level: 0, tagName: "",        isSelfClosing: false, isForeign: false, type: TokenType.TEXT,                         text: "\n"},
    ]
);

for (const token of htmltok(source))
{	//console.log(token.debug());
    if (token.type == TokenType.ATTR_VALUE)
    {	console.log(`Attribute value: ${token.getValue()}`);
    }
}

Prints:

Attribute value: viewport
Attribute value: width=device-width, initial-scale=1.0
Attribute value: "Title"

htmltok() - Tokenize string

function htmltok(source: string, settings: Settings={}, hierarchy: string[]=new Array<string>, tabWidth: number=4, nLine: number=1, nColumn: number=1): Generator<Token, void, string>

This function returns iterator over tokens found in given HTML source string.

htmltok() arguments:

  • source - HTML or XML string.
  • settings - Affects how the code will be parsed.
  • hierarchy - If you pass an array object, this object will be modified during tokenization process - after yielding each next token. In this array you can observe current elements nesting hierarchy. For normal operation you need to pass empty array, but if you resume parsing from some point, you can provide initial hierarchy. All tag names here are lowercased.
  • tabWidth - Width of TAB stops. Affects nColumn of returned tokens.
  • nLine - Will start counting lines from this line number.
  • nColumn - Will start counting lines (and columns) from this column number.

This function returns Token iterator.

Before giving the last token in the source, this function generates TokenType.MORE_REQUEST. You can ignore it, or you can react by calling the following it.next(more) function of the iterator with a string argument, that contains code continuation. In this case this code will be appended to the last token, and the tokenization process will continue.

// To download and run this example:
// curl 'https://raw.githubusercontent.com/jeremiah-shaulov/htmltok/v3.0.1/README.md' | perl -ne 's/^> //; $y=$1 if /^```(.)?/; print $_ if $y&&$m; $m=$y&&$m+/<example-65ya>/' > /tmp/example-65ya.ts
// deno run /tmp/example-65ya.ts

import {htmltok, TokenType} from 'https://deno.land/x/htmltok@v3.0.1/mod.ts';

let source =
`	<meta name=viewport content="width=device-width, initial-scale=1.0">
    <div title="&quot;Title&quot;">
        Text.
    </div>
`;

function read()
{	const part = source.slice(0, 10);
    source = source.slice(10);
    return part;
}

const it = htmltok(read());
let token;
L:while ((token = it.next().value))
{	while (token.type == TokenType.MORE_REQUEST)
    {	token = it.next(read()).value;
        if (!token)
        {	break L;
        }
    }

    console.log(token.debug());
}

Token

class Token
{
    🔧 constructor(text: string, type: TokenType, nLine: number=1, nColumn: number=1, level: number=0, tagName: string=”“, isSelfClosing: boolean=false, isForeign: boolean=false)
    📄 text: string
    📄 type: TokenType
    📄 nLine: number
    📄 nColumn: number
    📄 level: number
    📄 tagName: string
    📄 isSelfClosing: boolean
    📄 isForeign: boolean
    ⚙ toString(): string
    ⚙ normalized(): string
    ⚙ debug(): string
    ⚙ getValue(): string
}

Token.toString() method returns original token (Token.text), except for TokenType.MORE_REQUEST and FIX_STRUCTURE_* token types, for which it returns empty string.

Token.normalized() - returns token text, as it’s suggested according to HTML normalization rules.

Token.debug() - returns Token object stringified for console.log().

Token.getValue() - returns decoded value of the token.

TokenType

const enum TokenType
{
    TEXT = 0
    ENTITY = 1
    PI_BEGIN = 2
    PI_MID = 3
    PI_END = 4
    COMMENT_BEGIN = 5
    COMMENT_MID = 6
    COMMENT_MID_PI = 7
    COMMENT_END = 8
    CDATA_BEGIN = 9
    CDATA_MID = 10
    CDATA_MID_PI = 11
    CDATA_END = 12
    DTD = 13
    TAG_OPEN_BEGIN = 14
    TAG_OPEN_SPACE = 15
    ATTR_NAME = 16
    ATTR_EQ = 17
    ATTR_VALUE = 18
    TAG_OPEN_END = 19
    TAG_CLOSE = 20
    RAW_LT = 21
    RAW_AMP = 22
    JUNK = 23
    JUNK_DUP_ATTR_NAME = 24
    FIX_STRUCTURE_TAG_OPEN = 25
    FIX_STRUCTURE_TAG_OPEN_SPACE = 26
    FIX_STRUCTURE_TAG_OPEN_END = 27
    FIX_STRUCTURE_TAG_CLOSE = 28
    FIX_STRUCTURE_ATTR_QUOT = 29
    FIX_STRUCTURE_PI_END = 30
    FIX_STRUCTURE_COMMENT_END = 31
    FIX_STRUCTURE_CDATA_END = 32
    MORE_REQUEST = 33
}

Settings

interface Settings
{
    📄 mode?: “html” | “xml”
    📄 noCheckAttributes?: boolean
    📄 quoteAttributes?: boolean
    📄 unquoteAttributes?: boolean
    📄 maxTokenLength?: number
}

  • mode - Tokenize in either HTML, or XML mode. In XML mode, tag and attribute names are case-sensitive, and there’s no special treatment for tags like <script>, <style>, <textarea> and <title>. Also there’re no self-closing by definition tags, and /> can be used in any tag to make it self-closing. Also XML mode implies Settings.quoteAttributes.
  • noCheckAttributes - If true, will not try to determine duplicate attribute names. This can save some computing resources.
  • quoteAttributes - If true, will generate TokenType.FIX_STRUCTURE_ATTR_QUOT tokens to suggest quotes around unquoted attribute values.
  • unquoteAttributes - If true, will return quotes around attribute values as TokenType.JUNK, if such quotes are not necessary. HTML5 standard allows unquoted attributes (unlike XML), and removing quotes can make markup lighter, and more readable by humans and robots.
  • maxTokenLength - If single unsplittable token exceeds this length, an exception will be thrown. However this check is only performed before issuing TokenType.MORE_REQUEST (so tokens can be longer as long as there’s enough space in the buffer). Some tokens are splittable (are returned by parts), like comments, CDATA sections, and text, so this setting doesn’t apply to them. Unsplitable tokens include: attribute names, attribute values and DTD.

HTML normalization

htmltok() can be used to normalize HTML, that is, to fix markup errors. This includes closing unclosed tags, quoting attributes (in XML or if Settings.quoteAttributes is set), etc.

import {htmltok} from 'https://deno.land/x/htmltok@v3.0.1/mod.ts';

const html = `<a target=_blank>Click here`;
const normalHtml = [...htmltok(html, {quoteAttributes: true})].map(t => t.normalized()).join('');
console.log(normalHtml);

Prints:

<a target="_blank">Click here</a>

Preprocessing instructions

This tokenizer allows you to make template parsers that will utilize “preprocessing instructions” feature of XML-like markup languages. However there’s one limitation. The PIs must not cross markup boundaries.

If you want to execute preprocessing instructions before parsing markup, it’s very simple to do, and you don’t need htmltok for this (just str.replace(/<\?[\S\s]*?\?>/g, exec)). Creating parsers that first recognize the markup structure, and maybe split it, and execute PIs in later steps, requires to deal with PIs as part of markup, and htmltok can help here.

The following is code that has inter-markup PIs, and it’s not suitable for htmltok:

<!-- Crosses markup boundaries -->
<?='<div'?> id="main"></div>

The following is alright:

<!-- Doesn't cross markup boundaries -->
<<?='div'?> id="main"></<?='div'?>>

htmltokStream() - Tokenize ReadableStream

function htmltokStream(source: ReadableStream<Uint8Array> | Reader, settings: Settings={}, hierarchy: string[]=[], tabWidth: number=4, nLine: number=1, nColumn: number=1, decoder: TextDecoder=new TextDecoder, buffer: number | ArrayBuffer=BUFFER_SIZE): AsyncGenerator<Token, void, any>

This function allows to tokenize a ReadableStream<Uint8Array> stream of HTML or XML source code. It never generates TokenType.MORE_REQUEST.

If decoder is provided, will use it to convert bytes to text.

import {htmltokReader} from 'https://deno.land/x/htmltok@v3.0.1/mod.ts';
import {readerFromStreamReader} from 'https://deno.land/std@0.167.0/streams/reader_from_stream_reader.ts';

const res = await fetch("https://example.com/");
const reader = readerFromStreamReader(res.body!.getReader());
for await (const token of htmltokReader(reader))
{	console.log(token.debug());
}

htmlDecode() - Decode HTML5 entities

function htmlDecode(str: string, skipPi: boolean=false): string

This function decodes entities (character references), like &apos;, &#39; or &#x27;. If skipPi is true, it will operate only on parts between preprocessing instructions.

import {htmlDecode} from 'https://deno.land/x/htmltok@v3.0.1/mod.ts';

console.log(htmlDecode(`Text&amp;text<?&amp;?>text`)); // prints: Text&text<?&?>text
console.log(htmlDecode(`Text&amp;text<?&amp;?>text`, true)); // prints: Text&text<?&amp;?>text