Simple Parser for LaTeX Code

The latexwalker module provides a simple API for parsing LaTeX snippets, and representing the contents using a data structure based on nodes classes.

LatexWalker will understand the syntax of most common macros. However, latexwalker is NOT a replacement for a full LaTeX engine. (Originally, latexwalker was desigend to extract useful text for indexing for text database searches of LaTeX content.)

You can also use latexwalker directly in command-line, producing JSON or a human-readable node tree:

$ echo '\textit{italic} text' | python -m pylatexenc.latexwalker  \ 
                                --output-format=json --json-compact
[{"nodetype": "LatexMacroNode", "macroname": "textit", "nodeoptarg": null,
"nodeargs": [{"nodetype": "LatexGroupNode", "nodelist": [{"nodetype":
"LatexCharsNode", "chars": "italic"}]}], "macro_post_space": ""},
{"nodetype": "LatexCharsNode", "chars": " text"}]

$ python -m pylatexenc.latexwalker --help
[...]

The main LatexWalker class

class pylatexenc.latexwalker.LatexWalker(s, macro_dict=None, **flags)

A parser which walks through an input stream, parsing it as LaTeX markup.

Arguments:

  • s: the string to parse as LaTeX code
  • macro_dict: a context dictionary of known LaTeX macros. By default, the default global macro dictionary default_macro_dict is used. This should be a dictionary where the keys are macro names (see MacrosDef.macname) and values are MacrosDef instances.

Additional keyword arguments are flags which influence the parsing. Accepted flags are:

  • keep_inline_math=True|False If this option is set to True, then inline math is parsed and stored using LatexMathNode instances. Otherwise, inline math is not treated differently and is simply kept as text.
  • tolerant_parsing=True|False If set to True, then the parser generally ignores syntax errors rather than raising an exception.
  • strict_braces=True|False This option refers specifically to reading a encountering a closing brace when an expression is needed. You generally won’t need to specify this flag, use tolerant_parsing instead.

The methods provided in this class perform various parsing of the given string s. These methods typically accept a pos parameter, which must be an integer, which defines the position in the string s to start parsing.

These methods, unless otherwise documented, return a tuple (node, pos, len), where node is a LatexNode describing the parsed content, pos is the position at which the LaTeX element of iterest was encountered, and len is the length of the string that is considered to be part of the node. That is, the position in the string that is immediately after the node is pos+len.

get_latex_braced_group(pos, brace_type='{')

Parses the latex content given to the constructor (and stored in self.s), starting at position pos, to read a latex group delimited by braces.

Reads a latex expression enclosed in braces { ... }. The first token of s[pos:] must be an opening brace.

Returns a tuple (node, pos, len), where node is a LatexGroupNode instance, pos is the position of the first char of the expression (which has to be an opening brace), and len is the length of the group, including the closing brace (relative to the starting position).

get_latex_environment(pos, environmentname=None)

Parses the latex content given to the constructor (and stored in self.s), starting at position pos, to read a latex environment.

Reads a latex expression enclosed in a \begin{environment}...\end{environment}. The first token in the stream must be the \begin{environment}.

If environmentname is given and nonempty, then additionally a LatexWalkerParseError is raised if the environment in the input stream does not match the provided name.

This function will attempt to heuristically parse an optional argument, and possibly a mandatory argument given to the environment. No space is allowed between \begin{environment} and an opening square bracket or opening brace.

Returns a tuple (node, pos, len) with node being a LatexEnvironmentNode.

get_latex_expression(pos, strict_braces=None)

Parses the latex content given to the constructor (and stored in self.s), starting at position pos, to parse a single LaTeX expression.

Reads a latex expression, e.g. macro argument. This may be a single char, an escape sequence, or a expression placed in braces. This is what TeX calls a “token” (and not what we call a token… anyway).

Returns a tuple (node, pos, len), where pos is the position of the first char of the expression and len the length of the expression.

get_latex_maybe_optional_arg(pos)

Parses the latex content given to the constructor (and stored in self.s), starting at position pos, to attempt to parse an optional argument.

Attempts to parse an optional argument. If this is successful, we return a tuple (node, pos, len) if success where node is a LatexGroupNode. Otherwise, this method returns None.

get_latex_nodes(pos=0, stop_upon_closing_brace=None, stop_upon_end_environment=None, stop_upon_closing_mathmode=None)

Parses the latex content given to the constructor (and stored in self.s) into a list of nodes.

Returns a tuple (nodelist, pos, len) where nodelist is a list of LatexNode’s.

If stop_upon_closing_brace is given and set to a character, then parsing stops once the given closing brace is encountered (but not inside a subgroup). The brace is given as a character, ‘]’ or ‘}’. The returned len includes the closing brace, but the closing brace is not included in any of the nodes in the nodelist.

If stop_upon_end_environment is provided, then parsing stops once the given environment was closed. If there is an environment mismatch, then a LatexWalkerParseError is raised except in tolerant parsing mode (see py:meth:parse_flags()). Again, the closing environment is included in the length count but not the nodes.

If stop_upon_closing_mathmode is specified, then the parsing stops once the corresponding math mode (assumed already open) is closed. Currently, only inline math modes delimited by $ are supported. I.e., currently, if set, only the value stop_upon_closing_mathmode='$' is valid.

get_token(pos, brackets_are_chars=True, environments=True, keep_inline_math=None)

Parses the latex content given to the constructor (and stored in self.s), starting at position pos, to parse a single “token”, as defined by LatexToken.

Parse the token in the stream pointed to at position pos.

Returns a LatexToken. Raises LatexWalkerEndOfStream if end of stream reached.

If brackets_are_chars=False, then square bracket characters count as ‘brace_open’ and ‘brace_close’ token types (see LatexToken); otherwise (the default) they are considered just like other normal characters.

If environments=False, then ‘begin’ and ‘end’ tokens count as regular ‘macro’ tokens (see LatexToken); otherwise (the default) they are considered as the token types ‘begin_environment’ and ‘end_environment’.

If keep_inline_math is not None, then that value overrides that of self.keep_inline_math for the duration of this method call.

parse_flags()

The parse flags currently set on this object. Returns a dictionary with keys ‘keep_inline_math’, ‘tolerant_parsing’ and ‘strict_braces’.

Exception Classes

class pylatexenc.latexwalker.LatexWalkerError

Generic exception class raised by this module.

class pylatexenc.latexwalker.LatexWalkerParseError(msg, s=None, pos=None)

Parse error. The following attributes are available: msg (the error message), s (the parsed string), pos (the position of the error in the string, 0-based index).

class pylatexenc.latexwalker.LatexWalkerEndOfStream

Reached end of input stream (e.g., end of file).

Macro Definitions

class pylatexenc.latexwalker.MacrosDef(macname, optarg, numargs)

Class which stores a Macro syntax.

  • macname stores the name of the macro, without the leading backslash.
  • optarg may be one of True, False, or None.
    • if True, the macro expects as first argument an optional argument in square brackets. Then, numargs specifies the number of additional mandatory arguments to the command, given in usual curly braces (or simply as one TeX token)
    • if False, the macro only expects a number of mandatory arguments given by numargs. The mandatory arguments are given in usual curly braces (or simply as one TeX token)
    • if None, then numargs is a string of either characters “{” or “[“, in which each curly brace specifies a mandatory argument and each square bracket specifies an optional argument in square brackets. For example, “{{[{” expects two mandatory arguments, then an optional argument in square brackets, and then another mandatory argument.
pylatexenc.latexwalker.default_macro_dict

The default context dictionary of known LaTeX macros. The keys are the macro names (MacrosDef.macname) and the values are MacrosDef instances.

Data Node Classes

class pylatexenc.latexwalker.LatexNode(**kwargs)

Represents an abstract ‘node’ of the latex document.

Use nodeType() to figure out what type of node this is, and isNodeType() to test whether it is of a given type.

isNodeType(t)

Returns True if the current node is of the given type. The argument t must be a Python class such as, e.g. LatexGroupNode.

nodeType()

Returns the class which corresponds to the type of this node. This is a Python class object, that is one of LatexCharsNode, LatexGroupNode, etc.

class pylatexenc.latexwalker.LatexCharsNode(chars, **kwargs)

Bases: pylatexenc.latexwalker.LatexNode

A string of characters in the LaTeX document, without any special LaTeX code.

chars

The string of characters represented by this node.

class pylatexenc.latexwalker.LatexGroupNode(nodelist, **kwargs)

Bases: pylatexenc.latexwalker.LatexNode

A LaTeX group delimited by braces, {like this}.

nodelist

A list of nodes describing the contents of the LaTeX braced group. Each item of the list is a LatexNode.

class pylatexenc.latexwalker.LatexCommentNode(comment, comment_post_space='', **kwargs)

Bases: pylatexenc.latexwalker.LatexNode

A LaTeX comment, delimited by a percent sign until the end of line.

comment

The comment string, not including the ‘%’ sign nor the following newline

comment_post_space

The newline that terminated the comment possibly followed by spaces (e.g., indentation spaces of the next line)

class pylatexenc.latexwalker.LatexMacroNode(macroname, nodeoptarg=None, nodeargs=[], macro_post_space='', **kwargs)

Bases: pylatexenc.latexwalker.LatexNode

Represents a ‘macro’ type node, e.g. ‘textbf’

macroname

The name of the macro (string), without the leading backslash.

nodeoptarg

If non-None, this corresponds to the optional argument of the macro.

nodeargs

A list of arguments to the macro. Each item in the list is a LatexNode.

macro_post_space

Any spaces that were encountered immediately after the macro.

class pylatexenc.latexwalker.LatexEnvironmentNode(envname, nodelist, optargs=[], args=[], **kwargs)

Bases: pylatexenc.latexwalker.LatexNode

A LaTeX Environment Node, i.e. \begin{something} ... \end{something}.

envname

The name of the environment (‘itemize’, ‘equation’, …)

nodelist

A list of LatexNode’s that represent all the contents between the \begin{...} instruction and the \end{...} instruction.

optargs

Any possible optional argument passed to the \begin{...} instruction, for example in \begin{enumerate}[label=\roman*)] (Currently, only a single optional argument is parsed, but this attribute is still a list of LatexNode’s.

args

Any possible regular arguments passed to the \begin{...} instruction, for example in \begin{tabular}{clr}. Currently, at most a single regular argument is parsed, but this is anyway a list of LatexNode’s

class pylatexenc.latexwalker.LatexMathNode(displaytype, nodelist=[], **kwargs)

Bases: pylatexenc.latexwalker.LatexNode

A Math node type.

Note that currently only ‘inline’ math environments are detected.

displaytype

Either ‘inline’ or ‘display’, to indicate an inline math block or a display math block. (Note that math environments such as begin{equation}…end{equation}, are reported as LatexEnvironmentNode’s, and not as LatexMathNode’s.

Note

Currently, the ‘display’ type is never used. Display blocks delimited e.g. by $$ .. $$ or \[ ... \] are always reported as regular text with LatexCharsNode. This might change in the future.

nodelist

The contents of the environment, given as a list of LatexNode’s.

class pylatexenc.latexwalker.LatexToken(tok, arg, pos, len, pre_space, post_space='')

Represents a token read from the LaTeX input.

This is used internally by LatexWalker’s methods. You probably don’t need to worry about individual tokens. Rather, you should use the high-level functions provided by LatexWalker (e.g., get_latex_nodes()). So most likely, you can ignore this class entirely.

This is not the same thing as a LaTeX token, it’s just a part of the input which we treat in the same way (e.g. a bunch of content characters, a comment, a macro, etc.)

Information about the object is stored into the fields tok and arg. The tok field is a string which identifies the type of the token. The arg depends on what tok is, and describes the actual input.

Additionally, this class stores information about the position of the token in the input stream in the field pos. This pos is an integer which corresponds to the index in the input string. The field len stores the length of the token in the input string. This means that this token spans in the input string from pos to pos+len.

Leading whitespace before the token is not returned as a separate ‘char’-type token, but it is given in the pre_space field of the token which follows. Pre-space may contain a newline, but not two consecutive newlines.

The post_space is only used for ‘macro’ and ‘comment’ tokens, and it stores any spaces encountered after a macro, or the newline with any following spaces that terminates a LaTeX comment.

The tok field may be one of:

  • ‘char’: raw characters which have no special LaTeX meaning; they are part of the text content.

    The arg field contains the characters themselves.

  • ‘macro’: a macro invokation, but not ‘begin’ or ‘end’

    The arg field contains the name of the macro, without the leading backslash.

  • ‘begin_environment’: an invokation of ‘begin{environment}’.

    The arg field contains the name of the environment inside the braces.

  • ‘end_environment’: an invokation of ‘end{environment}’.

    The arg field contains the name of the environment inside the braces.

  • ‘comment’: a LaTeX comment delimited by a percent sign up to the end of the line.

    The arg field contains the text in the comment line, not including the percent sign nor the newline.

  • ‘brace_open’: an opening brace. This is usually a curly brace, and sometimes also a square bracket. What is parsed as a brace depends on the arguments to get_token().

    The arg is a string which contains the relevant brace character.

  • ‘brace_close’: a closing brace. This is usually a curly brace, and sometimes also a square bracket. What is parsed as a brace depends on the arguments to get_token().

    The arg is a string which contains the relevant brace character.

  • ‘mathmode_inline’: a delimiter which starts inline math. This is (e.g.) a single ‘$’ character which is not part of a double ‘$$’ display environment delimiter.

    The arg is the string value of the delimiter in question (‘$’)