Simple Parser for LaTeX Code¶
The latexwalker
module provides a simple API for parsing LaTeX snippets,
and representing the contents using a data structure based on nodes classes.
LatexWalker will understand the syntax of most common macros. However,
latexwalker
is NOT a replacement for a full LaTeX engine. (Originally,
latexwalker
was desigend to extract useful text for indexing for text
database searches of LaTeX content.)
You can also use latexwalker directly in command-line, producing JSON or a human-readable node tree:
$ echo '\textit{italic} text' | python -m pylatexenc.latexwalker \
--output-format=json --json-compact
[{"nodetype": "LatexMacroNode", "macroname": "textit", "nodeoptarg": null,
"nodeargs": [{"nodetype": "LatexGroupNode", "nodelist": [{"nodetype":
"LatexCharsNode", "chars": "italic"}]}], "macro_post_space": ""},
{"nodetype": "LatexCharsNode", "chars": " text"}]
$ python -m pylatexenc.latexwalker --help
[...]
The main LatexWalker class¶
-
class
pylatexenc.latexwalker.
LatexWalker
(s, macro_dict=None, **flags)¶ A parser which walks through an input stream, parsing it as LaTeX markup.
Arguments:
- s: the string to parse as LaTeX code
- macro_dict: a context dictionary of known LaTeX macros. By default, the default
global macro dictionary default_macro_dict is used. This should be a dictionary
where the keys are macro names (see
MacrosDef.macname
) and values areMacrosDef
instances.
Additional keyword arguments are flags which influence the parsing. Accepted flags are:
- keep_inline_math=True|False If this option is set to True, then inline math is
parsed and stored using
LatexMathNode
instances. Otherwise, inline math is not treated differently and is simply kept as text. - tolerant_parsing=True|False If set to True, then the parser generally ignores syntax errors rather than raising an exception.
- strict_braces=True|False This option refers specifically to reading a encountering a closing brace when an expression is needed. You generally won’t need to specify this flag, use tolerant_parsing instead.
The methods provided in this class perform various parsing of the given string s. These methods typically accept a pos parameter, which must be an integer, which defines the position in the string s to start parsing.
These methods, unless otherwise documented, return a tuple (node, pos, len), where node is a
LatexNode
describing the parsed content, pos is the position at which the LaTeX element of iterest was encountered, and len is the length of the string that is considered to be part of the node. That is, the position in the string that is immediately after the node is pos+len.-
get_latex_braced_group
(pos, brace_type='{')¶ Parses the latex content given to the constructor (and stored in self.s), starting at position pos, to read a latex group delimited by braces.
Reads a latex expression enclosed in braces
{ ... }
. The first token of s[pos:] must be an opening brace.Returns a tuple (node, pos, len), where node is a
LatexGroupNode
instance, pos is the position of the first char of the expression (which has to be an opening brace), and len is the length of the group, including the closing brace (relative to the starting position).
-
get_latex_environment
(pos, environmentname=None)¶ Parses the latex content given to the constructor (and stored in self.s), starting at position pos, to read a latex environment.
Reads a latex expression enclosed in a
\begin{environment}...\end{environment}
. The first token in the stream must be the\begin{environment}
.If environmentname is given and nonempty, then additionally a
LatexWalkerParseError
is raised if the environment in the input stream does not match the provided name.This function will attempt to heuristically parse an optional argument, and possibly a mandatory argument given to the environment. No space is allowed between
\begin{environment}
and an opening square bracket or opening brace.Returns a tuple (node, pos, len) with node being a
LatexEnvironmentNode
.
-
get_latex_expression
(pos, strict_braces=None)¶ Parses the latex content given to the constructor (and stored in self.s), starting at position pos, to parse a single LaTeX expression.
Reads a latex expression, e.g. macro argument. This may be a single char, an escape sequence, or a expression placed in braces. This is what TeX calls a “token” (and not what we call a token… anyway).
Returns a tuple (node, pos, len), where pos is the position of the first char of the expression and len the length of the expression.
-
get_latex_maybe_optional_arg
(pos)¶ Parses the latex content given to the constructor (and stored in self.s), starting at position pos, to attempt to parse an optional argument.
Attempts to parse an optional argument. If this is successful, we return a tuple (node, pos, len) if success where node is a
LatexGroupNode
. Otherwise, this method returns None.
-
get_latex_nodes
(pos=0, stop_upon_closing_brace=None, stop_upon_end_environment=None, stop_upon_closing_mathmode=None)¶ Parses the latex content given to the constructor (and stored in self.s) into a list of nodes.
Returns a tuple (nodelist, pos, len) where nodelist is a list of
LatexNode
’s.If stop_upon_closing_brace is given and set to a character, then parsing stops once the given closing brace is encountered (but not inside a subgroup). The brace is given as a character, ‘]’ or ‘}’. The returned len includes the closing brace, but the closing brace is not included in any of the nodes in the nodelist.
If stop_upon_end_environment is provided, then parsing stops once the given environment was closed. If there is an environment mismatch, then a LatexWalkerParseError is raised except in tolerant parsing mode (see py:meth:parse_flags()). Again, the closing environment is included in the length count but not the nodes.
If stop_upon_closing_mathmode is specified, then the parsing stops once the corresponding math mode (assumed already open) is closed. Currently, only inline math modes delimited by
$
are supported. I.e., currently, if set, only the valuestop_upon_closing_mathmode='$'
is valid.
-
get_token
(pos, brackets_are_chars=True, environments=True, keep_inline_math=None)¶ Parses the latex content given to the constructor (and stored in self.s), starting at position pos, to parse a single “token”, as defined by
LatexToken
.Parse the token in the stream pointed to at position pos.
Returns a
LatexToken
. RaisesLatexWalkerEndOfStream
if end of stream reached.If brackets_are_chars=False, then square bracket characters count as ‘brace_open’ and ‘brace_close’ token types (see
LatexToken
); otherwise (the default) they are considered just like other normal characters.If environments=False, then ‘begin’ and ‘end’ tokens count as regular ‘macro’ tokens (see
LatexToken
); otherwise (the default) they are considered as the token types ‘begin_environment’ and ‘end_environment’.If keep_inline_math is not None, then that value overrides that of self.keep_inline_math for the duration of this method call.
-
parse_flags
()¶ The parse flags currently set on this object. Returns a dictionary with keys ‘keep_inline_math’, ‘tolerant_parsing’ and ‘strict_braces’.
Exception Classes¶
-
class
pylatexenc.latexwalker.
LatexWalkerError
¶ Generic exception class raised by this module.
-
class
pylatexenc.latexwalker.
LatexWalkerParseError
(msg, s=None, pos=None)¶ Parse error. The following attributes are available: msg (the error message), s (the parsed string), pos (the position of the error in the string, 0-based index).
-
class
pylatexenc.latexwalker.
LatexWalkerEndOfStream
¶ Reached end of input stream (e.g., end of file).
Macro Definitions¶
-
class
pylatexenc.latexwalker.
MacrosDef
(macname, optarg, numargs)¶ Class which stores a Macro syntax.
- macname stores the name of the macro, without the leading backslash.
- optarg may be one of True, False, or None.
- if True, the macro expects as first argument an optional argument in square brackets. Then, numargs specifies the number of additional mandatory arguments to the command, given in usual curly braces (or simply as one TeX token)
- if False, the macro only expects a number of mandatory arguments given by numargs. The mandatory arguments are given in usual curly braces (or simply as one TeX token)
- if None, then numargs is a string of either characters “{” or “[“, in which each curly brace specifies a mandatory argument and each square bracket specifies an optional argument in square brackets. For example, “{{[{” expects two mandatory arguments, then an optional argument in square brackets, and then another mandatory argument.
-
pylatexenc.latexwalker.
default_macro_dict
¶ The default context dictionary of known LaTeX macros. The keys are the macro names (
MacrosDef.macname
) and the values areMacrosDef
instances.
Data Node Classes¶
-
class
pylatexenc.latexwalker.
LatexNode
(**kwargs)¶ Represents an abstract ‘node’ of the latex document.
Use
nodeType()
to figure out what type of node this is, andisNodeType()
to test whether it is of a given type.-
isNodeType
(t)¶ Returns True if the current node is of the given type. The argument t must be a Python class such as, e.g.
LatexGroupNode
.
-
nodeType
()¶ Returns the class which corresponds to the type of this node. This is a Python class object, that is one of
LatexCharsNode
,LatexGroupNode
, etc.
-
-
class
pylatexenc.latexwalker.
LatexCharsNode
(chars, **kwargs)¶ Bases:
pylatexenc.latexwalker.LatexNode
A string of characters in the LaTeX document, without any special LaTeX code.
-
chars
¶ The string of characters represented by this node.
-
-
class
pylatexenc.latexwalker.
LatexGroupNode
(nodelist, **kwargs)¶ Bases:
pylatexenc.latexwalker.LatexNode
A LaTeX group delimited by braces,
{like this}
.
-
class
pylatexenc.latexwalker.
LatexCommentNode
(comment, comment_post_space='', **kwargs)¶ Bases:
pylatexenc.latexwalker.LatexNode
A LaTeX comment, delimited by a percent sign until the end of line.
-
comment
¶ The comment string, not including the ‘%’ sign nor the following newline
-
comment_post_space
¶ The newline that terminated the comment possibly followed by spaces (e.g., indentation spaces of the next line)
-
-
class
pylatexenc.latexwalker.
LatexMacroNode
(macroname, nodeoptarg=None, nodeargs=[], macro_post_space='', **kwargs)¶ Bases:
pylatexenc.latexwalker.LatexNode
Represents a ‘macro’ type node, e.g. ‘textbf’
-
macroname
¶ The name of the macro (string), without the leading backslash.
-
nodeoptarg
¶ If non-None, this corresponds to the optional argument of the macro.
-
macro_post_space
¶ Any spaces that were encountered immediately after the macro.
-
-
class
pylatexenc.latexwalker.
LatexEnvironmentNode
(envname, nodelist, optargs=[], args=[], **kwargs)¶ Bases:
pylatexenc.latexwalker.LatexNode
A LaTeX Environment Node, i.e.
\begin{something} ... \end{something}
.-
envname
¶ The name of the environment (‘itemize’, ‘equation’, …)
-
nodelist
¶ A list of
LatexNode
’s that represent all the contents between the\begin{...}
instruction and the\end{...}
instruction.
-
-
class
pylatexenc.latexwalker.
LatexMathNode
(displaytype, nodelist=[], **kwargs)¶ Bases:
pylatexenc.latexwalker.LatexNode
A Math node type.
Note that currently only ‘inline’ math environments are detected.
-
displaytype
¶ Either ‘inline’ or ‘display’, to indicate an inline math block or a display math block. (Note that math environments such as begin{equation}…end{equation}, are reported as
LatexEnvironmentNode
’s, and not asLatexMathNode
’s.
Note
Currently, the ‘display’ type is never used. Display blocks delimited e.g. by
$$ .. $$
or\[ ... \]
are always reported as regular text withLatexCharsNode
. This might change in the future.-
-
class
pylatexenc.latexwalker.
LatexToken
(tok, arg, pos, len, pre_space, post_space='')¶ Represents a token read from the LaTeX input.
This is used internally by
LatexWalker
’s methods. You probably don’t need to worry about individual tokens. Rather, you should use the high-level functions provided byLatexWalker
(e.g.,get_latex_nodes()
). So most likely, you can ignore this class entirely.This is not the same thing as a LaTeX token, it’s just a part of the input which we treat in the same way (e.g. a bunch of content characters, a comment, a macro, etc.)
Information about the object is stored into the fields tok and arg. The tok field is a string which identifies the type of the token. The arg depends on what tok is, and describes the actual input.
Additionally, this class stores information about the position of the token in the input stream in the field pos. This pos is an integer which corresponds to the index in the input string. The field len stores the length of the token in the input string. This means that this token spans in the input string from pos to pos+len.
Leading whitespace before the token is not returned as a separate ‘char’-type token, but it is given in the pre_space field of the token which follows. Pre-space may contain a newline, but not two consecutive newlines.
The post_space is only used for ‘macro’ and ‘comment’ tokens, and it stores any spaces encountered after a macro, or the newline with any following spaces that terminates a LaTeX comment.
The tok field may be one of:
‘char’: raw characters which have no special LaTeX meaning; they are part of the text content.
The arg field contains the characters themselves.
‘macro’: a macro invokation, but not ‘begin’ or ‘end’
The arg field contains the name of the macro, without the leading backslash.
‘begin_environment’: an invokation of ‘begin{environment}’.
The arg field contains the name of the environment inside the braces.
‘end_environment’: an invokation of ‘end{environment}’.
The arg field contains the name of the environment inside the braces.
‘comment’: a LaTeX comment delimited by a percent sign up to the end of the line.
The arg field contains the text in the comment line, not including the percent sign nor the newline.
‘brace_open’: an opening brace. This is usually a curly brace, and sometimes also a square bracket. What is parsed as a brace depends on the arguments to
get_token()
.The arg is a string which contains the relevant brace character.
‘brace_close’: a closing brace. This is usually a curly brace, and sometimes also a square bracket. What is parsed as a brace depends on the arguments to
get_token()
.The arg is a string which contains the relevant brace character.
‘mathmode_inline’: a delimiter which starts inline math. This is (e.g.) a single ‘$’ character which is not part of a double ‘$$’ display environment delimiter.
The arg is the string value of the delimiter in question (‘$’)