latexnodes — LaTeX Nodes Tree and Parsers

New in version 3.0: The latexnodes module was introduced in pylatexenc 3.

Parsing State

class pylatexenc.latexnodes.ParsingState(**kwargs)

Stores some information about the current parsing state, such as whether we are currently in a math mode block.

One of the ideas of pylatexenc is to make the parsing of LaTeX code mostly state-independent mark-up parsing (in contrast to a full TeX engine, whose state constantly changes and whose parsing behavior is altered dynamically while parsing). However a minimal state of the context might come in handy sometimes. Perhaps some macros or specials should behave differently in math mode than in text mode.

This class also stores some essential information that is associated with LatexNode‘s and which provides a context to better understand the node structure. For instance, we store the original parsed string, and each node refers to which part of the string they represent.

s

The string that is parsed by the LatexWalker

Deprecated since version 3.0: The s attribute is deprecated starting in pylatexenc 3. If you have access to a node instance (cf. LatexNode) and would like to find out the original string that was parsed, use node.latex_walker.s instead of querying the parsing state. (The rationale of removing the s attribute from the parsing state is for parsing state objects to have a meaning of their own independently of any string being parsed or any latex walker instance.)

latex_context

The latex context (with macros/environments specifications) that was used when parsing the string s. This is a pylatexenc.macrospec.LatexContextDb object.

in_math_mode

Whether or not we are in a math mode chunk of LaTeX (True or False). This can be inline or display, and can be caused by an equation environment.

math_mode_delimiter

Information about the kind of math mode we are currently in, if in_math_mode is True. This is a string which can be set to aid the parser. The parser sets this field to the math mode delimiter that initiated the math mode (one of '$', '$$', r'\(', r'\['). For user-initiated math modes (e.g. by a custom environment definition), you can set this string to any custom value EXCEPT any of the core math mode delimiters listed above.

Note

The tokenizer/parser relies on the value of the math_mode_delimiter attribute to disambiguate two consecutive dollar signs ...$$... into either a display math mode delimiter or two inline math mode delimiters (as in $a$$b$). You should only set math_mode_delimiter=’$’ if you know what you’re doing.

latex_group_delimiters

Doc …………

latex_inline_math_delimiters

Doc …………

latex_display_math_delimiters

Doc …………

enable_double_newline_paragraphs

Doc …………

enable_environments

.Doc ………..

enable_comments

Doc …………

macro_alpha_chars

Doc …………

macro_escape_char

Doc …………….

forbidden_characters

Characters that are simply forbidden to occur as regular characters. You can use this for instance if you’d like to disable some LaTeX-like features but cause the corresponding character to raise an error. For instance, you can force inline math to be typed as \(...\) and not as $...$, and yet still force users to type \$ for a dollar sign by including ‘$’ in the list of forbidden characters.

The forbidden_characters can be a string, or a list of single-character strings; this attribute will be used with the syntax if (c in forbidden_characters): ...

New in version 2.0: This class was introduced in version 2.0.

New in version 2.7: The attribute math_mode_delimiter was introduced in version 2.7.

Changed in version 2.7: All arguments must now be specified as keyword arguments as of version 2.7.

New in version 3.0: The attributes latex_group_delimiters, latex_inline_math_delimiters, latex_display_math_delimiters, enable_double_newline_paragraphs, enable_environments, enable_comments, macro_alpha_chars, macro_escape_char, and forbidden_characters were introduced in version 3.

New in version 3.0: This class was moved to pylatexenc.latexnodes.ParsingState starting in pylatexenc 3.0. In earlier versions, this class was located in the latexwalker module, see ParsingState.

sub_context(**kwargs)

Return a new ParsingState instance that is a copy of the current parsing state, but where the given properties keys have been set to the corresponding values (given as keyword arguments).

This makes it easy to create a sub-context in a given parser. For instance, if we enter math mode, we might write:

parsing_state_inner = parsing_state.sub_context(in_math_mode=True)

If no arguments are provided, this returns a copy of the present parsing context object.

get_fields()

Returns the fields and values associated with this ParsingState as a dictionary.

class pylatexenc.latexnodes.ParsingStateDelta(set_attributes=None, _fields=None, **kwargs)

Describe a change in the parsing state. Can be the transition into math mode; the definition of a new macro causing the latex context to change; etc. etc.

There are many ways in which the parsing state can change, and this is reflected in the many different subclasses of ParsingStateDelta (e.g., ParsingStateDeltaEnterMathMode).

This class serves both as a base class for general parsing state changes, as well as a simple implementation of a parsing state change based on parsing state attributes that are to be changed.

get_updated_parsing_state(parsing_state, latex_walker)

Apply any required changes to the given parsing_state and return a new parsing state that reflects all the necessary changes.

The new parsing state instance might be the same object instance as is if no changes need to be applied.

class pylatexenc.latexnodes.ParsingStateDeltaReplaceParsingState(set_parsing_state, **kwargs)

A parsing state change in which a new full parsing state object entirely replaces the previous parsing state.

class pylatexenc.latexnodes.ParsingStateDeltaChained(parsing_state_deltas, **kwargs)

Apply multiple parsing state deltas, in the order specified.

class pylatexenc.latexnodes.ParsingStateDeltaWalkerEvent(walker_event_name, walker_event_kwargs)

A parsing state change representing a logical “event” (like entering math mode), for which the actual parsing state changes should be requested to the latex walker instance.

DOC………………….

class pylatexenc.latexnodes.ParsingStateDeltaEnterMathMode(math_mode_delimiter=None, trigger_token=None)

A parsing state change representing the beginning of math mode contents.

This class is a semantic marker for entering math mode and does not itself set the field in_math_mode=True for the parsing state. It’s a “walker event parsing state delta”, see ParsingStateDeltaWalkerEvent. The latexwalker is queried to obtain the actual parsing state change that should be effected because of the change to math mode. (There might be changes other than in_math_mode=True, such as a different set of macro definitions, etc.)

class pylatexenc.latexnodes.ParsingStateDeltaLeaveMathMode(trigger_token=None)

A parsing state change representing contents in text mode.

See also ParsingStateDeltaEnterMathMode.

Latex Token

class pylatexenc.latexnodes.LatexToken(tok, arg, pos, pos_end=None, pre_space='', post_space='', **kwargs)

Represents a token read from the LaTeX input. Instances of this class are return by token readers such as LatexTokenReader.

This is not the same thing as a LaTeX token, it’s just a part of the input which we treat in the same way (e.g. a text character, a comment, a macro, etc.)

Information about the object is stored into the fields tok and arg. The tok field is a string which identifies the type of the token. The arg depends on what tok is, and describes the actual input.

Additionally, this class stores information about the position of the token in the input stream in the field pos. This pos is an integer which corresponds to the index in the input string. The field pos_end stores the position immediately past the token in the input string. This means that the string length spanned by this token is pos_end - pos (without leading whitespace).

Leading whitespace before the token is not returned as a separate ‘char’-type token, but it is given in the pre_space field of the token which follows. Pre-space may contain a newline, but not two consecutive newlines. The pos position is the position of the first character of the token itself, which immediately follows any leading whitespace.

The post_space is only used for ‘macro’ and ‘comment’ tokens, and it stores any spaces encountered after a macro, or the newline with any following spaces that terminates a LaTeX comment. When we encounter two consecutive newlines these are not included in post_space. Contrary to pre_space, the post_space is accounted for in the attribute pos_end, i.e., pos_end points immediately after any trailing whitespace.

The tok field may be one of:

  • ‘char’: raw character(s) which have no special LaTeX meaning and which are part of the text content.

    The arg field contains the characters themselves.

  • ‘macro’: a macro invocation, but not \begin or \end

    The arg field contains the name of the macro, without the leading backslash.

  • ‘begin_environment’: an invocation of \begin{environment}.

    The arg field contains the name of the environment inside the braces.

  • ‘end_environment’: an invocation of \end{environment}.

    The arg field contains the name of the environment inside the braces.

  • ‘comment’: a LaTeX comment delimited by a percent sign up to the end of the line.

    The arg field contains the text in the comment line, not including the percent sign nor the newline.

  • ‘brace_open’: an opening brace. This is usually a curly brace, and sometimes also a square bracket. What is parsed as a brace depends on the arguments to get_token().

    The arg is a string which contains the relevant brace character.

  • ‘brace_close’: a closing brace. This is usually a curly brace, and sometimes also a square bracket. What is parsed as a brace depends on the arguments to get_token().

    The arg is a string which contains the relevant brace character.

  • ‘mathmode_inline’: a delimiter which starts/ends inline math. This is (e.g.) a single ‘$’ character which is not part of a double ‘$$’ display environment delimiter.

    The arg is the string value of the delimiter in question (‘$’)

  • ‘mathmode_display’: a delimiter which starts/ends display math, e.g., \[.

    The arg is the string value of the delimiter in question (e.g., \[ or $$)

  • ‘specials’: a character or character sequence that has a special meaning in LaTeX. E.g., ‘~’, ‘&’, etc.

    The arg field is then the corresponding SpecialsSpec instance.

    The rationale for setting arg to a SpecialsSpec instance, in contrast to the behavior for macros and envrionments, is that macros and environments are delimited directly by LaTeX syntax and are determined unambiguously without any lookup in the latex context database. This is not the case for specials, where successfully parsing a specials already requires a lookup in the context database, and so the spec object is readily available.

Changed in version 3.0: Starting in pylatexenc 3, the len argument was replaced by pos_end. For backwards compatibility, kwargs arguments are inspected for a len argument. If a len argument is provided and pos_end was left None, then pos_end is set to pos+len.

New in version 3.0: This class was moved to pylatexenc.latexnodes.LatexToken starting in pylatexenc 3.0. In earlier versions, this class was located in the latexwalker module, see LatexToken.

Token Readers

class pylatexenc.latexnodes.LatexTokenReaderBase(**kwargs)

Base class for token readers.

A token reader is able to transform input characters (usually given as a single string) into tokens. Tokens are instances of LatexToken.

A token reader also has an internal position pointer that remembers where in the string we should continue to read more tokens. A call to next_token() will both parse a new token and advance the internal position pointer past the token that was just read, such that future calls to next_token() continue parsing tokens as they appear in the string.

A token reader should at minimum provide implementations to peek_token(), move_to_token(), move_past_token(), and cur_pos().

A token reader can (but does not have to) also provide character-level access to the input. This can be used by some special parsers like verbatim parsers. In this case, the token reader should implement peek_chars(), next_chars(), and move_to_pos_chars().

New in version 3.0: The LatexTokenReaderBase class was introduced in pylatexenc 3.0.

make_token(**kwargs)

Return a new LatexToken instance with the given parameters. Can be reimplemented if you want to use a custom token class, although I’m not sure why you’d want to do that.

move_to_token(tok, rewind_pre_space=True)

Move the internal position pointer of this token reader to point to the position of the given token tok. That is, a subsequent call to peek_token() or next_token() should read the given token again.

For token readers that can worry about whitespace, if rewind_pre_space=True, then the internal position is set to point on the whitespace that precedes the token tok (as specified in the instance tok); if rewind_pre_space=False the internal position pointer is set to point on the actual token after the preceding whitespace.

move_past_token(tok, fastforward_post_space=True)

Move the internal position pointer of this token reader to point immediately past the given token tok. That is, a subsequent call to peek_token() or next_token() should return the token that follows tok in the input stream.

For token readers that can worry about whitespace, if fastforward_post_space=True, then whitespace that follows the given tok (for macro and comment nodes) is also skipped.

peek_token(parsing_state)

Parse a single token at the current position in the input stream. Parsing is influenced by the given parsing_state. (See ParsingState.)

The internal position pointer is not updated. I.e., a subsequent call to peek_token() with the same parsing state should return the same token.

If the end of stream is reached, i.e., if there are no remaining tokens at the current internal position, then LatexWalkerEndOfStream is raised.

peek_token_or_none(parsing_state)

A convenience method that calls peek_token(), but that returns None instead of raising LatexWalkerEndOfStream.

next_token(parsing_state)

Same as peek_token(), but then also updates the internal position pointer of this token reader to advance past the token that was read.

cur_pos()

Return the current internal position pointer’s state.

peek_space_chars(parsing_state)

Read a sequence of whitespace characters and return them. Whitespace characters should be read until a nonwhitespace character is found.

The current internal position pointer should remain as it is.

skip_space_chars(parsing_state)

Read a sequence of whitespace characters and return them. Whitespace characters should be read until a nonwhitespace character is found.

Advance internal position as whitespace characters are read. The position pointer should be left immediately after any encountered whitespace. If the current pointed position is not whitespace, the position should not be advanced.

peek_chars(num_chars, parsing_state)

Reads at most num_chars of characters at the current position and returns them. The internal position pointer is not changed.

If the pointer is already at the end of the string and there are no chars we can read, then LatexWalkerEndOfStream is raised.

next_chars(num_chars, parsing_state)

Reads at most num_chars of characters at the current position and returns them. The internal position pointer is advanced to point immediately after the characters read.

If the pointer is already at the end of the string and there are no chars we can read, then LatexWalkerEndOfStream is raised.

move_to_pos_chars(pos)

Move the internal position pointer to a specific character-level position in the input string/stream.

class pylatexenc.latexnodes.LatexTokenReader(s, *, tolerant_parsing=False)

Parse tokens from an input string to create LatexToken instances.

Inherits LatexTokenReaderBase. See also the methods there for the standard token reader interface (such as LatexTokenReaderBase.peek_token() and friends).

The main functionality of this class is coded in the impl_***() methods. To extend this class with custom functionality, you should reimplement those. The methods reimplemented from LatexTokenReaderBase add layers of exception catching and recovery, etc., so be wary of reimplementing them manually.

Attributes:

New in version 3.0: The LatexTokenReader class was introduced in pylatexenc 3.0.

move_to_token(tok, rewind_pre_space=True)

Reimplemented from LatexTokenReaderBase.move_to_token().

move_past_token(tok, fastforward_post_space=True)

Reimplemented from LatexTokenReaderBase.move_past_token().

peek_chars(num_chars, parsing_state)

Reimplemented from LatexTokenReaderBase.peek_chars().

next_chars(num_chars, parsing_state)

Reimplemented from LatexTokenReaderBase.next_chars().

cur_pos()

Reimplemented from LatexTokenReaderBase.cur_pos().

move_to_pos_chars(pos)

Reimplemented from LatexTokenReaderBase.move_to_pos_chars().

skip_space_chars(parsing_state)

Move internal position to skip any whitespace. The position pointer is left immediately after any encountered whitespace. If the current pointed position is not whitespace, the position is not advanced.

If parsing_state.enable_double_newline_paragraphs is set, then two consecutive newlines do not count as whitespace.

Returns the string of whitespace characters that was skipped.

Reimplemented from LatexTokenReaderBase.skip_space_chars().

peek_space_chars(parsing_state)

Reimplemented from LatexTokenReaderBase.peek_space_chars().

peek_token(parsing_state)

Read a single token without updating the current position pointer. Returns the token that was parsed.

Parse errors while reading the token are handled differently whether or not we are in tolerant parsing mode. (See tolerant_parsing attribute and constructor argument.) If not in tolerant mode, the error is raised. When in tolerant parsing mode, the error is translated into a “recovery token” provided by the error object. The “recovery token” is returned as if no error had occurred, in order to continue parsing.

Reimplemented from LatexTokenReaderBase.peek_token().

impl_peek_token(parsing_state)

Read a single token and return it.

If the end of stream is reached, raise LatexWalkerEndOfStream (regardless of whether or not we are in tolerant parsing mode).

impl_peek_space_chars(s, pos, parsing_state)

Look at the string s, and identify how many characters need to be skipped in order to skip whitespace. Does not update the internal position pointer.

Return a tuple (space_string, pos, pos_end) where space_string is the string of whitespace characters that would be skipped at the current position pointer (reported in pos). The integer pos_end is the position immediately after the space characters.

No exception is raised if we encounter the end of the stream, we simply stop looking for more spaces.

impl_char_token(c, pos, pos_end, parsing_state, pre_space)

Read a character token.

This method checks that the given character is not a forbidden character, see ParsingState.forbidden_characters.

impl_maybe_read_math_mode_delimiter(s, pos, parsing_state, pre_space)

See if we can read a math mode delimiter token. This method is called only after a first check (math mode is enabled in parsing state, and the character is one of the first characters of known math mode delimiters).

Return the math mode token, or None if we didn’t encounter a math mode delimiter.

impl_read_macro(s, pos, parsing_state, pre_space)

Read a macro call token. Called when the character at the current position is a macro escape character (usually \, see ParsingState.macro_escape_char).

Macro characters that form long macro names are determined by the py:attr:ParsingState.macro_alpha_chars attribute.

Return the macro token.

rx_environment_name = re.compile('\\s*\\{(?P<environmentname>[A-Za-z0-9*._ :/!^()\\[\\]-]+)\\}')

A regular expression that will read the environment name after encountering the \begin or \end constructs.

parse_latex_environment_name(pos, beginend, pos_envname)

Parse an environment name in curly braces after encountering \begin or \end.

We allow for whitespace, an opening brace, a macro name with normal ASCII alphanumeric characters and some standard punctuation, and a closing curly brace.

We use the regular expression stored as the class attribute rx_environment_name. To override it, you can simply set this attribute to your token reader object instance, e.g., my_token_reader.rx_environment_name = .....

Return a tuple (environmentname, environment_match_end_pos). If the environment name could not be read because of a parse error, then return (None, None).

impl_read_environment(s, pos, parsing_state, beginend, pre_space)

Parse a \begin{environmentname} or \end{environmentname} token.

This method is called after we have seen that at the position pos in the string we indeed have \begin or \end (or with the current escape character instead of \).

Return the parsed token.

impl_read_comment(s, pos, parsing_state, pre_space)

Parse and return a comment token.

We also parse the post-space and include it in the token object. New paragraph tokens are never included in the comment’s post-space attribute.

class pylatexenc.latexnodes.LatexTokenListTokenReader(token_list)

A token reader object that simply yields tokens from a list of already-parsed tokens.

This object doesn’t parse any LaTeX code. Use LatexTokenReader for that.

Arguments and Parsed Arguments

class pylatexenc.latexnodes.LatexArgumentSpec(parser, argname=None, parsing_state_delta=None)

Specify an argument accepted by a callable (a macro, an environment, or specials).

parser

The parser instance to use to parse an argument to this callable.

For the constructor you can also specify a string represending a standard argument type, such as ‘{’, ‘[’, ‘*’, or also some xparse-inspired strings. See LatexStandardArgumentParser. In this case, a suitable parser is instanciated and stored in the parser attribute.

argname

A name for the argument (which can be None, if the argument is to be referred to only by number).

The name can serve for easier argument lookups and can offer more future-proof flexibility: E.g., while adding more optional arguments renumbers all arguments, you can refer to them by name to avoid having to update all references to argument numbers.

See ParsedArgumentsInfo for an interface for looking up argument values on a node instance.

parsing_state_delta

Specify if this argument should be parsed with a specifically altered parsing state (e.g., if the argument should be parsed in math mode).

New in version 3.0: This class was introduced in pylatexenc 3.

class pylatexenc.latexnodes.ParsedArguments(argnlist=None, arguments_spec_list=None, **kwargs)

Parsed representation of macro arguments.

The base class provides a simple way of storing the arguments as a list of parsed nodes.

This base class can be subclassed to store additional information and provide more advanced APIs to access macro arguments for certain categories of macros.

Arguments:

  • argnlist is a list of latexwalker nodes that represent macro arguments. If the macro arguments are too complicated to store in a list, leave this as None. (But then code that uses the latexwalker must be aware of your own API to access the macro arguments.)

    The difference between argnlist and the legacy nodeargs (in pylatexenc 1.x) is that all options, regardless of optional or mandatory, are stored in the list argnlist with possible None‘s at places where optional arguments were not provided. Previously, whether a first optional argument was included in nodeoptarg or nodeargs depended on how the macro specification was given.

  • argspec is a string or a list that describes how each corresponding argument in argnlist represents. If the macro arguments are too complicated to store in a list, leave this as None. For standard macros and parsed arguments this is a string with characters ‘*’, ‘[‘, ‘{’ describing an optional star argument, an optional square-bracket-delimited argument, and a mandatory argument.

Attributes:

argnlist

The list of latexwalker nodes that was provided to the constructor

arguments_spec_list

Argument types, etc. …………….

argspec

Argument type specification provided to the constructor

Deprecated since version 3.0: The attribute argspec is deprecated and only read-only starting from pylatexenc 3. Use the arguments_spec_list attribute instead.

legacy_nodeoptarg_nodeargs

A tuple (nodeoptarg, nodeargs) that should be exposed as properties in LatexMacroNode to provide (as best as possible) compatibility with pylatexenc < 2.

This is either (<1st optional arg node>, <list of remaining args>) if the first argument is optional and all remaining args are mandatory; or it is (None, <list of args>) for any other argument structure.

Deprecated since version 2.0: The legacy_nodeoptarg_nodeargs might be removed in a future version of pylatexenc.

Changed in version 3.0: This class used to be called ParsedMacroArgs in pylatexenc 2. It provides a mostly backwards-compatible interface to the earlier ParsedMacroArgs class, and is still exposed as macrospec.ParsedMacroArgs.

to_json_object()

Called when we export the node structure to JSON when running latexwalker in command-line.

Return a representation of the current parsed arguments in an object, typically a dictionary, that can easily be exported to JSON. The object may contain latex nodes and other parsed-argument objects, as we use a custom JSON encoder that understands these types.

class pylatexenc.latexnodes.ParsedArgumentsInfo(parsed_arguments=None, node=None)

Utility class that can gather information about the arguments stored in a ParsedArguments instance.

get_argument_info(arg)

Return some information about an argument.

If arg is an integer, then it is interpreted as an index in the list of arguments. If it is a string, then it is interpreted as a named argument, and a corresponding LatexArgumentSpec will be sought with a matching argname attribute.

The returned object is a SingleParsedArgumentInfo instance.

get_all_arguments_info(args=None, allow_additional_arguments=False, skip_nonexistent_arguments=False, return_argnames_only=True)

A helper function to return info objects for all arguments.

Here, args specifies which arguments to retrieve information for. If args=None, then information about all known arguments are returned. Otherwise, you can specify a list wherein each item is an argument name or an argument index.

This method returns a dictionary of argument names to ParsedArgumentInfo instances. (Unless you set return_argnames_only=False, in which case the returned dictionary keys match exactly what you specified in args if the latter is non-None.)

The allow_additional_arguments flag sets the behavior to adopt if an argument was found in the present argument list that is not in args. If False, then a parse error is raised complaining about an unexpected argument. If True, it is ignored.

The skip_nonexistent_arguments flag defines the behavior to adopt if an argument requested in args does not appear in the present argument list. If False, then a parse error is raised complaining about a missing argument. If True, the error is ignored and the returned dictionary will not include an entry for that argument.

class pylatexenc.latexnodes.SingleParsedArgumentInfo(argument_node_object)

Helper class to retrieve information about a given argument that was specified and parsed to a latex callable object (macro, environment, or specials).

You normally won’t have to instantiate this object yourself, rather, instances are returned by ParsedArgumentsInfo.get_argument_info() and ParsedArgumentsInfo.get_all_arguments_info().

New in version 3.0: This class was introduced in pylatexenc 3.

was_provided()

Return True if the given argument was provided to the macro (or environment/specials) call, False if the argument was not provided. This only makes sense for optional arguments and will always return True for a mandatory argument that was provided.

Checks that the given node object argument_node_object is not None.

get_content_nodelist()

Return a node list with the contents of the argument. The returned object is always a LatexNodeList instance.

If the argument node is a LatexGroupNode instance (e.g., a mandatory argument delimited by braces as in \textbf{Hello world}), then we return the node list contents of that group node. If the argument is a single node instance of a type other than a group node, then we return a new node list containing that single node. If an optional argument was not provided, then we return a node list that contains a single None item.

get_content_as_chars()

Return the argument contents as a single character string.

The argument must be such that only character nodes (and possibly comment nodes) were given, and an error will be raised otherwise. The content might still be contained in a single group node.

This method first extracts the content node list with get_content_nodelist(). Then, it iterates through the node list, ignoring None items and comment nodes, while concatenating strings in character nodes. Any other node type causes a LatexWalkerParseError to be raised.

This method is useful to extract character arguments from macro calls with an argument that requires a single string, such as \label{my-label} or \href{https://example.com/}{...}.

If the argument consists of a group which contains character and comment nodes (as happens with arguments delimited by braces), the group delimiters are not included in the returned string.

Nodes Collector

class pylatexenc.latexnodes.LatexNodesCollector(latex_walker, token_reader, parsing_state, stop_token_condition=None, stop_nodelist_condition=None, make_child_parsing_state=None, include_stop_token_pre_space_chars=True)

Process a stream of LaTeX tokens and convert them into a list of nodes.

The LatexNodesCollector class functions hand-in-hand with parsers to transform tokens into nodes. A parser such as LatexGeneralNodesParser might set up the parsing state correctly and then defer to a LatexNodesCollector instance to actually parse a bulk of contents. The LatexNodesCollector instance, on the other hand, recurses down to calling parsers when we encounter new macros, environments, specials, etc. in the bulk that is being parsed. The result is a node list containing a full tree of child nodes that represents the logical structure of the tokens that were encountered.

The public API of this class resides essentially in the process_tokens(), as well as the get_final_nodelist() (and some other friends, see docs below).

New in version 3.0: The LatexNodesCollector class was added in pylatexenc 3.0.

exception ReachedEndOfStream

Raised by the process_one_token() method if we reached the end of stream.

You should not have to worry about this exception unless you call process_one_token() yourself. But most of the time you’ll be calling process_tokens() instead, which does not raise this exception; it directly raises LatexWalkerEndOfStream as the higher-level parsers do.

exception ReachedStoppingCondition(stop_data, **kwargs)

Raised by the process_one_token() method to indicate that a stopping condition was met.

You should not have to worry about this exception unless you call process_one_token() yourself. But most of the time you’ll be calling process_tokens() instead, which simply stops processing tokens if a stopping condition is met.

get_final_nodelist()

Returns the final nodelist collected from the processed tokens.

The return value is a LatexNodeList instance.

get_parser_parsing_state_delta()

Doc. …………

pos_start()

Returns the first position of nodes in the collected node list (collected up to this point).

pos_end()

Returns the position immediately after the last node in the collected node list (collected up to this point).

stop_token_condition_met()

Returns True if the condition set as stop_token_condition was met while processing tokens.

stop_token_condition_met_token()

Returns the token that caused the stop condition to be met.

stop_nodelist_condition_met()

Returns True if the condition set as stop_nodelist_condition was met while processing tokens.

stop_condition_stop_data()

If a stopping condition was met, returns whatever the stopping condition callback returned that was non-None and caused the processing to stop.

reached_end_of_stream()

Returns True if we reached the end of the stream.

is_finalized()

Whether this object’s node list has been finalized.

Once the object is finalized, you cannot parse any more tokens. See finalize().

finalize()

Finalize this object’s node list. This ensures that any pending characters that were read are collected into a final chars node. (In the future, there might be other tasks to perform to finalize the node list.)

Normally you don’t have to worry about calling finalize() yourself, because it is automatically called by process_tokens(). You should only worry about calling finalize() if you are calling process_one_token() manually.

Once you call finalize(), you can no longer make any further calls to process_tokens() or process_one_token().

push_pending_chars(chars, pos)

This method should only be called internally or by subclass derived methods.

Adds chars to the pending chars string, i.e., the latest chars that we have seen that will have to be collected into a chars node once we encounter anything other than a regular char.

flush_pending_chars()

This method should only be called internally or by subclass derived methods.

Create a chars node out of all the pending chars that were added with calls to push_pending_chars(). Adds the chars node to the node list, and clears the pending chars string.

push_to_nodelist(node)

This method should only be called internally or by subclass derived methods.

Add the given node to the final node list that we are building.

update_state_from_parsing_state_delta(parsing_state_delta)

This method should only be called internally or by subclass derived methods.

Update our parsing_state attribute to account for any parsing state changes information that might have been provided by some parsed construct (say, a macro call).

process_tokens()

Read tokens from token_reader until either we reach the end of the stream, or a stopping condition is met.

This function never returns anything interesting.

In all cases, the object is finalized (see finalize()) before this method finishes its execution, regardless of whether the function finishes by normal return or by raising an exception.

You can inspect the reason that caused the end of the processing using the methods stop_token_condition_met(), stop_nodelist_condition_met() and reached_end_of_stream().

You can then call get_final_nodelist() to get the nodelist, get_parser_parsing_state_delta() to get any carry-over information for the parser for future parsing, etc.

process_one_token()

Read a single token and process it, recursing into brace blocks and environments etc if needed, and appending stuff to nodelist.

Whereas process_tokens() gathers tokens into nodes until a stopping condition is met or until the end of the stream is reached, the process_one_token() provides finer control on the execution of the process of collecting tokens and gathering them into nodes.

Warning

Normally, it is better to use process_tokens() directly. If you want to read a single node, simply set a stopping condition that stops for instance once the node list has length at least one.

The process_one_token() method requires you to take care of some tasks yourself, which are normally automatically taken care of by process_tokens(). Read on below for more information.

A number of tasks that are taken care of by process_tokens() are NOT taken care of here:

  • If an end of stream is reached, we raise the exception LatexNodesCollector.ReachedEndOfStream. It’s up to you to catch it and do something relevant.

  • If a stopping condition is met, we raise the exception LatexNodesCollector.ReachedStoppingCondition. It’s up to you to catch it and do something relevant.

  • The function returns normally (without any return value) if neither a stopping condition is met nor the end of stream is met. Normally, this means we should continue processing tokens.

  • You have to take care that you call finalize() on the nodes collector instance once you’re done processing tokens.

make_child_parsing_state(parsing_state, node_class)

Create a parsing state a child node of the given type node_class.

You can reimplement this method to customize the parsing state of child nodes.

parse_comment_node(tok)

Process a token that introduces a comment. The token tok is of type tok.tok == 'comment'.

The default implementation creates a LatexCommentNode and pushes it onto the node list.

This method can be reimplemented to customize its behavior. Implementations should create the relevant node(s) and push them onto the node list with a call to push_to_nodelist() (refer to that method’s doc).

parse_latex_group(tok)

Process a token that introduces a LaTeX group (e.g. {a group}). The token tok is of type tok.tok == 'brace_open' according to the current parsing state.

The default implementation uses the make_latex_group_parser provided by the LatexWalker instance to parse the group node, and pushes the resulting node onto the node list.

This method can be reimplemented to customize its behavior. Implementations should create the relevant node(s) and push them onto the node list with a call to push_to_nodelist() (refer to that method’s doc).

parse_macro(tok)

Process a token representing a macro (e.g. \macro). The token tok is of type tok.tok == 'macro'.

The default implementation looks up the corresponding macro specification object via the parsing state’s latex context database, and defers to parse_invocable_token_type().

This method can be reimplemented to customize its behavior. Implementations should create the relevant node(s) and push them onto the node list with a call to push_to_nodelist() (refer to that method’s doc).

parse_environment(tok)

Process a token representing an environment (e.g. \begin{environment}). The token tok is of type tok.tok == 'begin_environment'.

The default implementation looks up the corresponding environment specification object via the parsing state’s latex context database, and defers to parse_invocable_token_type().

This method can be reimplemented to customize its behavior. Implementations should create the relevant node(s) and push them onto the node list with a call to push_to_nodelist() (refer to that method’s doc).

parse_specials(tok)

Process a token representing LaTeX specials (e.g. ~). The token tok is of type tok.tok == 'specials'.

The default implementation defers to parse_invocable_token_type().

This method can be reimplemented to customize its behavior. Implementations should create the relevant node(s) and push them onto the node list with a call to push_to_nodelist() (refer to that method’s doc).

parse_invocable_token_type(tok, spec, node_class, what)

Process a token representing either a macro call, a begin environment call, or specials chars.

This method is a convenience method that collects the similar processing for these three node types. The specification class is queried for the relevant parser object (spec.get_node_parser()), to which we defer for parsing the macro call / the environment / the specials.

Additionally, the current parsing state is updated using the carry-over information reported by the call parser.

This method can be reimplemented to customize its behavior. Implementations should create the relevant node(s) and push them onto the node list with a call to push_to_nodelist() (refer to that method’s doc).

parse_math(tok)

Process a token that introduces LaTeX math mode (e.g. $ ... $ or \[ ... \]). The token tok is of type tok.tok in ('mathmode_inline', 'mathmode_display') according to the current parsing state.

The default implementation uses the make_latex_math_parser() provided by the latex walker to parse the group node, and pushes the resulting node onto the node list.

This method can be reimplemented to customize its behavior. Implementations should create the relevant node(s) and push them onto the node list with a call to push_to_nodelist() (refer to that method’s doc).

Exception classes

class pylatexenc.latexnodes.LatexWalkerError

Generic exception class raised while parsing LaTeX code. Common subclass to LatexWalkerLocatedError as well as LatexWalkerEndOfStream.

class pylatexenc.latexnodes.LatexWalkerLocatedError(msg, s=None, pos=None, lineno=None, colno=None, error_type_info=None, **kwargs)

Exception class raised to the user when there was an error dealing with LaTeX code. The exception is accompanied by information about where the error occurred in the source LaTeX code.

The following attributes are available if they were provided to the class constructor:

msg

The error message

s

The string that was currently being parsed

pos

The index in the string where the error occurred, starting at zero.

lineno

The line number where the error occurred, starting at 1.

colno

The column number where the error occurred in the line lineno, starting at 0.

input_source

The name of the source (e.g. file name) from which the LaTeX code was obtained. (Optional.)

error_type_info

Specify additional information about the error so that specific applications can interpret the error and provide more meaningful messages to the user. For instance, the message “Character is forbidden: ‘%’” might be cryptic to a user, whereas an application might be able to parse the error_type_info to see that the error is of the type of a forbidden character, and issue a message like “LaTeX comments are not permitted (‘%’ char forbidden), use ‘\%’ for a literal percent sign.”

The error_type_info attribute is a dictionary with at least one key named what. The what key should reflect the type of error that occurred, e.g., token_forbidden_character. Other keys might give additional information about the error (e.g., which character was encountered and was forbidden).

class pylatexenc.latexnodes.LatexWalkerLocatedErrorFormatter(exc)

Format the

class pylatexenc.latexnodes.LatexWalkerParseError(msg, s=None, pos=None, lineno=None, colno=None, error_type_info=None, **kwargs)

Represents an error while LaTeX code, specifically while parsing the code into the nodes structure.

class pylatexenc.latexnodes.LatexWalkerNodesParseError(recovery_nodes=None, recovery_parsing_state_delta=None, recovery_at_token=None, recovery_past_token=None, **kwargs)

Represents an error while parsing content nodes, typically as a consequence of LatexWalker.parse_content(). This class carries some additional information about how best to recover from this parse error if we are operating in tolerant parsing mode. E.g., we can already report a list of nodes parsed so far.

In addition to the attributes inherited by LatexWalkerParseError, we have:

recovery_nodes

Nodes result (a LatexNode or LatexNodeList instance) to use as if the parser call had returned successfully.

recovery_parsing_state_delta

Parsing state delta to use as if the parser call had returned successfully.

recovery_at_token

If non-None, then we should reset the token reader’s internal position to try to continue parsing at the given token’s position.

recovery_past_token

If non-None, then we should reset the token reader’s internal position to try to continue parsing immediately after the given token’s position.

This attribute is not to be set if recovery_at_token is already non-None.

New in version 3.0: The LatexWalkerNodesParseError class was introduced in pylatexenc 3.

class pylatexenc.latexnodes.LatexWalkerTokenParseError(recovery_token_placeholder, recovery_token_at_pos, **kwargs)

Represents an error while parsing a single token of LaTeX code. See LatexTokenReader.

In addition to the attributes inherited by LatexWalkerParseError, we have:

recovery_token_placeholder

A LatexToken instance to use in place of a token that we tried, but failed, to parse.

recovery_token_at_pos

The position at which to reset the token_reader’s internal state to attempt to recover from this error.

New in version 3.0: The LatexWalkerTokenParseError class was introduced in pylatexenc 3.

class pylatexenc.latexnodes.LatexWalkerEndOfStream(final_space='')

We reached end of the input string.

final_space

Any trailing space at the end of the input string that might need to be included in a character node.

New in version 2.0: The attribute final_space was added in pylatexenc 2.

Base classes

class pylatexenc.latexnodes.CallableSpecBase

The base class for macro, environment, and specials spec classes (see the pylatexenc.macrospec module).

As far as this latexnodes module’s classes are concerned, a spec object is simply something that can provide a parser to parse the given construct (macro, environment, or specials).

The spec object should implement get_node_parser(), and it should return a parser instance that can be used to parse the entire construct.

See macrospec.MacroSpec for how this is implemented in the pylatexenc.macrospec module.

New in version 3.0: The CallableSpecBase class was added in pylatexenc 3.0.

class pylatexenc.latexnodes.LatexWalkerParsingStateEventHandler

A LatexWalker parsing state event handler.

The LatexWalker instance will call methods on this object to determine how to update the parsing state upon certain events, such as entering or exiting math mode.

Events:

  • enter math mode

  • exit math mode

New in version 3.0: The LatexWalkerParsingStateEventHandler class was added in pylatexenc 3.0.

class pylatexenc.latexnodes.LatexWalkerBase

Base class for a latex-walker. Essentially, this is all that the classes and methods in the latexnodes module need to know about what a LatexWalker does.

See also latexwalker.LatexWalker.

New in version 3.0: The LatexWalkerBase class was added in pylatexenc 3.0.

parsing_state_event_handler()

Doc……

parse_content(parser, token_reader=None, parsing_state=None, open_context=None, **kwargs)

Doc……

make_node(node_class, **kwargs)

Doc……

make_nodelist(nodelist, **kwargs)

Doc……

make_nodes_collector(token_reader, parsing_state, **kwargs)

Doc……

make_latex_group_parser(delimiters)

Doc……

make_latex_math_parser(math_mode_delimiters)

Doc……

check_tolerant_parsing_ignore_error(exc)

You can inspect the exception object exc and decide whether or not to attempt to recover from the exception (if you want to be tolerant to parsing errors).

Return the exception object if it should be raised, or return None if recovery should be attempted.

format_node_pos(node)

Doc……

class pylatexenc.latexnodes.LatexContextDbBase

Base class for a parsing state’s LaTeX context database.

A full implementation of how to specify macro, environment, and specials definitions are actually in the pylatexenc.macrospec module. As far as this latexnodes is concerned, a latex context database object is simply an object that provides the get_***_spec() family of methods along with test_for_specials(), and they return relevant spec objects.

The spec objects returned by get_***_spec() and test_for_specials() are subclasses of CallableSpecBase.

New in version 3.0: The LatexContextDbBase class was added in pylatexenc 3.0.

get_macro_spec(macroname)

Return the macro spec to use to parse a macro named macroname. The macroname does not contain the escape character (\) itself.

This method should return the relevant spec object, which should be an instance of a subclass of CallableSpecBase.

The latex context database object may choose to provide a default spec object if macroname wasn’t formally defined. As far as the parsers are concerned, if get_macro_spec() returns a spec object, then the parsers know how to parse the given macro and will happily proceed.

If a macro of name macroname should not be considered as defined, and the parser should not attempt to parse a macro and raise an error instead (or recover from it in tolerant parsing mode), then this method should return None.

get_environment_spec(environmentname)

Like get_macro_spec(), but for environments. The environmentname is the name of the environment specified between the curly braces after the \begin call.

This method should return the relevant spec object, which should be an instance of a subclass of CallableSpecBase.

The latex context database object may choose to provide a default spec object if an environment named environmentname wasn’t somehow formally defined. As far as the parsers are concerned, if get_environment_spec() returns a spec object, then the parsers know how to parse the given environment and will happily proceed.

If an environment of name environmentname should not be considered as defined, and the parser should not attempt to parse the environment and raise an error instead (or recover from it in tolerant parsing mode), then this method should return None.

get_specials_spec(specials_chars)

Like get_macro_spec(), but for specials. The specials_chars is the sequence of characters for which we’d like to find if they are a specials construct.

Parsing of specials is different from macros and environments, because there is no universal syntax that distinguishes them (macros and environments are always initiated with the escape character \). So the token reader will call test_for_specials() to see if the string at the given position can be matched for specials.

The result is that get_specials_spec() usually doesn’t get called when parsing tokens. The get_specials_spec() method is only called in certain specific situations, such as to get the spec object associated with the new paragraph token \n\n.

This method should return the relevant spec object, which should be an instance of a subclass of CallableSpecBase, or None if these characters are not to be considered as specials.

test_for_specials(s, pos, parsing_state)

Test the string s at position pos for the presence of specials.

For instance, if the parser tests the string "Eq.~\eqref{eq:xyz}" at position 3, then the latex context database might want to report the character ~ as a specials construct and return a specials spec for it.

If specials characters are recognized, then this method should return a corresponding spec object. The spec object should be an instance of a CallableSpecBase subclass. In addition, the returned spec object must expose the attribute specials_chars. That attribute should contain the sequence of characters that were recognized as special.

If no specials characters are recongized at exactly the position pos, then this method should return None.

Node Classes

Parser Classes