Simple Latex to Text Converter

A simplistic, heuristic LaTeX code parser allowing to returns a text-only approximation. Suitable, e.g. for indexing tex code in a database for full text searching.

The main class is LatexNodes2Text. For a quick start, try:

from pylatexenc.latex2text import LatexNodes2Text

latex = "... LaTeX code ..."
text = LatexNodes2Text(strict_latex_spaces='macros').latex_to_text(latex)

Latex to Text Converter Class

class pylatexenc.latex2text.LatexNodes2Text(env_dict=None, macro_dict=None, text_replacements=None, **flags)

Simplistic Latex-To-Text Converter.

This class parses a nodes structure generated by the latexwalker module, and creates a text representation of the structure.

It is capable of parsing \input directives safely, see set_tex_input_directory() and read_input_file(). By default, \input and \include directives are ignored.

Arguments to the constructor:

Additional keyword arguments are flags which may influence the behavior:

  • keep_inline_math=True|False: If set to True, then inline math is kept using dollar signs, otherwise it is incorporated as normal text. (By default this is False)

  • keep_comments=True|False: If set to True, then LaTeX comments are kept (including the percent-sign); otherwise they are discarded. (By default this is False)

  • strict_latex_spaces=True|False: If set to True, then we follow closely LaTeX’s handling of whitespace. For instance, whitespace following a bare macro (i.e. w/o any delimiting characters like ‘{‘) is consumed/removed. If set to False (the default), then some liberties are taken with respect to whitespace [hopefully making the result slightly more aesthetic, but this behavior is mostly there for historical reasons].

    You may also use one of the presets strict_latex_spaces=’default’|’macros’|’except-in-equations’, which allow for finer control of how whitespace is handled. The ‘default’ is the same as False. Using ‘macros’ will make macros and other sequences of LaTeX constructions obey LaTeX space rules, but will keep indentations after comments and more liberal whitespace rules in equations. The ‘except-in-equations’ preset goes as you’d expect, setting strict latex spacing only outside of equation contexts.

    Finally, the argument strict_latex_spaces may also be set to a dictionary with keys ‘between-macro-and-chars’, ‘after-comment’, and ‘between-latex-constructs’, ‘in-equations’, with individual values either True or False, dictating whitespace behavior in specific cases (True indicates strict latex behavior). The value for ‘in-equations’ may even be another dictionary with the same keys to override values in equations.

    In the future, the default value of this setting might change, e.g., to ‘macros’.

  • keep_braced_groups=True|False: If set to True, then braces delimiting a TeX group {Like this} will be kept in the output, with the contents of the group converted to text as usual. (By default this is False)

  • keep_braced_groups_minlen=<int>: If keep_braced_groups is set to True, then we keep braced groups only if their contents length (after conversion to text) is longer than the given value. E.g., if keep_braced_groups_minlen=2, then {'e}tonnant still goes to étonnant but {'etonnant} becomes {étonnant}.

latex_to_text(latex, **parse_flags)

Parses the given latex code and returns its textual representation.

The parse_flags are the flags to give on to the pylatexenc.latexwalker.LatexWalker constructor.

node_to_text(node, prev_node_hint=None)

Return the textual representation of the given node.

If prev_node_hint is specified, then the current node is formatted suitably as following the node given in prev_node_hint. This might affect how much space we keep/discard, etc.

nodelist_to_text(nodelist)

Extracts text from a node list. nodelist is a list of nodes as returned by pylatexenc.latexwalker.LatexWalker.get_latex_nodes().

In addition to converting each node in the list to text using node_to_text(), we apply some global replacements and fine-tuning to the resulting text to account for text_replacements (e.g., to fix quotes, tab alignment & chars, etc.)

read_input_file(fn)

This method may be overridden to implement a custom lookup mechanism when encountering \input or \include directives.

The default implementation looks for a file of the given name relative to the directory set by set_tex_input_directory(). If strict_input=True was set, we ensure strictly that the file resides in a subtree of the reference input directory (after canonicalizing the paths and resolving all symlinks).

You may override this method to obtain the input data in however way you see fit. (In that case, a call to set_tex_input_directory() may not be needed as that function simply sets properties which are used by the default implementation of read_input_file().)

This function accepts the referred filename as argument (the argument to the \input macro), and should return a string with the file contents (or generate a warning or raise an error).

set_tex_input_directory(tex_input_directory, latex_walker_init_args=None, strict_input=True)

Set where to look for input files when encountering the \input or \include macro.

Alternatively, you may also override read_input_file() to implement a custom file lookup mechanism.

The argument tex_input_directory is the directory relative to which to search for input files.

If strict_input is set to True, then we always check that the referenced file lies within the subtree of tex_input_directory, prohibiting for instance hacks with ‘..’ in filenames or using symbolic links to refer to files out of the directory tree.

The argument latex_walker_init_args allows you to specify the parse flags passed to the constructor of pylatexenc.latexwalker.LatexWalker when parsing the input file.

Known Macros and Environments

class pylatexenc.latex2text.EnvDef(envname, simplify_repl=None, discard=False)

An environment definition.

  • envname: the name of the environment
  • simplify_repl: the replacement text of the environment. This is either
    a callable or a string. If it is a callable, it must accept a single argument, the pylatexenc.latexwalker.LatexEnvironmentNode representing the LaTeX environment. If it is a string, it may contain ‘%s’ which will be replaced by the environment contents converted to text.
  • discard: if set to True, then the full environment is discarded, i.e.,
    it is converted to an empty string.
class pylatexenc.latex2text.MacroDef(macname, simplify_repl=None, discard=None)

A macro definition.

  • macname: the name of the macro (no backslash)
  • simplify_repl: either a string or a callable. The string may contain ‘%s’ replacements, in which the macro arguments will be substituted. The callable should accept the corresponding pylatexenc.latexwalker.LatexMacroNode as an argument.
  • discard: if set to True, then the macro call is discarded, i.e., it is
    converted to an empty string.
pylatexenc.latex2text.default_env_dict

The default context dictionary of known LaTeX environment definitions and how to convert them to text.

This is a dictionary with keys the environment name (EnvDef.envname) and values are EnvDef instances.

pylatexenc.latex2text.default_macro_dict

The default context dictionary of known LaTeX macro definitions and how to convert them to text.

This is a dictionary with keys the macro name (MacroDef.macname) and values are MacroDef instances.

pylatexenc.latex2text.default_text_replacements

Default text replacements (final touches) to apply to LaTeX code. (For instance, converting ~ to (space) or '' to ".)

This is a list (or tuple) of pairs of (regex-pattern, replacement-text) of replacements to perform.