latex2text — Simple Latex to Text Converter

A simplistic, heuristic LaTeX code parser allowing to returns a text-only approximation. Suitable, e.g. for indexing tex code in a database for full text searching.

The main class is LatexNodes2Text. For a quick start, try:

from pylatexenc.latex2text import LatexNodes2Text

latex = "... LaTeX code ..."
text = LatexNodes2Text().latex_to_text(latex)

You may also use the command-line version of latex2text:

$ echo '\textit{italic} \`acc\^ented text' | latex2text
italic àccênted text

Custom latex conversion rules: A simple template

Here is a short introduction on how to customize the way that LatexNodes2Text converts LaTeX constructs (macros, environments, and specials) to unicode text. You can start off with the example template below and adapt it to your needs.

Macros, environments and specials are parsed as corresponding node objects by the parser (see pylatexenc.latexwalker.LatexMacroNode, pylatexenc.latexwalker.LatexEnvironmentNode, and pylatexenc.latexwalker.LatexSpecialsNode). These node objects are then converted to unicode text by the LatexNodes2Text object.

You can define new macros, environments, or specials, or override existing definitions. The definitions need to be provided twice. First, at the level of the parser using the macrospec module; the parser needs to know the argument structure of your macros, environments, and specials, along with which characters to recognize as “specials”. Second, at the level of latex2text, you need to specify what the replacement strings are for the different LaTeX constructs after they have been parsed into the latex node tree by the parser.

The following template is a simple illustrative example that implements the following definitions:

  • A new macro \putinquotes[`][']{text} that puts its mandatory argument into quotes defined by the two optional arguments. Let’s say that the default quotes that are used are `` and ''. Another simpler macro \putindblquotes{text} is also provided for the sake of the example.

  • A new environment \begin{inquotes}[`]['] ... \end{inquotes} that does the same thing as its macro equivalent. Another simpler environment \begin{indblquotes}...\end{indblquotes} is also provided for the sake of the example.

  • The usual LaTeX quote symbols `, ``, ', and '' for unicode quotes. (See also issue #39)

Here is the code (see also docs for pylatexenc.macrospec.MacroSpec, pylatexenc.macrospec.EnvironmentSpec, pylatexenc.macrospec.SpecialsSpec, as well as pylatexenc.latex2text.MacroTextSpec, pylatexenc.latex2text.EnvironmentTextSpec, pylatexenc.latex2text.SpecialsTextSpec):

from pylatexenc import latexwalker, latex2text, macrospec

#
# Define macros, environments, specials for the *parser*
#
lw_context_db = latexwalker.get_default_latex_context_db()
lw_context_db.add_context_category(
    'my-quotes',
    prepend=True,
    macros=[
        macrospec.MacroSpec("putindblquotes", "{"),
        macrospec.MacroSpec("putinquotes", "[[{"),
    ],
    environments=[
        macrospec.EnvironmentSpec("indblquotes", ""),
        macrospec.EnvironmentSpec("inquotes", "[["),
    ],
    specials=[
        macrospec.SpecialsSpec("`"),
        macrospec.SpecialsSpec("'"),
        macrospec.SpecialsSpec("``"),
        macrospec.SpecialsSpec("''"),
    ],
)

#
# Implement macros, environments, specials for the *conversion to text*
#

def _get_optional_arg(node, default, l2tobj):
    """Helper that returns the `node` converted to text, or `default`
    if the node is `None` (e.g. an optional argument that was not
    specified)"""
    if node is None:
        return default
    return l2tobj.nodelist_to_text([node])

def put_in_quotes_macro_repl(n, l2tobj):
    """Get the text replacement for the macro
    \putinquotes[open-quote][close-quote]{text}"""
    if not n.nodeargd:
        # n.nodeargd can be empty if e.g. \putinquotes was a single
        # token passed as an argument to a macro,
        # e.g. \newcommand\putinquotes...
        return ''
    open_q_s = _get_optional_arg(n.nodeargd.argnlist[0], '“', l2tobj)
    close_q_s = _get_optional_arg(n.nodeargd.argnlist[1], '”', l2tobj)
    return (open_q_s + l2tobj.nodelist_to_text([n.nodeargd.argnlist[2]])
            + close_q_s)

def in_quotes_env_repl(n, l2tobj):
    """Get the text replacement for the {inquotes} environment"""
    open_q_s = _get_optional_arg(n.nodeargd.argnlist[0], '“', l2tobj)
    close_q_s = _get_optional_arg(n.nodeargd.argnlist[1], '”', l2tobj)
    return open_q_s + l2tobj.nodelist_to_text(n.nodelist) + close_q_s

l2t_context_db = latex2text.get_default_latex_context_db()
l2t_context_db.add_context_category(
    'my-quotes',
    prepend=True,
    macros=[
        latex2text.MacroTextSpec("putindblquotes",
                                 simplify_repl=r'“%(1)s”'),
        latex2text.MacroTextSpec("putinquotes",
                                 simplify_repl=put_in_quotes_macro_repl),
    ],
    environments=[
        latex2text.EnvironmentTextSpec("indblquotes",
                                       simplify_repl=r'“%(body)s”'),
        latex2text.EnvironmentTextSpec("inquotes",
                                       simplify_repl=in_quotes_env_repl),
    ],
    specials=[
        latex2text.SpecialsTextSpec('`', "‘"),
        latex2text.SpecialsTextSpec("'", "’"),
        latex2text.SpecialsTextSpec('``', "“"),
        latex2text.SpecialsTextSpec("''", "”"),
    ],
)


#
# Here is an example usage:
#

def custom_latex_to_text( input_latex ):
    # the latex parser instance with custom latex_context
    lw_obj = latexwalker.LatexWalker(input_latex,
                                     latex_context=lw_context_db)
    # parse to node list
    nodelist, pos, length = lw_obj.get_latex_nodes()
    # initialize the converter to text with custom latex_context
    l2t_obj = latex2text.LatexNodes2Text(latex_context=l2t_context_db)
    # convert to text
    return l2t_obj.nodelist_to_text( nodelist )


print(custom_latex_to_text(
    r"""\begin{inquotes}[`][']Hello, world\end{inquotes}"""))
# ‘Hello, world’

print(custom_latex_to_text(r"""\putinquotes[``]['']{Hello, world}"""))
# “Hello, world”

print(custom_latex_to_text(r"""\putinquotes{Hello, world}"""))
# “Hello, world”

print(custom_latex_to_text(r"""\putinquotes[`][']{Hello, world}"""))
# ‘Hello, world’

Latex to Text Converter Class

class pylatexenc.latex2text.LatexNodes2Text(latex_context=None, **flags)

Simplistic Latex-To-Text Converter.

This class parses a nodes structure generated by the latexwalker module, and creates a text representation of the structure.

It is capable of parsing \input directives safely, see set_tex_input_directory() and read_input_file(). By default, \input and \include directives are ignored.

Arguments to the constructor:

Additional keyword arguments are flags which may influence the behavior:

  • math_mode=’text’|’with-delimiters’|’verbatim’|’remove’: Specify how to treat chunks of LaTeX code that correspond to math modes. If ‘text’ (the default), then the math mode contents is incorporated as normal text. If ‘with-delimiters’, the content is incorporated as normal text but it is still included in the original math-mode delimiters, such as ‘$…$’. If ‘verbatim’, then the math mode chunk is kept verbatim, including the delimiters. The value ‘remove’ means to remove the math mode sections entirely and not to produce any replacement text.

  • keep_comments=True|False: If set to True, then LaTeX comments are kept (including the percent-sign); otherwise they are discarded. (By default this is False)

  • fill_text: If set to True or to a positive integer, then the whitespace of LaTeX char blocks is re-layed out to fill at the given number of characters or 80 by default. The fill is by far not perfect, but the resulting text might be slightly more readable.

  • strict_latex_spaces=True|False: If set to True, then we follow closely LaTeX’s handling of whitespace. For instance, whitespace following a bare macro (i.e. without any delimiting characters like ‘{’) is consumed/removed. If set to False (the default), then some liberties are taken with respect to whitespace [hopefully making the result slightly more aesthetic, but this behavior is mostly there for historical reasons].

    You may also use one of the presets strict_latex_spaces=’based-on-source’|’macros’|’except-in-equations’, which allow for finer control of how whitespace is handled:

    • The value ‘based-on-source’ is the option that is furthest from latex’s behavior with spaces, and takes liberties in incuding spaces that are present in the source file in several situations where LaTeX would remove them, including after macros. This is meant to be hopefully slightly more aesthetic. However, this option might inadvertently break up words: For instance:

      Sk\l odowska
      

      would be replaced by:

      Skł odowska
      
    • The value ‘macros’ is the same as specifying strict_latex_spaces=False, and it is the default. It will make macros and other sequences of LaTeX constructions obey LaTeX space rules, but will keep indentations after comments and keep more liberal whitespace rules in equations for a hopefully more aesthetic result.

    • The ‘except-in-equations’ preset goes as you would expect, setting strict latex spacing only outside of equation contexts.

    Finally, the argument strict_latex_spaces may also be set to a dictionary with keys ‘between-macro-and-chars’, ‘after-comment’, ‘between-latex-constructs’, and ‘in-equations’, with individual values either True or False, dictating whitespace behavior in specific cases (True indicates strict latex behavior). The value for ‘in-equations’ may even be another dictionary with the same keys to override values in equations. A value of False for ‘in-equation’ has the same meaning as ‘macros’.

    Changed in version 2.0: Since pylatexenc 2.0, the default value of strict_latex_spaces is ‘macros’, and no longer ‘based-on-source’.

    Deprecated since version 2.0: The value ‘default’ is also accepted, but it is no longer the default! It is an alias for ‘based-on-source’

    Changed in version 2.6: In pylatexenc versions 2.0–2.5, contrary to the documentation, the default value of strict_latex_spaces was actually still ‘based-on-source’. This bug was fixed in version 2.6, so that now, the default setting is actually ‘macros’.

  • keep_braced_groups=True|False: If set to True, then braces delimiting a TeX group {Like this} will be kept in the output, with the contents of the group converted to text as usual. (By default this is False)

  • keep_braced_groups_minlen=<int>: If keep_braced_groups is set to True, then we keep braced groups only if their contents length (after conversion to text) is longer than the given value. E.g., if keep_braced_groups_minlen=2, then {\'e}tonnant still goes to étonnant but {\'etonnant} becomes {étonnant}.

Additionally, the following arguments are accepted for backwards compatibility:

  • keep_inline_math=True|False: Obsolete since pylatexenc 2. If set to True, then this is the same as math_mode=’verbatim’, and if set to False, this is the same as math_mode=’text’.

    Deprecated since version 2.0: The keep_inline_math= option is deprecated because it had a weird behavior and was poorly implemented, especially given that a similarly named option in LatexWalker had a different effect. See issue #14.

  • text_replacements this argument is ignored starting from pylatexenc 2.

    Deprecated since version 2.0: Text replacements are no longer made at the end of the text conversion. This feature is replaced by the concept of LaTeX specials—see, e.g., pylatexenc.latexwalker.LatexSpecialsNode.

    To keep existing code working, add a call to apply_text_replacements() immediately after nodelist_to_text() to achieve the same effect as in pylatexenc 1.x. See apply_text_replacements().

  • env_dict, macro_dict: Obsolete since pylatexenc 2. If set, they are dictionaries of known environment and macro definitions. They default to default_env_dict and default_macro_dict, respectively.

    Deprecated since version 2.0: You should now use the more powerful option latex_context_db=. You cannot specify both macro_list (or env_list) and latex_context_db.

set_tex_input_directory(tex_input_directory, latex_walker_init_args=None, strict_input=True)

Set where to look for input files when encountering the \input or \include macro.

Alternatively, you may also override read_input_file() to implement a custom file lookup mechanism.

The argument tex_input_directory is the directory relative to which to search for input files.

If strict_input is set to True, then we always check that the referenced file lies within the subtree of tex_input_directory, prohibiting for instance hacks with ‘..’ in filenames or using symbolic links to refer to files out of the directory tree.

The argument latex_walker_init_args allows you to specify the parse flags passed to the constructor of pylatexenc.latexwalker.LatexWalker when parsing the input file.

read_input_file(fn)

This method may be overridden to implement a custom lookup mechanism when encountering \input or \include directives.

The default implementation looks for a file of the given name relative to the directory set by set_tex_input_directory(). If strict_input=True was set, we ensure strictly that the file resides in a subtree of the reference input directory (after canonicalizing the paths and resolving all symlinks).

If set_tex_input_directory() was not called, or if it was called with a value of None, then no file system access is attempted an an empty string is returned.

You may override this method to obtain the input data in however way you see fit. In that case, a call to set_tex_input_directory() may not be needed as that function simply sets properties which are used by the default implementation of read_input_file().

This function accepts the referred filename as argument (the argument to the \input macro), and should return a string with the file contents (or generate a warning or raise an error).

latex_to_text(latex, **parse_flags)

Parses the given latex code and returns its textual representation.

This is equivalent to constructing a pylatexenc.latexwalker.LatexWalker with the given latex string, parsing the string into general nodes with a LatexGeneralNodesParser (see parse_content()), and providing the outcome to nodelist_to_text().

The parse_flags are keyword arguments to provide to the pylatexenc.latexwalker.LatexWalker constructor.

nodelist_to_text(nodelist)

Extracts text from a node list. nodelist is a list of latexwalker nodes, typically parsed using a LatexGeneralNodesParser (see parse_content()).

This function basically applies node_to_text() to each node and concatenates the results into one string. (This is not quite actually the case, since we take some care as to where we add whitespace according to the class options.)

node_to_text(node, prev_node_hint=None, textcol=0)

Return the textual representation of the given node.

If prev_node_hint is specified, then the current node is formatted suitably as following the node given in prev_node_hint. This might affect how much space we keep/discard, etc.

chars_node_to_text(node, textcol=0)

Return the textual representation of the given node representing a block of simple latex text with no special characters or macros. The node is LatexCharsNode.

comment_node_to_text(node)

Return the textual representation of the given node representing a latex comment. The node is LatexCommentNode.

group_node_to_text(node)

Return the textual representation of the given node representing a latex group. The node is LatexGroupNode.

macro_node_to_text(node)

Return the textual representation of the given node representing a latex macro invocation. The node is LatexMacroNode.

environment_node_to_text(node)

Return the textual representation of the given node representing a full latex environment. The node is LatexEnvironmentNode.

specials_node_to_text(node)

Return the textual representation of the given node representing special a latex character (or characters). The node is LatexSpecialsNode.

math_node_to_text(node)

Return the textual representation of the given node representing a block of math mode latex. The node is either a LatexMathNode or a LatexEnvironmentNode.

This method is responsible for honoring the math_mode=… option provided to the constructor.

apply_simplify_repl(node, simplify_repl, what)

Utility to get the replacement text associated with a node for which we have a simplify_repl object (given by e.g. a MacroTextSpec or similar).

The argument what is used in error messages.

node_arg_to_text(node, k)

Return the textual representation of the k-th argument of the given node. This might be useful for substitution lambdas in macro and environment specs.

apply_text_replacements(s, text_replacements)

Convenience function for code that used text_replacements= in pylatexenc 1.x.

If you used custom text_replacements= in pylatexenc 1.x then you will have to change:

# pylatexenc 1.x with text_replacements
text_replacements = ...
l2t = LatexNodes2Text(..., text_replacements=text_replacements)
text = l2t.nodelist_to_text(...)

to:

# pylatexenc 2 text_replacements compatibility code
text_replacements = ...
l2t = LatexNodes2Text(...)
temp = l2t.nodelist_to_text(...)
text = l2t.apply_text_replacements(temp, text_replacements)

as a quick fix. It is recommended however to treat text replacements instead as “latex specials”. (Otherwise the brutal text replacements might act on text generated from macros and environments and give unwanted results.) See pylatexenc.macrospec.SpecialsSpec and SpecialsTextSpec.

Deprecated since version 2.0: The apply_text_replacements() method was introduced in pylatexenc 2.0 as a deprecated method. You can use it as a quick fix to make existing code run as it did in pylatexenc 1.x. Its use is however not recommended for new code. You should use “latex specials” instead for characters that have special LaTeX meaning.

pylatexenc.latex2text.get_default_latex_context_db()

Return a pylatexenc.macrospec.LatexContextDb instance initialized with a collection of text replacements for known macros and environments.

TODO: clean up and document categories.

If you want to add your own definitions, you should use the pylatexenc.macrospec.LatexContextDb.add_context_category() method. If you would like to override some definitions, use that method with the argument prepend=True. See docs for pylatexenc.macrospec.LatexContextDb.add_context_category().

If there are too many macro/environment definitions, or if there are some irrelevant ones, you can always filter the returned database using pylatexenc.macrospec.LatexContextDb.filter_context().

New in version 2.0: The pylatexenc.macrospec.LatexContextDb class as well as this method, were all introduced in pylatexenc 2.0.

Define replacement texts

class pylatexenc.latex2text.MacroTextSpec(macroname, simplify_repl=None, discard=None)

A specification of how to obtain a textual representation of a macro.

macroname

The name of the macro (no backslash)

simplify_repl

The replacement text of the macro invocation. This is either a string or a callable:

  • If simplify_repl is a string, this string is used as the text representation of this macro node.

    The string may contain a single ‘%s’ replacement placeholder which will be replaced by the concatenated textual representation of all macro arguments. Alternatively, the string may contain ‘%(<n>)s’ (where <n> is an integer) to refer to the n-th argument (starting at ‘%(1)s’). You cannot mix the two %-formatting styles.

  • If simplify_repl is a callable, it should accept the corresponding pylatexenc.latexwalker.LatexMacroNode as an argument.

    The callable will be inspected to see what other arguments it accepts. If it accepts an argument named l2tobj, the LatexNodes2Text instance is provided to that argument. If it accepts an argument named macroname, the name of the macro is provided to that argument.

discard

If set to True, then the macro call is discarded, i.e., it is converted to an empty string.

New in version 2.0: The class MacroTextSpec was introduced in pylatexenc 2.0 to succeed to the previously named MacroDef class.

class pylatexenc.latex2text.EnvironmentTextSpec(environmentname, simplify_repl=None, discard=False)

A specification of how to obtain a textual representation of an environment.

environmentname

The name of the environment

simplify_repl

The replacement text of the environment. This is either a string or a callable:

  • If simplify_repl is a string, this string is used as the text representation of this environment node.

    The string may contain a single ‘%s’ replacement placeholder, in which the (processed) environment body will be substituted.

    Alternatively, the simplify_repl string may contain ‘%(<n>)s’ (where <n> is an integer) to refer to the n-th argument after egin{environment} (starting at ‘%(1)s’). The body of the environment has to be referred to with %(body)s.

    You cannot mix the two %-formatting styles.

  • If simplify_repl is a callable, it should accept the corresponding pylatexenc.latexwalker.LatexEnvironmentNode as an argument.

    The callable will be inspected to see what other arguments it accepts. If it accepts an argument named l2tobj, the LatexNodes2Text instance is provided to that argument. If it accepts an argument named environmentname, the name of the environment is provided to that argument.

discard

If set to True, then the full environment is discarded, i.e., it is converted to an empty string.

New in version 2.0: The class EnvironmentTextSpec was introduced in pylatexenc 2.0 to succeed to the previously named EnvDef class.

class pylatexenc.latex2text.SpecialsTextSpec(specials_chars, simplify_repl=None)

A specification of how to obtain a textual representation of latex specials.

specials_chars

The sequence of special LaTeX characters

simplify_repl

The replacement text for the given latex specials. This is either a string or a callable:

  • If simplify_repl is a string, this string is used as the text representation of this specials node.

    The string may contain a single ‘%s’ replacement placeholder which will be replaced by the concatenated textual representation of all macro arguments.

    Alternatively, the string may contain ‘%(<n>)s’ (where <n> is an integer) to refer to the n-th argument (starting at ‘%(1)s’). You cannot mix the two %-formatting styles.

  • If simplify_repl is a callable, it should accept the corresponding pylatexenc.latexwalker.LatexSpecialsNode as an argument.

    The callable will be inspected to see what other arguments it accepts. If it accepts an argument named l2tobj, the LatexNodes2Text instance is provided to that argument. If it accepts an argument named specials_chars, the characters that were parsed this “latex specials” node are provided to that argument.

New in version 2.0: Latex specials were introduced in pylatexenc 2.0.

Obsolete members

pylatexenc.latex2text.EnvDef(envname, simplify_repl=None, discard=False)

Deprecated since version 2.0: Instantiate a EnvironmentTextSpec instead.

Since pylatexenc 2.0, EnvDef is a function which returns a EnvironmentTextSpec instance. In this way the earlier idiom EnvDef(...) still works in pylatexenc 2.

pylatexenc.latex2text.MacroDef(macname, simplify_repl=None, discard=None)

Deprecated since version 2.0: Instantiate a MacroTextSpec instead.

Since pylatexenc 2.0, MacroDef is a function which returns a MacroTextSpec instance. In this way the earlier idiom MacroDef(...) still works in pylatexenc 2.

pylatexenc.latex2text.default_env_dict

Deprecated since version 2.0: Use get_default_latex_context_db() instead, or create your own pylatexenc.macrospec.LatexContextDb object.

Provide an access to the default environment text replacement specs for latex2text in a form that is compatible with pylatexenc 1.x‘s default_macro_dict module-level dictionary.

This is implemented using a custom lazy mutable mapping, which behaves just like a regular dictionary but that loads the data only once the dictionary is accessed. In this way the default latex specs into a python dictionary unless they are actually queried or modified, and thus users of pylatexenc 2.0 that don’t rely on the default macro/environment definitions shouldn’t notice any decrease in performance.

pylatexenc.latex2text.default_macro_dict

Deprecated since version 2.0: Use get_default_latex_context_db() instead, or create your own pylatexenc.macrospec.LatexContextDb object.

Provide an access to the default macro text replacement specs for latex2text in a form that is compatible with pylatexenc 1.x‘s default_macro_dict module-level dictionary.

This is implemented using a custom lazy mutable mapping, which behaves just like a regular dictionary but that loads the data only once the dictionary is accessed. In this way the default latex specs into a python dictionary unless they are actually queried or modified, and thus users of pylatexenc 2.0 that don’t rely on the default macro/environment definitions shouldn’t notice any decrease in performance.

pylatexenc.latex2text.default_text_replacements

Deprecated since version 2.0: Text replacements are deprecated since pylatexenc 2.0 with the advent of “latex specials”. See LatexNodes2Text.apply_text_replacements() for a quick solution to keep existing code working if it uses custom text replacements.