latex2text — Simple Latex to Text Converter¶
A simplistic, heuristic LaTeX code parser allowing to returns a text-only approximation. Suitable, e.g. for indexing tex code in a database for full text searching.
The main class is LatexNodes2Text
. For a quick start, try:
from pylatexenc.latex2text import LatexNodes2Text
latex = "... LaTeX code ..."
text = LatexNodes2Text().latex_to_text(latex)
You may also use the command-line version of latex2text:
$ echo '\textit{italic} \`acc\^ented text' | latex2text
italic àccênted text
Custom latex conversion rules: A simple template¶
Here is a short introduction on how to customize the way that
LatexNodes2Text
converts LaTeX constructs
(macros, environments, and specials) to unicode text. You can start off with
the example template below and adapt it to your needs.
Macros, environments and specials are parsed as corresponding node objects by
the parser (see pylatexenc.latexwalker.LatexMacroNode
,
pylatexenc.latexwalker.LatexEnvironmentNode
, and
pylatexenc.latexwalker.LatexSpecialsNode
). These node objects are
then converted to unicode text by the
LatexNodes2Text
object.
You can define new macros, environments, or specials, or override existing
definitions. The definitions need to be provided twice. First, at the level of
the parser using the macrospec
module; the parser needs to
know the argument structure of your macros, environments, and specials, along
with which characters to recognize as “specials”. Second, at the level of
latex2text, you need to specify what the replacement strings are for the
different LaTeX constructs after they have been parsed into the latex node tree
by the parser.
The following template is a simple illustrative example that implements the following definitions:
A new macro
\putinquotes[`][']{text}
that puts its mandatory argument into quotes defined by the two optional arguments. Let’s say that the default quotes that are used are``
and''
. Another simpler macro\putindblquotes{text}
is also provided for the sake of the example.A new environment
\begin{inquotes}[`]['] ... \end{inquotes}
that does the same thing as its macro equivalent. Another simpler environment\begin{indblquotes}...\end{indblquotes}
is also provided for the sake of the example.The usual LaTeX quote symbols
`
,``
,'
, and''
for unicode quotes. (See also issue #39)
Here is the code (see also docs for pylatexenc.macrospec.MacroSpec
,
pylatexenc.macrospec.EnvironmentSpec
,
pylatexenc.macrospec.SpecialsSpec
, as well as
pylatexenc.latex2text.MacroTextSpec
,
pylatexenc.latex2text.EnvironmentTextSpec
,
pylatexenc.latex2text.SpecialsTextSpec
):
from pylatexenc import latexwalker, latex2text, macrospec
#
# Define macros, environments, specials for the *parser*
#
lw_context_db = latexwalker.get_default_latex_context_db()
lw_context_db.add_context_category(
'my-quotes',
prepend=True,
macros=[
macrospec.MacroSpec("putindblquotes", "{"),
macrospec.MacroSpec("putinquotes", "[[{"),
],
environments=[
macrospec.EnvironmentSpec("indblquotes", ""),
macrospec.EnvironmentSpec("inquotes", "[["),
],
specials=[
macrospec.SpecialsSpec("`"),
macrospec.SpecialsSpec("'"),
macrospec.SpecialsSpec("``"),
macrospec.SpecialsSpec("''"),
],
)
#
# Implement macros, environments, specials for the *conversion to text*
#
def _get_optional_arg(node, default, l2tobj):
"""Helper that returns the `node` converted to text, or `default`
if the node is `None` (e.g. an optional argument that was not
specified)"""
if node is None:
return default
return l2tobj.nodelist_to_text([node])
def put_in_quotes_macro_repl(n, l2tobj):
"""Get the text replacement for the macro
\putinquotes[open-quote][close-quote]{text}"""
if not n.nodeargd:
# n.nodeargd can be empty if e.g. \putinquotes was a single
# token passed as an argument to a macro,
# e.g. \newcommand\putinquotes...
return ''
open_q_s = _get_optional_arg(n.nodeargd.argnlist[0], '“', l2tobj)
close_q_s = _get_optional_arg(n.nodeargd.argnlist[1], '”', l2tobj)
return (open_q_s + l2tobj.nodelist_to_text([n.nodeargd.argnlist[2]])
+ close_q_s)
def in_quotes_env_repl(n, l2tobj):
"""Get the text replacement for the {inquotes} environment"""
open_q_s = _get_optional_arg(n.nodeargd.argnlist[0], '“', l2tobj)
close_q_s = _get_optional_arg(n.nodeargd.argnlist[1], '”', l2tobj)
return open_q_s + l2tobj.nodelist_to_text(n.nodelist) + close_q_s
l2t_context_db = latex2text.get_default_latex_context_db()
l2t_context_db.add_context_category(
'my-quotes',
prepend=True,
macros=[
latex2text.MacroTextSpec("putindblquotes",
simplify_repl=r'“%(1)s”'),
latex2text.MacroTextSpec("putinquotes",
simplify_repl=put_in_quotes_macro_repl),
],
environments=[
latex2text.EnvironmentTextSpec("indblquotes",
simplify_repl=r'“%(body)s”'),
latex2text.EnvironmentTextSpec("inquotes",
simplify_repl=in_quotes_env_repl),
],
specials=[
latex2text.SpecialsTextSpec('`', "‘"),
latex2text.SpecialsTextSpec("'", "’"),
latex2text.SpecialsTextSpec('``', "“"),
latex2text.SpecialsTextSpec("''", "”"),
],
)
#
# Here is an example usage:
#
def custom_latex_to_text( input_latex ):
# the latex parser instance with custom latex_context
lw_obj = latexwalker.LatexWalker(input_latex,
latex_context=lw_context_db)
# parse to node list
nodelist, pos, length = lw_obj.get_latex_nodes()
# initialize the converter to text with custom latex_context
l2t_obj = latex2text.LatexNodes2Text(latex_context=l2t_context_db)
# convert to text
return l2t_obj.nodelist_to_text( nodelist )
print(custom_latex_to_text(
r"""\begin{inquotes}[`][']Hello, world\end{inquotes}"""))
# ‘Hello, world’
print(custom_latex_to_text(r"""\putinquotes[``]['']{Hello, world}"""))
# “Hello, world”
print(custom_latex_to_text(r"""\putinquotes{Hello, world}"""))
# “Hello, world”
print(custom_latex_to_text(r"""\putinquotes[`][']{Hello, world}"""))
# ‘Hello, world’
Latex to Text Converter Class¶
- class pylatexenc.latex2text.LatexNodes2Text(latex_context=None, **flags)¶
Simplistic Latex-To-Text Converter.
This class parses a nodes structure generated by the
latexwalker
module, and creates a text representation of the structure.It is capable of parsing
\input
directives safely, seeset_tex_input_directory()
andread_input_file()
. By default,\input
and\include
directives are ignored.Arguments to the constructor:
latex_context_db is a
pylatexenc.macrospec.LatexContextDb
class storing a collection of rules for converting macros, environments, and other latex specials to text. The LatexContextDb should contain specifications viaMacroTextSpec
,EnvironmentTextSpec
, andSpecialsTextSpec
objects.The default latex context database can be obtained using
get_default_latex_context_db()
.
Additional keyword arguments are flags which may influence the behavior:
math_mode=’text’|’with-delimiters’|’verbatim’|’remove’: Specify how to treat chunks of LaTeX code that correspond to math modes. If ‘text’ (the default), then the math mode contents is incorporated as normal text. If ‘with-delimiters’, the content is incorporated as normal text but it is still included in the original math-mode delimiters, such as ‘$…$’. If ‘verbatim’, then the math mode chunk is kept verbatim, including the delimiters. The value ‘remove’ means to remove the math mode sections entirely and not to produce any replacement text.
keep_comments=True|False: If set to True, then LaTeX comments are kept (including the percent-sign); otherwise they are discarded. (By default this is False)
fill_text: If set to True or to a positive integer, then the whitespace of LaTeX char blocks is re-layed out to fill at the given number of characters or 80 by default. The fill is by far not perfect, but the resulting text might be slightly more readable.
strict_latex_spaces=True|False: If set to True, then we follow closely LaTeX’s handling of whitespace. For instance, whitespace following a bare macro (i.e. without any delimiting characters like ‘{’) is consumed/removed. If set to False (the default), then some liberties are taken with respect to whitespace [hopefully making the result slightly more aesthetic, but this behavior is mostly there for historical reasons].
You may also use one of the presets strict_latex_spaces=’based-on-source’|’macros’|’except-in-equations’, which allow for finer control of how whitespace is handled:
The value ‘based-on-source’ is the option that is furthest from latex’s behavior with spaces, and takes liberties in incuding spaces that are present in the source file in several situations where LaTeX would remove them, including after macros. This is meant to be hopefully slightly more aesthetic. However, this option might inadvertently break up words: For instance:
Sk\l odowska
would be replaced by:
Skł odowska
The value ‘macros’ is the same as specifying strict_latex_spaces=False, and it is the default. It will make macros and other sequences of LaTeX constructions obey LaTeX space rules, but will keep indentations after comments and keep more liberal whitespace rules in equations for a hopefully more aesthetic result.
The ‘except-in-equations’ preset goes as you would expect, setting strict latex spacing only outside of equation contexts.
Finally, the argument strict_latex_spaces may also be set to a dictionary with keys ‘between-macro-and-chars’, ‘after-comment’, ‘between-latex-constructs’, and ‘in-equations’, with individual values either True or False, dictating whitespace behavior in specific cases (True indicates strict latex behavior). The value for ‘in-equations’ may even be another dictionary with the same keys to override values in equations. A value of False for ‘in-equation’ has the same meaning as ‘macros’.
Changed in version 2.0: Since pylatexenc 2.0, the default value of strict_latex_spaces is ‘macros’, and no longer ‘based-on-source’.
Deprecated since version 2.0: The value ‘default’ is also accepted, but it is no longer the default! It is an alias for ‘based-on-source’
Changed in version 2.6: In pylatexenc versions 2.0–2.5, contrary to the documentation, the default value of strict_latex_spaces was actually still ‘based-on-source’. This bug was fixed in version 2.6, so that now, the default setting is actually ‘macros’.
keep_braced_groups=True|False: If set to True, then braces delimiting a TeX group
{Like this}
will be kept in the output, with the contents of the group converted to text as usual. (By default this is False)keep_braced_groups_minlen=<int>: If keep_braced_groups is set to True, then we keep braced groups only if their contents length (after conversion to text) is longer than the given value. E.g., if keep_braced_groups_minlen=2, then
{\'e}tonnant
still goes toétonnant
but{\'etonnant}
becomes{étonnant}
.
Additionally, the following arguments are accepted for backwards compatibility:
keep_inline_math=True|False: Obsolete since pylatexenc 2. If set to True, then this is the same as math_mode=’verbatim’, and if set to False, this is the same as math_mode=’text’.
Deprecated since version 2.0: The keep_inline_math= option is deprecated because it had a weird behavior and was poorly implemented, especially given that a similarly named option in
LatexWalker
had a different effect. See issue #14.text_replacements this argument is ignored starting from pylatexenc 2.
Deprecated since version 2.0: Text replacements are no longer made at the end of the text conversion. This feature is replaced by the concept of LaTeX specials—see, e.g.,
pylatexenc.latexwalker.LatexSpecialsNode
.To keep existing code working, add a call to
apply_text_replacements()
immediately afternodelist_to_text()
to achieve the same effect as in pylatexenc 1.x. Seeapply_text_replacements()
.env_dict, macro_dict: Obsolete since pylatexenc 2. If set, they are dictionaries of known environment and macro definitions. They default to
default_env_dict
anddefault_macro_dict
, respectively.Deprecated since version 2.0: You should now use the more powerful option latex_context_db=. You cannot specify both macro_list (or env_list) and latex_context_db.
- set_tex_input_directory(tex_input_directory, latex_walker_init_args=None, strict_input=True)¶
Set where to look for input files when encountering the
\input
or\include
macro.Alternatively, you may also override
read_input_file()
to implement a custom file lookup mechanism.The argument tex_input_directory is the directory relative to which to search for input files.
If strict_input is set to True, then we always check that the referenced file lies within the subtree of tex_input_directory, prohibiting for instance hacks with ‘..’ in filenames or using symbolic links to refer to files out of the directory tree.
The argument latex_walker_init_args allows you to specify the parse flags passed to the constructor of
pylatexenc.latexwalker.LatexWalker
when parsing the input file.
- read_input_file(fn)¶
This method may be overridden to implement a custom lookup mechanism when encountering
\input
or\include
directives.The default implementation looks for a file of the given name relative to the directory set by
set_tex_input_directory()
. If strict_input=True was set, we ensure strictly that the file resides in a subtree of the reference input directory (after canonicalizing the paths and resolving all symlinks).If set_tex_input_directory() was not called, or if it was called with a value of None, then no file system access is attempted an an empty string is returned.
You may override this method to obtain the input data in however way you see fit. In that case, a call to set_tex_input_directory() may not be needed as that function simply sets properties which are used by the default implementation of read_input_file().
This function accepts the referred filename as argument (the argument to the
\input
macro), and should return a string with the file contents (or generate a warning or raise an error).
- latex_to_text(latex, **parse_flags)¶
Parses the given latex code and returns its textual representation.
This is equivalent to constructing a
pylatexenc.latexwalker.LatexWalker
with the given latex string, parsing the string into general nodes with aLatexGeneralNodesParser
(seeparse_content()
), and providing the outcome tonodelist_to_text()
.The parse_flags are keyword arguments to provide to the
pylatexenc.latexwalker.LatexWalker
constructor.
- nodelist_to_text(nodelist)¶
Extracts text from a node list. nodelist is a list of latexwalker nodes, typically parsed using a
LatexGeneralNodesParser
(seeparse_content()
).This function basically applies node_to_text() to each node and concatenates the results into one string. (This is not quite actually the case, since we take some care as to where we add whitespace according to the class options.)
- node_to_text(node, prev_node_hint=None, textcol=0)¶
Return the textual representation of the given node.
If prev_node_hint is specified, then the current node is formatted suitably as following the node given in prev_node_hint. This might affect how much space we keep/discard, etc.
- chars_node_to_text(node, textcol=0)¶
Return the textual representation of the given node representing a block of simple latex text with no special characters or macros. The node is
LatexCharsNode
.
- comment_node_to_text(node)¶
Return the textual representation of the given node representing a latex comment. The node is
LatexCommentNode
.
- group_node_to_text(node)¶
Return the textual representation of the given node representing a latex group. The node is
LatexGroupNode
.
- macro_node_to_text(node)¶
Return the textual representation of the given node representing a latex macro invocation. The node is
LatexMacroNode
.
- environment_node_to_text(node)¶
Return the textual representation of the given node representing a full latex environment. The node is
LatexEnvironmentNode
.
- specials_node_to_text(node)¶
Return the textual representation of the given node representing special a latex character (or characters). The node is
LatexSpecialsNode
.
- math_node_to_text(node)¶
Return the textual representation of the given node representing a block of math mode latex. The node is either a
LatexMathNode
or aLatexEnvironmentNode
.This method is responsible for honoring the math_mode=… option provided to the constructor.
- apply_simplify_repl(node, simplify_repl, what)¶
Utility to get the replacement text associated with a node for which we have a simplify_repl object (given by e.g. a MacroTextSpec or similar).
The argument what is used in error messages.
- node_arg_to_text(node, k)¶
Return the textual representation of the k-th argument of the given node. This might be useful for substitution lambdas in macro and environment specs.
- apply_text_replacements(s, text_replacements)¶
Convenience function for code that used text_replacements= in pylatexenc 1.x.
If you used custom text_replacements= in pylatexenc 1.x then you will have to change:
# pylatexenc 1.x with text_replacements text_replacements = ... l2t = LatexNodes2Text(..., text_replacements=text_replacements) text = l2t.nodelist_to_text(...)
to:
# pylatexenc 2 text_replacements compatibility code text_replacements = ... l2t = LatexNodes2Text(...) temp = l2t.nodelist_to_text(...) text = l2t.apply_text_replacements(temp, text_replacements)
as a quick fix. It is recommended however to treat text replacements instead as “latex specials”. (Otherwise the brutal text replacements might act on text generated from macros and environments and give unwanted results.) See
pylatexenc.macrospec.SpecialsSpec
andSpecialsTextSpec
.Deprecated since version 2.0: The apply_text_replacements() method was introduced in pylatexenc 2.0 as a deprecated method. You can use it as a quick fix to make existing code run as it did in pylatexenc 1.x. Its use is however not recommended for new code. You should use “latex specials” instead for characters that have special LaTeX meaning.
- pylatexenc.latex2text.get_default_latex_context_db()¶
Return a
pylatexenc.macrospec.LatexContextDb
instance initialized with a collection of text replacements for known macros and environments.TODO: clean up and document categories.
If you want to add your own definitions, you should use the
pylatexenc.macrospec.LatexContextDb.add_context_category()
method. If you would like to override some definitions, use that method with the argument prepend=True. See docs forpylatexenc.macrospec.LatexContextDb.add_context_category()
.If there are too many macro/environment definitions, or if there are some irrelevant ones, you can always filter the returned database using
pylatexenc.macrospec.LatexContextDb.filter_context()
.New in version 2.0: The
pylatexenc.macrospec.LatexContextDb
class as well as this method, were all introduced in pylatexenc 2.0.
Define replacement texts¶
- class pylatexenc.latex2text.MacroTextSpec(macroname, simplify_repl=None, discard=None)¶
A specification of how to obtain a textual representation of a macro.
- macroname¶
The name of the macro (no backslash)
- simplify_repl¶
The replacement text of the macro invocation. This is either a string or a callable:
If simplify_repl is a string, this string is used as the text representation of this macro node.
The string may contain a single ‘%s’ replacement placeholder which will be replaced by the concatenated textual representation of all macro arguments. Alternatively, the string may contain ‘%(<n>)s’ (where <n> is an integer) to refer to the n-th argument (starting at ‘%(1)s’). You cannot mix the two %-formatting styles.
If simplify_repl is a callable, it should accept the corresponding
pylatexenc.latexwalker.LatexMacroNode
as an argument.The callable will be inspected to see what other arguments it accepts. If it accepts an argument named l2tobj, the
LatexNodes2Text
instance is provided to that argument. If it accepts an argument named macroname, the name of the macro is provided to that argument.
- discard¶
If set to True, then the macro call is discarded, i.e., it is converted to an empty string.
New in version 2.0: The class
MacroTextSpec
was introduced in pylatexenc 2.0 to succeed to the previously named MacroDef class.
- class pylatexenc.latex2text.EnvironmentTextSpec(environmentname, simplify_repl=None, discard=False)¶
A specification of how to obtain a textual representation of an environment.
- environmentname¶
The name of the environment
- simplify_repl¶
The replacement text of the environment. This is either a string or a callable:
If simplify_repl is a string, this string is used as the text representation of this environment node.
The string may contain a single ‘%s’ replacement placeholder, in which the (processed) environment body will be substituted.
Alternatively, the simplify_repl string may contain ‘%(<n>)s’ (where <n> is an integer) to refer to the n-th argument after
egin{environment}
(starting at ‘%(1)s’). The body of the environment has to be referred to with %(body)s.You cannot mix the two %-formatting styles.
If simplify_repl is a callable, it should accept the corresponding
pylatexenc.latexwalker.LatexEnvironmentNode
as an argument.The callable will be inspected to see what other arguments it accepts. If it accepts an argument named l2tobj, the
LatexNodes2Text
instance is provided to that argument. If it accepts an argument named environmentname, the name of the environment is provided to that argument.
- discard¶
If set to True, then the full environment is discarded, i.e., it is converted to an empty string.
New in version 2.0: The class
EnvironmentTextSpec
was introduced in pylatexenc 2.0 to succeed to the previously named EnvDef class.
- class pylatexenc.latex2text.SpecialsTextSpec(specials_chars, simplify_repl=None)¶
A specification of how to obtain a textual representation of latex specials.
- specials_chars¶
The sequence of special LaTeX characters
- simplify_repl¶
The replacement text for the given latex specials. This is either a string or a callable:
If simplify_repl is a string, this string is used as the text representation of this specials node.
The string may contain a single ‘%s’ replacement placeholder which will be replaced by the concatenated textual representation of all macro arguments.
Alternatively, the string may contain ‘%(<n>)s’ (where <n> is an integer) to refer to the n-th argument (starting at ‘%(1)s’). You cannot mix the two %-formatting styles.
If simplify_repl is a callable, it should accept the corresponding
pylatexenc.latexwalker.LatexSpecialsNode
as an argument.The callable will be inspected to see what other arguments it accepts. If it accepts an argument named l2tobj, the
LatexNodes2Text
instance is provided to that argument. If it accepts an argument named specials_chars, the characters that were parsed this “latex specials” node are provided to that argument.
New in version 2.0: Latex specials were introduced in pylatexenc 2.0.
Obsolete members¶
- pylatexenc.latex2text.EnvDef(envname, simplify_repl=None, discard=False)¶
Deprecated since version 2.0: Instantiate a
EnvironmentTextSpec
instead.Since pylatexenc 2.0, EnvDef is a function which returns a
EnvironmentTextSpec
instance. In this way the earlier idiomEnvDef(...)
still works in pylatexenc 2.
- pylatexenc.latex2text.MacroDef(macname, simplify_repl=None, discard=None)¶
Deprecated since version 2.0: Instantiate a
MacroTextSpec
instead.Since pylatexenc 2.0, MacroDef is a function which returns a
MacroTextSpec
instance. In this way the earlier idiomMacroDef(...)
still works in pylatexenc 2.
- pylatexenc.latex2text.default_env_dict¶
Deprecated since version 2.0: Use
get_default_latex_context_db()
instead, or create your ownpylatexenc.macrospec.LatexContextDb
object.Provide an access to the default environment text replacement specs for latex2text in a form that is compatible with pylatexenc 1.x‘s default_macro_dict module-level dictionary.
This is implemented using a custom lazy mutable mapping, which behaves just like a regular dictionary but that loads the data only once the dictionary is accessed. In this way the default latex specs into a python dictionary unless they are actually queried or modified, and thus users of pylatexenc 2.0 that don’t rely on the default macro/environment definitions shouldn’t notice any decrease in performance.
- pylatexenc.latex2text.default_macro_dict¶
Deprecated since version 2.0: Use
get_default_latex_context_db()
instead, or create your ownpylatexenc.macrospec.LatexContextDb
object.Provide an access to the default macro text replacement specs for latex2text in a form that is compatible with pylatexenc 1.x‘s default_macro_dict module-level dictionary.
This is implemented using a custom lazy mutable mapping, which behaves just like a regular dictionary but that loads the data only once the dictionary is accessed. In this way the default latex specs into a python dictionary unless they are actually queried or modified, and thus users of pylatexenc 2.0 that don’t rely on the default macro/environment definitions shouldn’t notice any decrease in performance.
- pylatexenc.latex2text.default_text_replacements¶
Deprecated since version 2.0: Text replacements are deprecated since pylatexenc 2.0 with the advent of “latex specials”. See
LatexNodes2Text.apply_text_replacements()
for a quick solution to keep existing code working if it uses custom text replacements.