Encode Unicode to LaTeX

The latexencode module provides a set of routines that allows you to convert a unicode string to LaTeX escape sequences.

For basic usage you can use the unicode_to_latex() function directly:

>>> from pylatexenc.latexencode import unicode_to_latex
>>> print(unicode_to_latex('À votre santé'))
\`A votre sant\'e
>>> print(unicode_to_latex('The length of samples #3 & #4 is 3μm'))
The length of samples \#3 \& \#4 is 3\ensuremath{\mu}m

The conversion is handled by the class UnicodeToLatexEncoder. If you are converting multiple strings, you may create an instance with the flags you like and invoke its method unicode_to_latex() as many times as necessary:

>>> from pylatexenc.latexencode import UnicodeToLatexEncoder
>>> u = UnicodeToLatexEncoder(unknown_char_policy='replace')
>>> print(u.unicode_to_latex('À votre santé'))
\`A votre sant\'e
>>> print(u.unicode_to_latex('The length of samples #3 & #4 is 3μm'))
The length of samples \#3 \& \#4 is 3\ensuremath{\mu}m
>>> print(u.unicode_to_latex('À votre santé: 乾杯'))
No known latex representation for character: U+4E7E - ‘乾’
No known latex representation for character: U+676F - ‘杯’
\`A votre sant\'e: {\bfseries ?}{\bfseries ?}

Example using custom conversion rules:

>>> from pylatexenc.latexencode import UnicodeToLatexEncoder, \
...     UnicodeToLatexConversionRule, RULE_REGEX
>>> u = UnicodeToLatexEncoder(
...     conversion_rules=[
...         UnicodeToLatexConversionRule(rule_type=RULE_REGEX, rule=[
...             (re.compile(r'-->'), r'\\textrightarrow'),
...             (re.compile(r'<--'), r'\\textleftarrow'),
...         ]),
...         'defaults'
...     ]
... )
>>> print(u.unicode_to_latex("Cheers --> À votre santé"))
Cheers {\textrightarrow} \`A votre sant\'e

See UnicodeToLatexEncoder and UnicodeToLatexConversionRule. Note for regex rules, the replacement text is expanded like the second argument of re.sub() and backslashes need to be escaped even inside raw strings.

New in version 2.0: The class UnicodeToLatexEncoder along with its helper functions and classes were introduced in pylatexenc 2.0.

The earlier function utf8tolatex() that was available in pylatexenc 1.x is still provided unchanged, so code written for pylatexenc 1.x should work without changes. New code is however strongly encouraged to employ the new API.

Unicode to Latex Conversion Class and Helper Function

class pylatexenc.latexencode.UnicodeToLatexEncoder(**kwargs)

Encode a string with unicode characters into a LaTeX snippet.

The following general attributes can be specified as keyword arguments to the constructor. Note: These attributes must be specified to the constructor and may NOT be subsequently modified. This is because in the constructor we pre-compile some rules and flags to optimize calls to unicode_to_text().

non_ascii_only

Whether we should convert only non-ascii characters into LaTeX sequences, or also all known ascii characters with special LaTeX meaning such as ‘\’, ‘$’, ‘&’, etc.

If non_ascii_only is set to True (the default is False), then conversion rules are not applied at positions in the string where an ASCII character is encountered.

conversion_rules

The conversion rules, specified as a list of UnicodeToLatexConversionRule objects. For each position in the string, the rules will be applied in the given sequence until a replacement string is found.

Instead of a UnicodeToLatexConversionRule object you may also specify a string specifying a built-in rule (e.g., ‘defaults’), which will be expanded to the corresponding rules according to get_builtin_conversion_rules().

If you specify your own list of rules using this argument, you will probably want to include presumably at the end of your list the element ‘defaults’ to include all built-in default conversion rules. To override built-in rules, simply add your custom rules earlier in the list. Example:

conversion_rules = [
    # our custom rules
    UnicodeToLatexConversionRule(RULE_REGEX, [
        # double \\ needed, see UnicodeToLatexConversionRule
        ( re.compile(r'...'), r'\\ldots' ),
        ( re.compile(r'î'), r'\\^i' ),
    ]),
    # plus all the default rules
    'defaults'
]
u = UnicodeToLatexEncoder(conversion_rules=conversion_rules)
replacement_latex_protection

How to “protect” LaTeX replacement text that looks like it could be interpreted differently if concatenated to arbitrary strings before and after.

Currently only one situation is recognized: if the replacement string ends with a latex macro invocation with a non-symbol macro name, e.g. \textemdash or \^\i. Indeed, if we naively replace these texts in an arbitrary string (like maître), we might get an invalid macro invocation (like ma\^\itre which causes un known macro name \itre).

Possible protection schemes are:

  • ‘braces’ (the default): Any suspicious replacement text (that might look fragile) is placed in curly braces {...}.
  • ‘braces-all’: All replacement latex escapes are surrounded in protective curly braces {...}, regardless of whether or not they might be deemed “fragile” or “unsafe”.
  • ‘braces-almost-all’: Almost all replacement latex escapes are surrounded in protective curly braces {...}. This option emulates closely the behavior of brackets=True of the function utf8tolatex() in pylatexenc 1.x, though I’m not sure it is really useful. [Specifically, all those replacement strings that start with a backslash are surrounded by curly braces].
  • ‘braces-after-macro’: In the situation where the replacement latex code ends with a string-named macro, then a pair of empty braces is added at the end of the replacement text to protect the macro.
  • none: No protection is applied, even in “unsafe” cases. This is not recommended, as this will likely result in invalid LaTeX code.
unknown_char_policy

What to do when a non-ascii character is encountered without any known substitution macro. The attribute unknown_char_policy can be set to one of:

  • ‘keep’: keep the character as is;
  • ‘replace’: replace the character by a boldface question mark;
  • ‘ignore’: ignore the character from the input entirely and don’t output anything for it;
  • ‘fail’: raise a ValueError exception;
  • ‘unihex’: output the unicode hexadecimal code (U+XXXX) of the character in typewriter font;
  • a Python callable — will be called with argument the character that could not be encoded. (If the callable accepts a second argument called ‘u2lobj’, then the UnicodeToLatexEncoder instance is provided to that argument.) The return value of the callable is used as LaTeX replacement code.
unknown_char_warning

In addition to the unknown_char_policy, this attribute indicates whether or not (True or False) one should generate a warning when a nonascii character without any known latex representation is encountered. (Default: True)

latex_string_class

The return type of unicode_to_latex(). Normally this is a simple unicode string (str on Python 3 or unicode on Python 2).

But you can specify your custom string type via the latex_string_class argument. The latex_string_class will be invoked with no arguments to construct an empty object (so latex_string_class can be either an object that can be constructed with no arguments or it can be a function with no arguments that return a fresh object instance). The object must support the operation “+=”, i.e., you should overload the __iadd__() method.

For instance, you can record the chunks that would have been appended into a single string as follows:

class LatexChunkList:
    def __init__(self):
        self.chunks = []

    def __iadd__(self, s):
        self.chunks.append(s)
        return self

u = UnicodeToLatexEncoder(latex_string_class=LatexChunkList,
                          replacement_latex_protection='none')
result = u.unicode_to_latex("é → α")
# result.chunks == [ r"\'e", ' ', r'\textrightarrow', ' ',
#                    r'\ensuremath{\alpha}' ]

Warning

None of the above attributes should be modified after constructing the object. The values specified to the class constructor are final and cannot be changed. [Indeed, the class constructor “compiles” these attribute values into a data structure that makes unicode_to_text() slightly more efficient.]

New in version 2.0: This class was introduced in pylatexenc 2.0.

unicode_to_latex(s)

Convert unicode characters in the string s into latex escape sequences, according to the rules and options given to the constructor.

pylatexenc.latexencode.unicode_to_latex(s, non_ascii_only=False, replacement_latex_protection='braces', unknown_char_policy='keep', unknown_char_warning=True)

Shorthand for constructing a UnicodeToLatexEncoder instance and calling its unicode_to_latex() method.

The UnicodeToLatexEncoder instances for given option settings are cached, making repeated calls to unicode_to_latex() possible without creating a new instance upon each call.

The parameters non_ascii_only, replacement_latex_protection, unknown_char_policy, and unknown_char_warning are directly passed on to the UnicodeToLatexEncoder constructor. See the class doc for UnicodeToLatexEncoder for more information about what they do.

You may only use arguments to this function that are python hashable (like True, False, or simple strings) to help us keep a cache of previously constructed UnicodeToLatexEncoder instances. For instance, it is not possible to provide a callable to unknown_char_policy. It is also not possible to specify custom conversion rules with this helper function. If you need any of these features, simply create a UnicodeToLatexEncoder instance directly.

Specifying conversion rules

pylatexenc.latexencode.RULE_DICT = 0

Indicates a rule type that is a dictionary of unicode point values to replacement strings. See UnicodeToLatexConversionRule.

New in version 2.0: This member was introduced in pylatexenc version 2.0.

pylatexenc.latexencode.RULE_REGEX = 1

Indicates a rule type that is a list (or iterable) of pairs (compiled_regular_expression, replacement_string). See UnicodeToLatexConversionRule.

New in version 2.0: This member was introduced in pylatexenc version 2.0.

pylatexenc.latexencode.RULE_CALLABLE = 2

Indicates a rule type that is a custom callable. See UnicodeToLatexConversionRule.

New in version 2.0: This member was introduced in pylatexenc version 2.0.

class pylatexenc.latexencode.UnicodeToLatexConversionRule(rule_type, rule=None)

Specify a rule how to convert unicode characters into LaTeX escapes.

rule_type

One of RULE_DICT, RULE_REGEX, or RULE_CALLABLE.

rule

A specification of the rule itself. The rule attribute is an object that depends on what rule_type is set to. See below.

Constructor syntax:

UnicodeToLatexConversionRule(RULE_XXX, <...>)
UnicodeToLatexConversionRule(rule_type=RULE_XXX, rule=<...>)

Note that you can get some built-in rules via the get_builtin_conversion_rules() function:

conversion_rules = get_builtin_conversion_rules('defaults') # all defaults

Rules types:

  • RULE_DICT: If rule_type is RULE_DICT, then rule should be a dictionary whose keys are integers representing unicode code points (e.g., 0x210F), and whose values are corresponding replacement strings (e.g., r'\hbar'). See get_builtin_uni2latex_dict() for an example.

  • RULE_REGEX: If rule_type is RULE_REGEX, then rule should be an iterable of tuple pairs (compiled_regular_expression, replacement_string) where compiled_regular_expression was obtained with re.compile(…) and replacement_string is anything that can be specified as the second (repl) argument of re.sub(…). This can be a replacement string that includes escapes (like \1, \2, \g<name>) for captured sub-expressions or a callable that takes a match object as argument.

    Note

    The replacement string is parsed like the second argument to re.sub() and backslashes have a special meaning because they can refer to captured sub-expressions. For a literal backslash, use two backslashes \\ in raw strings, four backslashes in normal strings.

    Example:

    regex_conversion_rule = UnicodeToLatexConversionRule(
        rule_type=RULE_REGEX,
        rule=[
            # protect acronyms of capital letters with braces,
            # e.g.: ABC -> {ABC}
            (re.compile(r'[A-Z]{2,}'), r'{\1}'),
            # Additional rules, e.g., "..." -> "\ldots"
            (re.compile(r'...'), r'\\ldots'), # note double \\
        ]
    )
    
  • RULE_CALLABLE: If rule_type is RULE_CALLABLE, then rule should be a callable that accepts two arguments, the unicode string and the position in the string (an integer). The callable will be called with the original unicode string as argument and the position of the character that needs to be encoded. If this rule can encode the given character at the given position, it should return a tuple (consumed_length, replacement_string) where consumed_length is the number of characters in the unicode string that replacement_string represents. If the character(s) at the given position can’t be encoded by this rule, the callable should return None to indicate that further rules should be attempted.

    If the callable accepts an additional argument called u2lobj, then the UnicodeToLatexEncoder instance is provided to that argument.

    For example, the following callable should achieve the same effect as the previous example with regexes:

    def convert_stuff(s, pos):
        m = re.match(r'[A-Z]{2,}', s, pos)
        if m is not None:
            return (m.end()-m.start(), '{'+m.group()+'}')
        if s.startswith('...', pos): # or  s[pos:pos+3] == '...'
            return (3, r'\ldots')
        return None
    

New in version 2.0: This class was introduced in pylatexenc 2.0.

pylatexenc.latexencode.get_builtin_conversion_rules(builtin_name)

Return a built-in set of conversion rules specified by a given name builtin_name.

There are two builtin conversion rules, with the following names:

  • ‘defaults’: the default conversion rules, a custom-curated list of unicode chars to LaTeX escapes.
  • ‘unicode-xml’: the conversion rules derived from the unicode.xml file maintained at https://www.w3.org/TR/xml-entity-names/#source by David Carlisle.

The return value is a list of UnicodeToLatexConversionRule objects that can be either directly specified to the conversion_rules= argument of UnicodeToLatexEncoder, or included in a larger list that can be provided to that argument.

New in version 2.0: This function was introduced in pylatexenc 2.0.

pylatexenc.latexencode.get_builtin_uni2latex_dict()

Return a dictionary that contains the default collection of known LaTeX escape sequences for unicode characters.

The keys of the dictionary are integers that correspond to unicode code points (i.e., ord(char)). The values are the corresponding LaTeX replacement strings.

The returned dictionary may not be modified. To alter the behavior of unicode_to_latex(), you should specify custom rules to a new instance of UnicodeToLatexEncoder.

New in version 2.0: This function was introduced in pylatexenc 2.0.

Compatibility with pylatexenc 1.x

pylatexenc.latexencode.utf8tolatex(s, non_ascii_only=False, brackets=True, substitute_bad_chars=False, fail_bad_chars=False)

Note

Since pylatexenc 2.0, it is recommended to use the the unicode_to_latex() function or the UnicodeToLatexEncoder class instead of the earlier function utf8tolatex().

The new routines provide much more flexibility and versatility. For instance, you can specify custom escape sequences for certain characters. Some cheap benchmarks seem to indicate that the new routines are not significantly slower than the utf8tolatex() function. Also, the name utf8tolatex() was poorly chosen, since the argument is in fact not ‘utf-8’-encoded but rather a Python unicode string object.

The function utf8tolatex() is still provided unchanged from pylatexenc 1.x. We do not plan to remove this function in the near future so it is not (yet) considered as deprecated and we will continue to provide it in near future versions of pylatexenc. Bug reports, improvements, and new features will however be directed to UnicodeToLatexEncoder().

Encode a UTF-8 string to a LaTeX snippet.

If non_ascii_only is set to True, then usual (ascii) characters such as #, {, } etc. will not be escaped. If set to False (the default), they are escaped to their respective LaTeX escape sequences.

If brackets is set to True (the default), then LaTeX macros are enclosed in brackets. For example, santé is replaced by sant{\'e} if brackets=True and by sant\'e if brackets=False.

Warning

Using brackets=False might give you an invalid LaTeX string, so avoid it! (for instance, maître will be replaced incorrectly by ma\^\itre resulting in an unknown macro \itre).

If substitute_bad_chars=True, then any non-ascii character for which no LaTeX escape sequence is known is replaced by a question mark in boldface. Otherwise (by default), the character is left as it is.

If fail_bad_chars=True, then a ValueError is raised if we cannot find a character substitution for any non-ascii character.

Changed in version 1.3: Added fail_bad_chars switch

pylatexenc.latexencode.utf82latex = <pylatexenc._util.LazyDict object>

Deprecated since version 2.0: Pylatexenc 1.x exposed the module-level dictionary utf82latex that could be modified to alter the behavior of utf8tolatex().

If you would like to obtain a copy of the built-in unicode to text dictionary, see get_builtin_uni2latex_dict(). If you would like to alter the behavior of utf8tolatex(), you should use UnicodeToLatexEncoder which provides a rich interface for specifying rules how to convert chars to LaTeX escapes.

For backwards compatibility, you can still modify the module-level dictionary utf82latex (but you can’t assign a new object to it) and this will directly modify the global built-in dictionary of known latex escapes. This is not recommended however, and the utf82latex module-level dictionary might be removed in the future.

Warning

Modifying the utf82latex module-level dictionary is not recommended. Doing so will alter the behavior of the utf8tolatex() function also for all other modules that also use pylatexenc!