## XeLaTeX: Unicode font fallback for unsupported characters

Traditionally I only used to use LaTeX to typeset documents, and it works perfectly when you have a single language script (e.g. only English or German). But as soon as you want to typeset Unicode text in multiple languages, you’re quickly out of luck. LaTeX is just not made for Unicode, and you need a lot of helper packages, documentation reading, and complicated configuration in your document to get it all right.

All I wanted was to typeset the following Unicode text. It contains regular latin characters, chinese characters, modern greek and polytonic (ancient) greek.

Latin text. Chinese text: 紫薇北斗星  Modern greek: Διαμ πριμα εσθ ατ, κυο πχιλωσοπηια Ancient greek: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος. And regular latin text.

I thought it was a simple task. I thought: let’s just use XeLaTeX, which has out-of-the-box Unicode support. In the end, it was a simple task, but only after struggling to solve a particular problem. To show you the problem, I ran the following straightforward code through XeLaTeX…

\documentclass[]{book}

\usepackage{fontspec}

\begin{document}
Latin text. Chinese text: 紫薇北斗星 Modern greek: Διαμ πριμα εσθ ατ, κυο πχιλωσοπηια Ancient greek: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος. And regular latin text.
\end{document}

… and the following PDF was produced:

It turns out that the missing unicode characters are not XeLaTeX’s fault. The problem is that the used font (XeLaTeX by default uses a slightly more encompassing Computer Modern font) has not all unicode characters implemented. To implement all unicode characters in a single font (about 1.1 million characters) is a monumental task, and there are only a small handful of fonts whose maintainers aim to have full support of all characters (one of them is GNU FreeFont, which is already part of the Debian distribution, and therefore available to XeLaTeX).

So, I thought, let’s just use a font which is dedicated to unicode. I selected in my document the pretty Junicode font:

\setmainfont{Junicode}

The result was:

Now, greek worked, but still no chinese characters. It turned out that even fonts which are dedicated to unicode do not yet have all possible characters implemented. Because it’s a lot of work to produce high-quality fonts with matching styles for millions of possible characters.

So, how do regular web browsers or office applications do it? They use a mechanism called font fallback. When a particular character is not implemented in the chosen main font, another font is silently used which does have this particular character implemented. XeLaTeX can do the same with a package called ucharclasses, and it gives you full control over the fallback font selection process. The ucharclasses documentation gives an example using the \fontspec  font selection. I decided to use the font IPAexMincho which supports chinese characters. So I added to my document:

\usepackage[CJK]{ucharclasses}
\setTransitionsForCJK{\fontspec{IPAexMincho}}{\fontspec{Junicode}}

… but when running XeLaTeX with this addition, ucharclasses somehow entered an endless loop with high CPU usage for the TexLive 2014 distribution (part of Debian). It hung at the line:

(./ucharclass.aux) (/usr/share/texmf/tex/latex/tipa/t3cmr.fd)


Endless googling didn’t bring up any useful hints. Something must have changed in the internals, and the ucharclasses documentation needs updating. In any event, it took me 4 hours to find a fix. The solution was to use a font selection other than \fontspec{} — because it doesn’t seem to be compatible with ucharclasses any more. Instead, I used fontspec‘s suggested \newfontfamily  mechanism. Here is the final working code:

\documentclass[]{book}

\usepackage{fontspec}
\setmainfont{Junicode}
\newfontfamily\myregularfont{Junicode}
\newfontfamily\mychinesefont{IPAexMincho}

\usepackage[CJK]{ucharclasses}
\setTransitionsForCJK{\mychinesefont}{\myregularfont}

\begin{document}
Latin text. Chinese text: 紫薇北斗星  Modern greek: Διαμ πριμα εσθ ατ, κυο πχιλωσοπηια Ancient greek: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος. And regular latin text.
\end{document}

Here is the result: Mixed latin, chinese, and greek scripts with two different fonts: Junicode and IPAexMincho:

Pretty!