1. Math Editing and Display
in Word 2007
Murray Sargent III
Publisher Text Services
28-may-2008
2. Overview
8 math infrastructures enable better math
display/editing
New Office math edit/display environment
Interoperate with math programs such as
Mathematica, Maple, publisher workflow
Input methods and formats
Layout
Math font
3. Complex Project
Intricacies of math typesetting
Creating and using a large set of glyph variants
Vagaries of math notation
Embedding math zones into international text
environments
Interaction with complex scripts
Math in other objects like hyperlinks, ruby
Input with nonASCII keyboards
4. Eight Math Infrastructures
[La]TeX: current tech-doc standards
Unicode 5.0: includes ~2000 math symbols
MathML 2.0: math K – 12 and beyond
OpenType font technology: special math tables
New math font (Cambria Math)
Math layout handler
Shared math input components
MS Office environment, autocorrect
5. [La]TeX
Widely used, high-quality tech document
preparation language
Simple ASCII keyboard entry
Usage and math typography are very well
documented
Stable since 1990
Complex scenarios are hard to edit
Numerous dialects, user macros, and lack of
Unicode complicate interchange
Fonts aren’t well suited to screen display
6. Unicode 5.0
340 math chars exist in ASCII, U+2200 block,
arrows, combining marks
1016 math alphanumeric characters are in
Unicode Plane 1 or Letterlike Symbols
591 new math symbols and operators are on
BMP
One math variant selector
One new combining character (reverse solidus)
New math characters were requested by STIX
7. Basic Set of Alphanumeric
Characters
Latin digits (0 - 9)
Upper- & lowercase Latin letters (a - z, A - Z)
Uppercase Greek letters Α - Ω plus the nabla ∇
and a variant of theta Θ
Lowercase Greek letters α - ω plus the partial
differential sign ∂ and glyph variants of ε, θ, κ, φ,
ρ, and π
Only unaccented forms of letters are used
8. Legibility Loss
Without math alphabetics, the Hamiltonian formula
H = ∫dτ [εE2 + μH2]
becomes an integral equation
H = ∫dτ [εE2 + μH2]
9. Math Alphanumeric Characters
• Math needs various Latin and Greek styles like
normal, bold, italic, script, Fraktur, and open-face
• May appear to be font variations, but have distinct
semantics and spacings
• Without these distinctions, you get gibberish, violating
Unicode rule: plain text must contain enough info to
permit text to be rendered legibly, and nothing more
• Plain-text searches should distinguish between
alphabets, e.g., a search for script H shouldn’t match
H, etc.
10. MathML
MathML 1.0 (April, 1998) was the first World
Wide Web Consortium (W3C) endorsed XML
vocabulary
Low-level format for describing mathematics as
a basis for machine to machine communication
MathML facilitates the use and re-use of
scientific content on the Web
MathML 2.0 released in late 2003 is now widely
used in exchanging mathematical text
MathML 2.0 spec has a wealth of math info
11. MathML Presentation Markup
Presentation markup directs how the math
should be rendered.
<mrow>
<mi>E</mi>
<mo>=</mo>
<mrow>
<mi>m</mi>
<mo>⁢</mo>
<msup>
E = mc2
<mi>c</mi>
<mn>2</mn>
</msup>
</mrow>
</mrow>
13. MathML with Custom XML
Can put arbitrary namespace attributes in
MathML tags
More complicated embellishments can use
<semantics>
MathML representation
<annotation-XML>
Enhancements
</annotation-XML>
</semantics>
14. MathML Parsing
MathML can be tricky to parse. For sin x:
<mrow>
<mi>sin</mi>
<mo>&FunctionApply;</mo>
<mi>x</mi>
</mrow>
Don’t know it’s a function-apply object until
reaching &FunctionApply: have to analyze
expressions as with the linear format
16. Math RTF
Math RTF is OMML in RTF syntax
Somewhat simplified (doesn’t need text tag)
For example,
<m:f> ... </m:f> → {mf ... }
Thoroughly defined in latest RTF spec
Reading spec is great way to learn how Word
represents math
17. Accented characters
Accents are handled by math accent
object
Accents may apply to multiple characters
Accents may be flattened
18. Vagaries of Math Notation
Choice of subscript/superscript base
Function arguments like
Integrands and n-aryands
Absolute value ambiguities like ||a|-|b||.
Actually this example is unambiguous, but
|a|b - c|d| has two possible meanings
Context sensitive ellipses: … vs ⋯
19. Math Spacing
Operators have math spacing given by extended
TeX spacing rules
Function object gives correct spacing between
object and neighbors, and between function
name and argument
n-aryand object gives correct spacing between
n-ary operator and its n-aryand
Automate much need for TeX spacing “tweaks”
Context-dependent operator spacing like + - . , :
20. Font Sizing
Text style, script style (70%), script script
style (60%)
Sub/sups…, fractions in line
Cramped
21. Confusables
1 vs ll
1 vs
𝑎𝑎vs �
vs
� vs � vs �
vs vs
𝒳 vs �
𝒳 vs
Y vs Υ
Y vs Υ
Other letter similarities are so close that they
Other letter similarities are so close that they
are avoided, e.g., UC alpha and LC omicron
are avoided, e.g., UC alpha and LC omicron
are never used.
are never used.
22. Math Input Methods
Linear format input and manual buildup
Formula autobuildup (FAB)
Math ribbons
Recognition of handwritten formulae
Hex code input
WYSIWYG editing
Hybrid editing (combination of WYSIWYG
and FAB)
23. Hex to Unicode Input Method
Type Unicode character hexadecimal code
Make corrections as need be
Type Alt+x to convert to character
Type Alt+x to convert back to hex (useful
especially for “missing glyph” character)
Resolve ambiguities by selection
Input higher-plane chars using 5 or 6-digit code
MS Word and RichEdit standard
24. Autocorrect Examples
Type delta and get δ, Delta and get Δ
Define quadratic to be
x = (-b ± √(b^2 - 4ac))/2a
Then typing quadratic<space> inserts:
25. Math Alphabetics
scriptA, frakturA, doubleA, etc., are used to
insert math script, Fraktur, and double-struck
alphabetics
Italic and bold are controlled by italic & bold
format tools and only apply to math alphabetics
Italic and/or bold is ignored for characters that
don’t have corresponding Unicode
26. Linear format math
• Simple operand is a span of alphanumeric
characters
• E.g., simple numerator or denominator is
terminated by any nonalphanumeric
character
abc
• abc/d gives d
• More complicated operands use parentheses
( ), brackets [ ], or { }
• Outermost parens in fractions aren’t
displayed in built-up form
27. Linear format math (cont)
E.g., plain text (a + c)/d displays as
• Easier to read than TEX’s, e.g., {a + cover d}
• MathML: <mfrac><mrow><mi>a</mi><mo>+</mo>
<mi>c</mi></mrow><mrow><mi>d</mi>
</mrow></mfrac>
• Neat feature: linear-format text looks like math
28. Subscripts and Superscripts
Unicode has numeric subscripts and
superscripts along with some operators
(U+2070-U+208E): convert to regular
Others need some kind of markup like <msup>…
</msup>
Use TeX’s _ and ^ subscript/superscript ops for
input; they can be displayed as a subscripted
down arrow and superscripted up arrow
Use parentheses as for fractions to overrule
built-in precedence order
29. Formula Autobuildup
Enter formulas in linear format in a math zone
When a character is typed that renders an
expression syntactically unambiguous, the
expression is built up
Edit expressions in built-up form or in linear form
For integrals, type int (which autocorrects to ∫ )
optionally followed by subscript and superscript
for limits, which auto build up
Can autocorrect <letters> to built-up characters
or expressions
30. Roles of Space (U+0020)
The ASCII space is rarely needed inside math
expressions, since math spacing is automatic
Use to terminate autocorrect entries and to
terminate expressions. When so used, is deleted
Use as command to build up math objects
Use to define spacings for , . and : and to force a
unary operator to display with binary spacing
A space builds up one subexpression; other
operators build up as many as they can
31. Unicode Spaces
Space Unicode Autocorrect
0 em U+200B zwsp
1/18 em U+200A hairsp
3/18 em U+2009 thinsp
4/18 em U+205F medsp
5/18 em U+2005 thicksp
6/18 em U+2004 vthicksp
9/18 em U+2002 ensp
18/18 em U+2003 emsp
(digit width) U+2007 numsp
(space width) U+00A0 nbsp
33. Four Math Invisibles
There are four “invisible” math control codes
Math control code Unicode
Invisible Function Apply U+2061
Invisible Times U+2062
Invisible Comma U+2063
Invisible Plus U+2064
Used for semantic content and usually don’t
display a glyph. May have a small width, e.g.,
Function Apply has thinsp
34. Math Layout
Collaboration between 5 entities:
Unicode rich-text text processing program
such as Word or RichEdit
LineServices math handler
Page/TableServices math handler
Math font, e.g., Cambria Math
Math-font handler
35. Equation Breaking & Numbering
PTS math handler can break equations into
multiple lines automatically or by user breaks
PTS can handle layout of equation numbers
Client needs to support “math paragraph”
Two kinds of user breaks: at operator via context
menu, at line break (Shift+Enter)
At operator indentation: each TAB indents to
next binary/relational operator
Line break: align at specific operators, e.g., =
37. Glyph Variants
Subscripts/superscripts
Primes
Dotless i, j used in bases of accent objects
Flattened and wide accents
Growable brackets, integrals, arrows
Display of differentials using U+2146
Mirror images for right-to-left math
Variation selector U+FE00
38. Cambria Math Font
Cambria typeface designed by Jelle Bosma
Extended for math by Ross Mills and Andrei
Burago in collaboration with the ClearType and
math-layout groups
Contains extensive math tables, glyph variants
and much of the Unicode math set
Is designed with ClearType and excellent screen
readibility in mind
Enables best screen-resolution display of math
39. New Math Fonts
Cambria Math has new version with more math
characters, e.g., U+2900..U+2AFF
202 math characters still needed for Unicode 5.1
STIX Times Roman math font is in beta; doesn’t
support Word 2007 math well
STIX has full math character set + some
STIX font is Type I, so it doesn’t work with the
Office pdf writer
Font demos
40. Font Math Tables
Specialized math tables have been created to
control glyph placements
Position subscripts/superscripts horizontally
using cut-ins and italic corrections
Many math constants: axis height, fraction rule
thickness, etc.
Compare kerning of
The math tables are formalized as OpenType
tables accessible via mathfont.dll
42. User Spacing Adjustments
Layout engine attempts to render with high
typographic quality
Users can spoil layout by inserting space where
engine would insert it automatically
Have autocorrect procedure to reduce this
Users can insert Unicode spaces
Phantoms and smashes
Size and placement overrides
43. Phantoms and Smashes
Phantoms have size but no display. Can
have both width & height, ascent only,
descent only
Smashes display, but remove one or more
sizes, e.g., descent, ascent, and/or width
44. Word 2007 Math Facility
Elegant math entry and display
Display is competitive with TeX
Automatic line breaking, special kerning
More math semantics than TeX: greater
interoperability (Presentation MathML)
Input with math ribbon, context menus
Formula autobuildup input method
WYSIWYG editing as well as linear format
MS Math graphing calculator add-in
45. What Word 2007 doesn’t have
Built-in equation numbering
Math Find/Replace
OpenType enhancements (aside from math
table functionality)
Optimal line breaking
Configurable math-zone vertical spacing
[La]TeX import/export
Document wide MathML support (only MathML
for a single math zone)
46. Conclusions
Eight infrastructures allow us to do math display and
editing better than ever before
High quality math handler and font enable typography
competitive with or better than TeX
Best screen-resolution display of mathematics
Streamlined input methods such as Formula Autobuildup
Incorporated into Word 2007, Word down-level
converter, Microsoft Math calculator
Cambria Math font: state-of-art math font
Notas do Editor
This talk describes and demonstrates how Unicode’s rich mathematical character set combined with OpenType font technology, TeX 's mathematical typography principles, and enhanced autocorrection can be used to produce high-quality, streamlined technical text processing in Word 2007
This project was considerably harder than any of us imagined it would be. Mathematical typography is very intricate and varied, and making it work in a international rich text environment encounters many complications one might not expect. On the other hand, that environment offers many advantages too. Mathematical expressions are always entered into math zones. These zones are regions of text like those in between $’s or $$’s in TeX, but are handled by a character format run attribute in our approach.
Infrastructures outside and inside of Microsoft have emerged to enable major advances in the editing and display of mathematical formulae. While TeX has been stable since about 1986 (last major changes were in 1990), most of the other infrastructures have become available only recently.
TeX (see the TeXbook, by Donald Knuth), a widely used document preparation program, provides both fundamental examples and many specifications for our new math editing and display facility. TeX is the most dominant technical document preparation program today, used to typeset technical books and journals throughout the world. It’s also used widely on the web to display technical documents, either in TeX or pdf form. The experts and users alike agree that the typography used is excellent and sufficient to meet their needs. The program allows the user or copy editor to tweak settings to match end preferences. TeX’s input method can be used with any plain-text editor. While easy to use in principle, the method becomes awkward for complicated mathematical formulae. In addition, one of TeX’s strengths—easy definition of macros—is also a problem when it comes to interchange. The TeXbook is a user manual that includes a detailed specification for mathematical typography. We have used many of its choices and methodology in creating our solutions, which are appropriately enhanced with the use of OpenType tables and some additional constructs. Although the TeX source code is available, it cannot be used directly for several reasons. First the code is like a web rather than being hierarchical and uses many global variables. This makes it cumbersome to employ in the instance-oriented contexts used at Microsoft. Complicating this is that TeX is a complete document imaging system, not one limited to mathematics. As such many aspects of the program that are used for mathematics are used also for other kinds of layout like headers, footers, figures, and footnotes. Extricating the mathematical algorithms from this web of code would be significantly harder than recreating the desired display quality using our own methodologies and the specifications given in The TeXbook . Furthermore we want to take advantage of our OpenType math fonts to obtain better positioning of subscripts, superscripts, and other symbols than possible by default using TeX. Another complication is that Office is an international environment and our math facility needs to be compatible with all languages that we support, potentially simultaneously. Limitations on screen display quality are discussed in later slides.
Unicode is a character encoding system that Knuth would have loved to have had when he and his students developed TeX. Unicode 5.0 contains all standard mathematical characters used in print today. This includes about 2000 characters plus all the combinations that can be made with combining marks. As such Unicode provides an excellent foundation for technical documents, significantly better than the character sets used in TeX itself. In particular, all of TeX’s characters are included in Unicode or in glyphs variants thereof.
See http://www.unicode.org/charts for displays of all characters in Unicode 4.0. This slide shows some of the Miscellaneous Mathematical Symbols-B, range U+2980 – U+29FF. For information about the Unicode math characters, see B. Beeton, A. Freytag, M. Sargent III, Unicode support for mathematics , http://www.unicode.org/reports/tr25/ (2003).
Mathematical notation uses a basic set of mathematical alphanumeric characters which consists of: - set of basic Latin digits (0 - 9) (U+0030 – U+0039) - set of basic upper- and lowercase Latin letters (a - z, A - Z) - uppercase Greek letters Α - Ω (U+0391 – U+03A9), plus the nabla ∇ (U+2207) and the variant of theta Θ given by U+03F4 - lowercase Greek letters α - ω (U+03B1 – U+03C9), plus the partial differential sign ∂ (U+2202) and the six glyph variants of ε, θ, κ, φ, ρ, and π, given by U+03F5, U+03D1, U+03F0, U+03D5, U+03F1, and U+03D6. Only unaccented forms of the letters are used for mathematical notation, because general accents such as the acute accent would interfere with common mathematical diacritics. Examples of common mathematical diacritics that can interfere with general accents are the circumflex, macron, or the single or double dot above, the latter two of which are used in physics to denote derivatives with respect to the time variable. Mathematical symbols with diacritics are always represented by combining character sequences, except as required by normalization. In addition to this basic set, mathematical notation also uses the four Hebrew-derived characters (U+2135 – U+2138). Occasional uses of other alphabetic and numeric characters are known. Examples include U+0428 cyrillic capital letter sha, U+306E hiragana letter no, and Eastern Arabic-Indic digits (U+06F0 – U+06F9). However, these characters are used in only the basic form.
Generally the math alphanumerics substantially reduce the verbosity of markup, although one can construct cases that aren’t so verbose. But a markup representation is poor for several reasons: 1) it complicates a search for a bold italic a, since the search engine needs to understand the bold and italic tags or attributes and dissect the tag contents, 2) it doesn’t tag the characters individually as math identifiers, which is a MathML requirement, and 3) it introduces complexity into the tag model by introducing multiple variable identifier tags. The last of these disadvantages can be overcome by representing the nature of the variables with attributes, e.g., <mi style=bolditalic> , but this approach is quite verbose for items as small as math characters. Admittedly this approach is necessary to handle (quite rare) alphanumeric math symbols that aren’t included in the math alphanumeric block. Searching for such symbols requires a sophisticated attribute-aware search engine since simple plain-text search engines would yield many undesired search hits.
Mathematics has need for a number of Latin and Greek alphabets that on first thought appear to be just font variations of one another, e.g., normal, bold, italic and script H. However in any given document, these characters have distinct mathematical semantics. For example, a normal H represents a different variable from a bold H, etc. If one drops these distinctions in plain text, one gets gibberish. The next slide shows that instead of the well-known Hamiltonian formula H = d ( E ² + H ²), you’d get the integral equation H = d ( E² + H²). Accordingly, bold, italic, script, etc., Latin and Greek alphabets. Straight encoding leads to 996 characters. They allow plain text to retain the proper character semantics and simple (nonrich) search methods to work. For example when you want to search for a script upper-case H math variable, you don’t want to find any other kind of H.
The World Wide Web Consortium W3C recognized the need for a format for representing scientific and technical information. In fact, the HTML 3.0 working draft (1994) included a proposal for HTML Math from Dave Raggett. In March, 1997, the W3C HTML Math working group was formally constituted. The first product of the W3C HTML Math working group was the Mathematical Markup Language (MathML). MathML 1.0 was released as a W3C Recommendation in April, 1998. As the first W3C endorsed XML application, MathML is a low-level format for describing mathematics. MathML provides a much needed foundation for the inclusion of mathematical expressions in Web page and as a common encoding for scientific processors. Indeed, MathML facilitates the use and re-use of scientific content. The MathML 2.0 specification also provides a wealth of information about putting math on computers.
Each MathML element falls into one of three categories: presentation elements, content elements and interface elements. Just as titles, sections, and paragraphs capture the level syntactic structure of a textual document, presentation elements are meant to express the syntactic structure of math notation. Content elements describe mathematical objects directly, as opposed to describing the notation which represents them. Presentation MathML specifies how to display mathematical formulae, but it doesn’t specify the content unambiguously. Here the 2 is a square, known to most everyone. But such notation can also be used as an index. The corresponding content markup specifies the two cases unambiguously.
Content MathML unambiguously defines the meaning of expressions. But it doesn’t specify how to display such expressions. It is possible to give both content and presentation forms for expressions using the <semantics> tag.
See MathML 2.0 Section 7.2.3 Attributes for unspecified data. Could put in WordProcessingML or DrawingML in attributes or inside <annotation-XML>.
The linear format is by far the simplest, but it’s not XML
Math information is collected into two areas: 1) Document default math properties in the {\\mmathPr…} group, and 2) Math zones in {\\mmath…} groups. A math zone is a text range within which math typography rules usually apply and outside of which math typography rules do not apply. Math zones can contain specially marked normal text runs for which math typography rules don’t apply (see \\mnor ). With Office math, math zones are identified internally by a character-format effect bit like bold. Hence if you delete the ordinary text separating two math zones, you get a single merged math zone. Math zones can be inline or display , corresponding to TeX ’s $ and $$ toggle keys. If a math zone fills an entire paragraph, it is a display math zone, i.e., it is displayed on its own line(s). If a math zone is preceded and/or followed by nonmath text other than a \\par , the math zone is inline and is rendered in a more compressed fashion. Inline math zones usually consist of math expressions or variables, whereas display math zones usually consist of one or more equations or formulas. The RTF for the content of an inline math zone replaces the first ellipsis of the nested group structure {\\mmath {\\*\\moMath…}{\\mmathPict…}} Readers that do not understand the ignorable {\\*\\moMath…} group can use one of the pictures in the {\\mmathPict…} group. The RTF for the content of a display math zone replaces the second ellipsis in the nested group structure {\\mmath{\\*\\moMathPara{\\moMathParaPr…}{\\*\\moMath…}+}{\\mmathPict…}} Here the + means that a {\\*\\moMath…} group is emitted for each instance of mathematical text that should start on a new line, e.g., for each new equation. The control word \\moMathPara stands for a “math paragraph”, which can contain multiple equations with various alignment and breaking options. A math paragraph may be part of a text paragraph (text ending in a \\par and either starting a document or following a \\par ). In general, a text paragraph can contain multiple math paragraphs separated from one another by lines of normal text. In this discussion, we see that math RTF uses two ways to assign property values depending on the property: 1) the standard RTF way with a parameter N as in \\msty2, and 2) using a mini group like {\\mtype skw}. The latter way is inspired from the corresponding OMML syntax, such as <m:type m:val="skw"/>, while the RTF way is more succinct. For detailed information see the RTF Specification, Version 1.9.1.
Mathematics is the product of a myriad ingenious minds and many notational variations are in use. We have attempted to support most of these variations.
Rigorous math spacing is essential for high quality mathematical typography. In the simplest cases, such as an equation like a = b + c , the variables a , b , and c , are represented by Unicode math-italic letters and the operators are separated from the letters by spacing chosen according to a set of rules specified in Chap. 18 of The TeXbook . In more complicated equations, special “built-up” math-handler objects are used to place the glyphs in the correct places. These objects allow the math handler in conjunction with the math font to place glyphs as TeX would along with automating a number of spacing refinements that TeX delegates to the user. The objects are summarized in a later slide. The MathML 2.0 specification also has math spacing information.
Math ribbons and handwriting recognition are beyond the scope of this talk.
A handy hex-to-Unicode entry method works with WordPad 2000/XP, Office 2000/XP edit boxes, RichEdit controls in general, and in Microsoft Word starting with Word 2002. Basically you type a character’s hexadecimal code (in ASCII), making corrections as need be, and then type Alt+x. Presto! The hexadecimal code is replaced by the corresponding Unicode character. The Alt+x can be a toggle (as in Microsoft Word 2002). That is, type it once to convert the hex code to a character and type it again to convert the character back to a hex code. If the hex code is preceded by one or more hexadecimal digits, you need to “select” the code so that the preceding hexadecimal characters aren’t included in the code. The code can range up to the value 0x10FFFF, which is the highest character in the 17 planes of Unicode. The only problem with this approach is that some programs use Alt+x for something else (like quit) or the keyboard doesn’t have direct access to ASCII alphabetics.
You can add autocorrect entries using the Tools/Autocorrect Options dialog. Type what you want replaced in the “Replace:” dialog and what you want it replaced with in the “With:” dialog. You can put mathematical expressions in linear form into the “With:” dialog. Then when the replace text is encountered, it will be replaced by a built-up form of the replacement text.
It’s possible to define a “plain text” encoding that often looks like mathematics. Some constructs require some simplified mark up, but many expressions are literally plain (Unicode) text. The notation is handy as a math input language for more elaborate markup languages like TeX and MathML and can be used in its own right. We define a simple operand to consist of all consecutive alphanumeric characters. We call this sequence of one or more alphanumeric characters a span of alphanumeric s. As such, a simple numerator or denominator is terminated by any operator, including, for example, arithmetic operators, the blank operator U+0020, all Unicode characters with codes U+22xx. The fraction operator is the ASCII forward slash U+002F.
For more complicated operands, such as those that include operators, parentheses ( ), brackets [ ], or { } can be used to enclose the desired character combinations. If parentheses are used and the outermost parenthesis set is preceded and followed by operators, that set is not displayed in built-up form, since usually one doesn’t want to see such parentheses. So the plain text ( a + b ) / c displays as shown in the slide. In practice, this approach leads to a linear text that is significantly easier to read than TeX’s, e.g., {a + c \\over d} , since in many cases, outermost parentheses are not needed, while TeX requires { }’s except for single letters. To force the display of an outermost parenthesis set, one encloses the set, in turn, within parentheses, which then become the outermost set. A really neat feature of this notation is that the linear text is, in fact, a legitimate mathematical notation in its own right, so it’s relatively easy to read. I plan to submit the full linear format as a Unicode Technical Note.
Nature isn’t so kind with subscripts and superscripts, but they’re still quite readable. Specifically, we introduce a subscript by a subscript operator _ which we display as a subscripted down arrow. Similarly we introduce a superscript with a superscript operator ^, which we display as a superscripted up arrow. The subscript itself can be any operand as defined above. Another compound subscript is a subscripted subscript, which works using right-to-left associativity. This associativity can be overruled using parentheses as described for fractions. If you use Unicode’s built-in subscripts and superscripts, they should be rendered to look the same as if they had been represented by the corresponding general subscript/superscript markup. The numeric subscripts and superscripts are often used and can streamline the look of technical plain text.
A large community of technically oriented people have TeX input “in their fingers”. In addition, this kind of input is easy to describe and appears in many readily available books. The problem is that it becomes cumbersome to work with in plain text for formulae that have much complexity. However this problem goes away in our environment thanks to autocorrect in combination with formula autobuildup. Essentially the user sees the formulae automatically build up on the screen as s/he types them in. This contrasts remarkably with the traditional TeX scenario, in which the user always edits the full original text in TeX’s linear format. To get an idea of how simple the new approach is, consider the following. In TeX a user types \\delta to see δ in print. With autocorrect and the right autocorrect data file (even Word 97 autocorrect) as soon as a blank or punctuation symbol is typed after the a in \\delta, the Greek letter δ appears on the screen. No need to wait for a printout or preview. Similarly with the formula autobuildup facility, one can type in integrals with \\int, fractions, square roots, etc., and see them displayed in built-up form on the screen instead of the relatively complicated way they appear when typed in. You never have to search the original plain text input to find where to edit. You just point and click at the right place in a formula and edit as desired. Typically such WYSIWYG editing is preferred once a formula is built up and you can use autoformula buildup wherever you want to, including inside built-up formulas. You can also toggle back to the linear format if that makes things easier, e.g., in converting a fraction to something else. A complete mathematical expression can be entered in linear form into an autocorrect target. The formula autobuildup mechanism automatically builds such expressions up as they are entered.
The space bar is the easiest key to hit on the keyboard and we make extensive use of it.
Unicode has a variety of spaces that can be used in mathematical text. Fonts need to show no glyph for these.
Note that many characters that are not operators in algebra nevertheless behave as operators in the linear format, namely all characters of the category concatenation . This includes space characters, along with arithmetic operators like +, *, =, etc. Note also that the absolute-value and norm operators don’t appear in the table, since they require a slightly more complicated formalism to handle (sometimes a ‘|’ acts like an opOpen and sometimes like an opClose ). Similarly period and comma don’t appear, since when sandwiched between ASCII digits they treated as part of an operand, while otherwise they have a precedence of 4.
In this model, math layout is performed by a collaboration between four entities: 1) a Unicode rich-text text processing program such as Word or RichEdit, 2) the math handler built into the latest version of the Microsoft text layout component, 3) the math font, and 4) the math-font handler. This collaboration is invoked whenever text inside a math zone needs to be displayed. All such text is rendered using appropriate glyphs with measurements dependent on the glyph ascents, descents, and widths. In the simplest cases, such as an equation like a = b + c , the variables a , b , and c , are represented by Unicode math-italic letters and the operators are separated from the letters by spacing chosen according to a set of rules specified in The TeXbook .
The math handler is also capable of breaking equations into multiple lines either automatically or by user defined breaks. This feature is valuable particularly on screen, where window widths tend to change readily, making the hand breaking used for paper less successful. While no special font properties are required for this feature, the client backing store has to support the concept of a “math paragraph”. Word 2007 implements the line breaking functionality, but postponed equation-numbering. One can get equation numbers using tables.
In more complicated equations, special “built-up” math-handler objects are used to place the glyphs in the correct places. These objects allow the math handler in conjunction with the math font to place glyphs as TeX would along with automating a number of spacing refinements that TeX delegates to the user. The Line Services math objects are: Accent: Display accent over base character(s) Box: Give properties to base Boxed formula: Display borders and/or lines through base Delimiters: Enclose base in parens, brackets, braces, etc. Delimiters with separators: Enclose bases separated by separator character, such as a vertical bar Equation array: Display set of horizontally aligned equations Fraction: Display normal or small built-up fraction Function apply: Display trigonometric and other functions with function name and base Left subsup: Prefix a subscript and/or superscript to base Lower limit: Display limit below base Matrix: Display matrix with n columns and m rows n -ary: Display large n -ary operator with a base and optional upper and lower limits Operator character: Used internally to give proper spacing to operators Overbar: Display bar over base (boxed formula special case) Phantom: Suppress any combination of base ascent, descent, width, display, or transparency Radical: Display square and n th roots Slashed fraction: Display slashed or built-up linear fraction Stack: Display first argument over second (like fraction w/o bar) Stretch stack: Display stretchable character above/below base or a limit above/below stretchable base character Subscript: Display subscript relative to base Subsup: Display subscript and superscript relative to base Superscript: Display superscript relative to base Underbar: Display bar under base (boxed formula special case) Upper limit: Display limit above base
Cambria Math contains a full set of glyph variants that have a heavier weighting so that when scaled down to the first script level (about 71% of text size) the stem widths match those of the text level glyphs. Prime (U+2032) and multiple primes need to be superscripted and scaled down accordingly. Dotless i and j are automatically used in the bases of accent objects. When putting an accent over capital letters, partially flattened glyph variants are used. Furthermore the glyph variants are requested to have sufficient width to cover the accent base, which may consist of more than one character. Brackets, braces, parentheses and other growable characters have a number of larger glyph variants as well as arbitrarily large size created using glyph assemblies. When the assemblies are displayed, the pieces are clipped to prevent overlap, which would create ClearType artifacts. According to a document setting, the italic open-face characters 0x2145 - 0x2149 (differentials, e, i, j) can be displayed as themselves (useful for patent applications) or with the corresponding math italic or corresponding ASCII letters. Serifed italic glyphs are used for these in most math publications, but serifed upright glyphs are used in some European math publications. The use of the differential d (U+2146) automatically introduces a small space between it and the preceding character if that character is alphabetic. Right-to-left math requires mirroring the images of parentheses, integrals, square roots, arrows, etc. Many such mirror images can be obtained by using corresponding Unicode characters. For example the mirror image of a left parenthesis is a right parenthesis and vice versa. But Unicode doesn’t have many characters that are mirror images of other characters, such as integral signs and square roots. Furthermore it seems that a glyph variant approach for these characters makes more sense than adding characters to serve as the mirror images. Other approaches include using world transforms and mirrored bitmaps. The present version of our software doesn’t handle true right-to-left math. Math zones in right-to-left paragraphs are treated as left-to-right objects, with all characters in the math zone being strong left-to-right except those defined by Unicode to be strong right-to-left.
New Unicode 4.0 Math Fonts have been developed both at Microsoft as well as by the STIX committee, which played a key role in generalizing Unicode to include all standard math characters. Our new math facilities have been developed along with the Cambria Math font, influencing one another to obtain ideal results. The Cambria Math effort was managed by Geraldine Wade and Michael Duggan together with Tiro Typeworks. Andrei Burago and Sergey Malkin also contributed in key ways. Cambria Math is part of a TrueType collection that also includes Cambria, Cambria Italic, Cambria Bold, and Cambria Bold Italic. High-quality low-resolution screen display is very important for the way people work with documents in the Internet age: most documents are perused on screen and only printed for purposes of detailed examination. This is a major advantage of our math system.
Here’s a list of 202 other characters needed to complete the math character set (from UTR #25; includes the circled single digits and 52 circled alphabetics and six parenthesized alphabetics that I use in the math linear format): 232C..232E, 23E1..23E7, 2460..2468, 24A9, 249D, 249E, 24A8, 24AD, 24B1, 24B6..24EA, 25A2, 25AA..25AB, 25B2, , 25B4..25B9, 25BC..25BF, 25C0..25C3, 25C6..25C7, 25C9, 25CE..25CF, 25E6, 25EF, 25FB..25FE, 2605..2606, 2609, 26AA..26AC, 2772..2773, 27C0..27C9, 27CC, 2B00..2B03, 2B05, 2B08..2B0C, 2B0E..2B19, 2B1B..2B54 The circled/parenthesized characters in the new math linear format are: 24A9, 249D, 249E, 24A8, 24AD, 24B1, 24B7, 24B8, 24C1, 24C3, 24C9, 24D1, 24D2. The geometric characters in UTR #25 Table 2.5 need to be sized appropriately. We probably should brainstorm about these and compare sizes in Cambria Math, STIX and UTR #25. Also there’s a passionate user who’s written up comments (http://www.unicode.org/~rick/Chastney-Phillip-Shapes-II.pdf) on this. We also ought to discuss the glyph variants in UTR #25 tables 2.7—2.9. Presumably these can be accommodated using shaping.
The new font tables enable one to position subscripts and superscripts horizontally better than TeX as well as having richer glyph choices for operators like the integral sign, square root, and growable brackets. The tables include parameters such as the em-size-dependent sub/superscript values LONG lSubscriptShiftDown; LONG lSubscriptTopMax; LONG lSubscriptBottomDropMin; LONG lSuperscriptShiftUp; LONG lSuperscriptShiftUpCramped; LONG lSuperscriptBottomMin; LONG lSuperscriptTopRiseMin; LONG lSubSuperscriptMinGap; LONG lSuperscriptBottomMaxWithSubscript; LONG lSpaceAfterScript; In addition math characters have four cut-in values, one for each corner, allowing sub/superscripts to be kerned with their bases. The information in the tables can be obtained from mathfont.dll along with appropriate scaling and glyph assemblies.
Functions in mathfont.dll accessing math font tables
The subscript/superscript/prescript callbacks are shown
For example, consider Einstein’s most famous equation, E = mc 2 . The E is in its own text run, the equal sign is a mathematical operator object, the m is in its own text run, and the c 2 is a superscript object with text runs for arguments. The text runs result in various callbacks to obtain character properties, widths, and glyphs, as well as to display the glyphs or variants thereof once the whole line is laid out. All text is treated using glyphs and glyph-ink ascents and descents. The math italic letters are given by Unicode math alphabetics in plane 1. The operator object for the equal sign results in callbacks to determine the operator’s text characteristics and its default spacing class, in this case, relational. The superscript object results in callbacks to get text-run information for the base and superscript text, as well as to obtain the superscript vertical shift and the cut-in values for the upper-right corner of the and lower-left corner of the 2. These displacements are obtained from the math font handler (MFH), which is responsible for access to the math font’s math tables along with appropriate scaling. When the glyph for the superscript 2 is fetched, the MFH is requested to return a script level-1 glyph variant with a relative size specified by the font (typically about 70% of the text size). This example shows how even a simple mathematical equation involves interplay between the client, the math layout handler, the math font handler, and the font itself. More complicated examples have math objects like brackets or integrals and need glyph assemblies and other information. In addition, larger equations may need to be wrapped to two or more lines, a process that involves further callbacks and information.
If a standard fraction's argument is a standard fraction that has a width greater than the outer fraction's rule length - 20% EM, increase the rule length by 20% EM to reveal which fraction contains the other.
Math editing and display in Office 2007 appear only in Word 2007, but the appearance is stunning. The math typography is competitive or superior to TeX’s, the input methods are state of the art, and the environment is Office’s, which comes with internationalization, spelling and grammar checking, interoperability, bibliography support, and many other features one expects from the leading word processor. Much of the underlying functionality is based on the sharable components, PTLS 4.0 (Page/Table/LineServices with its high quality math handlers), RichEdit, the math font library, Uniscribe, and the incredible Cambria Math font.
The underlying technology was mostly available, but there wasn’t enough time to integrate it
WordPad uses a RichEdit control for editing and displaying text. By using the latest RichEdit control with WordPad, we can edit and display mathematical expressions as described in this talk. To get an upgraded msftedit.dll for use with WordPad, go to \\\\scratch2\\scratch\\murrays\\wordpad. The math build up/down facility is housed in Office 12’s RichEdit 6.0 dll and communicates with clients using a subset of the TOM2 (Text Object Model 2) interface methods. Any application that implements this subset of methods can have formula autobuildup and manual build up/down. In particular, the client needs to implement the ITextStrings rich-text strings interface. This interface gives access to a set of strings similar to a stack of C strings, but the ITextStrings strings may have rich-text properties that the build up/down facility doesn’t need to understand.
Many people around the company and many technologies are involved in this effort. This slide lists the people and groups most directly involved. Many thanks also are due to our managers at all levels, who offered lots of support and encouragement.
The Unicode Standard , Version 4.0, (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1) or online as http://www.unicode.org/versions/Unicode4.0.0/ Barbara Beeton, Asmus Freytag, Murray Sargent III, Unicode Technical Report #25 “Unicode Support for Mathematics”, http://www.unicode.org/reports/tr25 Donald E. Knuth, The TeXbook , (Reading, Massachusetts: Addison-Wesley 1984) Mathematical Markup Language (MathML) Version 2.0 (Second Edition) http://www.w3.org/TR/2003/REC-MathML2-20031021/ . Murray Sargent III, Unicode Nearly Plain-Text Encoding of Mathematics , http://www.unicode.org/notes/tn28/UTN28-PlainTextMath.pdf.
Caveat: we’re not finished yet. right-to-left Arabic math and a number of important features have been postponed to the next iteration of Office. PowerPoint and OneNote didn’t quite make it, although we have impressive demos. We don’t have converters to/from TeX, although one could import/export TeX via MathML.