Text::ASCIIMathML - Perl extension for parsing ASCIIMathML text into MathML

SYNOPSIS

use Text::ASCIIMathML;

$parser=new Text::ASCIIMathML();

$parser->SetAttributes(ForMoz => 1);

$ASCIIMathML = "int_0^1 e^x dx";
$mathML = $parser->TextToMathML($ASCIIMathML);
$mathML = $parser->TextToMathML($ASCIIMathML, [title=>$ASCIIMathML]);
$mathML = $parser->TextToMathML($ASCIIMathML, undef, [displaystyle=>1]);

$mathMLTree = $parser->TextToMathMLTree($ASCIIMathML);
$mathMLTree = $parser->TextToMathMLTree($ASCIIMathML, [title=>$ASCIIMathML]);
$mathMLTree = $parser->TextToMathMLTree($ASCIIMathML,undef,[displaystyle=>1]);

$mathML = $mathMLTree->text();
$latex  = $mathMLTree->latex();

DESCRIPTION

Text::ASCIIMathML is a parser for ASCIIMathML text which produces MathML XML markup strings that are suitable for rendering by any MathML-compliant browser.

The parser uses the following attributes which are settable through the SetAttributes method:

ForMoz

Specifies that the fonts should be optimized for Netscape/Mozilla/Firefox.

The output of the TextToMathML method always follows the schema <math><mstyle>...</mstyle></math> The first argument of TextToMathML is the ASCIIMathML text to be parsed into MathML. The second argument is a reference to an array of attribute/value pairs to be attached to the <math> node and the third argument is a reference to an array of attribute/value pairs for the <mstyle> node. Common attributes for the <math> node are "title" and "xmlns"=>"&mathml;". Common attributes for the <mstyle> node are "mathcolor" (for text color), "displaystyle"=>"true" for using display style instead of inline style, and "fontfamily".

ASCIIMathML markup

The syntax is very permissive and does not generate syntax errors. This allows mathematically incorrect expressions to be displayed, which is important for teaching purposes. It also causes less frustration when previewing formulas.

If you encode 'x^2' or 'a_(mn)' or 'a_{mn}' or '(x+1)/y' or 'sqrtx', you pretty much get what you expect. The choice of grouping parenthesis is up to you (they don't have to match either). If the displayed expression can be parsed uniquely without them, they are omitted. Most LaTeX commands are also supported, so the last two formulas above can also be written as '\frac{x+1}{y}' and '\sqrt{x}'.

The parser uses no operator precedence and only respects the grouping brackets, subscripts, superscript, fractions and (square) roots. This is done for reasons of efficiency and generality. The resulting MathML code can quite easily be processed further to ensure additional syntactic requirements of any particular application.

The grammar

Here is a definition of the grammar used to parse ASCIIMathML expressions. In the Backus-Naur form given below, the letter on the left of the ::= represents a category of symbols that could be one of the possible sequences of symbols listed on the right. The vertical bar | separates the alternatives.

     c ::= [A-z] | numbers | greek letters | other constant symbols 
    				    (see below)
     u ::= 'sqrt' | 'text' | 'bb' | other unary symbols for font commands
     b ::= 'frac' | 'root' | 'stackrel' | 'newcommand' | 'newsymbol'
                                        binary symbols
     l ::= ( | [ | { | (: | {:          left brackets
     r ::= ) | ] | } | :) | :}          right brackets
     S ::= c | lEr | uS | bSS | "any"   simple expression
     E ::= SE | S/S |S_S | S^S | S_S^S  expression (fraction, sub-,
    				    super-, subsuperscript)

The translation rules

Each terminal symbol is translated into a corresponding MathML node. The constants are mostly converted to their respective Unicode symbols. The other expressions are converted as follows:

     lSr	  ->	<mrow>lSr</mrow> 
    		(note that any pair of brackets can be used to
    		delimit subexpressions, they don't have to match)
     sqrt S	  ->	<msqrt>S'</msqrt>
     text S	  ->	<mtext>S'</mtext>
     "any"	  ->	<mtext>any</mtext>
     frac S1 S2	->	<mfrac>S1' S2'</mfrac>
     root S1 S2	->	<mroot>S2' S1'</mroot>
     stackrel S1 S2	->	<mover>S2' S1'</mover>
     S1/S2	  ->	<mfrac>S1' S2'</mfrac>
     S1_S2	  ->	<msub>S1 S2'</msub>
     S1^S2	  ->	<msup>S1 S2'</msup>
     S1_S2^S3 ->	<msubsup>S1 S2' S3'</msubsup> or
    		 <munderover>S1 S2' S3'</munderover> (in some cases)
     S1^S2_S3 ->	<msubsup>S1 S3' S2'</msubsup> or
    		 <munderover>S1 S3' S2'</munderover> (in some cases)

In the rules above, the expression S' is the same as S, except that if S has an outer level of brackets, then S' is the expression inside these brackets.

Matrices

A simple syntax for matrices is also recognized:

l(S11,...,S1n),(...),(Sm1,...,Smn)r
    or    
l[S11,...,S1n],[...],[Sm1,...,Smn]r.

Here l and r stand for any of the left and right brackets (just like in the grammar they do not have to match). Both of these expressions are translated to

<mrow>l<mtable><mtr><mtd>S11</mtd>...
<mtd>S1n</mtd></mtr>...
<mtr><mtd>Sm1</mtd>... 
<mtd>Smn</mtd></mtr></mtable>r</mrow>.

Note that each row must have the same number of expressions, and there should be at least two rows.

LaTeX matrix commands are not recognized.

Tokenization

The input formula is broken into tokens using a "longest matching initial substring search". Suppose the input formula has been processed from left to right up to a fixed position. The longest string from the list of constants (given below) that matches the initial part of the remainder of the formula is the next token. If there is no matching string, then the first character of the remainder is the next token. The symbol table at the top of the ASCIIMathML.js script specifies whether a symbol is a math operator (surrounded by a <mo> tag) or a math identifier (surrounded by a <mi> tag). For single character tokens, letters are treated as math identifiers, and non-alphanumeric characters are treated as math operators. For digits, see "Numbers" below.

Spaces are significant when they separate characters and thus prevent a certain string of characters from matching one of the constants. Multiple spaces and end-of-line characters are equivalent to a single space.

Numbers

A string of digits, optionally followed by a decimal point (a period) and another string of digits, is parsed as a single token and converted to a MathML number, i.e., enclosed with the <mn> tag.

Greek letters

Lowercase letters

alpha beta chi delta epsilon eta gamma iota kappa lambda mu nu omega phi pi psi rho sigma tau theta upsilon xi zeta

Uppercase letters

Delta Gamma Lambda Omega Phi Pi Psi Sigma Theta Xi

Variants

varepsilon varphi vartheta

Standard functions

sin cos tan csc sec cot sinh cosh tanh log ln det dim lim mod gcd lcm min max

Operation symbols

Type	  Description					Entity
+	  +						+
-	  -						-
*	  Mid dot					&sdot;
**	  Star						&Star;
//	  /						/
\\	  \						\
xx	  Cross product					&times;
-:	  Divided by					&divide;
@	  Compose functions				&SmallCircle;
o+	  Circle with plus 				&oplus;
ox	  Circle with x					&otimes;
o.	  Circle with dot				&CircleDot;
sum	  Sum for sub- and superscript			&sum;
prod	  Product for sub- and superscript		&prod;
^^	  Logic "and"					&and;
^^^	  Logic "and" for sub- and superscript 		&Wedge;
vv	  Logic "or"					&or;
vvv	  Logic "or" for sub- and superscript		&Vee;
nn	  Logic "intersect"				&cap;
nnn	  Logic "intersect" for sub- and superscript	&Intersection;
uu	  Logic "union"					&cup;
uuu	  Logic "union" for sub- and superscript	&Union;

Relation symbols

Type	  Description 					Entity
=	  =						=
!=	  Not equals					&ne;
<	  <						&lt;
>	  >						&gt;
<=	  Less than or equal				&le;
>=	  Greater than or equal				&ge;
-lt	  Precedes					&Precedes;
>-	  Succeeds					&Succeeds;
in	  Element of					&isin;
!in	  Not an element of				&notin;
sub	  Subset					&sub;
sup	  Superset					&sup;
sube	  Subset or equal				&sube;
supe	  Superset or equal				&supe;
-=	  Equivalent					&equiv;
~=	  Congruent to					&cong;
~~	  Asymptotically equal to			&asymp;
prop	  Proportional to				&prop;

Logical symbols

Type	  Description 					Entity
and	  And						" and "
or	  Or						" or "
not	  Not						&not;
=>	  Implies					&rArr;
if	  If						" if "
iff	  If and only if				&hArr;
AA	  For all					&forall;
EE	  There exists					&exist;
_|_	  Perpendicular, bottom				&perp;
TT	  Top						&DownTee;
|--	  Right tee					&RightTee;
|==	  Double right tee				&DoubleRightTee;

Grouping brackets

Type	  Description 					Entity
(	  (						(
)	  )						)
[	  [						[
]	  ]						]
{	  {						{
}	  }						}
(:	  Left angle bracket				&lang;
:)	  Right angle bracket				&rang;
{:	  Invisible left grouping element
:}	  Invisible right grouping element

Miscellaneous symbols

Type	  Description 					Entity
int	  Integral					&int;
oint	  Countour integral				&ContourIntegral;
del	  Partial derivative				&del;
grad	  Gradient					&nabla;
+-	  Plus or minus					&plusmn;
O/	  Null set					&empty;
oo       Infinity					&infin;
aleph	  Hebrew letter aleph				&alefsym;
/_	  Angle						&ang;
:.	  Therefore					&there4;
...	  Ellipsis					...
cdots	  Three centered dots				&ctdot;
\<sp>    Non-breaking space (<sp> means space)		&nbsp;
quad	  Quad space					&nbsp;&nbsp;
diamond  Diamond					&Diamond;
square	  Square					&Square;
|__	  Left floor					&lfloor;
__|	  Right floor					&rfloor;
|~	  Left ceiling					&lceil;
~|	  Right ceiling					&rceil;
CC	  Complex numbers				&Copf;
NN	  Natural numbers				&Nopf;
QQ	  Rational numbers				&Qopf;
RR	  Real numbers					&Ropf;
ZZ	  Integers					&Zopf;

Arrows

Type	  Description 					Entity
uarr	  Up arrow					&uarr;
darr	  Down arrow					&darr;
rarr	  Right arrow					&rarr;
->	  Right arrow					&rarr;
larr	  Left arrow					&larr;
harr     Horizontal (two-way) arrow			&harr;
rArr	  Right double arrow				&rArr;
lArr	  Left double arrow				&lArr;
hArr	  Horizontal double arrow			&hArr;

Accents

Type	 Description	     Output
hat x	 Hat over x	     <mover><mi>x</mi><mo>^</mo></mover>
bar x	 Bar over x	     <mover><mi>x</mi><mo>&macr;</mo></mover>
ul x	 Underbar under x    <munder><mi>x</mi><mo>&UnderBar;</mo></munder>
vec x	 Right arrow over x  <mover><mi>x</mi><mo>&rarr;</mo><mover>
dot x	 Dot over x	     <mover><mi>x</mi><mo>.</mo><mover>
ddot x	 Double dot over x   <mover><mi>x</mi><mo>..</mo><mover>

Font commands

Type	  Description
bb A	  Bold A
bbb A	  Double-struck A
cc A	  Calligraphic (script) A
tt A	  Teletype (monospace) A
fr A	  Fraktur A
sf A	  Sans-serif A

Defining new commands and symbols

It is possible to define new commands and symbols using the 'newcommand' and 'newsymbol' binary operators. The former defines a macro that gets expanded and reparsed as ASCIIMathML and the latter defines a constant that gets used as a math operator (<mo>) element. Both of the arguments must be text, optionally enclosed in grouping operators. The 'newsymbol' operator also allows the second argument to be a group of two text strings where the first is the mathml operator and the second is the latex code to be output.

For example, 'newcommand "DDX" "{:d/dx:}"' would define a new command 'DDX'. It could then be invoked like 'DDXf(x)', which would expand to '{:d/dx:}f(x)'. The text 'newsymbol{"!le"}{"&#x2270;"}' could be used to create a symbol you could invoke with '!le', as in 'a !le b'.

Attributes for <math>

title

The title attribute for the element, if specified. In many browsers, this string will appear if you hover over the MathML markup.

id

The id attribute for the element, if specified.

class

The class attribute for the element, if specified.

Attributes for <mstyle>

displaystyle

The displaystyle attribute for the element, if specified. One of the values "true" or "false". If the displaystyle is false, then fractions are represented with a smaller font size and the placement of subscripts and superscripts of sums and integrals changes.

mathvariant

The mathvariant attribute for the element, if specified. One of the values "normal", "bold", "italic", "bold-italic", "double-struck", "bold-fraktur", "script", "bold-script", "fraktur", "sans-serif", "bold-sans-serif", "sans-serif-italic", "sans-serif-bold-italic", or "monospace".

mathsize

The mathsize attribute for the element, if specified. Either "small", "normal" or "big", or of the form "number v-unit".

mathfamily

A string representing the font family.

mathcolor

The mathcolor attribute for the element, if specified. It be in one of the forms "#rgb" or "#rrggbb", or should be an html-color-name.

mathbackground

The mathbackground attribute for the element, if specified. It should be in one of the forms "#rgb" or "#rrggbb", or an html-color-name, or the keyword "transparent".

METHODS

Text::ASCIIMathML

TextToMathML($text, [$math_attr], [$mstyle_attr])

Converts $text to a MathML string. If the optional $math_attr argument is provided, it should be a reference to a hash of attribute/value pairs for the <math > node. If the optional $mstyle_attr argument is provided, it should be a reference to a hash of attribute/value pairs for the <mstyle > node.

TextToMathMLTree($text, [$math_attr], [$mstyle_attr])

Like TextToMathMLTree except that instead of returning a string, it returns a Text::ASCIIMathML::Node representing the parsed MathML structure.

Text::ASCIIMathML::Node

text

Returns a MathML string representing the parsed MathML structure encoded by the Text::ASCIIMathML::Node.

latex

Returns a LaTeX string representing the parsed MathML structure encoded by the Text::ASCIIMathML::Node.

BUGS AND SUGGESTIONS

If you find bugs, think of anything that could improve Text::ASCIIMathML or have any questions related to it, feel free to contact the author.

AUTHOR

Mark Nodine <mnodine@alum.mit.edu>

SEE ALSO

MathML::Entities, 
<http://www1.chapman.edu/~jipsen/mathml/asciimathsyntax.xml>

ACKNOWLEDGEMENTS

This Perl module has been created by modifying Peter Jipsen's ASCIIMathML.js script. He deserves full credit for the original implementation; any bugs have probably been introduced by me.

COPYRIGHT

The Text::ASCIIMathML module is copyright (c) 2006 Mark Nodine, USA. All rights reserved.

You may use and distribute them under the terms of either the GNU General Public License or the Artistic License, as specified in the Perl README file.

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 3:

=pod directives shouldn't be over one line long! Ignoring all 2 lines of content

Around line 421:

You forgot a '=back' before '=head1'