MicroLaTeX
The MicroLaTeX parser first transforms source into a list of primitive LaTeX blocks using a shift-reduce parser with error recovery. It then maps the parser for the internal language over this list, into a list of expression blocks.
Data Structures
Primitive Blocks
A primitve LaTeX block is a 13-field record as displayed below.
-- Parser.PrimitiveLaTeXBlock
type alias PrimitiveLaTeXBlock =
{ indent : Int
, lineNumber : Int
, position : Int
, level : Int
, content : List String
, firstLine : String
, name : Maybe String
, args : List String
, properties : Dict String String
, sourceText : String
, blockType : PrimitiveBlockType
, status : Status
, error : Maybe PrimitiveBlockError
}
-
When the source text is parsed into a list of blocks, it is grouped into lists of strings in the
contentfield. -
The content is also stored as a string in the
sourceTextfield. This is to facilitate synchronization of source text and rendered text. The source text field is carried into the final syntax tree (forest of expression blocks). Consequently, if a piece source text is selected, the syntax tree can can be searched, the matching element can be located and then used to highlight the corresponding part of the rendered text. -
The
indentfield is the number of spaces of indentation of the first line of the block;lineNumberis the line number of the first line of the text in the source string;positionis its character position in that string. Thelevelfield is the depth of the block in the eventual tree structure. -
Blocks may be un-named, as in the case of paragraph, or named, in the case of a LaTeX environment. This information is stored in the
namefield. For example, the name of the block\begin{theorem} ... \end{theorem}isJust "theorem" -
The
firstlinefield is the first line of a block, i.e., its header. If the header of a block is "\begin{theorem}[Pythagoras]", then its name is "theorem" andargsis the list["Pythagoras]. If we had"\begin{theorem}[Pythagoras, foo:bar]"(XX) thenargsis still ["Pythagoras"], andpropertiesis a dictionary with one key, "foo", whose value is "bar". Thusargsis a list of unnamed args andpropertiesis a dictionary of key-value pairs derived from the named args. (XX: improve this discussion) -
The
blockTypefield has typetype PrimitiveBlockType = PBVerbatim | PBOrdinary | PBParagraphIt describes the type of block — unnamed, environment like "theorem" in which the body of the block is parsed, or environment like "equation" where it is passed verbatim to the renderer.
-
The
statusfield has typetype Status = Finished | Started | FilledIt is used by the primitive block parser and is needed to handle nested blocks.
Primitive Block Parser
The parser is defined in
module Parser.PrimitiveLaTeXBlock.
Lists of lines of text are parsed into lists
of primitive blocks by the function
-- Parser.PrimitiveLaTeXBlock
parse : List String -> List PrimitiveLaTeXBlock
parse lines =
lines |> parseLoop |> .blocks
The strategy is to examine each line in turn, committing the current block if its mathching end tag is found, otherwise pushing it onto a stack of blocks. All blocks are moved from the stack to the committed list when the "root" or first block on the stack as well as all of its children are closed. If the stack is nonempty after all blocks have been consumed, there has been a syntax error, and so the error recovery procedure is invoked.
Data structure
-- Parser.PrimtiiveLaTeXBlock
-- 14 fields
type alias State =
{ blocks : List PrimitiveLaTeXBlock
, stack : List PrimitiveLaTeXBlock
, holdingStack : List PrimitiveLaTeXBlock
, labelStack : List Label
, lines : List String
, sourceText : String
, firstBlockLine : Int
, indent : Int
, level : Int
, lineNumber : Int
, position : Int
, verbatimClassifier : Maybe Classification
, count : Int
, label : String
}
where
type alias Label =
{ classification : ClassifyBlock.Classification
, level : Int
, status : Status
, lineNumber : Int
}
- The
blocksfield holds the committed blocks — the eventual output of the parser. - The
stackfield holds
Main parsing functions
parse : List String -> List PrimitiveLaTeXBlock
parse lines =
lines |> parseLoop |> .blocks
and
parseLoop : List String -> ParserOutput
parseLoop lines =
loop (init lines) nextStep |> finalize
where
type alias ParserOutput =
{ blocks : List PrimitiveLaTeXBlock
, stack : List PrimitiveLaTeXBlock
, holdingStack : List PrimitiveLaTeXBlock }
The nextStep function
This is the driver function for the parser's functional loop. It operates as follows:
- Increment state.lineNumber.
- If the input (state.lines) has been consumed and
- the stack is empty, return Done state
- the stack is non empty, return recoverFromError state
- Let the current raw line be the string at index state.lineNumber of state.lines.
- Classify the raw line, a value of type Classification:
type Classification
= CBeginBlock String
| CEndBlock String
| CSpecialBlock LXSpecial
| CMathBlockDelim
| CVerbatimBlockDelim
| CPlainText
| CEmpty
- Invoke a handler based on the classification that returns a value
of type Step State State
New blocks are constructed by nextStep using
blockFromLine : Int -> Line -> PrimitiveLaTeXBlock
The primitive block type (PBVerbatim, PBOrdinary, PBParagraph)
and the label (in the case of the first two variants) is determined
by examining the contents of the line. For example, if the
line is "\begin{equation}" then the primitive block type
is PBVerbatim and the label is "equation". The label is used
to run the parser loop; when a block is committed, the label is
used to form the name: Maybe String field of the primitive block.
This field is Nothing in the case of PBParagraph and is a Just String
in the case of the other two block types.
Error recovery
Recall that error recovery is invoked when the stack
is nonempty after all input has been consumed.
The recovery strategy is to commit
the root block on the stack, setting the error field
to missingTagError block, then reparse
the input starting from the line immediately
after that of the offending block. Consequently
error recovery is guaranteed to terminate and also
to deal with additional errors. Error recovery
is handled by
recoverFromError : State -> State
Transform
Module MicroLaTeX.Parser.Transform
The purpose of this function is to transform a primitive block like the one coming from a single-line paragraph with text "\section{Intro}" to an ordinary (blockType PBOrdinaryBlock) block with name "section", args ["1"], and content ["Introduction"]. This is to coerce parsed MiniLaTeX source to our standard model.
Verbatim Blocks
If a block is to be treated as a verbatim block, its name must appear in
Parser.LaTeXBlock.verbatimBlockNames : List String
Note. What's the deal with Parser.Common.verbatimBlockNames?
verbatimBlockNames =
[ "equation"
, "aligned"
, "math"
, "code"
, "verbatim"
, "verse"
, "mathmacros"
, "textmacros"
, "tabular"
, "hide"
, "docinfo"
, "datatable"
, "chart"
, "svg"
, "quiver"
, "image"
, "tikz"
, "load-files"
, "include"
, "iframe"
]
Tests
Test parsing of text to a list of primitive blocks:
-- MicroLaTeXParserTest
primitiveBlockRoundTripTest "nested environments" text1
Test the internal language:
-- MicroLaTeXParserTest
roundTrip1 "\\blue{\\italic{abc \\strong{def}}}"
Test coercion of MicroLaTeX macros to blocks:
-- TransformLaTeXTest
test_ "tags" (toL0 [ "\\tags{AAA}" ]) [ "| tags AAA " ]
-- TransformTest
test_ "transform, args"
(toPrimitiveBlocks "\n\n\\section{Intro}\n\n" |> List.map transform |> List.map .args)
[ [ "1" ] ]
where
toPrimitiveBlocks =
Markup.toPrimitiveBlocks MicroLaTeXLang
Command line tools
The ./CLI folder contains various CLI tools for testing
and benchmarking. All use Albert Dahlin's
elm/posix
package and can be run using velociraptor (command: vr).
Some examples:
-
vr lxpb lxtest/a1.txt
-
vr rt foo.txt
-
vr bench init 100 bench/harmonic.tex