MicroLaTeX
The MicroLaTeX parser first transforms source into a list of primitive LaTeX blocks using a shift-reduce parser with error recovery. It then maps the parser for the internal language over this list, into a list of expression blocks.
Data Structures
Primitive Blocks
A primitve LaTeX block is a 13-field record as displayed below.
-- Parser.PrimitiveLaTeXBlock
type alias PrimitiveLaTeXBlock =
{ indent : Int
, lineNumber : Int
, position : Int
, level : Int
, content : List String
, firstLine : String
, name : Maybe String
, args : List String
, properties : Dict String String
, sourceText : String
, blockType : PrimitiveBlockType
, status : Status
, error : Maybe PrimitiveBlockError
}
-
When the source text is parsed into a list of blocks, it is grouped into lists of strings in the
content
field. -
The content is also stored as a string in the
sourceText
field. This is to facilitate synchronization of source text and rendered text. The source text field is carried into the final syntax tree (forest of expression blocks). Consequently, if a piece source text is selected, the syntax tree can can be searched, the matching element can be located and then used to highlight the corresponding part of the rendered text. -
The
indent
field is the number of spaces of indentation of the first line of the block;lineNumber
is the line number of the first line of the text in the source string;position
is its character position in that string. Thelevel
field is the depth of the block in the eventual tree structure. -
Blocks may be un-named, as in the case of paragraph, or named, in the case of a LaTeX environment. This information is stored in the
name
field. For example, the name of the block\begin{theorem} ... \end{theorem}
isJust "theorem"
-
The
firstline
field is the first line of a block, i.e., its header. If the header of a block is "\begin{theorem}[Pythagoras]", then its name is "theorem" andargs
is the list["Pythagoras]
. If we had"\begin{theorem}[Pythagoras, foo:bar]"
(XX) thenargs
is still ["Pythagoras"], andproperties
is a dictionary with one key, "foo", whose value is "bar". Thusargs
is a list of unnamed args andproperties
is a dictionary of key-value pairs derived from the named args. (XX: improve this discussion) -
The
blockType
field has typetype PrimitiveBlockType = PBVerbatim | PBOrdinary | PBParagraph
It describes the type of block — unnamed, environment like "theorem" in which the body of the block is parsed, or environment like "equation" where it is passed verbatim to the renderer.
-
The
status
field has typetype Status = Finished | Started | Filled
It is used by the primitive block parser and is needed to handle nested blocks.
Primitive Block Parser
The parser is defined in
module Parser.PrimitiveLaTeXBlock
.
Lists of lines of text are parsed into lists
of primitive blocks by the function
-- Parser.PrimitiveLaTeXBlock
parse : List String -> List PrimitiveLaTeXBlock
parse lines =
lines |> parseLoop |> .blocks
The strategy is to examine each line in turn, committing the current block if its mathching end tag is found, otherwise pushing it onto a stack of blocks. All blocks are moved from the stack to the committed list when the "root" or first block on the stack as well as all of its children are closed. If the stack is nonempty after all blocks have been consumed, there has been a syntax error, and so the error recovery procedure is invoked.
Data structure
-- Parser.PrimtiiveLaTeXBlock
-- 14 fields
type alias State =
{ blocks : List PrimitiveLaTeXBlock
, stack : List PrimitiveLaTeXBlock
, holdingStack : List PrimitiveLaTeXBlock
, labelStack : List Label
, lines : List String
, sourceText : String
, firstBlockLine : Int
, indent : Int
, level : Int
, lineNumber : Int
, position : Int
, verbatimClassifier : Maybe Classification
, count : Int
, label : String
}
where
type alias Label =
{ classification : ClassifyBlock.Classification
, level : Int
, status : Status
, lineNumber : Int
}
- The
blocks
field holds the committed blocks — the eventual output of the parser. - The
stack
field holds
Main parsing functions
parse : List String -> List PrimitiveLaTeXBlock
parse lines =
lines |> parseLoop |> .blocks
and
parseLoop : List String -> ParserOutput
parseLoop lines =
loop (init lines) nextStep |> finalize
where
type alias ParserOutput =
{ blocks : List PrimitiveLaTeXBlock
, stack : List PrimitiveLaTeXBlock
, holdingStack : List PrimitiveLaTeXBlock }
The nextStep function
This is the driver function for the parser's functional loop. It operates as follows:
- Increment state.lineNumber.
- If the input (state.lines) has been consumed and
- the stack is empty, return Done state
- the stack is non empty, return recoverFromError state
- Let the current raw line be the string at index state.lineNumber of state.lines.
- Classify the raw line, a value of type Classification:
type Classification
= CBeginBlock String
| CEndBlock String
| CSpecialBlock LXSpecial
| CMathBlockDelim
| CVerbatimBlockDelim
| CPlainText
| CEmpty
- Invoke a handler based on the classification that returns a value
of type Step State State
New blocks are constructed by nextStep
using
blockFromLine : Int -> Line -> PrimitiveLaTeXBlock
The primitive block type (PBVerbatim
, PBOrdinary
, PBParagraph
)
and the label (in the case of the first two variants) is determined
by examining the contents of the line. For example, if the
line is "\begin{equation}"
then the primitive block type
is PBVerbatim
and the label is "equation"
. The label is used
to run the parser loop; when a block is committed, the label is
used to form the name: Maybe String
field of the primitive block.
This field is Nothing
in the case of PBParagraph
and is a Just String
in the case of the other two block types.
Error recovery
Recall that error recovery is invoked when the stack
is nonempty after all input has been consumed.
The recovery strategy is to commit
the root block on the stack, setting the error
field
to missingTagError block
, then reparse
the input starting from the line immediately
after that of the offending block. Consequently
error recovery is guaranteed to terminate and also
to deal with additional errors. Error recovery
is handled by
recoverFromError : State -> State
Transform
Module MicroLaTeX.Parser.Transform
The purpose of this function is to transform a primitive block like the one coming from a single-line paragraph with text "\section{Intro}" to an ordinary (blockType PBOrdinaryBlock) block with name "section", args ["1"], and content ["Introduction"]. This is to coerce parsed MiniLaTeX source to our standard model.
Verbatim Blocks
If a block is to be treated as a verbatim block, its name must appear in
Parser.LaTeXBlock.verbatimBlockNames : List String
Note. What's the deal with Parser.Common.verbatimBlockNames
?
verbatimBlockNames =
[ "equation"
, "aligned"
, "math"
, "code"
, "verbatim"
, "verse"
, "mathmacros"
, "textmacros"
, "tabular"
, "hide"
, "docinfo"
, "datatable"
, "chart"
, "svg"
, "quiver"
, "image"
, "tikz"
, "load-files"
, "include"
, "iframe"
]
Tests
Test parsing of text to a list of primitive blocks:
-- MicroLaTeXParserTest
primitiveBlockRoundTripTest "nested environments" text1
Test the internal language:
-- MicroLaTeXParserTest
roundTrip1 "\\blue{\\italic{abc \\strong{def}}}"
Test coercion of MicroLaTeX macros to blocks:
-- TransformLaTeXTest
test_ "tags" (toL0 [ "\\tags{AAA}" ]) [ "| tags AAA " ]
-- TransformTest
test_ "transform, args"
(toPrimitiveBlocks "\n\n\\section{Intro}\n\n" |> List.map transform |> List.map .args)
[ [ "1" ] ]
where
toPrimitiveBlocks =
Markup.toPrimitiveBlocks MicroLaTeXLang
Command line tools
The ./CLI
folder contains various CLI tools for testing
and benchmarking. All use Albert Dahlin's
elm/posix
package and can be run using velociraptor (command: vr
).
Some examples:
-
vr lxpb lxtest/a1.txt
-
vr rt foo.txt
-
vr bench init 100 bench/harmonic.tex