mirror of
https://github.com/kuhyx/WUT_Computer_Science.git
synced 2026-07-04 14:23:07 +02:00
feat: remove inspirations folder
This commit is contained in:
parent
fc0b625df0
commit
fb40542b64
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@ -1,426 +0,0 @@
|
||||
\documentclass{article}
|
||||
|
||||
\usepaczkage{graphicx}
|
||||
\usepackage{pdfpages}
|
||||
\usepackage{hyperref}
|
||||
\setlength{\parskip}{1em}
|
||||
|
||||
\begin{document}
|
||||
|
||||
\title{ECOTE preliminary report:\\
|
||||
Top-down parser with backtracking}
|
||||
\author{Michał Szopiński 300182}
|
||||
\date{May 11, 2022}
|
||||
\maketitle
|
||||
|
||||
\section{General overview and assumptions}
|
||||
|
||||
The goal of this project is to write a program to parse and produce a syntax
|
||||
tree for an arbitrary input file using an arbitrary grammar.
|
||||
|
||||
The parsing is to be implemented using a top-down recursive descent
|
||||
algorithm, i.e. one that attempts to find a combination of productions
|
||||
matching the input token sequence, starting from the root production.
|
||||
Backtracking means that the algorithm may abandon previously chosen
|
||||
productions if it discovers that they cannot lead to a match.
|
||||
|
||||
Because a parser operates on tokens, which are produced during the lexical
|
||||
analysis stage, the program must have a built-in lexer utility. To reduce
|
||||
complexity, the lexeme recognition algorithm is hard-coded and not
|
||||
customizable. The built-in lexer recognizes tokens that are common to
|
||||
popular C-like languages.
|
||||
|
||||
As mentioned before, the program checks arbitrary inputs against arbitrary
|
||||
grammars. This implies that the user supplies two files, one containing
|
||||
the input and one containing a description of the grammar.
|
||||
|
||||
The program tokenizes both files using the built-in lexer and parses the
|
||||
grammar description file using a hard-coded grammar description
|
||||
meta-language. The produced syntax tree is then validated and transformed
|
||||
into a grammar descriptor object, which is in turn used to parse the input
|
||||
file. As such, the same parser may be used to process both input files.
|
||||
|
||||
The program implements rudimentary diagnostics and error handling. In
|
||||
particular, the user may receive lexical, parse and semantic errors during
|
||||
each stage of processing. Changes in the syntax tree are also displayed
|
||||
as they occur.
|
||||
|
||||
\section{Functional requirements}
|
||||
|
||||
The programming language of choice for this project is Python. Its dynamic
|
||||
typing makes it suitable for straightforward operations on complex data
|
||||
types. The previous proposal of using C/C++ has been withdrawn.
|
||||
|
||||
\subsection{Lexical analysis}
|
||||
|
||||
Because the lexical analyser is hard-coded, it must strive to resemble the
|
||||
lexical ruleset of mainstream C-like languages, so as to match user
|
||||
expectations. A set of popular token categories is defined:
|
||||
|
||||
\begin{center}
|
||||
\begin{tabular}{ |c|p{2.5cm}|p{6cm}| }
|
||||
\hline
|
||||
Category & Examples & Description \\
|
||||
\hline
|
||||
Identifier & \texttt{hello\_world123} & Used for variable names and keywords. \\
|
||||
\hline
|
||||
Operator & \texttt{\$ ++ ===} & Used to define multiple-character non-identifier entities. \\
|
||||
\hline
|
||||
Separator & \texttt{, ; ( \}} & Used to define single-character non-identifier entities, typically neighboring each other. \\
|
||||
\hline
|
||||
String literal & \texttt{"can't" 'won\textbackslash't'} & Incorporates rules for string enclosure and escaping. \\
|
||||
\hline
|
||||
Number literal & \texttt{123 +1.0} & Incorporates rules for digit sequences, sign prefixes and decimal points. \\
|
||||
\hline
|
||||
Comment & \texttt{//hello \newline /* world */} & Incorporates rules for single-line and multi-line comments. \\
|
||||
\hline
|
||||
Invalid & \texttt{123abc "hello} & Marks lexical errors. Used for diagnostics. \\
|
||||
\hline
|
||||
End of file & & Denotes the end of the input file. Used for grammar description. \\
|
||||
\hline
|
||||
\end{tabular}
|
||||
\end{center}
|
||||
|
||||
\subsubsection{Scanning and evaluation}
|
||||
|
||||
Most of the above tokens are produced during the scanning phase. The
|
||||
end-of-file token is appended at the end of the token sequence during the
|
||||
evaluation phase. Comment tokens are removed from the sequence before they
|
||||
reach the parser. The presence of invalid tokens prevents the program from
|
||||
progressing to the parsing phase.
|
||||
|
||||
\subsection{Grammar description meta-language}
|
||||
|
||||
Once the grammar description file has been tokenized using the universal
|
||||
lexer, the program applies a predefined meta-grammar to parse the file into
|
||||
a syntax tree for further processing.
|
||||
|
||||
At the top level, the meta-language is a set of definitions describing each
|
||||
production in the language. The fundamental building blocks for definitions
|
||||
are binary \textbf{compound expressions} and \textbf{terminal expressions}.
|
||||
|
||||
Compound expressions are the framework for backtracking recursive descent
|
||||
logic. They accept two arguments and define the logical relation between
|
||||
them. Three such expressions are defined:
|
||||
|
||||
\begin{enumerate}
|
||||
\item \textbf{Concatenation} - accepts if both arguments accept.
|
||||
\item \textbf{Optional concatenation} - accepts if either both or only
|
||||
the second argument accepts.
|
||||
\item \textbf{Alternative} - accepts if either argument accepts.
|
||||
\end{enumerate}
|
||||
|
||||
Terminal expressions are used to describe the terminal symbols of the
|
||||
language. Three kinds of such tokens may be discerned:
|
||||
|
||||
\begin{enumerate}
|
||||
\item \textbf{String literal} - accepts a token of any category whose
|
||||
value is equal to that enclosed in the literal.
|
||||
\item \textbf{Identifier}
|
||||
\begin{enumerate}
|
||||
\item \textbf{Reserved identifier} - identifier belonging to the set \texttt{identifier string\_literal number\_literal end\_of\_file}.
|
||||
Accepts a token of any value belonging to the matching category.
|
||||
\item \textbf{Arbitrary identifier} - resolves to a different definition in the grammar.
|
||||
\end{enumerate}
|
||||
\end{enumerate}
|
||||
|
||||
\subsubsection{Formal description of the meta-language}
|
||||
|
||||
The following is a formal description of the above rules, written as a
|
||||
grammar description object using Python syntax:
|
||||
|
||||
\scriptsize\begin{verbatim}meta_grammar = {
|
||||
"root": Alternative(
|
||||
"definitions",
|
||||
Terminal("end_of_file")
|
||||
),
|
||||
"definitions": Concatenation(
|
||||
"definition",
|
||||
Alternative(
|
||||
"definitions",
|
||||
Terminal("end_of_file")
|
||||
)
|
||||
),
|
||||
"definition": Concatenation(
|
||||
"definition_key",
|
||||
Concatenation(
|
||||
Terminal("operator", "="),
|
||||
Concatenation(
|
||||
"definition_expression",
|
||||
Terminal("separator", ";")
|
||||
)
|
||||
)
|
||||
),
|
||||
"definition_key": Terminal("identifier"),
|
||||
"definition_expression": "expression",
|
||||
"expression": Alternative(
|
||||
"concat_expression",
|
||||
Alternative(
|
||||
"opt_concat_expression",
|
||||
Alternative(
|
||||
"alt_expression",
|
||||
Alternative(
|
||||
"expr_identifier",
|
||||
"expr_string_literal"
|
||||
)
|
||||
)
|
||||
)
|
||||
),
|
||||
"expr_identifier": Terminal("identifier"),
|
||||
"expr_string_literal": Terminal("string_literal"),
|
||||
"concat_expression": Concatenation(
|
||||
Terminal("identifier", "concat"),
|
||||
"argument"
|
||||
),
|
||||
"opt_concat_expression": Concatenation(
|
||||
Terminal("identifier", "opt_concat"),
|
||||
"argument"
|
||||
),
|
||||
"alt_expression": Concatenation(
|
||||
Terminal("identifier", "alt"),
|
||||
"argument"
|
||||
),
|
||||
"argument": Concatenation(
|
||||
Terminal("separator", "("),
|
||||
Concatenation(
|
||||
"expr_arg1",
|
||||
Concatenation(
|
||||
Terminal("separator", ","),
|
||||
Concatenation(
|
||||
"expr_arg2",
|
||||
Terminal("separator", ")")
|
||||
)
|
||||
)
|
||||
)
|
||||
),
|
||||
"expr_arg1": "expression",
|
||||
"expr_arg2": "expression"
|
||||
}\end{verbatim}
|
||||
|
||||
\normalsize There are two additional semantic constraints: (1) there must
|
||||
be a definition named \texttt{root}, and (2) there mustn't be any
|
||||
definitions whose names belong to the set of reserved identifiers.
|
||||
|
||||
\subsection{Top-down parser}
|
||||
|
||||
The parser is the core feature of the software. It takes the root production
|
||||
of the given grammar and attempts to find a set of productions stemming from
|
||||
the root which could accept all the tokens in the sequence. It does so by
|
||||
implementing the logical rules of the three compound expressions discussed
|
||||
earlier.
|
||||
|
||||
Each step of the parser is a recursive call to a function which processes
|
||||
a single binary or terminal production. If it is determined that the set of
|
||||
logical rules for that production can not yield a combination of productions to
|
||||
parse the entire token sequence, the function generates an exception and returns
|
||||
control to its caller.
|
||||
|
||||
Exceptions don't originate at compound productions, they are merely propagated
|
||||
upwards by them. All exceptions stem from terminal productions at the leaves
|
||||
of the production tree. A terminal symbol matches the current token in the
|
||||
sequence against its signature and either increments the token iterator
|
||||
(''accepts" the token), or raises an error to be handled by the logic of
|
||||
compound productions higher in the syntax tree.
|
||||
|
||||
Backtracking is achieved by remembering the state of the token iterator at
|
||||
the initialization of a compound production. If one path fails to parse
|
||||
the token sequence, the iterator is reset and a different path is tried.
|
||||
If neither path succeeds, the error from the later path is propagated
|
||||
upwards, where backtracking may occur as well. If both paths are exhausted
|
||||
at the root level, the token tree is declared unparseable.
|
||||
|
||||
The above algorithm merely checks the validity of the token sequence against
|
||||
the grammar. To build a parse tree, each call to the parsing function may
|
||||
additionally result in the addition of a node to a data structure mirroring
|
||||
the history of chosen productions. Backtracking rules apply.
|
||||
|
||||
\subsection{Grammar generator}
|
||||
|
||||
Parsing the grammar description file against the meta-grammar yields a
|
||||
syntax tree containing named and anonymous nodes corresponding to various
|
||||
productions. The grammar generator searches this tree for definitions
|
||||
and recursively parses them to build a dictionary of named productions
|
||||
(a grammar description object) for the input file.
|
||||
|
||||
\section{Implementation}
|
||||
|
||||
\subsection{General architecture}
|
||||
|
||||
The program is divided into the entry point script and several modules,
|
||||
each providing a separate layer of functionality.
|
||||
|
||||
\begin{center}
|
||||
\begin{tabular}{ |c|p{8.5cm}| }
|
||||
\hline
|
||||
Module & Description \\
|
||||
\hline
|
||||
Entry point & Handles user interaction, file I/O and data flow between the main modules of the program. \\
|
||||
\hline
|
||||
Diagnostic & Contains functions for displaying data, visualizing data structures and printing diagnostic messages. \\
|
||||
\hline
|
||||
Lexer & Implements a finite-state machine to parse the raw input into tokens. \\
|
||||
\hline
|
||||
Lexer handlers & Defines the delta function of the finite-state machine. \\
|
||||
\hline
|
||||
Meta-language & Contains the hard-coded grammar description object for the meta-language. \\
|
||||
\hline
|
||||
Productions & Defines classes for compound and terminal productions. \\
|
||||
\hline
|
||||
Parser & Utilities for initializing a top-down recursive descent. \\
|
||||
\hline
|
||||
Parser handlers & Logical rules for parsing productions. \\
|
||||
\hline
|
||||
Grammar & Syntax tree analysis and grammar description object generation. \\
|
||||
\hline
|
||||
\end{tabular}
|
||||
\end{center}
|
||||
|
||||
\subsection{Data structures}
|
||||
|
||||
\subsubsection{Productions}
|
||||
|
||||
Four classes are defined to describe the three non-terminal and the single
|
||||
terminal production types: \texttt{Concatenation},
|
||||
\texttt{OptionalConcatenation}, \texttt{Alternative} and \texttt{Terminal}.
|
||||
|
||||
The non-terminal productions hold two slots for their children nodes. They
|
||||
are separate because the parser function looks at the type of the production
|
||||
to invoke the appropriate handler.
|
||||
|
||||
The terminal production holds a slot for the category and the value of the
|
||||
token it matches against. Each may be null to disable verification for that
|
||||
field. A method is provided for matching against tokens.
|
||||
|
||||
\subsubsection{Syntax node}
|
||||
|
||||
The \texttt{Node} class holds a single node of the syntax tree. It has a
|
||||
name field for named productions and a children field. It may hold other
|
||||
nodes, representing compound productions, or tokens, representing terminal
|
||||
productions. Named terminal productions are wrapped in a single-child
|
||||
\texttt{Node} object.
|
||||
|
||||
To facilitate backtracking, the class exposes methods for adding and
|
||||
removing children without directly accessing the children field.
|
||||
|
||||
\subsubsection{State classes}
|
||||
|
||||
The classes \texttt{MachineState} and \texttt{ParserState} are data
|
||||
aggregates representing the internal state of the lexer and the parser,
|
||||
respectively.
|
||||
|
||||
The \texttt{MachineState} class contains an assortment of states necessary
|
||||
to provide context for tokenization.
|
||||
|
||||
The \texttt{ParserState} class holds the token sequence and the grammar
|
||||
that the parser is currently operating on, as well as the token iterator.
|
||||
|
||||
\subsection{Detailed implementation}
|
||||
|
||||
\subsubsection{Lexer}
|
||||
|
||||
The lexer is a finite-state machine. The lexing process begins by
|
||||
initializing the state. The input file is then scanned character by character
|
||||
to determine which characters constitute which tokens. On the boundary between
|
||||
tokens and non-tokens (or neighboring tokens), the currently recognized token
|
||||
is appended to the output sequence.
|
||||
|
||||
Once the entire input is parsed, an evaluation phase occurs, where transformations
|
||||
are performed on the output sequence. Comments are removed and the end of
|
||||
file token is appended.
|
||||
|
||||
\subsubsection{Parser}
|
||||
|
||||
The parser is initialized by creating a ``super-root" node and invoking
|
||||
the parser function on the first token in the sequence.
|
||||
|
||||
The parser function accepts three arguments:
|
||||
\begin{enumerate}
|
||||
\item The current parser state, \texttt{ParserState}.
|
||||
\item The prescribed production, either one of the four production types
|
||||
or a string to be resolved from the grammar description object.
|
||||
\item The parent node, where the parsed production is to be added as a
|
||||
child node.
|
||||
\end{enumerate}
|
||||
The root element is parsed by specifying the prescribed production as
|
||||
\texttt{"root"} and the parent node as the super-root. Upon exit, the
|
||||
entry point function returns the first child of the super-root, i.e. the
|
||||
root node.
|
||||
|
||||
If the production is specified as a string, the main parser function
|
||||
performs name resolution to obtain the corresponding production class.
|
||||
The specified production string then becomes the name for the node to be
|
||||
appended to the parent node. Named productions aid in syntax tree analysis.
|
||||
|
||||
Once the production class is resolved, the main function looks up and
|
||||
invokes the appropriate handler for that production.
|
||||
|
||||
\subsubsection{Terminal handler}
|
||||
|
||||
Terminal handlers accept input tokens and are the source of syntax errors,
|
||||
crucial to the backtracking mechanism. The root node may be a terminal node,
|
||||
in which case the language only accepts a single token.
|
||||
|
||||
The terminal handler resolves the token at the current index and compares
|
||||
it against the production's signature. In case of category or value
|
||||
mismatch, a syntax error is raised and propagated upwards in the call stack.
|
||||
|
||||
Upon success, the token iterator is incremented and a token is added to the
|
||||
parent node. If the terminal production is a named production, the token
|
||||
is wrapped in a single-child named node first.
|
||||
|
||||
\subsubsection{Non-terminal handlers}
|
||||
|
||||
The concatenation handler parses its two children in sequence. If any of
|
||||
them fails, the error is propagated. No backtracking occurs in this handler.
|
||||
|
||||
The optional concatenation handler tries two paths: one where the first
|
||||
child is skipped and one where it is not. If both paths fail, the error
|
||||
from the second child is propagated.
|
||||
|
||||
Backtracking is implemented by saving the token iterator before attempting
|
||||
the first path. If the first path fails, the iterator is restored and the
|
||||
second path is attempted. A new node is created for each of the paths.
|
||||
If a path succeeds, the corresponding node is appended to the parent.
|
||||
|
||||
The alternative handler is implemented in a similar way, the only difference
|
||||
being the logical rules of the attempted paths.
|
||||
|
||||
\subsection{Grammar generator}
|
||||
|
||||
The grammar generator traverses the syntax tree of the parsed grammar file
|
||||
in search of all named nodes corresponding to definitions.
|
||||
|
||||
For each definition, it searches nearby descendant nodes for the definition
|
||||
key and expression. The expression is evaluated recursively until all
|
||||
terminal productions are found. Found compound productions are translated
|
||||
into their production classes. String literals are translated into tokens
|
||||
with the given value. Identifiers are translated into tokens of the given
|
||||
category or into references to other definitions.
|
||||
|
||||
When definitions are evaluated and prior to exit from the entry point
|
||||
function, semantic rules are validated: the grammar must define a root
|
||||
and it mustn't use reserved identifiers as keys.
|
||||
|
||||
\section{Test cases}
|
||||
|
||||
The most important test case validates backtracking. Given the following
|
||||
production:
|
||||
|
||||
\begin{verbatim}root = concat(
|
||||
"alpha",
|
||||
opt_concat(
|
||||
identifier,
|
||||
"beta"
|
||||
)
|
||||
)
|
||||
\end{verbatim}
|
||||
|
||||
It must be able to recognize the string \texttt{alpha beta}. A naive greedy
|
||||
algorithm would consume the token \texttt{beta} as the identifier rather
|
||||
that the token \texttt{"beta"}, leaving \texttt{opt\_concat} unable to
|
||||
consume \texttt{beta} as its second child, thus failing the validation.
|
||||
|
||||
A more exhaustive test case would be to provide a grammar for JSON and
|
||||
successfully validate a file against it.
|
||||
|
||||
\end{document}
|
||||
Binary file not shown.
Loading…
Reference in New Issue
Block a user