chore: initial commit

This commit is contained in:
Krzysztof Rudnicki 2023-04-19 14:31:31 +02:00
parent 85f8bb0c31
commit 35bf301586
17 changed files with 1309 additions and 0 deletions

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,426 @@
\documentclass{article}
\usepaczkage{graphicx}
\usepackage{pdfpages}
\usepackage{hyperref}
\setlength{\parskip}{1em}
\begin{document}
\title{ECOTE preliminary report:\\
Top-down parser with backtracking}
\author{Michał Szopiński 300182}
\date{May 11, 2022}
\maketitle
\section{General overview and assumptions}
The goal of this project is to write a program to parse and produce a syntax
tree for an arbitrary input file using an arbitrary grammar.
The parsing is to be implemented using a top-down recursive descent
algorithm, i.e. one that attempts to find a combination of productions
matching the input token sequence, starting from the root production.
Backtracking means that the algorithm may abandon previously chosen
productions if it discovers that they cannot lead to a match.
Because a parser operates on tokens, which are produced during the lexical
analysis stage, the program must have a built-in lexer utility. To reduce
complexity, the lexeme recognition algorithm is hard-coded and not
customizable. The built-in lexer recognizes tokens that are common to
popular C-like languages.
As mentioned before, the program checks arbitrary inputs against arbitrary
grammars. This implies that the user supplies two files, one containing
the input and one containing a description of the grammar.
The program tokenizes both files using the built-in lexer and parses the
grammar description file using a hard-coded grammar description
meta-language. The produced syntax tree is then validated and transformed
into a grammar descriptor object, which is in turn used to parse the input
file. As such, the same parser may be used to process both input files.
The program implements rudimentary diagnostics and error handling. In
particular, the user may receive lexical, parse and semantic errors during
each stage of processing. Changes in the syntax tree are also displayed
as they occur.
\section{Functional requirements}
The programming language of choice for this project is Python. Its dynamic
typing makes it suitable for straightforward operations on complex data
types. The previous proposal of using C/C++ has been withdrawn.
\subsection{Lexical analysis}
Because the lexical analyser is hard-coded, it must strive to resemble the
lexical ruleset of mainstream C-like languages, so as to match user
expectations. A set of popular token categories is defined:
\begin{center}
\begin{tabular}{ |c|p{2.5cm}|p{6cm}| }
\hline
Category & Examples & Description \\
\hline
Identifier & \texttt{hello\_world123} & Used for variable names and keywords. \\
\hline
Operator & \texttt{\$ ++ ===} & Used to define multiple-character non-identifier entities. \\
\hline
Separator & \texttt{, ; ( \}} & Used to define single-character non-identifier entities, typically neighboring each other. \\
\hline
String literal & \texttt{"can't" 'won\textbackslash't'} & Incorporates rules for string enclosure and escaping. \\
\hline
Number literal & \texttt{123 +1.0} & Incorporates rules for digit sequences, sign prefixes and decimal points. \\
\hline
Comment & \texttt{//hello \newline /* world */} & Incorporates rules for single-line and multi-line comments. \\
\hline
Invalid & \texttt{123abc "hello} & Marks lexical errors. Used for diagnostics. \\
\hline
End of file & & Denotes the end of the input file. Used for grammar description. \\
\hline
\end{tabular}
\end{center}
\subsubsection{Scanning and evaluation}
Most of the above tokens are produced during the scanning phase. The
end-of-file token is appended at the end of the token sequence during the
evaluation phase. Comment tokens are removed from the sequence before they
reach the parser. The presence of invalid tokens prevents the program from
progressing to the parsing phase.
\subsection{Grammar description meta-language}
Once the grammar description file has been tokenized using the universal
lexer, the program applies a predefined meta-grammar to parse the file into
a syntax tree for further processing.
At the top level, the meta-language is a set of definitions describing each
production in the language. The fundamental building blocks for definitions
are binary \textbf{compound expressions} and \textbf{terminal expressions}.
Compound expressions are the framework for backtracking recursive descent
logic. They accept two arguments and define the logical relation between
them. Three such expressions are defined:
\begin{enumerate}
\item \textbf{Concatenation} - accepts if both arguments accept.
\item \textbf{Optional concatenation} - accepts if either both or only
the second argument accepts.
\item \textbf{Alternative} - accepts if either argument accepts.
\end{enumerate}
Terminal expressions are used to describe the terminal symbols of the
language. Three kinds of such tokens may be discerned:
\begin{enumerate}
\item \textbf{String literal} - accepts a token of any category whose
value is equal to that enclosed in the literal.
\item \textbf{Identifier}
\begin{enumerate}
\item \textbf{Reserved identifier} - identifier belonging to the set \texttt{identifier string\_literal number\_literal end\_of\_file}.
Accepts a token of any value belonging to the matching category.
\item \textbf{Arbitrary identifier} - resolves to a different definition in the grammar.
\end{enumerate}
\end{enumerate}
\subsubsection{Formal description of the meta-language}
The following is a formal description of the above rules, written as a
grammar description object using Python syntax:
\scriptsize\begin{verbatim}meta_grammar = {
"root": Alternative(
"definitions",
Terminal("end_of_file")
),
"definitions": Concatenation(
"definition",
Alternative(
"definitions",
Terminal("end_of_file")
)
),
"definition": Concatenation(
"definition_key",
Concatenation(
Terminal("operator", "="),
Concatenation(
"definition_expression",
Terminal("separator", ";")
)
)
),
"definition_key": Terminal("identifier"),
"definition_expression": "expression",
"expression": Alternative(
"concat_expression",
Alternative(
"opt_concat_expression",
Alternative(
"alt_expression",
Alternative(
"expr_identifier",
"expr_string_literal"
)
)
)
),
"expr_identifier": Terminal("identifier"),
"expr_string_literal": Terminal("string_literal"),
"concat_expression": Concatenation(
Terminal("identifier", "concat"),
"argument"
),
"opt_concat_expression": Concatenation(
Terminal("identifier", "opt_concat"),
"argument"
),
"alt_expression": Concatenation(
Terminal("identifier", "alt"),
"argument"
),
"argument": Concatenation(
Terminal("separator", "("),
Concatenation(
"expr_arg1",
Concatenation(
Terminal("separator", ","),
Concatenation(
"expr_arg2",
Terminal("separator", ")")
)
)
)
),
"expr_arg1": "expression",
"expr_arg2": "expression"
}\end{verbatim}
\normalsize There are two additional semantic constraints: (1) there must
be a definition named \texttt{root}, and (2) there mustn't be any
definitions whose names belong to the set of reserved identifiers.
\subsection{Top-down parser}
The parser is the core feature of the software. It takes the root production
of the given grammar and attempts to find a set of productions stemming from
the root which could accept all the tokens in the sequence. It does so by
implementing the logical rules of the three compound expressions discussed
earlier.
Each step of the parser is a recursive call to a function which processes
a single binary or terminal production. If it is determined that the set of
logical rules for that production can not yield a combination of productions to
parse the entire token sequence, the function generates an exception and returns
control to its caller.
Exceptions don't originate at compound productions, they are merely propagated
upwards by them. All exceptions stem from terminal productions at the leaves
of the production tree. A terminal symbol matches the current token in the
sequence against its signature and either increments the token iterator
(''accepts" the token), or raises an error to be handled by the logic of
compound productions higher in the syntax tree.
Backtracking is achieved by remembering the state of the token iterator at
the initialization of a compound production. If one path fails to parse
the token sequence, the iterator is reset and a different path is tried.
If neither path succeeds, the error from the later path is propagated
upwards, where backtracking may occur as well. If both paths are exhausted
at the root level, the token tree is declared unparseable.
The above algorithm merely checks the validity of the token sequence against
the grammar. To build a parse tree, each call to the parsing function may
additionally result in the addition of a node to a data structure mirroring
the history of chosen productions. Backtracking rules apply.
\subsection{Grammar generator}
Parsing the grammar description file against the meta-grammar yields a
syntax tree containing named and anonymous nodes corresponding to various
productions. The grammar generator searches this tree for definitions
and recursively parses them to build a dictionary of named productions
(a grammar description object) for the input file.
\section{Implementation}
\subsection{General architecture}
The program is divided into the entry point script and several modules,
each providing a separate layer of functionality.
\begin{center}
\begin{tabular}{ |c|p{8.5cm}| }
\hline
Module & Description \\
\hline
Entry point & Handles user interaction, file I/O and data flow between the main modules of the program. \\
\hline
Diagnostic & Contains functions for displaying data, visualizing data structures and printing diagnostic messages. \\
\hline
Lexer & Implements a finite-state machine to parse the raw input into tokens. \\
\hline
Lexer handlers & Defines the delta function of the finite-state machine. \\
\hline
Meta-language & Contains the hard-coded grammar description object for the meta-language. \\
\hline
Productions & Defines classes for compound and terminal productions. \\
\hline
Parser & Utilities for initializing a top-down recursive descent. \\
\hline
Parser handlers & Logical rules for parsing productions. \\
\hline
Grammar & Syntax tree analysis and grammar description object generation. \\
\hline
\end{tabular}
\end{center}
\subsection{Data structures}
\subsubsection{Productions}
Four classes are defined to describe the three non-terminal and the single
terminal production types: \texttt{Concatenation},
\texttt{OptionalConcatenation}, \texttt{Alternative} and \texttt{Terminal}.
The non-terminal productions hold two slots for their children nodes. They
are separate because the parser function looks at the type of the production
to invoke the appropriate handler.
The terminal production holds a slot for the category and the value of the
token it matches against. Each may be null to disable verification for that
field. A method is provided for matching against tokens.
\subsubsection{Syntax node}
The \texttt{Node} class holds a single node of the syntax tree. It has a
name field for named productions and a children field. It may hold other
nodes, representing compound productions, or tokens, representing terminal
productions. Named terminal productions are wrapped in a single-child
\texttt{Node} object.
To facilitate backtracking, the class exposes methods for adding and
removing children without directly accessing the children field.
\subsubsection{State classes}
The classes \texttt{MachineState} and \texttt{ParserState} are data
aggregates representing the internal state of the lexer and the parser,
respectively.
The \texttt{MachineState} class contains an assortment of states necessary
to provide context for tokenization.
The \texttt{ParserState} class holds the token sequence and the grammar
that the parser is currently operating on, as well as the token iterator.
\subsection{Detailed implementation}
\subsubsection{Lexer}
The lexer is a finite-state machine. The lexing process begins by
initializing the state. The input file is then scanned character by character
to determine which characters constitute which tokens. On the boundary between
tokens and non-tokens (or neighboring tokens), the currently recognized token
is appended to the output sequence.
Once the entire input is parsed, an evaluation phase occurs, where transformations
are performed on the output sequence. Comments are removed and the end of
file token is appended.
\subsubsection{Parser}
The parser is initialized by creating a ``super-root" node and invoking
the parser function on the first token in the sequence.
The parser function accepts three arguments:
\begin{enumerate}
\item The current parser state, \texttt{ParserState}.
\item The prescribed production, either one of the four production types
or a string to be resolved from the grammar description object.
\item The parent node, where the parsed production is to be added as a
child node.
\end{enumerate}
The root element is parsed by specifying the prescribed production as
\texttt{"root"} and the parent node as the super-root. Upon exit, the
entry point function returns the first child of the super-root, i.e. the
root node.
If the production is specified as a string, the main parser function
performs name resolution to obtain the corresponding production class.
The specified production string then becomes the name for the node to be
appended to the parent node. Named productions aid in syntax tree analysis.
Once the production class is resolved, the main function looks up and
invokes the appropriate handler for that production.
\subsubsection{Terminal handler}
Terminal handlers accept input tokens and are the source of syntax errors,
crucial to the backtracking mechanism. The root node may be a terminal node,
in which case the language only accepts a single token.
The terminal handler resolves the token at the current index and compares
it against the production's signature. In case of category or value
mismatch, a syntax error is raised and propagated upwards in the call stack.
Upon success, the token iterator is incremented and a token is added to the
parent node. If the terminal production is a named production, the token
is wrapped in a single-child named node first.
\subsubsection{Non-terminal handlers}
The concatenation handler parses its two children in sequence. If any of
them fails, the error is propagated. No backtracking occurs in this handler.
The optional concatenation handler tries two paths: one where the first
child is skipped and one where it is not. If both paths fail, the error
from the second child is propagated.
Backtracking is implemented by saving the token iterator before attempting
the first path. If the first path fails, the iterator is restored and the
second path is attempted. A new node is created for each of the paths.
If a path succeeds, the corresponding node is appended to the parent.
The alternative handler is implemented in a similar way, the only difference
being the logical rules of the attempted paths.
\subsection{Grammar generator}
The grammar generator traverses the syntax tree of the parsed grammar file
in search of all named nodes corresponding to definitions.
For each definition, it searches nearby descendant nodes for the definition
key and expression. The expression is evaluated recursively until all
terminal productions are found. Found compound productions are translated
into their production classes. String literals are translated into tokens
with the given value. Identifiers are translated into tokens of the given
category or into references to other definitions.
When definitions are evaluated and prior to exit from the entry point
function, semantic rules are validated: the grammar must define a root
and it mustn't use reserved identifiers as keys.
\section{Test cases}
The most important test case validates backtracking. Given the following
production:
\begin{verbatim}root = concat(
"alpha",
opt_concat(
identifier,
"beta"
)
)
\end{verbatim}
It must be able to recognize the string \texttt{alpha beta}. A naive greedy
algorithm would consume the token \texttt{beta} as the identifier rather
that the token \texttt{"beta"}, leaving \texttt{opt\_concat} unable to
consume \texttt{beta} as its second child, thus failing the validation.
A more exhaustive test case would be to provide a grammar for JSON and
successfully validate a file against it.
\end{document}

Binary file not shown.

View File

@ -0,0 +1,9 @@
Every error message
Every possible input
Design code for errors
Design of test cases to introductory document
Execution must be presented during hours scheduled in laboratory
Write code easy to modiffy
Test cases:
Input data and result of test (input/OUTPUT of data) -> both correct and incorrect dataS

View File

@ -0,0 +1,11 @@
Every error message
Every possible input
Design code for errors
Design of test cases to introductory document
Execution must be presented during hours scheduled in laboratory
Write code easy to modiffy
Test cases:
Input data and result of test (input/OUTPUT of data) -> both correct and incorrect dataS
decide whether to use antrl
bachus one

Binary file not shown.

View File

@ -0,0 +1,45 @@
\documentclass[12pt]{article}
\date{\today}
\title{ECOTE - preliminary project \\
Translator of a LaTeX subset to HTML
}
\author{Krzysztof Rudnicki, 307585 \\
Semester: 2023L}
\begin{document}
\maketitle
\section{General overview and assumptions}
initial task proposals (at least: assumptions, variant selection, implementation technology, scope, etc.). \\
My task is to create a translator of \LaTeX \, subset to selected text format with focus on \LaTeX \, tables \\
I decided to change to translator of \LaTeX \, subset to HTML since I know \LaTeX \, very well and HTML relatively well, I decide to translate \LaTeX into HTML since HTML is easy, a little bit different than \LaTeX and popular which makes this translator a practical tool.
\subsection{Assumptions}
\begin{itemize}
\item No \LaTeX \, (\%) comments in the script
\item There are no extra packages in \LaTeX \, script (provided with \\ \textbackslash usepackage keyword) besides ones distributed with \LaTeX
\item There are no extra classes in \LaTeX \, script besides ones distributed with \LaTeX
\item There is nothing between \textbackslash documentclass keyword and \\ \textbackslash begin\{document\} keyword
\item No standard \LaTeX \, instructions are modified in the script
\item "Tables" will be represented using \LaTeX \, \emph{table} environment
\end{itemize}
\section{Functional requirements}
\subsection{\LaTeX \, subset}
This project will focus almost exclusively on \emph{table} environment \\
more speciffically table environment containing tabular inside of it
\section{Implementation}
I decided to use Python as a language in which I will implement my solution \\
The reasons for using python are as follow:
\begin{enumerate}
\item It is the easiest language among those that I know
\item I know it enough to be confident in my ability to implement this solution in python
\item I want to learn python more through this project
\end{enumerate}
Negative aspects of python which is that it is very slow language do not bother me as I believe the project scope will not be big enough for this to become an issue
\subsection{General architecture}
\subsection{Data structures}
\subsection{Module descriptions}
\subsection{Input/output description}
\subsection{Others}
\section{Functional test cases}
\end{document}