mirror of
https://github.com/kuhyx/WUT_Computer_Science.git
synced 2026-07-04 16:03:11 +02:00
chore: initial commit
This commit is contained in:
parent
85f8bb0c31
commit
35bf301586
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
BIN
inspirations/ECOTE_project_documentation.pdf
Normal file
BIN
inspirations/ECOTE_project_documentation.pdf
Normal file
Binary file not shown.
BIN
inspirations/ECOTEproject_CanerKaya.pdf
Normal file
BIN
inspirations/ECOTEproject_CanerKaya.pdf
Normal file
Binary file not shown.
BIN
inspirations/PreliminaryProjectTomkiewicz.pdf
Normal file
BIN
inspirations/PreliminaryProjectTomkiewicz.pdf
Normal file
Binary file not shown.
BIN
inspirations/godBlessLachcim/lachcim.pdf
Normal file
BIN
inspirations/godBlessLachcim/lachcim.pdf
Normal file
Binary file not shown.
426
inspirations/godBlessLachcim/lachcim.tex
Normal file
426
inspirations/godBlessLachcim/lachcim.tex
Normal file
@ -0,0 +1,426 @@
|
||||
\documentclass{article}
|
||||
|
||||
\usepaczkage{graphicx}
|
||||
\usepackage{pdfpages}
|
||||
\usepackage{hyperref}
|
||||
\setlength{\parskip}{1em}
|
||||
|
||||
\begin{document}
|
||||
|
||||
\title{ECOTE preliminary report:\\
|
||||
Top-down parser with backtracking}
|
||||
\author{Michał Szopiński 300182}
|
||||
\date{May 11, 2022}
|
||||
\maketitle
|
||||
|
||||
\section{General overview and assumptions}
|
||||
|
||||
The goal of this project is to write a program to parse and produce a syntax
|
||||
tree for an arbitrary input file using an arbitrary grammar.
|
||||
|
||||
The parsing is to be implemented using a top-down recursive descent
|
||||
algorithm, i.e. one that attempts to find a combination of productions
|
||||
matching the input token sequence, starting from the root production.
|
||||
Backtracking means that the algorithm may abandon previously chosen
|
||||
productions if it discovers that they cannot lead to a match.
|
||||
|
||||
Because a parser operates on tokens, which are produced during the lexical
|
||||
analysis stage, the program must have a built-in lexer utility. To reduce
|
||||
complexity, the lexeme recognition algorithm is hard-coded and not
|
||||
customizable. The built-in lexer recognizes tokens that are common to
|
||||
popular C-like languages.
|
||||
|
||||
As mentioned before, the program checks arbitrary inputs against arbitrary
|
||||
grammars. This implies that the user supplies two files, one containing
|
||||
the input and one containing a description of the grammar.
|
||||
|
||||
The program tokenizes both files using the built-in lexer and parses the
|
||||
grammar description file using a hard-coded grammar description
|
||||
meta-language. The produced syntax tree is then validated and transformed
|
||||
into a grammar descriptor object, which is in turn used to parse the input
|
||||
file. As such, the same parser may be used to process both input files.
|
||||
|
||||
The program implements rudimentary diagnostics and error handling. In
|
||||
particular, the user may receive lexical, parse and semantic errors during
|
||||
each stage of processing. Changes in the syntax tree are also displayed
|
||||
as they occur.
|
||||
|
||||
\section{Functional requirements}
|
||||
|
||||
The programming language of choice for this project is Python. Its dynamic
|
||||
typing makes it suitable for straightforward operations on complex data
|
||||
types. The previous proposal of using C/C++ has been withdrawn.
|
||||
|
||||
\subsection{Lexical analysis}
|
||||
|
||||
Because the lexical analyser is hard-coded, it must strive to resemble the
|
||||
lexical ruleset of mainstream C-like languages, so as to match user
|
||||
expectations. A set of popular token categories is defined:
|
||||
|
||||
\begin{center}
|
||||
\begin{tabular}{ |c|p{2.5cm}|p{6cm}| }
|
||||
\hline
|
||||
Category & Examples & Description \\
|
||||
\hline
|
||||
Identifier & \texttt{hello\_world123} & Used for variable names and keywords. \\
|
||||
\hline
|
||||
Operator & \texttt{\$ ++ ===} & Used to define multiple-character non-identifier entities. \\
|
||||
\hline
|
||||
Separator & \texttt{, ; ( \}} & Used to define single-character non-identifier entities, typically neighboring each other. \\
|
||||
\hline
|
||||
String literal & \texttt{"can't" 'won\textbackslash't'} & Incorporates rules for string enclosure and escaping. \\
|
||||
\hline
|
||||
Number literal & \texttt{123 +1.0} & Incorporates rules for digit sequences, sign prefixes and decimal points. \\
|
||||
\hline
|
||||
Comment & \texttt{//hello \newline /* world */} & Incorporates rules for single-line and multi-line comments. \\
|
||||
\hline
|
||||
Invalid & \texttt{123abc "hello} & Marks lexical errors. Used for diagnostics. \\
|
||||
\hline
|
||||
End of file & & Denotes the end of the input file. Used for grammar description. \\
|
||||
\hline
|
||||
\end{tabular}
|
||||
\end{center}
|
||||
|
||||
\subsubsection{Scanning and evaluation}
|
||||
|
||||
Most of the above tokens are produced during the scanning phase. The
|
||||
end-of-file token is appended at the end of the token sequence during the
|
||||
evaluation phase. Comment tokens are removed from the sequence before they
|
||||
reach the parser. The presence of invalid tokens prevents the program from
|
||||
progressing to the parsing phase.
|
||||
|
||||
\subsection{Grammar description meta-language}
|
||||
|
||||
Once the grammar description file has been tokenized using the universal
|
||||
lexer, the program applies a predefined meta-grammar to parse the file into
|
||||
a syntax tree for further processing.
|
||||
|
||||
At the top level, the meta-language is a set of definitions describing each
|
||||
production in the language. The fundamental building blocks for definitions
|
||||
are binary \textbf{compound expressions} and \textbf{terminal expressions}.
|
||||
|
||||
Compound expressions are the framework for backtracking recursive descent
|
||||
logic. They accept two arguments and define the logical relation between
|
||||
them. Three such expressions are defined:
|
||||
|
||||
\begin{enumerate}
|
||||
\item \textbf{Concatenation} - accepts if both arguments accept.
|
||||
\item \textbf{Optional concatenation} - accepts if either both or only
|
||||
the second argument accepts.
|
||||
\item \textbf{Alternative} - accepts if either argument accepts.
|
||||
\end{enumerate}
|
||||
|
||||
Terminal expressions are used to describe the terminal symbols of the
|
||||
language. Three kinds of such tokens may be discerned:
|
||||
|
||||
\begin{enumerate}
|
||||
\item \textbf{String literal} - accepts a token of any category whose
|
||||
value is equal to that enclosed in the literal.
|
||||
\item \textbf{Identifier}
|
||||
\begin{enumerate}
|
||||
\item \textbf{Reserved identifier} - identifier belonging to the set \texttt{identifier string\_literal number\_literal end\_of\_file}.
|
||||
Accepts a token of any value belonging to the matching category.
|
||||
\item \textbf{Arbitrary identifier} - resolves to a different definition in the grammar.
|
||||
\end{enumerate}
|
||||
\end{enumerate}
|
||||
|
||||
\subsubsection{Formal description of the meta-language}
|
||||
|
||||
The following is a formal description of the above rules, written as a
|
||||
grammar description object using Python syntax:
|
||||
|
||||
\scriptsize\begin{verbatim}meta_grammar = {
|
||||
"root": Alternative(
|
||||
"definitions",
|
||||
Terminal("end_of_file")
|
||||
),
|
||||
"definitions": Concatenation(
|
||||
"definition",
|
||||
Alternative(
|
||||
"definitions",
|
||||
Terminal("end_of_file")
|
||||
)
|
||||
),
|
||||
"definition": Concatenation(
|
||||
"definition_key",
|
||||
Concatenation(
|
||||
Terminal("operator", "="),
|
||||
Concatenation(
|
||||
"definition_expression",
|
||||
Terminal("separator", ";")
|
||||
)
|
||||
)
|
||||
),
|
||||
"definition_key": Terminal("identifier"),
|
||||
"definition_expression": "expression",
|
||||
"expression": Alternative(
|
||||
"concat_expression",
|
||||
Alternative(
|
||||
"opt_concat_expression",
|
||||
Alternative(
|
||||
"alt_expression",
|
||||
Alternative(
|
||||
"expr_identifier",
|
||||
"expr_string_literal"
|
||||
)
|
||||
)
|
||||
)
|
||||
),
|
||||
"expr_identifier": Terminal("identifier"),
|
||||
"expr_string_literal": Terminal("string_literal"),
|
||||
"concat_expression": Concatenation(
|
||||
Terminal("identifier", "concat"),
|
||||
"argument"
|
||||
),
|
||||
"opt_concat_expression": Concatenation(
|
||||
Terminal("identifier", "opt_concat"),
|
||||
"argument"
|
||||
),
|
||||
"alt_expression": Concatenation(
|
||||
Terminal("identifier", "alt"),
|
||||
"argument"
|
||||
),
|
||||
"argument": Concatenation(
|
||||
Terminal("separator", "("),
|
||||
Concatenation(
|
||||
"expr_arg1",
|
||||
Concatenation(
|
||||
Terminal("separator", ","),
|
||||
Concatenation(
|
||||
"expr_arg2",
|
||||
Terminal("separator", ")")
|
||||
)
|
||||
)
|
||||
)
|
||||
),
|
||||
"expr_arg1": "expression",
|
||||
"expr_arg2": "expression"
|
||||
}\end{verbatim}
|
||||
|
||||
\normalsize There are two additional semantic constraints: (1) there must
|
||||
be a definition named \texttt{root}, and (2) there mustn't be any
|
||||
definitions whose names belong to the set of reserved identifiers.
|
||||
|
||||
\subsection{Top-down parser}
|
||||
|
||||
The parser is the core feature of the software. It takes the root production
|
||||
of the given grammar and attempts to find a set of productions stemming from
|
||||
the root which could accept all the tokens in the sequence. It does so by
|
||||
implementing the logical rules of the three compound expressions discussed
|
||||
earlier.
|
||||
|
||||
Each step of the parser is a recursive call to a function which processes
|
||||
a single binary or terminal production. If it is determined that the set of
|
||||
logical rules for that production can not yield a combination of productions to
|
||||
parse the entire token sequence, the function generates an exception and returns
|
||||
control to its caller.
|
||||
|
||||
Exceptions don't originate at compound productions, they are merely propagated
|
||||
upwards by them. All exceptions stem from terminal productions at the leaves
|
||||
of the production tree. A terminal symbol matches the current token in the
|
||||
sequence against its signature and either increments the token iterator
|
||||
(''accepts" the token), or raises an error to be handled by the logic of
|
||||
compound productions higher in the syntax tree.
|
||||
|
||||
Backtracking is achieved by remembering the state of the token iterator at
|
||||
the initialization of a compound production. If one path fails to parse
|
||||
the token sequence, the iterator is reset and a different path is tried.
|
||||
If neither path succeeds, the error from the later path is propagated
|
||||
upwards, where backtracking may occur as well. If both paths are exhausted
|
||||
at the root level, the token tree is declared unparseable.
|
||||
|
||||
The above algorithm merely checks the validity of the token sequence against
|
||||
the grammar. To build a parse tree, each call to the parsing function may
|
||||
additionally result in the addition of a node to a data structure mirroring
|
||||
the history of chosen productions. Backtracking rules apply.
|
||||
|
||||
\subsection{Grammar generator}
|
||||
|
||||
Parsing the grammar description file against the meta-grammar yields a
|
||||
syntax tree containing named and anonymous nodes corresponding to various
|
||||
productions. The grammar generator searches this tree for definitions
|
||||
and recursively parses them to build a dictionary of named productions
|
||||
(a grammar description object) for the input file.
|
||||
|
||||
\section{Implementation}
|
||||
|
||||
\subsection{General architecture}
|
||||
|
||||
The program is divided into the entry point script and several modules,
|
||||
each providing a separate layer of functionality.
|
||||
|
||||
\begin{center}
|
||||
\begin{tabular}{ |c|p{8.5cm}| }
|
||||
\hline
|
||||
Module & Description \\
|
||||
\hline
|
||||
Entry point & Handles user interaction, file I/O and data flow between the main modules of the program. \\
|
||||
\hline
|
||||
Diagnostic & Contains functions for displaying data, visualizing data structures and printing diagnostic messages. \\
|
||||
\hline
|
||||
Lexer & Implements a finite-state machine to parse the raw input into tokens. \\
|
||||
\hline
|
||||
Lexer handlers & Defines the delta function of the finite-state machine. \\
|
||||
\hline
|
||||
Meta-language & Contains the hard-coded grammar description object for the meta-language. \\
|
||||
\hline
|
||||
Productions & Defines classes for compound and terminal productions. \\
|
||||
\hline
|
||||
Parser & Utilities for initializing a top-down recursive descent. \\
|
||||
\hline
|
||||
Parser handlers & Logical rules for parsing productions. \\
|
||||
\hline
|
||||
Grammar & Syntax tree analysis and grammar description object generation. \\
|
||||
\hline
|
||||
\end{tabular}
|
||||
\end{center}
|
||||
|
||||
\subsection{Data structures}
|
||||
|
||||
\subsubsection{Productions}
|
||||
|
||||
Four classes are defined to describe the three non-terminal and the single
|
||||
terminal production types: \texttt{Concatenation},
|
||||
\texttt{OptionalConcatenation}, \texttt{Alternative} and \texttt{Terminal}.
|
||||
|
||||
The non-terminal productions hold two slots for their children nodes. They
|
||||
are separate because the parser function looks at the type of the production
|
||||
to invoke the appropriate handler.
|
||||
|
||||
The terminal production holds a slot for the category and the value of the
|
||||
token it matches against. Each may be null to disable verification for that
|
||||
field. A method is provided for matching against tokens.
|
||||
|
||||
\subsubsection{Syntax node}
|
||||
|
||||
The \texttt{Node} class holds a single node of the syntax tree. It has a
|
||||
name field for named productions and a children field. It may hold other
|
||||
nodes, representing compound productions, or tokens, representing terminal
|
||||
productions. Named terminal productions are wrapped in a single-child
|
||||
\texttt{Node} object.
|
||||
|
||||
To facilitate backtracking, the class exposes methods for adding and
|
||||
removing children without directly accessing the children field.
|
||||
|
||||
\subsubsection{State classes}
|
||||
|
||||
The classes \texttt{MachineState} and \texttt{ParserState} are data
|
||||
aggregates representing the internal state of the lexer and the parser,
|
||||
respectively.
|
||||
|
||||
The \texttt{MachineState} class contains an assortment of states necessary
|
||||
to provide context for tokenization.
|
||||
|
||||
The \texttt{ParserState} class holds the token sequence and the grammar
|
||||
that the parser is currently operating on, as well as the token iterator.
|
||||
|
||||
\subsection{Detailed implementation}
|
||||
|
||||
\subsubsection{Lexer}
|
||||
|
||||
The lexer is a finite-state machine. The lexing process begins by
|
||||
initializing the state. The input file is then scanned character by character
|
||||
to determine which characters constitute which tokens. On the boundary between
|
||||
tokens and non-tokens (or neighboring tokens), the currently recognized token
|
||||
is appended to the output sequence.
|
||||
|
||||
Once the entire input is parsed, an evaluation phase occurs, where transformations
|
||||
are performed on the output sequence. Comments are removed and the end of
|
||||
file token is appended.
|
||||
|
||||
\subsubsection{Parser}
|
||||
|
||||
The parser is initialized by creating a ``super-root" node and invoking
|
||||
the parser function on the first token in the sequence.
|
||||
|
||||
The parser function accepts three arguments:
|
||||
\begin{enumerate}
|
||||
\item The current parser state, \texttt{ParserState}.
|
||||
\item The prescribed production, either one of the four production types
|
||||
or a string to be resolved from the grammar description object.
|
||||
\item The parent node, where the parsed production is to be added as a
|
||||
child node.
|
||||
\end{enumerate}
|
||||
The root element is parsed by specifying the prescribed production as
|
||||
\texttt{"root"} and the parent node as the super-root. Upon exit, the
|
||||
entry point function returns the first child of the super-root, i.e. the
|
||||
root node.
|
||||
|
||||
If the production is specified as a string, the main parser function
|
||||
performs name resolution to obtain the corresponding production class.
|
||||
The specified production string then becomes the name for the node to be
|
||||
appended to the parent node. Named productions aid in syntax tree analysis.
|
||||
|
||||
Once the production class is resolved, the main function looks up and
|
||||
invokes the appropriate handler for that production.
|
||||
|
||||
\subsubsection{Terminal handler}
|
||||
|
||||
Terminal handlers accept input tokens and are the source of syntax errors,
|
||||
crucial to the backtracking mechanism. The root node may be a terminal node,
|
||||
in which case the language only accepts a single token.
|
||||
|
||||
The terminal handler resolves the token at the current index and compares
|
||||
it against the production's signature. In case of category or value
|
||||
mismatch, a syntax error is raised and propagated upwards in the call stack.
|
||||
|
||||
Upon success, the token iterator is incremented and a token is added to the
|
||||
parent node. If the terminal production is a named production, the token
|
||||
is wrapped in a single-child named node first.
|
||||
|
||||
\subsubsection{Non-terminal handlers}
|
||||
|
||||
The concatenation handler parses its two children in sequence. If any of
|
||||
them fails, the error is propagated. No backtracking occurs in this handler.
|
||||
|
||||
The optional concatenation handler tries two paths: one where the first
|
||||
child is skipped and one where it is not. If both paths fail, the error
|
||||
from the second child is propagated.
|
||||
|
||||
Backtracking is implemented by saving the token iterator before attempting
|
||||
the first path. If the first path fails, the iterator is restored and the
|
||||
second path is attempted. A new node is created for each of the paths.
|
||||
If a path succeeds, the corresponding node is appended to the parent.
|
||||
|
||||
The alternative handler is implemented in a similar way, the only difference
|
||||
being the logical rules of the attempted paths.
|
||||
|
||||
\subsection{Grammar generator}
|
||||
|
||||
The grammar generator traverses the syntax tree of the parsed grammar file
|
||||
in search of all named nodes corresponding to definitions.
|
||||
|
||||
For each definition, it searches nearby descendant nodes for the definition
|
||||
key and expression. The expression is evaluated recursively until all
|
||||
terminal productions are found. Found compound productions are translated
|
||||
into their production classes. String literals are translated into tokens
|
||||
with the given value. Identifiers are translated into tokens of the given
|
||||
category or into references to other definitions.
|
||||
|
||||
When definitions are evaluated and prior to exit from the entry point
|
||||
function, semantic rules are validated: the grammar must define a root
|
||||
and it mustn't use reserved identifiers as keys.
|
||||
|
||||
\section{Test cases}
|
||||
|
||||
The most important test case validates backtracking. Given the following
|
||||
production:
|
||||
|
||||
\begin{verbatim}root = concat(
|
||||
"alpha",
|
||||
opt_concat(
|
||||
identifier,
|
||||
"beta"
|
||||
)
|
||||
)
|
||||
\end{verbatim}
|
||||
|
||||
It must be able to recognize the string \texttt{alpha beta}. A naive greedy
|
||||
algorithm would consume the token \texttt{beta} as the identifier rather
|
||||
that the token \texttt{"beta"}, leaving \texttt{opt\_concat} unable to
|
||||
consume \texttt{beta} as its second child, thus failing the validation.
|
||||
|
||||
A more exhaustive test case would be to provide a grammar for JSON and
|
||||
successfully validate a file against it.
|
||||
|
||||
\end{document}
|
||||
BIN
inspirations/mskarzyn ECOTE documentation.pdf
Normal file
BIN
inspirations/mskarzyn ECOTE documentation.pdf
Normal file
Binary file not shown.
9
preliminaryReport/actualReport/labNotes.tx
Normal file
9
preliminaryReport/actualReport/labNotes.tx
Normal file
@ -0,0 +1,9 @@
|
||||
Every error message
|
||||
Every possible input
|
||||
Design code for errors
|
||||
Design of test cases to introductory document
|
||||
Execution must be presented during hours scheduled in laboratory
|
||||
Write code easy to modiffy
|
||||
|
||||
Test cases:
|
||||
Input data and result of test (input/OUTPUT of data) -> both correct and incorrect dataS
|
||||
11
preliminaryReport/actualReport/labNotes.txt
Normal file
11
preliminaryReport/actualReport/labNotes.txt
Normal file
@ -0,0 +1,11 @@
|
||||
Every error message
|
||||
Every possible input
|
||||
Design code for errors
|
||||
Design of test cases to introductory document
|
||||
Execution must be presented during hours scheduled in laboratory
|
||||
Write code easy to modiffy
|
||||
|
||||
Test cases:
|
||||
Input data and result of test (input/OUTPUT of data) -> both correct and incorrect dataS
|
||||
decide whether to use antrl
|
||||
bachus one
|
||||
BIN
preliminaryReport/actualReport/report.pdf
Normal file
BIN
preliminaryReport/actualReport/report.pdf
Normal file
Binary file not shown.
45
preliminaryReport/actualReport/report.tex
Normal file
45
preliminaryReport/actualReport/report.tex
Normal file
@ -0,0 +1,45 @@
|
||||
\documentclass[12pt]{article}
|
||||
|
||||
\date{\today}
|
||||
\title{ECOTE - preliminary project \\
|
||||
Translator of a LaTeX subset to HTML
|
||||
}
|
||||
\author{Krzysztof Rudnicki, 307585 \\
|
||||
Semester: 2023L}
|
||||
|
||||
\begin{document}
|
||||
\maketitle
|
||||
\section{General overview and assumptions}
|
||||
initial task proposals (at least: assumptions, variant selection, implementation technology, scope, etc.). \\
|
||||
My task is to create a translator of \LaTeX \, subset to selected text format with focus on \LaTeX \, tables \\
|
||||
I decided to change to translator of \LaTeX \, subset to HTML since I know \LaTeX \, very well and HTML relatively well, I decide to translate \LaTeX into HTML since HTML is easy, a little bit different than \LaTeX and popular which makes this translator a practical tool.
|
||||
\subsection{Assumptions}
|
||||
\begin{itemize}
|
||||
\item No \LaTeX \, (\%) comments in the script
|
||||
\item There are no extra packages in \LaTeX \, script (provided with \\ \textbackslash usepackage keyword) besides ones distributed with \LaTeX
|
||||
\item There are no extra classes in \LaTeX \, script besides ones distributed with \LaTeX
|
||||
\item There is nothing between \textbackslash documentclass keyword and \\ \textbackslash begin\{document\} keyword
|
||||
\item No standard \LaTeX \, instructions are modified in the script
|
||||
\item "Tables" will be represented using \LaTeX \, \emph{table} environment
|
||||
\end{itemize}
|
||||
\section{Functional requirements}
|
||||
\subsection{\LaTeX \, subset}
|
||||
This project will focus almost exclusively on \emph{table} environment \\
|
||||
more speciffically table environment containing tabular inside of it
|
||||
\section{Implementation}
|
||||
I decided to use Python as a language in which I will implement my solution \\
|
||||
The reasons for using python are as follow:
|
||||
\begin{enumerate}
|
||||
\item It is the easiest language among those that I know
|
||||
\item I know it enough to be confident in my ability to implement this solution in python
|
||||
\item I want to learn python more through this project
|
||||
\end{enumerate}
|
||||
Negative aspects of python which is that it is very slow language do not bother me as I believe the project scope will not be big enough for this to become an issue
|
||||
|
||||
\subsection{General architecture}
|
||||
\subsection{Data structures}
|
||||
\subsection{Module descriptions}
|
||||
\subsection{Input/output description}
|
||||
\subsection{Others}
|
||||
\section{Functional test cases}
|
||||
\end{document}
|
||||
Binary file not shown.
BIN
preliminaryReport/teamsMaterials/ECOTE_TasksG101&104_2023Lv2.pdf
Normal file
BIN
preliminaryReport/teamsMaterials/ECOTE_TasksG101&104_2023Lv2.pdf
Normal file
Binary file not shown.
BIN
preliminaryReport/teamsMaterials/ECOTE_labIntro101_104.pdf
Normal file
BIN
preliminaryReport/teamsMaterials/ECOTE_labIntro101_104.pdf
Normal file
Binary file not shown.
BIN
preliminaryReport/teamsMaterials/ECOTEproject_pattern.doc
Normal file
BIN
preliminaryReport/teamsMaterials/ECOTEproject_pattern.doc
Normal file
Binary file not shown.
Loading…
Reference in New Issue
Block a user