chore: initial commit

2026-07-04 16:03:11 +02:00 · 2023-04-19 14:31:31 +02:00 · 2023-04-19 14:31:31 +02:00 · 35bf301586
commit 35bf301586
parent 85f8bb0c31
17 changed files with 1309 additions and 0 deletions
--- a/helpfulMaterials/Floats
+++ b/helpfulMaterials/Floats
--- a/helpfulMaterials/table
+++ b/helpfulMaterials/table
--- a/helpfulMaterials/tabular
+++ b/helpfulMaterials/tabular
--- a/inspirations/ECOTE_project_documentation.pdf
+++ b/inspirations/ECOTE_project_documentation.pdf
--- a/inspirations/ECOTEproject_CanerKaya.pdf
+++ b/inspirations/ECOTEproject_CanerKaya.pdf
--- a/inspirations/PreliminaryProjectTomkiewicz.pdf
+++ b/inspirations/PreliminaryProjectTomkiewicz.pdf
--- a/inspirations/godBlessLachcim/lachcim.pdf
+++ b/inspirations/godBlessLachcim/lachcim.pdf
--- a/inspirations/godBlessLachcim/lachcim.tex
+++ b/inspirations/godBlessLachcim/lachcim.tex
@ -0,0 +1,426 @@
+\documentclass{article}
+
+\usepaczkage{graphicx}
+\usepackage{pdfpages}
+\usepackage{hyperref}
+\setlength{\parskip}{1em}
+
+\begin{document}
+
+	\title{ECOTE preliminary report:\\
+	Top-down parser with backtracking}
+	\author{Michał Szopiński 300182}
+	\date{May 11, 2022}
+	\maketitle
+
+	\section{General overview and assumptions}
+
+	The goal of this project is to write a program to parse and produce a syntax
+	tree for an arbitrary input file using an arbitrary grammar.
+
+	The parsing is to be implemented using a top-down recursive descent
+	algorithm, i.e. one that attempts to find a combination of productions
+	matching the input token sequence, starting from the root production.
+	Backtracking means that the algorithm may abandon previously chosen
+	productions if it discovers that they cannot lead to a match.
+
+	Because a parser operates on tokens, which are produced during the lexical
+	analysis stage, the program must have a built-in lexer utility. To reduce
+	complexity, the lexeme recognition algorithm is hard-coded and not
+	customizable. The built-in lexer recognizes tokens that are common to
+	popular C-like languages.
+
+	As mentioned before, the program checks arbitrary inputs against arbitrary
+	grammars. This implies that the user supplies two files, one containing
+	the input and one containing a description of the grammar.
+
+	The program tokenizes both files using the built-in lexer and parses the
+	grammar description file using a hard-coded grammar description
+	meta-language. The produced syntax tree is then validated and transformed
+	into a grammar descriptor object, which is in turn used to parse the input
+	file. As such, the same parser may be used to process both input files.
+
+	The program implements rudimentary diagnostics and error handling. In
+	particular, the user may receive lexical, parse and semantic errors during
+	each stage of processing. Changes in the syntax tree are also displayed
+	as they occur.
+
+	\section{Functional requirements}
+
+	The programming language of choice for this project is Python. Its dynamic
+	typing makes it suitable for straightforward operations on complex data
+	types. The previous proposal of using C/C++ has been withdrawn.
+
+	\subsection{Lexical analysis}
+
+	Because the lexical analyser is hard-coded, it must strive to resemble the
+	lexical ruleset of mainstream C-like languages, so as to match user
+	expectations. A set of popular token categories is defined:
+
+	\begin{center}
+	\begin{tabular}{ |c|p{2.5cm}|p{6cm}| }
+		\hline
+			Category & Examples & Description \\
+		\hline
+			Identifier & \texttt{hello\_world123} & Used for variable names and keywords. \\
+		\hline
+			Operator & \texttt{\$ ++ ===} & Used to define multiple-character non-identifier entities. \\
+		\hline
+			Separator & \texttt{, ; ( \}} & Used to define single-character non-identifier entities, typically neighboring each other. \\
+		\hline
+			String literal & \texttt{"can't" 'won\textbackslash't'} & Incorporates rules for string enclosure and escaping. \\
+		\hline
+			Number literal & \texttt{123 +1.0} & Incorporates rules for digit sequences, sign prefixes and decimal points. \\
+		\hline
+			Comment & \texttt{//hello \newline /* world */} & Incorporates rules for single-line and multi-line comments. \\
+		\hline
+			Invalid & \texttt{123abc "hello} & Marks lexical errors. Used for diagnostics. \\
+		\hline
+			End of file & & Denotes the end of the input file. Used for grammar description. \\
+		\hline
+	\end{tabular}
+	\end{center}
+
+	\subsubsection{Scanning and evaluation}
+
+	Most of the above tokens are produced during the scanning phase. The
+	end-of-file token is appended at the end of the token sequence during the
+	evaluation phase. Comment tokens are removed from the sequence before they
+	reach the parser. The presence of invalid tokens prevents the program from
+	progressing to the parsing phase.
+
+	\subsection{Grammar description meta-language}
+
+	Once the grammar description file has been tokenized using the universal
+	lexer, the program applies a predefined meta-grammar to parse the file into
+	a syntax tree for further processing.
+
+	At the top level, the meta-language is a set of definitions describing each
+	production in the language. The fundamental building blocks for definitions
+	are binary \textbf{compound expressions} and \textbf{terminal expressions}.
+
+	Compound expressions are the framework for backtracking recursive descent
+	logic. They accept two arguments and define the logical relation between
+	them. Three such expressions are defined:
+
+	\begin{enumerate}
+		\item \textbf{Concatenation} - accepts if both arguments accept.
+		\item \textbf{Optional concatenation} - accepts if either both or only
+		the second argument accepts.
+		\item \textbf{Alternative} - accepts if either argument accepts.
+	\end{enumerate}
+
+	Terminal expressions are used to describe the terminal symbols of the
+	language. Three kinds of such tokens may be discerned:
+
+	\begin{enumerate}
+		\item \textbf{String literal} - accepts a token of any category whose
+		value is equal to that enclosed in the literal.
+		\item \textbf{Identifier}
+		\begin{enumerate}
+			\item \textbf{Reserved identifier} - identifier belonging to the set \texttt{identifier string\_literal number\_literal end\_of\_file}.
+			Accepts a token of any value belonging to the matching category.
+			\item \textbf{Arbitrary identifier} - resolves to a different definition in the grammar.
+		\end{enumerate}
+	\end{enumerate}
+
+	\subsubsection{Formal description of the meta-language}
+
+	The following is a formal description of the above rules, written as a
+	grammar description object using Python syntax:
+
+	\scriptsize\begin{verbatim}meta_grammar = {
+    "root": Alternative(
+        "definitions",
+        Terminal("end_of_file")
+    ),
+    "definitions": Concatenation(
+        "definition",
+        Alternative(
+            "definitions",
+            Terminal("end_of_file")
+        )
+    ),
+    "definition": Concatenation(
+        "definition_key",
+        Concatenation(
+            Terminal("operator", "="),
+            Concatenation(
+                "definition_expression",
+                Terminal("separator", ";")
+            )
+        )
+    ),
+    "definition_key": Terminal("identifier"),
+    "definition_expression": "expression",
+    "expression": Alternative(
+        "concat_expression",
+        Alternative(
+            "opt_concat_expression",
+            Alternative(
+                "alt_expression",
+                Alternative(
+                    "expr_identifier",
+                    "expr_string_literal"
+                )
+            )
+        )
+    ),
+    "expr_identifier": Terminal("identifier"),
+    "expr_string_literal": Terminal("string_literal"),
+    "concat_expression": Concatenation(
+        Terminal("identifier", "concat"),
+        "argument"
+    ),
+    "opt_concat_expression": Concatenation(
+        Terminal("identifier", "opt_concat"),
+        "argument"
+    ),
+    "alt_expression": Concatenation(
+        Terminal("identifier", "alt"),
+        "argument"
+    ),
+    "argument": Concatenation(
+        Terminal("separator", "("),
+        Concatenation(
+            "expr_arg1",
+            Concatenation(
+                Terminal("separator", ","),
+                Concatenation(
+                    "expr_arg2",
+                    Terminal("separator", ")")
+                )
+            )
+        )
+    ),
+    "expr_arg1": "expression",
+    "expr_arg2": "expression"
+}\end{verbatim}
+
+	\normalsize There are two additional semantic constraints: (1) there must
+	be a definition named \texttt{root}, and (2) there mustn't be any
+	definitions whose names belong to the set of reserved identifiers.
+
+	\subsection{Top-down parser}
+
+	The parser is the core feature of the software. It takes the root production
+	of the given grammar and attempts to find a set of productions stemming from
+	the root which could accept all the tokens in the sequence. It does so by
+	implementing the logical rules of the three compound expressions discussed
+	earlier.
+
+	Each step of the parser is a recursive call to a function which processes
+	a single binary or terminal production. If it is determined that the set of
+	logical rules for that production can not yield a combination of productions to
+	parse the entire token sequence, the function generates an exception and returns
+	control to its caller.
+
+	Exceptions don't originate at compound productions, they are merely propagated
+	upwards by them. All exceptions stem from terminal productions at the leaves
+	of the production tree. A terminal symbol matches the current token in the
+	sequence against its signature and either increments the token iterator
+	(''accepts" the token), or raises an error to be handled by the logic of
+	compound productions higher in the syntax tree.
+
+	Backtracking is achieved by remembering the state of the token iterator at
+	the initialization of a compound production. If one path fails to parse
+	the token sequence, the iterator is reset and a different path is tried.
+	If neither path succeeds, the error from the later path is propagated
+	upwards, where backtracking may occur as well. If both paths are exhausted
+	at the root level, the token tree is declared unparseable.
+
+	The above algorithm merely checks the validity of the token sequence against
+	the grammar. To build a parse tree, each call to the parsing function may
+	additionally result in the addition of a node to a data structure mirroring
+	the history of chosen productions. Backtracking rules apply.
+
+	\subsection{Grammar generator}
+
+	Parsing the grammar description file against the meta-grammar yields a
+	syntax tree containing named and anonymous nodes corresponding to various
+	productions. The grammar generator searches this tree for definitions
+	and recursively parses them to build a dictionary of named productions
+	(a grammar description object) for the input file.
+
+	\section{Implementation}
+
+	\subsection{General architecture}
+
+	The program is divided into the entry point script and several modules,
+	each providing a separate layer of functionality.
+
+	\begin{center}
+	\begin{tabular}{ |c|p{8.5cm}| }
+		\hline
+			Module & Description \\
+		\hline
+			Entry point & Handles user interaction, file I/O and data flow between the main modules of the program. \\
+		\hline
+			Diagnostic & Contains functions for displaying data, visualizing data structures and printing diagnostic messages. \\
+		\hline
+			Lexer & Implements a finite-state machine to parse the raw input into tokens. \\
+		\hline
+			Lexer handlers & Defines the delta function of the finite-state machine. \\
+		\hline
+			Meta-language & Contains the hard-coded grammar description object for the meta-language. \\
+		\hline
+			Productions & Defines classes for compound and terminal productions. \\
+		\hline
+			Parser & Utilities for initializing a top-down recursive descent. \\
+		\hline
+			Parser handlers & Logical rules for parsing productions. \\
+		\hline
+			Grammar & Syntax tree analysis and grammar description object generation. \\
+		\hline
+	\end{tabular}
+	\end{center}
+
+	\subsection{Data structures}
+
+	\subsubsection{Productions}
+
+	Four classes are defined to describe the three non-terminal and the single
+	terminal production types: \texttt{Concatenation},
+	\texttt{OptionalConcatenation}, \texttt{Alternative} and \texttt{Terminal}.
+
+	The non-terminal productions hold two slots for their children nodes. They
+	are separate because the parser function looks at the type of the production
+	to invoke the appropriate handler.
+
+	The terminal production holds a slot for the category and the value of the
+	token it matches against. Each may be null to disable verification for that
+	field. A method is provided for matching against tokens.
+
+	\subsubsection{Syntax node}
+
+	The \texttt{Node} class holds a single node of the syntax tree. It has a
+	name field for named productions and a children field. It may hold other
+	nodes, representing compound productions, or tokens, representing terminal
+	productions. Named terminal productions are wrapped in a single-child
+	\texttt{Node} object.
+
+	To facilitate backtracking, the class exposes methods for adding and
+	removing children without directly accessing the children field.
+
+	\subsubsection{State classes}
+
+	The classes \texttt{MachineState} and \texttt{ParserState} are data
+	aggregates representing the internal state of the lexer and the parser,
+	respectively.
+
+	The \texttt{MachineState} class contains an assortment of states necessary
+	to provide context for tokenization.
+
+	The \texttt{ParserState} class holds the token sequence and the grammar
+	that the parser is currently operating on, as well as the token iterator.
+
+	\subsection{Detailed implementation}
+
+	\subsubsection{Lexer}
+
+	The lexer is a finite-state machine. The lexing process begins by
+	initializing the state. The input file is then scanned character by character
+	to determine which characters constitute which tokens. On the boundary between
+	tokens and non-tokens (or neighboring tokens), the currently recognized token
+	is appended to the output sequence.
+
+	Once the entire input is parsed, an evaluation phase occurs, where transformations
+	are performed on the output sequence. Comments are removed and the end of
+	file token is appended.
+
+	\subsubsection{Parser}
+
+	The parser is initialized by creating a ``super-root" node and invoking
+	the parser function on the first token in the sequence.
+
+	The parser function accepts three arguments:
+	\begin{enumerate}
+		\item The current parser state, \texttt{ParserState}.
+		\item The prescribed production, either one of the four production types
+		or a string to be resolved from the grammar description object.
+		\item The parent node, where the parsed production is to be added as a
+		child node.
+	\end{enumerate}
+	The root element is parsed by specifying the prescribed production as
+	\texttt{"root"} and the parent node as the super-root. Upon exit, the
+	entry point function returns the first child of the super-root, i.e. the
+	root node.
+
+	If the production is specified as a string, the main parser function
+	performs name resolution to obtain the corresponding production class.
+	The specified production string then becomes the name for the node to be
+	appended to the parent node. Named productions aid in syntax tree analysis.
+
+	Once the production class is resolved, the main function looks up and
+	invokes the appropriate handler for that production.
+
+	\subsubsection{Terminal handler}
+
+	Terminal handlers accept input tokens and are the source of syntax errors,
+	crucial to the backtracking mechanism. The root node may be a terminal node,
+	in which case the language only accepts a single token.
+
+	The terminal handler resolves the token at the current index and compares
+	it against the production's signature. In case of category or value
+	mismatch, a syntax error is raised and propagated upwards in the call stack.
+
+	Upon success, the token iterator is incremented and a token is added to the
+	parent node. If the terminal production is a named production, the token
+	is wrapped in a single-child named node first.
+
+	\subsubsection{Non-terminal handlers}
+
+	The concatenation handler parses its two children in sequence. If any of
+	them fails, the error is propagated. No backtracking occurs in this handler.
+
+	The optional concatenation handler tries two paths: one where the first
+	child is skipped and one where it is not. If both paths fail, the error
+	from the second child is propagated.
+
+	Backtracking is implemented by saving the token iterator before attempting
+	the first path. If the first path fails, the iterator is restored and the
+	second path is attempted. A new node is created for each of the paths.
+	If a path succeeds, the corresponding node is appended to the parent.
+
+	The alternative handler is implemented in a similar way, the only difference
+	being the logical rules of the attempted paths.
+
+	\subsection{Grammar generator}
+
+	The grammar generator traverses the syntax tree of the parsed grammar file
+	in search of all named nodes corresponding to definitions.
+
+	For each definition, it searches nearby descendant nodes for the definition
+	key and expression. The expression is evaluated recursively until all
+	terminal productions are found. Found compound productions are translated
+	into their production classes. String literals are translated into tokens
+	with the given value. Identifiers are translated into tokens of the given
+	category or into references to other definitions.
+
+	When definitions are evaluated and prior to exit from the entry point
+	function, semantic rules are validated: the grammar must define a root
+	and it mustn't use reserved identifiers as keys.
+
+	\section{Test cases}
+
+	The most important test case validates backtracking. Given the following
+	production:
+
+	\begin{verbatim}root = concat(
+    "alpha",
+    opt_concat(
+        identifier,
+        "beta"
+    )
+)
+\end{verbatim}
+
+	It must be able to recognize the string \texttt{alpha beta}. A naive greedy
+	algorithm would consume the token \texttt{beta} as the identifier rather
+	that the token \texttt{"beta"}, leaving \texttt{opt\_concat} unable to
+	consume \texttt{beta} as its second child, thus failing the validation.
+
+	A more exhaustive test case would be to provide a grammar for JSON and
+	successfully validate a file against it.
+
+\end{document}
--- a/inspirations/mskarzyn
+++ b/inspirations/mskarzyn
--- a/preliminaryReport/actualReport/labNotes.tx
+++ b/preliminaryReport/actualReport/labNotes.tx
@ -0,0 +1,9 @@
+Every error message
+Every possible input
+Design code for errors 
+Design of test cases to introductory document 
+Execution must be presented during hours scheduled in laboratory
+Write code easy to modiffy
+
+Test cases:
+Input data and result of test (input/OUTPUT of data) -> both correct and incorrect dataS
--- a/preliminaryReport/actualReport/labNotes.txt
+++ b/preliminaryReport/actualReport/labNotes.txt
@ -0,0 +1,11 @@
+Every error message
+Every possible input
+Design code for errors 
+Design of test cases to introductory document 
+Execution must be presented during hours scheduled in laboratory
+Write code easy to modiffy
+
+Test cases:
+Input data and result of test (input/OUTPUT of data) -> both correct and incorrect dataS
+decide whether to use antrl
+bachus one
--- a/preliminaryReport/actualReport/report.pdf
+++ b/preliminaryReport/actualReport/report.pdf
--- a/preliminaryReport/actualReport/report.tex
+++ b/preliminaryReport/actualReport/report.tex
@ -0,0 +1,45 @@
+\documentclass[12pt]{article}
+
+\date{\today}
+\title{ECOTE - preliminary project \\ 
+Translator of a LaTeX subset to HTML
+}
+\author{Krzysztof Rudnicki, 307585 \\
+Semester: 2023L}
+
+\begin{document}
+\maketitle
+\section{General overview and assumptions}
+initial task proposals (at least: assumptions, variant selection, implementation technology, scope, etc.). \\
+My task is to create a translator of \LaTeX \, subset to selected text format with focus on \LaTeX \, tables \\ 
+I decided to change to translator of \LaTeX \, subset to HTML since I know \LaTeX \, very well and HTML relatively well, I decide to translate \LaTeX into HTML since HTML is easy, a little bit different than \LaTeX and popular which makes this translator a practical tool.
+\subsection{Assumptions}
+\begin{itemize}
+    \item No \LaTeX \, (\%) comments in the script 
+    \item There are no extra packages in \LaTeX \, script (provided with \\ \textbackslash usepackage keyword) besides ones distributed with \LaTeX
+    \item There are no extra classes in \LaTeX \, script besides ones distributed with \LaTeX
+    \item There is nothing between \textbackslash documentclass keyword and \\ \textbackslash begin\{document\} keyword 
+    \item No standard \LaTeX \, instructions are modified in the script
+    \item "Tables" will be represented using \LaTeX \, \emph{table} environment 
+\end{itemize}
+\section{Functional requirements}
+\subsection{\LaTeX \, subset}
+This project will focus almost exclusively on \emph{table} environment \\
+more speciffically table environment containing tabular inside of it 
+\section{Implementation}
+I decided to use Python as a language in which I will implement my solution \\ 
+The reasons for using python are as follow:
+\begin{enumerate}
+    \item It is the easiest language among those that I know
+    \item I know it enough to be confident in my ability to implement this solution in python
+    \item I want to learn python more through this project
+\end{enumerate}
+Negative aspects of python which is that it is very slow language do not bother me as I believe the project scope will not be big enough for this to become an issue
+  
+\subsection{General architecture}
+\subsection{Data structures}
+\subsection{Module descriptions}
+\subsection{Input/output description}
+\subsection{Others}
+\section{Functional test cases}
+\end{document}
--- a/preliminaryReport/teamsMaterials/ECOTE_TaskAssignmentG101&104_2023Lv2.pdf
+++ b/preliminaryReport/teamsMaterials/ECOTE_TaskAssignmentG101&104_2023Lv2.pdf
--- a/preliminaryReport/teamsMaterials/ECOTE_TasksG101&104_2023Lv2.pdf
+++ b/preliminaryReport/teamsMaterials/ECOTE_TasksG101&104_2023Lv2.pdf
--- a/preliminaryReport/teamsMaterials/ECOTE_labIntro101_104.pdf
+++ b/preliminaryReport/teamsMaterials/ECOTE_labIntro101_104.pdf
--- a/preliminaryReport/teamsMaterials/ECOTEproject_pattern.doc
+++ b/preliminaryReport/teamsMaterials/ECOTEproject_pattern.doc