diff --git a/inspirations/ECOTE_project_documentation.pdf b/inspirations/ECOTE_project_documentation.pdf deleted file mode 100644 index 6ec7b244..00000000 Binary files a/inspirations/ECOTE_project_documentation.pdf and /dev/null differ diff --git a/inspirations/ECOTEproject_CanerKaya.pdf b/inspirations/ECOTEproject_CanerKaya.pdf deleted file mode 100644 index c2053d35..00000000 Binary files a/inspirations/ECOTEproject_CanerKaya.pdf and /dev/null differ diff --git a/inspirations/PreliminaryProjectTomkiewicz.pdf b/inspirations/PreliminaryProjectTomkiewicz.pdf deleted file mode 100644 index c891c13b..00000000 Binary files a/inspirations/PreliminaryProjectTomkiewicz.pdf and /dev/null differ diff --git a/inspirations/godBlessLachcim/lachcim.pdf b/inspirations/godBlessLachcim/lachcim.pdf deleted file mode 100644 index 764819b4..00000000 Binary files a/inspirations/godBlessLachcim/lachcim.pdf and /dev/null differ diff --git a/inspirations/godBlessLachcim/lachcim.tex b/inspirations/godBlessLachcim/lachcim.tex deleted file mode 100644 index 32bebfcd..00000000 --- a/inspirations/godBlessLachcim/lachcim.tex +++ /dev/null @@ -1,426 +0,0 @@ -\documentclass{article} - -\usepaczkage{graphicx} -\usepackage{pdfpages} -\usepackage{hyperref} -\setlength{\parskip}{1em} - -\begin{document} - - \title{ECOTE preliminary report:\\ - Top-down parser with backtracking} - \author{Michał Szopiński 300182} - \date{May 11, 2022} - \maketitle - - \section{General overview and assumptions} - - The goal of this project is to write a program to parse and produce a syntax - tree for an arbitrary input file using an arbitrary grammar. - - The parsing is to be implemented using a top-down recursive descent - algorithm, i.e. one that attempts to find a combination of productions - matching the input token sequence, starting from the root production. - Backtracking means that the algorithm may abandon previously chosen - productions if it discovers that they cannot lead to a match. - - Because a parser operates on tokens, which are produced during the lexical - analysis stage, the program must have a built-in lexer utility. To reduce - complexity, the lexeme recognition algorithm is hard-coded and not - customizable. The built-in lexer recognizes tokens that are common to - popular C-like languages. - - As mentioned before, the program checks arbitrary inputs against arbitrary - grammars. This implies that the user supplies two files, one containing - the input and one containing a description of the grammar. - - The program tokenizes both files using the built-in lexer and parses the - grammar description file using a hard-coded grammar description - meta-language. The produced syntax tree is then validated and transformed - into a grammar descriptor object, which is in turn used to parse the input - file. As such, the same parser may be used to process both input files. - - The program implements rudimentary diagnostics and error handling. In - particular, the user may receive lexical, parse and semantic errors during - each stage of processing. Changes in the syntax tree are also displayed - as they occur. - - \section{Functional requirements} - - The programming language of choice for this project is Python. Its dynamic - typing makes it suitable for straightforward operations on complex data - types. The previous proposal of using C/C++ has been withdrawn. - - \subsection{Lexical analysis} - - Because the lexical analyser is hard-coded, it must strive to resemble the - lexical ruleset of mainstream C-like languages, so as to match user - expectations. A set of popular token categories is defined: - - \begin{center} - \begin{tabular}{ |c|p{2.5cm}|p{6cm}| } - \hline - Category & Examples & Description \\ - \hline - Identifier & \texttt{hello\_world123} & Used for variable names and keywords. \\ - \hline - Operator & \texttt{\$ ++ ===} & Used to define multiple-character non-identifier entities. \\ - \hline - Separator & \texttt{, ; ( \}} & Used to define single-character non-identifier entities, typically neighboring each other. \\ - \hline - String literal & \texttt{"can't" 'won\textbackslash't'} & Incorporates rules for string enclosure and escaping. \\ - \hline - Number literal & \texttt{123 +1.0} & Incorporates rules for digit sequences, sign prefixes and decimal points. \\ - \hline - Comment & \texttt{//hello \newline /* world */} & Incorporates rules for single-line and multi-line comments. \\ - \hline - Invalid & \texttt{123abc "hello} & Marks lexical errors. Used for diagnostics. \\ - \hline - End of file & & Denotes the end of the input file. Used for grammar description. \\ - \hline - \end{tabular} - \end{center} - - \subsubsection{Scanning and evaluation} - - Most of the above tokens are produced during the scanning phase. The - end-of-file token is appended at the end of the token sequence during the - evaluation phase. Comment tokens are removed from the sequence before they - reach the parser. The presence of invalid tokens prevents the program from - progressing to the parsing phase. - - \subsection{Grammar description meta-language} - - Once the grammar description file has been tokenized using the universal - lexer, the program applies a predefined meta-grammar to parse the file into - a syntax tree for further processing. - - At the top level, the meta-language is a set of definitions describing each - production in the language. The fundamental building blocks for definitions - are binary \textbf{compound expressions} and \textbf{terminal expressions}. - - Compound expressions are the framework for backtracking recursive descent - logic. They accept two arguments and define the logical relation between - them. Three such expressions are defined: - - \begin{enumerate} - \item \textbf{Concatenation} - accepts if both arguments accept. - \item \textbf{Optional concatenation} - accepts if either both or only - the second argument accepts. - \item \textbf{Alternative} - accepts if either argument accepts. - \end{enumerate} - - Terminal expressions are used to describe the terminal symbols of the - language. Three kinds of such tokens may be discerned: - - \begin{enumerate} - \item \textbf{String literal} - accepts a token of any category whose - value is equal to that enclosed in the literal. - \item \textbf{Identifier} - \begin{enumerate} - \item \textbf{Reserved identifier} - identifier belonging to the set \texttt{identifier string\_literal number\_literal end\_of\_file}. - Accepts a token of any value belonging to the matching category. - \item \textbf{Arbitrary identifier} - resolves to a different definition in the grammar. - \end{enumerate} - \end{enumerate} - - \subsubsection{Formal description of the meta-language} - - The following is a formal description of the above rules, written as a - grammar description object using Python syntax: - - \scriptsize\begin{verbatim}meta_grammar = { - "root": Alternative( - "definitions", - Terminal("end_of_file") - ), - "definitions": Concatenation( - "definition", - Alternative( - "definitions", - Terminal("end_of_file") - ) - ), - "definition": Concatenation( - "definition_key", - Concatenation( - Terminal("operator", "="), - Concatenation( - "definition_expression", - Terminal("separator", ";") - ) - ) - ), - "definition_key": Terminal("identifier"), - "definition_expression": "expression", - "expression": Alternative( - "concat_expression", - Alternative( - "opt_concat_expression", - Alternative( - "alt_expression", - Alternative( - "expr_identifier", - "expr_string_literal" - ) - ) - ) - ), - "expr_identifier": Terminal("identifier"), - "expr_string_literal": Terminal("string_literal"), - "concat_expression": Concatenation( - Terminal("identifier", "concat"), - "argument" - ), - "opt_concat_expression": Concatenation( - Terminal("identifier", "opt_concat"), - "argument" - ), - "alt_expression": Concatenation( - Terminal("identifier", "alt"), - "argument" - ), - "argument": Concatenation( - Terminal("separator", "("), - Concatenation( - "expr_arg1", - Concatenation( - Terminal("separator", ","), - Concatenation( - "expr_arg2", - Terminal("separator", ")") - ) - ) - ) - ), - "expr_arg1": "expression", - "expr_arg2": "expression" -}\end{verbatim} - - \normalsize There are two additional semantic constraints: (1) there must - be a definition named \texttt{root}, and (2) there mustn't be any - definitions whose names belong to the set of reserved identifiers. - - \subsection{Top-down parser} - - The parser is the core feature of the software. It takes the root production - of the given grammar and attempts to find a set of productions stemming from - the root which could accept all the tokens in the sequence. It does so by - implementing the logical rules of the three compound expressions discussed - earlier. - - Each step of the parser is a recursive call to a function which processes - a single binary or terminal production. If it is determined that the set of - logical rules for that production can not yield a combination of productions to - parse the entire token sequence, the function generates an exception and returns - control to its caller. - - Exceptions don't originate at compound productions, they are merely propagated - upwards by them. All exceptions stem from terminal productions at the leaves - of the production tree. A terminal symbol matches the current token in the - sequence against its signature and either increments the token iterator - (''accepts" the token), or raises an error to be handled by the logic of - compound productions higher in the syntax tree. - - Backtracking is achieved by remembering the state of the token iterator at - the initialization of a compound production. If one path fails to parse - the token sequence, the iterator is reset and a different path is tried. - If neither path succeeds, the error from the later path is propagated - upwards, where backtracking may occur as well. If both paths are exhausted - at the root level, the token tree is declared unparseable. - - The above algorithm merely checks the validity of the token sequence against - the grammar. To build a parse tree, each call to the parsing function may - additionally result in the addition of a node to a data structure mirroring - the history of chosen productions. Backtracking rules apply. - - \subsection{Grammar generator} - - Parsing the grammar description file against the meta-grammar yields a - syntax tree containing named and anonymous nodes corresponding to various - productions. The grammar generator searches this tree for definitions - and recursively parses them to build a dictionary of named productions - (a grammar description object) for the input file. - - \section{Implementation} - - \subsection{General architecture} - - The program is divided into the entry point script and several modules, - each providing a separate layer of functionality. - - \begin{center} - \begin{tabular}{ |c|p{8.5cm}| } - \hline - Module & Description \\ - \hline - Entry point & Handles user interaction, file I/O and data flow between the main modules of the program. \\ - \hline - Diagnostic & Contains functions for displaying data, visualizing data structures and printing diagnostic messages. \\ - \hline - Lexer & Implements a finite-state machine to parse the raw input into tokens. \\ - \hline - Lexer handlers & Defines the delta function of the finite-state machine. \\ - \hline - Meta-language & Contains the hard-coded grammar description object for the meta-language. \\ - \hline - Productions & Defines classes for compound and terminal productions. \\ - \hline - Parser & Utilities for initializing a top-down recursive descent. \\ - \hline - Parser handlers & Logical rules for parsing productions. \\ - \hline - Grammar & Syntax tree analysis and grammar description object generation. \\ - \hline - \end{tabular} - \end{center} - - \subsection{Data structures} - - \subsubsection{Productions} - - Four classes are defined to describe the three non-terminal and the single - terminal production types: \texttt{Concatenation}, - \texttt{OptionalConcatenation}, \texttt{Alternative} and \texttt{Terminal}. - - The non-terminal productions hold two slots for their children nodes. They - are separate because the parser function looks at the type of the production - to invoke the appropriate handler. - - The terminal production holds a slot for the category and the value of the - token it matches against. Each may be null to disable verification for that - field. A method is provided for matching against tokens. - - \subsubsection{Syntax node} - - The \texttt{Node} class holds a single node of the syntax tree. It has a - name field for named productions and a children field. It may hold other - nodes, representing compound productions, or tokens, representing terminal - productions. Named terminal productions are wrapped in a single-child - \texttt{Node} object. - - To facilitate backtracking, the class exposes methods for adding and - removing children without directly accessing the children field. - - \subsubsection{State classes} - - The classes \texttt{MachineState} and \texttt{ParserState} are data - aggregates representing the internal state of the lexer and the parser, - respectively. - - The \texttt{MachineState} class contains an assortment of states necessary - to provide context for tokenization. - - The \texttt{ParserState} class holds the token sequence and the grammar - that the parser is currently operating on, as well as the token iterator. - - \subsection{Detailed implementation} - - \subsubsection{Lexer} - - The lexer is a finite-state machine. The lexing process begins by - initializing the state. The input file is then scanned character by character - to determine which characters constitute which tokens. On the boundary between - tokens and non-tokens (or neighboring tokens), the currently recognized token - is appended to the output sequence. - - Once the entire input is parsed, an evaluation phase occurs, where transformations - are performed on the output sequence. Comments are removed and the end of - file token is appended. - - \subsubsection{Parser} - - The parser is initialized by creating a ``super-root" node and invoking - the parser function on the first token in the sequence. - - The parser function accepts three arguments: - \begin{enumerate} - \item The current parser state, \texttt{ParserState}. - \item The prescribed production, either one of the four production types - or a string to be resolved from the grammar description object. - \item The parent node, where the parsed production is to be added as a - child node. - \end{enumerate} - The root element is parsed by specifying the prescribed production as - \texttt{"root"} and the parent node as the super-root. Upon exit, the - entry point function returns the first child of the super-root, i.e. the - root node. - - If the production is specified as a string, the main parser function - performs name resolution to obtain the corresponding production class. - The specified production string then becomes the name for the node to be - appended to the parent node. Named productions aid in syntax tree analysis. - - Once the production class is resolved, the main function looks up and - invokes the appropriate handler for that production. - - \subsubsection{Terminal handler} - - Terminal handlers accept input tokens and are the source of syntax errors, - crucial to the backtracking mechanism. The root node may be a terminal node, - in which case the language only accepts a single token. - - The terminal handler resolves the token at the current index and compares - it against the production's signature. In case of category or value - mismatch, a syntax error is raised and propagated upwards in the call stack. - - Upon success, the token iterator is incremented and a token is added to the - parent node. If the terminal production is a named production, the token - is wrapped in a single-child named node first. - - \subsubsection{Non-terminal handlers} - - The concatenation handler parses its two children in sequence. If any of - them fails, the error is propagated. No backtracking occurs in this handler. - - The optional concatenation handler tries two paths: one where the first - child is skipped and one where it is not. If both paths fail, the error - from the second child is propagated. - - Backtracking is implemented by saving the token iterator before attempting - the first path. If the first path fails, the iterator is restored and the - second path is attempted. A new node is created for each of the paths. - If a path succeeds, the corresponding node is appended to the parent. - - The alternative handler is implemented in a similar way, the only difference - being the logical rules of the attempted paths. - - \subsection{Grammar generator} - - The grammar generator traverses the syntax tree of the parsed grammar file - in search of all named nodes corresponding to definitions. - - For each definition, it searches nearby descendant nodes for the definition - key and expression. The expression is evaluated recursively until all - terminal productions are found. Found compound productions are translated - into their production classes. String literals are translated into tokens - with the given value. Identifiers are translated into tokens of the given - category or into references to other definitions. - - When definitions are evaluated and prior to exit from the entry point - function, semantic rules are validated: the grammar must define a root - and it mustn't use reserved identifiers as keys. - - \section{Test cases} - - The most important test case validates backtracking. Given the following - production: - - \begin{verbatim}root = concat( - "alpha", - opt_concat( - identifier, - "beta" - ) -) -\end{verbatim} - - It must be able to recognize the string \texttt{alpha beta}. A naive greedy - algorithm would consume the token \texttt{beta} as the identifier rather - that the token \texttt{"beta"}, leaving \texttt{opt\_concat} unable to - consume \texttt{beta} as its second child, thus failing the validation. - - A more exhaustive test case would be to provide a grammar for JSON and - successfully validate a file against it. - -\end{document} \ No newline at end of file diff --git a/inspirations/mskarzyn ECOTE documentation.pdf b/inspirations/mskarzyn ECOTE documentation.pdf deleted file mode 100644 index d5ae951e..00000000 Binary files a/inspirations/mskarzyn ECOTE documentation.pdf and /dev/null differ