mirror of
https://github.com/kuhyx/WUT_Computer_Science.git
synced 2026-07-04 18:43:15 +02:00
76 lines
2.9 KiB
TeX
76 lines
2.9 KiB
TeX
\documentclass[12pt]{article}
|
|
\usepackage{listings}
|
|
\usepackage{hyperref}
|
|
\usepackage{graphicx}
|
|
\title{EARIN project Final report}
|
|
\author{Krzysztof Rudnicki \\ Jakub Kliszko}
|
|
\begin{document}
|
|
\maketitle
|
|
\section{Introduction}
|
|
The goal of our project was to create a model for anime reccomender \\
|
|
After entering anime name from the database model should output recommended animes
|
|
\section{Used data and algorithms}
|
|
\subsection{Data}
|
|
We used different dataset from originally specified in the project description \\
|
|
We decided to use Anime Recommendation Database from Kaggle: \href{https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020}{LINK} \\
|
|
Main reasons why we decided to use this database was that it was bigger than original one, was more recent, it was described as being 100\% usable by Kaggle and still had decent amount of code examples \\
|
|
We are mostly interested in rating\_complete.csv file which contains information about anime ratings from users who completed the anime
|
|
\subsection{Algorithms}
|
|
We decided to use collaborative filtering to develop our model, It makes personalized recommandations based on preferences of similar users \\
|
|
We represent anime data-set as embedding vector \\
|
|
We use K-nearest neighbors model and decided to test it out with different metrics, neighbors and algorithms \\
|
|
\subsubsection{Algorithms}
|
|
We decided to test our model with 2 algorithms:
|
|
\begin{enumerate}
|
|
\item Brute
|
|
\item Auto
|
|
\end{enumerate}
|
|
Ball Tree and KD Tree do not work on sparse input (as is the case with our input) so we decided to omit them
|
|
|
|
\subsubsection{Neighbor number}
|
|
We decided to test our model with 5 different neighbor amount:
|
|
\begin{enumerate}
|
|
\item 5 - Popular starting point for small-medium datasets
|
|
\item square root of available data - Usually helps to balance between underfitting and overfitting
|
|
\item half of available data - Usually usefull for checking overall trend than specific nuances
|
|
\item logarithm of available data - Used for very large datasets
|
|
\item n-1 neighbors - Usually leads to overgeneralization as we use all instances excepct one for prediciton
|
|
\end{enumerate}
|
|
|
|
\subsubsection{Metrics}
|
|
For brute algorithm we tested it will all possible metrics:
|
|
\begin{enumerate}
|
|
\item Cityblock
|
|
\item Cosine
|
|
\item Euclidean
|
|
\item l1
|
|
\item l2
|
|
\item Manhattan
|
|
\end{enumerate}
|
|
|
|
\section{Intermediate results}
|
|
\subsection{Results}
|
|
\subsection{Insights}
|
|
|
|
\section{Using program}
|
|
\subsection{Arguments}
|
|
\subsubsection{Default arguments}
|
|
\subsubsection{Reproducing}
|
|
|
|
\section{Final experimental results}
|
|
\subsection{Experiments}
|
|
\subsection{Results}
|
|
\subsection{Disussion}
|
|
\subsection{Comparison}
|
|
|
|
\section{Challenges}
|
|
\subsection{Challenges themselfes}
|
|
\subsection{Tackling challenges}
|
|
|
|
\section{Conclusions}
|
|
\paragraph{Best algorithm}
|
|
\subsection{Solution satisfaction}
|
|
\subsection{Potential improvements}
|
|
|
|
|
|
\end{document} |