feat: add project

This commit is contained in:
Krzysztof Rudnicki 2023-06-16 16:36:31 +02:00
commit a09c96dd65
28 changed files with 1668 additions and 2 deletions

312
.gitignore vendored
View File

@ -1,5 +1,317 @@
<<<<<<< HEAD
*.swp *.swp
.DS_Store .DS_Store
bin/configlet bin/configlet
bin/configlet.exe bin/configlet.exe
temp.pl temp.pl
=======
database
test_results
anime_with_synopsis.csv
anime.csv
animelist.csv
rating_complete.csv
watching_status.csv
## Core latex/pdflatex auxiliary files:
*.aux
*.lof
*.log
*.lot
*.fls
*.out
*.toc
*.fmt
*.fot
*.cb
*.cb2
.*.lb
## Intermediate documents:
*.dvi
*.xdv
*-converted-to.*
# these rules might exclude image files for figures etc.
# *.ps
# *.eps
# *.pdf
## Generated if empty string is given at "Please type another file name for output:"
.pdf
## Bibliography auxiliary files (bibtex/biblatex/biber):
*.bbl
*.bcf
*.blg
*-blx.aux
*-blx.bib
*.run.xml
## Build tool auxiliary files:
*.fdb_latexmk
*.synctex
*.synctex(busy)
*.synctex.gz
*.synctex.gz(busy)
*.pdfsync
## Build tool directories for auxiliary files
# latexrun
latex.out/
## Auxiliary and intermediate files from other packages:
# algorithms
*.alg
*.loa
# achemso
acs-*.bib
# amsthm
*.thm
# beamer
*.nav
*.pre
*.snm
*.vrb
# changes
*.soc
# comment
*.cut
# cprotect
*.cpt
# elsarticle (documentclass of Elsevier journals)
*.spl
# endnotes
*.ent
# fixme
*.lox
# feynmf/feynmp
*.mf
*.mp
*.t[1-9]
*.t[1-9][0-9]
*.tfm
#(r)(e)ledmac/(r)(e)ledpar
*.end
*.?end
*.[1-9]
*.[1-9][0-9]
*.[1-9][0-9][0-9]
*.[1-9]R
*.[1-9][0-9]R
*.[1-9][0-9][0-9]R
*.eledsec[1-9]
*.eledsec[1-9]R
*.eledsec[1-9][0-9]
*.eledsec[1-9][0-9]R
*.eledsec[1-9][0-9][0-9]
*.eledsec[1-9][0-9][0-9]R
# glossaries
*.acn
*.acr
*.glg
*.glo
*.gls
*.glsdefs
*.lzo
*.lzs
*.slg
*.slo
*.sls
# uncomment this for glossaries-extra (will ignore makeindex's style files!)
# *.ist
# gnuplot
*.gnuplot
*.table
# gnuplottex
*-gnuplottex-*
# gregoriotex
*.gaux
*.glog
*.gtex
# htlatex
*.4ct
*.4tc
*.idv
*.lg
*.trc
*.xref
# hyperref
*.brf
# knitr
*-concordance.tex
# TODO Uncomment the next line if you use knitr and want to ignore its generated tikz files
# *.tikz
*-tikzDictionary
# listings
*.lol
# luatexja-ruby
*.ltjruby
# makeidx
*.idx
*.ilg
*.ind
# minitoc
*.maf
*.mlf
*.mlt
*.mtc[0-9]*
*.slf[0-9]*
*.slt[0-9]*
*.stc[0-9]*
# minted
_minted*
*.pyg
# morewrites
*.mw
# newpax
*.newpax
# nomencl
*.nlg
*.nlo
*.nls
# pax
*.pax
# pdfpcnotes
*.pdfpc
# sagetex
*.sagetex.sage
*.sagetex.py
*.sagetex.scmd
# scrwfile
*.wrt
# svg
svg-inkscape/
# sympy
*.sout
*.sympy
sympy-plots-for-*.tex/
# pdfcomment
*.upa
*.upb
# pythontex
*.pytxcode
pythontex-files-*/
# tcolorbox
*.listing
# thmtools
*.loe
# TikZ & PGF
*.dpth
*.md5
*.auxlock
# titletoc
*.ptc
# todonotes
*.tdo
# vhistory
*.hst
*.ver
# easy-todo
*.lod
# xcolor
*.xcp
# xmpincl
*.xmpi
# xindy
*.xdy
# xypic precompiled matrices and outlines
*.xyc
*.xyd
# endfloat
*.ttt
*.fff
# Latexian
TSWLatexianTemp*
## Editors:
# WinEdt
*.bak
*.sav
# Texpad
.texpadtmp
# LyX
*.lyx~
# Kile
*.backup
# gummi
.*.swp
# KBibTeX
*~[0-9]*
# TeXnicCenter
*.tps
# auto folder when using emacs and auctex
./auto/*
*.el
# expex forward references with \gathertags
*-tags.tex
# standalone packages
*.sta
# Makeindex log files
*.lpz
# xwatermark package
*.xwm
# REVTeX puts footnotes in the bibliography by default, unless the nofootinbib
# option is specified. Footnotes are the stored in a file with suffix Notes.bib.
# Uncomment the next line to have this generated file ignored.
#*Notes.bib
>>>>>>> project/main

View File

@ -1,4 +1,5 @@
{ {
<<<<<<< HEAD
"recommendations": [ "recommendations": [
<<<<<<< HEAD <<<<<<< HEAD
"ms-python.python", "ms-python.python",
@ -13,3 +14,7 @@
>>>>>>> lab7/main >>>>>>> lab7/main
] ]
} }
=======
"recommendations": ["james-yu.latex-workshop"]
}
>>>>>>> project/main

92
project/README.md Normal file
View File

@ -0,0 +1,92 @@
# EARIN_project
## Getting started
To make it easy for you to get started with GitLab, here's a list of recommended next steps.
Already a pro? Just edit this README.md and make it your own. Want to make it easy? [Use the template at the bottom](#editing-this-readme)!
## Add your files
- [ ] [Create](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#create-a-file) or [upload](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#upload-a-file) files
- [ ] [Add files using the command line](https://docs.gitlab.com/ee/gitlab-basics/add-file.html#add-a-file-using-the-command-line) or push an existing Git repository with the following command:
```
cd existing_repo
git remote add origin https://gitlab-stud.elka.pw.edu.pl/krudnic3/earin_project.git
git branch -M main
git push -uf origin main
```
## Integrate with your tools
- [ ] [Set up project integrations](https://gitlab-stud.elka.pw.edu.pl/krudnic3/earin_project/-/settings/integrations)
## Collaborate with your team
- [ ] [Invite team members and collaborators](https://docs.gitlab.com/ee/user/project/members/)
- [ ] [Create a new merge request](https://docs.gitlab.com/ee/user/project/merge_requests/creating_merge_requests.html)
- [ ] [Automatically close issues from merge requests](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#closing-issues-automatically)
- [ ] [Enable merge request approvals](https://docs.gitlab.com/ee/user/project/merge_requests/approvals/)
- [ ] [Automatically merge when pipeline succeeds](https://docs.gitlab.com/ee/user/project/merge_requests/merge_when_pipeline_succeeds.html)
## Test and Deploy
Use the built-in continuous integration in GitLab.
- [ ] [Get started with GitLab CI/CD](https://docs.gitlab.com/ee/ci/quick_start/index.html)
- [ ] [Analyze your code for known vulnerabilities with Static Application Security Testing(SAST)](https://docs.gitlab.com/ee/user/application_security/sast/)
- [ ] [Deploy to Kubernetes, Amazon EC2, or Amazon ECS using Auto Deploy](https://docs.gitlab.com/ee/topics/autodevops/requirements.html)
- [ ] [Use pull-based deployments for improved Kubernetes management](https://docs.gitlab.com/ee/user/clusters/agent/)
- [ ] [Set up protected environments](https://docs.gitlab.com/ee/ci/environments/protected_environments.html)
***
# Editing this README
When you're ready to make this README your own, just edit this file and use the handy template below (or feel free to structure it however you want - this is just a starting point!). Thank you to [makeareadme.com](https://www.makeareadme.com/) for this template.
## Suggestions for a good README
Every project is different, so consider which of these sections apply to yours. The sections used in the template are suggestions for most open source projects. Also keep in mind that while a README can be too long and detailed, too long is better than too short. If you think your README is too long, consider utilizing another form of documentation rather than cutting out information.
## Name
Choose a self-explaining name for your project.
## Description
Let people know what your project can do specifically. Provide context and add a link to any reference visitors might be unfamiliar with. A list of Features or a Background subsection can also be added here. If there are alternatives to your project, this is a good place to list differentiating factors.
## Badges
On some READMEs, you may see small images that convey metadata, such as whether or not all the tests are passing for the project. You can use Shields to add some to your README. Many services also have instructions for adding a badge.
## Visuals
Depending on what you are making, it can be a good idea to include screenshots or even a video (you'll frequently see GIFs rather than actual videos). Tools like ttygif can help, but check out Asciinema for a more sophisticated method.
## Installation
Within a particular ecosystem, there may be a common way of installing things, such as using Yarn, NuGet, or Homebrew. However, consider the possibility that whoever is reading your README is a novice and would like more guidance. Listing specific steps helps remove ambiguity and gets people to using your project as quickly as possible. If it only runs in a specific context like a particular programming language version or operating system or has dependencies that have to be installed manually, also add a Requirements subsection.
## Usage
Use examples liberally, and show the expected output if you can. It's helpful to have inline the smallest example of usage that you can demonstrate, while providing links to more sophisticated examples if they are too long to reasonably include in the README.
## Support
Tell people where they can go to for help. It can be any combination of an issue tracker, a chat room, an email address, etc.
## Roadmap
If you have ideas for releases in the future, it is a good idea to list them in the README.
## Contributing
State if you are open to contributions and what your requirements are for accepting them.
For people who want to make changes to your project, it's helpful to have some documentation on how to get started. Perhaps there is a script that they should run or some environment variables that they need to set. Make these steps explicit. These instructions could also be useful to your future self.
You can also document commands to lint the code or run tests. These steps help to ensure high code quality and reduce the likelihood that the changes inadvertently break something. Having instructions for running tests is especially helpful if it requires external setup, such as starting a Selenium server for testing in a browser.
## Authors and acknowledgment
Show your appreciation to those who have contributed to the project.
## License
For open source projects, say how it is licensed.
## Project status
If you have run out of energy or time for your project, put a note at the top of the README saying that development has slowed down or stopped completely. Someone may choose to fork your project or volunteer to step in as a maintainer or owner, allowing your project to keep going. You can also make an explicit request for maintainers.

388
project/final/code/main.py Normal file
View File

@ -0,0 +1,388 @@
"""
Code for preprocessing data and creating model that predicts and
recomends anime based on another anime entered by user
"""
import math
import argparse
import shutil
import os
import datetime
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import VALID_METRICS_SPARSE
from scipy.sparse import csr_matrix
def get_data_cpu(limit_data=-1, data_folder_path="database"):
"""
Reads anime from csv database
"""
if limit_data > -1:
# User can limit number of data taken into consideration,
# model seems to work with limit_data value as low as 500,000
rating_data = pd.read_csv(
data_folder_path + "/animelist.csv", nrows=limit_data)
else:
rating_data = pd.read_csv(data_folder_path + "/animelist.csv")
anime_data = pd.read_csv(data_folder_path + "/anime.csv")
return rating_data, anime_data
def get_data(limit_data=-1, data_folder_path="database", gpu=False):
rating_data, anime_data = get_data_cpu(limit_data, data_folder_path)
# used to fetch anime_id(MAL_ID)
anime_data = anime_data.rename(columns={"MAL_ID": "anime_id"})
anime_contact_data = anime_data[["anime_id", "Name"]]
rows_number = rating_data.shape[0]
return rating_data, anime_contact_data, rows_number
def merge_rating_anime_data(rating_data, anime_contact_data, debug=False):
"""
Preprocesses the data used for rating
"""
rating_data = rating_data.merge(
anime_contact_data, left_on="anime_id", right_on="anime_id", how="left"
)
rating_data = rating_data[
["user_id", "Name", "anime_id", "rating",
"watching_status", "watched_episodes"]
]
rating_head = rating_data.head()
if debug:
print(rating_head)
rating_shape_complete = rating_data.shape
if debug:
print(rating_shape_complete)
return rating_data
def split_data_below_thresholds(rating_data, data_name, threshold=-1, debug=False):
"""
Removes data with data_name which is below given threshold
"""
if threshold != -1:
count = rating_data[data_name].value_counts()
rating_data = rating_data[
rating_data[data_name].isin(count[count >= threshold].index)
].copy()
rating_shape_cut = rating_data.shape
if debug:
print(rating_shape_cut)
return rating_data
def combine_name_and_ratings(rating_data, debug=False):
"""
Create table which holds name of the anime and number of its reviews
then we merge this with rating_data
"""
combine_movie_rating = rating_data.dropna(axis=0, subset=["Name"])
movie_rating_count = (
combine_movie_rating.groupby(by=["Name"])["rating"]
.count()
.reset_index()[["Name", "rating"]]
)
rating_head = movie_rating_count.head()
if debug:
print(rating_head)
rating_data = combine_movie_rating.merge(
movie_rating_count, left_on="Name", right_on="Name", how="left"
)
return rating_data
def get_length_of_data(rating_data, data_name):
"""
We get amount of data in the database with a given column data_name
"""
# Encoding categorical data
column_ids = rating_data[data_name + "_id"].unique().tolist()
column_to_column = {x: i for i, x in enumerate(column_ids)}
rating_data[data_name] = rating_data[data_name +
"_id"].map(column_to_column)
users_number = len(column_to_column)
return users_number
def get_top_ranked(rating_data, data_name, join_table=None, top_data_taken=20):
"""
Get anime with highest ranking
"""
if join_table is None:
join_table = rating_data
group_data_by_rating = rating_data.groupby(
data_name + "_id")["rating"].count()
top_users = group_data_by_rating.dropna().sort_values(ascending=False)[
:top_data_taken]
top_rated = join_table.join(top_users, rsuffix="_r",
how="inner", on=data_name + "_id")
return top_rated
def get_data_info(rating_data, debug=False, gpu=False):
"""
Get some informations about data
"""
users_number = get_length_of_data(rating_data, "user")
animes_number = get_length_of_data(rating_data, "anime")
top_rated = get_top_ranked(rating_data, "user")
top_rated = get_top_ranked(rating_data, "anime", top_rated)
pivot = pd.crosstab(top_rated.user_id, top_rated.anime_id,
top_rated.rating, aggfunc=np.sum)
pivot.fillna(0, inplace=True)
smallest_rating = min(rating_data["rating"])
highest_rating = max(rating_data["rating"])
if debug:
print(pivot)
if debug:
print(f"Num of users: {users_number}, Num of animes: {animes_number}")
print(
f"Min total rating: {smallest_rating}, Max total rating: {highest_rating}")
def preprocessing(rating_data, anime_contact_data,
debug=False, user_threshold=500, anime_threshold=200, auto=False):
"""
Preprocesses data for making model more accurate and/or faster
"""
rating_data = merge_rating_anime_data(rating_data, anime_contact_data)
rating_data = split_data_below_thresholds(
rating_data, "user_id", user_threshold)
rating_data = split_data_below_thresholds(
rating_data, "anime_id", anime_threshold)
rating_data = combine_name_and_ratings(rating_data)
rating_data = rating_data.drop(columns="rating_y")
rating_data = rating_data.rename(columns={"rating_x": "rating"})
if debug and not auto:
print(rating_data)
get_data_info(rating_data, True)
pivot_table = rating_data.pivot_table(
index="Name", columns="user_id", values="rating"
).fillna(0)
if debug and not auto:
print(pivot_table)
return pivot_table
def predict(prediction_model, pivot_table, seed=42, anime="RANDOM", recommendation_number=6, auto=False, debug=False):
"""
This will choose a random anime name and our prediction_model will predict similar anime.
"""
np.random.seed(seed)
if anime == "RANDOM":
chosen_anime = np.random.choice(pivot_table.shape[0])
query = pivot_table.iloc[chosen_anime, :].values.reshape(1, -1)
chosen_anime_name = pivot_table.index[chosen_anime]
else:
query = pivot_table.loc[anime].values.reshape(1, -1)
chosen_anime_name = anime
distance, suggestions = prediction_model.kneighbors(
query)
if debug:
print("prediction model, distance: ", distance)
for i in range(0, 2):
if i == 0:
print(f"Recommendations for {chosen_anime_name}:\n")
else:
print(
f"""{i}: {pivot_table.index[suggestions.flatten()[i]]},
with distance of {distance.flatten()[i]}:"""
)
average_distance = np.mean(distance.flatten())
closest_anime_name = pivot_table.index[suggestions.flatten()[1]]
closest_anime_distance = distance.flatten()[1]
average_minus_closest_distance = average_distance - closest_anime_distance
print(
f"Average distance: {average_distance}, average_minus_closest_distance: {average_minus_closest_distance}")
return chosen_anime, suggestions.flatten()[1:recommendation_number+1], distance.flatten()[1:recommendation_number+1], f"{closest_anime_distance}_{average_distance}_{average_minus_closest_distance}"
# return f"{chosen_anime_name}_{closest_anime_name}_{closest_anime_distance}_{average_distance}_{average_minus_closest_distance}"
def calculate_neighbors(rows_number, neighbors=5):
neighbor_value = {
"default": 5,
"sqrt": math.floor(math.sqrt(rows_number)),
"half": math.floor(rows_number / 2),
"log": math.floor(math.log(rows_number)),
"n-1": rows_number - 1
}
if isinstance(neighbors, str):
return neighbor_value[neighbors]
return neighbors
def create_model(pivot_table, rows_number, metric="cosine", algorithm="brute", neighbors=5):
"""
Creates model based on neaarest neighbor for anime prediction
"""
neighbors_number = calculate_neighbors(pivot_table.shape[0], neighbors)
pivot_table_matrix = csr_matrix(pivot_table.values)
if algorithm == "brute":
model = NearestNeighbors(n_neighbors=neighbors_number,
metric=metric, algorithm=algorithm)
else:
model = NearestNeighbors(
n_neighbors=neighbors_number, algorithm=algorithm)
try:
model.fit(pivot_table_matrix)
except:
print(f"""Error in create_model, probably wrong metric for data
Metric: {metric}, algorithm: {algorithm}""")
return "Error!"
return model
def handle_arguments():
"""
Handles all arguments that can be used to change algorithm behaviour or program display
"""
parser = argparse.ArgumentParser(description='Example script with pyargs')
parser.add_argument('--data_limit', '-dl',
help="""Specify data limit,
Recommended at least 500k, set to -1 for no limit""",
required=False, type=int, default=-1)
parser.add_argument('--seed', '-s',
help='Specify seed',
type=int, required=False, default=42)
parser.add_argument('--debug', '-d',
help='Use debug (more information) prints',
type=bool, required=False, default=False)
parser.add_argument('--database', '-db',
help='Specify database path',
required=False, default="database")
allowed_metric = ["cosine", "mahalanobis", "euclidean"]
parser.add_argument('--metric', '-m',
help='Specify metric for NearestNeighbor learner',
required=False, default="cosine", choices=allowed_metric)
allowed_algorithms = ['auto', 'brute']
parser.add_argument('--algorithm', '-a',
help='Specify algorithm for Nearest Neighbor learner',
required=False, default="brute", choices=allowed_algorithms)
parser.add_argument('--anime', '-an',
help='Specify anime to choose',
required=False, default="RANDOM")
parser.add_argument('--neighbors', '-n',
help='Specify number of nearest neighbors',
required=False, default=5)
parser.add_argument('--user_threshold', '-ut',
help="""Specify minimal number of votes required for user to be
included in the data, set to -1 for no threshold""",
required=False, type=int, default=500)
parser.add_argument('--anime_threshold', '-at',
help="""Specify minimal number of votes required for anime
to be included in the data, set to -1 for no threshold""",
required=False, type=int, default=200)
parser.add_argument('--recommendation_amount', '-ra',
help='Specify how much anime should be recommended',
required=False, type=int, default=5)
parser.add_argument('--auto', '-au',
help="""Enable auto mode, no debug, no user parameters,
automatic testing and saving results""",
type=bool, required=False, default=False)
# Parse the command-line arguments
args = parser.parse_args()
args.recommendation_amount = args.recommendation_amount + 1
# Access the values of the arguments
return args.seed, args.debug, args.data_limit, args.database, args.metric, args.algorithm, args.anime, args.neighbors, args.user_threshold, args.anime_threshold, args.recommendation_amount, args.auto
def auto_mode(data_limit=-1, seed=42, anime="RANDOM"):
print("Started auto mode")
algorithm_spread = ['auto', 'brute']
metric_spread = ['manhattan', 'euclidean', 'cosine']
neighbor_spread = [5, "sqrt", "half", "log", "n-1"]
# No reason to access and waste computational power every time we run the simulation
starting_rating_data, starting_anime_contact_data, starting_rows_number = get_data(
limit_data=data_limit)
original_pivot_table = preprocessing(
starting_rating_data, starting_anime_contact_data)
if os.path.exists('test_results'):
shutil.rmtree('test_results')
for algorithm in algorithm_spread:
possibleMetrics = []
if algorithm != 'auto':
possibleMetrics = metric_spread
print("testing for algorithm: ", algorithm, possibleMetrics)
if possibleMetrics == []:
possibleMetrics = [""]
for metric in possibleMetrics:
if metric != 'precomputed':
print("testing for algorithm, metric: ", algorithm, metric)
for neighbor_amount in neighbor_spread:
print("testing for algorithm, metric, neighbor_amount: ",
algorithm, metric, neighbor_amount)
preprocess_model_predict(starting_rating_data, starting_anime_contact_data,
starting_rows_number, original_pivot_table, seed=seed, anime=anime, neighbors=neighbor_amount, algorithm=algorithm, metric=metric)
def write_test_results(title, result=""):
# Create directory if it doesn't already exist
if not os.path.exists('test_results'):
os.makedirs('test_results')
# Generate timestamped filename
timestamp = datetime.datetime.now().strftime(
'%Y%m%d%H%M%S') # e.g., 20230611235959
filename = f"{title}_{timestamp}.txt"
# Create and write to the file
with open(os.path.join('test_results', filename), 'a') as file:
file.write(result)
def calculate_precision(predictions, threshold=8):
ratings = [anime[anime > 0].mean() for anime in predictions]
precision = [1 if r >= threshold else 0 for r in ratings]
return np.mean(precision)
def preprocess_model_predict(rating_data, anime_contact_data, rows_number, pivot_table, data_limit=-1, db="database", debug=False, user_threshold=500, anime_threshold=200, metric="cosine", algorithm="brute", neighbors=5, seed=42, anime="RANDOM", recommendation_amount=5):
MODEL = create_model(pivot_table, rows_number,
metric, algorithm, neighbors)
result = ""
if MODEL != "Error!":
chosen_anime, suggestions, distance, distance_data = predict(MODEL, pivot_table, seed,
anime, recommendation_amount)
chosen_anime_name = pivot_table.index[chosen_anime]
# average_distance = np.mean(distance)
# closest_anime_name = pivot_table.index[suggestions[1]]
# closest_anime_distance = distance[1]
# average_minus_closest_distance = closest_anime_distance - average_distance
precision = calculate_precision(
[pivot_table.iloc[s] for s in suggestions])
result = f"{chosen_anime_name}:\n"
for i in range(len(suggestions)):
result += f"{pivot_table.index[suggestions[i]]}; Distance: {distance[i]}\n"
result += f"Precision: {precision*100}%\n"
result += "Smallest distance, average distance, Average - Smallest distance: " + distance_data
# result = f"{chosen_anime_name}_{closest_anime_name}_{closest_anime_distance}_{average_distance}_{average_minus_closest_distance}"
write_test_results(
f"dl={rows_number}&s={seed}&m={metric}&a={algorithm}&ut={user_threshold}&at={anime_threshold}&n={neighbors}", result)
if __name__ == "__main__":
SEED, DEBUG, DATA_LIMIT, DB, METRIC, ALGORITHM, ANIME, NEIGHBORS, USER_THRESHOLD, ANIME_THRESHOLD, RECOMMENDATION_AMOUNT, AUTO = handle_arguments()
if not AUTO:
print("Entered not auto mode")
starting_rating_data, starting_anime_contact_data, starting_rows_number = get_data(
limit_data=DATA_LIMIT, data_folder_path=DB)
pivot_table = preprocessing(
starting_rating_data, starting_anime_contact_data, USER_THRESHOLD, ANIME_THRESHOLD)
preprocess_model_predict(starting_rating_data, starting_anime_contact_data, starting_rows_number,
pivot_table, data_limit=DATA_LIMIT, db=DB, debug=DEBUG, user_threshold=USER_THRESHOLD, anime_threshold=ANIME_THRESHOLD,
metric=METRIC, algorithm=ALGORITHM, neighbors=NEIGHBORS, seed=SEED, anime=ANIME, recommendation_amount=RECOMMENDATION_AMOUNT)
if AUTO:
auto_mode(DATA_LIMIT, SEED, ANIME)

View File

@ -0,0 +1,4 @@
pandas
numpy
seaborn
matplotlib

Binary file not shown.

View File

@ -0,0 +1,39 @@
Parameters:
- datalimit (usable between 500k and max) [max = 109,224,747 ]
- seed (very important make sure it stays the same through all testing [maybe just 42?])
- metric (either cosine, mahalanobis or euclidean as in preliminary report)
- NN algorithm (either auto, ball_tree, kd_tree, brute)
- neighbors - number of nearest neigbors
- User threshold - minimal numbers of votes for user to be included in data
- Anime threshold - same for anime
These are 6 parameters that influence program behaviour and 1 parameter for seed
Probably would do simulations for 3 variants of each parameters (excluding seed), rest will be default
so in total 6 * 3 = 18 simulations
Default values:
Datalimit: all of data
Seed: 42
Metric: cosine
NN algorithm: brute
Neighbors: 5
User threshold: 500
Anime threshold: 200
Neighbors number count:
k = 3-5: default starting points for small-medium dataset
k = sqrt(n): rule of thumb, n is number of instances in dataset (balanced between underfitting and overfitting)
l = n / 2: look at half of dataset for each prediction
k = log(n): for very large datasets
k = n - 1: Use all data except one, will probably overgenarlize the model
Values spread:
Datalimit: [27306186, 54612373, 109224747] (max on the right, then halved and halved)
Metric: ["cosine", "mahalanobis", "euclidean"]
NN algorithm: ['auto', 'ball_tree', 'kd_tree', 'brute']
neighbors: [5, sqrt(n), n / 2, log(n), n - 1]
User threshold: [0, 500, 1000]
Anime threshold: [0, 200, 500]

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

View File

@ -0,0 +1,200 @@
\documentclass[12pt]{article}
\usepackage{listings}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{float}
\title{EARIN project Final report}
\author{Krzysztof Rudnicki \\ Jakub Kliszko}
\begin{document}
\maketitle
\section{Introduction}
The goal of our project was to create a model for anime recommender \\
After entering anime name from the database model should output recommended anime's
\section{Used data and algorithms}
\subsection{Data}
We used different data-set from originally specified in the project description \\
We decided to use Anime Recommendation Database from Kaggle: \href{https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020}{LINK} \\
Main reasons why we decided to use this database was that it was bigger than original one, was more recent, it was described as being 100\% usable by Kaggle and still had decent amount of code examples \\
We are mostly interested in rating\_complete.csv file which contains information about anime ratings from users who completed the anime
\subsection{Algorithms}
We decided to use collaborative filtering to develop our model, It makes personalized recommendations based on preferences of similar users \\
We represent anime data-set as embedding vector \\
We use K-nearest neighbors model and decided to test it out with different metrics, neighbors and algorithms \\
\subsubsection{Algorithms}
We decided to test our model with 2 algorithms:
\begin{enumerate}
\item Brute
\item Auto
\end{enumerate}
Ball Tree and KD Tree do not work on sparse input (as is the case with our input) so we decided to omit them
\subsubsection{Neighbor number}
We decided to test our model with 5 different neighbor amount:
\begin{enumerate}
\item 5 - Popular starting point for small-medium data-sets
\item square root of available data - Usually helps to balance between under-fitting and over-fitting
\item half of available data - Usually useful for checking overall trend than specific nuances
\item logarithm of available data - Used for very large data-sets
\item n-1 neighbors - Usually leads to over generalization as we use all instances except one for prediction
\end{enumerate}
\subsubsection{Metrics}
For brute algorithm we tested it will all possible metrics:
\begin{enumerate}
\item Cosine - Measures cosine of angle between vectors
\item Euclidean - Measures distance in straight line between two points
\item Manhattan - Measures sum of absolute paths between two coordinates
\end{enumerate}
\section{Intermediate results}
\subsection{Results}
For intermediate solution we have implemented reading data from csv files, preprocessing them with optional showing of some of the information about the data and used model/learner for implementing neighbour searches \\ We implemented full manual mode for our program and added some parameters that will be later used for auto mode like choosing type of metric, algorithm, number of neighbors and seed
\subsection{Insights}
We find out that the solution we wanted to initially base on: \href{https://www.kaggle.com/code/chaitanya99/recommendation-system-cf-anime}{Kaggle code with tensorflow} was too complicated and required too much computational time, so we changed it to a simpler, faster one that was still enough to provide the solution \\
We also found out when investigating the ratings that the rating is skewed towards higher values like 7, 8, 9 so the average rating is well above 5
\begin{figure}[H]
\caption{User rating count}
\includegraphics[width=\textwidth]{user_rating.png}
\end{figure}
\section{Using program}
\subsection{Arguments}
There are 13 parameters that can be modified which either influence how the program behaves or how the model behaves, too see how they can be modified user needs to run
\begin{lstlisting}[language=bash]
python main.py -h
\end{lstlisting}
Command
\begin{lstlisting}[language=bash]
options:
-h, --help show this help message and exit
--data_limit DATA_LIMIT, -dl DATA_LIMIT
Specify data limit,
Recommended at least 500k,
set to -1 for no limit
--seed SEED, -s SEED Specify seed
--debug DEBUG, -d DEBUG
Use debug (more information) prints
--database DATABASE, -db DATABASE
Specify database path
--metric {cosine,mahalanobis,euclidean},
-m {cosine,mahalanobis,euclidean}
Specify metric for NearestNeighbor learner
--algorithm {auto, brute},
-a {auto, brute}
Specify algorithm for
Nearest Neighbor learner
--anime ANIME, -an ANIME
Specify anime to choose
--neighbors NEIGHBORS, -n NEIGHBORS
Specify number of nearest neighbors
--user_threshold USER_THRESHOLD, -ut USER_THRESHOLD
Specify minimal number of votes required for
user to be included in the data, set to -1 for
no threshold
--anime_threshold ANIME_THRESHOLD, -at ANIME_THRESHOLD
Specify minimal number of votes required for
anime to be included in the data, set to -1 for
no threshold
--recommendation_amount RECOMMENDATION_AMOUNT,
-ra RECOMMENDATION_AMOUNT
Specify how much anime should be recommended
--auto AUTO, -au AUTO
Enable auto mode, no debug,
no user parameters, automatic testing and saving results
\end{lstlisting}
\subsubsection{Default arguments}
Default values of arguments are:
\begin{itemize}
\item Data Limit = -1 (means no limit, all data will be used)
\item Seed = 42
\item Debug = False (means no debug information will be shown)
\item Database = database
\item Metric = cosine
\item Algorithm = brute
\item Anime = RANDOM (program will randomly choose anime for which there should be recommendation)
\item Neighbors = 5
\item User Threshold = 500
\item Anime Threshold = 200
\item Recommendation Amount = 5
\item Auto mode = False
\end{itemize}
\subsubsection{Reproducing}
In order to reproduce test results user should use:
\begin{lstlisting}
python main.py -au True -dl 600000
\end{lstlisting}
Command which will run auto mode with 600 thousand entries
\section{Final experimental results}
\subsection{Experiments}
All of our experiments were done on data limited to \textbf{600 thousand entries}, rest of parameters were default \\
We checked for 2 things in our experiments, precision of our algorithm and distances between recommended anime and input anime for different metrics and number of neighbors \\
Precision was calculated based on the rating of anime which was recommended, if the anime recommended had rating higher or equal to 8 then the recommendation was "good", so true positive is any anime which rating is above or equal to 8 \\
Unfortunately we did not manage to calculate recall and F1 score (since it requires recall to be calculated)
\subsection{Results}
\subsubsection{Precision}
\begin{figure}[H]
\caption{Precision from different metrics and auto algorithm}
\includegraphics[width=\textwidth]{precision_metric.png}
\end{figure}
\begin{figure}[H]
\caption{Precision from number of neighbors}
\includegraphics[width=\textwidth]{precision_neighbor.png}
\end{figure}
\subsubsection{Distance}
\begin{figure}[H]
\caption{Distance for Manhattan}
\includegraphics[width=\textwidth]{distance_manhattan.png}
\end{figure}
\begin{figure}[H]
\caption{Distance for Euclidean}
\includegraphics[width=\textwidth]{distance_euclidean.png}
\end{figure}
\begin{figure}[H]
\caption{Distance for Cosine}
\includegraphics[width=\textwidth]{distance_cosine.png}
\end{figure}
\begin{figure}[H]
\caption{Distance for Auto mode}
\includegraphics[width=\textwidth]{distance_auto.png}
\end{figure}
\subsection{Discussion}
\subsubsection{Precision}
As expected number of neighbors does not have influence on precision, metric used on different hand does, as we can see cosine metric achieves the best precision compared to cosine and euclidean metric. \\
This means that the anime that is recommended from a model that metric usually has higher rating than anime's that use euclidean or Manhattan metric. \\
Auto mode seems to use euclidean metric as the results between those two are exactly the same
\subsubsection{Distance}
Smallest distance was always the same within the same metric no matter the number of neighbors used \\
The only thing that changed was average distance and consequently difference between average distance and smallest distance \\
In general smaller number of neighbors used meant smaller average distance, notice how for neighbors equal to 5 and logarithm of number of entries average distance in all charts is roughly the same and below average distance for number of neighbours equal to square root of entries, half of entries or all entries but one \\
We think that it means that simply by taking more data into consideration when recommending anime we increase number of outliners that are radically "far" from input anime \\
For some reason sometimes the average distance was actually smaller than smallest distance, we did manage to understand why within time limit of this project \\
\section{Challenges}
\subsection{Challenges themselves}
\paragraph{Precision}
The biggest challenge was implementing the algorithm that checks precision of our model, we had to reject recall and F1 score since defining what sort of base should they use to measure correctness of solution was not a trivial task
\paragraph{Data Size}
Another challenge was with the size of database itself which impacted speed of running new changes
\paragraph{Visualisation}
It was not easy to visualize the effects of experiments themselves, most program output is simple text and there is not much continues data that naturally aligns itself with some graphs
\subsection{Tackling challenges}
For precision we just settled on the idea of good recommendation to be any recommendation that recommends anime with rating above or equal 8 \\
For Data size we introduced data limit argument which made it very easy and fast to introduce new changes \\
For visualization we agreed on going through the precision of algorithm based on different thresholds and showing data for distances in the histogram chart
\section{Conclusions}
\paragraph{Best parameters}
For our precision metrics we lean towards brute algorithm with cosine metric used, probably with logarithmic or 5 number of neighbors \\
Cosine metric provides us with the best precision and small number of neighbors offers smaller average distance, it also behaves as expected since the average distance is smaller than the distance between input anime and recommended anime
\subsection{Solution satisfaction}
Our solution works, it manages to go through entire data-set and return some recommendation \\
Those recommendation based on manual inspections are usually OK, model often behaves rationally for example by recommending sequel of anime when given anime's first part \\
We also manage to make basic precision algorithm which considering how hard it is to define what a "correct" recommendation is, should be considered a success \\
Overall we are content with the result given limited time, knowledge and resources on our disposal
\subsection{Potential improvements}
We did not manage to introduce more parameters when embedding data into vectors, like anime popularity or how controversial it is, this would probably make the model at least more interesting if not directly better
\end{document}

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

View File

@ -0,0 +1,244 @@
"""
Code for preprocessing data and creating model that predicts and
recomends anime based on another anime entered by user
"""
import pandas as pd
import numpy as np
import argparse
import sklearn
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix
def get_data(limit_data=-1, data_folder_path="database"):
"""
Reads anime from csv database
"""
if limit_data > -1:
# User can limit number of data taken into consideration,
# model seems to work with limit_data value as low as 500,000
rating_data = pd.read_csv(
data_folder_path + "/animelist.csv", nrows=limit_data)
else:
rating_data = pd.read_csv(data_folder_path + "/animelist.csv")
anime_data = pd.read_csv(data_folder_path + "/anime.csv")
# used to fetch anime_id(MAL_ID)
anime_data = anime_data.rename(columns={"MAL_ID": "anime_id"})
anime_contact_data = anime_data[["anime_id", "Name"]]
return rating_data, anime_contact_data
def merge_rating_anime_data(rating_data, anime_contact_data, debug=False):
"""
Preprocesses the data used for rating
"""
rating_data = rating_data.merge(
anime_contact_data, left_on="anime_id", right_on="anime_id", how="left"
)
rating_data = rating_data[
["user_id", "Name", "anime_id", "rating",
"watching_status", "watched_episodes"]
]
rating_head = rating_data.head()
if debug:
print(rating_head)
rating_shape_complete = rating_data.shape
if debug:
print(rating_shape_complete)
return rating_data
def split_data_below_thresholds(rating_data, data_name, threshold=-1, debug=False):
"""
Removes data with data_name which is below given threshold
"""
if threshold != -1:
count = rating_data[data_name].value_counts()
rating_data = rating_data[
rating_data[data_name].isin(count[count >= threshold].index)
].copy()
rating_shape_cut = rating_data.shape
if debug:
print(rating_shape_cut)
return rating_data
def combine_name_and_ratings(rating_data, debug=False):
"""
Create table which holds name of the anime and number of its reviews
then we merge this with rating_data
"""
combine_movie_rating = rating_data.dropna(axis=0, subset=["Name"])
movie_rating_count = (
combine_movie_rating.groupby(by=["Name"])["rating"]
.count()
.reset_index()[["Name", "rating"]]
)
rating_head = movie_rating_count.head()
if debug:
print(rating_head)
rating_data = combine_movie_rating.merge(
movie_rating_count, left_on="Name", right_on="Name", how="left"
)
return rating_data
def get_length_of_data(rating_data, data_name):
"""
We get amount of data in the database with a given column data_name
"""
# Encoding categorical data
column_ids = rating_data[data_name + "_id"].unique().tolist()
column_to_column = {x: i for i, x in enumerate(column_ids)}
rating_data[data_name] = rating_data[data_name +
"_id"].map(column_to_column)
users_number = len(column_to_column)
return users_number
def get_top_ranked(rating_data, data_name, join_table=None, top_data_taken=20):
"""
Get anime with highest ranking
"""
if join_table is None:
join_table = rating_data
group_data_by_rating = rating_data.groupby(
data_name + "_id")["rating"].count()
top_users = group_data_by_rating.dropna().sort_values(ascending=False)[
:top_data_taken]
top_rated = join_table.join(top_users, rsuffix="_r",
how="inner", on=data_name + "_id")
return top_rated
def get_data_info(rating_data, debug=False):
"""
Get some informations about data
"""
users_number = get_length_of_data(rating_data, "user")
animes_number = get_length_of_data(rating_data, "anime")
top_rated = get_top_ranked(rating_data, "user")
top_rated = get_top_ranked(rating_data, "anime", top_rated)
pivot = pd.crosstab(top_rated.user_id, top_rated.anime_id,
top_rated.rating, aggfunc=np.sum)
pivot.fillna(0, inplace=True)
smallest_rating = min(rating_data["rating"])
highest_rating = max(rating_data["rating"])
if debug:
print(pivot)
if debug:
print(f"Num of users: {users_number}, Num of animes: {animes_number}")
print(
f"Min total rating: {smallest_rating}, Max total rating: {highest_rating}")
def preprocessing(rating_data, anime_contact_data, debug=False, user_threshold=500, anime_threshold=200):
"""
Preprocesses data for making model more accurate and/or faster
"""
rating_data = merge_rating_anime_data(rating_data, anime_contact_data)
rating_data = split_data_below_thresholds(
rating_data, "user_id", user_threshold)
rating_data = split_data_below_thresholds(
rating_data, "anime_id", anime_threshold)
rating_data = combine_name_and_ratings(rating_data)
rating_data = rating_data.drop(columns="rating_x")
rating_data = rating_data.rename(columns={"rating_y": "rating"})
if debug:
print(rating_data)
get_data_info(rating_data)
pivot_table = rating_data.pivot_table(
index="Name", columns="user_id", values="rating"
).fillna(0)
if debug:
print(pivot_table)
return pivot_table
def predict(prediction_model, pivot_table, seed=42, anime="RANDOM", recommendation_number=6):
"""
This will choose a random anime name and our prediction_model will predict similar anime.
"""
np.random.seed(seed)
print(pivot_table)
if anime == "RANDOM":
chosen_anime = np.random.choice(pivot_table.shape[0])
query = pivot_table.iloc[chosen_anime, :].values.reshape(1, -1)
chosen_anime_name = pivot_table.index[chosen_anime]
else:
query = pivot_table.loc[anime].values.reshape(1, -1)
chosen_anime_name = anime
distance, suggestions = prediction_model.kneighbors(
query, n_neighbors=recommendation_number)
for i in range(0, len(distance.flatten())):
if i == 0:
print(f"Recommendations for {chosen_anime_name}:\n")
else:
print(
f"{i}: {pivot_table.index[suggestions.flatten()[i]]}, with distance of {distance.flatten()[i]}:"
)
def create_model(pivot_table, metric="cosine", algorithm="brute", neighbors=5):
"""
Creates model based on neaarest neighbor for anime prediction
"""
pivot_table_matrix = csr_matrix(pivot_table.values)
model = NearestNeighbors(n_neighbors=neighbors,
metric=metric, algorithm=algorithm)
model.fit(pivot_table_matrix)
return model
def handle_arguments():
parser = argparse.ArgumentParser(description='Example script with pyargs')
parser.add_argument('--data_limit', '-dl',
help='Specify data limit, Recommended at least 500k, set to -1 for no limit', required=False, type=int, default=-1)
parser.add_argument('--seed', '-s', help='Specify seed',
type=int, required=False, default=42)
parser.add_argument('--debug', '-d', help='Use debug (more information) prints',
type=bool, required=False, default=False)
parser.add_argument('--database', '-db', help='Specify database path',
required=False, default="database")
allowed_metric = ["cosine", "mahalanobis", "euclidean"]
parser.add_argument('--metric', '-m', help='Specify metric for NearestNeighbor learner',
required=False, default="cosine", choices=allowed_metric)
allowed_algorithms = ['auto', 'ball_tree', 'kd_tree', 'brute']
parser.add_argument('--algorithm', '-a', help='Specify algorithm for Nearest Neighbor learner',
required=False, default="brute", choices=allowed_algorithms)
parser.add_argument('--anime', '-an', help='Specify anime to choose',
required=False, default="RANDOM")
parser.add_argument('--neighbors', '-n', help='Specify number of nearest neighbors',
required=False, default=5)
parser.add_argument('--user_threshold', '-ut', help='Specify minimal number of votes required for user to be included in the data, set to -1 for no threshold',
required=False, type=int, default=500)
parser.add_argument('--anime_threshold', '-at', help='Specify minimal number of votes required for anime to be included in the data, set to -1 for no threshold',
required=False, type=int, default=200)
parser.add_argument('--recommendation_amount', '-ra', help='Specify how much anime should be recommended',
required=False, type=int, default=5)
# Parse the command-line arguments
args = parser.parse_args()
args.recommendation_amount = args.recommendation_amount + 1
# Access the values of the arguments
return args.seed, args.debug, args.data_limit, args.database, args.metric, args.algorithm, args.anime, args.neighbors, args.user_threshold, args.anime_threshold, args.recommendation_amount
if __name__ == "__main__":
seed, debug, data_limit, db, metric, algorithm, anime, neighbors, user_threshold, anime_threshold, recommendation_amount = handle_arguments()
RATING_DATA, ANIME_CONTACT_DATA = get_data(data_limit, db)
PIVOT_TABLE = preprocessing(
RATING_DATA, ANIME_CONTACT_DATA, debug, user_threshold, anime_threshold)
MODEL = create_model(PIVOT_TABLE, metric, algorithm, neighbors)
predict(MODEL, PIVOT_TABLE, seed, anime, recommendation_amount)

View File

@ -0,0 +1,4 @@
pandas
numpy
seaborn
matplotlib

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

View File

@ -0,0 +1,97 @@
\documentclass[12pt]{article}
\usepackage{listings}
\usepackage{hyperref}
\usepackage{graphicx}
\title{EARIN project Midterm report}
\author{Krzysztof Rudnicki \\ Jakub Kliszko}
\begin{document}
\maketitle
\section{Progress}
We have implemented reading data from csv files, preprocessing them with optional showing of some of the information about the data and used model/learner for implementing neighbour searches \\
Program is very flexible and allows for a lot of modification from command line arguments \\
Full list here:
\begin{lstlisting}[language=bash]
options:
-h, --help show this help message and exit
--data_limit DATA_LIMIT, -dl DATA_LIMIT
Specify data limit, Recommended at least 500k,
set to -1 for no limit
--seed SEED, -s SEED Specify seed
--debug DEBUG, -d DEBUG
Use debug (more information) prints
--database DATABASE, -db DATABASE
Specify database path
--metric {cosine,mahalanobis,euclidean}
-m {cosine,mahalanobis,euclidean}
Specify metric for NearestNeighbor learner
--algorithm {auto,ball_tree,kd_tree,brute}
-a {auto,ball_tree,kd_tree,brute}
Specify algorithm for Nearest Neighbor learner
--anime ANIME, -an ANIME
Specify anime to choose
--neighbors NEIGHBORS, -n NEIGHBORS
Specify number of nearest neighbors
--user_threshold USER_THRESHOLD, -ut USER_THRESHOLD
Specify minimal number of votes
required for user to be included in
the data, set to -1 for no threshold
--anime_threshold ANIME_THRESHOLD, -at ANIME_THRESHOLD
Specify minimal number of votes
required for anime to be included
in the data, set to -1 for no threshold
\end{lstlisting}
\section{Results}
Currently recommendations are displayed in a following way:
\begin{lstlisting}[language=bash]
Recommendations for Kill la Kill:
1: Shingeki no Kyojin, with distance of 0.11106648055176693:
2: Steins;Gate, with distance of 0.12104265014640536:
3: Toradora!, with distance of 0.12112848901274798:
4: Sword Art Online, with distance of 0.13046005032340824:
5: No Game No Life, with distance of 0.1306815843129835:
6: One Punch Man, with distance of 0.14848484728234945:
7: Angel Beats!, with distance of 0.15175709939974935:
8: Hataraku Maou-sama!, with distance of 0.15244674042590045:
9: Psycho-Pass, with distance of 0.15288022814590008:
\end{lstlisting}
Where we are given name of the anime for which we create recommendation and list of animes recommended with distance to original anime (lower is better)
\subsection{Data size and execution time}
\begin{figure}
\caption{Chart showing how size of data taken impacts execution time }
\includegraphics[width=\textwidth]{execution_time.png}
\end{figure}
This data was taken using default parameters execpt for increasing data size, each of three runs uses different seed
\paragraph{Seed} We added seed in predict function for choosing random anime, using the same seed always returns same recommendations and choosing random anime is the only random part of our code \\
User can specify their own seed by using -s or --seed flag by entering in command line:
\begin{lstlisting}
python -s 42
\end{lstlisting}
\section{Challenges}
\subsection{Failed attempts}
Biggest challenge was realizing how overcomplicated and unnecessary difficult to implement is the first code we based on: \href{https://www.kaggle.com/code/chaitanya99/recommendation-system-cf-anime}{Kaggle code with tensorflow} \\
This solutions runs for almost 10 minutes on kaggle and implementing it to run on our local devices was a real chore that took us a good day and a half to implement \\
This implementation is based around very powerful Tensor Processing Unit from google and while it is possible to change it to run on local graphics card it requires downloading both cuda and cudnn to a downgraded version supported by tensorflow (11.8) and downgrading graphics card drivers \\
Running it with CPU results in the model training for over 3 hours
\subsection{Corrections}
Suprisingly even though we based our preliminary report around different example code we managed to not make any corrections to preliminary report \\
All of functionality that we want to implement is available in sklearn and scipy
\subsection{Results and findings}
We can see that the rating is skewed towards higher values, users tend to give ratings of 7, 8 or 9 which inflates average rating to be well above 5
\begin{figure}
\caption{User rating count}
\includegraphics[width=\textwidth]{user_rating.png}
\end{figure}
\section{Finishing project}
\subsection{Embedding more data in user and anime}
Currently we are only embedding pure rating values of users, we do not take into consideration, popularity, "controversy", studio which created the anime, length of anime (number of episodes and length of episodes), and when it was aired \\
\subsection{Evaluating our model accuracy}
We need to introduce some way to evaluate accuracy of our model, we will try to introduce at least some of the measures mentioned in preliminary report: precision, recall, F1 score and MAP
\subsection{More results representation}
We still need to introduce more representation for our model results. Mainly how well it predicts similarity based on different parameter values (different modes, arguments and so on) \\
We already can modify those values easily from the code itself and as argument, we just need to run those values and collect results
\end{document}

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

View File

@ -0,0 +1,281 @@
\documentclass[12pt]{article}
\usepackage{graphicx} % Required for inserting images
\usepackage{hyperref}
\title{EARIN Preliminary Project \\
19 - Anime Recomender}
\author{Jakub Kliszko, Krzysztof Rudnicki }
\date{\today}
\begin{document}
\maketitle
The goal of this project is to develop a model for anime recommendation, which takes an anime name as an input and recommends a list of related anime's based on this name. \\
\section{Algorithm description and examples}
We will be using the collaborative filtering approach to develop our model. \\
Collaborative filtering is a popular method for making personalized recommendations based on the preferences of other users with similar tastes. In this approach, the recommendation system analyzes a large data-set of user-item ratings to identify patterns and similarities between users and anime's, and uses this information to make recommendations to new users.
We will represent users and anime's data-sets as embedding vectors. Embeddings provide a way to transform high-dimensional data into a lower-dimensional space while preserving relevant relationships.
We have decided to use and test multiple metrics such as
\begin{itemize}
\item Cosine similarity
\item Mahalanobis Distance
\item Euclidean Distance
\end{itemize}
together with K-nearest neighbors algorithm. \\
\subsection{Embedding users and anime}
In order to embed users and anime we must choose which features of the user/anime are the most important for our model \\
For the user we are restricted to what database offers, database pretty much only offers us what ratings the user gave to an anime, we will compare the performance of our algorithm when:
\begin{itemize}
\item Given ratings of the user for all anime that they gave rating to (no matter the watching status)
\item Given ratings of the user for all anime that they have \emph{completed}
\end{itemize}
For anime we will use data:
\begin{itemize}
\item User score
\item Popularity (this includes popularity itself, number of members and favourites)
\item Controversy (Whether the anime gets a lot of 1s and 10s or just 6s)
\end{itemize}
We have decided to use two different methods to combine data and compare the results \\
First method is dot product which gives a single number after multiplying two vectors \\
The second is concatenation which returns a new vector combining first and second vector \\
As an input for all our metrics (like cosine similarity) we will use vectors of anime and users
\subsection{Metrics}
%\emph{Cosine similarity} is a metric used to compute the similarity between two users based on their ratings or preferences for anime. In our case, we will use cosine similarity to compute the similarity between each pair of users in our data-set based on their anime ratings. This will result in a similarity matrix, where each entry (i,j) represents the cosine similarity between users i and j.
\paragraph{Cosine similarity} is a measure used to calculate the similarity between two vectors representing items or users. It evaluates the cosine of the angle between the two vectors, indicating their directional similarity. In our use case, cosine similarity can help identify items or users with similar preferences or characteristics by comparing their embedding vectors. Higher cosine similarity values (closer to 1) indicate a stronger similarity between the vectors, suggesting that the items or users are more likely to have similar features or preferences.
\paragraph{Mahalanobis distance} is a metric that takes into account the correlations between variables when measuring the distance between two vectors. In our recommendation system, we can leverage Mahalanobis distance to quantify the dissimilarity between the embedding vectors of items or users. By considering the correlations within the embedding dimensions, Mahalanobis distance provides a more accurate measure of dissimilarity. It helps identify items or users that are dissimilar based on their characteristics or preferences, accounting for the relationships between different features.
\paragraph{Euclidean distance} is a widely used metric to calculate the straight-line distance between two points in a multi-dimensional space. In our use case, we can apply Euclidean distance to measure the dissimilarity between the embedding vectors of items or users. By comparing the corresponding dimensions of the vectors and calculating the square root of the sum of squared differences, Euclidean distance provides a simple and intuitive measure of dissimilarity. It helps identify items or users that are relatively far apart in terms of their characteristics or preferences, based on the geometric distance in the embedding space.
%\subsubsection{Cosine similarity input}
%TUTAJ MA BYĆ INNY TEKST A NIE KNN
\paragraph{KNN} is a machine learning algorithm that we will use to find the K most similar users to a target user in the similarity matrix. The K most similar users will be our nearest neighbors, and we will use their ratings to generate recommendations for the target user. The number of nearest neighbors (K) that we choose will depend on the performance of our recommendation system on the validation set. We will experiment with different values of K and different weighting schemes for the ratings of the nearest neighbors to optimize the performance of our recommendation system. \\
%\paragraph{Euclidean distance} measures distance between two points in a straight line. It is used to measure similarity between points. Points that are closer to each other are considered to be similar. We can use it for finding nearest neighbors and clusters.
%\paragraph{Mahalonobis distance} again measures distance between two points but this time it takes into consideration how different elements of data-set relate to each other. This should be an advantage over Euclidean distance which assumes that all inputs are independent of each other.
\subsection{Data used for recommendation}
In addition to users and their rating for anime, we will also use measures of:
\begin{itemize}
\item Controversy - Anime with lots of 1 and 10 might have the exact same rating as anime with mostly 6 but it is much more
controversial (hit or miss) and it can be a good recommendation even though its rating might seem low
\item Combined data about popularity - Number of favorites, number of members which are in the group for fans of a given anime and overall popularity on the site
\item Ranked - How good the anime is in comparison with other anime's on the site
\end{itemize}
%Collaborative filtering is a technique that recommends items based on the preferences of similar users or items. \\
%In our case, we will use the item-based collaborative filtering approach where we will recommend anime's based on their similarity to the input anime. \\
%Our model will use the cosine similarity metric to measure the similarity between anime's. \\
%We will also use a neighborhood-based approach to find the most similar anime's. \\
%Specifically, we will use the K-nearest neighbors (KNN) algorithm to identify the most similar anime's to the input anime.
\section{Selection and description of the data-sets}
The quality and size of the dataset may greatly affect the performance of the collaborative filtering algorithm. This is why we decided to choose a dataset that was newer and bigger than the one provided in the task.
We will be using the Anime Recommendations Database from Kaggle \href{https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020}{LINK}. The data-set contains information about over 17,000 anime's and over 300,000 users. \\
\subsection{Why this data set}
There are multiple reasons to use this specific data set over different ones available on Kaggle (including the one proposed in task description) \\
\begin{itemize}
\item It is considered to have 100 \% usability by Kaggle
\begin{itemize}
\item It is tagged, has subtitle, description and cover image
\item It has a source, is public and was updated recently (2020)
\item It has the most permissive (CC0) license, good file format, and all files and columns are described
\end{itemize}
\item It is one of the biggest anime databases on kaggle (605 MB)
\item It has over 60 code examples of use
\end{itemize}
\subsection{Data-set description}
The data-set contains:
\begin{itemize}
\item The anime list per user (dropped, complete, plan to watch, currently watching and on hold)
\item Ratings given by users to the anime's that they has watched completely.
\item Information about anime (exactly 35 columns!), most importantly:
\begin{itemize}
\item Name
\item User score
\item Popularity
\item Members
\item Favourites
\item Number of each particular score (from 1 to 10)
\end{itemize}
\item HTML files containing reviews, synopsis, staff info etc.
\end{itemize}
There are 5 csv files in total and one HTML folder
\subsection{anime.csv}
It contains information about all anime scraped \\
In total it contains \textbf{35} columns \\
Information's that might be in our opinion use-full and are not contained within rest of the files are:
Information's that might be use full for our algorithm:
\begin{itemize}
\item Number of episodes - Some users might prefer shorter/longer anime's
\item When it was aired - Some users might prefer older/newer anime styled
\item When it was premiered - Same as above
\item Studios - Some users might favourite specific Studios style
\item Duration - Some users might prefer anime's with shorter single episode length
\item Ranked - Anime that might not be a hit with a certain user but is still considered to be very good can still be a good recommendation
\item Popularity - Anime that might not be a hit with a certain user but is very popular can still be a good recommendation
\item Members - Same as above
\item Favorites - Same as above
\item Number of exact amount of scores from 1 to 10 - This can tell how "controversial" the anime is, anime with lots of '1' and '10 might have the exact same rating as anime with mostly '6' but it is much more controversial (hit or miss)
\end{itemize}
In formations that might be use-full to display:
\begin{itemize}
\item Type - Usually whether it is a series (TV or OVA) or movie
\item Number of episodes - How much time does it take to watch it
\item When it was aired - Start of anime being shown and end of it
\item When it was premiered - When the anime started
\item Studios - What studio produced it
\item Duration - How much time does it take to watch single episode
\item Ranked - How good it is in relation to all other anime's on my anime list
\end{itemize}
\subsection{anime\_with\_synopsis.csv}
CSV file containing anime info for humans \\
It contains 5 columns:
\begin{itemize}
\item (key) mal\_id - used to identify the anime
\item name - either English or R\=omaji version of the title
\item score - average anime score
\item genres - list of genres associated with this anime
\item synopsis - description of anime
\end{itemize}
This file will be use-full for displaying the recommendation to user with additional info
\subsection{animelist.csv}
This file contains information about all users anime lists, no matter their watching status \\
In total it contains over 105 million rows \\
There are over 60 million ratings given by users (compared with over 55 for only completed series)
It contains 5 columns:
\begin{itemize}
\item (key) user\_id - random (but persistent through database) user id
\item (key) anime\_id - used to identify the anime
\item rating - what score this user set for this anime (zero is set if user did not set any score)
\item watching\_status - state ID for this anime in the anime list of the user (see \ref{watchingStatus} watching\_status.csv)
\item watched\_episodes - how many episodes have been watched by the user
\end{itemize}
This file can be potentially use full especially to determine what users with similar interest are planning to watch
\subsection{rating\_complete.csv}
This file contains information about all ratings given to animes by users who selected watching\_status 2 option (complete) \\
In total it contains over 55 million ratings \\
It contains 3 columns:
\begin{itemize}
\item (key) user\_id - random (but persistent through database) user id
\item (key) anime\_id - used to identify the anime
\item rating - what score this user set for this anime (1-10 scale)
\end{itemize}
This is probably the most important file regarding our project
\subsection{watching\_status.csv}
\label{watchingStatus}
\begin{figure}[h]
\caption{Entire contents of watching\_status.csv file}
\centering
\begin{tabular}{| c | c |}
\hline
status & description \\
\hline
1 & Currently Watching \\
\hline
2 & Completed \\
\hline
3 & On Hold \\
\hline
4 & Dropped \\
\hline
6 & Plan to Watch \\
\hline
\end{tabular}
\end{figure}
Watching status file is used to relate numerical id of watching status to actual textual description of this status \\
Status number 5 is missing, possibly referring to "re-watching" status? \\
Meaning of particular descriptions:
\begin{itemize}
\item Currently Watching - Actively keeping up with the series
\item Completed - Watched the entire series/film
\item On Hold - Stopped watching it but possibly wanting to return to it
\item Dropped - Stopped watching it and decided not to return it
\item Plan to Watch - Not watched it yet but plan to
\end{itemize}
For most purposes we will be only interested in status 2 (completed) which tells us that the user checked the anime as already watched \\
Plan to watch is also interesting since it can be used to recommend anime based on what similar users are interested in
\subsection{HTML folder}
HTML folder contains HTML scraped info about specific anime's, we will ignore this folder as parsing HTML is a much bigger challenge and outside of the scope of our project
\subsection{Summary}
There are two most important files, for program inner workings: rating\_complete.csv and anime\_with\_synopsis.csv for showing the recommended anime to user
\section{General plan of tests/experiments}
To evaluate and compare the performance of our collaborative filtering model with different parameter configurations, we will use grid search and 5-fold cross-validation.
Grid search will allow us to systematically test different combinations of hyperparameters, such as the number of nearest neighbors in KNN and the regularization strength in the matrix factorization algorithm, to find the best combination that maximizes the model's performance.
Cross-validation will help us to estimate the model's generalization performance and reduce the risk of overfitting to the training data.
% Notka dla nas: Hyperparameters: Parameters that are set before training a machine learning model and cannot be learned from the data. Examples of hyperparameters include the learning rate, the number of hidden layers in a neural network, and the regularization strength.
% Notka dla nas: Overfitting is a common problem in machine learning where a model is trained too well on the training data and performs poorly on the testing or validation data. In other words, the model learns the patterns and noise in the training data to such a degree that it loses the ability to generalize and predict accurately on new data.
To measure the quality of our model's recommendations, we will use precision, recall, and F1-score metrics. Precision measures the proportion of relevant items among the recommended items, recall measures the proportion of relevant items that are recommended, and F1-score is the harmonic mean of precision and recall, providing a balanced measure of the model's performance. We will compute these metrics on the test set and report the average scores across the 5-fold cross-validation.
%Our main goal is to evaluate the performance of the recommendation model. We will perform a 5-fold cross-validation to estimate the generalization performance of the model. We will also use precision, recall, and F1-score metrics to evaluate the model's performance. We will experiment with different values of K (number of neighbors) to find the optimal value.
\section{Methods of result visualization}
For the visualization, we chose three methods:
\begin{itemize}
\item Heatmaps: They are useful for visualizing the similarity between users or animes based on their ratings. They will allow us to identify clusters of similar users or animes and explore how their preferences are related to one another.
% https://rpubs.com/jeknov/movieRec
\begin{figure}
\centering
\caption{An exemplary heatmap}
\label{fig:heatmap}
\end{figure}
\item Histograms: They will be used to visualize the distribution of ratings and identifying any biases or patterns in the data. They can help us identify whether the ratings are normally distributed or skewed.
\begin{figure}
\centering
\caption{An exemplary histogram}
\label{fig:precission-recall}
\end{figure}
\item Precision-recall curves: They show the tradeoff between precision and recall at different thresholds, allowing us to identify the optimal threshold for making recommendations. By visualizing the performance of the algorithm in this way, we can identify areas for improvement and generate new hypotheses for increasing the accuracy of the recommendations.
\end{itemize}
\begin{figure}
\centering
\caption{An exemplary precision-recall graph}
\label{fig:histogram}
\end{figure}
\section{Definition of quality measures that will be used}
In order to evaluate the quality of our recommendation system, we will use a combination of precision, recall, F1 score, and MAP. Precision measures the proportion of relevant items among the recommended items, while recall measures the proportion of relevant items that were actually recommended. F1 score is the harmonic mean of precision and recall, providing a balanced measure of the system's performance.
MAP is calculated by taking the average of the average precision (AP) for each user. AP is calculated as the mean of the precision values obtained at each relevant item position in the ranked list of recommended items.
To calculate MAP, we need to define a threshold for relevance, which can be either binary (relevant or not relevant) or graded (assigning a relevance score to each anime). In our case, we will use a binary threshold, where an anime is considered relevant if has been rated 8 or higher. We will also calculate MAP for different values of K (number of recommended items), as the quality of the recommendations may vary depending on the number of items displayed to the user.
%Quality Measures:
%We will use the following quality measures to evaluate the performance of the model:
% Precision: the ratio of true positive predictions to the total number of predicted positives.
% Recall: the ratio of true positive predictions to the total number of actual positives.
% F1-score: the harmonic mean of precision and recall.
%We will also use the mean average precision (MAP) as an additional quality measure to evaluate the model's performance.
\end{document}