WUT_Computer_Science/Programming/EARIN/project/final/notes/automatingParametersTesting.txt
2026-02-06 22:15:23 +01:00

39 lines
1.5 KiB
Plaintext

Parameters:
- datalimit (usable between 500k and max) [max = 109,224,747 ]
- seed (very important make sure it stays the same through all testing [maybe just 42?])
- metric (either cosine, mahalanobis or euclidean as in preliminary report)
- NN algorithm (either auto, ball_tree, kd_tree, brute)
- neighbors - number of nearest neigbors
- User threshold - minimal numbers of votes for user to be included in data
- Anime threshold - same for anime
These are 6 parameters that influence program behaviour and 1 parameter for seed
Probably would do simulations for 3 variants of each parameters (excluding seed), rest will be default
so in total 6 * 3 = 18 simulations
Default values:
Datalimit: all of data
Seed: 42
Metric: cosine
NN algorithm: brute
Neighbors: 5
User threshold: 500
Anime threshold: 200
Neighbors number count:
k = 3-5: default starting points for small-medium dataset
k = sqrt(n): rule of thumb, n is number of instances in dataset (balanced between underfitting and overfitting)
l = n / 2: look at half of dataset for each prediction
k = log(n): for very large datasets
k = n - 1: Use all data except one, will probably overgenarlize the model
Values spread:
Datalimit: [27306186, 54612373, 109224747] (max on the right, then halved and halved)
Metric: ["cosine", "mahalanobis", "euclidean"]
NN algorithm: ['auto', 'ball_tree', 'kd_tree', 'brute']
neighbors: [5, sqrt(n), n / 2, log(n), n - 1]
User threshold: [0, 500, 1000]
Anime threshold: [0, 200, 500]