Are All Rejected Recommendations Equally Bad?: Towards Analysing Rejected Recommendations
Frumerman, Shir
and
Shani, Guy
and
Shapira, Bracha
and
Shalom, Oren Sar
ACM UMAP - 2019 via Local Bibsonomy
Keywords:
dblp
## Idea
When we recommend items to users, some of them are not chosen by the user. These rejected recommendations are usually treated as hard mistakes.
Authors argue that these bad recommendations still may influence user's choice even though they were not picked. For example user didn't click on "Die Hard" but watched another Bruce Willis movie. This seems to be a not so bad recommendation after all and maybe we should not penalize it as hard as we usually do.
Ultimate goal is to invent a metric that would have good correlation between offline results and real online performance.
## User study
Authors held a user study, showing people a set of 5 items: watched movie, 3 rejected recommendations and an item chosen after recommendation. Rejected recommendations were generated into 4 groups:
- only high content similarity
- only high collaborative similarity
- only high popularity similarity
- all medium similarities
The question was "**How good is this recommendation 1-5?**"
| Content | Collaborative | Popularity | Other |
| ------- | ------------- | ---------- | ----- |
| 3.8 | 3.52 | 2.93 | 1.99 |
## Proposal
If standard precision is
$$p_u = \frac{|c_u \cap r_u|}{|r_u|}$$
where $c_u$ are items chosen by user and $r_u$ items recommended to user, then we can define a refined precision as
$$p_u^{sim} = p_u + \frac{\sum_{i \in r_u \setminus c_u}max_{j \in \{ c_u:\ t(u,j) > t(u,i)\}}sim(i, j)}{|r_u|}$$
where $t(u,i)$ is the time when user $u$ interacted with item $i$.
## Evaluation
Authors used Xing dataset containing user interactions with a system for seeking employment opportunities. It contains logs of what was recommended and what was clicked.
### "Online" evaluation
Measure correlation between different refinement types of precision of RS presented in dataset and actual user clicks.
| Content | Collaborative | Regular |
| ------- | ------------- | ------- |
| 0.615 | 0.197 | 0.184 |
### Offline evaluation
Split logs 70/30 by time and measure correlation between number of clicks per user on test and metrics on train as if we were training a model on train part.
| Train clicks | Content | Collaborative | Random |
| ------------ | ------- | ------------- | ------ |
| 0.5 | 0.35 | 0.16 | 0.087 |
## Open question
What is the best way to calculate item similarity?