Biases in learning to rank models and three approaches to deal with them

Search engines rely on models, which rank the matching results for a given user query. These models optimize the order of items. They learn how to rank items in a result list, therefore the name Learning-to-Rank (LTR) models.

It’s a challenging task to build the ground truth for these LTR models: Oftentimes, we data scientists “build” labels based on implicit user feedback: User interactions with items in a result list are interpreted as relevant, not relevant or some degree of relevance. For example a click or a long view time on an item may be considered as a signal for relevance, while an impression without a click may indicate that the item was not relevant to the user. As a data scientist, you have to come up with a logic for how relevant an item was. This is a difficult task, because you cannot be sure. I spend a lot of time reading papers pondering about how to assign labels.

However, these approaches to distinguish relevant from non-relevant items, were flawed. There are other factors that influence user interaction. These ground truth datasets would suffer always from biases: Position Bias: Users have learned to click on the highest ranked results. Thus, the top ranked items are more likely to be clicked even if they are not the most relevant items. By looking at historic data, you cannot differentiate whether a user clicked, because the result was truly relevant or because the item was ranked high. If you plot your CTR by position, you’re likely to see a steep decline after the first position.

Selection bias: Users can only interact with items that they saw. Oftentimes, users just look at the first result page and never make it to the second or third. How could your model learn that the item that was on position 100 and that the user never saw was actually relevant? It cannot.

The probability, that a user clicks on an item depends on examination and the relevance of an item at position k. So what can you do?

Online: randomize order of items

If you are really bold, you create an AB-Test and change your search rank on production: Group A sees the results ranked by your model and you display items for users in group B in random order. Group B will have irrelevant items on top. Nevertheless, the b-users will still click on items in the first position. Compare the CTR of both groups and you will have an estimation about how big your position bias is. You can later weigh your observations in the ground truth according to the bias. Let’s assume the first position is 10 times more likely to be clicked than the 20th position. You could then count an item shown at first position as one impression, while if it is shown at the 20th position, you only count it as one 10th of an impression because the propensity to purchase at the first position is 10 times larger than the propensity to purchase at the 20th position.

Completely randomizing the order will likely reduce the quality of search results in the B group severly and may reduce your companies revenue. In most companies, management would not let you experiment around like that for the sake of understanding a bias.

You could be less extreme and rerank only the top n positions or swap positions between a pair of items. If you swap pairs, you need a lot of users or patience or both. Swapping pairs or the top n positions, will only help you understand your biases if you have few items to rerank, but would not let you estimate the bias for larger result sets. In addition, it will not help you understand, if your ranking model suffers from selection bias.

Offline: model inverse propensities

There are several papers on estimating position bias in an offline setting which avoids changing your search models in production. Authors try to estimate the click propensities correcting for position bias by estimating relevance with Inverse Propensity Scoring.

Do your items in the result set change over time or is your model trained with time dependent features? Do users repeat their search queries? This is often the case in e-commerce settings, where the availability of items changes over time: Products are added or removed from an online shop. You could estimate bias by training a model with item-pairs that appeared in different positions for the same search query.

In practise, this rarely works. Depending on your context, there are few identical queries and novelty may be an important feature that contributes to the item being relevant. You may also not have enough variation in your ranks for the same query to train a model with item pairs. Wang et al. (2018)[2] could show that it is also possible to estimate propensities without using ranking features. They directly estimate the propensities by maximizing a likelihood function and eliminate the relevance from the likelihood function.

Offline: add randomness to your ground truth as “not relevant” items

We improved our models by simply adding randomness to the ground truth as irrelevant items: We assumed that any random item is worse than an item the user clicked on. By pairing a random item with a relevant (clicked) item for 10 % of our ground truth pairs, we would increase the diversity of search results in our offline evaluation and increase the CTR of users in an AB-Test.

Summary

There are different approaches how reduce bias in LTR:

Randomization of the result list may help you estimate the biases, but comes at the cost of quality.
Modelling the bias offline could be a less risky yet challenging option. You would still need to AB-Test your model online to see if it indeed leads to more relevance for the user.
Adding randomness to your ground truth could be a quick and easy AB-Test you should try out.

Biases in learning to rank models and three approaches to deal with them

Online: randomize order of items

Offline: model inverse propensities

Offline: add randomness to your ground truth as “not relevant” items

Summary

Heike Maria (PhD)