Two-Tower Networks and Negative Sampling in Recommender Systems
One of the most important types of models in recommendation systems at present is the two-tower neural networks. They are structured as follows: one part of the neural network (tower) processes all the information about the query (user, context), while the other tower processes information about the object. The outputs of these towers are embeddings, which are then multiplied (dot product or cosine, as we already discussed here). One of the first mentions of the application of such networks for recommendations can be found in a very good paper about YouTube. By the way, I would now call this article a classic and most suitable for entering the field of recommendations.

What characterizes such networks? They are very similar to matrix factorization, which is actually a special case, taking only _userid and _itemid as input. However, if we compare them with arbitrary networks, the restriction on late crossing (not allowing inputs from different towers to fuse until the very end) makes two-tower networks extremely efficient in application. To build recommendations for a single user, we need to calculate the query tower once, and then multiply this embedding with the embeddings of documents, which are usually pre-calculated. This process is very fast. Moreover, these pre-calculated document embeddings can be organized into an ANN index (e.g., HNSW) to quickly find good candidates without having to go through the entire database.
Similarity Search, Part 4: Hierarchical Navigable Small World (HNSW)
We can achieve even more efficiency by computing the user part not for every query, but asynchronously, with some regularity. However, this means sacrificing the consideration of real-time history and context.
The towers themselves can be quite sophisticated. For example, in the user part, we can use the self-attention mechanism to process history, which results in a transformer for personalization. But what is the cost of introducing the restriction on late crossing? Naturally, it affects quality. In the same attention mechanism, we cannot use the items that we currently wish to recommend. Ideally, we would like to focus on similar items in the user's history. Therefore, networks with early crossing are typically used in the later stages of ranking, when there are only a few dozen or hundred candidates left, while those with late crossing (two-tower) are used, conversely, in the early stages and for candidate generation.
(However, there is a purely theoretical argument that any reasonable ranking of documents for different queries can be encoded through embeddings of sufficient dimensionality. Moreover, decoders in NLP actually operate on the same principle, only recalculating the query tower for each token.)
Loss Functions and Negative Sampling
A particular point of interest is the loss function used to train two-tower networks. In principle, they can be trained with any loss function, targeting various outcomes and even having multiple different ones for different heads (with different embeddings in each tower). However, an interesting variant is training with a softmax loss on in-batch negatives. For each query-document pair in the dataset, the other documents in the same mini-batch are used as negatives in combination with the same query in the softmax loss. This method is a highly effective form of hard negative mining.
But it's important to consider the probabilistic interpretation of such a loss function, which is not always well-understood. In a trained network,

The exponent of the score is proportional not to the a priori probability of the document given the query, but to the PMI (Pointwise Mutual Information), specific to the query. More popular documents will not necessarily be recommended more often by such a model, because during training, they proportionally appear more often as negatives. Using the score as a feature can be beneficial, but for final ranking and candidate generation, this can lead to very specific, yet poor-quality documents.
Google, in a paper, suggested combating this issue with logQ correction during training. We, on the other hand, typically addressed this at the application stage, not during training, by simply multiplying by the a priori probability of the document P(d). However, we never compared these approaches, which would indeed be an interesting comparison.
Implicit Regularization: Bridging ALS and Modern Neural Networks
There is a collaborative filtering algorithm known as Implicit ALS (IALS). I have already mentioned it before. In the pre-neural network era, it was arguably one of the most popular algorithms. Its distinguishing feature is the effective ‘mining' of negatives: all user-object pairs without a history of interaction are treated as negatives (though with less weight than actual interactions). Moreover, unlike in actual mining, these negatives are not sampled but are used in their entirety in every iteration. This approach is known as implicit regularization.
How is this possible? Given reasonable task sizes (number of users and objects), there should be so many negatives that even listing them would take longer than the entire training process. The beauty of the algorithm lies in the fact that, by using the MSE loss and the method of least squares, it's possible to pre-calculate certain elements separately for all users and separately for all objects before each full iteration, and this is sufficient for performing implicit regularization. This way, the algorithm avoids a quadratic size. (For more details, I refer you to one of my favorite papers from that time).
A couple of years ago, I pondered whether it's possible to combine this wonderful idea of implicit regularization with the more advanced technology of two-tower Neural Networks. It's a complex question, as there's stochastic optimization instead of full batch, and there's a reluctance to revert back to MSE loss (at least for the entire task; specifically for regularization, it might be okay) as it tends to yield inferior results.
I thought long and hard, and finally came up with a solution! For a couple of weeks, I was exhilarated, eagerly anticipating how we would try this out in place of in-batch negatives.
And then, of course (as often happens in such cases), I read in a paper that everything had already been thought of three years earlier. Again, it was Google. Later, in that same paper about logQ correction, they demonstrated that softmax loss with in-batch negatives works better than implicit regularization.
That's how we were able to save time and not test this idea