Evaluating Text Generation in Large Language Models

Author:Murphy | View: 25202 | Time: 2025-03-22 23:10:36

Recently, large language models have shown tremendous ability in generating human-like texts. There are many metrics to measure how close/similar a text generated by large language models is to the reference human text. In fact, bridging this gap is an active area of research.

In this post, we look into two well-known metrics for automatically evaluating the machine generated texts.

BERTScore

Consider you are given a reference text that is human-generated, and a machine-generated text that is generated by an LLM. To compute the semantic similarity between these two texts, BERTScore compute pairwise cosine similarity of token embeddings. See the image below:

Here the reference text is "the weather is cold today" and the candidate text which is machine generated is "it is freezing today". If we compute the n-gram similarity these two texts will have a low score. However, we know they are semantically very similar. So BERTScore computes the contextual embedding of each token in both reference text and the candidate text and the based on these embedding vectors, it computes the pairwise cosine similarities.

Based on pairwise cosine similarities, we can compute precision, recall and F1 score. To do so as following:

Recall: we get the maximum cosine similarity for every token in the reference text and get their average
Precision: we get the maximum cosine similarity for every token in the candidate text and get their average
F1 score: the harmonic mean of precision and recall

BERTScore[1] also propose a modification to above score called as "importance weighting". In "importance weighting" , considers the fact that rare word which are common between two sentences are more indicative of similarity between the two sentences as compared to common words. As a result, they use idf (inverse document frequency) to indicate importance weight of each token. They compute idf for each token as following:

Here, M is the total number of sentences in a corpus and I is an indicator function that whether word w is in a sentence. They use +1 smoothing to handle unknown word in the nominator of idf.

Therefore, using idf the formula for Recall becomes as following:

and the BERTScore for the above example becomes as following:

Mauve

Autoregressive models form a distribution over tokens of a vocabulary and coupled with decoding mechanisms, they sample a token from this distribution and output it one at a time.

If you are not familiar with the decoding mechanisms read this post.

Decoding Methods for Language Generation With Transformers

To this end, to compare the text generated by an Llm to human text, Mauve [2] compares the distribution over the language that is generated by the LLM to the true distribution by human.

Let P be the true distribution and Q be the machine generated distribution by an LLM.

To compare P and Q, we must consider two types of error:

Type-I error = False positive error = where there exists a text x that Q(x) is large but P(x) is small. In other words, the LLM generates a high probability for a sequence/token that does not resemble human-like text.
Type-II error = False negative error = where there exists text x that P(x) is large but Q(x) is small. This one is the reverse of above and means that the LLM miss to resemble human-like text. The human-like text/token has large probability P(x) and the LLM generates a small probability Q(x).

Now, if you are familiar with KL-divergence, you know that KL(P||Q) measures Type-I error. Similarly, KL(Q||P) measures Type-II error.

Let's review KL divergence. Consider the following examples for discrete distributions P and Q:

We see that Q is a uniform distribution with probability=1/3, while P is a binomial distribution with probability=0.4. We can compute the KL divergence of KL(P||Q) and KL(Q||P) as following:

Unfortunately, KL(P||Q) or KL(Q||P) is infinite if the support for both distribution is not the same; which is the case in case of natural language generation. To address this caveat, Mauve [2] proposes a mixture of distribution as following:

This new mixture distribution, allow Mauve[2] to measure soft Type-I error and soft Type-II error.

Then they define a divergence curve specified by the following data points:

Note C(P,Q) is a collection of data points, where first coordinate of the data point is soft Type-I error, and second coordinate is soft Type-II error. These points together define divergence curve which captures the tradeoff between the two errors.

Mauve(P,Q) is then the area under the curve of this divergence curve.

Larger values of Mauve indicates that Q is closer to P.

How to compute Mauve: To measure Mauve in practice, we need to have access to the true distribution P, which we do not have. To overcome it, the paper [2] proposes to use Monte Carlo estimator in the following manner:

sample text instance x ~ P and x' ~ Q.
Having enough samples, pass them through an encoder and obtain their embedding vector.
Quantize the distribution by clustering embedding vectors of P into k clusters. Similarly, cluster embeddings of Q into k clusters by applying k-means.
Now the support of each distribution is k and we compute the probability of a cluster label as ratio of data points which fall into that cluster.

Having access to the quantized distribution of P and Q we can compute their Mauve score as an estimation for the Mauve score of the original distributions.

The authors run the following experiments to compare convergence curve for GPT2 and Grover under different decoding mechanism. They show that generations from larger models and nucleus sampling are closer to human text [2].

Conclusion

In this post, we looked at two metrics to systematically compute similarity between LLM-generated text and human-generated text. The first approach, BERTScore takes tokens of each text and compute the similarity by using pair-wise cosine function. The second approach, Mauve, considers the distribution over generated text and use a mixture of KL divergences to compute the similarity.

If you have any questions or suggestions, feel free to reach out to me: Email: [email protected] LinkedIn: https://www.linkedin.com/in/minaghashami/