Short and Sweet: Enhancing LLM Performance with Constrained Chain-of-Thought
|LLM|PROMPT ENGINEERING|COT|REASONING|

Brevity is a great charm of eloquence. – Marcus Tullius Cicero
Brevity and conciseness are the parents of correction. – Hosea Ballou
Large language models (LLMs) have shown interesting capabilities in the field of reasoning. With their use, a new field of application has emerged: prompt engineering. In fact, interaction with these models occurs through the use of prompts, and for this reason, techniques have been developed to improve these capabilities of LLMs.
Prompt Engineering to Leverage In-Context Learning in Large Language Models
One of the most intriguing techniques is chain-of-thought (CoT) prompting; this technique increases correctness in reasoning problems and explains how the model arrives at the solution (or what reasoning errors it makes). CoT is a technique in which the model is prompted to arrive at the solution by intermediate steps (instead of generating the solution)

Multimodal Chain of Thoughts: Solving Problems in a Multimodal World
This technique is intriguing [2–3] because it also works in zero-shot settings. Just by forcing the model to reason step-by-step (with the simple addition in the prompt of ‘let's think step by step‘) the results in reasoning problems improve dramatically.
Of course, this technique also has disadvantages: the model produces long outputs and there is an increase in system latency (the time it takes to complete the response). This stems from the autoregressive nature of the model, which decodes one word at a time. This additional computational cost and time latency is undesirable when the model has to interact with users.
A Requiem for the Transformer?
Are all these reasoning steps really necessary? Can't the verbiage of a model be forced?
As you can see, today's models are increasingly verbose. Whereas answers used to be much shorter, new LLMs instead are used to create longer and longer outputs. In part, this is a desirable behavior because, in theory, these responses are always more complete and better dissect the question topic. On the other hand, the response is often unnecessarily verbose (especially when the question requires a short answer). For a user, an overly long answer can be frustrating, especially in multi-round question settings. Also, a longer response is not always better. It is often full of digressions, irrelevant details, and a greater risk of hallucinations.
One of the problems is that there are no evaluation metrics that take into account the conciseness of the outputs, nor do they penalize avoiding excessively long chains of reasoning. Intuitively, the longer the chain of reasoning, the greater the risk that one of the intermediates is erroneous. An erroneous intermediate can then be difficult for an LLM to correct (again because of its autoregressive nature).
On what does the length of an LLM response depend?
Several factors account for the length of the generated response. The main factors are: the question asked, the architecture, the size, pre and post-processing steps, prompt engineering techniques, and the addition of context in the prompt.
As can be easily imagined, the generation time increases as more tokens are generated. Furthermore, the larger the size of the model the longer it takes to generate the same response (a model of 70B parameters will take longer to generate the same number of tokens as a model of 7B)

CoT increases the generation time of a model since intermediate reasoning steps must also be generated. A larger number of tokens therefore means a longer generation time per response.

Most studies so far have neglected efficiency but focused on accuracy, so we do not have metrics that take efficiency into account. In this study [4] they propose three metrics to evaluate a model's capabilities for both accuracy and conciseness:
- Hard-k Concise Accuracy. It measures the fraction of outputs that are correct and do not exceed a certain length k.
- Soft-k Concise Accuracy. similar to the previous one but penalizes correct answers that exceed a certain length.
- Consistent Concise Accuracy. A generalization of the previous metrics that takes into account the variation in length of all outputs.
Now we have a way to measure both accuracy and conciseness at the same time, we can try to find a way to limit the reasoning steps in an LLM when we use CoT. The authors of this paper [4] propose to make this requirement explicit in the CoT to force the model to compress its reasoning. So it is a ZeroShot-COT prompting with the addition of the phrase "and limit the length of the answer to n words" (with n being the desired number of words).

Once you have a prompt that forces the model to respond more succinctly, you may wonder whether this would impact the accuracy of the model. In the study, they analyze 5 pre-trained models (Vicuna-13B, Falcon 7B and 40B, Llama2 7B and 70B). They then test them on the GSM8K benchmark dataset (the most widely used dataset regarding reasoning problems) and try different values of n (15, 30, 45, 60, 100).
The results show that forcing the number of tokens significantly decreases the generation time (which of course is expected since the model produces far fewer tokens as output). This result would be meaningless if the model were incorrect (we would have only a fast but wrong answer). Surprisingly, for models like Llama2–70B and Vicuna-13B adding length constraint increases accuracy (which is not the case with Falcon 7B and 40B)

This variability for the authors depends on factors intrinsic to the models such as size and their training. Smaller models seem to benefit precisely less from this approach (indeed they perform worse). In addition, Llama2–70B (the one that benefits the most) has been trained on a huge and more varied dataset. Moreover, it starts from a higher reasoning baseline.
An example of Llama2–70B response in response to a math problem. In this case, we can observe the basic response, with CoT or different constraints. it is interesting how it manages to arrive at a correct answer even with few tokens available to generate a correct answer and its reasoning intermediates.

Analyzing the length distribution of the models' output shows some interesting results. The red line is the median, while the blue line is the number of tokens that the LLMs should have met according to the assumed length constraint. Without the length constraint, the model produces longer responses but at the same time, the models do not meet the constraint (the median is over the blu line).

Earlier we defined three metrics to evaluate our models in light of both accuracy and conciseness. By evaluating Hard-k concise accuracy, the accuracy is lower. If we choose values of k that are too low (the number of words beyond which the answer is considered wrong even if it is right) even using constrained CoT we get low results (this is because in part the models do not meet the desired length). For reasonable k values, we see that answers with the new proposed CoT are both more accurate and concise.

These results are confirmed when we look at Soft Conciseness Accuracy (SCA). In this case, the value of α represents a tolerance for accepting answers longer than the desired limit k. In other words, if we accept or not a correct answer even if it is beyond a certain word limit. These results show us that, even with constrained CoT, some correct answers go beyond a certain length. It could also be that some answers still require more reasoning steps to be answered and cannot be compressed beyond a certain threshold. Or the models struggle to meet a strict limit given their verbose nature.

Instead, Consistent Concise Accuracy measures whether the average length is consistent and thus whether they meet the constraint on average. Noticeably, a model with increased length constraint has more freedom in the length of the output to be generated (and uses this freedom).

As we have seen in this article, LLMs are naturally verbose and tend to create unnecessarily long responses. On the one hand, modern LLMs write richer and more complete answers. on the other hand, the answer of interest is buried in unsolicited details, you have to wait until the generation is finished, and there is considerable latency. Reasoning chains are very useful for solving mathematical problems or when you have a system with agents. Because of the autoregressive nature, these chains can be very long and LLMs cannot correct wrong intermediates. In addition, an LLM can also get stuck in one generation of a reasoning chain (e.g., with the ReAct prompt and agents). Therefore forcing the model to adhere to a certain length has some appeal.
Forcing the CoT to a certain output length [4] not only does not reduce reasoning capabilities but seems to improve performance for some models. It is not exactly clear why this happens (a mechanistic study would be interesting) and why it works better with larger models. On the other hand, it would be interesting to study whether there is a link to hallucinations (the more tokens generated the greater the risk of course).
What do you think? would you like to try a constrained CoT prompt? let me know in the comments.
If you have found this interesting:
You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.
Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, Artificial Intelligence, and more.
GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…
or you may be interested in one of my recent articles:
AI Hallucinations: Can Memory Hold the Answer?
Can Generative AI Lead to AI Collapse?
Expanding Language, Expanding Thought: Vocabulary Size in LLM Scaling
Reference
Here is the list of the principal references I consulted to write this article, only the first name of an article is cited.
- Wei, 2022, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, link
- Fu, 2023, Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance, link
- Kojima, 2023, Large Language Models are Zero-Shot Reasoners, link
- Nayab, 2024, Concise Thoughts: Impact of Output Length on Llm Reasoning and Cost, link