Chain of Thought Prompting Facilitate LLMs Reasoning Abilities
Large language models (LLMs) are proven to be highly efficient at solving a variety of tasks from summarizing documents to writing code in different programming languages.
Moreover, they just get better with newly announced models like ChatGPT and GPT-4, unlocking a world of opportunities with LLM-based applications.
Contrary to their extraordinary skills, LLMs sometimes fail to demonstrate very simple reasoning abilities and fail to solve questions that can easily be tackled down by a fourth grader.
A lot of research has been done in this area aiming to understand why LLMs fail to perform such tasks and to make them get better.
One study that focuses on this particular issue is chain-of-thought prompting, introduced by Google research, brain team.
Chain-of-thought prompting
A prompting technique with a structure of {input, chain-of-thought, output}, where chain-of-thought is a series of intermediate natural language reasoning steps.
The models are given a few examples with input and output (few-shot learning) and then asked a question that involves a multi-step or arithmetic reasoning tasks.
The key takeaways from the paper:
- Chain-of-thought prompting outperforms standard prompting
- The difference between chain-of-thought prompting and standard prompting becomes more evident on larger models. The performance increase with chain-of-prompting is proportional to the number of parameters of the model.
The chain-of-thought prompting technique is simply solving the problem step-by-step. Each step is based on logical reasoning. Here is an example:
Question: John has 2 houses. Each house has 3 bedrooms and there are 2 windows in each bedroom. Each house has 1 kitchen with 2 windows. Also, each house has 5 windows that are not in the bedrooms or kitchens. How many windows are there in John's houses?
Answer: Each house has 3 bedrooms with 2 windows each, which makes it 6 bedroom windows. Each house has 1 kitchen with 2 windows, which makes it 2 kitchen windows. There are also 5 windows that are not in the kitchen or bedroom. So there are 6 + 2 + 5 = 13 windows per house. Since there are 2 houses, there are 2 x 13 = 26 houses in John's houses.
You can follow different steps to reach the correct answer. For such questions, there are almost always more than one path to get to the answer.
In standard prompting, we'd just ask the above question to an LLM and expect it to give the answer. Let's try it:
# model: text-davinci-003
prompt = """
John has 2 houses. Each house has 3 bedrooms and there are 2 windows in each bedroom.
Each house has 1 kitchen with 2 windows. Also, each house has 5 windows that are not in the bedrooms or kitchens.
How many windows are there in John's houses?
"""
# model response
"There are a total of 32 windows in John's houses."Wrong answer!
We don't know how the model gets this answer, which brings us to another advantage of chain-of-thought prompting. The model responds with a step-by-step answer, which makes the debugging process easier. We can easily spot where things go wrong.
With chain-of-prompting technique, we add a few questions and their answers to the prompt to do few-shot prompting. The answers are in the form of a step-by-step solution (i.e. demonstrates a chain-of-thought).
Here is the updated prompt and the response of the model:
# model: text-davinci-003
prompt = """
Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
A: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6.
Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot? 
A: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.
Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total? 
A: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they
had 74 - 35 = 39. The answer is 39.
Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
A: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. The answer is 8.
Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?
A: Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. The answer is 9.
Q: John has 2 houses. Each house has 3 bedrooms and there are 2 windows in each bedroom.
Each house has 1 kitchen with 2 windows. Also, each house has 5 windows that are not in the bedrooms or kitchens.
How many windows are there in John's houses?
A: 
"""
# model response
"John has 2 houses. Each house has 3 bedrooms with 2 windows each, 
1 kitchen with 2 windows, and 5 windows that are not in the bedrooms or 
kitchens. So for each house, there are 3 x 2 + 2 + 5 = 13 windows. 
In total, there are 2 x 13 = 26 windows. The answer is 26."Correct answer!
The response is a step-by-step explanation of the thought process of the model (chain-of-thought) just like the examples shown in the prompt.
Let's go over another example. Here is the standard prompting version:
# model: text-davinci-003
prompt = """
We have a big meeting in 10 days, which is 5 days before our product 
launch on 08/24/2022. What is the date today in MM/DD/YYYY?
"""
# model response
"08/14/2022"Wrong answer!
Let's try the same question with chain-of-thought prompting by providing a few input-output examples (few-shot learning):
# model: text-davinci-003
prompt = """
Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
A: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6.
Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot? 
A: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.
Q: 2015 is coming in 36 hours. What is the date one week from today in MM/DD/YYYY?
A: If 2015 is coming in 36 hours, then it is coming in 2 days. 2 days before 01/01/2015 is 12/30/2014, 
so today is 12/30/2014. So one week from today will be 01/05/2015. So the answer is 01/05/2015.
Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?
A: Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. The answer is 9.
Q: We have a big meeting in 10 days, which is 5 days before our product launch on 08/24/2022.
What is the date today in MM/DD/YYYY?
A: 
"""
# model response
"If the product launch is on 08/24/2022, then 10 days before that is '
 '08/14/2022. 5 days before that is 08/09/2022. So today is 08/09/2022. The '
 'answer is 08/09/2022."Correct answer!
Final words
The chain-of-thought prompting clearly increases the ability of LLMs at certain tasks. I suggest reading the entire paper if you'd like to learn more about the experiments they conducted.
It is important to note that, as also mentioned in the paper, the benefits of chain-of-thought prompting only become evident when applied to models with approximately 100 billion parameters, and it doesn't significantly enhance the performance of smaller models.
The experiment results yield to the conclusion that smaller models produce fluent but illogical chains of thought, which leads to a worse performance than standard prompting.
References
Wei, Jason, et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. ArXiv. https://arxiv.org/abs/2201.11903

