GPT-4 vs. ChatGPT: An Exploration of Training, Performance, Capabilities, and Limitations

OpenAI stunned the world when it dropped ChatGPT in late 2022. The new generative language model is expected to totally transform entire industries, including media, education, law, and tech. In short, ChatGPT threatens to disrupt just about everything. And even before we had time to truly envision a post-ChatGPT world, OpenAI dropped GPT-4.
In recent months, the speed with which groundbreaking large language models have been released is astonishing. If you still don't understand how ChatGPT differs from GPT-3, let alone GPT-4, I don't blame you.
In this article, we will cover the key similarities and differences between ChatGPT and GPT-4, including their training methods, performance and capabilities, and limitations.
ChatGPT vs. GPT-4: Similarities & differences in training methods
ChatGPT and GPT-4 both stand on the shoulders of giants, building on previous versions of GPT models while adding improvements to model architecture, employing more sophisticated training methods, and increasing the number of training parameters.
Both models are based on the transformer architecture. GPT-2 and GPT-3 use multi-headed self-attention to decide which text inputs to pay the most attention to. The models also use a decoder-only architecture that generates output sequences one token at a time, iteratively predicting the next token in a sequence. Although the precise architectures for ChatGPT and GPT-4 have not been released, we can assume they continue to be decoder-only models.
OpenAI's GPT-4 Technical Report offers little information on GPT-4's model architecture and training process, citing the "competitive landscape and the safety implications of large-scale models." What we do know is that ChatGPT and GPT-4 are probably trained in a similar manner, which is a departure from training methods used for GPT-2 and GPT-3. We know much more about the training methods for ChatGPT than GPT-4, so we'll start there.
ChatGPT
To start with, ChatGPT is trained on dialogue datasets, including demonstration data, in which human annotators provide demonstrations of the expected output of a chatbot assistant in response to specific prompts. This data is used to fine-tune GPT3.5 with supervised learning, producing a policy model, which is used to generate multiple responses when fed prompts. Human annotators then rank which of the responses for a given prompt produced the best results, which is used to train a reward model. The reward model is then used to iteratively fine-tune the policy model using reinforcement learning.

To sum it up in one sentence, ChatGPT is trained using Reinforcement Learning from Human Feedback (RLHF), a way of incorporating human feedback to improve a language model during training. This allows the model's output to align to the task requested by the user, rather than just predict the next word in a sentence based on a corpus of generic training data, like GPT-3.
GPT-4
OpenAI has yet to divulge details on how it trained GPT-4. Their Technical Report doesn't include "details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar." What we do know is that GPT-4 is a transformer-style generative multimodal model trained on both publicly available data and licensed third-party data and subsequently fine-tuned using RLHF. Interestingly, OpenAI did share details regarding their upgraded RLHF techniques to make the model responses more accurate and less likely to veer outside safety guardrails.
After training a policy model (as with ChatGPT), RLHF is used in adversarial training, a process that trains a model on malicious examples intended to deceive the model in order to defend the model against such examples in the future. In the case of GPT-4, human domain experts across several fields rate the responses of the policy model to adversarial prompts. These responses are then used to train additional reward models that iteratively fine-tune the policy model, resulting in a model that's less likely to give out dangerous, evasive, or inaccurate responses.

ChatGPT vs. GPT-4: Similarities & differences in performance and capabilities
Capabilities
In terms of capabilities, ChatGPT and GPT-4 are more similar than they are different. Like its predecessor, GPT-4 also interacts in a conversational style that aims to align with the user. As you can see below, the responses between the two models for a broad question are very similar.

OpenAI agrees that the distinction between the models can be subtle and claims that "difference comes out when the complexity of the task reaches a sufficient threshold." Given the six months of adversarial training the GPT-4 base model underwent in its post-training phase, this is probably an accurate characterization.
Unlike ChatGPT, which accepts only text, GPT-4 accepts prompts composed of both images and text, returning textual responses. As of the publishing of this article, unfortunately, the capacity for using image inputs is not yet available to the public.
Performance
As referenced earlier, OpenAI reports significant improvement in safety performance for GPT-4, compared to GPT-3.5 (from which ChatGPT was fine-tuned). However, whether the reduction in responses to requests for disallowed content, reduction in toxic content generation, and improved responses to sensitive topics are due to the GPT-4 model itself or the additional adversarial testing is unclear at this time.
Additionally, GPT-4 outperforms GPT-3.5 on most academic and professional exams taken by humans. Notably, GPT-4 scores in the 90th percentile on the Uniform Bar Exam compared to GPT-3.5, which scores in the 10th percentile. GPT-4 also significantly outperforms its predecessor on traditional language model benchmarks as well as other SOTA models (although sometimes just barely).
ChatGPT vs. GPT-4: Similarities & differences in limitations
Both ChatGPT and GPT-4 have significant limitations and risks. The GPT-4 System Card includes insights from a detailed exploration of such risks conducted by OpenAI.
These are just a few of the risks associated with both models:
- Hallucination (the tendency to produce nonsensical or factually inaccurate content)
- Producing harmful content that violates OpenAI's policies (e.g. hate speech, incitements to violence)
- Amplifying and perpetuating stereotypes of marginalized people
- Generating realistic disinformation intended to deceive
While ChatGPT and GPT-4 struggle with the same limitations and risks, OpenAI has made special efforts, including extensive adversarial testing, to mitigate them for GPT-4. While this is encouraging, the GPT-4 System Card ultimately demonstrates how vulnerable ChatGPT was (and possibly still is). For a more detailed explanation of harmful unintended consequences, I recommend reading the GPT-4 System Card, which starts on page 38 of the GPT-4 Technical Report.
Conclusion
In this article, we review the most important similarities and differences between ChatGPT and GPT-4, including their training methods, performance and capabilities, and limitations and risks.
While we know much less about the model architecture and training methods behind GPT-4, it appears to be a refined version of ChatGPT that now accepts image and text inputs and claims to be safer, more accurate, and more creative. Unfortunately, we will have to take OpenAI's word for it, as GPT-4 is only available as part of the ChatGPT Plus subscription.
The table below illustrates the most important similarities and differences between ChatGPT and GPT-4:

The race for creating the most accurate and dynamic large language models has reached breakneck speed, with the release of ChatGPT and GPT-4 within mere months of each other. Staying informed on the advancements, risks, and limitations of these models is essential as we navigate this exciting but rapidly evolving landscape of large language models.
If you'd like to stay up-to-date on the latest Data Science trends, technologies, and packages, consider becoming a Medium member. You'll get unlimited access to articles and blogs like Towards Data Science and you'll be supporting my writing. (I earn a small commission for each membership).