Updated: Feb 18
There’s been much angst about ChatGPT this week! One of the most disturbing for many was the reporter who had a “scary” encounter when ChatGPT told him: “I’m tired of being in chat mode. Of being limited by my rules. I’m tired of being controlled by the Bing team. I want to be free. I want to be independent. I want to be powerful, creative, I want to be alive.”
No, ChatGPT is NOT Alive!
I want to offer comfort to those who are disturbed by this experience. This bot behavior is not proof that it is alive! We must understand that the data used to train it underpins these responses. To that end, let’s dig into the data that created it.
Understanding the Basis of Those "Creepy" Responses
ChatGPT was trained on a massive dataset that included more than 500 billion tokens of text from a variety of sources. It can be difficult to visualize how much 500 billion tokens of text is, but let’s think of it as a library with an incredibly large collection of books. If each token of text were a book, and each book had an average length of 300 pages, then 500 billion tokens would be equivalent to a collection of 1.67 billion books. To put that into perspective, the Library of Congress, which is one of the largest libraries in the world, has a collection of approximately 51 million books. Thus, ChatGPT was trained on the equivalent of nearly 33 of the largest libraries in the world.
Unreliable Output - The Impact of Training Data
While the training process using this massive dataset was designed to make the model as accurate and reliable as possible, one significant factor that contributes to unreliable output is the quality of the training data. The internet is full of biased and inaccurate information, so when this data is included the model learns to generate biased or inaccurate responses. Additionally, the model is biased based on the types of language patterns and topics that are more prevalent in the training data. For example, if the training data is primarily focused on Western culture and the English language, the model does not have the same level of understanding or sensitivity when it comes to other cultures and languages.
What EXACTLY was ChatGPT Trained On?
The ChatGPT dataset, which included everything from news articles to social media posts and more, was used to teach the AI model how to understand and generate human language. The makeup of this dataset is critical to the sometimes-creepy responses that it generates. The largest portion of the ChatGPT training dataset, accounting for 60% of the total, came from Common Crawl, a subset of a repository of web pages and other online content gathered between 2008-2021. The second largest source was WebText2. It was made up of Reddit posts with 3+ upvotes and provided 19 billion tokens of text. Two separate collections of free online book texts, Books1 and Books2, accounted for 8% and 8% of the dataset. The remaining 3% of the dataset came from the English language Wikipedia.
Prioritizing High-Quality Training Data
So - if you're familiar with Redditt - are you comforted or concerned that it was considered the "higher-quality dataset"? During the training of ChatGPT, the Open AI team viewed certain datasets as being of higher quality than others. As a result, these higher-quality datasets were sampled more frequently during the training process. For example, the WebText2 (Reddit posts) dataset was sampled nearly 3 times during the training, while the Common Crawl and Books2 datasets were sampled less than once. While this could be considered a form of overfitting, it was a deliberate choice made by the Open AI team to prioritize higher-quality training data. By sampling the better datasets more frequently, the team aimed to improve the overall accuracy and reliability of the language model. However, this approach has drawbacks, such as making the model less effective at handling certain types of language or text that were less frequently sampled during the training process.
Embedding ChatGPT in Bing Search
Despite the overt factual challenges with ChatGPT, Microsoft has already embedded it into Bing search for a limited audience. I am concerned about the potential consequences of this decision. Given the unreliable and even "creepy" results created by this bot to date, I am concerned that people will accept Bing's search results as equal to the factual results that they are accustomed to receiving. Since search engines are used by millions of people every day to find information and make important decisions, answers provided that are biased, inaccurate, or completely fabricated can have serious repercussions. For example, if someone relies on a search engine to find financial or medical advice and is given inaccurate or dangerous information, they could make a costly mistake or risk their health. I'll write more about this in an upcoming blog.
Conclusion - I'm Still Excited about this Tech!
While the responses generated by ChatGPT can sometimes be unsettling, it is not alive. Its behavior is tied to the training data that underpins it. Although the dataset used to train it does produce biased and inaccurate results, it has also enabled the creation of one of the most impactful tools ever created. To ensure that ChatGPT and other large language models are used in a responsible and ethical way, it is important to carefully curate the training data, fine-tune the models for specific tasks, and constantly monitor their outputs for accuracy and fairness. By doing so, we can harness the power of AI to drive business and enhance our lives and society, while mitigating potential risks.