ChatGPT and the importance of data quality management
Willem Conradie, CTO of PBT Group
It has now been almost a year since OpenAI launched ChatGPT to the public. Since then, adoption has been exceptionally rapid. By February 2023, Reuters reported an estimated 100 million active users in January 2023, which makes it the fastest growing consumer application in history at that point in time. To further demonstrate the rapid popularity of this app, data from SimilarWeb estimated that the ChatGPT website saw over 1.5 billion visits in the month of September 2023, which was also a 4.7% increase on the total number of visits for the previous month.
This unprecedented adoption, and significant usage of ChatGPT has, as expected, raised many major concerns among various industries. Some of these concerns include: fabricated information, biased information, question misinterpretation, inconsistent answers, lack of empathy and security matters.
To address some of these concerns, a new approach to implementing Artificial Intelligence (AI) is gaining a lot of attention. This new approach is referred to as Responsible AI and it focuses on applying AI withfair, inclusive, secure, transparent, accountable, and ethical intent.
Addressing the concern around fabricated information
The concern of fabricated information represents itself when ChatGPT provides information that is incorrect, or outdated. There is a way to lower this risk by making use of good data quality management practices.
ChatGPT is not only available to the public – it can also be used in corporate environments to enhance many business processes. Some of these uses include customer service enquiries, writing emails, as a personal assistant, for keyword searches, and to create presentations.
For ChatGPT to be effective at the above tasks it must provide accurate responses to user input. For it to be accurate, it must be trained on data that is relevant to the organisation, and just as important is the accuracy and timeliness of this data it is being trained on. This is where good data quality management practices become imperative.
To make this practical, let’s consider the following scenario: ChatGPT is used in a corporate environment to automatically service customer enquiries with the intent of enhancing the customer experience by making answers more relevant to each individual customer. If the data that ChatGPT is trained on is not of high quality, the response it provides to the customer may be factually incorrect.
This could be something as simple as getting the customer’s name wrong, or more complex like providing incorrect instructions to accomplish some “self-help” task on the company’s mobile application. When this happens, it will more than likely frustrate the customer and have a detrimental effect on the customer experience, resulting in the exact opposite outcome as initially intended.
What data quality management entails
To practically manage the quality of data being used to train the ChatGPT model, the following data quality management practices should be considered.
- Data must be relevant to the intended use. When training a ChatGPT model it is critically important to ensure the data being used as input is relevant to the business context in which the model must provide responses.
- Data must be timely. Using data for model training that is not up to date for the intended use may result in outdated responses by the model, leading to incorrect actions.
- Data must be complete. This refers to the extent to which the dataset should not have any missing values, duplicates, or values that are irrelevant. Using data that is incomplete for the intended use may result in incorrect responses by the model, again leading to incorrect actions.
Over and above this, it is important to continuously improve the model to incorporate user feedback, or more commonly known as reinforcement learning. In the case of ChatGPT, and other Conversational AI models, the model can learn from the answers it provides. It can improve responses based on the feedback from users when the feedback is incorporated into model retraining cycles.
The above data quality management practices are not exhaustive and are a good practical starting point. They are not only applicable to using ChatGPT, but any Conversational AI in general. These practices are also applicable to other uses of AI like Generative AI.