AI in Contract Management: How to evaluate and trust answers provided by ChatGPT
With the arrival of ChatGPT, we have easy access to a powerful tool that can quickly provide us answers in a practical, distilled format, saving us the time we typically would spend searching, browsing and comparing different information sources.
While this is a great convenience, as we hand over the controls to automation, we still need to stay vigilant about the accuracy of the information it provides. However, given our limited insight into the process of vetting the information sources, as well as the process of evaluating, interpreting and comparing content, typically, we have relatively few means of judging the trustworthiness of the end result.
In other words, tools such as ChatGPT pose a real challenge to online information literacy. In addition, generative models have certain propensities to ‘hallucinate’, i.e., generating plausible-sounding sentences and inventing information, rather than sticking to instruction or fact.
In this blog post, we explore these challenges and possible tactics to mitigate the issues. If you’re interested in using generative AI with your contracts, check out Zefort’s ChatGPT integration.
Why generative language models fail to be truthful
In a recent paper by Microsoft Research studying GPT-4’s fascinating and varied capabilities, the authors point out a number of areas where the model is still lacking, including being able to judge “when it should be confident and when it is just guessing”. It can make up facts both when asked about information that should be in its training data and when asked to adhere to information provided in its prompt text. This seems to be a risk particularly when it is being asked questions for which there is no or little relevant information available. Quite simply, it seems to be primed to please the user, rather than saying “I don’t know”.
Fundamentally, the way generative language models are trained, by elaborately estimating probabilities of sequences of words, makes them primarily good at reproducing text and information from its training data, with a certain level of generalization. The first phase of training, called pre-training, uses a heterogeneous mix of text that is very abundant, where the model learns to produce fluent and coherent language, and learns about concepts of the world. At this point, however, the language model does not have a direct means of control, it only produces what it estimates is the most likely continuation to any given piece of text.
In order to teach a language model to follow instructions in a prompt text and actually work as an assistant, it is fine-tuned using dialogue data sets and manual ranking of responses. Solving this problem has been key to the success of ChatGPT. Nevertheless, since this training data is more laborious to produce and relatively scarce (although it can be collected from users of the service), we should expect the model to fail more often in situations that require it to adhere to instructions and facts in the prompt.
As the model is prompted, it generates a response word by word. Given a sequence of previous words, it is trying at each step to fill in the most fitting, i.e., most probable, next word or number. If no really good alternative exists, it still has a tendency to produce a best guess. Failing to recognize this can lead to over-generalization or otherwise inventing of information. This seems to be a challenge for ChatGPT currently, and a shortcoming in its fine-tuning and guardrails. It can also easily forget some of the instructions it was given, and, for instance, start “freestyling” when it comes to facts as a consequence. Finally, it is good to remember that language models are inherently not very good at common-sense reasoning (nor at arithmetics), as they only learn it indirectly through depictions of the real world in their training data.
How does this relate to contract management? A typical use case of ChatGPT in a contract management context is to have it answer questions related to a particular agreement. Typically, the agreement text can be fed as a whole in the prompt, followed by a question. This can be handy for quickly finding out anything from what the term of the agreement is to what obligations are defined for the parties. A particular danger in this case is that the model will confidently make statements not based on the agreement text at hand, but on other similar texts it has seen as part of its training data. This is one of the pitfalls we need to be on the lookout for.
Improving model truthfulness and transparency
Considering the issues mentioned above, a healthy degree of skepticism is called for when interacting with ChatGPT or similar models. The approaches you would otherwise take online still apply, plus you need to keep in mind the risk of hallucinations, big and small – made-up information can come as complete sentences or just slightly inaccurate numbers.
Luckily, ChatGPT provides room for a number of tactics for us to modify its behavior and challenge its responses. First, the user can ask follow-up questions asking to clarify or support certain claims, for instance, by asking it to explain its reasoning. This is referred to as chain-of-thought prompting, and can especially help the model solve tasks that involve more complex reasoning. Similarly, manually breaking up a more complex task and prompting the model step by step might help the model to avoid errors, for instance, by first letting it find relevant supporting information and then asking it for the final interpretation. Sometimes repeating or slightly varying a prompt might yield different responses, and if they are inconsistent that might be a red flag.
Often, though, simply asking it “Are you sure?” will cause it to revise, and possibly correct an incorrect answer.Such interrogation allows the model to do a second pass over both the prompt and its previous response, effectively giving it a better overview, as well as a chance to realize possible mistakes and calibrate its confidence. During the word-by-word generation process, the model may get stuck on an unfortunate path as it has to make decisions sequentially with somewhat limited information. The process also has a degree of randomness, which in itself sometimes can lead to suboptimal results, which a second run might fix.
Example: ChatGPT revising an overly-confident answer.
Q: What is historically the average wind speed in Turku? A: [...] The average wind speed in Turku is typically around 4 to 5 meters per second (9 to 11 miles per hour). [...] Q: Where did you get that number? A: [...] the specific average wind speed I mentioned for Turku was based on typical wind conditions in coastal areas. [...]
Another approach is to challenge its use of sources, for instance, by asking it for references to a claim. At Zefort, we are especially interested in generative models’ abilities to answer questions about specific legal documents. A key part of this strategy is to use prompt engineering to modify the model’s behavior to be more careful when answering questions and adhere to a provided source document. As an additional measure, we constrain the model so that it provides quotes from the document as support for its claims, which we match against the document so the user can easily verify the claim.
Example: ChatGPT answering a question about a contract, by providing a quote from the Finnish-language document and paraphrasing its meaning in English.
Q: Are price increases restricted under this agreement? A: "Toimittajalla on oikeus vuosittain muuttaa palvelujen toistuvaismaksuja ja yksikköhintoja ainoastaan todellisten palvelutuotantokustannusten nousua vastaava määrä. Muutos voi olla korkeintaan viisi (5) prosenttia."Therefore, price increases are allowed, but they can only be in line with the actual cost increases and cannot exceed 5%.
Give it a try yourself: Zefort’s ChatGPT integration is now available.
These techniques can mitigate some of the problematic tendencies and provide more transparency, thereby supporting the user in evaluating the information. Generative models can be incredibly useful tools, but even the best ones are imperfect and should not be blindly trusted. This generally holds true for any kind of model, as the statistician George E.P. Box famously expressed it: “All models are wrong, but some are useful”.
Learning to make good use of these tools will dramatically increase the speed at which we can process information, extract knowledge and apply it, and it will let us perform tasks that previously may not have been feasible. We can think of the arrival of these assistants as another step following the introduction of the web, Google Search, Wikipedia, etc. Much as the trustworthiness of Wikipedia was debated in its earlier days, we should do our best to challenge the models and retain a healthy skepticism, as we explore all the practical use cases they enable.
Samuel Rönnqvist is the Machine Learning Lead at Zefort, where he is heading AI research and development related to document understanding for contracts. He holds a PhD in natural language processing and has been active as a researcher in the field for more than a decade. Since an early stage, he has been working on deep learning and pioneered its use in financial stability monitoring, through text mining of news. In 2018, he started working on neural language generation at University of Turku, Finland, and Goethe University Frankfurt, Germany, collaborating on controlled news generation with the STT News Agency through the Google News Initiative, as well as working on a project that produced the first fully generated, published scientific book in collaboration with Springer Publishing.