AI in Contract Management: How to evaluate and trust answers provided by ChatGPT

With the arrival of ChatGPT, we have easy access to a powerful tool that can quickly provide us answers in a practical, distilled format, saving us the time we typically would spend searching, browsing and comparing different information sources.

While this is a great convenience, as we hand over the controls to automation, we still need to stay vigilant about the accuracy of the information it provides. However, given our limited insight into the process of vetting the information sources, as well as the process of evaluating, interpreting and comparing content, typically, we have relatively few means of judging the trustworthiness of the end result.

In other words, tools such as ChatGPT pose a real challenge to online information literacy. In addition, generative models have certain propensities to ‘hallucinate’, i.e., generating plausible-sounding sentences and inventing information, rather than sticking to instruction or fact.

In this blog post, we explore these challenges and possible tactics to mitigate the issues. If you’re interested in using generative AI with your contracts, check out Zefort’s ChatGPT integration.

Why generative language models fail to be truthful

In a recent paper by Microsoft Research studying GPT-4’s fascinating and varied capabilities, the authors point out a number of areas where the model is still lacking, including being able to judge “when it should be confident and when it is just guessing”. It can make up facts both when asked about information that should be in its training data and when asked to adhere to information provided in its prompt text. This seems to be a risk particularly when it is being asked questions for which there is no or little relevant information available. Quite simply, it seems to be primed to please the user, rather than saying “I don’t know”.

Fundamentally, the way generative language models are trained, by elaborately estimating probabilities of sequences of words, makes them primarily good at reproducing text and information from its training data, with a certain level of generalization. The first phase of training, called pre-training, uses a heterogeneous mix of text that is very abundant, where the model learns to produce fluent and coherent language, and learns about concepts of the world. At this point, however, the language model does not have a direct means of control, it only produces what it estimates is the most likely continuation to any given piece of text.

In order to teach a language model to follow instructions in a prompt text and actually work as an assistant, it is fine-tuned using dialogue data sets and manual ranking of responses. Solving this problem has been key to the success of ChatGPT. Nevertheless, since this training data is more laborious to produce and relatively scarce (although it can be collected from users of the service), we should expect the model to fail more often in situations that require it to adhere to instructions and facts in the prompt.

As the model is prompted, it generates a response word by word. Given a sequence of previous words, it is trying at each step to fill in the most fitting, i.e., most probable, next word or number. If no really good alternative exists, it still has a tendency to produce a best guess. Failing to recognize this can lead to over-generalization or otherwise inventing of information. This seems to be a challenge for ChatGPT currently, and a shortcoming in its fine-tuning and guardrails. It can also easily forget some of the instructions it was given, and, for instance, start “freestyling” when it comes to facts as a consequence. Finally, it is good to remember that language models are inherently not very good at common-sense reasoning (nor at arithmetics), as they only learn it indirectly through depictions of the real world in their training data.

How does this relate to contract management? A typical use case of ChatGPT in a contract management context is to have it answer questions related to a particular agreement. Typically, the agreement text can be fed as a whole in the prompt, followed by a question. This can be handy for quickly finding out anything from what the term of the agreement is to what obligations are defined for the parties. A particular danger in this case is that the model will confidently make statements not based on the agreement text at hand, but on other similar texts it has seen as part of its training data. This is one of the pitfalls we need to be on the lookout for.

Improving model truthfulness and transparency

Considering the issues mentioned above, a healthy degree of skepticism is called for when interacting with ChatGPT or similar models. The approaches you would otherwise take online still apply, plus you need to keep in mind the risk of hallucinations, big and small – made-up information can come as complete sentences or just slightly inaccurate numbers.

Luckily, ChatGPT provides room for a number of tactics for us to modify its behavior and challenge its responses. First, the user can ask follow-up questions asking to clarify or support certain claims, for instance, by asking it to explain its reasoning. This is referred to as chain-of-thought prompting, and can especially help the model solve tasks that involve more complex reasoning. Similarly, manually breaking up a more complex task and prompting the model step by step might help the model to avoid errors, for instance, by first letting it find relevant supporting information and then asking it for the final interpretation. Sometimes repeating or slightly varying a prompt might yield different responses, and if they are inconsistent that might be a red flag.

Often, though, simply asking it “Are you sure?” will cause it to revise, and possibly correct an incorrect answer.Such interrogation allows the model to do a second pass over both the prompt and its previous response, effectively giving it a better overview, as well as a chance to realize possible mistakes and calibrate its confidence. During the word-by-word generation process, the model may get stuck on an unfortunate path as it has to make decisions sequentially with somewhat limited information. The process also has a degree of randomness, which in itself sometimes can lead to suboptimal results, which a second run might fix.

Example: ChatGPT revising an overly-confident answer.

Q: What is historically the average wind speed in Turku?
A: [...] The average wind speed in Turku is typically around 4 to 5 meters per second (9 to 11 miles per hour). [...]
Q: Where did you get that number?
A: [...] the specific average wind speed I mentioned for Turku was based on typical wind conditions in coastal areas. [...]

Another approach is to challenge its use of sources, for instance, by asking it for references to a claim. At Zefort, we are especially interested in generative models’ abilities to answer questions about specific legal documents. A key part of this strategy is to use prompt engineering to modify the model’s behavior to be more careful when answering questions and adhere to a provided source document. As an additional measure, we constrain the model so that it provides quotes from the document as support for its claims, which we match against the document so the user can easily verify the claim.

Example: ChatGPT answering a question about a contract, by providing a quote from the Finnish-language document and paraphrasing its meaning in English.

Q: Are price increases restricted under this agreement?
A: "Toimittajalla on oikeus vuosittain muuttaa palvelujen toistuvaismaksuja ja yksikköhintoja ainoastaan todellisten palvelutuotantokustannusten nousua vastaava määrä. Muutos voi olla korkeintaan viisi (5) prosenttia."Therefore, price increases are allowed, but they can only be in line with the actual cost increases and cannot exceed 5%.

Give it a try yourself: Zefort’s ChatGPT integration is now available.

These techniques can mitigate some of the problematic tendencies and provide more transparency, thereby supporting the user in evaluating the information. Generative models can be incredibly useful tools, but even the best ones are imperfect and should not be blindly trusted. This generally holds true for any kind of model, as the statistician George E.P. Box famously expressed it: “All models are wrong, but some are useful”.

Learning to make good use of these tools will dramatically increase the speed at which we can process information, extract knowledge and apply it, and it will let us perform tasks that previously may not have been feasible. We can think of the arrival of these assistants as another step following the introduction of the web, Google Search, Wikipedia, etc. Much as the trustworthiness of Wikipedia was debated in its earlier days, we should do our best to challenge the models and retain a healthy skepticism, as we explore all the practical use cases they enable.

samuel rönnqvist zefort

Samuel Rönnqvist is the Machine Learning Lead at Zefort, where he is heading AI research and development related to document understanding for contracts. He holds a PhD in natural language processing and has been active as a researcher in the field for more than a decade. Since an early stage, he has been working on deep learning and pioneered its use in financial stability monitoring, through text mining of news. In 2018, he started working on neural language generation at University of Turku, Finland, and Goethe University Frankfurt, Germany, collaborating on controlled news generation with the STT News Agency through the Google News Initiative, as well as working on a project that produced the first fully generated, published scientific book in collaboration with Springer Publishing.

Start your free
Zefort trial
in minutes

Start now

Or talk to sales

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-100866936-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_lfa	2 years	This cookie is set by the provider Leadfeeder to identify the IP address of devices visiting the website, in order to retarget multiple users routing from the same IP address.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
__adroll	1 year 1 month	This cookie is set by AdRoll to identify users across visits and devices. It is used by real-time bidding for advertisers to display relevant advertisements.
__adroll_fpc	1 year	AdRoll sets this cookie to target users with advertisements based on their browsing behaviour.
__adroll_shared	1 year 1 month	Adroll sets this cookie to collect information on users across different websites for relevant advertising.
__ar_v4	1 year	This cookie is set under the domain DoubleClick, to place ads that point to the website in Google search results and to track conversion rates for these ads.
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
_opt_expid	past	Set by Google Analytics, this cookie is created when running a redirect experiment. It stores the experiment ID, the variant ID and the referrer to the page that is being redirected.
anj	3 months	AppNexus sets the anj cookie that contains data stating whether a cookie ID is synced with partners.
B	1 year	This Cookie is used by Yahoo to anonymously store data related to user's visits, such as the number of visits, average time spent on the website and what pages have been loaded. This data helps to customize website content to enhance user experience.
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
c	1 year	This cookie is set by Rubicon Project to control synchronization of user identification and exchange of user data between various ad services.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
i	1 year	This cookie is set by OpenX to record anonymized user data, such as IP address, geographical location, websites visited, ads clicked by the user etc., for relevant advertising.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
tuuid	1 year	The tuuid cookie, set by BidSwitch, stores an unique ID to determine what adverts the users have seen if they have visited any of the advertiser's websites. The information is used to decide when and how often users will see a certain banner.
tuuid_lu	1 year	This cookie, set by BidSwitch, stores a unique ID to determine what adverts the users have seen while visiting an advertiser's website. This information is then used to understand when and how often users will see a certain banner.
uuid2	3 months	The uuid2 cookie is set by AppNexus and records information that helps in differentiating between devices and browsers. This information is used to pick out ads delivered by the platform and assess the ad performance and its attribute payment.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
_gaexp_rc	past	No description available.
_lfa_test_cookie_stored	past	No description
_te_	session	No description
1e5a17c8ab	session	No description available.
A3	1 year	No description
AnalyticsSyncHistory	1 month	No description
li_gc	2 years	No description
UserMatchHistory	1 month	Linkedin - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.

AI in Contract Management: How to evaluate and trust answers provided by ChatGPT

Why generative language models fail to be truthful

Improving model truthfulness and transparency

Read next

Legal design calls for curiosity, empathy and boldness

Time to recharge your brains

Log4J Security update

Start your free
Zefort trial
in minutes

Start a 14-day trial

AI in Contract Management: How to evaluate and trust answers provided by ChatGPT

Why generative language models fail to be truthful

Improving model truthfulness and transparency

Read next

Legal design calls for curiosity, empathy and boldness

Time to recharge your brains

Log4J Security update

Start your free Zefort trial in minutes

Start a 14-day trial

Start your free
Zefort trial
in minutes