Today, various Artificial Intelligence (AI) tools and services are all the rage again, with especially ChatGPT’s latest versions stirring the biggest media headlines. From text and image generation to voice recognition, new AI services are popping up everywhere and already established services are announcing more AI-powered features to their products.
AI tools have quickly become extremely accessible in terms of their pricing, application areas and ease of use. So, you might be tempted to give AI a try – feed your contract to ChatGPT and ask it to rewrite or analyse it, for example.
However, take a minute before you start – there are some things you should understand about how privacy and data security work with AI tools.
What are LLMs – and why are they everywhere right now?
Behind many of today’s advanced AI services dealing with text and speech are something commonly referred to as Large Language Models (LLMs).
LLMs are large neural networks trained on vast amounts of text, in order to learn to represent the meaning of a piece of text in a way that is useful for various AI applications. They offer a kind of base layer of text understanding, which can be built on top of, rather than having to train separate models from scratch with each new use case. Since one of the first of these pre-trained language models, BERT, was released by Google in 2018, it has been incorporated in countless AI applications and spurred intense development of ever larger and more capable language models.
As OpenAI unveiled their ChatGPT at the end of 2022, they without doubt made it clear that development in natural language processing (i.e., language-centric AI) has taken another giant leap forward, and the attention among the general public has been unprecedented.
This time, the focus is particularly on generative AI – meaning language models designed and trained to generate text – and on the user-friendly dialogue-based interface that ChatGPT provides.
ChatGPT uses your data to teach its AI model
As it is being trained, a generative AI takes in vast amounts of text, from sources such as Wikipedia, discussion forums, licensed repositories, and the internet at large. From this material, the model learns the structure of language and concepts of the world at a massive scale.
As the sizes of these models have kept on growing, so has their capacity to learn from data and need for it grown, to the extent that gathering new data might become a bottleneck in their continued development. As a consequence, it is safe to assume that any good quality data available might be fed to the model for training at some point.
In addition, an essential component of ChatGPT is its fine-tuning through human feedback, aimed to make it better at following instructions in its prompt and providing relevant responses, beyond producing fluent and coherent language around a general topic. Such human feedback data is continuously being collected from everyone using the service.
Both types of training make it possible for an AI like ChatGPT to, for instance, read a piece of contract text in its prompt and answer questions about it, or offer to rewrite it.
However, what makes this problematic for the legal profession, is that contracts are typically confidential documents that include company confidential business secrets as well as personal information that falls under privacy legislation such as GDPR.
Takeaway: Essentially, you should expect that anything fed to ChatGPT or similar services will as a default be used for training purposes and might potentially compromise privacy, confidentiality and information security.
What could happen to your data when fed to a generative AI?
The way a generative language model is trained poses particular risks of sensitive data leaking out, not only to the ones controlling the model and surrounding infrastructure, but to anyone using it.
Fundamentally, a language model can be thought of as a very advanced statistical model that based on a certain sequence of words predicts the next, one at a time. This simple task is what it is being trained on, as it is fed with all the text data. By feeding the model a prompt text, having it predict the next word, adding it to the rest and repeating the process, the model can generate long sequences of words.
Typically, a generative model trained plainly in this fashion will produce what it estimates is the most likely continuation of a prompt, given the texts it has seen during training.
The model behind the first version of ChatGPT (GPT-3.5) was trained on 570GB of text (about 300 billion pages), but theoretically has a capacity to store almost double that amount of information due to its size.
While it certainly learns to generalize information, the AI would be capable of memorizing all text it has ever seen during training. It is therefore conceivable that such a model could be tricked, or accidentally led, to reproduce parts of its training data more or less verbatim.
Moreover, ChatGPT has been fine-tuned to follow instructions in the prompt, which could be used to query more directly for specific names and information and steer the generation toward revealing what it knows. While the developers have made a lot of effort into building guardrails around the model, in order to prevent unwanted or harmful behavior, such measures are never complete.
Regulators in, for instance, Italy temporarily banned access to the ChatGPT service over concerns regarding its handling of personal data and GDPR compliance. This highlights that companies interested in feeding customer-related text data to such a service might run into compliance issues themselves. It is also important to keep in mind in which jurisdiction a particular service is being hosted, as sending sensitive data to a cloud service, for instance, outside of the EU might present certain risks on its own.
How can I prevent my data from being used?
OpenAI recently announced a change to the privacy settings of ChatGPT, which allows users to opt-out of their data being used for training and their chat histories being saved for more than 30 days. It is also possible to export all data that is being stored for a user account.
This reflects the compliance issues mentioned above, and an emerging discussion around privacy in the context of generative language models, which is especially important for corporate users dealing with sensitive information. Microsoft is reportedly planning to offer access to the language model through private server instances as a premium service.
As a further step in this direction, one approach is to provide an “anonymization layer” to data, through automatic identification and masking of identifiable information. This is a feature actively being researched at Zefort, where our existing AI capabilities for document understanding can be leveraged toward this end.
For example, any information that might be used to identify persons or companies can be replaced by placeholders, reviewed by the user, and made available for export to any service. If such text is fed to a model like ChatGPT and, for instance, prompted to rewrite or analyze some passage, the model is capable of retaining the placeholders. This could be used to draft new clauses, which can be imported back into Zefort, where the placeholders are automatically replaced with the original information.
Alternatives to ChatGPT?
If you want more transparency to your AI service, a recently released open alternative to ChatGPT, HuggingChat, might also be worth considering.
The service is hosted by Huggingface, the company behind one of the industry-leading, open-source software suites for deep learning and natural language processing. While the service offers the same kind of privacy settings as ChatGPT, it is based on an open-source model developed by the OpenAssistant project, which is backed by the non-profit organization LAION.
The purpose of this effort is to provide an open alternative to OpenAI, which despite its name has gradually turned less and less open with every new version of the GPT model. While the model is considerably smaller than OpenAI’s counterpart, it is surprisingly capable, and can likely provide real utility for many practical use cases.
Note that the HuggingChat service is currently hosted on servers in the US.
Open, community-driven efforts like this outline a very interesting direction that is rapidly evolving. While simply hosting a model the size of GPT-3.5 requires specialized hardware in the six figures due to its memory consumption, these smaller language models being developed will be manageable to host also for smaller organizations. They will probably not be as widely capable as the latest versions that OpenAI, Google, Meta, etc. will be putting out there, but with the appropriate fine-tuning, they might be fit for more specific purposes in selected domains.
At the end of the day, everyone needs to decide what level of privacy protection is appropriate for them, and if the potential benefits outweigh the risks. With these services at our fingertips, it seems we could get all the answers we want.
Once the privacy concerns are settled and appropriate setups are found, the real challenge then becomes asking the right questions. What questions would you want to ask about your contracts?
About the author:
Samuel Rönnqvist is the Machine Learning Lead at Zefort, where he is heading AI research and development related to document understanding for contracts. He holds a PhD in natural language processing and has been active as a researcher in the field for more than a decade. Since an early stage, he has been working on deep learning and pioneered its use in financial stability monitoring, through text mining of news. In 2018, he started working on neural language generation at University of Turku, Finland, and Goethe University Frankfurt, Germany, collaborating on controlled news generation with the STT News Agency through the Google News Initiative, as well as working on a project that produced the first fully generated, published scientific book in collaboration with Springer Publishing.