ChatGPT and privacy – what you need to know

Today, various Artificial Intelligence (AI) tools and services are all the rage again, with especially ChatGPT’s latest versions stirring the biggest media headlines. From text and image generation to voice recognition, new AI services are popping up everywhere and already established services are announcing more AI-powered features to their products.

AI tools have quickly become extremely accessible in terms of their pricing, application areas and ease of use. So, you might be tempted to give AI a try – feed your contract to ChatGPT and ask it to rewrite or analyse it, for example.

However, take a minute before you start – there are some things you should understand about how privacy and data security work with AI tools.

What are LLMs – and why are they everywhere right now?

Behind many of today’s advanced AI services dealing with text and speech are something commonly referred to as Large Language Models (LLMs).

LLMs are large neural networks trained on vast amounts of text, in order to learn to represent the meaning of a piece of text in a way that is useful for various AI applications. They offer a kind of base layer of text understanding, which can be built on top of, rather than having to train separate models from scratch with each new use case. Since one of the first of these pre-trained language models, BERT, was released by Google in 2018, it has been incorporated in countless AI applications and spurred intense development of ever larger and more capable language models.

As OpenAI unveiled their ChatGPT at the end of 2022, they without doubt made it clear that development in natural language processing (i.e., language-centric AI) has taken another giant leap forward, and the attention among the general public has been unprecedented.

This time, the focus is particularly on generative AI – meaning language models designed and trained to generate text – and on the user-friendly dialogue-based interface that ChatGPT provides.

ChatGPT uses your data to teach its AI model

As it is being trained, a generative AI takes in vast amounts of text, from sources such as Wikipedia, discussion forums, licensed repositories, and the internet at large. From this material, the model learns the structure of language and concepts of the world at a massive scale.

As the sizes of these models have kept on growing, so has their capacity to learn from data and need for it grown, to the extent that gathering new data might become a bottleneck in their continued development. As a consequence, it is safe to assume that any good quality data available might be fed to the model for training at some point.

In addition, an essential component of ChatGPT is its fine-tuning through human feedback, aimed to make it better at following instructions in its prompt and providing relevant responses, beyond producing fluent and coherent language around a general topic. Such human feedback data is continuously being collected from everyone using the service.

Both types of training make it possible for an AI like ChatGPT to, for instance, read a piece of contract text in its prompt and answer questions about it, or offer to rewrite it.

However, what makes this problematic for the legal profession, is that contracts are typically confidential documents that include company confidential business secrets as well as personal information that falls under privacy legislation such as GDPR.

Takeaway: Essentially, you should expect that anything fed to ChatGPT or similar services will as a default be used for training purposes and might potentially compromise privacy, confidentiality and information security.

What could happen to your data when fed to a generative AI?

The way a generative language model is trained poses particular risks of sensitive data leaking out, not only to the ones controlling the model and surrounding infrastructure, but to anyone using it.

Fundamentally, a language model can be thought of as a very advanced statistical model that based on a certain sequence of words predicts the next, one at a time. This simple task is what it is being trained on, as it is fed with all the text data. By feeding the model a prompt text, having it predict the next word, adding it to the rest and repeating the process, the model can generate long sequences of words.

Typically, a generative model trained plainly in this fashion will produce what it estimates is the most likely continuation of a prompt, given the texts it has seen during training.

The model behind the first version of ChatGPT (GPT-3.5) was trained on 570GB of text (about 300 billion pages), but theoretically has a capacity to store almost double that amount of information due to its size.

While it certainly learns to generalize information, the AI would be capable of memorizing all text it has ever seen during training. It is therefore conceivable that such a model could be tricked, or accidentally led, to reproduce parts of its training data more or less verbatim.

Moreover, ChatGPT has been fine-tuned to follow instructions in the prompt, which could be used to query more directly for specific names and information and steer the generation toward revealing what it knows. While the developers have made a lot of effort into building guardrails around the model, in order to prevent unwanted or harmful behavior, such measures are never complete.

Regulators in, for instance, Italy temporarily banned access to the ChatGPT service over concerns regarding its handling of personal data and GDPR compliance. This highlights that companies interested in feeding customer-related text data to such a service might run into compliance issues themselves. It is also important to keep in mind in which jurisdiction a particular service is being hosted, as sending sensitive data to a cloud service, for instance, outside of the EU might present certain risks on its own.

How can I prevent my data from being used?

OpenAI recently announced a change to the privacy settings of ChatGPT, which allows users to opt-out of their data being used for training and their chat histories being saved for more than 30 days. It is also possible to export all data that is being stored for a user account.

This reflects the compliance issues mentioned above, and an emerging discussion around privacy in the context of generative language models, which is especially important for corporate users dealing with sensitive information. Microsoft is reportedly planning to offer access to the language model through private server instances as a premium service.

As a further step in this direction, one approach is to provide an “anonymization layer” to data, through automatic identification and masking of identifiable information. This is a feature actively being researched at Zefort, where our existing AI capabilities for document understanding can be leveraged toward this end.

For example, any information that might be used to identify persons or companies can be replaced by placeholders, reviewed by the user, and made available for export to any service. If such text is fed to a model like ChatGPT and, for instance, prompted to rewrite or analyze some passage, the model is capable of retaining the placeholders. This could be used to draft new clauses, which can be imported back into Zefort, where the placeholders are automatically replaced with the original information.

Alternatives to ChatGPT?

If you want more transparency to your AI service, a recently released open alternative to ChatGPT, HuggingChat, might also be worth considering.

The service is hosted by Huggingface, the company behind one of the industry-leading, open-source software suites for deep learning and natural language processing. While the service offers the same kind of privacy settings as ChatGPT, it is based on an open-source model developed by the OpenAssistant project, which is backed by the non-profit organization LAION.

The purpose of this effort is to provide an open alternative to OpenAI, which despite its name has gradually turned less and less open with every new version of the GPT model. While the model is considerably smaller than OpenAI’s counterpart, it is surprisingly capable, and can likely provide real utility for many practical use cases.

Note that the HuggingChat service is currently hosted on servers in the US.

Open, community-driven efforts like this outline a very interesting direction that is rapidly evolving. While simply hosting a model the size of GPT-3.5 requires specialized hardware in the six figures due to its memory consumption, these smaller language models being developed will be manageable to host also for smaller organizations. They will probably not be as widely capable as the latest versions that OpenAI, Google, Meta, etc. will be putting out there, but with the appropriate fine-tuning, they might be fit for more specific purposes in selected domains.

At the end of the day, everyone needs to decide what level of privacy protection is appropriate for them, and if the potential benefits outweigh the risks. With these services at our fingertips, it seems we could get all the answers we want.

Once the privacy concerns are settled and appropriate setups are found, the real challenge then becomes asking the right questions. What questions would you want to ask about your contracts?

About the author:

samuel rönnqvist zefort

Samuel Rönnqvist is the Machine Learning Lead at Zefort, where he is heading AI research and development related to document understanding for contracts. He holds a PhD in natural language processing and has been active as a researcher in the field for more than a decade. Since an early stage, he has been working on deep learning and pioneered its use in financial stability monitoring, through text mining of news. In 2018, he started working on neural language generation at University of Turku, Finland, and Goethe University Frankfurt, Germany, collaborating on controlled news generation with the STT News Agency through the Google News Initiative, as well as working on a project that produced the first fully generated, published scientific book in collaboration with Springer Publishing.

Start your free
Zefort trial
in minutes

Start now

Or talk to sales

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-100866936-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_lfa	2 years	This cookie is set by the provider Leadfeeder to identify the IP address of devices visiting the website, in order to retarget multiple users routing from the same IP address.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
__adroll	1 year 1 month	This cookie is set by AdRoll to identify users across visits and devices. It is used by real-time bidding for advertisers to display relevant advertisements.
__adroll_fpc	1 year	AdRoll sets this cookie to target users with advertisements based on their browsing behaviour.
__adroll_shared	1 year 1 month	Adroll sets this cookie to collect information on users across different websites for relevant advertising.
__ar_v4	1 year	This cookie is set under the domain DoubleClick, to place ads that point to the website in Google search results and to track conversion rates for these ads.
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
_opt_expid	past	Set by Google Analytics, this cookie is created when running a redirect experiment. It stores the experiment ID, the variant ID and the referrer to the page that is being redirected.
anj	3 months	AppNexus sets the anj cookie that contains data stating whether a cookie ID is synced with partners.
B	1 year	This Cookie is used by Yahoo to anonymously store data related to user's visits, such as the number of visits, average time spent on the website and what pages have been loaded. This data helps to customize website content to enhance user experience.
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
c	1 year	This cookie is set by Rubicon Project to control synchronization of user identification and exchange of user data between various ad services.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
i	1 year	This cookie is set by OpenX to record anonymized user data, such as IP address, geographical location, websites visited, ads clicked by the user etc., for relevant advertising.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
tuuid	1 year	The tuuid cookie, set by BidSwitch, stores an unique ID to determine what adverts the users have seen if they have visited any of the advertiser's websites. The information is used to decide when and how often users will see a certain banner.
tuuid_lu	1 year	This cookie, set by BidSwitch, stores a unique ID to determine what adverts the users have seen while visiting an advertiser's website. This information is then used to understand when and how often users will see a certain banner.
uuid2	3 months	The uuid2 cookie is set by AppNexus and records information that helps in differentiating between devices and browsers. This information is used to pick out ads delivered by the platform and assess the ad performance and its attribute payment.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
_gaexp_rc	past	No description available.
_lfa_test_cookie_stored	past	No description
_te_	session	No description
1e5a17c8ab	session	No description available.
A3	1 year	No description
AnalyticsSyncHistory	1 month	No description
li_gc	2 years	No description
UserMatchHistory	1 month	Linkedin - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.

ChatGPT and privacy – what you need to know

What are LLMs – and why are they everywhere right now?

ChatGPT uses your data to teach its AI model

What could happen to your data when fed to a generative AI?

How can I prevent my data from being used?

Alternatives to ChatGPT?

Read next

How to use Zefort’s AI features

Maximizing Resources: How Technology Supports Growing Legal Teams

How Zefort helps you meet NIS2 directive cybersecurity requirements

Start your free
Zefort trial
in minutes

Start a 14-day trial

ChatGPT and privacy – what you need to know

What are LLMs – and why are they everywhere right now?

ChatGPT uses your data to teach its AI model

What could happen to your data when fed to a generative AI?

How can I prevent my data from being used?

Alternatives to ChatGPT?

Read next

How to use Zefort’s AI features

Maximizing Resources: How Technology Supports Growing Legal Teams

How Zefort helps you meet NIS2 directive cybersecurity requirements

Start your free Zefort trial in minutes

Start a 14-day trial

Start your free
Zefort trial
in minutes