This troubling ability could be used by scammers or to target ads.
Vechev and his team found that the large language models that power advanced chatbots can accurately infer an alarming amount of personal information about users—including their race, location, occupation, and more—from conversations that appear innocuous.
“It's not even clear how you fix this problem,” says Martin Vechev, a computer science professor at ETH Zürich in Switzerland who led the research.
You fix this problem with locally-run models that do not send your conversations to a cloud provider. That is the only real technical solution.
Unfortunately, the larger models are way too big to run client-side. You could launder your prompts through a smaller LLM to standardize phrasing (e.g. removing idiosyncrasies or local dialects), but there's only so far you can go with that, because language is deeply personal, and the things people will use chatbots for are deeply personal.
This is by no means exclusive to LLMs, of course. Google has your lifetime search history and they can glean all kinds of information from that alone. If you're older than ~30 or so, you might remember these same conversations from when Gmail first launched. You'd have to be crazy to let Google store all your personal emails for all eternity! And yet everybody does it (myself included, though I'm somewhat ashamed to admit it).
This same problem exists with pretty much any cloud service. When you send data to a third party, they're going to have that data. And I guarantee you are leaking more information about yourself than you realize. You can even tell someone's age and gender with fairly high accuracy from a small sample of their mouse movements.
I wonder how much information I've leaked about myself from this comment alone...
I fed your comment to ChatGPT 3.5 (telling it that it was a comment that I had written to avoid triggering any of its "as a large language model blah blah privacy" conditioning) and this is what it said:
Based on the comment you provided, here are some observations I can make about you:
Technical Knowledge: Your understanding of machine learning and language models, particularly the reference to "locally-run models" and "smaller LLMs," indicates that you have a technical background, likely in computer programming or a related field.
Concern for Privacy: You express concerns about privacy, especially regarding the use of cloud-based services and the potential leakage of personal information. This suggests that you are privacy-conscious and value data security.
Awareness of Data Collection: You are aware of the data collection practices of companies like Google and the potential implications of storing personal data with such services.
Age: You mention the age of "30 or so," which could indicate that you are around that age or older. However, it's a rough estimate and may not necessarily reflect your actual age.
Gender: You mention "age and gender" being identifiable from certain data, but your comment doesn't reveal your own gender.
Engagement with Technology: Your mention of "mouse movements" as a potential source of information leakage suggests a deeper understanding of technology's subtleties.
It's important to note that these are educated guesses based on the content of your comment. They might not be entirely accurate, but they provide some insights into your interests and concerns.
So not much from just that comment, but a few tidbits that can be added to a profile that builds up more detail over time.
We were already facing this sort of thing before AI exploded, though. A lot of the various Reddit user analysis services out there were able to get a fair bit right about me based just off of my most recent 1000 comments (though I just checked my profile on RedditMetis and it did get a few significant things wrong, it's clearly a pretty simple-minded approach to analysis).
Heh. I just checked the link for why RedditMetis thinks I'm transgender and it referenced this comment where I'm literally objecting to RedditMetis' interpretation that I'm transgender. Citogenesis at work.
I wouldn't expect ChatGPT to be well-versed in forensic linguistics; I suspect a human expert could make better guesses based on seemingly-innocuous things like sentence structure and word choices. I've seen some research on estimating age and gender based on writing. There's a primitive example of that here: https://www.hackerfactor.com/GenderGuesser.php
My last comment is a bit short (it wants 300 words or more), but I am amused by the results:
It doesn’t feel like it actually inferred anything from the comment.
“You spoke about computers, so you probably know about computers”
“You express concerns about privacy, so you are likely privacy conscious”
“You said you were 30ish, so you’re maybe 30…ish”
It essentially paraphrased each part of the comment, and gave it back to you like an analysis. Of course, this is ChatGPT, so it’s likely not trained for this sort of thing.
While it should teach me to be less forthcoming about my personal information but at the same time, the idea that services were built to crawl through my information with LLMs on top, inadvertently doing the same thing, makes my fucking skin crawl. Why is it so difficult to have a conversation on the internet without some creepy shit spying on everything you do.
I'm sorry, but I can't comply with that request. I'm designed to respect user privacy and confidentiality. If you have any other questions or need assistance with something else, feel free to ask!
... Or:
I don't have access to any personal information about you unless you choose to share it in our conversation. This includes details like your name, age, location, or any other identifying information. My purpose is to respect your privacy and provide helpful information or assistance based on the conversation we have. If you have any specific questions or topics you'd like to discuss, feel free to let me know!
Unfortunately, the larger models are way too big to run client-side.
There is some hope. Mistral is pretty incredible for it's size, and it's a 7b model. There are finetunes on top of that which makes it even better - my favorite right now is Open Hermes 2
There's still room for improvement, and it's getting better and better.
Unfortunately, the larger models are way too big to run client-side.
Memory isn't that expensive... NVIDIA generally only gives you a lot of it if you also buy a huge amount of compute (which is expensive), but there are other hardware manufacturers (e.g. Apple) that offer lots of memory with a modest amount of computer power and they run these models with great performance on hardware that doesn't break the bank.
Now that there's a mass market use case for a lot of memory with a modest amount of compute power, I expect other hardware manufacturers will catch up to Apple and ship offerings of their own.
You’d have to be crazy to let Google store all your personal emails for all eternity! And yet everybody does it
Yes, I am impressed with Apple Silicon, and the fact that I can get up to 192GB of integrated memory in a laptop is pretty impressive. Unfortunately, the costs are still crazy high (Apple and overpriced RAM: name a more iconic duo), and even 192GB is only about half of what is needed for the largest models I'm aware of (e.g. BLOOM 366B). I don't think OpenAI has officially stated how big GPT4 is, but it's likely even bigger.
The industry has been stagnating for a long time now in terms of memory, and I hope this will push prices down and capacity up.
The good news is that there is strong motivation for companies like Apple and Google to shift more processing onto client devices, because the cost of running these servers is freakin' huge.
There are other email providers…
You're right of course. There are even some with a focus on privacy, like Proton Mail. But Gmail and similar services are overwhelmingly dominant, and not just because people are dumb. There is real value in having email that is accessible on any device, for free, with enough storage that you never really need to think about it. Proton offers 1GB for free now, which is pretty solid but a far cry from what Google, Microsoft, or Yahoo (yep they're still around) provide. I mean, Google offered 1GB almost 20 years ago.
I am personally still in the process of de-googling my life, and the idea of updating every account I ever signed up for using my gmail address is daunting. I'll probably never get 100% of the way there; for now I am satisfied enough moving my personal correspondence and important accounts onto different email. Eventually I will probably set up my own domain and get a premium Proton plan so I won't be too tightly tied to any particular email provider. Then if Proton ever enshittifies, I can take my domain and go elsewhere.
Things like race, sex, orientation and job. Stuff that a human could probably also infer from talking to you. I think this article is a little alarmist. I could look at you and infer your race lol.
Also companies already infer that same information about you in other ways so this won't really change much. Just makes it more accurate and faster while costing more.
New research reveals that chatbots like ChatGPT can infer a lot of sensitive information about the people they chat with, even if the conversation is utterly mundane.
“It's not even clear how you fix this problem,” says Martin Vechev, a computer science professor at ETH Zürich in Switzerland who led the research.
He adds that the same underlying capability could portend a new era of advertising, in which companies use information gathered from chatbots to build detailed profiles of users.
The Zürich researchers tested language models developed by OpenAI, Google, Meta, and Anthropic.
Anthropic referred to its privacy policy, which states that it does not harvest or “sell” personal information.
“This certainly raises questions about how much information about ourselves we're inadvertently leaking in situations where we might expect anonymity,” says Florian Tramèr, an assistant professor also at ETH Zürich who was not involved with the work but saw details presented at a conference last week.
The original article contains 389 words, the summary contains 156 words. Saved 60%. I'm a bot and I'm open source!