She says the companies’ chatbots were trained on her book.
Comedian and author Sarah Silverman, as well as authors Christopher Golden and Richard Kadrey — are suing OpenAI and Meta each in a US District Court over dual claims of copyright infringement.
Interested to see how this plays out! Their argument that the only way a LLM could summarize their book is by ingesting the full copyrighted work seems a bit suspect, as it could've ingested plenty of reviews and summaries written by humans and combined that information.
I'm not confident that they'll be able to prove OpenAI or Meta infringed copyright, just as i'm not confident they'll be able to prove that they didn't violate copyright. I don't know if anyone really knows what these things are trained on.
We got to where we are now with fair use in search and online commentary because of a ton of lawsuits setting precedent, not surprising we'll have to do the same with machine learning.
ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”
I think this is where the crux of the case lies since the article mentions these are only available illegally through torrents.
This is starting to touch on the root of why they keep calling this "AI", "training", etc. They aren't doing this for strictly marketing, they are attempting to skew public opinion. These companies know intimately how to do that.
They're going to argue that if torrents are legal for educational purposes (ie the loophole that all trackers use), and they're just "training" an "AI" then they're just engaging in education. And an ignorant public might buy it.
These kinds of cases will be viewed as landmark cases in the future and honestly I don't have huge hopes. The history of these companies is engineer first, excuse the lack of ethics later. Or the philosophy of "it's easier to apologize than ask".
Even if they did train the model on the entire text of the book, that's still not necessarily copyright violation. I would think not, since the resulting model doesn't actually have a copy of the book embedded within it.
It’s difficult to tell to what extent books are encoded into the model. The data might be there in some abstract form or another.
During training it is kind of instructed to plagiarize the text it’s given. The instruction is basically “guess the next word of this unfinished excerpt”. It probably won’t memorize all input it’s given, but there’s a nonzero chance it manages to memorize some significant excerpts.
But the server used to calculate the model would have a copy of it. If training an AI model is not fair use then the mere act of loading a book you don't have a license for into the server would be copyright infringement. Like text book. It's a unauthorized digital copy. It's all very untested legal grounds and seems like lots of people want to be the first to test it. Not everyone has a great case but if the courts interpret things a certain way there's gonna be lots of payouts so maybe best to get in line early?
It may be that no one currently knows exactly what these things are trained on, but it could be determined. If you know the methodology you can figure out what data is being used. The companies involved are going to resist letting anyone find out, but I'm hoping a court case will break that black box open.
One of the many problems with this form of AI is the degree to which we don't know where it's getting its information from. Without that, there is no way to determine the reliability of the results. They can sound perfectly reasonable and be entirely untrue.
It's very rare for me to want Facebook to win a lawsuit. It's just as rare for me to want to see Sarah Silverman not succeed. But in this case, I think the Internet needs to see Facebook win.
All you will get from this will me more bots, advertising in every place for whichever brand paid the most. It's all about ads. Just like TV shows, the ads will be integrated to the content and be indistinguishable from the original content.
A place like this one will be treasured as a great thing from the past. It is really what you want?
Honestly? A little bit, yeah. More automated tools with greater function will help as long as we can moderate their use.
My real concern is more related to the fact that this will probably lead to a massive crackdown on sources and shadow libraries that have been used as training data for AIs. If this goes through, I see a lot of ML/AI/bots being forced into an audit, and whenever "potentially infringing" content is found, they won't just remove it, there will be an aggressive push against the shadow library hosting it.
There’s As we’ve said on The Vergecast every time someone gets Nilay going on copyright law, we’re going to see lawsuits centered around this stuff for years to come.
I can't wait to watch and/or listen to The Vergecast this week. If Nilay is on the podcast this time (he wasn't this past week) he's going to talk about it. I for one, can't wait to hear what he has to say.