The New York Times has sued OpenAI and Microsoft for copyright infringement, alleging that the companies’ artificial intelligence technology illegally copied millions of Times articles to train ChatGPT and other services to provide people with information – technology that now competes with the Time...
The New York Times sues OpenAI and Microsoft for copyright infringement::The New York Times has sued OpenAI and Microsoft for copyright infringement, alleging that the companies’ artificial intelligence technology illegally copied millions of Times articles to train ChatGPT and other services to provide people with information – technology that now competes with the Times.
There is something wrong when search and AI companies extract all of the value produced by journalism for themselves. Sites like Reddit and Lemmy also have this issue. I’m not sure what the solution is. I don’t like the idea of a web full of paywalls, but I also don’t like the idea of all the profit going to the ones who didn’t create the product.
It's a product where the value curve is heavily weighted towards recency.
In theory, the greatest value theft is when the AP writes a piece and two dozen other 'journalists' copy the thing changing the text just enough not to get sued. Which is completely legal, but what effectively killed investigative journalism.
A LLM taking years old articles and predicting them until it can effectively learn relationships between language itself and events described in those articles isn't some inherent value theft.
It's not the training that's the problem, it's the application of the models that needs policing.
Like if someone took a LLM, fed it recently published news stories in the prompts with RAG, and had it rewrite them just differently enough that no one needed to visit the original publisher.
Even if we have it legal for humans to do that (which really we might want to revisit, or at least create a special industry specific restriction regarding), maybe we should have different rules for the models.
Except no one is claiming that LLMs are the problem, they're claiming GPT, or more specifically GPTs training data, is the problem. Transformer models still have a lot of potential, but the question the NYT is asking is "can you just takes anyone else's work to train them".
The solution is imposing to these companies the responsibility of tracking their profit per media, tax them and redistribute that money based on the tracking info. They're able to track all the pages you visit, it's complete bullshit when they say they don't know how much they make for each places their ads are displayed.
My question is how is an AI reading a bunch of articles any different from a human doing it. With this logic no one would legally be able to write an article as they are using bits of other peoples work they read that they learnt to write a good article with.
They are both making money with parts of other peoples work.
It was thought that the LLM wouldn’t keep the actual data internally verbatim. If you can memorize an article, and recite it to everyone free of charge, technically it’s plagiarism. Same if you sing a song to a crowd when you don’t have the rights.
The Google research (and other discovery) proved that you can actually extract verbatim training data from a LLM. Which has a lot of implications for copyright.
The physical limitations are an important difference. A human can only read and remember so much material. With AI, you can scale that exponentially with more compute resources. Frankly, IP law was not written with this possibility in mind and needs to be updated to find a balance.
Let me ask you this: when have you ever seen ChatGPT cite its sources and give appropriate credit to the original author?
If I were to just read the NYT and make money by simply summarizing articles and posting those summaries on my own website without adding anything to it like my own commentary and without giving credit to the author, that would rightfully be considered plagiarism.
This is a really interesting conundrum though. I would argue that AI isn't capable of original thought the way that humans are and therefore AI creators must provide due compensation to the authors and artists whose data they used.
AI is only giving back some amalgamation of words and concepts that it has been trained on. You might say that humans do the same, but that isn't exactly true. The human brain is a funny thing. It can forget, it can misremember. It can manipulate. It can exaggerate. It can plan. It can have irrational or emotional responses. AI can't really do those things on its own. It's just mimicking human behavior at best.
Most importantly to me though, AI is not capable of spontaneous thought. It is only capable of providing information that it has been trained on and only when prompted.
I'm pretty sure there is copyright infringement going on by the letter of the law. But I also think the world would be better off if copyright laws were a bit more loose. Not wild-west anything-goes libertarianism, but more open than the current state.
Let me ask you this: when have you ever seen ChatGPT cite its sources and give appropriate credit to the original author?
Bing chat now does that by default. Normally you have to prompt that manually.
If I were to just read the NYT and make money by simply summarizing articles and posting those summaries on my own website without adding anything to it like my own commentary and without giving credit to the author, that would rightfully be considered plagiarism.
No. It would be considered journalism. If you read the news a bit, you will find that they reference the output of other news corporations quite a bit. If your preferred news source does not do that, then they simply don't cite their sources.
I think the important difference in this case is like the difference between a human enjoying a song that they hear being performed vs a company recording a song that someone is performing and then replaying that song on demand for paying customers.
not even in their entirety. It's taking a few notes from here and there, arranges them in a way what makes sense, and effectively performing a "new" song - which isn't all that different from a human artist who is "inspired" by the works of other artists and produces a new work in the same genre.
The main difference being the volume. An example I like is how Google trained his gaming AI to starcraft 2. This AI was able to beat high ranked professional gamers. It was trained by watching a century of games.
Chatgpt didn't read few articles, it read years of them, maybe a couple of decades.
Reminds me of Nokia suing Apple (two waves), Blockbuster suing Netflix, and Yahoo suing Facebook. Threatened, declining company suing a disruptor is what we can expect will always happen I guess. Will be nice to see this stuff finally tested in court though.
Except the news still needs to come from somewhere. While GPT can "create" things, it's not a journalist. It's just the next step in aggregation skimming money from the actual sources.
This person seems not to know very much about what they are talking about, despite their confidence in saying it.
It looks like they think the reason AI output can't be copyrighted is because it's been "ruled a derivative work" but that's not the reasoning provided which is that copyright can only protect human creativity, and thus machine output without human involvement can't be copyrighted - with the judge noting the line of what proportion of human contribution is needed is unclear.
The other suits trying to claim the models are derivative works are either yet to be settled or in some cases have been thrown out.
Even in one of the larger suits on whether training is infringement regarding LLMs, the derivative claim has been thrown out:
Chhabria, in his ruling, called this argument “nonsensical,” adding, “There is no way to understand the LLaMA models themselves as a recasting or adaptation of any of the plaintiffs’ books.”
Additionally, Chhabria threw out the plaintiffs’ argument that every LLaMA output was “an infringing derivative” work and “constitutes an act of vicarious copyright infringement”; that LLaMA was in violation of the Digital Millennium Copyright Act; and that LLaMA “unjustly enriched Meta” and “breached a duty of care ‘to act in a reasonable manner towards others’ by copying the plaintiffs’ books to train LLaMA.”
Social media has really turned into a confirmation bias echo chamber where misinformation can run rampant when people make unsourced broad claims that are successful because they "feel right" even if they aren't.
Perhaps the reason hallucination is such a problem for LLMs is that in the social media data that's a large chunk of their training everyone is so full of shit?
Perhaps the reason hallucination is such a problem for LLMs is that in the social media data that’s a large chunk of their training everyone is so full of shit?
Heh. I think it simply shows us that the fundamental principle of artificial neural nets, really captures how the brain works.
Social media has really turned into a confirmation bias echo chamber where misinformation can run rampant
Honestly this can be easily overstated in the case of social media relative to anything else humanity does. But and large no one knows anything and is happy talking and speculating as they do. It was true before social media and it will be after.
The fun part is trying to make sense of it all, thus why I said “interesting”.
I personally have thought the copyright dimension one of the more interesting aspects of AI in the short and medium term and have thought so for years. Happy to hear takes and opinions on the issue, especially as I’m not plugged into the space any more.