Technology @lemmy.world GlitzyArmrest @lemmy.world 6 mo. ago

OpenAI claims The New York Times tricked ChatGPT into copying its articles

www.theverge.com OpenAI claims The New York Times tricked ChatGPT into copying its articles

OpenAI insists training AI models on copyrighted data is fair use.

OpenAI has publicly responded to a copyright lawsuit by The New York Times, calling the case “without merit” and saying it still hoped for a partnership with the media outlet.

In a blog post, OpenAI said the Times “is not telling the full story.” It took particular issue with claims that its ChatGPT AI tool reproduced Times stories verbatim, arguing that the Times had manipulated prompts to include regurgitated excerpts of articles. “Even when using such prompts, our models don’t typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts,” OpenAI said.

OpenAI claims it’s attempted to reduce regurgitation from its large language models and that the Times refused to share examples of this reproduction before filing the lawsuit. It said the verbatim examples “appear to be from year-old articles that have proliferated on multiple third-party websites.” The company did admit that it took down a ChatGPT feature, called Browse, that unintentionally reproduced content.

133

You're viewing a single thread.

133 comments

The advances in LLMs and Diffusion models over the past couple of years are remarkable technological achievements that should be celebrated. We shouldn't be stifling scientific progress in the name of protecting intellectual property, we should be keen to develop the next generation of systems that mitigate hallucination and achieve new capabilities, such as is proposed in Yann Lecun's Autonomous Machine Intelligence concept.

I can sorta sympathise with those whose work is "stolen" for use as training data, but really whatever you put online in any form is fair game to be consumed by any kind of crawler or surveillance system, so if you don't want that then don't put your shit in the street. This "right" to be omitted from training datasets directly conflicts with our ability to progress a new frontier of science.

The actual problem is that all this work is undertaken by a cartel of companies with a stranglehold on compute power and resources to crawl and clean all that data. As with all natural monopolies (transportation, utilities, etc.) it should be undertaken for the public good, in such as way that we can all benefit from the profits.

And the millionth argument quibbling about whether LLMs are "truly intelligent" is a totally orthogonal philosophical tangent.
- I understand your point, but disagree.
  
  We tend to think of these models as agents or persons with a right to information. They "learn like we do" after all. I think the right way to see them is emulating machines.
  
  A company buys an empty emulating machine and then puts in the type of information is would like to emulate or copy. Copyright prevents companies from doing this in the classic sense of direct emulation already.
  
  LLM companies are trying to push the view that their emulating machines are different enough from previous methods of copying that they should be immune to copyright. They tend to also claim that their emulating machines are in some way learning rather than emulating, but this is tenuous at best and has not yet been proven in a meaningful sense.
  
  I think you'll see that if you only feed an LLM art or text from only one artist you will find that most of the output of the LLM is clearly copyright infringement if you tried to use it commercially. I personally don't buy the argument that just because you're mixing several artists or writers that it's suddenly not infringement.
  
  As far as science and progress, I don't think that's hampered by the view that these companies are clearly infringing on copyright. Copyright already has several relevant exemptions for educational and private use.
  
  As far as "it's on the internet, it's fair game". I don't agree. In Western countries your works are still protected by copyright. Most of us do give away those rights when we post on most platforms, but only to one entity, not anyone/ any company who can read or has internet access.
  
  I personally think IP laws as they are hold us back significantly. Using copyright against LLMs is one of the first modern cases where I think it will protect society rather than hold us back. We can't just give up all our works and all our ideas to a handful of companies to copy for profit just because they can read and view them and feed them en masse into their expensive emulating machines.
  
  We need to keep the right to profit from our personal expression. LLMs and other AI as they currently exist are a direct threat to our right to benefit from our personal expression.
  
  We tend to think of these models as agents or persons with a right to information. They “learn like we do” after all.
  
  This is again a similar philosophical tangent that's not germane to the issue at hand (albeit an interesting one).
  
  I think you’ll see that if you only feed an LLM art or text from only one artist you will find that most of the output of the LLM is clearly copyright infringement if you tried to use it commercially.
  
  This is not a feasible proposition in any practical sense. LLMs are necessarily trained on VAST datasets that comprise all kinds of text. The only type of network that could be trained on only one artist's corpus is a tiny pedagogical tool like Karpathy's minGPT https://github.com/karpathy/minGPT, trained solely on the works of Shakespeare. But this is not a "Large" language model, it's a teaching exercise for ML students. One artist's work could never practically train a network that could be considered "Large" in the sense of LLMs. So it's pointless to prevaricate on a contrived scenario like that.
  
  In more practical terms, it's not controversial to state that deep networks with lots of degrees of freedom are capable of overfitting and memorizing training data. However, if they have other additional capabilities besides memorization then this may be considered an acceptable price to pay for those additional capabilities. It's trivial to demonstrate that chatbots can perform novel tasks, like writing a rap song about Spongebob going to the moon on a rocket powered by ice cream - which is surely not existent in any training data, yet any contemporary chatbot is able to produce.
  
  As far as science and progress, I don’t think that’s hampered by the view that these companies are clearly infringing on copyright.
  
  As an example, one open research question concerns the scaling relationships of network performance as dataset size increases. In this sense, any attempt to restrict the pool of available training data hampers our ability to probe this question. You may decide that this is worth it to prioritize the sanctity of copyright law, but you can't pretend that it's not impeding that particular research question.
  
  As far as “it’s on the internet, it’s fair game”. I don’t agree. In Western countries your works are still protected by copyright. Most of us do give away those rights when we post on most platforms, but only to one entity, not anyone/ any company who can read or has internet access.
  
  I wasn't making a claim about law, but about ethics. I believe it should be fair game, perhaps not for private profiteering, but for research. Also this says nothing of adversary nations that don't respect our copyright principles, but that's a whole can of worms.
  
  We can’t just give up all our works and all our ideas to a handful of companies to copy for profit just because they can read and view them and feed them en masse into their expensive emulating machines.
  
  As already stated, that's where I was in agreement with you - It SHOULDN'T be given up to a handful of companies. But instead it SHOULD be given up to public research institutes for the furtherance of science. And whatever you don't want to be included you should refrain from posting. (Or perhaps, if this research were undertaken according to transparent FOSS principles, the curated datasets would be public and open, and you could submit the relevant GDPR requests to get your personal information expunged if you wanted.)
  
  Your whole response is framed in terms of LLMs being purely a product for commercial entities, who shadily exaggerate the learning capabilities of their systems, and couches the topic as a "people vs. corpos" battle. But web-scraped datasets (such as Imagenet) have been powering deep learning research for over a decade, long before AI captured the public imagination the way it has currently, and long before it became a big money spinner. This view neglects that language modelling, image recognition, speech transcription, etc. are also ongoing fields of academic research. Instead of vainly trying to cram the cat back into the bag, and throttling research, we should be embracing the use of publicly available data, with legislation that ensures it's used for public benefit.

You've viewed 133 comments.