A group of authors filed a lawsuit against Meta, alleging the unlawful use of copyrighted material in developing its Llama 1 and Llama 2 large language models....
“To the extent a response is deemed required, Meta denies that its use of copyrighted works to train Llama required consent, credit, or compensation,” Meta writes.
The authors further stated that, as far as their books appear in the Books3 database, they are referred to as “infringed works”. This prompted Meta to respond with yet another denial. “Meta denies that it infringed Plaintiffs’ alleged copyrights,” the company writes.
When you compare the attitudes on this and compare them to how people treated The Pirate Bay, it becomes pretty fucking clear that we live in a society with an entirely different set of rules for established corporations.
The main reason they were able to prosecute TPB admins was the claim they were making money. Arguably, they made very little, but the copyright cabal tried to prove that they were making just oodles of money off of piracy.
Meta knew that these files were pirated. Everyone did. The page where you could download Books3 literally referenced Bibliotik, the private torrent tracker where they were all downloaded. Bibliotik also provides tools to strip DRM from ebooks, something that is a DMCA violation.
This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1)
They knew full well the provenance of this data, and they didn't give a flying fuck. They are making money off of what they've done with the data. How are we so willing to let Meta get away with this while we were literally willing to let US lawyers turn Swedish law upside-down to prosecute a bunch of fucking nerds with hardly any money? Probably because money.
Trump wasn't wrong, when you're famous enough, they let you do it.
You see, if you pirate a couple textbooks in college because you don't have resources, but you want to earn your right to participate in society and not starve, it's called theft.
But if one of the top 10 companies in the world does the same with thousands of books just to get even richer, it's called fair use.
I'll say this: If Meta and Facebook are prosecuted and domains seized in the same way pirate sites are, for Meta's use of illegimately obtained copyrighted material for profit, then I'll believe that anti-piracy laws are fair and just.
If Meta win this lawsuit, does it mean I can download some open source AI and claim that "These million 4k Blu-ray ISOs I torrented was just used to train my AI model"?
Heck, if how you use the downloaded stuff is a factor, I can claim that I just torrented those files and never looked at them. It is more believable than Meta's argument too, because, as a human, I do not have enough time to consume a million movies in my lifetime (probably, didn't do the math) unlike AIs.
But who am I kidding, I fully expect to be sued to hell and back if I were actually to do that.
Oh so when I pirate something I get a legal notice in my mailbox and a strike against me but when Meta does it they get rewarded with H A L L U C I N A T I O N S
Fair use covers research, but creating a training database for your commercial product is distinctly different from research. They're not publishing scientific papers, along with their data, which others can verify; they are developing a commercial product for profit. Even compared to traditional R&D this is markedly different, as they aren't building a prototype - the test version will eventually become the finished product.
The way fair use works is that a judge first decides whether it fits into one of the categories - news, education, research, criticism, or comment. This does not really fit into the category of "research", because it isn't research, it's the final product in an interim stage. However, even if it were considered research, the next step in fair use is the nature, in particular whether it is commercial. AI is highly commercial.
AI should not even be classified in a fair use category, but even if it were, it should not be granted any exemption because of how commercial it is.
They use other peoples' work to profit. They should pay for it.
Facebook steals the data of individuals. They should pay for that, too. We don't exchange our data for access to their website (or for access to some 3rd party Facebook pays to put a pixel on), the website is provided free of charge, and they try and shoehorn another transaction into the fine print of the terms and conditions where the user gives up their data free of charge. It is not proportionate, and the user's data is taken without proper consideration (ie payment, in terms of the core principles of contract law).
Frankly, it is unsurprising that an entity like Facebook, which so egregiously breaks the law and abuses the rights of every human being who uses the interent, would try to abuse content creators in such a fashion. Their abuse needs to be stopped, in all forms, and they should be made to pay for all of it.
Given how LLM's work and how nearly everything of value is under a copyright until at least the old age of the creators grandchildren LLMs would probably be pretty useless if they can't disregard copyright for their purposes.
Not that I have any sympathy for the likes of Meta and OpenAI in any of this.
The profit margins in AI are fleeting at best. There's no point in squabbling over who's paying for what training data. Very, very soon it's all going to be free anyway.
I love how everybody here goes from "yay piracy" and "screw copyright" to "I can't believe they violated copyright laws" the second it's somebody they dislike.
ITT: A hilarious combination of people who have no clue what copyright covers and people who think providing a tool that allows a user to generate potentially copy written material is a violation of the aforementioned.
Google literally does this in every image search, but go off I guess...