Skip Navigation

You're viewing a single thread.

28 comments
  • That's like says smartphones are fundamentally a surveillance technology. There's truth to it, but it's not inherent to the technology. It's a deliberate act by people using the tech that we allow for whatever reason.

    • Right, you can still do traditional advertising without the targeted metrics provided by smartphones, but....

      AI LLMs literally require a corpus of language to learn from. Thus the "Large Language" part of "LLM." The amount of data these models need to function is so staggeringly huge there is no way they can compile all that data without scraping the entire internet and pirating a bunch of copyrighted books.

      It's fundamentally a surveillance technology, because the technology fundamentally cannot function without that large dataset of language to begin with. It needs massive amounts of data that have to be surveilled to be achieved, because unless you're Reddit or Facebook, your own site probably doesn't contain enough data to fill out the needs of the LLM. Thus you need to scrape the internet for more data in hopes of filling it out.

      Books3 is used widely as part of "The Pile" and is clearly all of the content of private torrent tracker Bibliotik. People theorize Books2 is all of the books from Library Genesis. To be able to make their models work, they have to scrape the internet and pirate thousands of books to make it functional at all.

      This is also fundamentally why AI starts to fail so quickly, because these tools have been used to flood the internet with AI generated pages, which in turn become training data for AI, which means the training data is tainted with AI generated garbage, which will further degrade the LLM. On the plus side, I guess, is that if they keep using this kind of business model, they will unintentionally make their AI pretty useless within a few years by flooding the internet with useless, incorrect data.

You've viewed 28 comments.