Hilariously, unless ALL lemmy instances do this, anyone that federates with you will have to block it too or any communities they sync with you will be available on their instances...
Is it possible that they offloaded the scraping to a different company to avoid direct litigation now theyre out in the open? To say "we didn't scrape your website, and you can't prove it."
Like DDG, Ecosia, Qwant use Bing for their data Or how feds buy data from data brokers. Outsource the dirty job like every tech company does and shift the blame if caught doing something unlawful.
It seems they are trying to garner some positive PR after they scraped through everything without anyone noticing.
I absolutely believe a lot of companies outsource simply because they don't want to build an internal organ to do it. Even in government, despite what Conservatives believe, most organization heads are pretty focused on core competency and press to use outsourced resources. This latter also promoted by heavy lobbying by the companies selling the services.
This is a situation of "never attribute to malice that which can be easily explained by stupidity." Sure, some are motivated by malice or subterfuge, but most are probably just buying services because they have other things they'd rather focus on.
Why would they be concerned about litigation? As far as I know, scraping is completely legal in most/all countries (including the US, which I'm more familiar with and they're headquartered out of), as long as you're respecting copyright and correctly handling PII (which they claim to be making an effort on).
Yeah I always assumed robots.txt only told them to hide it from search results, but Google still scrapes everything they can from you. The illusion they skipped over you
But for large website operators, the choice to block large language model (LLM) crawlers isn't as easy as it may seem. Making some LLMs blind to certain website data will leave gaps of knowledge that could serve some sites very well (such as sites that don't want to lose visitors if ChatGPT supplies their information for them), but it may also hurt others. For example, blocking content from future AI models could decrease a site's or a brand's cultural footprint if AI chatbots become a primary user interface in the future. As a thought experiment, imagine an online business declaring that it didn't want its website indexed by Google in the year 2002—a self-defeating move when that was the most popular on-ramp for finding information online.
That's an interesting point that I hadn't considered, the comparison to Google indexing in the early 2000's may prove to be very apt with the number of people I've seen using chat GPT as a search engine.
It was enough to make me try Bing... which lasted all of about ten seconds (one search) before I ran screaming for the hills back to Duck Duck Go.
So no, I don't think this can make people use Bing - that product has so many problems I'm not sure it will ever be good enough.
Having said that - ChatGPT is really good at interpreting a user search term and equally good at understanding the contents of an arbitrary webpage. It's a perfect tool to build a search engine around, and I can't wait for someone more competent than Bing to do just that.
I'd rather like it if they train it on stuff I say. I want the AI of tomorrow to reflect my thoughts.
seriously I would much prefer gold tier journalism and news sites let it crawl so when people use it to make choices in the future they're guided to better choices.
it is honestly so hard to know what will happen though, it's so complicated it's virtually guaranteed we're not correctly anticipating the consequences of any of this. I'm not really even talking about the AI, I'm talking about the effects on society which are a lot more complex.