Reddit @lemmy.world Abaixo de Cão @lemm.ee 6 mo. ago

Reddit locks down its public data in new content policy, says use now requires a contract

techcrunch.com Reddit locks down its public data in new content policy, says use now requires a contract | TechCrunch

The newly announced "Public Content Policy" will now join Reddit's existing privacy policy and content policy to guide how Reddit's data is being accessed and used by commercial entities and other partners.

You're viewing a single thread.

26 comments

Too bad the website is still openly accessible and still capable of being scraped
- Probably what they’re targeting. Such as sites like safereddit, etc.
  
  Well that's part of the thing. Web scraping doesn't get covered by policies. Like, they could ban your ip or any accounts you have, but web scraping itself will always be acceptable. It's why projects like NewPipe and Invidious don't care about YouTube cease and desist letters.
  
  Oops look like this community hasn't been reviewed. Login if you still want to see the content.
  
  Yea, I've seen those pop-ups when trying to find something out. It sucks but isn't a significant barrier to web scraping
  
  Permanently Deleted
  
  I tried that on my desktop. So long as you are not actually logged in you cannot see the communities that are too small for a review or too adult after a review.
  
  Permanently Deleted
  
  Permanently Deleted
  
  Parsing absolutely comes with a lot more overhead. Especially since many websites integrate a lot of JS interactivity nowadays, you oftentimes don’t get the full contents you’re looking for straight out of the HTML you’re getting out of your HTTP request, depending on the site.
  
  Permanently Deleted
  
  In what way?
  
  HTML definitely provides more overhead than json if you only care about the data.
  
  Permanently Deleted
  
  Still waiting for the news that they took down old.reddit. Without the third party apps, that was the only way it could still be usable.
  
  Removed
- We use Akamai where I work for security, CDN, etc. Their services make it largely trivial to identify traffic from bots. They can classify requests in real time as coming from known bots like Googlebot to programming frameworks like python & java to bots that impersonate Googlebot, to virtually any other automated traffic from unknown bots.
  
  If Reddit was smart they’d leverage something like that to allow Google, Bing, etc. to crawl their data and block all others, or poison others with bogus data. But we’re talking about Reddit here…

You've viewed 26 comments.