As far as we know, Google is not giving up any data. The crawler still must store a copy of the text for the index. The only certainty we have is that Google is no longer sharing it.
As far as we know, Google is not giving up any data. The crawler still must store a copy of the text for the index. The only certainty we have is that Google is no longer sharing it.
Here’s the heart of the not-so-obvious problem:
Websites treat the Google crawler like a 1st class citizen. Paywalls give Google unpaid junk-free access. Then Google search results direct people to a website that treats humans differently (worse). So Google users are led to sites they cannot access. The heart of the problem is access inequality. Google effectively serves to refer people to sites that are not publicly accessible.
I do not want to see search results I cannot access. Google cache was the equalizer that neutralizes that problem. Now that problem is back in our face.
From the article:
“was meant for helping people access pages when way back, you often couldn’t depend on a page loading. These days, things have greatly improved. So, it was decided to retire it.” (emphasis added)
Bullshit! The web gets increasingly enshitified and content is less accessible every day.
For now, you can still build your own cache links even without the button, just by going to “https://webcache.googleusercontent.com/search?q=cache:” plus a website URL, or by typing “cache:” plus a URL into Google Search.
You can also use 12ft.io.
Cached links were great if the website was down or quickly changed, but they also gave some insight over the years about how the “Google Bot” web crawler views the web. … A lot of Google Bot details are shrouded in secrecy to hide from SEO spammers, but you could learn a lot by investigating what cached pages look like.
Okay, so there’s a more plausible theory about the real reason for this move. Google may be trying to increase the secrecy of how its crawler functions.
The pages aren’t necessarily rendered like how you would expect.
More importantly, they don’t render the way authors expect. And that’s a fucking good thing! It’s how caching helps give us some escape from enshification. From the 12ft.io faq:
“Prepend 12ft.io/ to the URL webpage, and we’ll try our best to remove the popups, ads, and other visual distractions.”
It also circumvents #paywalls. No doubt there must be legal pressure on Google from angry website owners who want to force their content to come with garbage.
The death of cached sites will mean the Internet Archive has a larger burden of archiving and tracking changes on the world’s webpages.
The possibly good news is that Google’s role shrinks a bit. Any Google shrinkage is a good outcome overall. But there is a concerning relationship between archive.org and Cloudflare. I depend heavily on archive.org largely because Cloudflare has broken ~25% of the web. The day #InternetArchive becomes Cloudflared itself, we’re fucked.
We need several non-profits to archive the web in parallel redundancy with archive.org.
Bingo. When I read that part of the article, I felt insulted. People see the web getting increasingly enshitified and less accessible. The increased need for cached pages has justified the existence of 12ft.io.
~40% of my web access is now dependant on archive.org and 12ft.io.
So yes, Google is obviously bullshitting. Clearly there is a real reason for nixing cached pages and Google is concealing that reason.
This is probably an attempt to save money on storage costs.
That’s in fact what the article claims as Google’s reason. But seems irrational. Google still needs to index websites for the search engine. So the storage is still needed since the data collection is still needed. The only difference (AFAICT) is Google is simply not sharing that data. Also, there are bigger pots of money in play than piddly storage costs.
tl;dr:
I’ve been using Lemmy for years, back when there were only 2 or 3 nodes and federation capability did not exist. It’s a shit show. Extremely buggy web clients and no useful proper desktop clients. I must say it’s sensible that the version numbers are still 0.x. It’s also getting worse. 0.19.3 was more usable than 0.19.5 which introduced serious bugs that make it unusable in some variants of Chromium browser.
mBin has been plagued with serious bugs. But it’s also very young. It was not ready for prime-time when it got rolled out, but I think it (or kbin) was pushed out early because many Redditors were jumping ship and those refugees needed a place to go. IMO mbin will out-pace Lemmy and take the lead. Mbin is bad at searching. You can search for mags that are already federated but if a community does not appear in a search I’m not even sure if or how a user can create the federated relationship.
The running goat fuck with Lemmy is in recent years with the shitty javascript web client. There’s only so much blame you can fairly put on those devs though because they need to focus on a working server. The shitty JavaScript web client should just be considered a proof-of-concept experimental test sandbox. JavaScript is unfit for this kind of purpose. It’s really on the FOSS community to produce a decent proper client. And what has happened is there has been focus on a dozen or so different phone apps (wtf?) and no real effort on a desktop app.
Cloudflare filters lacking
Both Lemmy and Mbin lack the ability to filter out or block Cloudflare nodes. They both only give a way to block specific forums. So you get imersed/swamped in LemmyWorld’s walled garden and to get LemmyWorld out of sight there is a big manual effort of blocking hundreds of communities. It’s a never ending game of whack-a-mole.