*that's the expected cost of publishing something to the public internet. People...

autoexec · on Aug 26, 2023

Spam is a problem, but a very different one. Spam is often malicious/harmful. Accessing a publicly accessible website is usually benign even if the content being accessed gets saved to disk.

Some people don't publicly publish their email address, they instead selectively give it out only to those they want to get email from, but their address gets leaked/sold and abused. Ultimately people who do publicly publish a contact address (email or even a physical mailing address) are basically on the hook for deciding what to do with whatever people send them.

The spam situation got out of hand pretty fast though. The only thing that kept email spam from reaching the level of a DoS attack were blacklists and server-side filtering, and even with those things (plus client-side filters) spam is still a huge problem today. Spam is just a much bigger problem than web scraping. Even the junkmail the mailman delivers to my door has an environmental cost that's much worse than the "harm" of a web scraper's http GET requests.

We have many alternative ways to contact each other online that aren't as vulnerable to spam, but for all of its shortcomings email continues to be widely used because at the end of the day people think giving strangers the ability to reach out to them uninvited is valuable. Anyone can set up a whitelist and trash everything that comes into their mailbox unless it's from an approved sender, but almost nobody does because they want to be more reachable than that.

nextaccountic · on Aug 26, 2023

That's like comparing web scraping with DDoS

dvhh · on Aug 26, 2023

There are some factors that might lead to this lone of though.

- it seem a lot of web developer / product manager either do not know or do not care about robots.txt

- some web application are so badly optimized, that some of them are not able to handle more than 1 hit per second at a sustained rate, which admittedly have worked fine so far. But crawlers are persistent, causing the normal crawling activity to cause denial of service for normal users.

Asooka · on Aug 25, 2023

I hold the same view as the parent poster on public data online and my opinion on spam e-mail is that it's a consequence of naivete bordering on faulty design. It should have been set up with strong authentication (proof that this e-mail is from whom it says it is) and explicit consent (you can only message me if I allow you to message me). The latter could be as simple as rate-limiting e-mail from addresses which you have not explicitly allowed, or to which you have not previously sent letters, to one per week or so, with all such e-mails automatically going to the spam folder to be purged in a month.