Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This reminds me of the Nepenthes tarpit [1], which is an endless source of ad-hoc generated garbled mess which links to itself over and over.

Probably more effective at poisoning the dataset if one has the resources to run it.

[1]: https://zadzmo.org/code/nepenthes/



I'm running Iocaine[1] which is essentially the same thing on my tiny $3/mo VPS and it's handling crawlers bombarding the honeypot with ~12 requests per second just fine. It's using about 30 MB of RAM.

[1]: https://iocaine.madhouse-project.org/


Odorless, tasteless, and among the more deadly poisons known to crawlers!


Unfortunately they will spend the next several years building up an immunity.


Do we know if LLM scrapers are running JavaScript on the pages? If they are, maybe it's worth offloading the Markov model to the client side.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: