Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

By the way, the technical side of this is very interesting. If you look at the tools mentioned (the wayback machine, but also perma.cc and other archival solutions), almost all of them rely on a single semi-modern tech stack that produces WARCs (web archives - ISO - ISO 28500:2017 https://iipc.github.io/warc-specifications/specifications/wa...).

The main crawler still seems to be heritrix3 (https://github.com/internetarchive/heritrix3), but there's a great little ecosystem with tools such as webrecorder and warcprox.

Still, I've read through the code of these tools and am feeling that they are failing in the face of the modern web with single page apps, mobile phone apps and walled gardens. Even newer iterations with browser automation are getting increasingly throttled and blocked and excluded from walled gardens.

Perhaps the time has come for a coordinated, decentralized but omnipresent approach to archival.



WARC can record and replay single-page apps, but it struggles with knowing where a "page" begins and ends.

There was a time when I was furious with the web going to hell and I investigated the possibility of "web without browsers" that started with making a WARC capture of page and putting pages through extensive filtering and classification before the user sees anything.

With interactive capturing you can push a button to indicate that a page is done "loading" but with automated capturing you can't really know that the page is done or that you got a good capture. That ended the project right there.


I, too, was fascinated by a "web without browsers" (or with other kinds of browsers, really) until i stepped into the community and realized that community was a proto-nazi cesspit full of misogynistic attitudes.

Maybe now that times have passed, people have died who posthumously admitted their preferences for white supremacy (and heavily bitcoin-supported that), and whole projects have been renamed, there can be a more inclusive community built around browserless web? For those who haven't followed, i'm refering to Woob (previously Weboob). I'd be interested in other people's feedback about that community lately, the ideas are great!


> I, too, was fascinated by a "web without browsers" (or with other kinds of browsers, really) until i stepped into the community and realized that community was a proto-nazi cesspit full of misogynistic attitudes.

Could you expand on that? It seems a bit out of (anti-)left field, so to speak.


I had this idea too. Wrote some code to scrape data from my school's badly designed website and it significantly improved my quality of life. Really made me think. What if we had a huge library of scrapers for every single website out there? We could build custom clients and have full control over everything. If people can maintain absurdly huge adblocking databases, surely something like this would also be possible.

Nice to know about Weboob. No idea what the community was like but it's nice to know I'm not insane for thinking about stuff like this.


I'm completely out of the loop on something like this, but could you in theory apply some kind of ML to identify the end of pages to assist with good page captures?


Probably. Certainly the more you spent on it the better you could do.

At the time I was most bothered by the slow load times of web pages and blaming this phenomenon:

https://www.sjsu.edu/faculty/watkins/samplemax4.htm

particularly that if you take the max of N random variables, the expectation value you get gets worse as N increases -- that is, the page isn't done loading until the slowest http request completes.

So I saw the "knowing when the page is done" problem as being particularly core, and it would be if the goal was to "win the race" against a conventional web browser.

If you were (say) preloading all the links submitted to hacker news you might be able to tolerate the system taking 5 minutes to process an incoming page. (See archive.is)

Today I've noticed that sites like Wired are giving up on complaining about my anti-track and ad-blocker and they just load the page partially which would drive me crazy if I was serious about debugging.


> increasingly throttled and blocked and excluded from walled garden

I keep thinking back to Jacob Applebaum's stance of "facebook and the other walled gardens are the real dark web."


Honestly, it would be a better use of surplus resources than crypto mining.

If only there are a way to algorithmically tie a proof of work for a new cryptocurrency to archival of the internet in a way that wouldn't be easily gamed (by people archiving easy to access content or highly redundant archival of trivia).



I think “right to be forgotten” is important and I’m generally against everlasting social media posts, but for copyrighted works, we really need a centralized Library of Congress that acts to archive these. In order for that to happen there needs to be an equivalent “publishing” mechanism for the web - where the user says - I created something and I want it to be archived. This would cover things that exist behind a paywall or are only delivered as newsletters.


WARC is genuine genius, and a very real and valuable contribution in large part of the Internet Archive.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: