Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Internet Archive as a default host-of-record for startups (twitter.com/id_aa_carmack)
356 points by bpierre on Dec 21, 2021 | hide | past | favorite | 178 comments


Hi,

I manage the Wayback Machine at the Internet Archive.

Very happy so many people here care about preserving, and making available, our cultural heritage!

Please know a dedicated, and talented, team of engineers works every day to do a better job of archiving more of the public Web, and making it available via the Wayback Machine.

As noted the Internet Archive is experimenting with filecoin.io and storj.io and is always open to suggestions about how we might do our jobs better, and improve our service. We also host regular meetups (and have hosted summits and a camp) related to the Decentralized Web. See: https://blog.archive.org/tag/dweb/

The Internet Archive also offers archive-it.org, a subscription service, for those who want a higher level of support and more features.

We appreciate any support you can offer, financial and otherwise. Please share any bug reports, feature suggestions and other feedback with us via email to info@archive.org

Oh/and… checkout the new PDF Search feature we just launched at the bottom of web.archive.org. More to come like that in 2022.

Finally, you might also find some of the things I wrote here of interest: https://gijn.org/2021/05/05/tips-for-using-the-internet-arch...


I worked with y'all as a volunteer back when trump was pushing to drop global warming sites from gov websites. I also use the wayback machine professionally on a regular basis. The work y'all do is genuinely appreciated to say the least.

That said.. damn I really wish y'all would revisit some of your fundamentals like recursive scraping and making sure your scraping is whole and complete before working on filecoin and other needlessly flashy systems. I'm genuinely worried y'all are digging yourselves into a technical pit that you can't get out of and it will hurt or even kill your goals.

Cheers.


Recursive scraping:

1) can consume a lot of storage really fast

2) makes IA a bot rather than a user service

3) is already done on a case by case basis by ArchiveTeam allowing IA to stay away from the previous two problems.


There's an old school website I wanted to access that's long gone. It's only partially preserved in the Wayback Machine. I'm glad it's there at all, but some of the specific information I was looking for was never captured, and now there's no way to recover it.

I realize it's a hard problem, but I really wish there were a way to automate more of it. Some of these communities are too small for anyone to bother preserving the pages manually, and I don't imagine we'd even show up on ArchiveTeam's radar. But they're not large pages, and basically static. I don't think they'd be a huge burden to store and maintain. It seems like some sort of a coverage + size metric would be pretty effective at guiding an automated scan such that you'd be able to preserve things liked this without needing humans to go and manually archive each and every page.


https://wiki.archiveteam.org/index.php/ArchiveBot is available if you would like to help save things that are still up.


I never got turned down sending an IRC message "x is going down in a few days" to the AT channels. https://wiki.archiveteam.org/index.php/Archiveteam:IRC

AT just asks to be given a few weeks/months of notice depending on how many TBs (MBs?) the crawl needs to be and not to throttle/ban their clients.


Pretty much. Your idea of getting coverage could probably work well to make sure IA doesn't "turn into a bot" (though it's practically the same, just without the permanent downloading).


What’s recursive scraping (presumably, following internal links), and what’s broken about it?


Yep, that's exactly it. Recursive scraping, in theory, would remove a lot gaps on IA. It's really hard to do generically and "right" though. Lots of one-off.. strangeness.. especially from older sites and sites that auto-generate URLs at render time.


Does the wayback machine offer a JSON/REST API to check for URLs, in the sense of sending a URL and getting back a map of specific crawl datetimes that are available in the cache?

I know about the web.archive.org/https://... "hack" but it's probably draining your servers unnecessarily when there are a lot of 301 redirects that weren't archived and on top of that are only able to be validated client side after receiving the whole HTML response.

A REST API would help thirdparty clients to know about this in advance, and the API documentations I found were super unclear in whether something like this exists or not.

Context: I'm building a web browser and I'm trying to offer a feature for error cases when the server or URL isn't available anymore, so that users can see the web archived version of it.

On top of that I have no idea how to "un-UI" the web archived versions. I know that the wget user agent somehow leads to this, but it's also kind of undocumented how the webserver of IA does this in the background and when exactly the UI is injected and all the URLs are rewritten. Something like maybe a http request header to get the raw actual source would be nice.


> Does the wayback machine offer a JSON/REST API to check for URLs, in the sense of sending a URL and getting back a map of specific crawl datetimes that are available in the cache?

Have you looked at the headers that they send?

  GET /web/20100330210402/https://arxiv.org/abs/0911.1112 HTTP/2
  [...]

  HTTP/2 200 OK
  [...]
  link: <http://arxiv.org/abs/0911.1112>; rel="original",
    <https://web.archive.org/web/timemap/link/http://arxiv.org/abs/0911.1112>;
    rel="timemap"; type="application/link-format",
    <https://web.archive.org/web/http://arxiv.org/abs/0911.1112>;
    rel="timegate",
    <https://web.archive.org/web/20100330210402/http://arxiv.org/abs/0911.1112>;
    rel="first memento"; datetime="Tue, 30 Mar 2010 21:04:02 GMT",
    <https://web.archive.org/web/20100330210402/http://arxiv.org/abs/0911.1112>;
    rel="memento"; datetime="Tue, 30 Mar 2010 21:04:02 GMT",
    <https://web.archive.org/web/20110101203756/http://arxiv.org/abs/0911.1112>;
    rel="next memento"; datetime="Sat, 01 Jan 2011 20:37:56 GMT",
    <https://web.archive.org/web/20211123040625/https://arxiv.org/abs/0911.1112>;
    rel="last memento"; datetime="Tue, 23 Nov 2021 04:06:25 GMT"
  [...]
The Wayback Machine implements RFC 7089. <http://mementoweb.org/guide/quick-intro/>


Regarding your last point, adding "id_" to the end of the timestamp in the URL produces the original downloaded file. Adding "if_" produces the page with rewritten links but without the Wayback Machine header. Additionally, if you look at the HTTP response headers on any version of the archived page, they include copies of the original response headers.


> I'm building a web browser and I'm trying to offer a feature for error cases when the server or URL isn't available anymore, so that users can see the web archived version of it.

Brave browser does that, and I've seen Firefox add-ons that do it. Maybe you can look at their code.


Thank you for your service!


Just wanted to say thank you for all the times you've let me continue researching when the original sources became dead!


Thank you!


>>” I wonder if there could be a world where the IA acts as a default host-of-record for startups, with a super-easy CDN relationship such that the content”

Doesn’t IA already partner with Cloudflare to do exactly what Carmack is suggesting.

https://blog.cloudflare.com/cloudflares-always-online-and-th...


You still need to provide a separate origin server (i.e. host-of-record) when using Always Online. AO is designed to use IA as a backstop when your origin server goes down.


The feature I most want from the Internet Archive is the ability to donate them an old domain name and enough cash to renew it for the next hundred years such that they can keep an archived version of a site available (without breaking any incoming links) for a very long time.

They would also need to be able to handle legal administration costs of things like DMCA take-down notices, but I assume they already have to deal with that for the rest of the archive so hopefully that's not an extra complexity for them.


It costs the Internet Archive $2/GB to host content in perpetuity. They have a tool, Archive It, that will periodically crawl your site for archival purposes if you are not technical.

For my needs, I run a report monthly for the content I’ve archived using my IA account to determine archived GBs, and then donate the amount needed to cover those costs.

Consider reaching out to their patron services email address with any questions.

Edit: $2/GB citation: https://help.archive.org/hc/en-us/articles/360014755952-Arch...


> It costs the Internet Archive $2/GB to host content in perpetuity.

Do you have source/more info than that?

Lets say the internet archive is 100 PB [1], that's 100,000,000 GB [2], and at that rate it comes out to $200 million [3] for the whole thing forever. That's a lot of money, but also a lot less than I was expecting for something like that.

[1] https://www.protocol.com/internet-archive-preserving-future: "The web archive alone is about 45 petabytes — 4,500 terabytes — and the Internet Archive itself is about double that size (the group has other collections, like a huge database of educational films, music and even long-gone software programs)."

[2] https://www.google.com/search?q=100+petabytes+to+gb: "100 petabyte = 1e+8 gigabytes"

[3] https://www.google.com/search?q=1e%2B8+*+%242: "1e+8 * (US$ 2) = 200 million US$"


I love that you have a citation for 2 * 100 million = 200 million. I've seen papers where the use of "=" sweeps a lot of non-trivial equalities under the rug, but this is the first time I've seen something go the other way this far.

I suppose the claim is rather shocking and warrants citations - $200m to host the entire internet archive forever? I don't blame you for the excessive citation.


Assume $11.88/TB [1], 5W per 4TB disk, and $0.2/kWh. We’re talking $1.19mm NRC + $18.3k MRC.

With a 9% discount rate, that’s only $3.63 million dollars in present value to pay for infinity months.

Of course there are other costs (and cheaper more efficient disks; and cheaper power; and your discount might be less aggressive; and server aren’t free tho you only need like 1 server for 100 disks with SAS expanders since most data is never read; and maintenance), but the $200 million number you got seems very reasonable to me.

1: https://diskprices.com/


No humans? :)


Ah I was only thinking of marginal costs. I’m pretty sure the $2/GB figure is the cost of adding a little more data to the IA, not the total cost.


Re: [1] 45 PB = 45,000 TB


I'd pay twice that to host my static blog with IA. ;)

I guess archiving a static blog is already trivial for their system, but I'd pay $100 a year to have IA host my static blog. The overhead I would consider as a donation to a worthy cause.

Then again, I can just give them $100 a year and find some free static hosting, like GitHub Pages, and call it a day.


$2/GB in perpetuity is really cheap. They should write a paper about how they did it, if they haven't already.

Edit: I'm assuming they can deliver reliability and durability similar to modern cloud standards, like AWS S3.


I'd also like to see the rest of the assumptions baked in. Is there some trust fund involved -- which requires that we now consider the risk associated with how the $2 is invested to get an adequate return?


It's based on the assumption that storage costs per GB will continue to decrease for their definition of "perpetuity." If you assume a consistent fall in costs, the total cost becomes a convergent series [1] and there is an upper limit even with an infinite number of years.

Suffice it to say, physics is probably going to have a lot to say about that assumption in the coming decades.

[1] https://en.wikipedia.org/wiki/Convergent_series


I too am a bit confused about that $2/GB for perpetuity. I'm assuming that's a loaded cost representing maintenance, backups, power. A TCO kind of number. Maybe that's a bad assumption?


Self built storage nodes and software for storage system, with an expectation that storage costs continue to decline per GB into the future.


beowolf cluster


Is that $/GB per year?


Last I read, that was one time [1]. Of course if you have the means, consider a bit more per GB.

[1] https://help.archive.org/hc/en-us/articles/360014755952-Arch...


A full-text search for the Wayback Machine would be my top feature request. It's not uncommon to lose the URL of a site and for active webpages to not have the URL of the old website. Plus I'm sure there are many interesting archived webpages I could find with a full-text search.

I understand they've tried this or things like it a few times but they haven't ever kept the feature.


I imagine that the costs to run this would outweight the potential marketing benefits, but it'd be amazing to see Algolia take this project on to benefit everyone.

The Wayback Machine's data is ~20PB..? What is approximately the size of the indexable text (i.e. the text content of html pages, sans tags)? And what would the index size be like, approximately?

I imagine that creating (and maintaining, of course) the index would be the most time-consuming part? Is it at all possible to imagine hosting this index... somewhere... and doing sqlite http range-like queries on it..?

Would it be enough to have an index consist of a list of found words, and the related "document ids"? i.e. "apple" is in doc ids 1000, 2000, 3000, "banana" is in doc ids 2000, 4000, etc.?

And have separate docid -> archive.org url mapping?


> Would it be enough to have an index consist of a list of found words, and the related "document ids"? i.e. "apple" is in doc ids 1000, 2000, 3000, "banana" is in doc ids 2000, 4000, etc.?

The problem, as with any search engine, is the ranking algorithm. Without a sufficent one, the search results are useless. What use is a list of every page in the Wayback Machine containing the word "apple"?

The Wayback Machine possibly would need a much larger index than any normal search engine: not only the present websites, but all the historical versions (though I don't know what proportion of the web they've indexed).


No you are right of course. I was imagining that a full text Wayback Machine search engine would mostly be useful to look for words unique enough that sifting through a lot of results (even if those were not ranked "well") could still be useful..?

I was very naively "back of the envelope" prototyping a search engine. I both realize that this is not the way that these things are built, and both would really like to have that (or any) search engine to look through the archive..! :-)

As for the index, I agree -- I was trying to guesstimate its size by going from the ~~20PB total Wayback Machine size (which includes all historical versions). Is it 1% of 20PB (for the size of the text content), and then another ~10% of that for the index size? So 20TB...?


I may have given the wrong impression: it sounds like a good idea to me.

> I was imagining that a full text Wayback Machine search engine would mostly be useful to look for words unique enough that sifting through a lot of results (even if those were not ranked "well") could still be useful..?

If we think about use cases, users may often search specific domains. In that case, results ordered by frequency and/or date might be sufficient and even desireable.

> I was very naively "back of the envelope" prototyping a search engine. I both realize that this is not the way that these things are built

It's often the first step!

> So 20TB...?

That doesn't sound so bad.


I agree & appreciate your response.

I have 0 time for this, but I also can't easily let go ;-)

Want to collab on this?


Wow, great offer and a great project. I also have 0 time and I'm dealing with extra obligations anyway. I'm afraid I need to be disciplined to keep my primary life goals - things that take a decade or more - on course. I am really thrilled by the energy you are showing, though, and hate to contribute anything negative. Please don't stop for me!

You made my day.


Three cheers, thank you for the nice exchange! Happy upcoming Holidays. I'll (obviously) post on HN if anything comes out of this.


I'd think you'd want something with at least the features of Lucene, whose index is 20%-30% the size of the data.

https://lucene.apache.org/core/features.html


A far cheaper solution is a browser extension that looks up DNS differently based on the age of a the link.

It wouldn't be hard to maintain a hand-crafted database of when domains are reused for something completely different, or even when the same conceptual website has breakages, and use that to choose between Internet Archive or live web accordingly. When one is browsing from an internet archive page, the date is known, when someone is browsing from a live website, heuristics can be used, along with "bisecting" dates when the link is dead.

Ultimately we want more content addressing to avoid this problem entirely (see below), Or DNS -> PubKey, PubKey -> latest content, with some law that the pubkeys shall not be reused for unrelated things. vs DNS which is mere ephemeral Huffman encoding. So see below for the stuff on IPFS. But the trick above is a good stop-gap, and indeed the database itself used to back the extension could be on IPFS.


Trying to figure out the age of links referenced against a hand crafted database (or trying to figure out if the current version is "too different" based on age automatically) using a browser extension only serves to create an unreliable solution for a few using the extension. Cheaper sure but it's also doing a whole lot less.

Alternative content addressed systems may also work better for finding the content than someone hosting your DNS records for a long time but the bulk of the problem space is in guaranteeing active hosting in a way viewable to viewers of the age will be available for many years not addressing the content. On top of hosting and addressing the Internet Archive offers the ability to view old content on modern browsers even if modern browsers have 0 support for such content anymore (or if browsers ever had support at all even). Forward compatibility isn't something solved by a protocol.


> using a browser extension only serves to create an unreliable solution for a few using the extension. Cheaper sure but it's also doing a whole lot less.

This is rather pessimistic thinking. The same money that goes into buying up domains could go into lobbying browsers to add this functionality be default.

> bulk of the problem space is in guaranteeing active hosting in a way viewable to viewers of the age will be available for many years not addressing the content.

You're moving the goal posts. I am not saying "IPFS means we don't need the internet archive". We absolute do need the internet archive. Content addressing helps by making archiving transparent, so the archival copy is not worse than the original.

Fundamentally, consumers producers or archivists may be the party most interested in the continued existence of some information at different moments in the lifespan of that information. Location-based addressing forces the producers to shoulder the burden of hosting, but content-based addressing allows the work to be distributed among those 3 however we see fit. Of course the burdened must still be barred! That doesn't mean the flexibility isn't extremely useful.

> On top of hosting and addressing the Internet Archive offers the ability to view old content on modern browsers even if modern browsers have 0 support for such content anymore (or if browsers ever had support at all even). Forward compatibility isn't something solved by a protocol.

Yeah that's great too, and again not something I am arguing is not good, or not necessary.


I made something very similar, though instead of age, it enforces page authorship for links (using PGP signatures). https://webverify.jahed.dev/


I agree, i'm a passionate photographer and i could pay good money to know that my pictures could be seen long time after my death. Maybe startups exists that do this, but they will die, i need something with enough critical mass that i can trust.


Perhaps have a look at Arweave (https://www.arweave.org).


That doesn't address the concern re: something needing critical mass to increase its chance of survival over a longer term.

Really the way I see it outside a few large banking firms, its kind of hard to be sure any provider of digital services would be around in the 50+ year term for this kind of public archive.

I hope the Internet Archive manages it.

EDIT: I do worry the IA has a bit of a lightning rod effect with skirting issues re: legality of archiving content. IMO its no guarantee it survives any significant time span either.


> Really the way I see it outside a few large banking firms, its kind of hard to be sure any provider of digital services would be around in the 50+ year term for this kind of public archive.

A library could do it. Perhaps leading institutions like the British Library or Library of Congress. I've thought that IA should be a Library of Congress project, and may eventually end up under their auspices.


Why not use MaidSAFE or IPFS for that?


With regards to IPFS, you'd need to find a pinning service that offers long-term contracts. Moreover, I have more confidence in the Internet Archive in being around in 10, 20, 30 years compared to any existing pinning service. That's not meant to be a slight against said pinning services, just that the IA is well established in their role.


Check out Freenet

They drop stuff that has the least amount of accessing, so if you want to keep something online you have to pay someone to keep accessing it. Makes sense, but I'd argue that it should be a market, meaning the price should go up as more people try to access stuff, but anyone can start to seed popular content and collect revenues also for hosting it (this is better than wasting electricity on accessing stuff or doing proof of work).

But isn't FileCoin exactly that for IPFS?

MaidSAFE goes a step further and has nodes rebalance autonomously and earn the most safecoin, as something gets more popular it gets seeded more.


That doesn’t solve the ‘everyone you know is dead’ problem.


> The feature I most want from the Internet Archive

The feature I most want from IA is a streamlined system to delete content they have archived on domains that I own, including a proper privacy law compliance effort on their part. They have intentionally made it a difficult, manual process to get content removed. They operate as a de facto malicious crawler.

They massively violate GDPR with how they operate and few seem to care about that fact, including all the commenters on HN (which universally give them a free pass on being malicious and violating GDPR very aggressively).

When IA has to comply with laws like GDPR, that's the end of IA.


Shouldn't it be hard to delete things from a library and historical-archive?

If you had to choose between the GDPR, & an accurate historical record, which would you prefer?


We won't. Pretty much the only way things leave archives of record (which is what IA is trying to be) are through Acts of God (if the archive burns down/all of IA's servers are taken out in a mass alien EMP attack).

An author suggesting that the LoC remove their copy of a book/other work (including digital works) because they want to unpublish it would not fly.

The parent comment has an issue with anything on the Web not hidden in some way being considered 'public' and 'published', but that would be something that would require international cooperation to hash out.


I can't speak to GDPR specifically because I'm not European, but a fair amount of laws have leeway for preservation purposes. (For example, section 108 of the Copyright Act in the US functionally exempts archives from being punished for copying provided they are doing so for preservation purposes).

There are very good reasons that archives will not destroy or alter information outside of very clear difficult and manual processes.

And actually, looking at it, I don't think they're necessarily in violation of GDPR [0].

Point 3 says: "Where personal data are processed for archiving purposes in the public interest, Union or Member State law may provide for derogations from the rights referred to in Articles 15, 16, 18, 19, 20 and 21 subject to the conditions and safeguards referred to in paragraph 1 of this Article in so far as such rights are likely to render impossible or seriously impair the achievement of the specific purposes, and such derogations are necessary for the fulfilment of those purposes."

According to GDPR, national law of EU parties overrules GDPR when it comes to personal data being used in archival context. I don't know every EU country's stance, but most of the bigger economies would allow for this.

There is also a difference between deleting the data and rendering it inaccessible to the public. Keeping something under wraps is generally more 'acceptable', but active destruction of the item (digital or not) and its providence is much more limited. Also there's a difference between personally identifying data (covered by GDPR), your content (which would be covered under copyright and not GDPR), and connections people can make if that content is available (not covered at all because it's not anybody else's issue if you write something terrible and people keep recognizing you over it so long as you did actually write it).

[0] https://gdpr-info.eu/art-89-gdpr/


> When IA has to comply with laws like GDPR, that's the end of IA.

Will you be happy when you've burned down that library?


https://web.archive.org/web/20200813235643/http://slawsonand...

> Article 3(2), a new feature of the GDPR, creates extraterritorial jurisdiction over companies that have nothing but an internet presence in the EU and offer goods or services to EU residents[1]. While the GDPR requires these companies[2] to follow its data processing rules, it leaves the question of enforcement unanswered. Regulations that cannot be enforced do little to protect the personal data of EU citizens.

> This article discusses how U.S. law affects the enforcement of Article 3(2). In reality, enforcing the GDPR on U.S. companies may be almost impossible. First, the U.S. prohibits enforcing of foreign-country fines. Thus, the EU enforcement power of fines for noncompliance is negligible. Second, enforcing the GDPR through the designated representative can be easily circumvented. Finally, a private lawsuit brought by in the EU may be impossible to enforce under U.S. law.

[snip]

> Currently, there is a hole in the GDPR wall that protects European Union personal data. Even with extraterritorial jurisdiction over U.S. companies with only an internet presence in the EU, the GDPR gives little in the way of tools to enforce it. Fines from supervisory authorities would be stopped by the prohibition on enforcing foreign fines. The company can evade enforcement through a representative simply by not designating one. Finally, private actions may be stalled on issues of personal jurisdiction. If a U.S. company completely disregards the GDPR while targeting customers in the EU, it can use the personal data of EU citizens without much fear of the consequences. While the extraterritorial jurisdiction created by Article 3(2) may have seemed like a good way to solve the problem of foreign companies who do not have a physical presence in the EU, it turns out to be practically useless.


I don't understand why Carmack thinks blockchain should be a component of this. Anyone care to elaborate on how that would make this easier/better?


I dont know why he suggest this but while I would like to read some old content that now cannot be found I am also respectful for people that dont want their content on the internet anymore.

So every time someone suggest to put some content on a blockchain I wonder if they realize that there are people that want to erase/remove their content from the internet. I also think it is dangerous to keep everything someone or some company created on the internet. It is too easy now to internet judge some adult about things they did while being young or to keep people accounted for mistakes they did and paid for them their duties to society.

I think if we ever build this feature on a blockchain I hope it is opt-in and people realize what that mean.


I think he's referring to something like IPFS.

https://en.wikipedia.org/wiki/InterPlanetary_File_System

http://ipfs.io

You can put the storage costs on the nodes because storage at archive.org's scale adds up, especially when it's run by volunteers.


It looks like you could use IPFS to accomplish this without using a blockchain.


Isn't IPFS pretty closely tied with FileCoin?


It seems more accurate to say filecoin is tied to ipfs. IPFS itself is just another protocol. Maybe it is better suited to Blockchain applications that https? But it doesn't require blockchain at all to function. Filecoin does require a Blockchain.


"Immutable and existing in perpetuity" are good qualities for an archive service, and that's at least the idea with a blockchain.


It is interesting though, what happens if you put so much data into a blockchain. I guess only a couple of nodes would want to verify the validity of the chain (because you need all the data to do it). And would those nodes really be more likely to keep the data then the situation we are in now?

I guess after some time the nodes would agree on the hash and throw away the data because it would cost too much to store.


I stopped reading when blockchain was mentioned.


I thought it was otherwise a reasonable idea, but yes-- it put me off a bit when he mentioned blockchain without further elaboration.

I see blockchain as a technology that may develop useful applications, but-- in terms of current day usage-- I'm extremely skeptical when it's referenced in conjunction with applications that might achieve the same goals without it.


I sincerely hope that we’re only witnessing

“Any sufficiently long Internet discussion will propose blockchain as a solution.”

rather than

“Blockchain is eating the world.”


Why?


because it shows a lack of understanding of the basics of distributed computing. specially on top of the web we have today (which was how the thread started "IA as a default host-of-records" which implies said records must be reachable by a any tech illiterate lawyer today)

car analogy time: It is the same as reading a post about "how to lift my car to do work in the garage", and the the second paragraph starts with "using energy harvested from my perpetual motion machine"


.... He says it right after "to make internet applications that could outlive companies". If it's on the blockchain it doesn't matter if the company storing all of the archives shuts down, the content would still exist, forever, until their is a network running the chain. I suppose something like Torrent could be used?


He has a point, but it's that private companies shouldn't archives of record.

I actually think using blockchain for things like ensuring providence is interesting, since in archives being able to have a clean record of what happened to a piece is VERY useful. It just won't earn a ton of money, so we'll need to wait for the capitalism to burn off to see more not-for-profit uses.


Similarly how github is a blockchain. I think he means the ease of version control by this.


Github is a software development tooling provider, not a blockchain


For his defense, he probably meant Git and typed too quickly. It's still obvious what he meant.


Honestly, Git was not at all obvious to me from that. And I fully admit that it could be a failing on my part not to read that into his post, but nonetheless I didn't see it.


Filecoin is a mechanism for someone to pay to ensure that data stays available. So if a group of people wanted to ensure that myoldwebsite.com stays available through IPFS and the IA, they could spend Filecoin. The difference between it and paying some central provider with fiat is that anyone can provide the data availability. So if you don't trust IA to be the long-term provider of content or want a more decentralized provider, Filecoin lets you do that. See also Arweave.


Correct me if I'm wrong, but isn't this problem the ideal use case for projects like IPFS? Anyone interested to preserve the content can join as a node to balance the load, right? And if so, why don't we see widespread adoption?


It is -- and Brewster Kahle and the Archive have been thinking about this for a long while (see this talk from him five years ago: https://archive.org/details/LockingTheWebOpen_2016 ). The model you can think for this would be to have as the Archive as the "node of last resort" of content-addressable storage, making sure there's always one node up with the content you want.

The incentive challenges are making sure that the average number of nodes is more than one, because, as Brewster likes to say, "libraries burn; it's what they do", plus all the traditional challenges of maintaining a commons at high levels of resilience. Once you have data on a network like IPFS, we can use a number of incentive models to make sure it stays there, including charitable projects like the Archive, government support (archives are traditionally state projects -- if every country's archive was pinning this content, it would be far more resilient), and decentralized incentive frameworks like Filecoin.

(Disclosure: I work for the Filecoin Foundation; in our decentralized preservation work, we've funded the Internet Archive's work in this area, though I should emphasise that IA works with a lot of different decentralizing technologies through their https://getdweb.net/ community.)


I was involved with planning https://nlnet.nl/project/SoftwareHeritage-P2P/ for just this reason --- hopefully we will finally be able to start work on it sometime too far off.

Indeed the real challenge of archival is not loosing the stuff, by making sure that people can still find the stuff. "Orphaned" information that no one knows exists, or is bothering to interact with, isn't that valuable compared to resources that are actively being used and still "live" in the culture.

Of course, the archive can never serve the same amount of bandwidth, but the goal is a) interested parties can mirror the stuff they care about in a higher bandwidth / item way after some huge disruption c) random viewers never notice something going down, nor who is serving the info, but just a temporary drop in connection quality.

Ultimately, location-based addressing is a stupid way to run society, needlessly fragile by baking in very property claims (IPs, DNS, etc.) that are incidental to the task at hand. Content-based addressing, with location based hints to avoid trying to solve really hard problems all at once, is the only way to make culture more robust.


The great thing about location-based addressing is that an archive of the set of known locations is not subject to the same ownership rules as the canonical live version of those addresses. A document listing all Geocities URLs can be placed in content-addressed storage without needing geocities.com to be owned by the party that emplaces that document. And a chain can be maintained such that people are incentivized to remember that document into the far future. Coupled with archival of the actual content, you bypass the exclusivity of domain ownership.

Of course, ensuring that there's persistence of attention as well is a tougher problem. But one only needs to look at sites like https://reddit.com/r/tumblr to realize that there is immense societal interest in "meme archaeology." Reducing the barriers to entry to would-be archaeologists, giving them a "chain" of breadcrumbs that lead to content, and building communities that will socially reward people for their archaeology work, is the best thing we can possibly do.


> The great thing about location-based addressing is that an archive of the set of known locations is not subject to the same ownership rules as the canonical live version of those addresses.

Erm, to me this sounds like putting up with link rot as hack around bad IP law? There are already IP exceptions for preservation. And if content-addressing was the norm, geocities-type sites might bow to market pressure to not "own" the content, but merely have some some sort of license for being the exclusive pinning service and running the ads or whatever. This is like avoiding the problem where your the rent on your current apartment doesn't fall as much as the market writ large because your landlord knows moving is not free.


IPFS only does addressability, it doesn't provide storage. You could use a decentralized storage network like Arweave, Filecoin, or Sia.

https://www.arweave.org/

https://www.filecoin.com/

http://sia.tech/


Or "centralized" ones like Fleek, Textile, Pinata, etc.

https://fleek.co/hosting/

https://docs.textile.io/buckets/

https://www.pinata.cloud/


IPFS does the opposite, right? It doesn't guarantee the archive is available, which is what Carmack is asking for. Incentives to scale bandwidth with need already exist as long as you have he data at all.

That is to say, IPFS doesn't help if the desire blooms after the nodes dry up. Things could still be lost.


The point of IPFS is not keep the data archived, but to allows users to no care who does the archiving.

Concretely, this would be to skip the "many people on encountering a dead URL don't bother to try the internet archive" problem.


There are many a projects who go open source when they fail. The extra step here is IA would become the A record for the project (at least temporarily?)


Why is there only one IA?

Why is IA not globally distributed, like a CDN?

I use IA for "problem" websites, e.g., ones that rely on SNI, i.e., ones hosted at certain CDNs. I simply add add these sites to a list and the local proxy does the rest.

      http-request set-uri https://web.archive.org/web/1if_/http://%[req.hdr(host)]%[pathq] if { hdr(host) -m str -f list }
IA "hosts" an enormous number of sites without the need for SNI (plaintext hostnames sent over the wire).

EDIT: @sebow the way they (re)format the HTML is less friendly to the text-only browser I use.


What’s your issue with SNI/threat model? If you use a non-SNI site, anyone can tell which site you are visiting since there’s only one domain on that IP.


Seconding that curiosity. Pretty much every web server I've built uses SNI (at least if it's hosting sites under multiple domains), and the only "downside" of which I'm aware is the lack of IE6 support.


There is a difference between making something "impossible" and making something "easier". Performing reverse DNS lookups, or otherwise trying to maintain a global table of 1:1 domain:IP mappings and perform lookups in real-time, is nowhere near as easy nor reliable as sniffing SNI. IME, it is neither easy nor reliable, nor worth the effort. SNI is the preferred method. SNI is easier. SNI is 100% reliable for detecting what hostname the user is trying to access.

What is the point of so-called "DNS privacy/Private DNS" if "anyone can tell which site you are visiting" simply by observing IP addresses, without any need to see domainnames.

If SNI (plaintext hostnames sent over the wire) is a non-issue, then why are people working on encrypted Client Hello in TLS1.3.


> 1if_

This is a neat shortcut to simply get the very first archived version! I often have to go to /*/ and manually click on one of them, which is very tiring.

Is there one to get the latest?


There might be, although I would not be surprised if it was slower than the one for the link to the first because the link to the latest is dynamic. A two-step "shortcut" is to use Lua with the proxy to retrieve the link to the latest then follow that link.

To get the link to the latest, can use memento. For example,

   usage: echo example.com|1.sh

   #! /bin/sh
   read x;
   curl -A "" https://web.archive.org/web/$(curl -A "" -s "https://web.archive.org/cdx/search/cdx?url=$x&fl=timestamp,original&limit=-1"|tr \\40 /)


The Tor network might be a good alternative to IA for hiding SNI (and also hiding destination IP addresses).


archive.today/.ph/etc is vastly superior for doing the intended purpose here: archive[given that you have the URL obviously]

IA is more of an curated internet archive + explorer(which granted is very good).


I think it's interesting to think about what we have lost because we couldn't keep everything from a 100 years ago and what society 100 years from now will be grateful we preserved.

Off the top of my head, we lost a lot of common wisdom in dealing with the flu pandemic of 1918 because personal letters and most newspapers were not preserved. I think 100 years from now they might wish we had preserved more from marginal and/or world communities. What folk wisdom is being lost? Perhaps we need to expand our definition of what is worth saving.


An excellent place to start is talking to your parents and grandparents and recording their history and stories online.


That's a good suggestion I've been following


Hard to know what will be of interest for future historians. Some things in which we place great value can be considered irrelevant, while some of our junk can become historical gold.


> some of our junk can become historical gold

That's what interests me. For example, there's a cool repository of 12 step speaker meeting talks hosted in Iceland [1] and frankly, some of the talks are junk, but there's a lot of wisdom. What I find interesting is how it showcases how ordinary citizens talk to each other. The words they use, the accents, the little gems of folk wisdom contained, along with some uncommon stories.

This will be valuable 100 years from now if, for example, you want to build a virtual world based in the mid to late 20th century and you want to get the accents correct. What phrases did people use? What were some common misconceptions? Maybe 100 years from now addiction will no longer be a problem. If my virtual world is to be accurate I need to know what it was like for ordinary people when it was a problem. Etc...

[1] https://xa-speakers.org/


That's a great example, thanks. Funny enough, I got that insight from Bill & Ted's Excellent Adventure. In the end, the most important thing in the future was some 80s rock song.


The mundane of today is very insightful for tomorrow's historians.

Its fascinating when you start looking into any historical time period (you wouldn't even need to go far back), before a lot of details are educated guesses. Since no one chose to record the mundane in detail or it failed to preserve over time.


> The mundane of today is very insightful for tomorrow's historians.

Too add an example I know well.

I grew up on a small farm.

We have plenty of images of Christmas parties etc, but almost none showing actual work being done which is what I think my kids would appreciate the most.

Luckily YouTube for all its warts exist and I can look up the "motorized tea spoon", the U-9 Motostandard for them when I need to explain it: https://www.youtube.com/results?search_query=motostandard+u9...

(We had the one with front-mounted wagon and a stick for steering. And yes we had another slightly larger tractor as well, the AEBI Transporter TP50: )


I'd love it if the Wayback Machiine were less touchy/more reliable.

The failure mode I see very often is that the frontend apparently doesn't know what the backend's doing: The part which ingests URLs and tells you what URLs have been archived does not know what archives the backend has, so it will tell you a page has been archived and give you a link to the archive, but when you click the link, it tells you it does not have the page archived, oh, look, it exists online, would you like to archive it now? Archive it again, and it will tell you that you can only archive a page once every 45 minutes. If you're a weird little obsessive like myself, you go through this process a half-dozen times for one page before it acknowledges that, yes, it does have the page archived (once, mind you) and you can actually see it.

While I'm filing bitch reports...

The Wayback Machine apparently loves setting cookies. It will set cookies until it has exceeded its own ability to accept cookies, at which point it will give you a blank page and you have to look in the developer console to figure out that it sent you a "too many cookies" error in the response header. I've had to force my browsers to not accept any cookies from the Internet Archive to fix this.


I wrote a little about this and the alternatives here, especially for sites with user-generated content which may be impossible to find/regenerate from elsewhere.

I call it “Beating the Samson Option”, of pulling the temple down upon your head.

https://blog.eutopian.io/beating-the-samson-option/


Personally, I think eternally archiving everything and infinitely available public data has been not-so-great. If this was an "archive with consent" sort of system, then sure. My response may be better summarized as, "Does IA support robots.txt, and if not why?"


Public data is... public. No one should be stopped from saving a public page and nothing should stop Internet Archive, be it robot or human. There is if course a need for removal of archived content infringing on someone's rights, whatever that might be, but "archive with consent" will fail for the goal of preserving culture. I think it's worrying that some online newspapers enacted archive blockers or IA needing DMCA excemptions, just so companies can't DMCA anything with their name on it. To preserve journalistic integreity and to save culture, even if it collides intellectual property rights, "archive with consent" won't cut it.


Content posted on a web site is NOT “public” (domain), it is (in the US) automatically copyrighted to the author, unless they specifically waive those rights. Just because you can see it through a browser doesn’t in any way mean you can make it yours and do what you want with it.


It doesn't matter.

Archives are exempt from being forbidden to create copies due to copyright infringement. The Library of Congress can make all the copies it wants, it just can't SELL them.

Now, there is a question whether a private company should legally be able to BE an archive of record, but as of now there's no legal reason they can't be, I believe. So it's legal.


Absolutely true. "unless they specifically waive those rights" - if archiving entailed contacting the owner with a legal archive request, we would have archived basically nothing. Luckily there are exceptions for Internet Archive in place. My point is, if "by consent" was the requirement to archive information, we would have archived nothing.


We store books in public libraries even if they aren't public domain and authors can do nothing to prevent them from doing so.


First sale doctrine in the US. If I buy a physical book, I can give it, loan it, throw it in the trash, etc. This doesn't apply to making electronic copies--see what happened with Google Books for example.


If you buy the book, you have an authorization to do so. But you can't put copies of it.


You do if you're an archive, actually.


Throughout human history, records have been forgotten, rewritten, changed, mutated, degraded, eroded away to nothingness. "The internet is forever" has always struck me as inhumane. Make a mistake or expose a weakness on the internet and it will always accompany you.

It turns out that the internet is not always forever. I find that comforting.


Depends on how you think of "forever". If you post an embarrassing video and someone saves and reposts it with your name attached, odds are that video isn't going to be around in 200 years. But what about the next 10, 20 or 40 years? In the context of your overall professional adult life, that's a long time.

"The Internet is forever" isn't some natural law by which all content abides as though it can never disappear. It's a warning that you don't control the content once it's accessible on the Internet.


I think there's inhumanity of a kind on both sides of this question. "Everything you've done will be forgotten, and no one will remember your name" is the sort of thing the bad guys say in movies. But that's what happens to most of us in the end. I think it's natural not to want that.


Maybe humans should become more accommodating of past mistakes.


Suggest that to them. I'm sure they'll get right on it.


You're consenting by posting it on public internet in the first place.


Posting something on the public internet is not consent for you to scrape it and post it on your own site forever.

And requiring an explicit opt-in would basically mean no IA.

To be clear, the IA is a positive, maybe even a great one. But it skirts by because most people don't care. (They did as you say post whatever on the public internet.) Add the facts that they're a non-profit, aren't trying to monetize their hosting, and will generally take things if the owner asks.

Libraries and other archives have some very limited special rights (which mostly relate to making physical backups of physical books). But invoking "library" isn't some general get out of jail free card with respect to copyright.


It can be. The reason libraries and other archives have special rights is because they fought for them against the express wishes of people who sold paper. There are no arguments made against archive.org that weren't also made against libraries.


Thank you.

These rights are also under constant attack: It's normal to charge libraries exorbitant prices for digital materials compared to their analogue counterparts, for example.


> Posting something on the public internet is not consent for you to scrape it and post it on your own site forever.

It effectively is. Your consent is not required, and people are doing far worse than just keeping it available (Clearview; there are also reports of people hoovering up encrypted data to crack in the coming decades when we're post-quantum).

This is no different than demanding people not keep track of anything else, and attacking archive.org might make you feel better, but that won't make anyone else stop.


People have recorded others without their consent for millennia. Whether it's telling a story about what someone said, a photo, or now a screenshot of a twitter post, that is reality. You'll never be able to stop someone from telling another that you said X.


I strongly believe that theres no freedom of speech without the freedom to replicate that speech.


Do you think most people who post publicly on the internet would agree if asked? I think most people would like to have a choice to make old stuff disappear. If most people think so, THAT should be the rule. You might not like that and argue that there is no way to enforce it, but that does not mean it is a good rule to assume consent.


There are two things:

1.) Most people won't opt-in because a significant majority accept defaults and don't opt into most things.

2.) For people like yourself probing a bit deeper, you might well ask whether you really want to give up your ability to decide you don't want something you thought was so funny when you wrote it at 20 now that you're a politician running for office or up for a political appointment.


I mean even honoring the opt-out of robots.txt would be fantastic. As another commenter pointed out they willfully ignore it: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

It's fairly unethical.


How about this? Archives will still exist regardless of consent. Does that makes them digital rapists?


Lets say someone posted a private video and some zero day made it public, what’s the take on this scenario. Somehow robots crawled it.


There is some public discussion about why IA does not strictly adhere to robots.txt:

https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...


They're basically saying they're choosing to ignore a web convention that explicitly states that people don't want their websites archived or searchable because they want them to be. Sounds pretty unethical to me.


When those people die and quit paying for hosting, the information on their website doesn't magically become useless.

Perhaps other people still have a need for it.

Thank goodness the IA doesn't blindly obey robots.txt.


There are clearly some things we, and the author, wants to persist, and yet the internet fails to do so.

Trying to muddle in privacy concerns with the accumulation of public knowledge undermines the whole concept of a shared society, or there being even the potential of accumulating "progress" in the first place.

Also, historically speaking, people with means used to save their letters for posterity, which proved to be a very valuable resource for future academics, so the idea of, what, deleting all your proton mails and signal messages as encouraged is arguably overshooting the return to some pre-internet norm.


> Also, historically speaking, people with means used to save their letters for posterity, which proved to be a very valuable resource for future academics

Your example is an example of choice or consent. They also had the option to burn their hand written books and scrolls down periodically. Systems like IA take that choice away.


There is no reason to do person communication with staticish websites. Don't foist the problems of social media onto the Internet Archive.


I'm not sure I understand. robots.txt was a matter of consent and was standardized well over twenty years ago. The IA willfully ignores it because they believe it interferes with their mission. This was long before social media, when static websites were more dominant than dynamic ones.

Source: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...


This is a fair philosophical quibble, but the tiny amount of harm done in in real world makes this a waste of time discussing.

You made it public on the damn web, people doing what they wanted with it is fair game per the original ethos of the web. If you wanted it to be private, you should have some auth or encryption.

The actual real world harm is in actually-common-used means of personal communication were network effects preclude on carrying out they business with technology appropriate to desired privacy levels. Chatting on my custom website was always niche, and therefore the network effects argument doesn't carry water.

We should refrain from worrying about robots.txt minute until the elephant in the room is put to rset.


Issues are not single track. I can rightfully claim that IA is founded on unethical behavior while I can also say other things in the world are bad.


"You shouldn't have privacy because it undermines the concept of a shared society, and also future historians might find your life interesting."

Thanks.


Rediculus Strawman


They do respect robots.txt matter of fact, it will remove the website outright and most of its archives will be hidden until that website's robots.txt is offline.

Their documentation about it is rather crap right now, but its in several of their FAQs about it


Respecting robots.txt would mean not saving pages blocked by robots.txt. Not just hiding.


I'm old enough to remember when the host-of-record for failed startups was fuckedcompany.com ...

I do wonder how many startups actually want to be archived, rather than just ditch everything with unseemly speed as soon as they get acquishutdown.


There's huge valuable learnings to be had in failure. Through some coordination they could perhaps get compensation for sharing although this goes against it all being open and free.


Interesting use cases for IA is to go back to early days of when a startup first launched to see their pitch before they raised real money and added tons of nonsensical bs to their homepage.

eg.

http://web.archive.org/web/20180901110658/https://www.snowfl...

http://web.archive.org/web/20140701061721/https://databricks...


There's nothing about his suggestion that requires use of a Blockchain. IA could certify authenticity easily without it.


The idea is more interesting when you think about scrapers. Take any ecommerce website, there are several scrapers that download all pages every hours, it would be more efficient if a provider had a live copy of the website and then serve the requests to the scrapers or could even send webhook.

A website could handle tons of scrappers without having high bandwidth, only the provider will need high bandwidth.

The issue is that scrapers often play with cookies and dynamic websites and such solution wouldn't work on these cases


(in the nicest way and since your post is recent, please edit s/scrappers/scrapers! It very much changed how I read it first time around -- I thought you were referring to a type of failed startup!)


Thanks, I need to improve my english haha


Scrapers. A scrAper is the thing you use to remove ice from windscreens or scrape data from websites. A scraPPer is someone who likes getting into fights or collecting scrap metal.


Is scraper one “p” or two? My inclination would be scrapper is “scrap”-er rather than “scrape”-er.


scraper*


What compression does IA use to store websites? Using a 2x better compression will allow them to store 2x more websites/content.

I am doing some compression research, and would love to help IA in any way I can. There are some amazing SOTA compression algorithms available now.

And if IA ignores images/video, and focuses only on text, they can store an insane amount of websites at a very low cost.


right now, most of it is coming in as ZSTD with a central index. older stuff was gzip


I am working on a ML-based compression which will be slow to compress/decompress but will give much better compression than zstd/gzip (even 2-3x better). Do you think that it's a useful algorithm for archival reasons or for storing huge amounts of data in cloud?


it sounds like he wants to be able to archive and respinup/run things like mmoprg game servers as easily as static content can be archived and served today.

that would be a huge and expensive paradigm shift for internet service backends that traditionally have been only designed to be run by one entity and typically are a mix of custom, open source and proprietary software that is run in a specific way.

i suppose it could be done, but there hasn't been any reason to make that investment. service backends also tend to be more "living software" where part of the system is the team that continuously builds, updates and operates it.

basically, it would look something like java applets, but for entire internet service backends. one step and all the services, databases, everything would spin up and start serving. that would be great but is probably a ways out.


Well there's other static things like FAQs, news feeds, even hosted game content that a game might rely on. It's not just dynamic services. I think Carmack is referring to that.


..."something like this could combine with a blockchain style technology to make internet applications that could outlive companies. A niche multiuser game that couldn't meet company revenue goals could still be "fed" by anyone that wanted to push resources at it, since the "

that's not static assets, that's a multiuser game server.

i think he's envisioning entire internet service backends that can be packaged up like java applets and re-run on demand paired with some kind of decentralized serving infrastructure that any user can insert coins and resurrect a sophisticated web service from the past.

more likely i suspect we'll see more efforts by hobbyists to resurrect these things and more releases of backends from failed projects into the public domain.

with so much physical gear that requires service backends being made today, we may even see regulation that requires release of the source for a service when a service is shut down. crazy to think that if the company who made your car or tractor fails, that your perfectly good car or tractor could cease function when they shut down the service backend.


Imagine someone building this for SaaS hosting--a perma-Heroku, or something like it. That's actually a huge value-add. Suddenly tinyStartupA doesn't need to convince largeCorpB that it's going to be around for forever. The service can exist in perpetuity without the company.

Complex repercussions obviously around acquisition, IP, and other business dimensions however. Maybe unworkable even. But I think there's a world where this actually exists and lowers the barrier to building business-critical software and selling to companies that need a 50-year commitment to risk you.


I wonder if its worth it for these platforms (like Heroku) to simply add a donation portal. It's not as future proof as something fully open, but it wouldn't need the design, implementation and maintenance of a brand new fully open PAAS ecosystem.


IA needs to respects robots.txt and they need to make it easier to request data to be removed.

Not everyone fully supports the IA mission and they need to respect that view as much as they respect their supporters.


They clearly explained that they don’t consider archival to be a robotic activity when a person clicks “archive this page”. It’s closer to saving a page with Ctrl+S and uploading it to IA.

Removing data from the web in 2021? Hmm… https://web.archive.org/web/20211222032633/https://news.ycom... oops!


Are you indicating that everything in the IA was added via a person clicking on “archive this page”?

I don’t think that is correct. A lot of it was added via automated methods?


Not necessarily via a button but all of the pages were submitted to them by 3rd parties. So, they don't seem to crawl to discover new URIs eagerly, see https://news.ycombinator.com/item?id=29643506

For example, I host my own ArchiveBox at home (you get fulltext search as a bonus) and it is configured to submit every URL I save to IA: https://imgur.com/a/Yhnxo1W IA considers that to be manual submission not subject to robots.txt rules.

https://archivebox.io


I genuinely didn't understand the point he's trying to make.

Could someone ELI5?


How about a default way of building iterative versioned static web sites where dynamic content is separated and used only for dynamic data. Meaning, most sites can easily be stored offline.


I love this idea. I am working on a mini project for a decentralized game which can be self hosted or snapshots - so you can share / modify a portion of it and provide a custom experience


I was with him up until “blockchain”.


Is it really worth archiving ?


It's difficult to know beforehand! A startup might be a total flop that delivers nothing of value to us now, but it might be a valuable datapoint for future historians to understand how startup culture changed and evolved. Or it may be interesting to future founders - I've seen some startups with perfectly fine ideas fail, and then a few years later, someone succeeds doing something very similar.

This may not be the best example, but while I'm sure the Rosetta Stone, being a treaty, likely would have seemed like something worth preserving, would anyone have imagined that it would be the pivotal document in understanding ancient Egyptian? That it would be one of the most important documents of all time?


that's optimistic, and I could see the potential. But I've become a bit suspicious of the content on the internet and the costs in archiving it. Maybe Internet Archive could have a curation threshold .

I would rather revert to the internet where quality content was published openly on the web. But presently web content, at least that upranked by Google, is very low quality (seo, clickbait, biased, trivial)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: