In particular, it gets into the heart of the matter: What does the user want to happen when they click the Refresh button?
It does seem worthwhile to try to change the default behavior of the Refresh button to mean "refresh the page" instead of "fix the page" (what it currently does), which would make this "immutable" proposal unnecessary, AFAICT.
IIRC this is exactly what the reload button used to do. You had to hold down what I believe was the Control key while pressing the reload button to do a "force refresh". Now it would seem it's the default behaviour. That, or maybe a normal refresh does the revalidation checks (which return 304), while a Control-refresh does a full download of all resources?
Browsers behave the same (though they've acquired additional nuance since the HTTP 1.0 and pre days) as they always have, generally speaking. IE had a cache control bug for many years that made it impossible to force a reload in some circumstances, but was fixed in IE 6.
The change is on the server side, not the browser. Modern single page applications do all kinds of janky things, and a lot of them break caching, either explicitly (with cache-control headers) or accidentally (with uncacheable URLs).
As far as I know, every major browser has standards compliant cache-control implementations, and all have some way to force a full reload.
Source: I worked on cache-control browser compatibility in Squid many years ago. The browsers took a while, but did get it right eventually.
In at least recent versions of Firefox and Chrome, a reload
includes a `Cache-Control:max-age=0` with the request. During a forced reload (e.g. Shift-Reload), both the legacy `Pragma:no-cache` (HTTP/1.0) and the more modern `Cache-Control:no-cache` (HTTP/1.1) headers are sent.
Judging by the comments here, there seems to be some confusion.
This is exactly like long-lived cache settings today. Right now browsers send a request on basic reloads and get back a 304 from the server which states that nothing has changed. All this setting does is let the server tell the browser to skip that check/roundtrip instead of wasting the time/bandwidth on confirming with a 304 after the initial load.
The browser is still completely in control here and can do a full reload or just reload all the time if it wants to. Web scrapers and other HTTP clients are unaffected.
We use webpack and all filenames are just sha hashes of the contents in the production builds. There is no need for the browser to ever ask anything about that file again (unless its purged from the cache...).
This has been the standard way of serving assets in Rails for the last 5+ years. I don't think it was invented there, as if you are using a CDN it's basically required.
Invalidating edge caches takes time and/or is expensive (i.e. Cloudfront), so adding the content hash in the file name is a good trick to ensure users always get the correct version of the asset.
I didn't mean to imply it was invented by webpack or anything (and I used the scarequotes around modern because I'm not sure how long it's been widely used).
I was just pointing out how and with what I use it.
It really works beautifully. One of our applications coming up needs to work offline, and we found that because of the naming we were using, appcache was almost completely work-free to implement.
If the appcache.manifest changes, it rechecks all files (or in my case, would only pointpessly re-check those which haven't changed, and download new ones), and the appcache.manifest will change the second a single byte anywhere in the program changes.
What happens if a random wordpress blog's frontpage (/) is compromised and has malware injected, setting the immutable keyword? Cloudflare and letsencrypt means most sites will be https sooner than later, so the https part will be "taken care of". (At least that's better than nothing; imagine the power granted to captive wifi portals if not!)
I think it would be bad practice to use this keyword on end-point URL's that are advertised in search engines or API documentation.
You would want to use it for resources to which base pages and manifests point; such as JS, CSS, JPG, PNG, etc.
The browser could enforce that, sort of. It could ignore immutable cache status on the object that is actually in the browser location bar and IMS it, but it could allow it on referenced objects.
The idea is that referenced objects can simply stop being referenced, and a fresh object is referenced.
I think the point is that it would be renamed /parallax-plugin2.js and HTML would be set to reference that instead? That is why immutable cache shouldn't work for the page in the address bar.
The concern here is that even after recovery from the compromise, the site could then never use the name "parallax-plugin.js" again, because a browser might have the cached malware under that name instead of the correct version.
On top of that you as a developer would have to understand what's happening and that it's happening at all. Might be not easy as we have a habit on clearing our cache all too often :)
There's max-age support, the ability to preload pins in the browser, and certificate transparency to work around this, see section 4.5: https://tools.ietf.org/html/rfc7469#page-21
As to this original point, it would be best if this didn't apply to the address bar URL / main document request. But it's a good point, worth considering. Perhaps the UA should set a timer, and two or three refreshes in a row would be the equivalent to the prior refresh behaviour.
Or simply the domain is resold, but old visitors still see a page from a year back. Immutable is useful, but the max-age limit should be limited to a few hours, which is an acceptable timeframe for internet disruptions (e.g. DNS).
With these, you fetch data by its hash. You provide a primary URL where the item is known to exist, but the browser is free to fetch the data from any proxy (or local cache) with the same hash.
This could replace package mirroring, git clones, parts of bittorrent, CDNs and more.
It does assume that you use a hash with enough bits that collisions are extremely unlikely, and also that your hash is cryptographically strong (else a rogue proxy can inject data).
Exactly! IPFS already has a web proxy to their content addressed network. (e.g https://ipfs.io/ipfs/QmXZnH2WVmFoiE7tRJQk9QstLGhSKpVyEQ4Rywx... ) And hopefully browsers will learn to speak the protocol natively so then there's no need for a http proxy at all.
"Immutable" with "max age" is an oxymoron. If it expires it isn't immutable. Use another word.
What you need is an absolute date and time in the cache header which says "we promise this page does not change before this date and time". This could be treated as a "lease" and automatically extended in some configurable intervals. For instance, if it is 30 days, then the file is good for 30 days since its modification time stamp. When that time passes, this is renewed automatically: it is now good for 60 days since its modification time stamp. Basically, it is always good for N*30 days since its modification time stamp, where N is the smallest N required for that time to be in the future.
When the webmaster publishes a new version of the file, he or she knows precisely when browsers who have seen the cached the previous version will start picking up the new one. Changes can be co-ordinated with the expiry time to minimize the refresh lag: the time between when the earliest new client sees the new page, and the last old client stop seeing the old one. If we know that a page expires for everyone on June 1, 2016 at noon, we can update that page in the morning on June 1. By afternoon, everyone sees the new one.
Yes I think it would still be a good idea to require expiry date for "immutable" content, just as a safety net if something is misconfigured somewhere etc - then when you fix a bug, you will know the precise time at which it will be gone for everyone (hopefully the expiry date was not set to 10 years).
However I wonder what is the typical cache lifetime of resources on current web. IIRC someone on HN posted like a week ago that from their study, it's rather short - stuff is evicted from cache quite rapidly if not used. So fast that getting a cache hit for jquery from CDN is quite unlikely very often.
> I've learned to press Enter on the URL line
> instead of reload for exactly this reason.
Yes, this is what I do too.
The browser can do one of three things:
1. Serve the file from cache. This is what happens when you put the cursor in the address bar and hit Enter. Well, at least for ancillary files. It likely still will ask whether the main HTML document has changed. But it will load CSS and JavaScript files from its local cache --- if the webmaster properly set the HTTP headers, like Expires, to tell the browser that it can cache the files.
2. Ask the server whether the file has changed. This is what happens when you click the Reload button. This is the area of dispute. The article is saying it would probably be better if the browser acted just like it does when you put the cursor in the address bar and hit Enter. Instead the browser seems to check not only the main document file but also every single CSS, JavaScript, image, whatever, file. It doesn't redownload them all, but it sends an If-Modified-Since header, to ask whether they have changed, and then requests the whole file only for ones that the server says have changed. The payload back and forth is usually just a few hundred bytes for files that have not changed. But the network requests take a noticeable slice of time, because it's one request per file.
3. Ask the server for the whole file, regardless. This is usually when you hit Shift and Reload.
I apologize if someone already mentioned this, but there's a way to eliminate the penalty to the user without changing HTTP at all. Browsers can simply check all the non-expired resources after all the rest. Now the latency is the same, but we still do the checks, just after everything else and we're already rendering the page. Only if one of those resources did actually change then we re-render the page.
The immutable solution is cleaner, and doesn't load the server as much, but it's not backward compatible and requires the people who run the server to know what they're doing. Maybe the two solutions could be combined?
The biggest potential drawback I see is that maybe most resources, including the html don't expire, so every page will be rendered and re-rendered. Giving little benefit and making the rendering choppier. Some of that maybe could be mitigated by starting the rendering in the background and not displaying it until a certain percentage of requests return, or special casing the "page" itself as opposed to page resources.
What you're suggesting won't work. The problem is that a lot of pages require their resources to be loaded in a specific order. The C in CSS stands for cascading, and means that rules that are loaded later override earlier ones (if selector specificity match). The same goes for JS since later scripts might depend on the framework or libraries loaded in earlier, unless the script has the async attribute. And then there are the content which are loaded by the CSS and JS themselves, which in most modern web apps make up the majority of the content.
Nonsense. You have the css and javascripts. You just aren't certain that they're the most recent version. So you go ahead and render the page using either the version you have, or requesting it from the server, still doing everything in order. Then you validate your assumptions in the background about the stale versions you used still being the most current. If your assumptions are right (and mostly they will be) nothing happens. If they're wrong, you re-render the page, again all in order.
You have a point. What about CSS? At first I think it would just render funny, but some javascript actually interacts with the CSS, e.g. jQuery selectors based on style classes or something. So it has the same problem really.
Images would be safe.
Or an alternate plan, start rendering, but do not run the javascript until the resource checks for css+js files return. It would slow things down some more, but not as much as waiting for everything.
Is there a danger here of getting a corrupt resource, then no matter how many times you mash reload it never gets fixed? What do we have to stop this... I don't think CSS files have a SHA checksum header by default do they?
Correcting possible corruption (e.g. shift reload in Firefox) never uses conditional revalidation and still makes sense to do with immutable objects if you're concerned they are corrupted.
and subresource integrity would insure that the value in the URL matches the hash of the content. Files named by their hash can be treated as immutable with confidence.
How do I do shift reload on my smartphone browser? And how do I even know what shift reload is (most people won't) or that the site is corrupted (it still shows something - how do I know that it isn't the most recent stuff)?
And for my general understanding of this proposal: Even if the current domain owner might guarantee that the content never changes - the domain can switch to another site which might use the pathes but of course wants to put different contents there. Is this somehow covered?
> How do I do shift reload on my smartphone browser?
For some reason, I have the intuition that it's by tapping the address bar (to get text focus) then go/enter/return. But I have no idea if that actually does the equivalent of shift reload!
You can just empty your cache completely - not ideal but still easily done on mobile.
Your 2nd question is confusing - are you asking what happens if you have the same exact path but from another domain? Then it depends on what that server responds with. This is just a HTTP response header, nothing more.
Yeah, I could clean the cache. But most people won't know how - and what a cache even is.
The second question was about that I expected that the new server won't even get queried if the immutable caching policy from the old server prevented this. And so it doesn't have a chance to signal that it's content changed.
The potential for a domain to transfer ownership and still use the same paths, yet have different content seems incredibly unlikely. Like, it feels like you were trying to come up with potential issues for the sake of finding a way to say "see, this won't work!" :P
The biggest reason to use this is for versioned resources. Things that will never change. Say I create a minified JavaScript file. Its MD5 hash is 123456789abcdef...., and so in the output file, the filename is "foo.123456789abcdef.js". If the file changes, the hash changes. If I request the version of the file with "123456789abcdef", I should get that one. Ignoring the unlikely potential for hash collisions, everything in this scenario is working as intended. There is no conceivable reason to ever want to change the content while keeping the same hash.
Now, let's say that file, somehow, gets corrupted AND cached in your browser. I can't say I've ever seen something like this happen, but I suppose it's possible? I'd be very interested to hear if something like this is possible, to be honest. It seems like between TCP retransmissions and Content-Length, you would need some sort of subtle corruption that flips a bit and isn't corrected?
EDIT: As Klathmon points out, Subresource Integrity is probably a better solution to "corrupted file in cache" scenarios. As it stands, if a file was corrupted on disk, let's say, but the ETag and/or the Last-Modified values were accurate, the origin would only ever respond back saying "nope, no changes! you seem to have the latest copy" and you'd still be stuck with the corrupted file. Only a hard reload/cache clear solves that.
I don't want to come up with potential issues just for the sake of preventing this. But it's my job as an engineer to think about all potential issues and to avoid them as long as possible. And I'm not directly involved in this topic here or in the web in the large, but I just read this and have wondered if this is fully thought through or not and asked therefore.
Of course domain changes are unlikely. But nevertheless they are possible in our system and we have to cope with it. I just googled subresource integrity and it doesn't seem like an appropriate solution for this scenario. This would mean a new domain owner would need to generate those for ALL his links - just to be sure that the previous site didn't mess anything up. This means at first extra work and second you wouldn't even know for how long you need this (until all previous users have visited the new site).
There would be even possibilites for major annoyancies, if a previous site owner put that feature on things like index.html before owner change - just to avoid that visitors see the new page as long as possible.
I mean, you could say the same thing about HSTS and key pinning. Domain changes hands, but "oops", HSTS was set and the old keys were pinned.
Is that actually a problem? No, it's not. Similarly, as the owner of a new domain, why would I want the old content? The only reason I can think of is that I brought a company outright, or something. In that situation, if I don't want to change the content, everything still works. If I want to change it, and they did something stupid -- like unversioned paths using this proposed flag -- then yeah, I'm in a weird spot. That seems like the most trivial and unlikely of scenarios, though. It requires such a complex chain of events to occur.
I think it's safe to say that malicious usage of the flag is entirely out of scope when considering the validity of it, again, because it requires a contrived situation.
Of course. But no one sane would put 'Cache-Control: immutable' on `index.html`. It's to be used on `/js/lib/jquery-1.7.1.min.js` or `/js/mystuff-<sha1here>.js` or `/photos/mnbvcxzasdfghjklqqwertyuio1234567890.jpg`
Nobody sane would, intentionally. But I'd bet the house on it happening by accident quite a bit.
At a technical level, I like this idea. When used well, it makes sense to allow. It's hard to fault it without bringing in human error, politics, or economics.
At a practical level...
I can't wait to see what happens when a bug allows Facebook to serve this header on all pages, even for a few minutes. The most Facebook dependant folks around, those checking their phones every five minutes, will be stuck in a perpetual time freeze, unable to move forward ;).
I also can't wait to see what happens when a government tries to ban a cache-control: immutable page.
Or even what happens when, someday, Google is selling it's assets and gets to "google.com.". (Someday, itll happen - Google won't exist for all eternity)
If you can construct an attack out of it, relying on people doing sane things is dangerous... (I'm not sure this is interesting enough as an attack vector, but "but nobody would do that" is a bad answer a lot of the time)
> The potential for a domain to transfer ownership and still use the same paths, yet have different content seems incredibly unlikely.
More generally, many protocols (including basic email verification) break horribly when the assumption that domains last forever gets broken. Ideally, domains shouldn't expire, ever.
EDIT: reading the firefox bug for this test implementation, I'm not sure if it is intended to be applied to pages, or only to sub-resources (early posts mention the distinction, later ones don't)
I'm pretty sure that in the past at least, Firefox would skip the request even on a refresh for resources that had the appropriate Cache-Control headers and which that did not have any of the various Conditional-Get related headers: e.g. Etag, Date, etc. Did this change?
> Facebook, like many sites, uses versioned URLs - these URLs are never updated to have different content and instead the site changes the subresource URL itself when the content changes. This is a common design pattern...
In particular, it gets into the heart of the matter: What does the user want to happen when they click the Refresh button?
It does seem worthwhile to try to change the default behavior of the Refresh button to mean "refresh the page" instead of "fix the page" (what it currently does), which would make this "immutable" proposal unnecessary, AFAICT.