Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Cooling related failure (in Google London DC) (cloud.google.com)
221 points by tardismechanic on July 20, 2022 | hide | past | favorite | 142 comments


Oracle also had issues around the same time: https://ocistatus.oraclecloud.com/#/incidents/ocid1.oraclecl...

Obviously the weather is easy to blame, but I wonder if the underlying cause is the same datacenter. It's kind of annoying when clouds use availability zones as such an opaque thing, it's not possible to map a zone from one cloud to another and potentially your failure domains overlap.

Shameless self promotion: I have a map of all the cloud regions (but not going into as much detail as availability zones): https://cloud-regions.bodge.cloud -- the clouds just don't publish the data.


For the AZs that's slightly complicated by the fact that at least for AWS, they are randomised (your eu-west-1a isn't necessarily the same as mine), so it might be pretty much impossible to actually check for actual overlap even if you know one DC is used by both Google and Oracle. Spreading out regions seems more appropriate to increase redundancy.


Amazon does provide a way to map to the real availability zone: https://docs.aws.amazon.com/ram/latest/userguide/working-wit...

It's also not appropriate in all cases to spread out, for example the nearest zones to London are around ~7ms away in mainland Europe.


The provided info is only relative to AWS. If you wanna map geographical destination across clouds you’re out of luck unless you have reports with precise location available. Generally, AWS already provides certain assurances that AZs are spread out.


I didn't know this, but it makes perfect sense and randomizing the a/b/c is a tried and true solution - e.g. in the electrical grid, L1/2/3 are rotated for each customer, because people tend to connect more stuff to L1 on average.


Most people only have single phase power here in the US with +120 and -120 legs separated by 180 degrees angle. The power company steps the voltage down from a single phase on the distribution wires. Circuit breaker panels are set up with alternating legs so that the neutral current is approximately balanced whether you fill up the panel on one side first or from top down.


> people tend to connect more stuff to L1 on average.

I just plug into the wall socket and it works.

How could I choose to connect things to L2?


If your property has 3 phase power, chances are your single phase breakers are just on some random phase (picked by your electrician) ie phase 1.

Because phase 1 is picked so often, they actually rotate the phases at the street ie switch 1 2 3 with 3 1 2 to avoid having ie 300% more load on phase 1 and having the grid just die.


> If your property has 3 phase power, chances are your single phase breakers are just on some random phase (picked by your electrician) ie phase 1.

That's bollocks or speaks for a completely incompetent electrician. Usually, you have three phase rails coming out of your GFCI that distribute the phases sequentially to each breaker [1] to avoid that scenario, as well as ending up with a tripped main breaker because the customer overloads the phase.

The only case where great care is being taken to avoid random assignment of phases is in event stage technology - you do not want your lighting dimmer packs on the same phase as your amplifiers because dimmer packs inject extreme amounts of EM noise, and you also have to take care to not put overloads on any single phase anywhere. However, it's questionable on how long this will stay relevant, given that most lighting load is moving off to LEDs.

[1] https://www.amazon.de/-/en/Busch-Jaeger-Hage-Phase-KDN380A-3...


Guessing you’re from the UK where you seem to have fancy pre configured sub boards. In Australia the standard and 99% of house installs are just hand wired with single wires from breaker to breaker and they’re twisted together and rammed into the bottom of the breakers :)


Germany, and I've worked on construction sites and in stage tech for a time in my career.

What you describe is... beyond awful and an absurd fire risk.


Lmao I moved from NZ to UK and the house wiring here has very often not been touched anywhere from the 60s-80s or so.

They are very slowly modernising here, but they seem to have a cultural resistance to change/modernisation; they're in love with the old days.

Very different to Aus/NZ where we're eager to get the latest stuff/NZ is often used as a test market.


Ours are called red, green and blue which helps avoid all on P1.


UK Phases used to be coloured as well. The wiring regs changed to make it much less confusing (and cause fewer issues for colourblind people) so that they're now L1, L2, L3.

The GP is also a bit misleading. Most residential properties in the UK are only fed one phase - either the street only has one phase or the grid connects houses up L1, L2, L3 in sequence. Of course I'm speak about detached and semi-detached houses, in larger groups of houses (e.g. blocks of flats) you usually get 3 phases, and one of the jobs of the electrician is phase balancing.


I'm from Yeovil, Somerset! I have a fair idea about 'leccy here.


Then you'd know they're Red, Yellow and Blue, and that colouring went out in 2006:

https://www.newfound-energy.co.uk/electrical-three-phase-wir...


I wonder if there can still be a statistical discrepancy due to the electrician favorite color.

It seems blue is strongly favored by psychology students for example : https://www.livescience.com/34105-favorite-colors.html


Hopefully not 123 -> 312 but e.g 231.

AFAIK phase order matters for e.g. three-phase motors, they will spin in the wrong direction if wired with the phases in the wrong order.


312 is also a clockwise direction; the anticlockwise ones are 132 213 321.


I'm not following, aren't 123,231 and 312 all in the same order?


Your own mappings are available in the console and on the API. So definitely still possible to tell


You have some of the empty lat, lon data (e.g. AWS GovCloud (US)) interpreted as at (lat,lon) = (0,0), putting them incorrectly on Null Island.


Thanks, hidden.


UK infra - physical and electronic - is typically rated to around 30C. So higher temps are going to cause failures in multiple locations.

If yesterday's temps lasted for a week instead of a day or two a lot of essential stuff would stop working.


Carrier-grade (i.e. critical) network equipment should be rated for 40C ambient air and then up to 55C for some short length of time (3 days??) to allow for air conditioning failure. This is what we design and test for. Cheap stuff, including Google's software solutions running on 'commodity hardware', won't handle that. You get what you pay for.


I have not seen a lot of carrier equipment in the DC/CO rated to 55C. Optical transport nodes maybe but not the Switches and Routers and servers etc. Temp ratings like that are normal for field systems in passively cooled cabinet shelters or tower sites and Industrial Ethernet and CCTV camera housings etc are even built to handle 65C. The Uptime Tier-1/2/3/4 builds overcapacity redundancy into the cooling facility itself to avoid failures like this and I don't recall seeing anything NEBS related to go beyond that either. This looks like it was a matter of intake air/water to the cooling system going outside of design parameters so even with extra capacity they couldn't meet the t/h set points.


I think it very much depends on the equipment. I think we have multiple tiers of what we support at each type of facility, with traditional data centres having less stringent requirements (maybe w.r.t noise/filters rather than temp) compared to the facilities housing big routers where -5/55C is the spec to meet.

You are right, and it's arguably harder for those smaller boxes sitting in cabinets / on telegraph poles, despite being much lower power systems to start with. It might be 50C+ in there before they even turn on! Those things might have a two-stage boot process where they just run their fans for a bit to cool things down before actually booting the main system. It must be a real nightmare for entirely passive stuff, I have no experience with that.


Google doesn't run its networking equipment on commodity hardware. That's been the case for over a decade source from Open Network Summit talks. The issue here isn't about network equipment temperature rating, rather whole datacenter losing cooling.


I was under the impression that they still did some of their own software defined routing, maybe at lower bandwidths? That could be an out-dated view. I do know they also buy traditional equipment from the big vendors for high-bandwidth stuff. P4 is very cool.

I was trying to point out that high-end hardware should survive a couple of days despite a data centre losing its cooling. it is designed for exactly that situation.


a typical google data center has a networking room which often will have large numbers of standard commercial networking devices. That room has extra redundancy, locked off from the cattle, etc.


Does the UK even design with the assumption of air conditioning being necessary and present, let alone designing for it failing?


Yes. You must be able to control the temperature of a building densely packed full of hot radiators. You might be able to avoid active AC within the Arctic circle but you'd m still want to filter that incoming (very) cold air, at which point maybe you might as well basically have an AC system. And yes, you must ensure you can survive a failure because these systems do fail (normally when you need them most) and then otherwise everything inside cooks.


The arctic is warming much faster than other places. See eg this recent article about record 38C temps in siberia: https://www.sciencenews.org/article/siberia-verkhoyansk-reco...


Datacenters always have A/C I thought?


Thanks :)


Google and other companies do make risk assessments including temperature scenarios.


Cheap stuff, including Google's software solutions running on 'commodity hardware', won't handle that. You get what you pay for.

Sounds like the downside of the "pets versus cattle" methodology is that natural phenomenon will wipe out your entire herd, while the pets survive.


Why does your map have AWS's eu-north-1 in Estonia? Officially it's said to be in Stockholm and I can find no references nor know of it existing there.


The eu-north-1 AZ’s are in Västerås, Eskilstuna and Katrineholm, all ~100 km from Stockholm and ~40-80 km from each other.


Strange, I'm not sure where I got that from, will fix.


You have some locations simply pinned to the central squares of cities when that is definitely not where the facilities are. Maybe include a +/- error range on the provided data?


Wow, this map is really cool. I'm (very) idly curious how accurate the reporting for the Sydney region is, because I just looked up the lat/lon, and found myself in the middle of an urban mall environment that I've walked past many times when I've been in the city.

Being able to look up at the buildings there and know there are indeed 5 different clouds somewhere up above my head in that specific location would be really cool. Being able to point at a specific building would be even cooler.

I do of course (sadly) appreciate the flip side of this coin which is one of (many of) the reasons precise data is not published. So I guess I'm just wondering out loud, probably rhetorically :), how I might find out one day. "<-- That building" is enough resolution for me :)

EDIT: A quick Google found baxtel.com (among other websites) has address-level locations for most providers. The buildings are all so boring, haha! (Understandably so though.)


It’s pretty far off, at least for azure. It shows a number of azure regions as being located in downtown urban areas, which is typically not the case


It is possible to infer the locations if you had a lot of time.

AWS at least in the past use to co-locate with other third party servers in the data centre. And so you if you had such a server you could ping AWS endpoints to triangulate where physically those servers might be.


I'm pretty sure a majority of AWS AZs globally are still not in AWS-owned-and-operated facilities.


At this point I would be quite surprised if they were not majority AWS owned and operated.


Of the ones that I know details about in Europe and Asia, not a single one of them are AWS owned and operated. I'm sure the situation is different in the US, and wouldn't be surprised if they own and operate all of their US DCs.


I also think it would be nice if Google prominently stated which cloud regions are hosted in their real datacenters and which are in third party facilities. It's not just that you run the risk of some off-brand piece of junk overheating, but also because they so loudly trumpet their PUE and carbon neutrality, but then resell cloud capacity in other people's facilities where you have no idea what the carbon impact is.


They do tell you the carbon impact right on the region selector: https://cloud.google.com/blog/topics/sustainability/pick-the... and also https://cloud.withgoogle.com/region-picker/ is quite nice.


Google doesn't really have a concept of datacenters internally. In simplified terms everything runs on a borg cell, which usually but not always exists in a single building. Within a region one of your GCP services could be running on one cell, with another service running on another in a different building. If a failure happens or maintenance needs to be done, a cell could be drained to another one in the same region.


Google has "data centers" internally. People argue about the various definitions but a DC is usually one or a few buildings containing several clusters, each of which is fairly "close" to all its components.


Really, Google doesn't have an entire org called "dcops"?


Not in a way that is relevant to a cloud customer asking what building their jobs are running in.


That's not how carbon impact works. You don't buy "clean" products. You buy products from vendors who have an overall average mix of production that meats a "clean" threshold, and/or buys credits from someone else who exceeds threshold.


In Google's (and for instance Scaleway too), the point is that some of their DCs use renewable energy only, thus are "clean". If however you use the "wrong" Google region which is hosted in Equinix's DCs, which happen to be powered by coal, it's not even remotely close. However Google make the distinction and you can easily check which region is "clean" and which isn't.


Electricity generation is also affected by weather in two ways. Firstly, by the droop in the wires which are built to weather tolerances much as railways are, and if you get outside the tolerances then the risks of problems increase.

Secondly, within a limit of my understanding, the efficiency of a turbine system hot-to-cold is affected by the climate it operates in. The ambient temperature, humidity and pressure affects the final stage.

Both things might mean that in times of high heat and humidity, the electricity supply system is least able to cope with increased demands for cooling systems, which will themselves draw more power fighting the weather.

Separately the HVAC systems for the DCs will have been designed for a specific climate, with margins. I guess the sustained change in night and daytime temps and humidity has hurt their efficiency too, in this window of time. They'll be fine when the weather system passes through, as will the supply network.

Met Office says both overnight and daytime peak temps for the inland south have been records. Thats where a lot of ICT infrastructure is.


> Separately the HVAC systems for the DCs will have been designed for a specific climate, with margins.

Exactly right. For a Tier IV data centre that margin is ASHRAE N=20. For London City until recently that was only 34.5C (dry bulb).


Interesting rabbit hole! Does that mean the margin is defined to contain once-in-twenty-years extremes?


That’s right. However these estimates are now being regularly exceeded. Going forward, it would probably be wiser to do what the mining industry does here in Australia - still design to a reasonable extreme temp, but also specify that the equipment must run at a much higher temperature at a derated capacity. Here that’s 50C ambient.


Even disregarding climate change, a 5% chance each year that your equipment will fail to handle the weather seems lacking for a high-availability facility. (You'll factor in many other this type of figures together, so your total risk budget would be higher still).


> Secondly, within a limit of my understanding, the efficiency of a turbine system hot-to-cold is affected by the climate it operates in. The ambient temperature, humidity and pressure affects the final stage.

Usually this is only partially true. Systems have enough margin to account for such conditions (cooling pumps have to pump more water to compensate for higher temps). Also to note that the output of a turbine can be kept constant. Only the efficiency will come down slightly.


I wasn't sure about the slightly. What I read elsewhere indicated that for safety reasons, gas and coal fired steam turbines back off load when the ambient heat rises. I had assumed it was the output cool/evaporate efficiency but it might be something else.


Sounds like yet another argument for investing in that energy storage low-hanging fruit of thermal reservoirs. Datacenter cooling does not need a huge delta from ambient, the advantage of pumping against nighttime ambient over pumping against daytime ambient would be huge. Datacenter cooling should be dispatchable demand.


I feel like we could do this for homes & businesses too...

How big of a thermal battery (assume we are freezing water) would you need to store the cooling capacity of a 5 ton HVAC system running for 1 summer day?

Perhaps we could design a new generation of heat pumps with this approach in mind.


If you (as a person) can tolerate cooler temperatures well, you can already do this to some extent using the house itself as a thermal battery.

Run your A/C overnight when power is cheaper and the air conditioner is more efficient (and rolling blackouts are less likely...). Cool your house to say 4-8F cooler than you'd normally keep it (close off some vents in your bedroom if you need to, though personally I prefer sleeping in the cold). If your house is well insulated you may be able to make it through a significant portion of the next day without the air conditioner needing to cut in again.

In the grand scheme of things wood doesn't have a lot of thermal mass, but if there's a lot of it it still adds up. Even as the air begins to warm, the floor and walls still feel noticeably cooler.


Key difference: homes and small businesses buy electricity at a fixed rate, large datacenters are big enough for dynamic prices. In a market that doesn't completely abandon fixed rate prices this isn't really possible for smaller customers, because offering dynamic rates requires some trust that the consumer doesn't abuse the fixed rate by dynamically selecting whatever is cheaper at the time. This trust can only be established for consumers large enough for people checking the numbers and perhaps even the occasional audit.


This is the case now, but it wasn't always.

In Texas, we used to have residential plans where you payed wholesale rates for electricity (e.g. Griddy). I used to be on one of these plans and would constantly fantasize about being able to accumulate HVAC capacity at night when you would sometimes be paid to consume electricity.


Did the "least cost router" scenario ever come up? I'd imagine that a certain kind of people would feel very much entitled to e.g. share consumption with a neighbor, switching both houses to a shared meter on either variable or fixed rate based on time of day, and power retailers frantically balancing between chasing them looking for suspicious patterns and pretending that it never happened to avoid inspiring others.

(I do believe that switching to variable rates would be a good idea, that in even the poorest of the poor would the grand scheme of things be better off if they had to occasionally disconnect, less bad than if the alternative was the entire grid occasionally browning out, including services that might be more important to them than their home consumption. And peak prices would even be as high as they are now, if they weren't propped up by an army of fixed rate consumers who don't show any have of demand flexibility almost be definition)


> I do believe that switching to variable rates would be a good idea

Completely agree. Any time we try to control/subsidize costs we introduce instabilities and bad incentives into the marketplace.

Texas grid was about as close as you could get to reality for a while. I would much prefer a situation where everyone is impacted by the cost in the same way. At grid scale, nothing can be stored, so financial arbitrage is effectively a scam.


What does that mean? Could you expand on that?


At a guess, it'd mean building large slabs of concrete, or tanks of water, or other cheap item with large thermal inertia. Ideally you'd cool these down overnight, or at a period with cheap power, then under regular external temperatures you'd use these to reduce the need for cooling at periods of high power cost. Or, when it's very warm outside, these would work in tandem with the cooling systems to cope with a hot day, with the hope being that overnight you'd catch up, without needing to interrupt workloads.


Thirdly, there's also drought related issues - here in Switzerland, a nuclear power station had to reduce its capacity because it could not take in enough water from the local river without endangering the fish stock therein.


We’re also just under a month from the summer solstice (in N. Hemisphere: June 21). London is at 51 degrees North, so the day is pretty long at 16 hours right now.

IE: less of that relieving overnight low.


Fortunately it rained last night and the temperature dropped pretty quickly.


So many governments, agencies, orgs, etc, are getting caught with their pants down with changing climate bringing more extreme weather conditions, because capitalism favors maximizing profit and considering tolerances too wide as an expense.

It's cathartic seeing the capitalists lack of foresight burn them time again, but also sad, knowing the response to this will be just buying slightly better HVAC and other systems, which will be globally sourced no doubt and burn carbon to produce and probably damage some ecosystems in the resource gathering process. Then when temperatures get even hotter its time to buy new systems etc.

Once again capitalism favors this outcome, because it allows multiple sales opportunities over time with the potential to increase monetization at every one, versus buying one system that can handle wider temperature swings and being done with it potentially forever if its made to be repairable/upgradeable. Even if you invented such a thing and sold it out of your garage, GE or whoever your competitors would be would ensure you have difficulty sourcing necessary parts or coming to market or getting word out to your potential customer base. Investors you need to afford to grow under these conditions would expect you to play the game and start cutting costs and engage in the rat race, or be replaced.

It's going to be hard to work our way out of climate change with the degenerate, consumptive nature of capitalism mining the planet while sucking resources from the wider economy where this salvation is to be invented, to the top that will always be able to afford to hole up and hide away from whatever disasters befall the working people.


You can blame capitalism all you want, but the issue is deeper. This is a well studied market failure called tragedy of the commons. People and markets have not been paying the true cost of their consumption, which has resulted in overconsumption of fossil fuels.

Your whole premise ignores the billions of people who happily use electricity and fuel everyday and happily ignore the consequences. These same people then (literally) riot when prices go up by a relatively small amount.

People are greedy. They want more for less. We can disagree about the best form of economic association, but it’s disingenuous to ignore the reality that climate change is driven by the masses.


> but it’s disingenuous to ignore the reality that climate change is driven by the masses.

And the answer to a "tragedy of the commons" scenario is regulation: either intervene directly by banning undesired behaviors (e.g. ban flights on routes that are served by rail, such as France does) or tax it to make undesired behaviors unprofitable.

The problem with the latter is that you will always have people rich enough to simply pay whatever tax is asked and social resentment will grow as a result ("why should the lower classes bear the load of climate change and the rich enjoy three-minute flights to save a 40 minute road trip [1]?").

[1] https://www.buzzfeednews.com/article/stephaniesoteriou/kylie...


Market failures are a feature of free market capitalism! Most of the solutions to market failures are some kind of regulation, restriction or tax which tend to be opposed by proponents of “pure capitalism”


> because capitalism favors maximizing profit and considering tolerances too wide as an expense.

Lack of foresight isn't exclusive to capitalism. Would socialist data centres be built to handle temperatures several degrees above the highest ever recorded temperature?

Even in a socialist economy, building systems with an excessively high tolerances would be seen as a poor allocation of resources.

It could be equally argued that capitalism favours private companies like Google ensuring their DCs are as fault tolerant as possible, to ensure they have a competitive advantage. There's also plenty of cases where companies sell unnecessary and excessive goods and services to maximise their own profit.

> and being done with it potentially forever if its made to be repairable/upgradeable

DC cooling systems are repairable and upgradable. They're a far cry from a residential split system AC unit.


https://en.m.wikipedia.org/wiki/OGAS The book "How Not to Network a Nation: The Uneasy History of the Soviet Internet" linked to in that article discusses many of these topics.


> Even in a socialist economy, building systems with an excessively high tolerances would be seen as a poor allocation of resources.

Huh? GDR and Soviet made machinery, vehicles, even glassware for pubs [1] was made with sometimes ridiculous margins and tolerances to ensure longevity and easy repairability and was famous for it. Even pre-reunification Western made products such as Bosch, AEG, Hilti or Siemens were famous for building stuff that sometimes outlasted the owners (such as my 80s Hilti drill, which served three generations of my family and likely will still work when I have children of my own).

[1] https://de.wikipedia.org/wiki/Superfest


> Soviet made machinery, vehicles, even glassware for pubs [1] was made with sometimes ridiculous margins and tolerances to ensure longevity

Bollocks. Soviet machinery is simple and crude, typically many decades behind their western counterparts.


Yes, but that was neither my point nor the point of the person I replied to.

"Keep it simple and stupid" is a tried and true engineering principle. The higher tolerances a design uses, the easier it is to manufacture and to repair, and the less likely it is to fail from wear and tear in the first place.

A socialist economy, where waiting for a new car could take anywhere from five to twenty years (!), definitely has to prioritize simpler, more (fault-)tolerant designs even if that takes a bit more resources to account for said tolerance. For example, a modern car heavily using fibreglass and plastic in the chassis may weigh a good load less than your average Lada or whatever that was made out of metal, but it could easily be repaired by your average farmer using tools they had in their shed.

Random side fact, this is a major cause why farmers are paying record prices for tractors nearly half a century old [1]. Or why the Russians are currently using so much ages-old stuff in the Ukraine war - modern tanks require a lot of logistics for repair and spare parts, but these old Russian clunkers? You can piss into the tank and it will probably drive on it. (Yes, I know, the Russians haven't been maintaining their tanks properly, which is a major factor in why they were not able to take Kyiw)

[1] https://www.startribune.com/for-tech-weary-midwest-farmers-4...


Simple and crude is not necessarily in opposition to robustness.


Modern capitalism does have a built-in short-term view though. Companies can try planning ahead for decades of use, but if someone comes in and designs for a shorter term, they'll be a lower bidder and thus more likely to get the job. By the time that the short-comings are discovered, the lower bidder is long gone. The stock market also provides an incentive to just aim for short term gains at the expense of the long term.

Socialism can prioritise quick fixes or long term solutions, but it depends on the wisdom of the people involved as to whether they'll build in enough capacity to allow for future climate changes.


Like most people making that commentary about "short-termism", you'd rather buy cheap clothes made in China than pay double for something local that lasts 5 times longer.


You have nothing on which to base that observation and as such, you're not even wrong as your comment is meaningless


How are non-capitalists countries dealing with extreme temperatures ?


I'd argue there are no non-capitalist countries


No Google data center issues, or melting railways in non-capitalists countries, because Google data centers or railways are the result of capitalist economies.


My main confusion with this downtime is that neither their Cloud SQL nor Redis offerings managed to complete fail over despite my org having high availability enabled on both of those plans. Is there something I'm missing here? I would've suspected that failover would kick in for high availability instances and cause minimal downtime however its been almost 24 hours and our Cloud SQL instance is still stuck on attempting to fail over, not to mention that it comes at a premium. Wondering if anyone can help me understand what I'm missing or if the failover behaviour is not working. We've made our own workarounds in the mean time.

Relevant docs I've checked for behaviour:

https://cloud.google.com/memorystore/docs/redis/high-availab...

https://cloud.google.com/sql/docs/mysql/high-availability

EDIT: Have found out from our ops team that the SQL instance recovered around 3am so it was down for approximately 9 hours -- which is still totally useless for something deemed HA.


Seems like they need to start issuing some refunds/credits.


IIUC, HA setting only failover across *zones in the same region*. If the whole region is down, HA won’t be helpful. In this case, the London data center is the region.


The region wasn't down though. Only one zone was down?

From earlier in the incident history:

> Cloud SQL:

> Impact/Diagnosis: Non-HA instances backed by europe-west2-a are hard-down in europe-west2-a. HA instances that were in europe-west2-a when the incident started, are down with stuck failovers.


That’s expected, Cloud SQL is not multi region. Clouds define HA as being multizonal, which you were.

Try Spanner if one region is not enough.


The whole region wasn’t down though, only zone europe-west2-a so AFAICT high availability should’ve covered this particular instance of outage.


That’s pretty terrible!


This incident page really needs a diff mode or just less boilerplate. It's really difficult to tell at a glance what's changed with each update.


Add to that all timestamps being US/Pacific for an outage in Europe...


Came here to say this. Each update almost subtracts value. Updates should only contain information that has changed.

Very frustrating when you're anxiously awaiting new information, and you have to do a word-by-word mental diff.


I really like this. I think a change to the input questions could solve this — clearer, more specific questions like "What's changed?", "Is it worse/better?".


It's funny how HN always complains about every status page.

I think Google Cloud has one of the only status pages that is always up to date and very forthcoming in giving as much detail as possible. Personally I couldn't ask for more.


It's casual feedback, sure. But it is (by and large) specific and actionable.

I think "you couldn't ask for more" is disingenuous at best, and an actively harmful outlook at worst.


Totally agree. I literally piped updates to diff last night to see whether GCP were actually making progress fixing anything or just spamming the same update message. Would make a useful twitter bot…


If you are really that interested in the content of a status page of one cloud service provider you should redesign your infra.

The value proposition of "cloud" for is not "I can haz cloud of big corp as my own" but "I can haz many cloudz to make resilient infra!".

If you rely on one cloud provider you are doing it wrong.


There was a concurrent incident affecting a large number of GCP services: https://status.cloud.google.com/incidents/fmEL9i2fArADKawkZA....

It would have been rather useful had GCP linked these (presumably linked) incidents. The first mention of cooling in the concurrent incident was at 14:39 PDT (over 5 hours after first status update, and 4 hours after the cooling incident was created)... This is what was said:

> Description: A cooling related failure in one of our buildings that hosts zone europe-west2-a for region europe-west2 is affecting multiple Cloud services.


I originally read this title as “Cooking related failure”. It reminded me of when I worked for a US research university in the late 90s and one if the data center operators accidentally heated up a hot dog wrapped in aluminum foil in a microwave. A microwave that was stupidly plugged into a circuit share by some of the network equipment and servers. The microwave allegedly blew up or something, causing a multi-hour outage of the campus network.


I had a portable 14000 BTU unit running flat out, and it couldn't keep up. 40c is hard to deal with here.


I’m a little confused. 40C is fairly common in India where I live. My air conditioning works fine here even in relatively high humidity. Is there something special about how a/c units work in the UK? Are they rated for lower ambient temperature or something?


It's less about the AC units, and more about the buildings. In places like Spain and Portugal, white painted outside walls, shading, and shutters on the outside of windows all help keep the heat out. My house has none of that, it just absorbs most of the heat from the sun.


Ah! That explains it. I'm sure if my house were in the UK, I would freeze to death in winter.


UK houses are totally fucking useless for cold weather too.


Maybe the new builds which are just brick skins, but there's plenty of older houses that are built like tanks (I live in one).


There are also plenty of older houses that are built very cheaply and not much good in heat or cold either (eg single-brick-skin Victorian terraces)...


OP probably has a small AC unit with a pipe that you send out of the window. We don't have proper AC over here in most houses, because it's normally only hot a few weeks out of the whole year.


Portable units have terrible efficiency. Unfortunately that is often the only option in apartments.


I'm a huge fan of wondering openly whether what people commonly do is best, or if it's just best for some people and everyone else does it because they haven't really thought about it.

For instance, the idea that an expensive, thousand plus dollar rackmount server is only able to run in a special place where the temperature is just right and might fail if a fan fails or the temperature is a wee bit higher than usual is utter bollocks.

I build my own rackmount servers that can run at 100º temperatures, even with fan failures. I know this because that's how I test them. I have the OS aggressively throttle on the most egregious failures, but fan failure is much less common in general when you're using 80mm Noctua fans instead of 40mm fans that have to run at many thousands of RPM to keep their zones cool.

So maybe people need to rethink the idea that datacenters have to be kept at 70º or below, and instead should insist on better thought out hardware.


Google datacenters run hot, as do many other cloud providers. They are happy to use high temperatures in their DCs to improve efficiency. The problem is that they have also removed the equipment to handle super-hot days for "efficiency."


Your servers can run at 100℃ ambient temperature? Oh, I guess you mean F?

Yeah, servers can run just fine at 40℃. Well… unless they fail because of it. :-) That is: The ones that don't fail work just fine.

If you have a DC with 10'000 of your servers, and maybe 30'000 hard drives. What percentage of them will fail on any given day at 25℃ vs 40℃?

But it's not just your servers. Can your AC equipment/evaporators work at 40℃? And if your ACs start failing you could be looking at a cascading failure where it's actually more like 60-70℃, or just plain "a fire", in your DC.

Can your generators work?

And the answer also isn't "every component in my DC must be milspec extended temperature range". It's actually fine to build a DC in Iceland that's not specced for outside temperature of 50℃. In fact it would be a ridiculous waste to do so.

Google of course measures this. E.g. https://www.techrepublic.com/article/google-research-tempera...

But do keep in mind that 40℃ outside may mean 50℃ inside. Or indeed just 60℃ hotspots on the DC floor.

Hell, your network cables may not even be rated for 60℃. They usually aren't.

Your server may be fine with 40℃ inlet, but going out the air may disconnect it due to melting the network cable.


I would like to present AWS and Google Cloud Mumbai. heat wave and covid wave simultaneously! 40 degrees centigrade is warm here.


Everywhere in the world you design for what the local conditions are.

I bet Mumbai doesn't mandate winter tyres in winter, right? Sweden does. If suddenly one winter you see -10℃ in Mumbai, would you appreciate Swedes mocking you for not being able to drive in a car not designed for it, with tyres not designed for it, on roads not designed for it, etc…?

Nothing in the UK is designed for 40℃. Buildings, the type of steel made for train tracks, ventilation in tube tunnels, the asphalt, the walls in the building, the windows (no double glazing in Mumbai, I assume?). I would expect everything in Mumbai is designed to handle high temperatures. But not cold.


And I would actually expect that it is equipped to handle 40-degree outside temperatures year-long (same in zones in southwestern US - I meant 105-degree heat because freedom units!) London didn't historically experience these kind of heat though, only peaking to 37-degree in an hour.


who the heck measures electronics temps/ambient in freedom units? All the datasheets are exclusively in C. [105F is precisely 40C, though]


Im curious whether all of those companies migrating to „the cloud” are thinking about those issues. It will be only more and more impactful also with power outages on the horizon.

Company I work now in completely does not care at least based on me rising those concerns to mng.


Unfortunate I guess, but I still remember having to pull an overnighter at work because there was a leak in the sever room :)


Cooling seems to be one of the first things to creak in hot weather. Our fridges at home were struggling, and some of the local supermarkets had fridges out.

Maybe that’s an obvious observation but I would have expected they had a little more operating range right at the point you need them.


Ex HVAC tech here. Also was on a commissioning team for a Google DC in SE Asia.

Heat waves indeed kill equipment. The system is attempting to deal with a higher heat load on the home/conditioned space, while also having to operate with high(er) ambient temperatures.

High ambient temps -> High heat load -> warm(er) return refrigerant temps/high(er) refrigerant pressures -> less cooling for compressor / higher mechanical load on compressor -> Elevated power consumption as compressor is working harder -> higher power/heat levels stress electrical insulation & components.

The system struggles to keep up as the duty cycle is elevated as well. Putting a sprinkler next to the condensor (outside) unit is a hack.


> Putting a sprinkler next to the condensor (outside) unit is a hack.

However, water companies have been telling people to reduce water usage during the heat wave since demand is higher than usual, so this may not be a great idea.


Worth noting that most home fridges have no kind of indication that they can't keep up.

Your fridge, rather than being the 5C it should be to keep your meat safe to eat, might have been up at 12C. You wouldn't be aware (it still feels cold), but you'd end up eating possibly dangerous food.

I really wish fridges had an alarm in that case (ie. The fridge has an indicator saying 'too hot. Food is now unsafe to eat').


That alarm is available on the more decent models.

There are also stickers for inside your fridge that can indicate the temperature. There is also a variant for specific temperatures, like 0, 5 or 7 degrees, that colorize if the temperature has risen, giving a (non-reversible) indication your fridge has been too warm.

Edit:

Did a quick search for you: https://www.tiptemp.com/Products/Rising-Time-Temperature-Ind...


You can self solve this with a fridge thermometer, e.g. https://www.amazon.co.uk/Thermometer-Refrigerator-INRIGOROUS... or if you’re into that smart home biz, I also have an Aqara zigbee temperature and humidity sensor for automated alarms.

EDIT: sibling comments about irreversible temperature monitors are also a great idea I hadn’t thought of! Time to buy some of those too


Ours was making strange noise, and we noticed things were wet when we took them out which indicated it was struggling to keep temperature. Agree some kind of alarm or indicator is probably a good idea if its not keeping up!


Wet things inside is actually an indication that the door isn't properly closed and sealing.


From a business point of view, if your looking to move your infrastructure to cloud do you now should you be factoring in global warming?

If so, This will only play into the hands of those setting up data centres in the far north of the northern hemisphere (e.g. Iceland).


I don't think this really has anything to do with the cloud.

On-premise hardware still needs cooling, and is arguably harder to cool as there are fewer economies of scale on cooling infrastructure. Dedicated "bare-metal" machines are just in regular data centres so no difference to the cloud there.

I think data centre locations will still be chosen on two factors: distance to customers, and cost of energy. It's just that operators will be looking for cheap energy. Iceland is good because they have a lot of geothermal energy, not because it's cold.


This reads like:

"Oh, tanks by that foreign army are rolling into our cities - should we start thinking about a defense system?"

Global warming will affect every aspect not only of your business, but also of your life, maybe a little bit later when you are rich and can afford to live in a self-created bubble, but it will.

Yes, you should think about it. Hard. Now.


Very much - YES.


GCP sure does seem to have a lot of outages that span across multiple availability zones (and occasionally across multiple regions). It sure does seem like there is a disconnect between the expectations of isolation that they set and what they are able to deliver.

It's also interesting that the status page's attempt to spin the scope of impact actually makes it seem worse that it was full-region outage (they said, "There is a cooling related failure in one of our buildings that hosts a portion of capacity for zone europe-west2-a for region europe-west2 ...")


The amount of outages happening on GCP is mind boggling. I don't know how do people trust their business on Google. It's an ad company, not an infrastructure company. If you trust an ad company with your business, I guess it's on you.


Great point, let's trust a retail company instead.


Rest assured, if the agenda ever u-turns into "global cooling", google will follow with "heating related failure" reports.


What?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: