Apache Benchmarks for Calxeda’s 5-Watt Web Server

reitzensteinm · on June 24, 2012

There are lies, damned lies, and benchmarks.

The Xeon machine is running at under 15% utilization due to the gigabit ethernet bottleneck, yet they're using the system TDP (that's the worst case thermal specs which you design your cooling around).

If you put a 10 gigabit NIC in there, the advantage would be closer to 2x performance/watt, since the Xeon would be processing 6.66x as many requests (1/0.15). This is also a last generation Xeon - Ivy will be out by the time this CPU is.

That they're not just taking the power at the wall of both systems probably means doing so would tip the balance closer to the Xeon - and you pay for at the wall power, so that's what matters.

In addition, their choice of 4gb for the platform is worrying; it's almost certainly a 32 bit machine. Which rules out an almost perfect use case for the machine; a memcache server.

ARM in servers is coming, to be sure. 64 bit ARM will bring unprecedented gigabytes of ram per watt. But toe to toe on CPU heavy tasks, they will still lose out badly on performance/watt at full load.

At least from the numbers in the post, I'd stick with virtualisation.

mykhal · on June 24, 2012

ad RAM: the platform (if it is real) is little bit more interesting - they have cards with 16GB RAM per four 4-core CPUs (SoCs) - these cards are to be used together in amount of tens, connected with high speed switched fabric in the host board.. so it's something like server cluster on board design.. http://www.calxeda.com/technology/products/energycards/quadn...

reitzensteinm · on June 24, 2012

Now that's interesting. Why on earth aren't they leading with it?

I can understand these kinds of faux benchmarks and whitepapers to convince pointy haired bosses to switch between mature, mainstream technologies.

But the early adopters of fringe technology like this are going to be companies with specific needs, and those making the decisions are going to be highly technical. Not to generalize, but they (me included) love details and possibilities, and abhor marketing puffery.

Calxeda should be sketching out novel ways to use the interconnect bandwidth to solve hard problems, with power efficiency x86 can't touch. Not running ab against an ARM core and misinterpreting the results. Sheesh.

mykhal · on June 24, 2012

well, there is a promising HP project Moonshot, which was going to build on this platform, but it seems that they recently decided to switch to Intel solution.. http://www.engadget.com/2012/06/20/project-moonshot-take-two...

brooksbp · on June 29, 2012

Calxeda five XAUI (50 Gbps) to a 4 core 1.1GHz 5W. 200ns per hop.

A ToR switch will slow you down to 2000-3500ns.

Tilera TilePro64, 64 core 866MHz 22W, 1.7Tbps core-to-core, 46ns L1/L2 miss found in adjacent core, four XAUI (40 Gbps).

When compared to Tilera, Calxeda has a high bandwith per compute ratio. But 200ns core-to-core is not as fast as 46ns.

Tilera also has higher core/watt ratio.

zokier · on June 24, 2012

I think you misread that page. I think they mean that each SoC has one memory slot, which supports max 4GB RAM. So that 16 gigs is total amount for all the four CPUs. This still strongly points towards a 32-bit design.

mykhal · on June 24, 2012

16 GB per 4 CPUs does not mean 4GB per CPU? sorry for my english.. :)

hatcravat · on June 24, 2012

16 GB shared among 4 CPUs is different from 4 CPUs which each access 4 GB of private memory.

unwind · on June 25, 2012

The page says:

Support for 16GB DDR3/3L memory via four (4) mini-DIMM sockets – one socket dedicated per SoC supports up to 4GB of ECC memory (supplied separately)

Which really doesn't sound as if the memory is shared, to me.

mercuryrising · on June 24, 2012

What is the performance difference between 32 bit and 64 bit for memcached? I know redis has a lot of overhead with the 64 bit version [1], does memcached suffer the same fate?

[1] http://serverfault.com/questions/221695/deploying-memcached-...

mtgx · on June 24, 2012

They don't need to beat Intel "at the high-end" for "CPU intensive tasks". They just need to find their own market that requires low energy usage over high performance, and of course better performance/Watt, which I do think ARM beats Intel in.

But Cortex A9 was never really meant to be a server CPU, so I would wait at least until Cortex A15, or even better for the 64 bit chips that are coming in 2014 (Cortex A15's successor, Nvidia's Project Denver, etc).

reitzensteinm · on June 24, 2012

I have to say a big [citation needed] to the claim of ARM beating Intel's high end chips on performance per watt, at least on general workloads.

I think it's common to extrapolate Atom vs ARM to Xeon vs ARM in HPC, without thinking through the implications. We may well get higher performance/watt for single threads under ARM - I'm not disputing that, especially for integer work.

However, Amahdl's law is going to raise its head. In the same machine, a higher number of lower performance threads is going to cause lock contention. You'll also have to split computations over more boxes, since the absolute performance of an Intel server will remain far higher (by 2014, we're talking 64 core/128 thread Haswell). Both of which are likely to be a massive tax on performance.

To fight this, performance per core is likely to see a substantial rise, both in clock frequencies, and as a result of single core complexity. However, this will directly work against the two things that makes ARM performance/watt so impressive currently.

Also, Intel's entire company is built around making those 100 watt scale processors fast and well. They really stumbled entering the Atom market; both because of a weak design (the chipset drew more power than the CPU itself!), as well as a lack of commitment (using 2-4 year old process nodes).

I think we're likely to see a similar teething pains with companies trying to enter the server market for ARM. The instituational knowledge just won't be there. Make a cache architecture that effectively feeds 64 cores? Way different to improving power drain on a mobile CPU, for the seventh generation. I expect it will be at least a few generations before design teams are fully up to speed.

Remember that AMD is reasonably well funded, also focused around server CPUs, and often stumbles. AMD is competent; Intel just makes them look incompetent by comparison.

I'm not saying we won't see certain workloads that are better off under ARM; memcached and static http serving are both likely to do well, since they're effectively just shuffling bits around, aren't particularly CPU intensive, and are embarassingly parallel. But I believe they'll turn out to be the exception, not the rule.

Which is to say, there's nothing magic about ARM that will let them beat x86 at the high end. They'll have to fight for it, and against Intel on their own turf no less.

twotwotwo · on June 24, 2012

"There's nothing magic about ARM" is right. Atom, for example, seems competitive as a low-power architecture now that Intel's really trying to get into the mobile market. The open question is whether cheap power-focused cores, from any maker, can compete against big server chips, and I think they do have a niche.

Low-power server makers probably admit, in their hearts, that the workload their stuff works best with is specialized. CPU-munching apps in scripting languages, or very CPU-intensive data work (full-text searches, say), are not what they're good for. Static content, memcache, and boxes that basically broker between other nodes and do very little 'thinking' themselves are candidates. As some Googlers pointed out, any work where the CPU causes much of the user-visible latency is right out.

I'd add that good low-power-CPU servers don't just look like regular servers with a low-power CPU slotted in. You need low-power storage, i.e., Flash. You probably want lots of cores to amortize the energy cost of memory, etc., so that means it works best with a future uarch like Cortex-A15 that supports that. You want low-power memory. Then servers are probably easily sub-1U, so you get a blade-like physical layout, with some resources shared among nodes.

Caldexa probably would rather not hear this, but you might have to cut price, not just improve power use and density, to secure ARM a niche against big, fast Intel chips in the DC. I think that can be done over time, because the premium Intel charges on top of chip maufacturing cost is a lot and ARM IP is relatively cheap. But it may make it hard for Caldexa to make back their initial R&D costs unless 1) some early-adopter customers pay a big premium (possible--Facebook, you into this?), 2) the market eats it up with surprising speed and soon everyone's got some ARM nodes in their racks -- seems unlikely, or 3) investors put in enough to outlast a long slow growth period (and I could see an ARM manufacturer like Qualcomm or Nvidia doing that with their own ARM architectures, but that may not help Caldexa or its investors).

Any comparison's a stretch, but consider that ARM consumer devices aren't only more mobile than high-end computers, they're cheaper too.

I look forward to eating these words in a few years.

tomstokes · on June 24, 2012

This is a very poor (and misleading) comparison.

First, the author admits that the Gigabit ethernet link was the bottleneck for the Xeon system, capping it at a mere 15% CPU utilization. However, he goes on to use published TDP numbers as the system power draw. TDP numbers are only approached under the most demanding of loads and 100% CPU usage, which this clearly was not. At bare minimum, the author needs some sort of power measurement device to make a reasonable comparison.

Second, serving static web pages is not a difficult task. The Xeon system is overkill for such a task, so of course it will be less energy efficient. In the real world, the extra capacity on this server could be used to perform more difficult tasks or run other processes.

Third, if we assume the Xeon server would scale linearly without being bottlenecked at 15% CPU usage by the Gigabit ethernet link, then it would serving approximately 46,300 pages per second, or almost 8.5X that of the ARM server. Take into account that the actual power consumed in the real-world will be less than the TDP, and the efficiency gap between the ARM server becomes very narrow. Even if the Xeon TDP numbers are accurate, the new margin is still less than 2X.

Finally, the Total Cost of Ownership calculations in the conclusion are based (as the author admits) on the flawed benchmark numbers. If they can only achieve a 77% TCO reduction by completing handicapping the Xeon system, then the ARM system may not be that advantageous. Especially when you consider that without the bottleneck or with a more demanding workload, you would might need as many as 8 times as many ARM servers to replace the performance of the Xeon server, which isn't going to help real-world TCO.

I'm a big fan of the ARM platform, and I'm very excited to see ARM servers enter the marketplace. However, false benchmarks like these aren't going to help anyone. ARM servers will have certainly have their place for a lot of different workloads, but to suggest that they are 15X more energy efficient and have a 77% lower TCO based on these numbers is disingenuous.

jws · on June 24, 2012

We've all savaged Calxeda's blogpost now, but it's worth noting the positive sides too.

• It looks like the real world power savings will be something like 66%. I know hosting facilities that base your bill on your watts. That looks like a powerful incentive.[1]

• Virtualization is nice (#1), but if you aren't a big enough fish to own all the slices you are at the mercy of your box mates and the financial pressures of your hosting company. If you have your own ARM server you get to live in a predictable world.

• Virtualization is nice (#2), but isn't there an embargoed Xen interdomain security flaw right now? How long have bad people known about it? Is your hosting provider in on the loop to get the fixes before they become public?

• For small sites, it doesn't matter what the efficiency of a Xeon at full load is. You won't get there with a dedicated machine, and you don't want to be on a virtual server that goes there.

• It looks like the boards actually have 10gbit interfaces. The 1gbit limit was either architectural to get to the client machines or deliberate to keep the Xeon in the same ballpark. Either way, it is reasonable for sites that aren't going to have more than a 1gbit drop anyway.

• 48 of these quad core ARM systems fit in a 2U box.

I'd much rather have a dedicated ARM than the tiny slices of Xeons that I use now. I don't need a random performance problem brought on by anyone other than myself.

EOM

[1] It may be that if your workload is network bound you won't be offered wattage based pricing. That might eliminate this savings for the people that could best use it.

jws · on June 24, 2012

I think I might have invested $19 in a Kill-a-watt meter before I published that benchmark.

Using TDP for the Xeon that is operating at a small fraction of capacity is going to mislead. Leaving out the disk drive (let's say 5 watts) helps make a great multiplier, but is divorced from reality.

Their Performance/Watt number in the table is, I think, actually a transactions/energy. The watt multiplier would be about 19.

bhauer · on June 24, 2012

Another thing to consider beyond simply the fact that your particular benchmark ran into a network bandwidth bottleneck: web server benchmarks should not be conducted using ApacheBench until Apache makes AB multi-threaded.

Use a multithreaded benchmark tool such as WeigHTTP ( http://redmine.lighttpd.net/projects/weighttp/wiki ). WeigHTTP is essentially identical in behavior to ApacheBench, but with a -t argument to specify the number of threads.

You can approximate the same behavior in ApacheBench by kicking off multiple AB instances in parallel, but then it is up to you to aggregate the results.

mykhal · on June 24, 2012

in the next benchmark, use SSL/TLS connections, and we'll see..

kg · on June 24, 2012

An actual Xeon is not going to draw its TDP in power under a simple workload like serving static files. Even if you get it up to 100% CPU utilization, it will probably not be drawing its TDP. Modern intel processors have a bunch of mechanisms built in for managing power draw (and similarly, heat output) - they can clock up and down in response to workloads, bringing a single core up above standard clock to speed up single-threaded loads and bringing all the cores down when the machine is doing less intensive things (like running a message pump or waiting for socket connections).

If anything, I'd expect static file serving on a Xeon to produce no more than say 40% of TDP. If you're lucky, serving up all the static files will load all the cores fairly evenly and get the CPU close to '100%', but none of the floating point or integer logic units will be remotely loaded - it'll be almost exclusively branch/copy work, which isn't going to put much load on the CPU itself or draw much power or generate much heat. It's also going to be spending tons of time waiting (on the NIC, etc) instead of actually doing computation, which can generate a lot less heat if the waits are done using the modern busy wait instructions instead of a spin loop.

EDIT: A comment in the OP provides a conservative estimate of 43W for the actual draw of the entire xeon-based system (not just the CPU) in the benchmark. He also points out that it has more RAM (which will increase power draw).

rbanffy · on June 24, 2012

Serving static files is a kind of thing that could be done with very little CPU involvement. Once the file is cached (or memory-mapped), just point the NIC processor to it and tell it to pipe the memory block through the network connection and head off to nobler jobs.

BTW, are there NICs this clever around?

halayli · on June 24, 2012

Again with these silly benchmarks. Let's just throw numbers out there and see what people can make from them.

Obviously whoever ran this benchmarks doesn't know what the real bottleneck is here. Hint: It's not the CPU.

6k req/sec is nothing to be proud of. A gevent python web server can handle 10k requests/sec, and a vanilla nginx can make 24k reqs/sec on a commodity machine.

aidenn0 · on June 25, 2012

Really "ab" as a web benchmark? I had no idea anybody used this anymore, how about a benchmark tool that at least supports HTTP v1.1?

secure · on June 24, 2012

Aside from the questionable benchmark, can you actually buy one of these right now?

mykhal · on June 24, 2012

i'd ask: did anyone see this machine in real? it's funny they're still promoting with unpopulated PCBs :) http://www.calxeda.com/wp-content/uploads/2012/05/Capture387...

zokier · on June 24, 2012

it's more energy efficient without all those pesky components :)

ck2 · on June 24, 2012

We saw similar claims for atom based servers.

If this was accurate, Google would have adopted it immediately, the savings on their power bill would be astronomical.

The only time atom and arm are 1500% more efficient is at idle.

sciurus · on June 24, 2012

http://research.google.com/pubs/archive/36448.pdf

"So why doesn’t everyone want wimpy-core systems? Because in many corners of the real world, they’re prohibited by law—Amdahl’s law. Even though many Internet services benefit from seemingly unbounded request- and data-level parallelism, such systems aren’t above the law. As the number of parallel threads increases, reducing serialization and communication overheads can become increasingly difficult. In a limit case, the amount of inherently serial work performed on behalf of a user request by slow single-threaded cores will dominate overall execution time."

andreasvc · on June 24, 2012

I imagine most computers in the world are idle most of the time, so that figure would make a _very_ significant difference if it were true.

80x86 · on June 25, 2012

Ok, where can I get one?

ksec · on June 24, 2012

May be Intel has paid them to publish these numbers? Otherwise I have no idea why they are stupid enough to post it on the net.