Inside Amazon's Cloud Computing Infrastructure 76
1sockchuck writes: As Sunday's outage demonstrates, the Amazon Web Services cloud is critical to many of its more than 1 million customers. Data Center Frontier looks at Amazon's cloud infrastructure, and how it builds its data centers. The company's global network includes at least 30 data centers, each typically housing 50,000 to 80,000 servers. "We really like to keep the size to less than 100,000 servers per data center," said Amazon CTO Werner Vogels. Like Google and Facebook, Amazon also builds its own custom server, storage and networking hardware, working with Intel to produce processors that can run at higher clockrates than off-the-shelf gear.
What Does This Mean (Score:5, Interesting)
working with Intel to produce processors that can run at higher clockrates than off-the-shelf gear.
What does this mean? They have custom chips? Custom mods at the chip fab level? Or are they taking advantage of designed-in features that are locked out for normal chip users? Are they simply over-clocking? Or are there features that can be unlocked with money?
Re: (Score:3, Interesting)
They must get chips that have been tested for overclocking.
Re:What Does This Mean (Score:5, Insightful)
Probably means they buy in bulk, so they get to pick the more overclock-able chips.
Say, Core i7 xxxx runs at 3.0ghz and i7 yyyy chip runs at 3.4ghz. They make a batch of i7s and test them at 3.4ghz. Some barely pass QC and are sold as retail i7 yyyy. Some fail at 3.4ghz so they're marked as i7 xxxx 3.0ghz. Some pass at 3.4ghz with flying colors, these are the ones overclockers want the most. Retail buyers like us don't get to pick which ones we get when we buy the i7 yyyy, but Amazon might.
Re: (Score:2)
i remember the last time i over clocked a computer. had a 300A running at 924 and stable. used two peltier pads and encased the heat sinks to pipe in water. Was a really fun build. before that i had a 233 mmx up to 405 stable,
that was years ago, good memory's,
Re: (Score:3)
But of course these are all Xeon processors. Those normally have a lower clock rate the more cores the chip has, to limit heat density. The 10-core processors run a bit more than half the speed of the 2-core (IIC, but I could be way off). You don't need to overclock these in the way you do enthusiast parts, when they're underclocked to begin with. You do need prodigious cooling.
I has more better questions (Score:3, Informative)
“Every day, Amazon enough new server capacity to support all of Amazon’s global infrastructure when it was a $7 billion annual revenue enterprise,” said James Hamilton, Distinguished Engineer at Amazon, who described the AWS infrastructure at the Re:Invent conference last fall. “There’s a lot of scale. That volume allows us to reinvest deeply into the platform and keep innovating.”
Did they use AWS for translation on this paragraph? How do you have "a lot of scale"? One can scale up or down, but is this like a computer hokey pokey? Scale is a verb!
Really, I skimmed this one pretty lightly. It looks like a marketing article, not a technical article. Buzz words a plenty, so I'm guessing your question is answered by "marketing"..
Re:I has more better questions (Score:5, Funny)
Did they use AWS for translation on this paragraph? How do you have "a lot of scale"? One can scale up or down, but is this like a computer hokey pokey? Scale is a verb!
Any verb can be nouned.
Re: (Score:3)
Re: (Score:1)
> Scale is a verb!
No. Scale was originally both a verb and a noun.
Scale as a verb (with respect to sizing) actually came later.
Re:I has more better questions (Score:5, Funny)
Scale is a verb!
As I weigh this fish scale on my scale, before cleaning the scale off my kettle, while listening to my neighbor play scales, I wonder about the scale of your intoxication: on a scale of one to potato, how high are you right now? Oh well, I'm off to work: I was hoping for better, but it pays scale.
Re: (Score:1)
Intel will make you custom chips (Score:2)
They are expensive and you have to buy a lot, but they'll do custom. Oracle also buys custom Intel chips. There are limits to what they'll customize, obviously writing a whole new ISA wouldn't be possible (at least not without a shit ton of resources) but they can customize things like cache sizes and configurations.
In terms of clock rate I image what Amazon is doing is more or less having Intel raise the TDP for the chips and run them harder. All the Xeons cap out at about the same TDP for the high end, re
Re: (Score:1)
Or they want a slightly special version. Say the CPU supports 30 different features across the entire line. For cloud services maybe amazon only really cares about 15 of them. So they could ask Intel to disable those 15 features permanently which saves power and use that extra power saved to run them a bit faster without burning up the chip. I am sure that if you buy enough at the right price they would do it. Its just a question of price and volume.
Re: (Score:2)
Re:What Does This Mean (Score:5, Insightful)
They are building custom hardware and a lot of it so they get a bit of special treatment from Intel.
You engineer the thermal paths and better control how you get rid of heat. You tweak the board layout for the best performance of the chipset and CPU and run closer tolerances on voltages and clock frequencies while keeping it small. Buying in bulk also lets you customize the chipset and CPU packaging to get you better performance/watt and higher density by eliminating all the "fluff" stuff you really don't want on the cloud machine. Who needs all those USB controllers, PCI-e busses, and sound cards you find in your average server chassis in a high density server farm that just take up space and suck power? Just give me a couple of NIC's, a SATA connection and a serial console and a way to reset an individual system and I have what I need to stand up an OS and grant somebody external access to it.
Re: (Score:2)
You fail to address the question. We are not talking about the board or chipset. The quote was specific:
working with Intel to produce processors that can run at higher clockrates than off-the-shelf gear.
We are talking specifically about the CPU. Is it "custom"? Or is it just a high-end "overclocked" chip that I could buy if I had the cash?
Re: (Score:2)
sata? ok for a small DOM for the os. pci-e based storage / pci-e for fiber channel / other stuff is needed.
The compute nodes need a os + fast network / storage links.
Re: (Score:2)
Basically, if you commit to buying a lot of chips, Intel will fab you modified versions of their existing product lines.
Remember back in the early days of Intel Macs, and Apple managed to get Intel chips that supported hardware virtualization, even th
When will AWS get IPv6 ability? (Score:2)
Re: (Score:2)
Probably when they don't have a choice
IPv6 costs more money to run. The packets are larger, so there is an increase in power consumption and bandwidth demand.
Re: (Score:3)
I like your comment, it is quite funny, but to address the question:
The packets are larger (more bits) so take longer to transmit, and more memory to store. Also, ASICs are built for IPv4, they don't work for IPv6, so much of IPv6 traffic is done in CPU rather than ASICs which is less efficient in power usage.
I doubt the power difference is terribly high, but at an Amazon level, it would likely be noticeable.
Re: (Score:2)
Yeah, it is funny how little you know of security APK, and that you feel the need to troll everything I post acting like you won, when you just don't get security in the least.
Re: (Score:2)
Enhance usually means increase, unfortunately as it is a security nightmare, destroy might be a more appropriate word to use.
Only kings refer to themselves in the third person. Do you think you are a king?
Re: (Score:2)
Amazon doesn't exactly "not turn a profit", they dump all their profit they earn into growth and research, so that they have no taxable profit. It is an optimization technique, not really a OMG we aren't making profit type issue.
Re: (Score:1)
As someone that participated in their beta test, while Amazon might be ready for IPv6, the apps most of their customers run are not. For example, we couldn't get Tomcat to accept IPv4 connections on Linux when IPv6 is enabled. It binds to the IPv6 port by default, but not to the IPv4 port. I don't think there's a way to get it to bind to both. We have a support contract with Kippdata, and they said they didn't think it was possible.
Re: (Score:2)
Actually, I don't think it's Tomcat that made that decision. From what I recall, the JVM itself defaulted to IPv6 on certain releases.
Re: (Score:2)
Is it possible to run two instances of Tomcat, one binding to each "interface"?
Re: (Score:2)
It's possible to run as many Tomcats as you like, each and every one of which can listen on either or both protocol types and on multiple ports.
All you need is enough RAM and CPU and network bandwidth to hold them all.
Re: (Score:1)
Although it doesn't seem to have hit the news, today was pretty blinky in Amazon VIrginia as well.
AWS' problem is not the infrastructure... (Score:1, Informative)
It's the fact that they only focus on infrastructure. IaaS is their bread and butter and it's what keeps them running and going with companies that don't know anything better than servers and storage, to migrate their workloads (the peaks and valleys kind) into the cloud to save money and be agile.
The next generation is a step beyond that, and it's what Microsoft, SalesForce and Google are building for -- PaaS. The idea that you manage fleets of servers is an archaic one, and the next generation will be wri
Re:AWS' problem is not the infrastructure... (Score:4, Insightful)
"It's the fact that they only focus on infrastructure"
You are not looking carefully enough.
"In that sense, Microsoft is far, far ahead of the others"
You know what happens with the ones too far, far ahead of others? In the future, people rise statues honoring them, but they usually die poor and/or too young.
It's quite funny you talk about Microsoft since, back in the day, it was Novell the one far, far ahead of Microsoft on PC-based client/server deployments. And know what? Microsoft not only didn't give a damn but they mocked Novell as too complex. And they were right: most people wasn't ready for Novell forests and inherited/nested permissions and Windows for Workgroups was everything they could cope with. Then they grew up to "classic" domains, still tad simpler than Novell while still being "good enough" for their customer base (in fact, being not only "good enough" but "top notch" since for most of them it was all they knew as in practical terms it was Microsoft itself the one "educating" them).
Eventually, Novell died and, who could think about it!? the very next day Microsoft came up with their new and shinny Active Domains that were basically what Novell had been doing since ten years before: now, somehow, that wasn't "too complex" anymore but the only true way.
I'd say Amazon is exactly on the same track today: on one hand, most people, as you say, is not ready yet for higher abstraction levels like PaaS, IaaS is good enough and strongly growing. On the other hand, PaaS market is far from mature enough: writing code against any public API today is guaranteed to have it rewritten even before the provider gets to declare it non-beta.
And there's even more: it's said that in the gold rush, the only ones consistently making money where the shovel shops, not the miners: nowadays, the "hardware store" is Amazon and it is the people building on top of AWS the ones taking the real risks of doing business. And Amazon is not just seeing the time going by: few years back they offered pretty simple virtual machines; now they offer quite a complex landscape with databases, routing, DNS, load balancing, tiered persistent storage... They are the Microsoft of today mocking on the ones too far, far ahead while, at the same time, cultivating their own customer base to make them ready for their future products and services.
Re: (Score:1)
He is completely wrong. AWS is far far ahead of Microsoft with PaaS, its not even comparable because AWS is so far ahead. Anyone who has used both knows this. He read a page or two on EC2 on AWS and thinks that is the only thing they have, its about 5% of what they offer.
He is obviously an MS paid shill hoping no one calls him out. Azure works fine, at least from what I hear it does now, but back when I was using it they were having serious problems. Went to Rackspace, similar to Azure (his comments wo
Re: (Score:1)
Re: (Score:1)
I think that AWS' IaaS picture is more complete than Microsoft's, no doubt... as for deprecating APIs well, I'll have to put my tin foil hat on that because since .NET 1.0 they have managed somehow, to maintain most of their APIs with little deprecation. I don't imagine it would bode well for their business if they deprecated it for Azure, but you're free to believe that. On the point of PaaS being far from mature enough, I'd likely agree; except if you look at modern startups, most of those are written in
Re: (Score:2)
"However for new development efforts where we look to write in a microservice architecture, then AWS is simply not an option and I'm looking at Apache Mesos, Heroku, Service Fabric and AppEngine. Now you may disagree with"
Not at all. Either I didn't explain myself good enough or you misunderstood. All these are well and good, but I bet you'll either fail in your next application (and therefore it doesn't matter) or you'll have to rewrite it in the not so distant future because one of your Mesos, Heroku, F
Re: (Score:1, Interesting)
It's a matter of risk vs reward. Yes, I might be locked into a platform but at the level I develop, MS and other enterprise cloud vendors can't just arbitrarily raise the price. There are enterprise agreements that have liabilities, timelines, penalties and a lot more in order to ensure that there aren't runaway costs. I know, because I've negotiated them with both AWS and Microsoft. Funny thing is, AWS does not agree to terms for large organizations that are any different for a startup, and that's great fo
Re: (Score:2)
The nightmare comes when you try to figure out which of your api vendors have brought your application down and you are left carrying the can because a problem with the billing system left you with only 100,000 API calls instead of the 1000,000 you expected. Still, it could be fun having the accounts dept on call to respond to outages.
How about an editor at Data Center Frontier? (Score:2)
What AWS outage demonstrates .. (Score:3)
I thought the outage demonstrated the relative unreliability of Amazon cloud Services. What are the legally binding terms of services that AWS provide in relation to uptime.
Re: (Score:2, Insightful)
I thought the outage demonstrated the relative unreliability of Amazon cloud Services.
Incorrect. What was demonstrated was the inability of AWS customers to design fault tolerant systems. Any system that cannot tolerate any downtime should be multi-region.
Re:What AWS outage demonstrates .. (Score:5, Insightful)
Well, it's their second major outage in the ~10 years of AWS. Far better than any in-house IT department I've ever seen.
Re: (Score:2)
False. They have had multiple outages in the last 5 years:
On April 20, 2011, some parts of Amazon Web Services suffered a major outage.
On June 29, 2012, several websites that rely on Amazon Web Services were taken offline
On October 22, 2012, a major outage occurred
On December 24, 2012, AWS suffered another outage
On September 20, 2015, AWS suffered another outage
Some of these were limited to one day, some were multiple days.
Re: (Score:1)
Re: (Score:1)
"What are the legally binding terms of services that AWS provide in relation to uptime."
https://aws.amazon.com/s3/sla/
Re: (Score:2)
I tend to think that it's not a question of their unreliability but the inherent complexity of providing high availability and scale that works 100% of the time.
As a consultant, I love AWS/Azure/O365 outages. They bring most customers back to reality with regard to the infalliability of "the cloud" and to the exponential increase in complexity required when chasing the "never goes down" dream.
If those guys, with unlimited money and unlimited talent, can't make their systems not have outages, then some ran
Whatg does this mean? (Score:2)
“Every day, Amazon enough new server capacity ..."? Editors at datacenterfrontier.com please!!!
Aww you didn't build out a Multi AZ solution? (Score:2)
You didn't build out a Multi AZ solution for your critical app? You relied on AWS services for critical load balancing and fail-over? You shoved everything into US-EAST-1 where it can sometimes take 5 minutes for a reboot? You're doing it wrong.