95178099
submission
kipperstem77 writes:
The Top 500 supercomputer list is in, but the rankings look weaker than ever despite strong growth. As Timothy Prickett Morgan at the Next Platform writes, Much will be made of the geographic distribution of the machines on the Top 500 list, and if this leads to actual concern about increasing investments in actual HPC, then we suppose that is a good thing. But if it just leads to more Linpack tests being run on machines that are not actually doing HPC, this is obviously not ideal. In any event, China is now top gun on the Top 500, with 202 systems, up from 160 on the June list, and the United States now has only 143 systems, down from 169 in June. There are now 251 systems that are located in Asia, up from 210 six months ago, with 35 of these in Japan. Europe has 93 systems across the region, down from 106 in June. Germany has 21 of these, followed by France with 18 and the United Kingdom with 15.
93887341
submission
kipperstem77 writes:
NUDT has, according to James Lin, vice director for the Center of HPC at Shanghai Jiao Tong University, who divulged the plans last year, is building one of the three pre-exascale machines, in this case a kicker to the Tianhe-1A CPU-GPU hybrid that was deployed in 2010 and that put China on the HPC map. This exascale system will be installed at the National Supercomputer Center in Tianjin, not the one in Guangzhou, according to Lin. This machine is expected to use ARM processors, and we think it will very likely use Matrix2000 DSP accelerators, too, but this has not been confirmed. The second pre-exascale machine will be an upgrade to the TaihuLight system using a future Shenwei processor, but it will be installed at the National Supercomputing Center in Jinan. And the third pre-exascale machine being funded by China is being architected in conjunction with AMD, with licensed server processor technology, and which everyone now thinks is going to be based on Epyc processors and possibly with Radeon Instinct GPU coprocessors.
92452801
submission
kipperstem77 writes:
Amin Vahdat, Google Fellow and technical lead for networking at the company, recently walked The Next Platform through Google’s implementation of Espresso routing on the public Internet, the fourth pillar of networking that the company has divulged thus far. Vahdat also talked about how networking technology, in terms of both hardware and software, are starting to see the pace of innovation pick up again after a tough slog there in the past decade.
92151495
submission
invalidpath writes:
ARM, traditionally known as the processor of choice for mobile phones, hard drives, and car control systems, are now well on their way to taking on the high performance computing market. New high performance computing installations, at one point, had a variety of processors to choose from. Today there are two main choices for providing a CPU that can run an operating system: Intel and IBM. ARM is an IP company. As such, it can enable multiple other vendors to produce chips for this market. Vendors license either the ISA or ISA and micro-architectural components to build their own branded chips. The potential is huge, drive down costs by enabling more competition while exploding the potential to innovate at the micro-architectural level. Having ARM in common enables these vendors to utilize a common software ecosystem. Without this ecosystem, the cost of entry into the market would be prohibitive. At the International Supercomputing Conference (ISC) this year (2017) ARM compatible hardware was out in force. Hardware was on display from multiple vendors including HPE, Cavium and GIgabyte. Many booths even had these systems to try out with live demos. To cap the week of displays and shows, the GoingARM workshop provided insight from users and HPC centers on their experiences with the ARM platform. Sitting through the presentations gave the audience a bit of insight into how monumental a task porting the entire HPC stack to a new ISA can be. A pretty comprehensive summary was written by The Next Platform.
91252975
submission
kipperstem77 writes:
Google designed the TPU2 specifically to accelerate focused deep learning workloads behind its core consumer-facing software such as search, maps, voice recognition and research projects such as autonomous vehicle training. Our rough translation of Google’s goals for TRC is that Google wants to recruit the research community to find workloads that will scale well with a TPU2 hyper-mesh. Google says the TRC program will start small but expand over time. The rest of us will not be able to directly access a TPU2 until Google’s research outreach finds more general applications and Google offers a TensorFlow hardware instance as infrastructure in its Google Cloud Platform public cloud.
91146485
submission
kipperstem77 writes:
This morning at the Google’s I/O event, the company stole Nvidia’s recent Volta GPU thunder by releasing details about its second-generation tensor processing unit (TPU), which will manage both training and inference in a rather staggering 180 teraflops package, complete with custom network to lash several together into “TPU pods” that can deliver Top 500-class supercomputing might at up to 11.5 petaflops of peak performance.
90271323
submission
kipperstem77 writes:
Four years ago, Google started to see the real potential for deploying neural networks to support a large number of new services. During that time it was also clear that, given the existing hardware, if people did voice searches for three minutes per day or dictated to their phone for short periods, Google would have to double the number of datacenters just to run machine learning models.
The need for a new architectural approach was clear, Google distinguished hardware engineer, Norman Jouppi, tells The Next Platform, but it required some radical thinking...
90240899
submission
kipperstem77 writes:
The tick-tock-clock three step dance that Intel will be using to progress its Core client and Xeon server processors in the coming years is on full display now that the Xeon E3-1200 v6 processors based on the “Kaby Lake” have been unveiled.
The Kaby Lake chips are Intel’s third generation of Xeon processors that are based on its 14 nanometer technologies, and as our naming convention for Intel’s new way of rolling out chips suggests, it is a refinement of both the architecture and the manufacturing process that, by and large, enables Intel to ramp up the clock speed on the prior generation of devices and, in the case of Kaby Lake specifically, support faster DDR4 main memory than prior Xeon E3 chips as well as the shiny new non-volatile 3D XPoint Optane memory sticks and Optane SSDs, both of which debuted in March.
89779337
submission
kipperstem77 writes:
It is a good time to be the maker of a machine that excels in large-scale optimization problems for cybersecurity and defense. And it is even better to be the only maker of such a machine at a time when the need for a post-Moore’s Law system is in high demand.
We have already described the U.S. Department of Energy’s drive to place a novel architecture at the heart of one of the future exascale supercomputers, and we have also explored the range of options that might fall under that novel processing umbrella. From neuromorphic chips, deep learning PIM-based architectures, ultra-hybrid machines with a combination of FPGA, GPU and non-X86 elements, and of course, quantum computers, there are a rich set of options. While these are important possibilities for the world’s top supercomputing sites, the defense and intelligence space is watching keenly as well—and with an eye on systems that can target their exact workloads.
89670505
submission
kipperstem77 writes:
If the 32-core Naples chip comes out, and works, it will be competitive with the 28-core Skylake Xeon due around the middle of the year. Everyone who might buy lots of either chip has seen both, has had them running in their labs for a long time, and has already made their purchasing decisions. All we are arguing about now is the price that the rest of us might be charged if we can get either processor.
The die is already cast, even if it is too hot to touch.
89282143
submission
kipperstem77 writes:
With all of those CPUs and GPUs, Tsubame 3.0 will have 12.15 petaflops of peak double precision performance, and is rated at 24.3 petaflops single precision and, importantly, is rated at 47.2 petaflops at the half precision that is important for neural networks employed in deep learning applications. When added to the existing Tsubame 2.5 machine and the experimental immersion-cooled Tsubame-KFC system, TiTech will have a total of 6,720 GPUs to bring to bear on workloads, adding up to a total of 64.3 aggregate petaflops at half precision. (This is interesting to us because that means Nvidia has worked with TiTech to get half precision working on Kepler GPUs, which did not formally support half precision.)
89149477
submission
kipperstem77 writes:
A recent conversation we had with Intel turned up a surprising new addition to the machine learning conversation—an emphasis on neuromorphic devices and what Intel is openly calling “cognitive computing” (a term used primarily—and heavily—for IBM’s Watson-driven AI technologies). This is the first time to date we’ve heard the company make any definitive claims about where neuromorphic chips might fit into a strategy to capture machine learning, and marks a bold grab for the term “cognitive computing” which has been an umbrella term for Big Blue’s AI business.
89081417
submission
kipperstem77 writes:
We know what you are thinking. This might be a good thing for IBM, but it might not be a good thing for Nvidia, Xilinx, and Mellanox, who are the key three hardware partners in the OpenPower consortium that IBM formed with the help of hyperscale datacenter operator Google back in August 2013. Fair enough. All three companies seem to be doing fine against their respective competition, and the OpenPower effort might be a tight enough coupling to get interesting and innovative systems to market. But, we might argue, this effort to build a flexible platform – for that is what the OpenPower consortium is ultimately about – could be significantly enhanced and accelerated by a tighter coupling of the core technologies created by all four of these companies. The fourth being, of course, the Power family of processors created by IBM, which would be married to Nvidia Tesla compute GPUs, Mellanox InfiniBand and Ethernet switching, and Xilinx UltraScale Virtex and Kintex FPGAs.
87443369
submission
kipperstem77 writes:
Just as Intel has had to go head-to-head with GPU maker, Nvidia, for HPC accelerator share with its first generation Xeon Phi product (a co-processor, which has given way to the self-hosted/non-offload Knights Landing appearing on its first wave of supercomputers this year), a new war with the same enemy is brewing. This is for machine learning share, an area where Nvidia’s GPUs dominate for the training portion of the growing workload set. While Intel has often proclaimed that they “power 97 percent of the datacenter servers running AI workloads” we have to take issue with that statement—not because it isn’t true, but because that means a CPU is always part of the mix in all of these workloads (training and inference alike) and given the lack of diversity in the CPU ecosystem, it’s natural that 97 percent is X86 from Intel. However, for training, the GPU is the real workhorse. And breaking out that percentage of value for that specific workload is more important, albeit a more difficult analysis.
87304505
submission
kipperstem77 writes:
The Saturn V cluster is comprised of 124 of Nvidia’s own DGX-1 hybrid CPU-GPU systems, which were launched last April by the company and which were explicitly created to foster the use of its high-end “Pascal” Tesla P100 GPU accelerators, which also debuted at that time. The Tesla P100s come in two flavors, the original that mounts directly on the motherboard and that links to the processing complex over NVLink interconnects, called the SXM2 form factor, and the other having roughly the same performance and HBM2 stacked memory capabilities – and unfortunately the same name – but hooking to the CPU complex in a system over normal PCI-Express 3.0 x16 links.