Enterprise


Intel Cloud Day 2016 Live Blog, 9AM PT

I’m here at Intel’s Cloud Day at the Nasdaq Center in San Francisco, ready for a live blog of the keynote talk from Diane Bryant, SVP and GM of Intel’s Data Center Group. It is set to start at 9AM PT.

Host-Independent PCIe Compute: Where We're Going, We Don't Need Nodes

Host-Independent PCIe Compute: Where We’re Going, We Don’t Need Nodes

The typical view of a cluster or supercomputer that uses a GPU, an FPGA or a Xeon Phi type device is that each node in the system requires one host or CPU to communicate through the PCIe root complex to 1-4 coprocessors. In some circumstances, the CPU/host model adds complexity, when all you really need is more coprocessors. This is where host-independent compute comes in.

The CPU handles the networking transfer and when combined with the south bridge, manages the IO and other features. Some orientations allow the coprocessors to talk directly with each other, and the CPU part allows large datasets to be held in local host DRAM. However for some compute workloads, all you need is more coprocessor cards. Storage and memory might be decentralized, and adding in hosts creates cost and complexity – a host that seamlessly has access to 20 coprocessors is easier to handle than 20 hosts with one each. This is the goal of EXTOLL as part of the DEEP (Dynamical Exascale Entry Platform) Project.

At SuperComputing 15, one of the academic posters on display from Sarah Neuwirth and her team from the University of Heidelberg was around developing the hardware and software stacks to allow for host-independent PCIe coprocessors through a custom fabric. This is theory would allow for compute nodes in a cluster to be split specifically into CPU and PCIe compute nodes, depending on the need of the simulation, but also allows for fail over or multiple user access. All of this is developed through their EXTOLL network interface chip, which has subsequently been spun out into a commercial entity.

A side note – In academia, it is common enough that the best ideas, if they’re not locked down by funding terms and conditions, are spun out into commercial enterprises. With enough university or venture capital in exchange for a percentage of ownership, an academic team can hire external experts to make their ideas a commercial product. These ideas either work and fail, or sometimes the intellectual property is sold up the chain to a tech industry giant.

The concept of EXTOLL is to act as a mini-host to initialize the coprocessor but also handles the routing and memory address translation such that it is transparent to all parties involved.  On a coprocessor with EXTOLL equipped, it can be connected into a fabric of other compute, storage and host nodes and yet be accessible to all. Multiple hosts can connect into the fabric, and coprocessors in the fabric can communicate directly to each other without the need to move out to a host. This is all controlled via MPI command extensions for which the interface is optimised.

The top level representation of the EXTOLL gives seven external ports supporting cluster architectures up to a 3D Torus plus one extra. The internal switch manages which network port is in use, derived from the translation layer provided by the IP blocks: VELO is the Virtualized Engine for Low Overhead that deals with MPI and in particular small messages, RMA is the Remote Memory Access unit that implements put/get with one-or-zero-copy operations and zero CPU interaction, and the SMFU which is the Shared Memory Function Unit for exporting segments of local memory to remote nodes. This all communicates to the PCIe coprocessor via the host interface which supports both PCIe 3.0 or HyperTransport 3.0.

From topology point of view, EXTOLL is not to act as a replacement for a regular network fabric and adds in a separate fabric layer. In the diagram above, the exploded view gives compute and host nodes (CN) offering standard fabric options, booster interface nodes (BI) that have both the standard fabric and EXTOLL fabric, then booster nodes (BN) which are just the PCIe coprocessor and an EXTOLL NIC. With this there can be a 1 to many or a many to many representation depending on what is needed, or in most cases the BI and BN can be combined into a single unit. From the end users perspective, this should all be seamless.

I discussed this and was told that several users could allocate themselves a certain number of coprocessors or the admin can set the limits depending on login or other workloads queued.

On the software side, EXTOLL sits between the coprocessor driver as a virtual PCI layer. This communicates to the hardware through the EXTOLL driver, telling the hardware to perform the required methods of address translation or MPI messages etc. The driver provides the tools to do the necessary translation of PCI commands across its own API.

The goal of something like EXTOLL is to be part of the PCIe coprocessor itself, similar to how Omni-Path will be on Knights Landing, either as a custom IC on the package or internal to the die. That way the EXTOLL connected devices can be developed into devices in a different physical format to the standard PCIe coprocessor cards, perhaps with integrated power and cooling to make design more efficient.  The first generation of this was built on an FPGA and used as an add-in to a power and data only PCIe interface. The second generation is similar, but this time has moved out into a 65nm TSMC based ASIC, reducing power and increasing performance. The latest version is the Tourmalet card, using upgraded IP blocks and featuring 100 GB/s per direction and 1.75 TB/s switching capacity.


Early hardware in the DEEP Project, to which EXTOLL is a key part

Current tests with the 2nd generation, the Galibier, and a dual node design gave LAMMPS (a biochemistry library) speed up of 57%.

The concept of host-less PCIe coprocessors is one of the next steps towards exascale computing, and EXTOLL are now straddling the line between commercial products and presenting their research as part of academic endeavours, even though there is the goal of external investment, similar to a startup. I am told they already have interest and proof of concept deployment with two partners, but this sort of technology needs to be integrated into the coprocessor itself – having something the size of a motherboard with several coprocessors talking via EXTOLL without external cables should be part of the endgame here, as long as power and cooling can be controlled. The other factor is ease of integration with software. If it fits easily into current MPI based codes and libraries, on C++ and FORTRAN, and it can be supported as new hardware is developed with new use cases, then it is a positive step. Arguably EXTOLL thus needs to be brought into on of the large tech firms, most likely as an IP purchase, or others will develop something similar depending on patents. Arguably the best person into that position will be Intel with its Omni-Path, but consider that FPGA vendors have close ties to Infiniband, so there could be potential there.

Relevant Paper: Scalable Communication Architecture for Network-Attached Accelerators

Host-Independent PCIe Compute: Where We're Going, We Don't Need Nodes

Host-Independent PCIe Compute: Where We’re Going, We Don’t Need Nodes

The typical view of a cluster or supercomputer that uses a GPU, an FPGA or a Xeon Phi type device is that each node in the system requires one host or CPU to communicate through the PCIe root complex to 1-4 coprocessors. In some circumstances, the CPU/host model adds complexity, when all you really need is more coprocessors. This is where host-independent compute comes in.

The CPU handles the networking transfer and when combined with the south bridge, manages the IO and other features. Some orientations allow the coprocessors to talk directly with each other, and the CPU part allows large datasets to be held in local host DRAM. However for some compute workloads, all you need is more coprocessor cards. Storage and memory might be decentralized, and adding in hosts creates cost and complexity – a host that seamlessly has access to 20 coprocessors is easier to handle than 20 hosts with one each. This is the goal of EXTOLL as part of the DEEP (Dynamical Exascale Entry Platform) Project.

At SuperComputing 15, one of the academic posters on display from Sarah Neuwirth and her team from the University of Heidelberg was around developing the hardware and software stacks to allow for host-independent PCIe coprocessors through a custom fabric. This is theory would allow for compute nodes in a cluster to be split specifically into CPU and PCIe compute nodes, depending on the need of the simulation, but also allows for fail over or multiple user access. All of this is developed through their EXTOLL network interface chip, which has subsequently been spun out into a commercial entity.

A side note – In academia, it is common enough that the best ideas, if they’re not locked down by funding terms and conditions, are spun out into commercial enterprises. With enough university or venture capital in exchange for a percentage of ownership, an academic team can hire external experts to make their ideas a commercial product. These ideas either work and fail, or sometimes the intellectual property is sold up the chain to a tech industry giant.

The concept of EXTOLL is to act as a mini-host to initialize the coprocessor but also handles the routing and memory address translation such that it is transparent to all parties involved.  On a coprocessor with EXTOLL equipped, it can be connected into a fabric of other compute, storage and host nodes and yet be accessible to all. Multiple hosts can connect into the fabric, and coprocessors in the fabric can communicate directly to each other without the need to move out to a host. This is all controlled via MPI command extensions for which the interface is optimised.

The top level representation of the EXTOLL gives seven external ports supporting cluster architectures up to a 3D Torus plus one extra. The internal switch manages which network port is in use, derived from the translation layer provided by the IP blocks: VELO is the Virtualized Engine for Low Overhead that deals with MPI and in particular small messages, RMA is the Remote Memory Access unit that implements put/get with one-or-zero-copy operations and zero CPU interaction, and the SMFU which is the Shared Memory Function Unit for exporting segments of local memory to remote nodes. This all communicates to the PCIe coprocessor via the host interface which supports both PCIe 3.0 or HyperTransport 3.0.

From topology point of view, EXTOLL is not to act as a replacement for a regular network fabric and adds in a separate fabric layer. In the diagram above, the exploded view gives compute and host nodes (CN) offering standard fabric options, booster interface nodes (BI) that have both the standard fabric and EXTOLL fabric, then booster nodes (BN) which are just the PCIe coprocessor and an EXTOLL NIC. With this there can be a 1 to many or a many to many representation depending on what is needed, or in most cases the BI and BN can be combined into a single unit. From the end users perspective, this should all be seamless.

I discussed this and was told that several users could allocate themselves a certain number of coprocessors or the admin can set the limits depending on login or other workloads queued.

On the software side, EXTOLL sits between the coprocessor driver as a virtual PCI layer. This communicates to the hardware through the EXTOLL driver, telling the hardware to perform the required methods of address translation or MPI messages etc. The driver provides the tools to do the necessary translation of PCI commands across its own API.

The goal of something like EXTOLL is to be part of the PCIe coprocessor itself, similar to how Omni-Path will be on Knights Landing, either as a custom IC on the package or internal to the die. That way the EXTOLL connected devices can be developed into devices in a different physical format to the standard PCIe coprocessor cards, perhaps with integrated power and cooling to make design more efficient.  The first generation of this was built on an FPGA and used as an add-in to a power and data only PCIe interface. The second generation is similar, but this time has moved out into a 65nm TSMC based ASIC, reducing power and increasing performance. The latest version is the Tourmalet card, using upgraded IP blocks and featuring 100 GB/s per direction and 1.75 TB/s switching capacity.


Early hardware in the DEEP Project, to which EXTOLL is a key part

Current tests with the 2nd generation, the Galibier, and a dual node design gave LAMMPS (a biochemistry library) speed up of 57%.

The concept of host-less PCIe coprocessors is one of the next steps towards exascale computing, and EXTOLL are now straddling the line between commercial products and presenting their research as part of academic endeavours, even though there is the goal of external investment, similar to a startup. I am told they already have interest and proof of concept deployment with two partners, but this sort of technology needs to be integrated into the coprocessor itself – having something the size of a motherboard with several coprocessors talking via EXTOLL without external cables should be part of the endgame here, as long as power and cooling can be controlled. The other factor is ease of integration with software. If it fits easily into current MPI based codes and libraries, on C++ and FORTRAN, and it can be supported as new hardware is developed with new use cases, then it is a positive step. Arguably EXTOLL thus needs to be brought into on of the large tech firms, most likely as an IP purchase, or others will develop something similar depending on patents. Arguably the best person into that position will be Intel with its Omni-Path, but consider that FPGA vendors have close ties to Infiniband, so there could be potential there.

Relevant Paper: Scalable Communication Architecture for Network-Attached Accelerators

Workstation Love at SuperComputing 15

Workstation Love at SuperComputing 15

One of the interesting angles at Supercomputing 15 was workstations. In a show where high performance computing is paramount, most situations involve an offload of software onto a small cluster or an off-site mega-behemoth funded environment. However there were a few interesting under-the-desk systems offered by system integrators outside the usual Dell/HP/Lenovo conglomerates to tackle the taste buds of workstation aficionados, including high performance computing, networking and visualization.

First is a system that a few of my twitter followers saw – a pre-overclocked box from BOXX. Not only was it odd to see an overclocked system at a supercomputing conference (although some financial environments crave low latency), but here BOXX had pushed it to the near extreme.

To get more than a few MHz, the system needs to move away from Xeon, which this does, at the expense of ECC memory for the usual reasons. But here is an i7-5960X overclocked to a modest 4.1 GHz, 128GB of DDR4-2800 memory, and four GTX Titan X cards. Each of these cards is overclocked by around 10-15%, and both the CPU and GPUs are water cooled into a massive custom case. The side window panel is optional. Obviously to get the best of everything is still a little out of the reach for PCIe NVMe SSDs as well, but the case offers enough storage capacity for a set of drives in a mix of RAID 0/1/5/10 or JBOD.

They had the system attached to a 2×2 wall of 4K monitors, all performing various 3D graphics and rendering workloads. Of course, the system aims to be a workstation wet dream so the goal here is to show off what Boxx can do with off the shelf parts before going fully custom in that internal water loop. I discussed with the agent about the range of overclocks, and they said it has to balance between speed and repair, such that if a part fails it needs to be a less often as well as quick and easy – hence why the CPU and GPUs were not on the bleeding edge. It makes sense.

Microway also had a beast on show. What actually drew me to Microway in the first place was that they were the first company I saw showing a new Knights Landing (link) motherboard and chip, but right next to it was essentially a proper server in a workstation.

Initially I thought it was an AMD setup, as I had seen quad Magny Cours/Istanbul/Abu Dhabi based systems in workstation like cases before. But no, that is a set of four E5-4600 v3 series CPUs in a SuperMicro motherboard. The motherboard is aimed at servers, but Microway has put it in a case, and each socket has a closed loop liquid cooler. Because these are v3 CPUs, there is scope for up to 72 cores / 144 threads using E7 v3 processors, which is more similar to what you would see in a 4U rack based arrangement. Because this is in a case, and the board arrangement is such, PCIe coprocessor support is varied based in which PCIe root hub it comes from, but I was told that it will be offered with the standard range of PCIe devices as well as Intel’s Omni-Path when those cards come to retail. Welcome to the node under the desk. You need a tall desk. I’m reaching out to Microway to get one for review, if only for perverse curiosity into workstation CPU compute power.

So there’s one node in a case – how about seven? In collaboration with Intel, International Computer Concepts Inc has developed an 8U half-width chassis that will take any half-width server unit up to a specific length.

Each 1U has access to a 10GBase-T port and an internal custom 10G switch with either copper or fibre outputs depending on how you order it. In the example show to us, each 1U was supported by dual Xeon D nodes, which will offer up to 64 threads x 16 when fully populated with the next Xeon-D generation. Of course, some parts of the system can be replaced with storage nodes, or full-fat Xeon nodes. Cooling wasn’t really discussed here, but I was told that the system should be populated to keep noise in mind – so giving each 1U a pair of GPUs probably isn’t a good idea. The system carries it’s own power backplane as well with dual redundant supplies up to 1200W if I remember correctly.

With this amount of versatility, particularly when testing for a larger cluster (or even as an SMB deployment), it certainly sounds impressive. I’m pretty sure Ganesh wants one for his NAS testing.

Workstation Love at SuperComputing 15

Workstation Love at SuperComputing 15

One of the interesting angles at Supercomputing 15 was workstations. In a show where high performance computing is paramount, most situations involve an offload of software onto a small cluster or an off-site mega-behemoth funded environment. However there were a few interesting under-the-desk systems offered by system integrators outside the usual Dell/HP/Lenovo conglomerates to tackle the taste buds of workstation aficionados, including high performance computing, networking and visualization.

First is a system that a few of my twitter followers saw – a pre-overclocked box from BOXX. Not only was it odd to see an overclocked system at a supercomputing conference (although some financial environments crave low latency), but here BOXX had pushed it to the near extreme.

To get more than a few MHz, the system needs to move away from Xeon, which this does, at the expense of ECC memory for the usual reasons. But here is an i7-5960X overclocked to a modest 4.1 GHz, 128GB of DDR4-2800 memory, and four GTX Titan X cards. Each of these cards is overclocked by around 10-15%, and both the CPU and GPUs are water cooled into a massive custom case. The side window panel is optional. Obviously to get the best of everything is still a little out of the reach for PCIe NVMe SSDs as well, but the case offers enough storage capacity for a set of drives in a mix of RAID 0/1/5/10 or JBOD.

They had the system attached to a 2×2 wall of 4K monitors, all performing various 3D graphics and rendering workloads. Of course, the system aims to be a workstation wet dream so the goal here is to show off what Boxx can do with off the shelf parts before going fully custom in that internal water loop. I discussed with the agent about the range of overclocks, and they said it has to balance between speed and repair, such that if a part fails it needs to be a less often as well as quick and easy – hence why the CPU and GPUs were not on the bleeding edge. It makes sense.

Microway also had a beast on show. What actually drew me to Microway in the first place was that they were the first company I saw showing a new Knights Landing (link) motherboard and chip, but right next to it was essentially a proper server in a workstation.

Initially I thought it was an AMD setup, as I had seen quad Magny Cours/Istanbul/Abu Dhabi based systems in workstation like cases before. But no, that is a set of four E5-4600 v3 series CPUs in a SuperMicro motherboard. The motherboard is aimed at servers, but Microway has put it in a case, and each socket has a closed loop liquid cooler. Because these are v3 CPUs, there is scope for up to 72 cores / 144 threads using E7 v3 processors, which is more similar to what you would see in a 4U rack based arrangement. Because this is in a case, and the board arrangement is such, PCIe coprocessor support is varied based in which PCIe root hub it comes from, but I was told that it will be offered with the standard range of PCIe devices as well as Intel’s Omni-Path when those cards come to retail. Welcome to the node under the desk. You need a tall desk. I’m reaching out to Microway to get one for review, if only for perverse curiosity into workstation CPU compute power.

So there’s one node in a case – how about seven? In collaboration with Intel, International Computer Concepts Inc has developed an 8U half-width chassis that will take any half-width server unit up to a specific length.

Each 1U has access to a 10GBase-T port and an internal custom 10G switch with either copper or fibre outputs depending on how you order it. In the example show to us, each 1U was supported by dual Xeon D nodes, which will offer up to 64 threads x 16 when fully populated with the next Xeon-D generation. Of course, some parts of the system can be replaced with storage nodes, or full-fat Xeon nodes. Cooling wasn’t really discussed here, but I was told that the system should be populated to keep noise in mind – so giving each 1U a pair of GPUs probably isn’t a good idea. The system carries it’s own power backplane as well with dual redundant supplies up to 1200W if I remember correctly.

With this amount of versatility, particularly when testing for a larger cluster (or even as an SMB deployment), it certainly sounds impressive. I’m pretty sure Ganesh wants one for his NAS testing.