CPUs


Unpacking AMD's Zen Benchmark: Is Zen actually 2% Faster than Broadwell?

Unpacking AMD’s Zen Benchmark: Is Zen actually 2% Faster than Broadwell?

At a satellite event to Intel’s Developer Conference last week, AMD held an evening soiree with invited analysts and press to talk about their new upcoming Zen microarchitecture. This was intended to be a preview of tomorrow’s Hot Chips presentation, and we’ve already covered the juicier parts of the presentation in terms of microarchitecture announcements as well as diving deep into the Server-based Naples implementation and what the motherboards told us from memory and IO support. 

You can read both here:

AMD Zen Microarchitecture: Dual Schedulers, Micro-op Cache and Memory Hierarchy Revealed
AMD Server CPUs and Motherboard Analysis

There was one other element to the presentation that requires further discussion and analysis, if only to clean up some of the misinformation already present online and to label what was shown with a long list of potential caveats which most people seem to have passed by almost entirely. As part of the show, AMD compared the performance of their Zen engineering sample to an Intel Broadwell-E processor. 

In this test, they told the audience that each system was running eight cores, sixteen threads, and will all cores set to 3 GHz (implying no turbo). Memory arrangements were not disclosed, nor storage: we were told to assume comparable setups. 

We were too busy trying to decipher what was on screen (and take note of the results) to actually photograph the benchmark as it happened (there are videos online), but the benchmark they showed was Blender, an open source rendering engine, with a custom multithreaded workload. The test was to render a mockup of a Zen based desktop CPU, with an effective workload of 50 seconds for these chips. I’ve seen numerous reports about this result saying the difference was 1 or 2 seconds, but with rarely a mention of the benchmark length, which is as important. The overall results were

  Blender Time to Render / sec
Intel Broadwell-E
Core i7-6900K
8C / 16T
3 GHz all-core
49.05
AMD ZEN
Engineering Sample
8C / 16T
3 GHz all-core
48.07
(-0.98 sec, 1.998%)

All things being equal (we’ll get to that in a second), this would suggest that an 8-core AMD has a ~2% advantage over Broadwell-E at the same clock speeds. Despite this result, there are a lot of unverifiable parts to the claim which makes analysis of such a result difficult. I want to go through each of them one by one to ensure everyone understands what was presented. 

I’ll preface this analysis with two things though: one is that AMD was very careful in what they said at the event, and only said as much as they needed to. Thus is why the string of caveats for this benchmark test is fairly long. But also, AMD has to set expectations here: if they choose an environment and test that represents the peak, or relies on something special, users will feel burned again after Bulldozer. AMD has to temper those expectations but still represent a methodology that is effective to them. By leaving so many cards on the table, this can both be a help or a hindrance.

But given the measured and calm, professional nature of the event, as opposed to the wild and wacky AMD events of the past, it was clear (either by design or chance) that the words used said only as much as they needed to. Along with the microarchitecture discussions, it was designed to provide a good stepping stone on to the Hot Chips presentation a few days later.

So, caveats. Or disclaimers not readily provided. Let’s start at the top. 

1) The Results Are Not Externally Verifiable At This Time, As Expected

We were told the setups of the systems being used, but were unable to confirm the results manually. This is typically the case with a high level, early look at performance and other companies do this all the time.

This being said, it would look bad on reports if it to turns out or someone finds a chasm between pre-launch and launch data, so the aspect of reporting this data without understanding this caveat is fundamental. The basis of publishing scientific papers is repeatability and verification – while this wasn’t a scientific presentation, it is important to keep it in the back of your mind when you hear any benchmark numbers (AnandTech included – our numbers are designed to be verifiable and we want to have a history of honesty with our readers, especially when it comes to custom software/workloads we cannot disclose). 

2) No Memory or TDP Numbers Were Provided

We were able to determine that the AMD-based systems were running 2×8 GB of DDR4-2400, although we did not get a look at Intel’s memory arrangement. Similarly, due to the ES nature of the CPU, TDP numbers were also not shared however we did see all the AMD systems use either the AMD Wraith cooler (which is rated at 125W) or the new near silent cooler (95W). That tends to peg the system at a peak power consumption and some of AMD’s current competitive parts actually use a cooler designed for the bracket above in TDP (e.g. A10-7860K at 65W uses the 95W cooler, A10-7890K at 95W uses the 125W cooler). 

3) Blender Is an Open Source Platform

One of the issues of using open source is that the code is editable by everyone and anyone. Any user can adjust the code to optimize for their system and run a test to get a result. That being the case, it can be difficult to accurately determine the code base for this test, and is relatively impossible to determine the code base of Blender that was compiled for this test.

Even in the base code, there could be CPU vendor specific optimizations in either the code or compiler that influences how the code manipulates the cache hierarchy with the workload and adjusts appropriately. It also doesn’t help that Blender has elements in the code called ‘AMD’, which relates to a series of internal rendering features not related to the company. Going down the optimization for specific CPU microarchitectures leads on to another more philosophical issue…

4) Did It Actually Measure IPC? (The Philosophical Debate) 

In the purest sense, measuring the number of instructions per clock that a set of instructions can perform can determine the efficiency of a design. However, the majority of highly optimized code bases do not have general-purpose code – if it detects a particular microarchitecture it can manipulate threads and loops to take advantage of the code design. How should IPC be measured is the main question: using identical code bases makes it easier to understand but are often non-real-world compiler targets, or highly optimized code to show the best of what the processor can do (which means that IPC performance is limited to that benchmark)? With the results we saw, if the difference of about a second in just under fifty seconds translates into a 2% difference, is it accurate to say that this is a 2% IPC increase, or does it rely on optimized/non-optimized code? Optimizing code, or profiling compilers for specific code targets, is nothing new. In the holistic view, most analysts use SPEC benchmarks for this, as they are well-known code structures, even though most benchmarks are compiler targets – while SPEC is not particularly relevant for the real world workloads, it does give an indication about performance for unknown architectures/microarchitectures.

5) The Workload Is Custom 

One of the benefits of software like SPEC, or canned benchmarks like Cinebench, is that anyone (with a license) can pick up the workload and run with it. Those workloads are typically well known, and we can get performance numbers out that have known qualities in their interpretation. With a custom workload, that is not always the case. It comes down to experience – an unknown workload can have a lop-sided implementation of certain branches of code which is unknown when it comes to running the results. This is why rendering one scene in a film can take a vastly different time to another, yet the results for the ‘benchmark’ are significantly different depending on the architecture (one prefers lighting, another prefers tessellation etc.) Using known or standard workloads over long periods of time can offer insights into the results, whereas new workloads cannot, especially with so few results on offer.

6) It Is Only One Benchmark

There is a reason for AMD only showing this benchmark – it’s either a best case scenario, or they are pitching their expectations exactly where they want people to think. By using a custom workload on open source software, the result is very specific and cannot be extrapolated in any meaningful way. This is why a typical benchmark suite offers 10-20 tests with different workloads, and even enterprise standard workloads like SPEC come with over a dozen tests in play, to cater for single thread or multi-thread or large cache or memory or pixel pushing bottleneck that may occur. Single benchmarks on their own are very limited in scope as a result.

7) There’s Plenty about the Microarchitecture and Chip We Don’t Know Yet, e.g. Uncore

One of the more difficult elements on a processor is managing cross-core communication, as well as cross-core-cache snooping. This problem is overtly exponential, with the plausibility of more direct connections per core as the numbers go up. Intel has historically used a torus (ring) interconnect between cores to do this, with their large multi-core chips using a dual ring bus with communication points between the two. We suspect AMD is also using a ring bus in a similar methodology, but this has not been discussed at this time. There’s also the interconnect fabric between the cores and other parts of the chip, such as the Northbridge/Southbridge or the memory controllers. Depending on the test, the core-to-core communication and the interconnect can have different performance effects on the CPU. 

8) Clock Speeds Are Not Final, Efficiency Not Known

Performance of a CPU is typically limited by the power draw – there is no point getting a small amount of performance for a large increase in power such that efficiency has decreased. AMD has stated that power consumption and efficiency was a premier goal as this microarchitecture was developed.

At the demonstration, we were told that the frequency of the engineering samples was set at 3 GHz for all-core operation. We were told explicitly that these are not the final clock speeds, but it at the very least it puts the lower bound on the highest end processor. In reality, 3 GHz could be a walk in the park for the final retail units, depending on how much difference there is between the chips on display and what people will be able to buy. We are under the impression that the CPUs will have turbo modes involved, and those could be staggered based on the cores being used.

But this is why I said that 3 GHz is the lower bound of the high-end processor. We know from these results (assuming point 1 in this list) that the best processor from AMD will do at least 3 GHz. There’s no indication of power, and thus there’s no indication of efficiency either, which is also another important metric left in the ether.

9) We Will Have to Wait to Test

Everyone wants the next technology yesterday, so the ‘gimme gimme gimme’ feeling of new stuff is always there. AMD has already stated that general availability for Zen and Summit Ridge will be Q1, which puts the launch at four months away at a minimum. At this stage of the game, while AMD is trying to be competitive with Intel, they don’t want to generate too much hype and give the game away in case it goes incredibly pear-shaped. There’s the added element of the hardware and software being finalized or updated.

Since I’ve been reviewing, no CPU manufacturer has handed out review units four months before launch (in all honesty, we’re lucky to get a launch date sample a week in advance these days). In fact we’d have to go back to Nehalem and Conroe to find something that was sampled early; however Conroe just passed its 10th birthday and in that case, Intel knew they were on to a clear winner ahead rather than just ‘meeting expectations’. Also, early samples of a great product will mean users will wait for it to come out, which results in revenue loss (the Osborne effect) unless you have zero stock and/or an uncompetitive product that no-one is buying. In this decade, no x86 CPU manufacturer has offered samples this far out. I’d be more than happy for that to change and I would encourage companies to do so, but I understand the reasons why. 

Some Final Words

Much in the same way that taking an IQ test tells you how good you are at an IQ test, it is typically an indication that you are good/bad at other things as well (most well-engineered IQ tests go through a lot of spatial reasoning, for example). In this circumstance, a CPU performing a Blender test is only as good as a Blender test, but given what we know about the Zen microarchitecture, it is probably also good at other things. Just how good, in what metric and to what extent, is almost impossible to say.

AMD has given a glimpse of performance, and they’ve only said as much as they needed to in order to get the message across. However it has been up to the media to understand the reasons why and explain what those caveats are.

Unpacking AMD's Zen Benchmark: Is Zen actually 2% Faster than Broadwell?

Unpacking AMD’s Zen Benchmark: Is Zen actually 2% Faster than Broadwell?

At a satellite event to Intel’s Developer Conference last week, AMD held an evening soiree with invited analysts and press to talk about their new upcoming Zen microarchitecture. This was intended to be a preview of tomorrow’s Hot Chips presentation, and we’ve already covered the juicier parts of the presentation in terms of microarchitecture announcements as well as diving deep into the Server-based Naples implementation and what the motherboards told us from memory and IO support. 

You can read both here:

AMD Zen Microarchitecture: Dual Schedulers, Micro-op Cache and Memory Hierarchy Revealed
AMD Server CPUs and Motherboard Analysis

There was one other element to the presentation that requires further discussion and analysis, if only to clean up some of the misinformation already present online and to label what was shown with a long list of potential caveats which most people seem to have passed by almost entirely. As part of the show, AMD compared the performance of their Zen engineering sample to an Intel Broadwell-E processor. 

In this test, they told the audience that each system was running eight cores, sixteen threads, and will all cores set to 3 GHz (implying no turbo). Memory arrangements were not disclosed, nor storage: we were told to assume comparable setups. 

We were too busy trying to decipher what was on screen (and take note of the results) to actually photograph the benchmark as it happened (there are videos online), but the benchmark they showed was Blender, an open source rendering engine, with a custom multithreaded workload. The test was to render a mockup of a Zen based desktop CPU, with an effective workload of 50 seconds for these chips. I’ve seen numerous reports about this result saying the difference was 1 or 2 seconds, but with rarely a mention of the benchmark length, which is as important. The overall results were

  Blender Time to Render / sec
Intel Broadwell-E
Core i7-6900K
8C / 16T
3 GHz all-core
49.05
AMD ZEN
Engineering Sample
8C / 16T
3 GHz all-core
48.07
(-0.98 sec, 1.998%)

All things being equal (we’ll get to that in a second), this would suggest that an 8-core AMD has a ~2% advantage over Broadwell-E at the same clock speeds. Despite this result, there are a lot of unverifiable parts to the claim which makes analysis of such a result difficult. I want to go through each of them one by one to ensure everyone understands what was presented. 

I’ll preface this analysis with two things though: one is that AMD was very careful in what they said at the event, and only said as much as they needed to. Thus is why the string of caveats for this benchmark test is fairly long. But also, AMD has to set expectations here: if they choose an environment and test that represents the peak, or relies on something special, users will feel burned again after Bulldozer. AMD has to temper those expectations but still represent a methodology that is effective to them. By leaving so many cards on the table, this can both be a help or a hindrance.

But given the measured and calm, professional nature of the event, as opposed to the wild and wacky AMD events of the past, it was clear (either by design or chance) that the words used said only as much as they needed to. Along with the microarchitecture discussions, it was designed to provide a good stepping stone on to the Hot Chips presentation a few days later.

So, caveats. Or disclaimers not readily provided. Let’s start at the top. 

1) The Results Are Not Externally Verifiable At This Time, As Expected

We were told the setups of the systems being used, but were unable to confirm the results manually. This is typically the case with a high level, early look at performance and other companies do this all the time.

This being said, it would look bad on reports if it to turns out or someone finds a chasm between pre-launch and launch data, so the aspect of reporting this data without understanding this caveat is fundamental. The basis of publishing scientific papers is repeatability and verification – while this wasn’t a scientific presentation, it is important to keep it in the back of your mind when you hear any benchmark numbers (AnandTech included – our numbers are designed to be verifiable and we want to have a history of honesty with our readers, especially when it comes to custom software/workloads we cannot disclose). 

2) No Memory or TDP Numbers Were Provided

We were able to determine that the AMD-based systems were running 2×8 GB of DDR4-2400, although we did not get a look at Intel’s memory arrangement. Similarly, due to the ES nature of the CPU, TDP numbers were also not shared however we did see all the AMD systems use either the AMD Wraith cooler (which is rated at 125W) or the new near silent cooler (95W). That tends to peg the system at a peak power consumption and some of AMD’s current competitive parts actually use a cooler designed for the bracket above in TDP (e.g. A10-7860K at 65W uses the 95W cooler, A10-7890K at 95W uses the 125W cooler). 

3) Blender Is an Open Source Platform

One of the issues of using open source is that the code is editable by everyone and anyone. Any user can adjust the code to optimize for their system and run a test to get a result. That being the case, it can be difficult to accurately determine the code base for this test, and is relatively impossible to determine the code base of Blender that was compiled for this test.

Even in the base code, there could be CPU vendor specific optimizations in either the code or compiler that influences how the code manipulates the cache hierarchy with the workload and adjusts appropriately. It also doesn’t help that Blender has elements in the code called ‘AMD’, which relates to a series of internal rendering features not related to the company. Going down the optimization for specific CPU microarchitectures leads on to another more philosophical issue…

4) Did It Actually Measure IPC? (The Philosophical Debate) 

In the purest sense, measuring the number of instructions per clock that a set of instructions can perform can determine the efficiency of a design. However, the majority of highly optimized code bases do not have general-purpose code – if it detects a particular microarchitecture it can manipulate threads and loops to take advantage of the code design. How should IPC be measured is the main question: using identical code bases makes it easier to understand but are often non-real-world compiler targets, or highly optimized code to show the best of what the processor can do (which means that IPC performance is limited to that benchmark)? With the results we saw, if the difference of about a second in just under fifty seconds translates into a 2% difference, is it accurate to say that this is a 2% IPC increase, or does it rely on optimized/non-optimized code? Optimizing code, or profiling compilers for specific code targets, is nothing new. In the holistic view, most analysts use SPEC benchmarks for this, as they are well-known code structures, even though most benchmarks are compiler targets – while SPEC is not particularly relevant for the real world workloads, it does give an indication about performance for unknown architectures/microarchitectures.

5) The Workload Is Custom 

One of the benefits of software like SPEC, or canned benchmarks like Cinebench, is that anyone (with a license) can pick up the workload and run with it. Those workloads are typically well known, and we can get performance numbers out that have known qualities in their interpretation. With a custom workload, that is not always the case. It comes down to experience – an unknown workload can have a lop-sided implementation of certain branches of code which is unknown when it comes to running the results. This is why rendering one scene in a film can take a vastly different time to another, yet the results for the ‘benchmark’ are significantly different depending on the architecture (one prefers lighting, another prefers tessellation etc.) Using known or standard workloads over long periods of time can offer insights into the results, whereas new workloads cannot, especially with so few results on offer.

6) It Is Only One Benchmark

There is a reason for AMD only showing this benchmark – it’s either a best case scenario, or they are pitching their expectations exactly where they want people to think. By using a custom workload on open source software, the result is very specific and cannot be extrapolated in any meaningful way. This is why a typical benchmark suite offers 10-20 tests with different workloads, and even enterprise standard workloads like SPEC come with over a dozen tests in play, to cater for single thread or multi-thread or large cache or memory or pixel pushing bottleneck that may occur. Single benchmarks on their own are very limited in scope as a result.

7) There’s Plenty about the Microarchitecture and Chip We Don’t Know Yet, e.g. Uncore

One of the more difficult elements on a processor is managing cross-core communication, as well as cross-core-cache snooping. This problem is overtly exponential, with the plausibility of more direct connections per core as the numbers go up. Intel has historically used a torus (ring) interconnect between cores to do this, with their large multi-core chips using a dual ring bus with communication points between the two. We suspect AMD is also using a ring bus in a similar methodology, but this has not been discussed at this time. There’s also the interconnect fabric between the cores and other parts of the chip, such as the Northbridge/Southbridge or the memory controllers. Depending on the test, the core-to-core communication and the interconnect can have different performance effects on the CPU. 

8) Clock Speeds Are Not Final, Efficiency Not Known

Performance of a CPU is typically limited by the power draw – there is no point getting a small amount of performance for a large increase in power such that efficiency has decreased. AMD has stated that power consumption and efficiency was a premier goal as this microarchitecture was developed.

At the demonstration, we were told that the frequency of the engineering samples was set at 3 GHz for all-core operation. We were told explicitly that these are not the final clock speeds, but it at the very least it puts the lower bound on the highest end processor. In reality, 3 GHz could be a walk in the park for the final retail units, depending on how much difference there is between the chips on display and what people will be able to buy. We are under the impression that the CPUs will have turbo modes involved, and those could be staggered based on the cores being used.

But this is why I said that 3 GHz is the lower bound of the high-end processor. We know from these results (assuming point 1 in this list) that the best processor from AMD will do at least 3 GHz. There’s no indication of power, and thus there’s no indication of efficiency either, which is also another important metric left in the ether.

9) We Will Have to Wait to Test

Everyone wants the next technology yesterday, so the ‘gimme gimme gimme’ feeling of new stuff is always there. AMD has already stated that general availability for Zen and Summit Ridge will be Q1, which puts the launch at four months away at a minimum. At this stage of the game, while AMD is trying to be competitive with Intel, they don’t want to generate too much hype and give the game away in case it goes incredibly pear-shaped. There’s the added element of the hardware and software being finalized or updated.

Since I’ve been reviewing, no CPU manufacturer has handed out review units four months before launch (in all honesty, we’re lucky to get a launch date sample a week in advance these days). In fact we’d have to go back to Nehalem and Conroe to find something that was sampled early; however Conroe just passed its 10th birthday and in that case, Intel knew they were on to a clear winner ahead rather than just ‘meeting expectations’. Also, early samples of a great product will mean users will wait for it to come out, which results in revenue loss (the Osborne effect) unless you have zero stock and/or an uncompetitive product that no-one is buying. In this decade, no x86 CPU manufacturer has offered samples this far out. I’d be more than happy for that to change and I would encourage companies to do so, but I understand the reasons why. 

Some Final Words

Much in the same way that taking an IQ test tells you how good you are at an IQ test, it is typically an indication that you are good/bad at other things as well (most well-engineered IQ tests go through a lot of spatial reasoning, for example). In this circumstance, a CPU performing a Blender test is only as good as a Blender test, but given what we know about the Zen microarchitecture, it is probably also good at other things. Just how good, in what metric and to what extent, is almost impossible to say.

AMD has given a glimpse of performance, and they’ve only said as much as they needed to in order to get the message across. However it has been up to the media to understand the reasons why and explain what those caveats are.

Early AMD Zen Server CPU and Motherboard Details: Codename ‘Naples’, 32-cores, Dual Socket Platforms, Q2 2017

Early AMD Zen Server CPU and Motherboard Details: Codename ‘Naples’, 32-cores, Dual Socket Platforms, Q2 2017

At the AMD Zen microarchitecture announcement event yesterday, the lid was lifted on some of the details of AMD’s server platform. The 32-core CPU, codename Naples, will feature simultaneous multithreading similar to the desktop platform we wrote about earlier, allowing for 64 threads per processor. Thus, in a dual socket system, up to 128 threads will be available. These development systems are currently in the hands of select AMD partners for qualification and development.

AMD was clear that we will expect to hear more over the coming months (SuperComputing 2016 is in November 2016, International SuperComputing is in June 2017) with a current schedule to start providing servers in Q2 2017.

 

Analysing AMD’s 2P Motherboard

AMD showed off a dual socket development motherboard, with two large AMD sockets using eight phase power for each socket as well as eight DDR4 memory slots.

It was not stated if the CPUs supported quad-channel memory at two DIMMs per channel or eight channel memory at this time, and there’s nothing written on the motherboard to indicate which is the case – typically the second DIMM slot in a 2DPC environment is a different color, which would suggest that this is an eight-channel design, however that is not always the case as some motherboard designs use the same color anyway.

However, it is worth noting that each bank of four memory slots on each side of each CPU has four chokes and four heatsinks (probably VRMs) in two sets. Typically we see one per channel (or one per solution), but the fact that each socket seems to have eight VRMs for the memory would also lean into the eight-channel idea. To top it off, each socket has a black EPS 12V (most likely for the CPU), which is isolated and clearly for CPU power, but also a transparent EPS 12V and a transparent 6-pin PCIe connector. These transparent connectors are not as isolated, so are not for low power implementation, but each socket does have one attached, perhaps suggesting that the memory interfaces are powered independently to the CPU. More memory channels would require more power, and four-channel interfaces have been done and dusted before via the single EPS 12V, so requiring even more power raises questions. I have had word in my ear that this may be as a result of support for future high energy memory, such as NVDIMM, although I have not been able to confirm this.

Edit: The transparent EPS 12V could be a PCIe 8-pin in retrospect, but still seems excessive for the power it can provide.

Unfortunately, we could not remove the heatsinks to see the CPUs or the socket, but chances are this demo system would not have CPUs equipped in the first place. Doing some basic math based on the length of a DDR4 module, our calculations show that the socket area (as delineated by the white line beyond the socket) is 7.46 cm x 11.877 cm, to give an area of 88.59 cm2. By comparison, the heatsink has an active fin floor plan area of 62.6 cm2 based on what we can measure. Unfortuantely this gives us no indication of package area or die area, both of which would be more exciting numbers to have.

Putting the CPU, memory and sockets aside, the motherboard has a number of features worth pointing out. There is no obvious chipset or southbridge in play here. Where we would normally expect a chipset, we have a Xilinx Spartan FPGA without a heatsink, although I would doubt this is the chipset based on the fact that there is an ‘FPGA Button’ right above it and this is most likely to aid in some of the debugging elements on the system.

Further to this, the storage options for the motherboard are all located on the left hand side (as seen) right next to one of the CPUs. Eight SATA style ports are here, all in blue which usually indicates that these are part of the same head controller, but also part of the text on the motherboard states ‘ALL SATA CONNS CONNECTED TO P1’ which indicates the first processor (from the main image, left to right, athough P1 is actually the ‘second processor’) has direct control.

Other typical IO on the rear panel such as a 10/100 network port (for the management) and the USB 3.0 ports are next to the second processor, which might indicate that this processor has IO control over these parts of the system. However the onboard management control, provided by an ASpeed AST2500 controller with access to Elpida memory, is nearer the PCIe slots and the Xilinx FPGA.

The lack of an obvious chipset, and the location of the SATA ports, would point to Naples having the southbridge integrated on die, and creating an SoC rather than a pure CPU. Bringing this on die, to 14nm FinFET, will allow the functions to be in a lower power process (historically chipsets are created at a larger lithography node to the CPU) as well as adjustments in bandwidth and utility, although at the expense of modularity and die area. If Naples has an integrated chipset, it makes some of the findings on the AM4 platform we saw at the show very interesting. Either that or the FPGA is actually used for the developers to change southbridge operation on the fly (or that chipsets are actually becoming more like FPGAs, which is more realistic as chipsets move to PCIe switch mechanisms).

There are a lot of headers and jumpers on board which won’t be of much interest to anyone except the platform testing, but the PCIe layout needs a look. On this board we have four PCIe slots below one of the CPUs, each using a 16 lane PCIe slot. By careful inspection of the pins we can certainly tell that the slots are each x16 electrical.

However the highlighted box gives some insight into the PCIe lane allocation. The text says:

“Slot 3 has X15 PCIe lanes if MGMT PCIe Connected
Slot 3 has X16 PCIe lanes if MGMT PCIe Disconnected”

This would indicate that slot three has a full x16 lane connection for data, or in effect we have 64 lanes of PCIe bandwidth in the PCIe slots. That’s about as far as we can determine here – we have seen motherboards in the past that take PCIe lanes from both CPUs, so at best we can say that in this configuration that the Naples CPU has between 32 lanes and 64 lanes for a dual processor system. The board traces, as far as we were able to look at the motherboard, did not make this clear, especially when this is a multi-layer motherboard (qualification samples are typically over-engineered anyway). There is an outside chance that the integrated southbridge/IO is able to supply an x16 combination PCIe lane, however there is no obvious way to determine if this is the case (and is not something we’ve seen historically).

AM4 Desktop Motherboards

Elsewhere on display for Zen, we also saw some of the internal AM4 motherboards in the base units at the event.

These were not typical motherboard manufacturer boards from the usual names like ASUS or GIGABYTE, and were very clearly internal use products. We weren’t able to open up the cases to see the boards better, but on closer inspection we saw a number of things.

First, there were two different models of motherboards on show, both ATX but varying a little in the functionality. One of the boards had twelve SATA ports, some of which were in very odd locations and colors, but we were unable to determine if any controllers were on board.

Second, each of the boards had video outputs. This would be because we already know that the AM4 platform has to cater for both Bristol Ridge and Summit Ridge, with the former being APU based with integrated graphics and the updated Excavator v2 core design. On one of the motherboards we saw two HDMI outputs and a DisplayPort output, suggesting a full 3-digital display pipeline for Bristol Ridge.

The motherboards were running 2x8GB of Micron memory, running at DDR4-2400. Also, the CPU coolers – AMD was using both its 125W AMD Wraith cooler as well as the new 95W near silent cooler between all four/five systems on display. This pegs these engineering samples at a top end of this TDP, but if recent APU and FX product announcements are anything to go by, AMD is happy to put a 125W cooler on a 95W CPU, or a 95W cooler on a 65W CPU if required.

I will say one thing that has me confused a little. AMD has been very quiet on the chipset support for AM4, and what IO the south bridge will have on the new platform (and if that changes if a Bristol or Summit Ridge CPU is in play at the time). In the server platform, we concluded above that the chipset is likely integrated into the CPU – if that is true on the consumer platform as well, then I would point to the chipset-looking device on these motherboards and start asking questions. Typically the chipset on a motherboard is cooled by a passive heatsink, but these chips here had low z-height on fans them and were running at quite the rate. I wonder if they were like this so when the engineers use the motherboards it means there is more space to plug testing tools, or if it for another purpose entirely. As expected, AMD said to expect more information closer to launch.

Wrap Up

To anyone who says motherboards are boring, well I think AMD has given a number of potential aspects of the platform away in merely showing a pair of these products for server and desktop. Sure, they answer some questions and cause a lot more of my hair to fall out trying to answer the questions that arise, but at this point it means we can start to have a fuller understanding of what is going on beyond the CPU.

As for server based Zen, Naples, depending on PCIe counts and memory support, along with the cache hierarchy we discussed in the previous piece, the prospect of it playing an active spot in enterprise seems very real. Unfortunately, it is still a year away from launch. There are lots of questions about how the server parts will be different, and how the 32-cores on the SKUs that were talked about will be arranged in order to shuffle memory around at a reasonable rate – one of the problems with large core count parts is being able to feed the beast. AMD even used that term in their presentation, meaning that it’s clearly a topic they believe they have addressed.

 

 

 

Early AMD Zen Server CPU and Motherboard Details: Codename ‘Naples’, 32-cores, Dual Socket Platforms, Q2 2017

Early AMD Zen Server CPU and Motherboard Details: Codename ‘Naples’, 32-cores, Dual Socket Platforms, Q2 2017

At the AMD Zen microarchitecture announcement event yesterday, the lid was lifted on some of the details of AMD’s server platform. The 32-core CPU, codename Naples, will feature simultaneous multithreading similar to the desktop platform we wrote about earlier, allowing for 64 threads per processor. Thus, in a dual socket system, up to 128 threads will be available. These development systems are currently in the hands of select AMD partners for qualification and development.

AMD was clear that we will expect to hear more over the coming months (SuperComputing 2016 is in November 2016, International SuperComputing is in June 2017) with a current schedule to start providing servers in Q2 2017.

 

Analysing AMD’s 2P Motherboard

AMD showed off a dual socket development motherboard, with two large AMD sockets using eight phase power for each socket as well as eight DDR4 memory slots.

It was not stated if the CPUs supported quad-channel memory at two DIMMs per channel or eight channel memory at this time, and there’s nothing written on the motherboard to indicate which is the case – typically the second DIMM slot in a 2DPC environment is a different color, which would suggest that this is an eight-channel design, however that is not always the case as some motherboard designs use the same color anyway.

However, it is worth noting that each bank of four memory slots on each side of each CPU has four chokes and four heatsinks (probably VRMs) in two sets. Typically we see one per channel (or one per solution), but the fact that each socket seems to have eight VRMs for the memory would also lean into the eight-channel idea. To top it off, each socket has a black EPS 12V (most likely for the CPU), which is isolated and clearly for CPU power, but also a transparent EPS 12V and a transparent 6-pin PCIe connector. These transparent connectors are not as isolated, so are not for low power implementation, but each socket does have one attached, perhaps suggesting that the memory interfaces are powered independently to the CPU. More memory channels would require more power, and four-channel interfaces have been done and dusted before via the single EPS 12V, so requiring even more power raises questions. I have had word in my ear that this may be as a result of support for future high energy memory, such as NVDIMM, although I have not been able to confirm this.

Edit: The transparent EPS 12V could be a PCIe 8-pin in retrospect, but still seems excessive for the power it can provide.

Unfortunately, we could not remove the heatsinks to see the CPUs or the socket, but chances are this demo system would not have CPUs equipped in the first place. Doing some basic math based on the length of a DDR4 module, our calculations show that the socket area (as delineated by the white line beyond the socket) is 7.46 cm x 11.877 cm, to give an area of 88.59 cm2. By comparison, the heatsink has an active fin floor plan area of 62.6 cm2 based on what we can measure. Unfortuantely this gives us no indication of package area or die area, both of which would be more exciting numbers to have.

Putting the CPU, memory and sockets aside, the motherboard has a number of features worth pointing out. There is no obvious chipset or southbridge in play here. Where we would normally expect a chipset, we have a Xilinx Spartan FPGA without a heatsink, although I would doubt this is the chipset based on the fact that there is an ‘FPGA Button’ right above it and this is most likely to aid in some of the debugging elements on the system.

Further to this, the storage options for the motherboard are all located on the left hand side (as seen) right next to one of the CPUs. Eight SATA style ports are here, all in blue which usually indicates that these are part of the same head controller, but also part of the text on the motherboard states ‘ALL SATA CONNS CONNECTED TO P1’ which indicates the first processor (from the main image, left to right, athough P1 is actually the ‘second processor’) has direct control.

Other typical IO on the rear panel such as a 10/100 network port (for the management) and the USB 3.0 ports are next to the second processor, which might indicate that this processor has IO control over these parts of the system. However the onboard management control, provided by an ASpeed AST2500 controller with access to Elpida memory, is nearer the PCIe slots and the Xilinx FPGA.

The lack of an obvious chipset, and the location of the SATA ports, would point to Naples having the southbridge integrated on die, and creating an SoC rather than a pure CPU. Bringing this on die, to 14nm FinFET, will allow the functions to be in a lower power process (historically chipsets are created at a larger lithography node to the CPU) as well as adjustments in bandwidth and utility, although at the expense of modularity and die area. If Naples has an integrated chipset, it makes some of the findings on the AM4 platform we saw at the show very interesting. Either that or the FPGA is actually used for the developers to change southbridge operation on the fly (or that chipsets are actually becoming more like FPGAs, which is more realistic as chipsets move to PCIe switch mechanisms).

There are a lot of headers and jumpers on board which won’t be of much interest to anyone except the platform testing, but the PCIe layout needs a look. On this board we have four PCIe slots below one of the CPUs, each using a 16 lane PCIe slot. By careful inspection of the pins we can certainly tell that the slots are each x16 electrical.

However the highlighted box gives some insight into the PCIe lane allocation. The text says:

“Slot 3 has X15 PCIe lanes if MGMT PCIe Connected
Slot 3 has X16 PCIe lanes if MGMT PCIe Disconnected”

This would indicate that slot three has a full x16 lane connection for data, or in effect we have 64 lanes of PCIe bandwidth in the PCIe slots. That’s about as far as we can determine here – we have seen motherboards in the past that take PCIe lanes from both CPUs, so at best we can say that in this configuration that the Naples CPU has between 32 lanes and 64 lanes for a dual processor system. The board traces, as far as we were able to look at the motherboard, did not make this clear, especially when this is a multi-layer motherboard (qualification samples are typically over-engineered anyway). There is an outside chance that the integrated southbridge/IO is able to supply an x16 combination PCIe lane, however there is no obvious way to determine if this is the case (and is not something we’ve seen historically).

AM4 Desktop Motherboards

Elsewhere on display for Zen, we also saw some of the internal AM4 motherboards in the base units at the event.

These were not typical motherboard manufacturer boards from the usual names like ASUS or GIGABYTE, and were very clearly internal use products. We weren’t able to open up the cases to see the boards better, but on closer inspection we saw a number of things.

First, there were two different models of motherboards on show, both ATX but varying a little in the functionality. One of the boards had twelve SATA ports, some of which were in very odd locations and colors, but we were unable to determine if any controllers were on board.

Second, each of the boards had video outputs. This would be because we already know that the AM4 platform has to cater for both Bristol Ridge and Summit Ridge, with the former being APU based with integrated graphics and the updated Excavator v2 core design. On one of the motherboards we saw two HDMI outputs and a DisplayPort output, suggesting a full 3-digital display pipeline for Bristol Ridge.

The motherboards were running 2x8GB of Micron memory, running at DDR4-2400. Also, the CPU coolers – AMD was using both its 125W AMD Wraith cooler as well as the new 95W near silent cooler between all four/five systems on display. This pegs these engineering samples at a top end of this TDP, but if recent APU and FX product announcements are anything to go by, AMD is happy to put a 125W cooler on a 95W CPU, or a 95W cooler on a 65W CPU if required.

I will say one thing that has me confused a little. AMD has been very quiet on the chipset support for AM4, and what IO the south bridge will have on the new platform (and if that changes if a Bristol or Summit Ridge CPU is in play at the time). In the server platform, we concluded above that the chipset is likely integrated into the CPU – if that is true on the consumer platform as well, then I would point to the chipset-looking device on these motherboards and start asking questions. Typically the chipset on a motherboard is cooled by a passive heatsink, but these chips here had low z-height on fans them and were running at quite the rate. I wonder if they were like this so when the engineers use the motherboards it means there is more space to plug testing tools, or if it for another purpose entirely. As expected, AMD said to expect more information closer to launch.

Wrap Up

To anyone who says motherboards are boring, well I think AMD has given a number of potential aspects of the platform away in merely showing a pair of these products for server and desktop. Sure, they answer some questions and cause a lot more of my hair to fall out trying to answer the questions that arise, but at this point it means we can start to have a fuller understanding of what is going on beyond the CPU.

As for server based Zen, Naples, depending on PCIe counts and memory support, along with the cache hierarchy we discussed in the previous piece, the prospect of it playing an active spot in enterprise seems very real. Unfortunately, it is still a year away from launch. There are lots of questions about how the server parts will be different, and how the 32-cores on the SKUs that were talked about will be arranged in order to shuffle memory around at a reasonable rate – one of the problems with large core count parts is being able to feed the beast. AMD even used that term in their presentation, meaning that it’s clearly a topic they believe they have addressed.