SOC


NVIDIA Announces Jetson TX2: Parker Comes To NVIDIA’s Embedded System Kit

NVIDIA Announces Jetson TX2: Parker Comes To NVIDIA’s Embedded System Kit

For a few years now, NVIDIA has been offering their line of Jetson embedded system kits. Originally launched using Tegra K1 in 2014, the first Jetson was designed to be a dev kit for groups looking to build their own Tegra-based devices from scratch. Instead, what NVIDIA surprisingly found, was that groups would use the Jetson board as-is instead and build their devices around that. This unexpected market led NVIDIA to pivot a bit on what Jetson would be, resulting in the second-generation Jetson TX1, a proper embedded system board that can be used for both development purposes and production devices.

This relaunched Jetson came at an interesting time for NVIDIA, which was right when their fortunes in neural networking/deep learning took off in earnest. Though the Jetson TX1 and underlying Tegra X1 SoC lack the power needed for high-performance use cases – these are after all based on an SoC designed for mobile applications – they have enough power for lower-performance inferencing. As a result, the Jetson TX1 has become an important part of NVIDIA’s neural networking triad, offering their GPU architecture and its various benefits for devices doing inferencing at the “edge” of a system.

Now about a year and a half after the launch of the Jetson TX1, NVIDIA is going to be giving the Jetson platform a significant update in the form of the Jetson TX2. This updated Jetson is not as radical a change as the TX1 before it was – NVIDIA seems to have found a good place in terms of form factor and the platform’s core feature set – but NVIDIA is looking to take what worked with TX1 and further ramp up the performance of the platform.

The big change here is the upgrade to NVIDIA’s newest-generation Parker SoC. While Parker never made it into third-party mobile designs, NVIDIA has been leveraging it internally for the Drive system and other projects, and now it will finally become the heart of the Jetson platform as well. Relative to the Tegra X1 in the previous Jetson, Parker is a bigger and better version of the SoC. The GPU architecture is upgraded to NVIDIA’s latest-generation Pascal architecture, and on the CPU side NVIDIA adds a pair of Denver 2 CPU cores to the existing quad-core Cortex-A57 cluster. Equally important, Parker finally goes back to a 128-bit memory bus, greatly boosting the memory bandwidth available to the SoC. The resulting SoC is fabbed on TSMC’s 16nm FinFET process, giving NVIDIA a much-welcomed improvement in power efficiency.

Paired with Parker on the Jetson TX2 as supporting hardware is 8GB of LPDDR4-3733 DRAM, a 32GB eMMC flash module, a 2×2 802.11ac + Bluetooth wireless radio, and a Gigabit Ethernet controller. The resulting board is still 50mm x 87mm in size, with NVIDIA intending it to be drop-in compatible with Jetson TX1.

Given these upgrades to the core hardware, unsurprisingly NVIDIA’s primary marketing angle with the Jetson TX2 is on its performance relative to the TX1. In a bit of a departure from the TX1, NVIDIA is canonizing two performance modes on the TX2: Max-Q and Max-P. Max-Q is the company’s name for TX2’s energy efficiency mode; at 7.5W, this mode clocks the Parker SoC for efficiency over performance – essentially placing it right before the bend in the power/performance curve – with NVIDIA claiming that this mode offers 2x the energy efficiency of the Jetson TX1. In this mode, TX2 should have similar performance to TX1 in the latter’s max performance mode.

Meanwhile the board’s Max-P mode is its maximum performance mode. In this mode NVIDIA sets the board TDP to 15W, allowing the TX2 to hit higher performance at the cost of some energy efficiency. NVIDIA claims that Max-P offers up to 2x the performance of the Jetson TX1, though as GPU clockspeeds aren’t double TX1’s, it’s going to be a bit more sensitive on an application-by-application basis.

NVIDIA Jetson TX2 Performance Modes
  Max-Q Max-P Max Clocks
GPU Frequency 854MHz 1122MHz 1302MHz
Cortex-A57 Frequency 1.2GHz Stand-Alone: 2GHz
w/Denver: 1.4GHz
2GHz+
Denver 2 Frequency N/A Stand-Alone: 2GHz
w/A57: 1.4GHz
2GHz
TDP 7.5W 15W N/A

In terms of clockspeeds, NVIDIA has disclosed that in Max-Q mode, the GPU is clocked at 854MHz while the Cortex-A57 cluster is at 1.2GHz. Going to Max-P increases the GPU clockspeed further to 1122MHz, and allows for multiple CPU options; either the Cortex-A57 cluster or Denver 2 cluster can be run at 2GHz, or both can be run at 1.4GHz. Though when it comes to all-out performance, even Max-P mode is below the TX2’s limits; the GPU clock can top out at just over 1300MHz and CPU clocks can reach 2GHz or better. Power states are configurable, so customers can dial in the TDPs and desired clockspeeds they want, however NVIDIA notes that using the maximum clocks goes further outside of the Parker SoC’s efficiency range.

Finally, along with announcing the Jetson TX2 module itself, NVIDIA is also announcing a Jetson TX2 development kit. The dev kit will actually ship first – it ships next week in the US and Europe, with other regions in April – and contains a TX2 module along with a carrier board to provide I/O breakout and interfaces to various features such as the USB, HDMI, and Ethernet. Judging from the pictures NVIDIA has sent over, the TX2 carrier board is very similar (if not identical) to the TX1 carrier board, so like the TX2 itself is should be familiar to existing Jetson developers.

With the dev kit leading the charge for Jetson TX2, NVIDIA will be selling it for $599 retail/$299 education, the same price the Jetson TX1 dev kit launched at back in 2015. Meanwhile the stand-alone Jetson TX2 module will be arriving in Q2’17, priced at $399 in 1K unit quantities. In the case of the module, this means prices have gone up a bit since the last generation; the TX2 is hitting the market at $100 higher than where the TX1 launched.

Samsung Announces Exynos 8895 SoC: 10nm, Mali G71MP20, & LPDDR4x

Samsung Announces Exynos 8895 SoC: 10nm, Mali G71MP20, & LPDDR4x

Even though Mobile World Congress doesn’t kick off for another few days, Samsung isn’t wasting any time in getting started. This morning the company is announcing their latest generation high-end ARM SoC, the Exynos 8895. Their first in-house 10nm SoC, the company isn’t talking about what it will go in, but based on the context of the announcement it’s a safe bet we’re looking at the SoC for at least some SKUs of the next Galaxy S phone.

While Samsung has been in the SoC game with the Exynos series for a number of years now, it’s been in the last few years that they’ve really cemented their positon as a market leader at the high-end. Thanks in part to the company’s 14nm process, the Exynos 7420 proved to be a very capable and powerful SoC from the company. Last year Samsung followed that up with the Exynos 8890, which among other firsts marked Samsung’s entry into designing their own CPU cores with the M1.

Now for 2017 Samsung wants to repeat their success over the past couple of years with the Exynos 9 Series 8895. As you can likely infer from the name, it’s not meant to be radically different from the preceding 8890, but there are still some pretty important changes here that should affect performance across the board.

Samsung Exynos SoCs Specifications
SoC Exynos 8895 Exynos 8890 Exynos 7420
CPU 4x A53

4x Exynos M2(?)

4x [email protected]

4x Exynos M1 @ 2.3GHz

4x [email protected]

4x [email protected]

GPU Mali G71MP20 Mali T880MP12
@ 650MHz
Mali T760MP8
@ 770MHz
Memory
Controller
2x 32-bit(?)
LPDDR4x
 
2x 32-bit
LPDDR4 @ 1794MHz

28.7GB/s b/w

2x 32-bit
LPDDR4 @ 1555MHz

24.8GB/s b/w

Storage eMMC 5.1, UFS 2.1 eMMC 5.1, UFS 2.0 eMMC 5.1, UFS 2.0
Modem Down: LTE Cat16
Up: LTE Cat13
Down: LTE Cat12
Up: LTE Cat13
N/A
ISP Rear: 28MP
Front: 28MP
Rear: 24MP
Front: 13MP
Rear: 16MP
Front: 5MP
Mfc.
Process
Samsung
10nm LPE
Samsung
14nm LPP
Samsung
14nm LPE

The big deal for Samsung of course is that the Exynos 8895 is their first 10nm SoC, designed by Samsung LSI and fabbed by Samsung. Semantics of what is or isn’t 10nm aside, Samsung’s 10nm LPE process is cutting-edge for a mobile SoC, and relative to the current 14nm process offers better density and better performance characteristics. Samsung has talked about the process a bit in the past, and for the Exynos 8895 announcement they are reiterating that the 10nm LPE process offers “up to 27% higher performance while consuming 40% less power” relative to 14nm. However this may be in error in phrasing on Samsung’s part, as last year it was “27-percent higher performance or 40-percent lower power consumption”, which is a more realistic statement. Either way, for 8895 in particular, Samsung isn’t talking about performance quite yet.

Diving into the specs, the CPU situation looks a great deal like the previous 8890. Samsung has gone with 8 cores – 4 high-power, 4 low-power – with a mix of custom and licensed silicon. The high-power cores are composed of what Samsung is calling a “2nd generation” custom CPU core. This would presumably be a newer iteration of the M1 (so the M2?), but Samsung isn’t offering up much in the way of details at this time over what’s changed from the M1. What we do know is that Samsung is touting that it offers both better performance and improved energy efficiency. Meanwhile low-power work is once again being provided by ARM’s Cortex-A53. (ed: which on 10nm, must be absolutely tiny, considering that a core was sub-1mm2 on 14nm)

Meanwhile on the GPU side, Samsung has significantly upgraded their graphics capabilities by tapping ARM’s latest-generation Mali-G71 GPU in an MP20 configuration. Based on ARM’s new Bifrost GPU architecture, the G71 radically overhauls the internal workings of the GPU to match the contemporary thread level parallelism (TLP)-centric nature of desktop GPUs and modern workloads. ARM has previously discussed that they expect G71-based devices to offer around 50% better graphics performance than T880 devices, and Samsung is going one step further by touting it as 60% faster performance.

In another first for Samsung, the 8895 is also their first Heterogeneous System Architecture (HSA) compliant SoC. This requires that the CPU, GPU, and interconnect all support HSA, and indeed all of the necessary pieces have come together for 8895. We’ve previously seen that the Mali-G71 GPU is HSA-compliant, and meanwhile for the 8895 Samsung has rolled out a new version of their interconnect (the Samsung Coherent Interconnect) to support HSA. This isn’t a development that I expect will have immediate ramifications, but HSA is ultimately at the core of making it easier for developers to program applications that use the GPU in a compute context, thanks to the common (and common-sense) architecture rules for HSA.

To feed the resulting beast, Samsung has added support for LPDDR4x memory. An extension of the original LPDDR4 standard, LPDDR4x is designed to reduce DRAM power consumption by up to 20% by reducing the output driver power (I/O VDDQ voltage) by 45%, from 1.1 V to 0.6 V. LPDDR4x memory has just started shipping, so along with the previously announced Snapdragon 835, the Exynos 8895 is the other high-performance SoC coming out this year to support the new memory.

The Exynos 8895 is also getting an upgraded ISP. The latest ISP supports 28MP for both the front and rear cameras, while a bit more nebulously, Samsung’s spec sheet also lists support for “28MP+16MP Dual Camera” mode, an unsurprising development given the recent popularity of dual camera phone designs. Diving a bit deeper, we find that the 8895’s ISP is actually two ISPs: a high-performance ISP and a low-power ISP, with the low-power ISP presumably providing the aforementioned 16MP capability. Samsung is touting this combination as allowing them to offer dual camera functionality while still keeping power consumption in check.

On the flip slide of the coin, the Exynos 8895 also gets a new version of Samsung’s video decode block, which the company calls their Multi-Format Codec (MFC). This latest MFC supports all the bells and whistles you’d expect, with both HEVC and VP9 decoding up to 4Kp120. Samsung’s press release also briefly mentions a “video processing technology that enables a higher quality experience by enhancing the image quality” that’s capable of “enhancing the image quality of a specific portion that is perceived more sensitive to the human eye.” Given the VR applications – and Samsung wants to be able to do 4K VR –  this sounds a bit like a variation on the idea of foveated rendering, but there aren’t any further details on the technology at this time.

Also appearing for the first time on the Exynos 8895 is Samsung’s Cat16 LTE modem design. With their modem Samsung is using 5x Carrier Aggregation to achieve up to 1Gbps down, while uploading is rated at LTE Cat 13, using 2 carriers to get 150Mbps up. What’s notable here is that, as best as I can tell, this is the first modem using 5x CA; Qualcomm’s equivalent modem, the X16, uses 3 or 4x CA depending on the scenario. Unfortunately with the limited details Samsung offers right now, I’m not sure whether they have to use 5x CA to get Cat 16 bandwidth, or this is just another optional mode.

Finally, the Exynos 8895 also includes what Samsung is calling an “enhanced security sub-system with a separate security processing unit” for use with user authentication, mobile payments, and the like. Based on Samsung’s description this sounds a heck of a lot like Apple’s Secure Enclave, which would be a very welcome development, as in Apple’s case it has made their phones a lot harder to break into.

Wrapping things up, along with today’s product announcement of the Exynos 8895, Samsung is also announcing that the SoC is in mass production; and indeed I would be surprised if this isn’t the SoC they announced back in October, which would mean it’s been in production for some time now. We still don’t know when we’re going to see the next Samsung Galaxy S phone, but given how Samsung is announcing the SoC in this fashion, clearly it’s going to be sooner than later. In the meantime, hopefully we’ll get some additional SoC details next week at MWC.

Samsung Announces Exynos 8895 SoC: 10nm, Mali G71MP20, & LPDDR4x

Samsung Announces Exynos 8895 SoC: 10nm, Mali G71MP20, & LPDDR4x

Even though Mobile World Congress doesn’t kick off for another few days, Samsung isn’t wasting any time in getting started. This morning the company is announcing their latest generation high-end ARM SoC, the Exynos 8895. Their first in-house 10nm SoC, the company isn’t talking about what it will go in, but based on the context of the announcement it’s a safe bet we’re looking at the SoC for at least some SKUs of the next Galaxy S phone.

While Samsung has been in the SoC game with the Exynos series for a number of years now, it’s been in the last few years that they’ve really cemented their positon as a market leader at the high-end. Thanks in part to the company’s 14nm process, the Exynos 7420 proved to be a very capable and powerful SoC from the company. Last year Samsung followed that up with the Exynos 8890, which among other firsts marked Samsung’s entry into designing their own CPU cores with the M1.

Now for 2017 Samsung wants to repeat their success over the past couple of years with the Exynos 9 Series 8895. As you can likely infer from the name, it’s not meant to be radically different from the preceding 8890, but there are still some pretty important changes here that should affect performance across the board.

Samsung Exynos SoCs Specifications
SoC Exynos 8895 Exynos 8890 Exynos 7420
CPU 4x A53

4x Exynos M2(?)

4x [email protected]

4x Exynos M1 @ 2.3GHz

4x [email protected]

4x [email protected]

GPU Mali G71MP20 Mali T880MP12
@ 650MHz
Mali T760MP8
@ 770MHz
Memory
Controller
2x 32-bit(?)
LPDDR4x
 
2x 32-bit
LPDDR4 @ 1794MHz

28.7GB/s b/w

2x 32-bit
LPDDR4 @ 1555MHz

24.8GB/s b/w

Storage eMMC 5.1, UFS 2.1 eMMC 5.1, UFS 2.0 eMMC 5.1, UFS 2.0
Modem Down: LTE Cat16
Up: LTE Cat13
Down: LTE Cat12
Up: LTE Cat13
N/A
ISP Rear: 28MP
Front: 28MP
Rear: 24MP
Front: 13MP
Rear: 16MP
Front: 5MP
Mfc.
Process
Samsung
10nm LPE
Samsung
14nm LPP
Samsung
14nm LPE

The big deal for Samsung of course is that the Exynos 8895 is their first 10nm SoC, designed by Samsung LSI and fabbed by Samsung. Semantics of what is or isn’t 10nm aside, Samsung’s 10nm LPE process is cutting-edge for a mobile SoC, and relative to the current 14nm process offers better density and better performance characteristics. Samsung has talked about the process a bit in the past, and for the Exynos 8895 announcement they are reiterating that the 10nm LPE process offers “up to 27% higher performance while consuming 40% less power” relative to 14nm. However this may be in error in phrasing on Samsung’s part, as last year it was “27-percent higher performance or 40-percent lower power consumption”, which is a more realistic statement. Either way, for 8895 in particular, Samsung isn’t talking about performance quite yet.

Diving into the specs, the CPU situation looks a great deal like the previous 8890. Samsung has gone with 8 cores – 4 high-power, 4 low-power – with a mix of custom and licensed silicon. The high-power cores are composed of what Samsung is calling a “2nd generation” custom CPU core. This would presumably be a newer iteration of the M1 (so the M2?), but Samsung isn’t offering up much in the way of details at this time over what’s changed from the M1. What we do know is that Samsung is touting that it offers both better performance and improved energy efficiency. Meanwhile low-power work is once again being provided by ARM’s Cortex-A53. (ed: which on 10nm, must be absolutely tiny, considering that a core was sub-1mm2 on 14nm)

Meanwhile on the GPU side, Samsung has significantly upgraded their graphics capabilities by tapping ARM’s latest-generation Mali-G71 GPU in an MP20 configuration. Based on ARM’s new Bifrost GPU architecture, the G71 radically overhauls the internal workings of the GPU to match the contemporary thread level parallelism (TLP)-centric nature of desktop GPUs and modern workloads. ARM has previously discussed that they expect G71-based devices to offer around 50% better graphics performance than T880 devices, and Samsung is going one step further by touting it as 60% faster performance.

In another first for Samsung, the 8895 is also their first Heterogeneous System Architecture (HSA) compliant SoC. This requires that the CPU, GPU, and interconnect all support HSA, and indeed all of the necessary pieces have come together for 8895. We’ve previously seen that the Mali-G71 GPU is HSA-compliant, and meanwhile for the 8895 Samsung has rolled out a new version of their interconnect (the Samsung Coherent Interconnect) to support HSA. This isn’t a development that I expect will have immediate ramifications, but HSA is ultimately at the core of making it easier for developers to program applications that use the GPU in a compute context, thanks to the common (and common-sense) architecture rules for HSA.

To feed the resulting beast, Samsung has added support for LPDDR4x memory. An extension of the original LPDDR4 standard, LPDDR4x is designed to reduce DRAM power consumption by up to 20% by reducing the output driver power (I/O VDDQ voltage) by 45%, from 1.1 V to 0.6 V. LPDDR4x memory has just started shipping, so along with the previously announced Snapdragon 835, the Exynos 8895 is the other high-performance SoC coming out this year to support the new memory.

The Exynos 8895 is also getting an upgraded ISP. The latest ISP supports 28MP for both the front and rear cameras, while a bit more nebulously, Samsung’s spec sheet also lists support for “28MP+16MP Dual Camera” mode, an unsurprising development given the recent popularity of dual camera phone designs. Diving a bit deeper, we find that the 8895’s ISP is actually two ISPs: a high-performance ISP and a low-power ISP, with the low-power ISP presumably providing the aforementioned 16MP capability. Samsung is touting this combination as allowing them to offer dual camera functionality while still keeping power consumption in check.

On the flip slide of the coin, the Exynos 8895 also gets a new version of Samsung’s video decode block, which the company calls their Multi-Format Codec (MFC). This latest MFC supports all the bells and whistles you’d expect, with both HEVC and VP9 decoding up to 4Kp120. Samsung’s press release also briefly mentions a “video processing technology that enables a higher quality experience by enhancing the image quality” that’s capable of “enhancing the image quality of a specific portion that is perceived more sensitive to the human eye.” Given the VR applications – and Samsung wants to be able to do 4K VR –  this sounds a bit like a variation on the idea of foveated rendering, but there aren’t any further details on the technology at this time.

Also appearing for the first time on the Exynos 8895 is Samsung’s Cat16 LTE modem design. With their modem Samsung is using 5x Carrier Aggregation to achieve up to 1Gbps down, while uploading is rated at LTE Cat 13, using 2 carriers to get 150Mbps up. What’s notable here is that, as best as I can tell, this is the first modem using 5x CA; Qualcomm’s equivalent modem, the X16, uses 3 or 4x CA depending on the scenario. Unfortunately with the limited details Samsung offers right now, I’m not sure whether they have to use 5x CA to get Cat 16 bandwidth, or this is just another optional mode.

Finally, the Exynos 8895 also includes what Samsung is calling an “enhanced security sub-system with a separate security processing unit” for use with user authentication, mobile payments, and the like. Based on Samsung’s description this sounds a heck of a lot like Apple’s Secure Enclave, which would be a very welcome development, as in Apple’s case it has made their phones a lot harder to break into.

Wrapping things up, along with today’s product announcement of the Exynos 8895, Samsung is also announcing that the SoC is in mass production; and indeed I would be surprised if this isn’t the SoC they announced back in October, which would mean it’s been in production for some time now. We still don’t know when we’re going to see the next Samsung Galaxy S phone, but given how Samsung is announcing the SoC in this fashion, clearly it’s going to be sooner than later. In the meantime, hopefully we’ll get some additional SoC details next week at MWC.

Semi-Critical Intel Atom C2000 SoC Flaw Discovered, Hardware Fix Required

Semi-Critical Intel Atom C2000 SoC Flaw Discovered, Hardware Fix Required

Last week, Paul Alcorn over at Tom’s Hardware picked up on an interesting statement made by Intel in their Q4 2016 earnings call.  The company, whose Data Center group’s profits had slipped a bit year-over-year, was “observing a product quality issue in the fourth quarter with slightly higher expected failure rates under certain use and time constraints.” As a result the company had setup a reserve fund as part of their larger effort to deal with the issue, which would include a “minor” design (i.e. silicon) fix to permanently resolve the problem.

A bit more digging by Paul further turned up that the problem was with Intel’s Atom C2000 family, better known by the codenames Avoton and Rangeley. As a refresher, the Silvermont-based server SoCs were launched in Q3 of 2013 – about three and a half years ago – and are offered with 2, 4, and 8 cores. These chips are, in turn, meant for use in lower-power and reasonably highly threaded applications such as microservers, communication/networking gear, and storage. As a result the C2000 is an important part of Intel’s product lineup – especially as it directly competes with various ARM-based processors in many of its markets – but it’s a name that’s better known to device manufacturers and IT engineers than it is to consumers. Consequently, an issue with the C2000 family doesn’t immediately raise any eyebrows.

Jumping a week into the present, since their earnings call Intel has posted an updated spec sheet for the Atom C2000 family. More importantly, device manufacturers have started posting new product errata notices; and while they are keeping their distance away from naming the C2000 directly, all signs point to the affected products being C2000 based. As a result we finally have some insight into what the issue is with C2000. And while the news isn’t anywhere close to dire, it’s certainly not good news for Intel. As it turns out, there’s a degradation issue with at least some (if not all) parts in the Atom C2000 family, which over time can cause chips to fail only a few years into their lifetimes.

The Problem: Early Circuit Degradation

To understand what’s going on and why C2000 SoCs can fail early, let’s start with Intel’s updated spec sheet, which contains the new errata for the problem.

AVR54. System May Experience Inability to Boot or May Cease Operation

Problem: The SoC LPC_CLKOUT0 and/or LPC_CLKOUT1 signals (Low Pin Count bus clock outputs) may stop functioning.

Implication:   If the LPC clock(s) stop functioning the system will no longer be able to boot.

Workaround:  A platform level change has been identified and may be implemented as a workaround for this erratum.

At a high-level, the problem is that the operating clock for the Low Pin Count bus can stop working. Essentially a type of legacy bus, the LPC bus is a simple bus for simple peripherals, best known for supporting legacy devices such as serial and parallel ports. It is not a bus that’s strictly necessary for the operation of a computer or embedded device, and instead its importance depends on what devices are being hung off of it. Along with legacy I/O devices, the second most common device type to hang off of the LPC is the boot ROM/BIOS– owing to the fact that it’s a simple device that needs little bandwidth – and this is where the C2000 flaw truly rears its head.

As Intel’s errata succinctly explains, if the LPC bus breaks, then any system using it to host the boot ROM will no longer be able boot, as the system would no longer be able to access said boot ROM. The good news is that Intel has a workaround (more on that in a second), so it’s an avoidable failure, but it’s a hardware workaround, meaning the affected boards have to be reworked to fix them. Complicating matters, since Atom C2000 is a BGA chip being used in an embedded fashion, an LPC failure means that the entire board (if not the entire device) has to be replaced.

Diving deeper, the big question of course is how the LPC bus could break in this fashion. To that end, The Register reached out to Intel and has been able to get a few more details. As quoted by The Register, Intel is saying that the problem is “a degradation of a circuit element under high use conditions at a rate higher than Intel’s quality goals after multiple years of service.”

Though we tend to think of solid-state electronics as just that – solid and unchanging – circuit degradation is a normal part of the lifecycle of a complex semiconductor like a processor. Quantum tunneling and other effects on a microscopic scale will wear down processors while they’re in use, leading to eventual performance degradation or operational failure. However even with modern processors the effect should take a decade or longer, much longer than the expected service lifetime of a chip. So when something happens to speed up the degradation process, if severe enough it can cut the lifetime of a chip to a fraction of what it was planned for, causing a chip (or line of chips) to fail while still in active use. And this is exactly what’s happening with the Atom C2000.

For Intel, this is the second time this decade that they’ve encountered a degradation issue like this. Back in 2011 the company had to undertake a much larger and more embarrassing repair & replacement program for motherboards using early Intel 6-series chipsets. On those boards an overbiased (overdriven) transistor controlling some of the SATA ports could fail early, disabling those SATA ports. And while Intel hasn’t clarified whether something similar to this is happening on the Atom C2000, I wouldn’t be too surprised if it was. Which isn’t to unnecessarily pick on Intel here; given the geometries at play (bear in mind just how small a 22nm transistor is) transistor reliability is a significant challenge for all players. Just a bit too much voltage on a single transistor out of billions can be enough to ultimately break a chip.

The Solution: New Silicon & Reworked Motherboards

Anyhow, the good news is that Intel has developed both a silicon workaround and a platform workaround. The long-term solution is of course rolling out a new revision of the C2000 silicon that incorporates a fix for the issue, and Intel has told The Register they’ll be doing just that. This will actually come somewhat late in the lifetime of the processor, as the current B0 revision was launched three and a half years ago and will be succeeded by Denverton this year. At the same time though, as an IT-focused product Intel will still need to offer the Atom C2000 series to customers for a number of years to come, so even with the cost of a new revision of the silicon, it’s in Intel’s long-term interest.

More immediately, the platform fix can be used to prevent the issue on boards with the B0 silicon. Unfortunately Intel isn’t disclosing just what the platform fix is, but if it is a transistor bias issue, then the fix is likely to involve reducing the voltage to the transistor, essentially bringing its degradation back to expected levels. Some individual product vendors are also reporting that the fix can be reworked into existing (post-production) boards, though it sounds like this can only prevent the issue, not fix an already-unbootable board.

Affected Products: Routers, Servers, & NASes

As a result of the nature of a problem, the situation is a mixed bag for device manufacturers and owners. First and foremost, while most manufacturers have used the LPC bus to host the boot ROM, not all of them have. For the smaller number of manufacturers who are using SPI Flash, this wouldn’t impact them unless they were using the LPC bus for something else. Otherwise for those manufacturers who are impacted, transistor degradation is heavily dependent on ambient temperature and use: the hotter a chip and the harder its run, the faster a transistor will degrade. Consequently, while all C2000 chips have the flaw, not all C2000 chips will have their LPC clock fail before a device reaches the end of its useful lifetime. And certainly not all C2000 chips will fail at the same time.

Cisco, whose routers are impacted, estimates that while issues can occur as early as 18 months in, they don’t expect a meaningful spike in failures until 3 years (36 months) in. This of course happens to be just a bit shorter than the age of the first C2000 products, which is likely why this issue hasn’t come to light until now. Failures would then become increasingly likely as time goes on, and accordingly Cisco will be replacing the oldest affected routers first, as they’re the most vulnerable to the degradation issue.

As for other vendors shipping Atom C2000-based products, those vendors are setting up their own support programs. Patrick Kennedy over at ServeTheHome has already started compiling a list of vendor responses, including Supermicro and Netgate. However as it stands a lot of vendors are still developing their response to the issue, so this will be an ongoing process.

Finally, what’s likely to be the most affected on the consumer side of matters will be on the Network Attached Storage front. As pointed out to me by our own Ganesh TS, Seagate, Synology, ASRock, Advantronix, and other NAS vendors have all shipped devices using the flawed chips, and as a result all of these products are vulnerable to early failures. These vendors are still working on their respective support programs, but for covered devices the result is going to be the same: the affected NASes will need to be swapped for models with fixed boards/silicon. So NAS owners will want to pay close attention here, as while these devices aren’t necessarily at risk of immediate failure, they are at risk of failure in the long term.

Sources: Tom’s Hardware, The Register, & ServeTheHome

Semi-Critical Intel Atom C2000 SoC Flaw Discovered, Hardware Fix Required

Semi-Critical Intel Atom C2000 SoC Flaw Discovered, Hardware Fix Required

Last week, Paul Alcorn over at Tom’s Hardware picked up on an interesting statement made by Intel in their Q4 2016 earnings call.  The company, whose Data Center group’s profits had slipped a bit year-over-year, was “observing a product quality issue in the fourth quarter with slightly higher expected failure rates under certain use and time constraints.” As a result the company had setup a reserve fund as part of their larger effort to deal with the issue, which would include a “minor” design (i.e. silicon) fix to permanently resolve the problem.

A bit more digging by Paul further turned up that the problem was with Intel’s Atom C2000 family, better known by the codenames Avoton and Rangeley. As a refresher, the Silvermont-based server SoCs were launched in Q3 of 2013 – about three and a half years ago – and are offered with 2, 4, and 8 cores. These chips are, in turn, meant for use in lower-power and reasonably highly threaded applications such as microservers, communication/networking gear, and storage. As a result the C2000 is an important part of Intel’s product lineup – especially as it directly competes with various ARM-based processors in many of its markets – but it’s a name that’s better known to device manufacturers and IT engineers than it is to consumers. Consequently, an issue with the C2000 family doesn’t immediately raise any eyebrows.

Jumping a week into the present, since their earnings call Intel has posted an updated spec sheet for the Atom C2000 family. More importantly, device manufacturers have started posting new product errata notices; and while they are keeping their distance away from naming the C2000 directly, all signs point to the affected products being C2000 based. As a result we finally have some insight into what the issue is with C2000. And while the news isn’t anywhere close to dire, it’s certainly not good news for Intel. As it turns out, there’s a degradation issue with at least some (if not all) parts in the Atom C2000 family, which over time can cause chips to fail only a few years into their lifetimes.

The Problem: Early Circuit Degradation

To understand what’s going on and why C2000 SoCs can fail early, let’s start with Intel’s updated spec sheet, which contains the new errata for the problem.

AVR54. System May Experience Inability to Boot or May Cease Operation

Problem: The SoC LPC_CLKOUT0 and/or LPC_CLKOUT1 signals (Low Pin Count bus clock outputs) may stop functioning.

Implication:   If the LPC clock(s) stop functioning the system will no longer be able to boot.

Workaround:  A platform level change has been identified and may be implemented as a workaround for this erratum.

At a high-level, the problem is that the operating clock for the Low Pin Count bus can stop working. Essentially a type of legacy bus, the LPC bus is a simple bus for simple peripherals, best known for supporting legacy devices such as serial and parallel ports. It is not a bus that’s strictly necessary for the operation of a computer or embedded device, and instead its importance depends on what devices are being hung off of it. Along with legacy I/O devices, the second most common device type to hang off of the LPC is the boot ROM/BIOS– owing to the fact that it’s a simple device that needs little bandwidth – and this is where the C2000 flaw truly rears its head.

As Intel’s errata succinctly explains, if the LPC bus breaks, then any system using it to host the boot ROM will no longer be able boot, as the system would no longer be able to access said boot ROM. The good news is that Intel has a workaround (more on that in a second), so it’s an avoidable failure, but it’s a hardware workaround, meaning the affected boards have to be reworked to fix them. Complicating matters, since Atom C2000 is a BGA chip being used in an embedded fashion, an LPC failure means that the entire board (if not the entire device) has to be replaced.

Diving deeper, the big question of course is how the LPC bus could break in this fashion. To that end, The Register reached out to Intel and has been able to get a few more details. As quoted by The Register, Intel is saying that the problem is “a degradation of a circuit element under high use conditions at a rate higher than Intel’s quality goals after multiple years of service.”

Though we tend to think of solid-state electronics as just that – solid and unchanging – circuit degradation is a normal part of the lifecycle of a complex semiconductor like a processor. Quantum tunneling and other effects on a microscopic scale will wear down processors while they’re in use, leading to eventual performance degradation or operational failure. However even with modern processors the effect should take a decade or longer, much longer than the expected service lifetime of a chip. So when something happens to speed up the degradation process, if severe enough it can cut the lifetime of a chip to a fraction of what it was planned for, causing a chip (or line of chips) to fail while still in active use. And this is exactly what’s happening with the Atom C2000.

For Intel, this is the second time this decade that they’ve encountered a degradation issue like this. Back in 2011 the company had to undertake a much larger and more embarrassing repair & replacement program for motherboards using early Intel 6-series chipsets. On those boards an overbiased (overdriven) transistor controlling some of the SATA ports could fail early, disabling those SATA ports. And while Intel hasn’t clarified whether something similar to this is happening on the Atom C2000, I wouldn’t be too surprised if it was. Which isn’t to unnecessarily pick on Intel here; given the geometries at play (bear in mind just how small a 22nm transistor is) transistor reliability is a significant challenge for all players. Just a bit too much voltage on a single transistor out of billions can be enough to ultimately break a chip.

The Solution: New Silicon & Reworked Motherboards

Anyhow, the good news is that Intel has developed both a silicon workaround and a platform workaround. The long-term solution is of course rolling out a new revision of the C2000 silicon that incorporates a fix for the issue, and Intel has told The Register they’ll be doing just that. This will actually come somewhat late in the lifetime of the processor, as the current B0 revision was launched three and a half years ago and will be succeeded by Denverton this year. At the same time though, as an IT-focused product Intel will still need to offer the Atom C2000 series to customers for a number of years to come, so even with the cost of a new revision of the silicon, it’s in Intel’s long-term interest.

More immediately, the platform fix can be used to prevent the issue on boards with the B0 silicon. Unfortunately Intel isn’t disclosing just what the platform fix is, but if it is a transistor bias issue, then the fix is likely to involve reducing the voltage to the transistor, essentially bringing its degradation back to expected levels. Some individual product vendors are also reporting that the fix can be reworked into existing (post-production) boards, though it sounds like this can only prevent the issue, not fix an already-unbootable board.

Affected Products: Routers, Servers, & NASes

As a result of the nature of a problem, the situation is a mixed bag for device manufacturers and owners. First and foremost, while most manufacturers have used the LPC bus to host the boot ROM, not all of them have. For the smaller number of manufacturers who are using SPI Flash, this wouldn’t impact them unless they were using the LPC bus for something else. Otherwise for those manufacturers who are impacted, transistor degradation is heavily dependent on ambient temperature and use: the hotter a chip and the harder its run, the faster a transistor will degrade. Consequently, while all C2000 chips have the flaw, not all C2000 chips will have their LPC clock fail before a device reaches the end of its useful lifetime. And certainly not all C2000 chips will fail at the same time.

Cisco, whose routers are impacted, estimates that while issues can occur as early as 18 months in, they don’t expect a meaningful spike in failures until 3 years (36 months) in. This of course happens to be just a bit shorter than the age of the first C2000 products, which is likely why this issue hasn’t come to light until now. Failures would then become increasingly likely as time goes on, and accordingly Cisco will be replacing the oldest affected routers first, as they’re the most vulnerable to the degradation issue.

As for other vendors shipping Atom C2000-based products, those vendors are setting up their own support programs. Patrick Kennedy over at ServeTheHome has already started compiling a list of vendor responses, including Supermicro and Netgate. However as it stands a lot of vendors are still developing their response to the issue, so this will be an ongoing process.

Finally, what’s likely to be the most affected on the consumer side of matters will be on the Network Attached Storage front. As pointed out to me by our own Ganesh TS, Seagate, Synology, ASRock, Advantronix, and other NAS vendors have all shipped devices using the flawed chips, and as a result all of these products are vulnerable to early failures. These vendors are still working on their respective support programs, but for covered devices the result is going to be the same: the affected NASes will need to be swapped for models with fixed boards/silicon. So NAS owners will want to pay close attention here, as while these devices aren’t necessarily at risk of immediate failure, they are at risk of failure in the long term.

Sources: Tom’s Hardware, The Register, & ServeTheHome