SOC


CEVA Launches Fifth-Generation Machine Learning Image and Vision DSP Solution: CEVA-XM6

CEVA Launches Fifth-Generation Machine Learning Image and Vision DSP Solution: CEVA-XM6

Deep learning, neural networks and image/vision processing is already a large field, however many of the applications that rely on it are still in their infancy. Automotive is the prime example that uses all of these areas, and solutions to the automotive ‘problem’ are require significant understanding and development in both hardware and software – the ability to process data with high accuracy in real-time opens up a number of doors for other machine learning codes, and all that comes afterwards is cost and power. The CEVA-XM4 DSP was aimed at being the first programmable DSP to support deep learning, and the new XM6 IP (along with the software ecosystem) is being launched today under the heading of stronger efficiency, compute, and new patents regarding power saving features.

Playing the IP Game

When CEVA launched the XM4 DSP, with the ability to infer pre-trained algorithms in fixed-point math to a similar (~1%) accuracy as the full algorithms, it won a number of awards from analysts in the field, claiming high performance and power efficiency over competing solutions and the initial progression for a software framework. The IP announcement was back in Q1 2015, with licensees coming on board over the next year and the first production silicon using the IP rolling off the line this year. Since then, CEVA has announced its CDNN2 platform, a one-button compilation tool for trained networks to be converted into suitable code for CEVA’s XM IPs. The new XM6 integrates the previous XM4 features, with improved configurations, access to hardware accelerators, new hardware accelerators, and still retains compatibility with the CDNN2 platform such that code suitable for XM4 can be run on XM6 with improved performance.

CEVA is in the IP business, like ARM, and works with semiconductor licensees that then sell to OEMs. This typically results in a long time-to-market, especially when industries such as security and automotive are moving at a rapid pace. CEVA is promoting the XM6 as a scalable, programmable DSP that can scale across markets with a single code base, while also using additional features to improve power, cost and performance.

The announcement today covers the new XM6 DSP, CEVA’s new set of imaging and vision software libraries, a set of new hardware accelerators and integration into the CDNN2 ecosystem. CDNN2 is a one-button compilation tool, detecting convolution and applying the best methodology for data transfer over the logic blocks and accelerators.

XM6 will support OpenCL and C++ development tools, and the software elements include CEVA’s computer vision, neural network and vision processing libraries with third-party tools as well. The hardware implements an AXI interconnect for the processing parts of the standard XM6 core to interact with the accelerators and memory. Along with the XM6 IP, there are hardware accelerators for convolution (CDNN assistance) allowing lower power fixed function hardware to cope with difficult parts of neural network systems such as GoogleNet, De-Warp for adjusting images taken on fish-eye or distorted lenses (once the distortion of an image is known, the math for the transform is fixed-function friendly), as well as other third party hardware accelerators.

The XM6 promotes two new specific hardware features that will aid the majority of image processing and machine learning algorithms. The first is scatter-gather, or the ability to read values from 32 addresses in L1 cache into vector registers in a single cycle. The CDNN2 compilation tool identifies serial code loading and implements vectorization to allow this feature, and scatter-gather improves data loading time when the data required is distributed through the memory structure. As the XM6 is configurable IP, the size/associativity of the L1 data store is adjustable at the silicon design level, and CEVA has stated that this feature will work with any size L1. The vector registers for processing at this level are 8-wide VLIW implementations, meaning ‘feed the beast’ is even more important than usual.

The second feature is called ‘sliding-window’ data processing, and this specific technique for vision processing has been patented by CEVA. There are many ways to process an image for either processing or intelligence, and typically an algorithm will use a block or tile of pixels at once to perform what it needs to. For the intelligence part, a number of these blocks will overlap, resulting in areas of the image being reused at different parts of the computation. CEVA’s method is to retain that data, resulting in fewer bits being needed in the next step of analysis. If this sounds straightforward (I was doing something similar with 3D differential equation analysis back in 2009), it is, and I was surprised that it had not been implemented in vision/image processing before. Reusing old data (assuming you have somewhere to store it) saves time and saves energy.

CEVA is claiming up to a 3x performance gain in heavy vector workloads for XM6 over XM4, with an average of 2x improvement for like-for-like ported kernels. The XM6 is also more configurable than the XM4 from a code perspective, offering ‘50% more control’.

With the specific CDNN hardware accelerator (HWA), CEVA cites that convolution layers in ecosystems such as GoogleNet consume the majority of cycles. The CDNN HWA takes this code and implements fixed hardware for it with 512 MACs using 16-bit support for up to an 8x performance gain (and 95% utilization). CEVA mentioned that a 12-bit implementation would save die area and cost for a minimal reduction in accuracy, however there are a number of developers requesting full 16-bit support for future projects, hence the choice.

Two of the big competitors for CEVA in this space, for automotive image/visual processing, is MobilEye and NVIDIA, with the latter promoting the TX1 for both training and inference for neural networks. Based on TX1 on a TSMC 20nm Planar process at 690 MHz, CEVA states that their internal simulations give a single XM6 based platform as 25x the efficiency and 4x the speed based on AlexNet and GoogleNet (with the XM6 also at 20nm, even though it will most likely be implemented at 16nm FinFET or 28nm). This would mean, extrapolating the single batch TX1 data published, that XM6 using AlexNet at FP16 can perform 268 images a second compared to 67, at around 800 mW compared to 5.1W. At 16FF, this power number is likely to be significantly lower (CEVA told us that their internal metrics were initially done at 28nm/16FF, but were redone on 20nm for an apples-to-apples with the TX1). It should be noted that TX1 numbers were provided for multi-batch which offered better efficiency over single batch, however other comparison numbers were not provided. CEVA also implements power gating with a DVFS scheme that allows low power modes when various parts of the DSP or accelerators are idle.

Obviously the advantage that NVIDIA has with their solution is availability and CUDA/OpenCL software development, both of which CEVA is attempting to address with one-button software platforms like CDNN2 and improved hardware such as XM6. It will be interesting to see which semiconductor partners and future implementations will combine this image processing with machine learning in the future. CEVA states that smartphones, automotive, security and commercial (drones, automation) applications are prime targets.

Samsung Announces Exynos 7570 14nm Budget SoC

Samsung Announces Exynos 7570 14nm Budget SoC

Today Samsung LSI launches a new 14nm SoC oriented at the budget and IoT segments. The new Exynos 7570 seems to be a a successor to the Exynos 3470 and Exynos 3475 found in a few of Samsung’s budget lineups such as the Galaxy S5 Mini. The chip is a quad-core A53 setup which is said to improve performance by 70% and power efficiency by 30% over its predecessor, although exact clocks weren’t publicized. Equally not specified was the GPU, although because of the advertised maximum 1280×800 resolution capabilities of the display pipeline, I wouldn’t expect anything more than a very small implementation of a low-end Mali series.

The more interesting aspects of this new SoC is its connectivity features as not only does it feature an integrated UE Category 4 modem with 2x carrier aggregation capabilities, but it also as far as I know for the first time for S.LSI integrates WiFi, Bluetooth, FM and GNSS capabilities, features that have traditionally been left in the charge of dedicated solutions such as Broadcom WiFi SoCs (Which following the acquisition by Avago have exited that business segment).

SoC Exynos 7570 Exynos 7580 Exynos 7870
CPU 4x A53 @ ?GHz 8x A53 @ 1.6GHz 8x A53 @ 1.6GHz
GPU ? Mali T720MP3
@ 600MHz
Mali T830MP2
@ 700MHz
Encode/
Decode
1080p60
H.264 
Camera/ISP Dual ISP
21MP / (13+8)
Dual ISP
16MP / (8+8)
Dual ISP
16MP / (8+8)
Integrated
Modem
Cat. 4
150Mbps DL 

2x20MHz C.A.

Cat. 6
300Mbps DL 

2x20MHz C.A.

Cat. 6
300Mbps DL 

2x20MHz C.A.

Connectivity WiFi, BT, FM,
GNSS
N/A
Mfc. Process 14nm 28nm HKMG 14nm

Earlier this year we were surprised to see Samsung announce a 14nm mid-range product as we weren’t expecting the economics of the new FinFET manufacturing processes to work out for anything other than flagships SoCs, but soon after Qualcomm followed suit with their own 14nm mid-range SoC. Today’s announcement seems to further dispell the doom & gloom over FinFET manufacturing cost as the Exynos 7570’s stictly low-end and IoT orientation makes it the first budget SoC on the new process and something as I see as a positive development as the industry continues to shift to full FinFET lineups.

Samsung Announces Exynos 7570 14nm Budget SoC

Samsung Announces Exynos 7570 14nm Budget SoC

Today Samsung LSI launches a new 14nm SoC oriented at the budget and IoT segments. The new Exynos 7570 seems to be a a successor to the Exynos 3470 and Exynos 3475 found in a few of Samsung’s budget lineups such as the Galaxy S5 Mini. The chip is a quad-core A53 setup which is said to improve performance by 70% and power efficiency by 30% over its predecessor, although exact clocks weren’t publicized. Equally not specified was the GPU, although because of the advertised maximum 1280×800 resolution capabilities of the display pipeline, I wouldn’t expect anything more than a very small implementation of a low-end Mali series.

The more interesting aspects of this new SoC is its connectivity features as not only does it feature an integrated UE Category 4 modem with 2x carrier aggregation capabilities, but it also as far as I know for the first time for S.LSI integrates WiFi, Bluetooth, FM and GNSS capabilities, features that have traditionally been left in the charge of dedicated solutions such as Broadcom WiFi SoCs (Which following the acquisition by Avago have exited that business segment).

SoC Exynos 7570 Exynos 7580 Exynos 7870
CPU 4x A53 @ ?GHz 8x A53 @ 1.6GHz 8x A53 @ 1.6GHz
GPU ? Mali T720MP3
@ 600MHz
Mali T830MP2
@ 700MHz
Encode/
Decode
1080p60
H.264 
Camera/ISP Dual ISP
21MP / (13+8)
Dual ISP
16MP / (8+8)
Dual ISP
16MP / (8+8)
Integrated
Modem
Cat. 4
150Mbps DL 

2x20MHz C.A.

Cat. 6
300Mbps DL 

2x20MHz C.A.

Cat. 6
300Mbps DL 

2x20MHz C.A.

Connectivity WiFi, BT, FM,
GNSS
N/A
Mfc. Process 14nm 28nm HKMG 14nm

Earlier this year we were surprised to see Samsung announce a 14nm mid-range product as we weren’t expecting the economics of the new FinFET manufacturing processes to work out for anything other than flagships SoCs, but soon after Qualcomm followed suit with their own 14nm mid-range SoC. Today’s announcement seems to further dispell the doom & gloom over FinFET manufacturing cost as the Exynos 7570’s stictly low-end and IoT orientation makes it the first budget SoC on the new process and something as I see as a positive development as the industry continues to shift to full FinFET lineups.

Investigating Cavium’s ThunderX: The First ARM Server SoC With Ambition

Today we’re taking a look at Cavium’s ThunderX, a highly integrated ARMv8 server SoC with 48 cores and lots of I/O. With the ThunderX, Cavium is looking to challenge Intel and their market-dominating Xeon lineup. Meanwhile Intel hasn’t been standing still and has further improved the Xeon D line. To that end we’ve put Cavium’s new server SoC to the test and compare it against the latest and greatest twelve and sixteen core Xeon Ds.  And we also got a hold of the new Xeon E5-2640 v4 which holds the promise of being a SKU with an excellent balance between performance and power.

Can the ThunderX compete with the best Intel SKUs? Let’s find out!

Investigating Cavium’s ThunderX: The First ARM Server SoC With Ambition

Today we’re taking a look at Cavium’s ThunderX, a highly integrated ARMv8 server SoC with 48 cores and lots of I/O. With the ThunderX, Cavium is looking to challenge Intel and their market-dominating Xeon lineup. Meanwhile Intel hasn’t been standing still and has further improved the Xeon D line. To that end we’ve put Cavium’s new server SoC to the test and compare it against the latest and greatest twelve and sixteen core Xeon Ds.  And we also got a hold of the new Xeon E5-2640 v4 which holds the promise of being a SKU with an excellent balance between performance and power.

Can the ThunderX compete with the best Intel SKUs? Let’s find out!