Mobile


ARM Reveals Cortex-A72 Architecture Details

ARM Reveals Cortex-A72 Architecture Details

Today in London as part of ARM’s TechDay 2015 event we had the pleasure to get a better insight into ARM’s new Cortex-A72 CPU. ARM had announced the Cortex-A72 in the beginning of February – leaving a lot of questions to be asked and sense of mystery in the air. The Cortex-A72 is a direct successor to the Cortex-A57 – taking the predecessor as a baseline in order to iterate and improve it.

On the naming side of the equation, moving from ‘A57’ to ‘A72’ rather than ‘A59’ or similar, ARM explains that it is purely a marketing decision as they wanted to give better differentiation between its higher-performance cores from the mid-tier and low-power cores. There seemed to be some confusion between the more power efficienct A53 and the more powerful A57, whereby users would assume they are similar, and thus moving its new big core into the A7x series.

We saw some absolute targeted performance numbers back during the February release, which promised some very interesting numbers that could be achieved over the A57. The problem was that it was not clear how much from performance and power efficiency came from the architectural changes and how much came from the the process on which these targeted performance data points are estimated from. It’s clear that on the high-end ARM is promoting the A72 on the new FinFET processes from Samsung/GlobalFoundries and TSMC, which are referred to as 14nm and 16nm in the slides. Generally, due to the design and the node, the A72 will be able to achieve higher clocks than the A57, and we seem to be aiming around 2.5GHz on the 14/16nm nodes when high-end smartphones are concerned. Higher clocks may be present in server applications, where the A72 is also aimed at.

Probably the most interesting slide next to the actual performance metrics of the A72 is the apples-to-apples comparison of the A57 to the A72 on the same process node. When on the 28nm node, we see the A72 having a respectable 20% power reduction when compared to the A57. As a reminder – we’re talking about absolute power at the same clock speed, which does not consider performance and thus not a representation of efficiency.

Notably, ARM is aiming for the A72 to be capable of extensive sustained performance at its target frequency. This is something that smaller form factor A57 designs (e.g. phones) have struggled with due to just how powerful A57 is, which has lead to more bursty designs that can only run A57 at its top speeds for short periods of time. We are presented with figures such as sustained 750mW operation per core on 16FF+ at clocks of ~2.5GHz.

While the power numbers are interesting we also have to put them into context of the achieved work. ARM has made several optimizations to the architecture to improve performance when compared to the A57. We’ll get into more detail in just a bit – but what we are looking at is a general 16-30% increase on IPC depending on the kind of workload. Together with the power reduction, we now see how ARM is able to advertise such large efficiency gains for the same fixed workload.

A72 Architecture – The Upgrades Over A57

ARM seems to have managed to achieve an improvement in all three areas of the PPA metric; Performance, Power and Area – the trifecta of semiconductor design goals. This was achieved by doing a re-optimization of (almost) every logical block from the A57. There has been some considerable redesign in the CPU’s architecture, some of which include a new branch-predictor and improvements in the decoder pipeline to allows for better throughoutput. 

On the level of the instruction fetch block we see a brand new branch-predictor that follows a new sophisticated algorithm that improves performance and reduces power through reduced misprediction and speculation, which has been cut down by 50% for mispredictions and 25% for speculation when compared to the A57. Superfluous branch-predictor accesses have also been suppressed – in workloads where the predictor is not able to do its job efficienctly it is then bypassed completely. There also has been general power optimization in the RAM-organization by coupling the different IP blocks better together, something ARM looks to provide with their own physical IP.

Moving down the pipeline, A72’s decoder/rename capabilities have seen their own set of improvements.The decoder itself is still a 3-wide decoder, but ARM has gone through it to try to improve both performance and power consumption in other ways. To improve performance, the effective decode bandwidth has been increased, and the decoder has received some AArch64 instruction-fusion enhancements. Meanwhile power consumption has been tempered at multiple levels, including optimizing decoding directly, and in other power optimizations to the buffers and flow-control hardware that work around the decoder.

However it’s on the dispatch/retire stage that the architecture sees the biggest improvements to performance. Going hand-in-hand with the decoder’s ability to fuse instructions, ARM’s dispatch unit can then break those ops back down into more granular micro-ops for feeding into the execution units, transforming it from a 3-wide to an effective 5-wide machine at the dispatch stage. The net result of this increases decoder throughput (by reducing the number of individual instructions decoded) while also increasing the total number of micro-ops created by the dispatcher and eventually executed per cycle. ARM is quoting an average of 1.08 micro-ops per instruction in code, which will aid the cases where in A57 the 3-wide dispatch unit was eventually dispatch limited. Again on the dispatch-level, ARM has done more extensive work on their register file by reducing the number of read-ports by introducting port-sharing and further reducing superfluous access.

ARM CPU Core Comparison
  Cortex-A15 Cortex-A57 Cortex-A72
ARM ISA ARMv7 (32-bit) ARMv8 (32/64-bit)
Decoder Width 3 ops
Maximum Pipeline Length 19 stages 16 stages
Integer Pipeline Length 14 stages
Branch Mispredict Penalty 15 cycles
Integer Add 2
Integer Mul 1
Load/Store Units 1 + 1 (Dedicated L/S)
Branch Units 1
FP/NEON ALUs 2×64-bit 2×128-bit
L1 Cache 32KB I$ + 32KB D$ 48KB I$ + 32KB D$
L2 Cache 512KB – 4MB 512KB – 2MB 512KB – 4MB
 

On the side of the execution units we see introduction of new, next-generation FP/Advanced SIMD units. The new units allow for much lower instruction latency as the FP pipeline length is reduced from 9 to 6.  FMUL is reduced from 5 cycles down to 3, FADD goes from 4 to 3, FMAC from 9 to 6, and the CVT units go from 4 to 2 units. The reduction of the FP pipeline length brings down the maximum pileline length of the architecture down from 19 to 16. 

The integer units also see an improvement, as the Radix-16 divider has seen its bandwidth doubled, while the CRC unit now becomes a pipelined block with just 1-cycle latency, a 3x increase in bandwidth over the A57. Again, we see a repeating pattern here as ARM claims it tried to squeeze the most power efficiency from all the units by improving the physical implementation.

Another large performance improvement over the A57 is found on the Load/Store unit. Here, ARM claims that bandwidth to L1/L2 has been improved by up to 30%. This was achieved by introducting a sophisticated L1/L2 data prefetcher which, again, is at the same time more efficient as improvements in the L1-hit pipeline, fowarding network, and way-predictor reduce the needed power. 

We’ve been generally impressed with what the A72 brings to the table. It’s clear that new architecture is an evolutional upgrade ot the A57, and the improvements in performance, power, and area, when looked at from an aggregate view, bring substantial differences and upgrades when compared to the A57. With the A57 having come to market in Q3 of last year and it now shipping in high-volume SoCs such as the Snapdragon 810 and Exynos 7420, we are looking at the possibility of seeing its successor come to market in shipping devices in less than a year’s time. The obvious partners that might ramp prodution the soonest are MediaTek and Qualcomm, at least if they are able to hit their target schedules. There should presumably still be un-announced parts from other ARM partners as well. It’s clear that ARM has increased the cadence of releasing refreshes of its IP portfolio and the quick succession of the A72 seems to be part of that.

The A72 looks to be a logical update to the A57 addressing some weakpoints such as peak power and power efficiency combined with an ~10% area reduction. We already saw Mediatek showing off an A72 package at MWC, so it will be interesting to see how the IP actually performs in silicon and what ARM’s partners will be able to do with the core and the time to market.

Cher Wang Replaces Peter Chou as CEO of HTC

Cher Wang Replaces Peter Chou as CEO of HTC

Today, Cher Wang, formerly chairwoman of HTC will replace Peter Chou as CEO of HTC.

This move may seem unprecedented, but in practice it seems that Cher has increased responsibilities in terms of running HTC from day to day as Peter became more focused on product development and R&D. Given the change in focus in HTC from smartphones to connected devices, it seems that this has become one of the organizational changes that was deemed necessary to expand into new segments of the industry. In the near term, it seems that it’s unlikely that anything will be noticeably different, but in the near future we may see a distinct shift in how HTC works.

Cher Wang Replaces Peter Chou as CEO of HTC

Cher Wang Replaces Peter Chou as CEO of HTC

Today, Cher Wang, formerly chairwoman of HTC will replace Peter Chou as CEO of HTC.

This move may seem unprecedented, but in practice it seems that Cher has increased responsibilities in terms of running HTC from day to day as Peter became more focused on product development and R&D. Given the change in focus in HTC from smartphones to connected devices, it seems that this has become one of the organizational changes that was deemed necessary to expand into new segments of the industry. In the near term, it seems that it’s unlikely that anything will be noticeably different, but in the near future we may see a distinct shift in how HTC works.

MediaTek at MWC 2015: A72 In Silicon, Multi-Standard Wireless Charging & More

MediaTek at MWC 2015: A72 In Silicon, Multi-Standard Wireless Charging & More

As part of our MWC coverage we had the pleasure to have a guided tour through MediaTek’s booth to see what kind of new technologies the company has in its pipeline. MediaTek has seen some enormous momentum over the last few years and we’re quickly seeing the Taiwanese company becoming a serious competitor to be reckoned with.

What was far the biggest surprise for us was the showing off of MT8173 hardware, a mid-range tablet SoC employing ARM’s new Cortex A72. It’s only been a few weeks since ARM officially announced the ARM Cortex A72, and while we still don’t know much about the micro-architectural nuances of the core, having MediaTek already displaying silicon is a severe departure from ARM’s usual announcement-to-release cadence. This puts the number A72 licensees with announced products already at two, with Qualcomm being the other one in the form of the Snapdragon 618 and 620.

The MT8173 employs two Cortex A72 CPUs at up to 2.4GHz and two Cortex A53 CPUs in a big.LITTLE configuration. On the GPU side we find a PowerVR GX6250 GPU, which if MediaTek’s clocking strategy continues should run north of 700MHz. The SoC is still powered by LPDDR3 as the preferred memory interface, undoubtedly a cost decision as we’re only starting to see LPDDR4 in flagship devices. On the multimedia side there’s MediaTek’s new display pipeline capable of 120Hz operation, 4K H.264/HEVC(10-bit)/VP9 video decoders and an ISP capable of 20MP sensors.

As part of the MWC announcements was also a (re)branding of MediaTek’s SoC lineup. Beginning with the MT6795 which is now denominated the Helio X10, MediaTek will in the future begin naming their new chips after the Helio (After the Greek word for sun, “helios”) brand. We’ll be seeing the P-line targeting the premium performance segment while the X-line targets the high-end and the best MediaTek has to offer.

MT3188: PMA, WPC and A4WP Wireless Charging Solution In 1

As part of the booth demos, MediaTek showed off the MT3188 wireless charging IC solution which supports all three currently available standards, PMA, WPC and the newly emerged A4WP standard. While we’ve had the IC announced early last year, it is still impressive to see the real thing in hardware.

WPC (Wireless Power Consortium) is by far the currently most widely available standard in the form of Qi, which has seen large adoption in the mobile space. PMA (Power Matters Alliance) remains as the competitor standard but which hasn’t seen as wide of an adoption rate with its Powermat/Duracell chargers. Both WPC and PMA rely on inductive charging which limit the spatial freedom between the transmitter and receiver coils to a few mm. 

A4WP on the other hand is the new standard which is based on resonance charging, giving devices the freedom in x, y and z directions around the emitter coil. The charging area can be much larger than in the inductive charging technologies and also allows for one charger to simultaneously charge multiple devices. The advantage of inductive charging over resonance charging remains in the efficiency and EMI aspects.

The demonstration of the A4WP standard was impressive as it allows for an enormous amount of flexibility in terms of integrating charging pads into furniture. Among multi-device charging bases, we also saw charging through relatively thick wooden tables where the charger was hidden underneath, instead of having to integrate them into the table of having a mat on lying on top.

The MT3188 joins other unified charging solutions such as Broadcomm’s BCM59350 which we also saw demonsrated at MWC this year.

MiraVision Display Pipeline Processing

Another interesting demo was the showing off of MiraVision integrated in MediaTek’s SoCs. Basically MiraVision is a fixed-function post-processor which sits on the display pipeline which has full control of the image data being sent to the display. The use-cases which MediaTek demonstrated were colour gamut manipulation on one side, something which is already for example done in products such as the Meizu MX4 with the MT6595. What was impressive to me was the dynamic analysis of image content in dark environments and subsequent adjusting of back-light and pixel data to allow for better visibility. Think of it as a dynamic gamma-curve adjustment.

There were a few demos, including a third-person shooter one where the effect was considerable to the viewing experience. We’ve seen Samsung employ similar technology in their Exynos and television SoCs called mDNIe (Mobile digital natural image enhancement) which used among other things to change between display profiles on Galaxy devices. MediaTek’s solution seems to one-up that as it allows for more dynamic settings as opposed to simply just having fixed programmed profiles.

MediaTek also demonstrated a frame-interpolation function for video playback through MiraVision. The result is similar to what SVP achieves in the PC space via software, but here it’s again implemented through fixed-function hardware to achieve high performance at very low power. Video content that is sourced at 24fps is interpolated to 60fps on the screen. The result is remarkable on a small screen as it suffers less from the “fake motion” that one associates with such techniques (Or MediaTek’s implementation is just really good?). 

The demo unit was again a Meizu MX4, so it means the hardware and products are already out there but just merely need to be officially adopted by the vendors in software.

MediaTek’s Modem Progression

Modems are an important part of MediaTek’s strategy and we’ve had more or less a status-update on how things are progressing. The LTE products from MediaTek are still far and few in-between that seems to be changing in the future as we’re seeing quick progression from the current Cat. 4 modem IP to Cat. 6 solutions in 2015 and Cat. 10 in 2016. 

An interesting addition is the adoption and field testing of C2K, or better known as CDMA2000. Other vendors such as Intel and Samsung chose not to adopt the technology as in the future we’ll see it being phased out in favour of LTE-only networks. For MediaTek to adopt it even though long-term it makes no sense, puts them in a unique position against Qualcomm in markets such as the US and China.