Intel’s Skylake GPU – Analyzing the Media Capabilities
At IDF in San Francisco last week, Intel provided us with lots of insights into Skylake, the microarchitecture behind the 6th generation Core series processors. Skylake marks the introduction of the Gen9 Intel HD Graphics technology. In advance of our full Skylake architecture analysis (coming soon), I wanted to get a head start and explain the media side (including Quick Sync and the image processing pipeline) of Skylake in a separate piece.
Media Capabilities and Quick Sync in Intel HD Graphics – A Brief History
Quick Sync has evolved through the last five years, starting with limited hardware acceleration and usage of the programmable EU array in Sandy Bridge. The second generation engine in Ivy Bridge moved to a hybrid hardware / software solution with rate control, motion estimation and intra estimation as well as mode decision happening in the programmable EU array. Usage of the EU array enabled tuning of the algorithms. Motion compensation, intra prediction, forward quantization and entropy coding were done in hardware in the MFX (multi-format codec engine). Haswell added JPEG / MJPEG decode to the MFX, a dedicated VQE (video quality engine) for low power video processing and a faster media sampler.
Around the time Broadwell was introduced, we had the major transitions taking place in the video codec front – HEVC adoption was picking up, and VP8 / VP9 was also gaining support. In order to tackle these aspects and build on consumer feedback, Intel made major updates to the media block / Quick Sync engine late last year.
Broadwell was also the first microarchitecture to support two BSDs (bit stream decoder) in the GT3 variants. Each BSD allows a set of commands to decode one video stream.
Broadwell's updates (when compared to Haswell) are summarized in the slide below.
The detailed discussion of Broadwell's media capabilities above is relevant to the improvements made in Skylake.
Skylake's Gen9 Graphics
The Gen9 graphics engine comes in multiple sizes for different power budgets. There are three main variants, GT2, GT3/GT3e and GT4e. In the slide below, the important aspect to note is that the media processing hardware (Media FF – Media Fixed Function) resides in the 'Unslice'. While the GT2 comes with the minimum possible Media FF logic, the GT3 and GT3e come with additional hardware capabilities. This strategy is similar to what was adopted in Broadwell.
The Unslice can operate at a different voltage and frequency compared to the Slices. This is especially important for video decoding / processing where the Media FF can run at higher clocks for better performance while ensuring minimal power consumption. From the viewpoint of tools such as GPU-Z and HWiNFO, it will be interesting to see if real-time statistics on voltage and clocks can be gathered for both the Unslice and the Slices. For additional power saving, power gating can be used at the Slices level or the EU group level.
Amongst the media improvements made in Skylake, we have:
- An additional fixed function video encoder in the Quick Sync engine
- Additional codec support (both decode and encode): HEVC, VP8, MJPEG
- RAW imaging capabilities
Quick Sync in Skylake
Intel classifies the Quick Sync modes in Broadwell and previous generations as 'PG-Mode' (Processor Graphics). It is optimized for faster than real-time encoding and flexibility. The new mode, 'FF-Mode' (Fixed Function) is optimized for real-time H.264 encoding, with focus on lowering the latency and reducing the power consumption. Except for programmable rate control, all other aspects of the encoding algorithm are handled in the MFX itself. Since rate control is in the hands of the application software, it is possible to do a 2-pass adaptive mode even with the FF hardware.
The new mode could possibly enable better user-experience with features such as Wi-Di, screen recording etc.. Note that Skylake offers developers the flexibility to use either the PG mode or the FF mode in their applications. PG mode still retains the TUx (Target Usage level) discussed in one of the above slides.
Skylake's MFX engine adds HEVC Main profile decode support (4Kp60 at up to 240 Mbps). Main10 decoding can be done with GPU acceleration. The Quick Sync PG Mode supports HEVC encoding (again, Main profile only, with support for up to 4Kp60 streams).
The DXVA Checker screenshot (taken on a i7-6700K, a part with Intel HD Graphics 530 / GT2) for Skylake with driver version 10.18.15.4248 is produced below. HEVC_VLD_Main10 has a DXVA profile, but it is done partially in the GPU (as specified in the slide above). VP8 DXVA profile doesn't seem to be activated yet. There are new DXVA profiles (enabled) for the SVC (scalable video coding) extension to H.264.
Video Post Processing & Miscellaneous Aspects
Additional improvements include a scalar and format converter (SFC) that can work with MFX and VQE (without using the EUs or the media sampler). This enables power-efficient rotation and color space conversion during media playback.
Yet another power-saving trick introduced in Skylake is the media memory bandwidth compression. The compression is lossless and managed at the driver level.
Skylake's VQE also brings about new features with RAW image processing support (16-bit image pipeline), spatial denoising and local adaptive contrast enhancement (LACE). Power efficiency is also improved, with claims of the VQE consuming less than 50mW during operation.
The new fixed function hardware in the performance-sensitive stages enables even low power mobile Skylake parts to support 4Kp60 RAW video processing. LACE support is not available for 4K resolution on the Y-series Skylake parts, though.
Display Capabilities
In terms of display support, Skylake can drive up to three simultaneous displays. The supported resolutions are provided in the table below. At IDF, Intel was showing off the Skylake platform driving three 4K monitors simultaneously.
One of the disappointing aspects is the absence of a native HDMI 2.0 port with HDCP 2.2 support. Intel's solution is to add a LSPCon (Level Shifter – Protocol Converter) in the DP 1.2 path. Various solutions such as the MegaChips MCDP28 family of products exist for this purpose. According to one of leaked Intel slides from earlier this year, the Alpine Ridge Thunderbolt 3 controller can also act as a LSPCon and provide a HDMI 2.0 output. At IDF, Intel indicated that we could see Alpine Ridge supporting HDMI 2.0 towards the end of the year (something corroborated unofficially by a few motherboard manufacturers)
The display sub-system also provides hardware support for Multi-plane Overlay (MPO) that allows alpha blending of multiple layers. This saves power by selective disabling of un-needed planes. Usage applications include certain video playback scenarios and HUD (heads-up display) gaming. The table below lists out the updated support for MPO as one moves from Broadwell to Skylake. The NV12 feature is particularly interesting from a media playback perspective – it is a video format that avoids conversion as video data moves between the decoder, post processing and the display blocks. With Skylake, post-decoded NV12 content can also be provided directly to a MPO display plane, and there is no need for the video post processor to do a NV12 to RGB conversion.
Intel indicated that the new Skylake MPO feature could save as much as 1.1W when playing back 1080p24 video on a 1440p panel – which is a substantial amount when mobile devices are considered. Power savings are also achieved by altering the core display clock based on the display configuration, number of displays and the resolution of each display.
Systems utilizing eDP with Windows 8.1 or later can also take advantage of hardware support for reducing refresh rate based on video content frame rate (for example, 24 fps video streams can be played after reducing the panel refresh rate to 48 Hz – eliminating 3:2 pull-down issues while also providing power savings). Obviously, the panel and TCON should support this.
Additional power saving can also be achieved on supported panels using Panel Self Refresh Media Buffer Optimization (PSR MBO). It is an Intel-developed optimization on top of the Panel Self Refresh feature of eDP 1.3.
Concluding Remarks
The media-related changes in Skylake's Gen9 GPU are best summarized by the slide below.
Skylake brings a lot of benefits to content creators – particularly in terms of improvements to Quick Sync and additional image processing options (including real-time 4Kp60 RAW import). However, it is a mixed bag for HTPC users. While the additional video post processing options (such as LACE for adaptive contrast enhancement) can improve quality of video playback, and the increase in graphics prowess can possibly translate to better madVR capabilities, two glaring aspects prove to be dampeners. The first one is the absence of full hardware acceleration for HEVC Main10 decode. Netflix has opted to go with HEVC Main10 for its 4K streams. When Netflix finally enables 4K streaming on PCs, Skylake, unfortunately is not going to be as power efficient a platform as it could have been. The second is the absence of a native HDMI 2.0 / HDCP 2.2 video output. Even though a LSPCon solution is suggested by Intel, it undoubtedly increases the system cost. Sinks supporting this standard have become quite affordable. For less than $600, one can get a 4K Hisense TV with HDMI 2.0 / HDCP 2.2 capability. Unfortunately, Skylake is not going to deliver the most cost-effective platform to utilize the full capabilities of such a display.