Metal
Upon hearing Apple’s Metal announcement, perhaps the greatest surprise was that iOS developers were in a position where they needed and could benefit from a low-level API like Metal. In the PC space we’ve been seeing low-level APIs rolled out as a solution to the widening gap between CPU and GPU performance, however the SoC class processors in Apple’s iOS devices are a very different beast. As one would expect for a mobile product, neither the CPU nor the GPU is high performance by PC standards, so why should a low-level API be necessary.
The answer to that is that while SoCs are lower performance devices, the same phenomena that has driven low-level APIs on the PC has driven them on mobile devices, just on a smaller scale. GPU performance is outgrowing CPU performance on the SoC level just as it has been the PC level, and even worse, SoC class CPUs are so slow that even small amounts of driver overhead can have a big impact. While we take 4000 draw calls for granted on desktop hardware – overhead and all – it’s something of a sobering reminder that this isn’t possible on even a relatively powerful SoC like the A7 with OpenGL ES, and that it took Metal for Crytek to get that many draw calls in motion, never mind other CPU savings such as precompiled shaders.
Update (2014-06-13): RenderingPipeline:
We can see, that a lot of complexity is hidden from the programmer and that a lot of tricks (probably way more than I have mentioned here) have to be performed to hide what is actually going on. Some of those tricks make the life of the developer simpler, others force him/her to find ways to trick the driver (e.g. the mentioned “useless” draw calls to force the driver to cache states early on) or to learn the possible side effects of API calls (for example which can stall the GPU and how to force a stall to reduce latency).
Some graphics APIs now try to remove most of these tricks by exposing more of the actual complexity – in some cases by leaving it to the program so solve the resulting problems. It’s been said that the graphics API of the PS3 went in this direction (as I’m not a PS3 developer I don’t get access to the documents to check and even if I would be one, the NDA prohibits all devs from describing any details), Mantle is going in this direction (we will see more about how it’s done by AMD when the documents get released), as will Microsoft with DirectX 12 and now Apple is doing the same with Metal.
[…]
Each draw costs some time on the CPU and some time on the GPU. The Metal API reduces the time spend on the CPU by making state handling simpler and thus reducing error checks by the driver if the state combination is valid. Precomputing states also helps: not only can the error check be done at state build time, the state change itself requires fewer API calls. Being able to build command buffers in parallel also increases the number of possible draw calls if the application is CPU bound.
The rendering on the GPU on the other hand is not faster, an application that only makes a few calls to draw very huge meshes will not benefit.
Update (2014-06-14): Guy English:
The issue with OpenGL is that an incredibly complicated state machine is being addressed in an atomic fashion on a per function basis. By that I mean that each call into OpenGL required a check to make sure the state was valid for the device. OpenGL error reporting is such that no functions return an error code. Instead they set a GL local variable that’s accessed though the glError() function. The lack of immediate error returns, as well as other features of the API, derive from its earlier days where it was envisioned as a largely asynchronous affair. Commands would be buffered and fired off to the renderer and there would be only occasional synchronization points.
While that’s a laudable abstraction, that’s not what we’ve ended up with in the real world. Our ability to render 3D graphics is coming closer to the CPU rather than further. Intel has been shipping respectable (yes, gamers, I know) parts for a couple of years that perform well. Meanwhile, Apple’s A7 has a tremendous amount of capabilities that haven’t, and can’t, be unlocked via the OpenGL API. Integrated memory and graphics processing are not what OpenGL was designed for.
[…]
Metal turns that that upside down. Rather than making discrete state changes (flipping switches) directly in the driver Metal allows you to order up a set of state that’d you want applied. If it can’t be done you’ll know. If it can then that set of state is good and can be applied, without further checking, to other rendering operations and contexts. Metal turns a set of many tiny decisions into an opportunity to green light a plan.