GPUS DON’T CHANGE architectures very often, but when they do, it is big news. AMD’s new Radeon HD6900 GPUs are the largest single change from that company in three and a half years, so lets dive in to the architecture.
VLIW5 vs VLIW4
Today, AMD launches the HD6900 line, the latest in the ‘Northern Islands’ family of chips. Due to a lot of problems not related to the architecture, this family has gone through more major changes than any chip we have heard of to date. The end result is a card that is quite different from it’s smaller brethren, and quite different from the preceding four generations of cards.
What changed? The biggest thing was the core of the GPU, the shaders. With the HD2900 architecture in May of 2007, ATI’s architecture was called VLIW5 (Very Long Instruction Word). This processed a bundle of five instructions per clock, which ATI counted as five shaders.
The old way, VLIW5
Each of these shader clusters consisted of four ‘simple’ and one ‘complex’ unit. If the scheduling gods allowed for it, they could execute five instructions per clock cycle, but that rarely happened. Instead, you would normally get either four simple operations or one complex one, with the other shaders going idle. If this sounds inefficient, it is.
This architecture was found in the 2900, 3000, 4000, 5000, and 6000-6899 cards, and ATI rode it from the depths of the 2900 to the triumphs of the recent 6800 line. This is not to say the shaders were not changed, they were, from adding DirectX 11 support to many compute features, and in general, too many other functions to list. The uncore, basically the parts around the shaders, went through even larger upheavals in this time, but the broad brush shader overview looks the way did on the 2900.
VLIW4 in all it’s glory
Enter the HD6900 series, also known as Cayman. Cayman massively changes the uncore from Barts, but doesn’t stop there. The new VLIW4 architecture takes the old shaders and throws them out, starting over from scratch. Instead of a 4+1 shader architecture, Cayman has 4 identical ones in a cluster, hence the architecture’s name, VLIW4.
When designing a new architecture, AMD has a large list of shader ops drawn from various sources that it tests any new ideas against. This simulation is key to any architectural choices that the gnomes of Markham are thinking of making.
This time around, the simulation showed that going from VLIW5 to VLIW4 had somewhere between no and a negligible performance penalty. Basically, the end result was the same no matter which architecture you picked.
In that case, why change? Die area, the bane of all chips. A VLIW4 cluster is about 10% smaller than a VLIW5 cluster. Each shader is tiny, fractions of a square millimeter, but when you have 1536 of them on a die, it adds up. Shaders take up about 1/3 to 1/2 of the die space of the chip, so 10% of that is quite a lot of area. Alternatively, you can look at it as getting 10% more performance for the same die budget.
The savings in area are much more than just the shaders themselves. Scheduling four instruction slots, all the same, is much easier than scheduling five asymmetric ones. This translates to a more efficient scheduler that takes up less area and power, all while delivering higher performance. It is a win/win any way you look at it.
VLIW4 doesn’t lose any of the flexibility, it simply lets you gang shaders together to look like a single complex unit. For example, transcendentals, a type of complex operation, are sent to 3 or 4 shaders at once, and they act as one big virtual complex shader. If the particular op only uses three shaders, an additional simple op can be scheduled in the fourth slot.
That is the heart of the beast, but what about the rest of it? It looks fairly similar to the 6800/Barts design, with one major change, the ‘graphics engine’. If you look at the block diagram of Cayman, you can see that there are now two graphics engines. These units sit between the command processor and the dispatch processor, and do, well, a ton of the work.
The big picture
This is where the tesselation, rasterization, and Z-buffer functions happen. Up until Cayman, this AMD GPUs could only process a theoretical max of one triangle per clock. The big advance of Nvidia’s Fermi architecture was that they split the geometry processing up into multiple chunks, 8 in the GF100/GF110, two in all others.
The problem with doing it in multiple chunks is that you have to keep everything very closely synchronized for it all to work. This means that a GF100 has 8 senders each sending to 7 other receivers every clock, a huge amount of data. This costs maximum clock, power and yields. According to no less than Jen-Hsun Huang, it basically sank the Fermi architecture.
The big advance from GF100 to GF104 was going from 8 units needing to be synchronized down to 2. This cost Nvidia performance in the very area that they touted as their killer advantage over AMD, but who are we to let facts get in the way of PR? It also allowed them to get functional parts out with moderate yields.
Really seeing double this time
This time around, AMD took a similar but more sensible approach from the start and split the single Graphics Engine into two. The net result is a doubling of geometry throughput over a single Graphics Engine. 6800/Barts doubled the Tessellation/geometry throughput over the 5800 series with a more efficient tessellator and geometry engine. Cayman improves on this with a new 8th generation tessellator, and puts two of them in. Cayman can process two primitives per clock, including transform and backface culling, splitting things up with a tile based load balancer.
While it doesn’t quite rise to the level of 4x 5870 performance or 2x 6870, it is easily faster than it’s older siblings. Barts was about 1.5x the triangle throughput of Cypress, Cayman is more like 2x as fast. This doubling of everything also extends to the rasterizers, Cayman can do 32 pixels per clock.
Another notable uncore change is the Render Back Ends (RBE), and they have been massively upgraded. The changes are not high level like their count, each one has more area devoted to them, and they are much more efficient. AMD claims that 16-bit integer apps are twice as fast, and 32-bit FP ops are between 2-4x as fast as the older Evergreen/5xxx RBEs. This should significantly increase performance too.
Overall, what Cayman doesn’t entirely re-do from the ground up is significantly massaged with huge performance benefits. The split into two halves of the chip that was started with Cypress has now been completed, top to bottom. Everything else is more efficient clock for clock, basically everything is better.
These changes are not limited to the graphics side, GPU compute is once again massively enhanced. The biggest change is asynchronous dispatch, meaning you can execute multiple compute kernels simultaneously. While this may not seem like a big deal unless you are calculating physics or processing data, there are huge benefits that will be seen from this. Eventually.
The idea is simple enough, if you can run multiple things at once, you can essentially multitask and take advantage of unused units. The tools to take advantage of this are not fully there, DX11 doesn’t expose them, but the hardware supports the ability for a programmer to carve off shaders and dedicate them to a task. DX12 should support this, as will custom APIs, but that doesn’t do much right now. Theoretically, you could run a game with ‘only’ 1280 shaders dedicated to it, while the remaining 256 are transcoding a movie. It also has huge implications for virtualization, cloud computing, and remote gaming ala Onlive.
Each of the compute kernels runs it’s own thread, and has it’s own virtual memory, so each kernel is protected from it’s cohorts and sloppy programming. Nvidia’s architecture does not hard separate kernels, so you can run into a lot of interesting threading bugs on something that is already a mess to debug. Not fun.
Cayman also has dual bidirectional DMA engines, so two threads can push and pull independently from system memory without stepping on each other. This should pay huge dividends in the GPGPU arena. Shader read ops are also coalesced for increased efficiency, and they can fetch directly from local memory. If that isn’t enough, flow control is also improved.
One last bit, in case you didn’t catch it above, the DP FP rate has been improved from 1/5 of the SP FP rate to 1/4. The already seriously fast DP performance of Cypress just got 25% faster.
What do you do with all that compute power? Barts added Morphological Antialiasing (MOO), and Cypress adds a new mode, EQAA. The short story is that EQAA allows you to set color and coverage sampling separately, theoretically getting the benefits of a full MSAA sample with far less overhead. AMD claims the same memory footprint with better quality, or less memory with the same quality. We will leave the evaluation of this feature to Max’s review of the cards themselves.
Moving on to the memory controller, Cayman has an enhanced GDDR5 controller with better training methods. It is capable of 5.5G transfers where Cypress was stuck at ‘only’ 4.8GT, and Barts could only go to 4.2. The difference between them is once again down to area devoted to squeezing that last bit of performance out of the memory bus. Cayman devoted more area to the problem, and gets more speed as a result. Barts lowered the area, and thus the speed, but also doubled the width. End result, Cayman is faster, Barts cheaper.
The last feature is perhaps the most widely misunderstood, power containment. This is not a consumer GPU feature as much as a bullet aimed straight at the compute world. It will be the single biggest killer feature for supercomputers, and is an advance on what AMD and Intel have been doing on the server front for a while. AMD calls this PowerTune Technology.
PowerTune allows you to set a cap for how much power the board can consume, and dynamically but gently moves clocks down if you hit that cap. It is just one of several power management features on Cayman, and it is unquestionably a good thing.
Different power strategies
If you recall, the ATI 4000 line had some rather gross power savings strategies. When the temperature got too hot, it hammered the clock rate. When temps went down to an acceptable level, they popped back up to full speed. While this works, if it happens during a game, it can lead to sub-optimal user experiences and sometimes player death. Basically it is REALLY annoying. That is the red line above. This is basically what Nvidia has finally added to the GF110/GTX580.
Scaled power, the purple line above does away with this by lowering the entire power use so that the peaks never hit the danger zone. Basically it throws away a lot of average performance to make sure the outliers never cause problems. Once again, this is not an ideal way to do things.
Powertune is subtle
PowerTune put dozens of sensors around the chip that do not measure power draw or temperature, they just measure activity. The PowerTune controller is smart enough to know how much power each unit’s activity level pulls, and when it hits a certain threshold, the clocks are smoothly ratched down until the power sits at the cap. It basically hugs the curve.
Please note that this is in addition to the other temperature and power sensors, already on the chip. They are still in place, and should PowerTune not catch something, or you manage to flash the BIOS with something silly, they will still step in and save you from letting the magic smoke out.
There are a couple of misconceptions floating around the net about PowerTune right now. First is that PowerTune cuts clock down during normal operation. This isn’t true, it ONLY operates when the GPU hits it’s programmed limit. If you are at the max power state of the GPU, basically the opposite of 2D/idle/desktop mode, and you are running a very outlier application like Furmark, this is the only time you will see PowerTune in operation.
In initial testing of the HD6970, Max and I could not get PowerTune to activate. If it does, AMD thoughtfully provides you with a slider in the Catalyst Control Center to move the PowerTune limits up and down by 20%. The TDP of the 6970 card is 190W, and PowerTune lets you push that to 250W, hence some very erroneous claims of a 250W TDP. For the overwhelming majory of users, you will never see PowerTune in action.
A really nice feature, perhaps a killer app for HTPC machines is the ability to set the limit down. If you want a quiet or cool box, or just know you won’t need the horsepower, you can set the PowerTune cap down by 20%. Silent PCs, meet the 6900 line and PowerTune.
We will end with the specs. The GPU itself, Cayman, is built on the same TSMC 40nm process as Cypress, it is about 17% larger, 389mm^2 vs 334mm^2, but adds much more performance and features. It does this while only adding 2W to the TDP, it now stands at 190W, and idle is dropped by nearly a third to 20W.
The raw specs
The raw shader count drops from 1600 in Cypress to 1536, but they are VLIW4 instead of VLIW5. This is equivalent to 1920 of the older shaders. Not coincidently, this the same ratio as the the SIMD clusters increase by, 20 in Cypress to 24 in Cayman. Clocks top out at 880MHz, and most of the rest is the same as the 6800/Barts chips.
How well does it perform? That is a question to be answered in our performance testing and review, something that is being wrapped up as we speak. The short story is that AMD took the unequivocable high end lead with dual Cypress/Hemlock/HD5970 and hasn’t looked back. Even the recent Nvidia GTX580 can’t beat that year old card, and the dual Cayman/Antilles/HD6990 are just around the corner. Game over until 28nm late next year.S|A