INTEL HAS TALKED about Becton, now called Nehalem EX, without going into many technical details. At Hot Chips 21, it is starting to talk about the guts of the chip, and it is very different from the EX-free Nehalems.
On the surface, Becton looks like a simple mashing together of two 4-core Nehalems. The specs are 8 cores, 16 threads, 4 DDR3 memory channels, 4 QPI links and 24MB of L3 cache all stuffed into a mere 2.3 billion transistors. If you take a Lynnfield i3/i5/i7, add a little more cache, and weld a second one on, it looks a lot like Becton, but that is where the similarity ends.
Nehalem is a modular architecture, and about the only thing the two chips have in common are the cores themselves. Just about everything else is different between high-end Bloomfield (i7) Nehalem and Becton. Because of the modular architecture of the base chip, this could be done without enormous amounts of pain.
Becton block diagram showing the ring
The biggest difference between the two is the ring bus. At 24MB, the cache is far too big to run at an acceptable speed, and making 24MB 8-ported fast cache RAM was a basically impossible task. Instead, Intel split the cache up into eight 3MB chunks called slices, and assigned one per core. That size cache is easy enough to design, and they ended up as inclusive with 4-ports.
Eight independent caches are not all that useful compared to a single large 24MB cache, so Intel put a large, bidirectional ring bus in the middle of the cache to shuttle the data around. If any core needs a byte from any other cache, it is no more than 4 ring hops to the right cache slice.
The ring bus is actually four rings, with the data ring being 32 bytes wide in each direction. It looks a lot like the ring in Larrabee, but Intel has not announced the width of that part yet. That said, it is 1024 bits wide, 512b times two directions. There are eight stops on the Becton ring, and data moves across it at one stop per clock.
To pull data off the ring, each stop can only pull one request off per clock. With two rings, this could be problematic if data comes in from both directions at the same time. The data flow would have to stop or the packets would have to go around again. Neither scenario is acceptable for a chip like this.
Intel solved this by putting some smarts into the ring stops, and added polarity to the rings. Each ring stop has a given polarity, odd or even, and can only pull from the correct ring that matches that polarity. The ring changes polarity once per clock, so which stop can read from which ring on a given cycle is easy to figure out.
Since a ring stop knows which other stop it is going to talk to, what the receiver polarity is, and what the hop count is between them, the sender can figure out when to send so that the receiver can actually read it. By delaying for a maximum of one cycle, a sender can assure the receiver can read it, and the case of two packets arriving at the same time never occurs.
In the end, the ring has four times the bandwidth of a similar width unidirectional ring, half the latency, and never sends anything that the receiver can’t read. The raw bandwidth available is over 250GBps, and that scales nicely with the number of stops. You could safely speculate that Eagleton will have a 375GBps ring bus if the clocks don’t change much.
Moving on to QPI, there is a second controller to enable four links per socket. In addition to allowing Becton to scale to eight sockets gluelessly, the chip can do two independent transactions over QPI at the same time. There are two functional blocks to assist with this, and Intel calls them QPI Home Agents (HA).
The home agents have much deeper caches and request queues than a normal QPI controller on a Lynnfield or Bloomfield part. The HAs support 256 outstanding requests, with up to 48 from one single source. For an eight socket system, this is not just nice but somewhat mandatory for scaling.
HAs don’t just track QPI requests, they can also track memory requests, and do some prefetching and write posting. On top of that, they control a lot of the cache coherency between sockets, something Intel calls a hybrid coherency protocol.
Augmenting the HAs are a QPI Caching Agent, with two per core, one per HA. The Caching Agents do what they sound like they do, cache QPI requests and data. Additionally, they can go directly to local memory, not just QPI, and send results directly to the correct core as well. QPI handling is in Becton is not just more intelligent, but also much better buffered as well.
The Nehalem family is the first modern Intel part to have memory controllers on die, so the memory controller count scales with socket count. Becton has two memory controllers per die, two channels per die, and two memory buffers per channel. With four DDR3 DIMMs per channel, that means 2 X 2 X 2 X 4, or 32 for the math adverse, per socket. On an eight socket system, that means 256 DIMMs, 2TB of memory per box. That is almost enough for running Vista at tolerable speeds.
In case you didn’t notice, there was something new in the memory hierarchy, memory buffers. These. The idea is simple, the earlier FB-DIMMs put a complex buffer onto the DIMM itself. It was expensive, hot, and generally unloved, but brought a ton of useful features to the memory subsystem.
Intel was a bit shy when it came to talking about what these new buffers do, but we hear it started with FB-DIMM AMB buffers and evolved things from there. If the new buffers kept the RAS features and other similar technologies, they will be a net plus for the Nehalem EX platform.
Eight socket Nehalem EX systems are complex
With 4 QPI links, 8 memory channels, 8 cores, 8 cache slices, 2 memory controllers, 2 cache agents, 2 home agents and a pony, this chip is getting quite complex. The transistor count of 2.3 billion backs that up. To make it all work, the center of the chip has a block called the router. It is a crossbar switch that connects all internal and external channels, up to eight at a time.
The router is fully programmable, so what it does in Becton is not the only thing it can do. You are unlikely to see anything different in this generation of product, but it could be done if needed. Eagleton should have a few new tricks, especially when it comes to routing using external glue chips for high socket counts.
With that many available inputs and outputs, you start to understand why the focus of Becton was on the uncore, and how things get moved around the die and the system in general. Without all the effort put in, just doubling up a Bloomfield or Lynnfield wouldn’t scale at all, much less to the 2,000-plus cores Intel is claiming Becton will hit.S|A
Updated: 2 x 2 x 2 x 4 x 8 = 256, not 512 as originally written.