The Next Generation of Microprocessors

by Paul Hsieh

Update: 09/07/99 I now have started a seventh generation architecture web page.

At the 1998 Microprocessor forum 3 eagerly anticipated x86 compatible CPUs were announced, and shortly thereafter a patent which revealed a great deal about another became public. After writing my "6th generation processors" comparison page I was eagerly awaiting the authorship of a "7th generation processors" page, but so far have not found an x86 CPU that truly seperated itself in a distinctive enough way to be called a 7th generation processor.

The closest thing would be the TransMeta and Merced ideas of emulating the x86 instruction set -- but that's hardly deserving of the moniker "7th generation" since it is likely to go a lot slower than ordinary hardware x86s in the same process. Thus all I am left with is "the next generation". Of course, everything said here is totally speculative since none of these processors are products yet.

One noticeable absence is the lack of a follow on to Intel's Pentium II. That's because so far, I don't see one that they've disclosed any details on that are worthy of being called "next generation". My understanding is that Intel will take their P-!!! architecture (which is a P-II with some 3DNow! like instructions) and bring it to a 0.18 process by the end of 1999 in a processor codenamed Coppermine reaching around 733Mhz. (Update: Intel has slipped, however they are likely still to deliver a 600+ coppermine in their 0.18 process by the end of the year.) That's not very exciting -- it almost looks as though they are leaving a wide open hole for their competitors. Foster, Whillamette and Deerfield are processors that have barely been described at all, and are not slated to be seen before 2000. I think Intel's drive towards Merced may have left them without a solid x86 follow up which seems quite odd considering its their main revenue source.

Background

The marketplace for x86 processors has really started to heat up recently. Start-ups building their own brand of x86 processor have recently crawled out of the wood work (IDT, Rise, probably TransMeta and at least one other company I cannot mention because of NDA.) While Cyrix has seen better times, they are still taking a credible slice of the pie. But the real success story (other than Intel) has been AMD which has managed to create a new x86 market (a $900-$1500 sweet spot for both price and performance -- especially for games -- Cyrix CPUs have been mostly about price.)

But now that AMD, (and to a lesser extent Cyrix) has proven the alternate x86 market, how will they survive the continued onslaught of Intel? For now, Intel has been triple teaming with the P-II (now P-!!!) for ordinary desktops, the Xeon for servers (a market untouched by the alternate x86s) and Celerons for the low end. Intel will also be introducing Katmai and a new low cost socket for their CPUs, both of which have obvious value adds which are aimed squarely at exactly the features that makes the K6-2 processor so attractive (3DNow! and Super Socket 7.)

The rather complex design of these processors have left little room for ordinary value add, or architectural improvements within the constraints of the current PC market so where do they go from here? Is Intel destined to march their technology forward undeterred by thier recent loss of market share?

I don't think so, and the reason is simple. As described below, both Cyrix and AMD "get it" as far as the future is concerned about CPUs. The way to win the performance war (which is still the primary feature of these CPUs) is to go after clock rate. Anyone who understands how "throughput" versus "latency" affects performance can appreciate why going to accelerated clock rates will have a larger impact than overly aggressive architectures. Critical to being able to design a processor for high clock rate is to have a deeply pipelined architecture. So while neither AMD nor Cyrix will publically nail down the intended clock rates of their CPUs, you can bet that they are not planning to just keep lagging behind Intel.

Coupled with this realization on the part of these companies is an apparent complacency on the part of Intel. The P-II should be nearing the end of its architectural lifetime. They can't keep cranking the clock rate on it forever. But they have no announcements about any mildly interesting processors to follow the P-II (as the P-!!! demonstrated.)

Jalapeno

TransMeta

WinChip 4

The AMD K7 CPU

The AMD K7 presentation at Microprocessor forum

AMD has claimed that the processor will be available in H1'99 at more than 500Mhz in a 0.25 micron process, and with a memory bus speed of up to 200Mhz. (Update: AMD has announced initial clock speeds of 500, 550, 600, available to the public in August 1999. The K7 now also has a distinct product name: the Athlon)

Update: 09/07/99 AMD released the Athlon at a whopping 650Mhz! As a results I have written up a seventh generation web page.

I've seen the Microprocessor forum pressentation on the K7 and all I can say is -- this thing looks like a monster. It has three symmetrical decoders (as opposed to the P-II's three dependent decoders) that issue to 3 macro-ops that are then decomposed to up to 6 risc-ops feeding a very deep 72 entry scheduler feeding a 9-issue out of order engine that can target 3 almost symmetric integer ALUs, 3 address generators (and therefore three loads) or 3 FPU ALUs. The FPU is more than just a correction to the K6's non-pipelined x87 unit, it is a super scalar FPU meaning that just like its 3DNow! units, it can issue adds and multiplies simultaneously (as well as load/stores to the x87.)

The K6-2

Structured dual instruction decoder
Scheduler Window depth of 6 clocks
2 - Integer ALUs (shift/mul/simple, simple)
2 - Address ALUs (load, store/lea)
1 - FPU non-pipelined
2 - MMX ALUs
2 - 3DNow! ALUs
64K L1 cache (harvard architecture)

The K7

Symmetrical triple instruction decoder
Much larger scheduler Window (total 72 riscop entries)
3 - Integer ALUs (1xgeneral+mul, 2xgeneral)
3 - Address ALUs (general)
3 - FPUs fully pipelined (load/mul/complex/MMX/3DNow-Mul, load/add/sub/comp/MMX/3DNow-ALU, store/misc)
128K L1 cache (harvard architecture)

On the surface it appears as though since the K7 has as many units as the K6-2, there is little to be gained. But the K7's units are more general, and can be realistically fed at higher rates than the K6 ALUs. The sharing of FPU and 3DNow! makes perfect sense since they are similar kinds of calculations while it is guaranteed that they never overlap.

The K7 architecture looks like it addresses a problem that was beyond the scope of the K6 and that Intel didn't really solve correctly. The K7 is a true 3-way execution engine. Just as with the K6 core, the back end is beefy enough to sustain even the most taxing load that the decoders might send its way. The unit partitioning also represents a significant improvement over the K6 (which has a seperate FPU and MMX pipes, even though they could never be fed simultaneously.) The K7 should do a much better job of saturating its instruction units than the K6 did.

As an analogy to the P-II decoders which can produce 4:1:1 micro-ops, the K7's decoders can produce instructions in 2:2:2 rop configurations, and if a K7 rop is anything like a K6 risc86 op, then the K7's decode rate will easily exceed that of the P-II. One flavor of risc86 op that was revealed was the load-op-store operation which would track the memory address in a single operand, affording the greatest possible efficiency for the x86's natural alternative to using registers (that it has so few of.)

The out of order window is also significantly larger than the P-II's (72 rops versus P-II's 21 micro-ops) allowing for much deeper instruction reordering which will allow for greater parallelism. The K7 also has more execution units than the P-II, which means it can handle more actual computational work than the P-II.

So what does this all mean? Unlike the P-II, the K7 may actually be able to execute and possibly sustain at a rate of 3 instructions per clock (or at least get a lot closer to it than the P-II does).

Swwweeeeet!

The one big detracting factor that I see is that the K7 requires a totally new motherboard. It will be very important for AMD to be able get the Super 7 motherboard vendors to produce high quality motherboards for the K7. I say this is a detracting factor, only in that it is an additional hurdle that must be overcome. At the same time it represents an opportunity for AMD's designers to have a hand in designing a high quality motherboard, and they may find many more performance opportunities. (A test motherboard with CPU was demoed at COMDEX in a closed door suite.)

While nothing about the architecture is radically new in the way of architectural concepts (which makes me wonder what it means to be a seventh generation technology), it is clearly the embodiment of all the best features of modern microprocessors. They say it will come out at 500Mhz+ but when it comes out and how much above 500Mhz it actually ends up at will be crucial to procure mindshare away from Intel. Intel should also exceed 500Mhz for at least the second half of 1999. (Update: K7's at 500, 550, and 600 are reportedly shipping and will be generally available in August.)

If AMD can deliver a high clock rate or in a timely manner, the K7 will be the processor to beat.

Advanced Micro Devices

The 7th generation of CPUs

Cyrix's Jalapeno architecture

The Cyrix Jalapeno presentation at Microprocessor forum

UPDATE: As of 05/05/99, National Semiconductor has announced that they will not be continuing PC processor development. However on 06/31/99, after I and others had pretty much written off Cyrix, VIA signed a letter of intent indicating they would purchase Cyrix. It is likely Cyrix may lose some of their engineers, and their schedules may slip a bit.

Cyrix has claimed that the processor will be available in Q4'99 at more than 600Mhz in a 0.18 micron process, with an on chip 3D accelerator, and supporting memory bandwidth's up to 3.2 GB/s.

The presentation they give above says something that doesn't make any sense to me: 3-issue buys little performance because applications are limited by OS performance. First of all, I dispute the applications are limited by OS performance (they may be limited by device or memory performance, but the OS, no matter how badly written, should not factor in as a performance consideration of any reasonably written application.) Secondly, Cyrix did not explain how they arrived at 3-issue data especially since they don't make a 3-decode processor -- they could just be taking P-II data which is not representative of the K7's or even RISE's mP6 decode rate.

The 6x86MX

1 FPU non-pipelined (slow, but asynchronous)
2 Load or Stores in EX units
Native x86 execution => 2 entry out-of-order window.
L2 cache on motherboard -- direct mapped
64KB L1 cache unified

Jalapeno

2 FPUs fully pipelined
1 Load/Store unit
Translation to "nodes" => 16x6 entry out-of-order window.
256KB L2 cache on chip -- 8-way set associative. pipelined to core frequency
32KB L1 cache harvard architecure (16K I-cache, 16K D-cache) non-blocking pending miss mechanism

It is clear that Cyrix has made some compromises in moving from the 6x86MX to the Jalapeno core. However, they have clearly identified areas where they have had problems in the 6x86MX. The on chip L2 cache will probably help memory bandwidth more than halving and de-unifying the L1 cache will hurt. But going with a single load/store unit seems like too much of a compromise to me. The superscalar pipelined FPUs will easily pull them ahead of the P-II's floating point performance, and it will probably be not too far off of the K7's performance. But for ordinary performance, my rough gut feel is that this processor will perform roughly like a K6 with significantly higher memory bandwidth (possibly like a K6-3 at a high clock rate.)

But the real differentiator of Jalapeno from other x86 CPUs is that like the MediaGX, it includes a graphics core. The 256K on chip L2 cache is shared for texture caching as well as ordinary CPU caching. On the face of this, the problem is that the bandwidth rates and content of the CPU and graphics processor are entirely different, and therefore such an idea would tend have the two thrash each other. But Cyrix has solved that problem by partitioning the L2 cache into graphics and non-graphics sections (by allowing only one of the "ways" of the cache to be dedicated for graphics.) So the value add of the graphics core will not be from an integration stand point of view, but rather from a core frequency and (possibly) memory bandwidth advantage.

On the one hand, including a graphics core on chip may prove to be the greatest value add (in terms of total system cost.) But on the other, they are pissing off all the graphics vendors, who are intensely competitive, and will be highly motivated to outperforming the Cyrix core. I don't have any reason to believe that Cyrix has graphics designers that can compete with nVidia, 3DFX, ATI, Matrox, or even S3 (all of whom have experts previously employed at SGI, or other workstation graphics companies.) On the flip side, Cyrix already has experience with MediaGX, and so should have a reasonable idea about where they went wrong with it in terms of a performance target (MediaGX, although not a complete failure did not quite deliver on its promise in terms of its potential market.)

I have a hunch that Cyrix may have bought a deferred rendering graphics core from GigaPixel or Stellar graphics -- the bullet points that they include suggest that they are avoiding the term "edge anti-aliasing" and specifying the minimum they need to that is both true and non-revealing. They also spent no time describing the kind of technology that the graphics core is comprised of which leads me to believe they may not understand it too well themselves. This is a total guess though.

Like the K7, Jalapeno will also need a new motherboard to accomodate their complete interated system concept. But they have already done it once with MediaGX, and thus should have some good knowledge on how to do this already.

600Mhz+ in Q4'99 is sounding reasonable but not earth shattering to me. Like I commented about AMD above, crucial to the success will be delivering clock rates that are competitive with what Intel will be offering. Intel's roadmap suggests that they will be shipping well above 600Mhz by this time frame.

Cyrix

The Transmeta CPU

TransMeta CPU speculation
TransMeta Patent

The following discussion occurred on USENET in the comp.arch group:

Subject: Comments on Transmeta patent?
Newsgroups: comp.arch

Transmeta has been issued a patent, "US5832205: Memory controller for
a microprocessor for detecting a failure of speculation on the
physical nature of a component being addressed":

  http://www.patents.ibm.com/details?pn=US05832205__&s_clms=1

that describes a VLIW processor optimized for efficient emulation of
other (i.e., x86) processors, that they call a "morph host".

In the text, they claim that an "embodiment" (presumably an existing
Transmeta device) with 1/4 the gates of a Pentium Pro runs x86 code
faster than any existing x86 proessor.  This "embodiment" has 64
integer and 32 fp registers, and has a 2MB "translation buffer" for
caching dynamic translations of x86 code to native code.  Individual
pages are marked as having translations in the buffer, and the TLB
uses this information to support invalidating translations for
self-modifying code.  It also allows memory locations to be aliased
with registers, via a "load and protect" instruction: this causes the
loaded address to be kept in a special register, and any subsequent
load or store to that address will cause an exception.

The patent describes in a lot of detail how an sample sequence of
target x86 instructions can be translated and optimized to native
instructions.

Do any other processors have a "load and protect" type of instruction
for dealing with aliasing issues?

Also maybe it's just that I'm not that familiar with patents, but it
seems like they disclose an awful lot of stuff that is tangentially
related to the central claims of the patent?

-- Dave Hinds

I have not read through the patent in detail as of yet. But from what I've gleened its a VLIW processor that uses a FX!32-like on the fly x86 translation mechanism with significant hardware support. As a bonus, the patent itself seems to referr to a concept that sounds like speculative memory operations.

It sounds like great technology, but I am sure we are only seeing the briefest glimpses of what its really about. Although "JC" believes that this processor will debut in Q4'99 (see JC's processor roadmap) I don't know where he got that information, or how credible it is. (His site also contains information implying that the transmeta CPU will be a lot smaller than a P6 core -- but so far that has not helped other Intel competitors such as IDT.)

The challenge for these guys will be to leverage the idea of VLIW (which enables more instruction parallelism) as a performance booster to offset whatever translation overhead there is, in a way that is well defined. Discussion of clock rate will not be a very good indicator, until the translation performance ratio is known.

Update: There are rumours floating around that the Transmeta CPU will be used for the new Amigas. While the story fits, (the new) Amiga has already changed directions once (going with QNX, then changing their mind and deciding to use Linux.) So I would not consider this information too solid.

Story on C|Net

TransMeta

The Centaur WinChip 4 CPU

The Centaur WinChip 4 presentation at Microprocessor Forum

Update: On 07/??/99 IDT announced that they will leave the x86 processor business. Their intent being to sell off their "Centaur" division. Shortly thereafter VIA announced intentions of acquiring IDT as well!! Don't ask me what's going on over there!

Just like the original WinChip designs seem to be a cross between a 486 and a RISC cpu, the WinChip 4 looks like a cross between a Pentium and a RISC cpu. Centaur claims that the processor will be available in 2H'99 at 400-500Mhz at 0.25 micron, with a 500-700Mhz version available in 1H'00 at 0.18 micron.

These time tables and clock rates do not seem all that impressive considering where Intel, AMD and Cyrix will be in a similar time frame. They may be relegated to a pricing strategy. I don't want to write them off yet though since they apear to be at least a stone's throw away in terms of clock rate.

Along with Cyrix, they for some reason, decided against 3 decode as well. They may be targetting RISE (another x86 start up that has a 3 decode rate, but only executes in order) instead of K7 with this comment since they have no pretentions of beating AMD or Intel from an architecture point of view.

Instead of going for an on-die L2 cache, they have decided that a larger L1 cache (128k) would be a better idea.

After going over their slides I've decided that the most interesting part of the architecture is that memory loads are put into a stage that occurs before execution. This make L1 cache memory access essentially free in all cases (as compared to being an additional clock in K6 or P-II architectures -- it won't be an advantage in comparison to an aggressive out of order architecture like the K7) at the relatively minor expense of increasing latency of all instructions by one clock.

If they played games with their decoder they could make "mov reg,[mem]" instructions, essentially equivalent to prefixes for non-memory operations. They make no mention of this, and I doubt that they would do it -- its a bit too complicated, and these Centaur people seem to like simplicity in their CPUs.

WinChip

Links

JC's Processor Roadmap

My 6th generation comparisons

Beyond RISC - The POST-RISC Architecture

Kenneth Eckman's Intel Roadmap

Glossary

Frequency - The clock rate of the CPU. Usually measured in Mhz.

Harvard Architecture - The partitioning of the L1 cache into seperate data and instruction caches. This allows the two caches to be designed with different characteristics or sizes.

L1 cache - A small, high performance memory shadow buffer that feeds instruction bytes and local redundant data to the CPUs registers. Usually, multiple ported to allow multiple accesses per clock.

L2 cache - A medium sized high performance memory shadow buffer that feeds small data blocks to the CPU's L1 cache. Usually very densely designed allowing for only one read or write port, but for larger capacity.

Out of order Window - A buffer of instructions that can be executed in an out of order (and usually speculative) manner. The size is measured in absolute slots or slots that can be decoded per clock.

Post-RISC architecture - a term coined by Charles Severance referring to the modern trend of CPUs to use techniques not found on traditional RISC processors such as speculative execution and register renaming in conjunction with instruction retirement.

Unified Cache - A cache that encompasses both data and instruction memory. This allows for larger typical working sets than a split architecture but requires more read ports to satisfy the high number of reads per clock.