THERE ARE A lot of curious things coming out of TSMC lately, and they all seem to center around dodging real questions. The problem is yields on their 40nm process, but TSMC will not address half of the reported problems publicly.
The history of the TSMC 40nm process has been long and painful on both the capacity (wafer starts) and yield fronts. Chips were supposed to roll off the line in October of 2008, and almost 18 month later, things still appear to be a mess. What went wrong, and what is still going wrong?
SemiAccurate has been following the debacle closely, and the author was one of the first to sound the alarm months before problems were public. Since then, we have been tracking what internal sources, customers, and TSMC executives say about the various problems, don’t say about others, and how the various fixes are purported to work. Not only do the officially claimed problems not match what external sources are saying, but the publicly named solutions don’t match what is coming out of the fabs.
The first problem is a contamination issue. This is the one that is officially talked about, and the one that is officially solved. From everything we hear, the fab where almost all 40nm wafers are run had a contamination problem in 2009, and it was quite problematic. When you hear about TSMC executives saying “yield rates on the process have improved after a two-quarter period with the defect density dropping from 0.3-0.4 to only 0.1-0.3″, it is very true, but only a partially story.
Insiders told SemiAccurate that during phase 2 of bringing up TSMC’s 40nm Fab 12, “clean room certification level was way off”, and for that phase, was the cause of a major yield drop. From everything we hear, that has been very successfully dealt with.
Another problem, the so called ‘chamber mismatch‘ is one that simply should not have happened, much less been allowed to continue for so long. Chamber mismatch is a fancy term of art for a calibration problem, basically a machine that is set wrong or reads wrong. It is a classic bring up problem, and was almost assuredly the reason for the delays in the second 40nm line.
Several semiconductor engineers questioned by SemiAccurate could not explain how such a problem was not caught immediately by TSMC’s metrology. Most fabs would notice a problem within a pod (4-5 wafers) or two, and raise alarm bells within a day at most. If the problem was a plasma etch chamber out of calibration, it should have been easy enough to detect with an electron microscope, that is exactly what metrology is there for. Metrology tests results and changes in production runs, and an etch chamber off calibration is a simple measurement, how thick is the layer that comes out?
Once again, that problem appears to have been solved, and chips were coming off the second 40nm line in mid-December 2009. The fact that it took months rather than hours to catch and rectify the chamber mismatch problem is worrisome, if the metrology program doesn’t work for the basics, how is TSMC going to get feedback on the more troublesome and esoteric problems? You can not debug deeper problems by randomly pressing buttons and hoping things work out three months later.
The problems of contamination and chamber mismatch were based on the tools or facilities that manufacture the chips, and do not have anything to do with the process itself. Think of them as the process being correct, but not being done right rather than the process having something inherently wrong with it.
The final two problems are much more serious in nature, and they are simply not addressed by TSMC executives in public. Via fracturing and transistor variability are much more problems with the process itself, not how it is done. Either one can lead to many more dead chips than a little dirt here and there, especially at the rates quoted by TSMC executives.
First up is the via bonding/failure problems. Vias failing were specifically called out by Nvidia’s John Chen as a problem, and he publicly rebuked them about it. ATI solved the problem by working around it at the design stage, not by the problem itself being resolved. Nvidia didn’t have the foresight to fix the problem at the design level, so they are a good indicator for TSMC’s progress in resolving the via problems. By all accounts, it is simply not fixed yet, and it is a much more serious problem than the first two.
The last problem is also a process technology problem, transistor variability. Once again, this is a fancy way of saying that the transistors are not exactly what was promised, some are too big, some are too small. Even if it is only by a single nanometer, this can cause huge problems with speed, leakage, and just about every single metric that matters.
If you think of variability as a bell curve, you want it to be as close to a single narrow peak as possible. TSMC’s variability seems to be wider and flatter than they would like it to be. Once again, the problem is manifested in Nvidia’s inability to make GF100/Fermi/GTX480 chips that run at the promised speeds and wattages. Those parts that keep getting pushed out by months or quarters at a crack, and are woefully short of the promised specs.
Both the via defects and transistor variability are fundamental problems with the technology of the process, not external defects like bad equipment or contamination. If you execute perfectly on your manufacturing, which TSMC may now be doing after about a year of trying, both will still happen, and both will still kill chips.
The anecdotal evidence says that TSMC has yields that are yo-yoing without any real explanation. While this may be reading a bit into barely correlated data points, it seems like TSMC’s 40nm process is still an experiment in progress. Yields going up and down tend to be a sign of things being tried on the fly, something that should not be happening in a high tech production environment like this.
Let’s hope that this is not the case, if it is, it would point to some very serious problems with TSMC’s fundamental technology, their ability to get feedback via metrology, the ability to interpret the data, and implement changes based on it. I doubt we will ever hear the complete story of what is going on internally, but promised resolutions are still months away. With luck, things will go smoothly from Q3 on.S|A