Kumar works to improve computing reliability
As transistors decrease in size, their reliability also decreases. That decrease may lead to computing applications running incorrectly. While it may be possible to guarantee correct execution at the expense of area, power, or performance, the cost will become prohibitive for future processors. ECE Assistant Professor Rakesh Kumar is working to solve the problem of computing in the face of errors that future transistors may produce.
In order for computing to be more pervasive, you really want the cost of computing to go down,” Kumar said. “The easiest way to make the cost of computing to go down is to make the transistors smaller.”
Kumar is researching stochastic processors, which are not guaranteed to produce correct results every time. The processor is designed for the average case, meaning it takes much less power and has more performance, but may not always produce correct results.
Kumar and his research team have been working on this project--which Kumar considers his flagship project--for about one year. It is supported by Intel, the National Science Foundation’s EAGAR program, and most recently, the GigaScale Systems Research Center (GSRC). The University is also supporting the project through an Arnold O. Beckman Research Award.
“You now have to do computing very differently because all the computing that people have done so far assumes that when you run a programmable chip, it’s going to give you correct results,” Kumar said.
Kumar is looking at three directions on how to do computing on stochastic processors.
The first is creating a soft architecture. In a typical hard architecture, the processor either functions fully or fails fully. With a soft architecture, when the voltage of a processor is reduced or frequency is increased and errors are produced, the processor degrades gradually.
“If your architecture was not soft, then you reduce the voltage, and suddenly, every circuit in the processor starts producing errors,” Kumar said. “In that case, neither would your application be able to tolerate errors, nor would your hardware-level error tolerance mechanisms be able to tolerate errors.”
The second research direction looks at what to do when the soft architecture produces errors. An increase in reliability is typically achieved only at increased power consumption. Similarly, if the power is reduced by reducing the voltage, the reliability will decrease, and errors will occur.
Error-tolerance mechanisms need to be created so that the errors do not reach the applications and to address the competing concerns of reliability and power. Kumar is looking at error tolerance mechanisms that account for application characteristics and environmental conditions to reduce the area and power overhead of tolerating errors.
The third research direction is application hardening or robustification. When errors reach applications, they crash. There needs to be a way that applications can survive when errors do reach them.
“The idea is can you develop a black box that takes as input an application and outputs an error-tolerant version of it,” Kumar said.
Instead of causing the application to crash, errors would only cause reduced application quality. For example, a video application’s full resolution images become half-resolution images.
“If I manage to have a black box that can take any application and output an error-tolerant version of it that runs at almost the same quality or same speed as the original non-error-tolerant version of the application, but at significantly lower power, then I would consider it as a success,” Kumar said.
Kumar thinks this project will have deep implications for the future of computing.
“The problem is hard, and hard problems excite me,” Kumar said. “The cost of doing computation correctly is really high, and this cost is actually going to increase in the future. If one can figure out how to do gainful computing without paying that cost, it will be great.”