Design for reliability in the era of the computing continuum
[ Back ]   [ More News ]   [ Home ]
Design for reliability in the era of the computing continuum

The CLERECO collaborative project proposes a scalable, cross-layer methodology and supporting suite of tools for accurate and fast estimations of computing systems’ reliability

Dec 19, 2016 -- As we enter the era of nanoscale devices, reliability is becoming a key challenge for the semiconductor industry. The now atomic dimensions of transistors result in a vulnerability to variations in the manufacturing process and can rise dramatically the effect of environments stress on the correct circuit behaviour. Failures in early assessing computing systems’ reliability may produce excessive redesign costs which can have severe consequences for the success of a product.

Current practice involves a worst-case design approach with large guard bands. Unfortunately, application of this approach is reaching its limit in terms of economic sustainability with regard to performance, size and energy costs. Coordinated by Dr Stefano Di Carlo of the Politecnico di Torino in Italy, the CLERECO (Cross Layer Early Reliability Evaluation for the Computing cOntinuum) project aims to address this challenge.

The CLERECO project involves industrial and academic partnerships between four academic institutions and three manufacturers in order to create a new approach and new tools for electronic design automation (EDA) support in reliability assessment. The project addresses a key challenge faced by designers employing early measures to prevent hardware faults propagating to the software layers of the system stack and reaching the system output. Different protection mechanisms must be employed at each layer in order to provide ‘cross-layer’ reliability enhancement. This challenge becomes more serious when coupled with the concept of the ‘computing continuum’. This phrase was coined by Intel already in 2011 and describes the erosion of the traditional separation between the market segments of embedded systems and high performance computing systems. CLERECO aims to reflect this merging of market segments at the EDA level by providing tools that can be applied with small differences to different computer architectures and across a range of areas. CLERECO has been in operation since 2013, with key milestones including theoretical models, tool design, adaptation of tools for industry use, and commercialisation of the created tools by partner organisations.

A new methodology and approach

The CLERECO project focuses on reliability analysis in the early phases of the design. Early assessment within the design cycle provides the freedom for adaptive modification if the estimated reliability level does not meet the requirements. Traditional tests typically take place at the end of the design stage when a prototype of the system is available. Physical stress tests are generally used to evaluate the final reliability of a system. When working with models of the design, register transfer level or gate level fault injection (FI) is the most accurate standard method for performing reliability analysis. The designer creates a detailed model of the system hardware and is able to simulate the execution of software applications on this model. To perform reliability analysis, faults are artificially injected into the model, thus simulating what could happen in a real environment, and the effect of the faults are evaluated by comparing the behaviour of the faulty system with a fault-free version. Theoretically, by performing a statistically significant number of simulations a very precise estimation of the reliability of the system can be identified.

Di Carlo identifies problems with this approach: ‘These detailed gate-level models, especially for complex hardware blocks such as microprocessors are not always available, and the simulation time in case of complex systems is prohibitive. Usually this results in the use of simplified simulations and models.’ This leads, in Di Carlo’s opinion, to ‘inaccurate and usually pessimistic estimations’.

CLERECO methodology addresses this challenge by providing dedicated tools to separately analyse the technology, the hardware components (at the microarchitecture level) and the software modules of a complex system and to recombine the characteristics of single object into a complex statistical Bayesian model that can be used to perform statistical reasoning on the reliability of the system as a whole. As part of the project a comparative analysis between reliability estimations obtained using CLERECO models and tools on a set of benchmark and real systems and results obtained using an accurate gate level FI campaign and a simplified FI approach both performed using state-of-the-art commercial tools has taken place. The results revealed an equivalent level of estimated reliability, with a significant reduction in simulation time required when using a CLERECO reliability estimation approach versus an industry standard gate level FI campaign.

Industrial challenges

The projet faced during the project in relation to the stability of the industrial team: ‘Within three years there were three changes in the composition of the industrial consortium. The project remained intact but the turbulence created by such changes represented a significant overhead for project coordinator and in some cases represented a threat to the continuation of the project. The ability to weather such turbulence relied on the preventive definition of effective backup plans for these situations and on the research strength and dedication of the all-academic partners that guaranteed the smooth continuation and the successful completion of the project.

This instability of the industrial consortium is not unique to the CLERECO project, however. The current situation within the global economy as well as the speed at which IT business changes make a three- year commitment in a research project increasingly difficult for industrial partners. In contrast to this, there were few major scientific challenges and there was a good and very productive collaboration among all partners.

Market-ready results

Industrial partners have been very important during the project despite the changes mentioned above. CLERECO worked in collaboration to build tools that could be inserted directly into industrial design flows. There are so many academic tools around that don’t find real application in the industry since they are not built to be integrated into a real industrial design flow. Industrial partners, provided guidelines on the use of tools and models, thus creating a key area of value for the project. The complexity of the new technologies requires a deeper collaboration between industry and academia. We could not have achieved our results without the strong support of our industrial partners.

Industrial partners also provided systems on which the tools and models were tested. Benchmarks are effective instruments for evaluation of the capability of a tool but only when working with real cases can the practical limits of an approach be understood. This academic and industrial partnership allowed a move from laboratory-based benchmark analysis to real systems in order to identify system-based bottlenecks. These have now been corrected, allowing delivery of mature products.

Industrial partners also provided the tools and knowledge to perform a fair comparison of CLERECO methods with existing gold standard commercial products, showing how early reliability analysis can complement precise fault injection based validation at the end of the design flow. CLERECO partners also provide a channel for exploitation and presentation of the project’s results. The CLERECO team has presented at HiPEAC Conference, DATE (Design Automation and Test in Europe), ISPASS (International Symposium on Performance Analysis of Systems and Software), VTS (IEEE VLSI Test Symposium), ICCD (IEEE International Conference on Computer Design), IOLTS (IEEE International Symposium on On-Line Testing and Robust System Design), PATMOS (International Workshop  on Power And Timing Modeling, Optimization and Simulation), DCIS (Design of Circuits and Integrated Systems Conference) and CTC (China Test Conference) in 2016 alone.

The complexity of the new technologies requires a deeper collaboration between industry and academia. We could not have achieved our results without the strong support of our industrial partners

Stefano Di Carlo (Project Coordinator)

Control and Computer Engineering Department
Politecnico di Torino

T: +39 011 090 7080
E: Email Contact

Read the complete story ...
For more discussions, follow this link …