Zaiq Technologies












 
Optical
Wireless
Broadband
Case Studies
White Papers
Methodology Briefs
Explore our domain capabilities and experience.

Evaluating and improving emulator performance

By Damian Deneault and Lauro Rizzatti, EEdesign
EEdesign
Jan 3, 2004 (2:09 PM)

PDF IconClick Icon for a PDF version of this document.

With functional verification becoming the largest task in the development of an ASIC or large FPGA, many engineers are looking at simulation acceleration or emulation as a way to speed development. Engineers today can choose from a number of different platforms.

Specifications or performance metrics from the various vendors have been expressed as gate capacities, speed-up ratios over simulation, bandwidth and/or clock frequency of the acceleration hardware. These metrics are not easily compared across vendors, and terminology is not always consistent. On a given emulator, actual performance varies widely according to the specific target design. Even for a given design, mapped to a given emulator platform, execution times for different tests vary widely.

Figure 1 shows the performance relative to an HDL simulator for a number of different tests, applied to a single network processor on an emulator1. Although the vendor may quote an average speedup factor relative to simulation, the wide variations shown here mean that an end user can not reliably use that speedup factor as a predictor of his actual performance. The actual performance of the test suite running on the emulator would not be known until after incurring the time and cost of implementing the testbench and target design on the emulator, at which point it may have already impacted the project's schedule and budget.

Figure 1 -- Emulation speed relative to HDL simulation

Simulation limits
Execution performance in HDL simulation depends on the size of the combined design and testbench. As the size of the design increases, the number of signal events or task evaluations increases and simulation time increases accordingly. Design size and level of abstraction (gate netlist, register transfer level, or behavioral code) are appropriate predictors of design simulation performance.

Testbench simulation time must also be factored into the prediction. For testbenches written in HDL, testbench simulation time often accounts for 25% to 50% of the total run-time. Testbenches using a high level hardware verification language (such as Vera, "e" or C/C++ with Testbuilder or SystemC) that communicate with the HDL design under test (DUT) at event or even clock cycle level will often take a similar fraction of the run-time for execution and communication.

For testbench environments written efficiently in C, C++ or SystemC and communicating with the simulation at a transaction level, the total testbench effect on run-time can be drastically reduced to a tiny fraction, possibly less than 0.1%. Table 1 provides a description of event, clock, and transaction communication.

Events Involves individual signals, pins or nets. Transitions of these signals are generated or monitored.
Cycles Focuses on behavior of each chip, bus or interface clock. Creates or monitors the activity at each clock.
Transactions A complete interchange of data or status. The transaction may span several clock cycles, involve many signals and events.
Examples: a bus read or write cycle, a bus burst, a DMA transfer, an interrupt, a packet or cell or frame of data.
Table 1 -- Test abstraction levels

Figure 2 shows the relationship between cycle and transaction.

Figure 2 -- Relationship between cycle and transaction

Transaction-based emulation
In transaction-based emulation, the functional verification tests are implemented on the host workstation, in C and C++, and are often augmented by performance modes and a software co-simulation environment. Test stimulus and responses are passed, in the form of transactions, between the test environment on the host and the DUT in the emulator.

A transactor provides the abstraction for a given protocol that can be standard or proprietary. A transactor contains the knowledge to transform the data between two interfaces:

  • On the hardware side, it matches the specific protocol interface.
  • On the software side, it provides high-level commands to perform specific actions.

A transactor is the combination of two pieces of code:

  • Software adapter: C++ model to send/receive transactions.
  • Hardware adapter: Verilog bus functional model (BFM) to play cycle-accurate transactions.

The data exchange between the hardware and the software part of the transactor is done through an API known as Standard Component Emulator Modeling Interface (SCE-MI), a standard sponsored by Accellera. Figure 3 shows the composition of a transactor.

Figure 3 -- Composition of a transactor

Figure 4 shows the coordination of software and hardware during test execution. Execution proceeds on the C-side, until a SIM function is called. Then the C/C++ side transactor halts, and the transactor BFM will handle the corresponding op code. Other transactors continue to run concurrently.

Figure 4 -- Transactor execution flow

The verification environment and emulator used in the performance example are shown in Figure 5.

Figure 5 -- Verification environment used in the performance benchmark

The transaction based environment has the following benefits:

  • Efficiency of the higher level of abstraction. Test writers work at the transaction level of abstraction when writing tests and debugging them. Cycle execution detail and handling is encapsulated in the BFM and lower level efficiency routines. Test development is more productive at this higher level of abstraction.
  • Correctness. By concentrating on the transaction level of abstraction, test writers are more likely to thoroughly exercise the device and get the right sequence of operations and data types. They aren't distracted by the cycle and signal protocol specifics, which instead are developed and tested once in the BFM.
  • Re-use. The transaction level promotes re-use of verification code, both within a project (tests written for an external chip interface may be applied to an internal chip bus for module test) or from project to project (tests may be applied to the next version of the bus interface for the next generation chip)
  • Execution performance. Transaction-based C environments incur less CPU and memory overhead, compared with testbenches connected at the pin or cycle level, or implemented in HDL.
  • Portability. The transaction communication in Figure 3 is based on the Standard Component Emulator Modeling Interface (SCE-MI) standard from Accellera. This standard interface provides a point to insure portability of the environment forward to new emulation platforms, or portability of third party verification intellectual property (IP) into the environment.

Emulation limits
Emulation limits include the capacity of the platform, the unconstrained performance of the platform, and the communication with the host workstation.

The DUT and BFMs are synthesized and mapped into the hardware emulation platform shown in Figure 3, limited by several critical resources:

  • The design must fit within the overall logic capacity of the platform. Predicting the size of a design not yet implemented is a challenge, but one familiar to designers, with tools and engineering experience available.
  • The memory in the DUT and BFMs can either be mapped into the FPGA internals, or to dedicated memories in the emulator, or stored on the host workstation. Internal and on-board memories may have constraints such as limited data widths, fixed depths, requirements for registered inputs and outputs that don't match the intended DUT implementation, and these must be accommodated.
  • The DUT and BFM clocks must be implemented in the actual hardware clock resources available in the emulator and FPGA(s) on which it is based. This takes place in a synthesis and mapping step, and is subject to the actual clock drivers available, and hardware issues of clock skew, setup and hold timing. Limitations on the number and types of clocks can usually be overcome with the EDA tools, coding practices, and experience from the design community.

The underlying clock rate of the emulators is often close to the full capabilities of the FPGAs they are based on, on the order of 106 or 108 Hz. The lower the speed, the more FPGAs used in the emulator. In contrast, leading edge chip or complete system designs in HDL simulation may only run at 10-1 to 102 Hz, or less.

A design may not fit into a particular emulator because of the limits described above. For designs that fit, however, the size of the design has little or no impact on the performance of the emulator. This binary dependency is in contrast to the gradual performance degradation of HDL simulation with size.

With the hardware emulator evaluating the DUT state up to 108 faster than simulation, and for an efficiently implemented high level testbench or co-simulation environment, communication between the two becomes an important performance predictor. Test stimulus and responses are passed, in the form of transactions, between the test environment on the host and the DUT in the emulator.

The communication channel includes a hardware interface between the host and emulator, often high-speed cable or bus, and a PCI interface in this case. Each BFM communicates with an independent C/C++ test thread. Transactions are mapped into SCE-MI messages and communicated through the shared SCE-MI software and hardware interface.

Figure 6 shows the throughput, in terms of SCE-MI messages per second, of the network processor test suite.

Figure 6 -- Throughput measured in transactions per second

The message throughput is much more consistent than the speedup ratios shown in Figure 1. This rate is dependent on the workstation CPU, so the consistent performance metric is Messages / Second / GHz-CPU (M/S/G). This throughput is a result of several factors, including the raw bandwidth of the hardware interface, the latency of the hardware interface, the throughput of the communication software layer, and the software thread switch time. None of these factors alone can give a meaningful prediction.

Mapping and prediction
A user considering or comparing emulator platforms can go through the following steps:

Targeting design decomposition. Determine the configurations and modes of operation required in emulation to meet the test goals. Identify the transaction level interface in the test environment, and the required BFMs.

Checking platform capacities. Estimate gate size, memory requirements, and clock domains in the DUT and BFMs, and determine how they will map to platform resources by hand or with design tools.

Analyzing transaction traffic. This step is based on the DUT and the test plan. For each interface on the DUT, determine the type of transaction traffic, the number of DUT cycles it represents, and the number required by the test suite.

  • In system level functional verification, for example, it is desirable to know how long the test suite will take to run. For a device that has I ports, each tested with J flows per port, K packets per flow, L SCE-MI messages per packet, M loopback trips, N tests in the test suite, and the platform provides X messages per second, the test suite will take I x J x K x L x M x N / X seconds to run in regression.
  • For co-development of hardware, understanding the amount of software that can be debugged requires knowing how many emulated processor cycles/second will be delivered. For a device that has a processor bus on which a burst transaction represents L clocks, a data port with transactions that represent M clocks, and a DMA port with transactions that represents N clocks, the emulator will yield X/(1/L +1/M + 1/N) emulated clock cycles/second.

This analysis requires that the project has produced a target design architecture and target test architecture. It gives results more accurate than comparisons of raw emulator specifications. Because performance can be predicted in advance of the implementation and porting effort, the use of emulation and selection of a platform can be an informed and low-risk decision.

Reference
1Test Suite developed by Zaiq Technologies in C and HDL for a network processor SoC, mapped to a Zebu FPGA-based emulator from Emulation and Verification Engineering (EVE).

Lauro Rizzatti is EVE-USA CEO and marketing vice president. He has more than 30 years of successful experience in EDA and ATE, where he held various positions in companies such as Get2Chip, Synopsys, Mentor Graphics, Teradyne, Alcatel, and Italtel.

Damian Deneault is the vice president of engineering at Zaiq Technologies Inc., responsible for design and verification of SoCs, ASICs, FPGAs and systems. He been developing and managing hardware development for over 18 years and has held various management positions at PictureTel Corp, Alliant Computer Systems Corp and Raster Technologies.

Back To Top

Home | About Us | Solutions | Innovation | Jobs@Zaiq |News | Partners | Site Map | Contact Us




In The News
Read the latest on Zaiq
Apply for a position today!