CHAPTER 11

BRING UP AND
DEBUG: THE
PROTOTYPE IN THE
LAB

We have come a long way. We have chosen or built our FPGA platform; we have manipulated our design to make it compatible with FPGA technology and we are ready with bitstreams to load into the devices on the board. Of course, we will find bugs, but how do we separate the real design bugs from those we may have accidentally introduced during FPGA-based prototyping? This chapter introduces a methodical approach to bringing up the board first, followed by the design, in order to remove the classic uncertainty between results and measurement.

We can then use instrumentation and other methods to gain visibility into the design and quickly proceed to using our proven and instrumented platform for debugging the design itself.

During debug we will need to find and fix bugs and then re-build the prototype in as short an iteration time as possible. We will therefore also explore in this chapter more powerful bus-based debug and board configuration tools and running incremental synthesis and place & route tools to save runtime.

11.1. Bring-up and debug–two separate steps?

It is worth reminding ourselves of the benefits of prototyping our designs on FPGA. We use prototypes in order to apply high-speed, real-world stimulus to a design, to verify its functionality and then to debug and correct the design as errors are discovered. The latter debug-and-correct loop is where the value of a prototyping project is realized and so we would prefer to spend the majority of our time there. It is tempting to jump straight to applying the whole FPGA-ready design into the prototyping platform, but this is often a mistake because there are many reasons why the design may not run first time, as will be discussed later in this chapter. It is very difficult in this situation to determine what prevents a design from running first time, so a more methodical approach is recommended.

When the design does not work on a prototyping board, it could be because of two broad reasons. It could be because of problems related to the prototyping board setup, or due to problems in the design that is run on the board. Separating the debugging process for these two separate problems will lessen the whole debugging time.

For effective debug-and-correct activity it is critical to make sure that the bugs we see are real design issues and not manifestations of a faulty FPGA board or mistakes in the prototyping methodology. We therefore should bring up our design step-by-step in order to discover bugs in turn as we test first the board, then the methodology and finally the completed design in pieces and as a whole. This will take more time than rushing the whole design onto the boards, but in the long run will save time and help prevent wasted effort.

These bring-up steps can be summarized as follows:

So, it is always necessary to make sure that the FPGA board setup is correct before testing the real design on board. The next step is to bring up the design on board by making sure that the clock and reset signals are correctly applied to the design. After the initial bring up, the actual design validation stage would start. In this design validation stage, debugging the issues becomes easy when there is enough visibility to the design internals. The necessary visibility can be brought into the design using different instrumentation methodologies which will be discussed in detail in the later part of this chapter.

11.2. Starting point: a fault-free board

We should like to start this chapter from the assumption that the FPGA board or system itself does not have any functional errors, but is that a safe assumption in real life? As mentioned in chapter 5, this book is not intended to be a manual on high-speed board design and debug so we are not intending on going into depth on board fault finding. We assume that those who have developed their own FPGA boards in house will know their boards very well and would have advanced to the point where the boards are provided, fully working to the lab.

There is an advantage at this time for those using commercial boards because these would have already been through a full-production test and sign-off at the vendor’s factory. For added confidence, board vendors should also supply user-executable test utilities or equipment in order to repeat much of the factory testing in the lab. For example, a special termination connector or automatic test program to run either on the board or on a connected host. It is not paranoid to recommend that these bare-board tests are repeated from time-to-time in order to reinforce confidence, especially when the prototype is showing some new or unexpected behavior. A few minutes running self-test might save many hours of misdirected analysis.

As well as self-test, board vendors will often provide small test designs for checking different parts of the base boards. Even if in-house boards are in use, it is still advisable to create and collect a library of small test designs in order to help commission all the parts of the boards before starting the real design debugging work.

The kinds of tests applicable to a base board vary a great deal in their sophistication and scope. As an example of the low-end tests, the Synopsys® HAPS® boards have for many years employed a common interconnect system, called HapsTrak®. Each board is supplied with a termination block, called a STB2_1×1 and is pictured in Figure 133. The STB2_1×1 can be placed on each HapsTrak connector in turn in order to test for open and short circuits and signal continuity.

Figure 133 : STB2_1×1. An example of a test header for FPGA Boards

Image

Testing the board from external connectors in this way would be a minimum requirement. Beyond this, we might also want to use any available built-in scan techniques, additional test-points, external host-based routines etc. All of these are vendor- and board-specific and apart from the simple above example, we leave it to the reader to explore the particular capability of their chosen platform. For the purpose of this manual, we will start form the previously mentioned assumption that the board itself is functional before we introduce the design. However, there are other aspects of the board and lab set-up that may be novel and worth checking before the design itself is introduced.

Typical issues that would present in the prototyping board setup are problems in the main prototyping board and daughter boards themselves, connection between these different boards through connectors and cables, or issues related to external sources like clock and reset sources. These all come down to good lab practice and we do not intend to cover those in detail here.

11.3. Running test designs

The best way to check whether the FPGA board setup is correct is to run different test designs on the board and check the functionality. By running a familiar small design which has well-known results, the correct setup of the boards can be established. Experienced prototypers can often reuse small test designs in this way and in a matter of hours know that the boards are ready for the introduction of the real SoC design.

Board vendors may also be able to supply small designs for this purpose but in-house boards will require their own tests and be tested separately before delivery to the prototypers. Use of custom-written test designs to test these boards in the context of the overall prototype can then be performed. This also applies to custom daughter boards for mounting onto a vendor-supplied baseboard. For example, if a custom board is made to drive a TFT display then a simple design might be written to display some text or graphics on a connected display in order to test the custom board’s basic functionality and its interface with the main board.

In either case, it is recommended to connect all the parts of the prototype platform including the main board, daughter boards, power supply, external clocks, resets, etc. as they will be used during the rest of the project. Then configure the assembled boards to the exact setting with which the actual design will employ during runtime, including any dip switches and jumpers. Sometimes these items can be controlled via a supervisory function which will access certain dedicated registers on the board under command of a remote-host program. It is especially important to configure any VCO or PLL settings and the voltage settings for each of the different IO banks of the FPGAs. Off-the-shelf prototyping boards like the HAPS board provide an easy way of configuring this using dip switches and on-board supervisory programs.

We are then ready to run some pre-design tests to ensure correct configuration and operation of the platform. Here are some typical test designs and procedures which the authors have seen run on various prototyping boards.

11.3.1. Filter test design for multiple FPGAs

One example of a simple-to-partition design which tests many aspects of a multi-FPGA design configuration is the FIR shown in Figure 134 which would serve to test four FPGAs on a board.

Figure 134 : A 4-tap pre-loadable transposed FIR filter test design

Image

An FIR, being a fully synchronous single-clock design helps to check that reset timing is correct and that clock skew is within an acceptable range.

Figure 135 : RTL code for FIR filter test design

Image

The RTL for this design is seen in Figure 135 and this may be synthesized and then split across four FPGAs either manually or by using one of the automated partitioning methods discussed earlier. The values of the coefficients, although arbitrary, should all be different.

The design is partitioned so that each tap (i.e., multiplier with its coefficient and adder) is partitioned into its own FPGA, as shown in Figure 136. Care should be taken with FPGA pinout and use of on-board traces to ensure that connectivity between the FPGAs is maintained.

Figure 136 : FIR filter test design partitioned across 4 FPGAs

Image

A logic analyzer is connected at the output and a pattern generator is connected at the input. Once the design is configured into the FPGAs then it can be tested by applying an impulse input as shown in Figure 137. The pattern generator can be used to apply the impulse input. An impulse input consists of a “1” sample followed by many “0” samples. i.e., for 8-bit positive integer values, then the input would be 01h for one clock followed by 00h for every clock thereafter. As an alternative to using an external pattern generator, a synthesizable “pattern generator” can be designed in one of the FPGA partitions to apply the impulse input. This is a simple piece of logic to provide a logic ‘1’ on the least significant bit of the input bus for a single clock period after reset.

The expected output from FPGA4 should be the same as the chosen filter coefficients and they should appear sequentially at FPGA4’s output pins as the pulse has been clocked through the filter (i.e., one clock after impulse input has applied in this 4-tap example), as shown in Figure 137.

11.3.2. Building a library of bring-up test designs

As the team becomes more familiar with FPGA-based prototyping, common test designs will be reused for different prototypes. It is advisable to create a way to share test designs amongst team members and across different projects. Creating a reusable test infrastructure with a library of known good tests usable on different boards is an investment that will benefit multiple prototyping projects and increase efficiency of our FPGA-based prototyping. This library could be considered in advance, or gradually built up with each new prototyping project.

Figure 137 : Expected FIR behavior across four FPGAs

Image

By running such simple designs on the prototyping board setup, the whole setup is tested for its basic functionality and we get comfortable with the FPGA-based prototyping flow. After running few designs on the board, we would know how to partition a design, synthesize and place & route them, create bit files, program the FPGAs with the bit files and do some basic debugging on the board. This knowledge would help us while debugging the real design on the board setup.

By testing this design on board, we become familiar with the complete flow of partitioning a design across multiple FPGAs, managing the connections between them and running them on board. In fact, during early adoption of FPGA-based prototyping by any project team, this type of simple design is good for pipe-cleaning methods to be used in the whole SoC design later.

11.4. Ready to go on board?

At this stage, some teams will decide to introduce the SoC design onto the FPGA board. It may have seemed like a long time to finally reach this stage but an experienced prototyping team will perform these test steps discussed in section 10.1 in a few hours or days at most. As mentioned, a piecemeal approach to bring-up may save us a great deal of false debug effort.

There is one further step which is recommended for first-time prototypers or any team using a new implementation flow and that is to inspect the output of the implementation flow back in the original verification environment. Our SoC design has probably undergone a number of changes in order to make it FPGA-ready, not least, partitioning into multiple devices. There may be other more subtle changes that may have crept in during the kind of tasks described in chapter 7 of this book. How can we check that the design is still functionally the same as the original SoC, or at least only includes intentional changes? The answer is to reuse the verification environment already in use for the SoC design itself.

11.4.1. Reuse the SoC verification environment

As mentioned in chapter 4, the RTL should have been verified to an acceptable degree using the simulation methods and signed off as ready for prototyping by the RTL verification team. Their “Hello World” test will probably have already been run upon the RTL and it is very helpful if this same test harness can be reused on the FPGA-ready version of the design.

We have seen that by reusing the existing SoC test bench we can identify differences between the behavior of the design before and after the design is made ready for the prototype. It is important that any functional differences between the SoC RTL and the FPGA-ready version can be accounted for as either intentional or expected. Unexpected differences should be investigated as they may have been introduced while we were preparing the prototype, often in the implementation process.

11.4.2. Common FPGA implementation issues

Despite all our best efforts, initial FPGA implementations can be different to the intended implementation. The following list describes common issues with initial FPGA implementations:

Let’s consider each of these types of implementation issues in more detail.

11.4.3. Timing violations

The design may have timing violations within an FPGA or between FPGAs. It is always advisable to run the static timing analysis with proper constraints on the post place & route netlist and make sure that all timing constraints are met. Even though FPGA timing reports consider worst case process, voltage and temperature (PVT) conditions, if timing constraints as reported by the timing analysis are not met, there is no guarantee the design will work on board.

Here are some common timing issues we might face while analyzing timing on a design and some tips on how to handle them:

Figure 138 : Typical FF arrangement in an FPGA IO cell

Image

We can see from these examples above that thorough timing analysis of the whole board is very useful and can discover many possible causes of non-operation of a prototype before we get to the stage of introducing the design onto the board. Any of these timing problems might manifest themselves obscurely on the board itself and potentially take days to uncover in the lab.

11.4.4. Improper inter-FPGA connectivity

Another difficult-to-find problem is improper connectivity between the FPGAs on the board. That is, signals between FPGAs which are misplaced so that the source and destination pins are not on the same board trace. Keeping correct contiguity between FPGA pins should be a matter of disciplined specification and automation. However, as the excerpt from a top-level partition view in Figure 139 shows us (or rather scares us), there will be thousands of signal-carrying pins on the FPGAs in a typical sized SoC prototype. If any one of these pins is misplaced (i.e., the signal is placed on the wrong pin) then the design will probably not behave as expected in the lab and the reason might prove very hard to find.

If this happens then it is most likely during a design’s first implementation pass or after major RTL modification or addition of new blocks of RTL at the top level.

Figure 139 : Certify® Partition View illustrating complexity of signals between FPGAs

Image

Pin location constraints are passed through different tools in the flow, usually from the partition tool to the synthesis tool to the place & route tool. If any of the different tools in the flow drops a pin location constraint for some reason then the place & route tool will randomly assign a pin location with only low probability that it will be in the correct place. The idea of pin placements being “lost” might seem unlikely but when a design stops working, especially after only a small design change, then this is a good place to look first.

Possible reasons for pin misplacement include accidentally setting a tool control to not forward annotate location constraints. Another mistake is to rely on defaults that change from time-to-time or over tool revisions. A further example happens when we add some new IO functionality but neglect to add all of the extra pin locations. In every case, however, we will notice the missing pin constraints if we look at the relevant reports.

We should carefully examine the pin location report generated by the place & route tools for each FPGA and compare against the intended pin assignments. In the case of the Xilinx® place & route tools, this is called a PAD report. Human inspection of such reports, which can be many thousands of lines long, might eventually lead to an error so some scripting is recommended. The script would open and read the pin location reports and look for messages of unplaced pins but while doing so, could also be looking for other key messages, for example, missed timing constraints or over-used resources. In the case of the pin locations, we might maintain a “golden” pin location reference for all the FPGAs so that the script can automatically compare this against the created PAD reports after each design iteration. This would add an extra layer of confidence on top of looking for the relevant “missing constraint” message. In addition, we could set up other scripted checks in order to compare any pin location information at intermediate steps in the flow, for example, between the location constraints passed forward by partition/synthesis to the place & route, and those present in the final PAD files.

It is worth noting that if a commercial partitioning tool is used then inter-FPGA pin location constraints should be generated automatically and consistently by the tool, perhaps on top of any user-specific locations. In the case of Certify, if the connectivity in the raw board description files are correct and the assignments of all the inter FPGA signals to the board traces are complete then there is a very minimal risk that the pin locations will be lost of misplaced by the end of the flow.

On the other hand, if we manually partition the design, then we need to carefully assign the pin location constraints for all the individual inter-FPGA connections as well as make sure that these are propagated correctly to the place & route stage, which can become a tedious and therefore error-prone task if not scripted or otherwise automated.

If all this attention to setting and checking pin locations seems rather paranoid then it is worth remembering that there might be thousands of signals crossing between FPGAs. In the final SoC, these signals will be buried within the SoC design and the verification team will go to great lengths to ensure continuity both in the netlist and in the actual physical layout. A single pin location misplacement on one FPGA would be as damaging as a single broken piece of metallization in an SoC device and perhaps harder to find (albeit easier to fix). Therefore it pays to be confident of our pin locations before we load the design into the FPGAs on the board.

11.4.5. Improper connectivity to the outside world

As well as between FPGAs, we need the proper connectivity between the FPGAs and any external interfaces. Some of the simplest but most crucial external interfaces are clock and reset sources, configuration ports and debug ports. The more sophisticated connections would be interfaces to external components like DDR SDRAM, FLASH and other external interfaces.

Once again, the implementation tools rely on there being correct and complete information about physical on-board traces between the FPGAs and from the FPGAs to the external daughter cards which may house the DDR etc. On a modular system, with different boards being connected together, the board description will be a hierarchy of smaller descriptions of the sub-systems and wit h a little care, we can easily ensure that the hierarchy is consistent and that the sub-boards are themselves correctly described. It may be worthwhile to methodically step through the board description and compare it to the physical connection of the boards, however, some systems will also allow this to be done automatically using scan techniques to interrogate each device on the boundary scan chain and check that they appear where they should, according to the board description.

A commercial board vendor will be able to provide complete connection information for the FPGA boards and for their connection to daughter boards for this purpose. In such a board description, inter-FPGA connections on the board will be labeled generically because they might conceivably carry any signal (or signals) in the design.

For connections to dedicated external interfaces, however, the signal on each connection is probably fixed e.g., a design signal called “acknowledge” must go to the “ack” pin on the daughter card’s test chip. In those cases, it can help to also give the traces in the board description the same meaningful name to make it is easier to verify the connectivity to external interfaces and match signals to traces. So for example, we can easily check that a trace called “ack” on the board is connected to a pin called “ack” on the daughter card and is carrying a signal called “ack” from the design. This naming discipline is especially useful for designs with wide bus signals.

Some advanced partitioning tools will be able to recognize that signals and traces, or signals and daughter-card pins, have the same name and therefore quickly make an automatic assignment of the signal to the correct trace, which if the board description is correct, will automatically also assign the signal to the correct FPGA pin(s). All of our pin assignments would therefore be correct by construction.

11.4.6. Incorrect FPGA IO pad configuration

When connecting various components at the board level, care must be paid to the logic levels of all interconnecting signals and be sure they are all compatible at the interface points. As well as having the correct physical connection, the inter-FPGA signals and those between the FPGAs and other components must be swinging between the correct voltage levels. As we saw in chapter 3, FPGA cores run at a common internal voltage but their IO pins have great flexibility and can support multiple voltage standards. The required IO voltage is configured at each FPGA pin (or usually for banks of adjacent pins) to operate with required IO standards and is controlled by applying the correct property or attribute during synthesis or place & route.

This degree of flexibility must be controlled because a pin driving to one voltage standard may be misinterpreted if interfacing with a pin set to receive in a different standard. This may seem obvious but as with the physical connection of the pins, the scale of dealing with voltage standards on thousands of signals adds its own challenge. For all the inter-FPGA connections, the IO standards of the driving FPGA pin and the driven FPGA pin should be the same.

The partition tools should automatically take care of assigning the same IO standards for connected pins, or warning when they are not. The default IO standard for the synthesis and place & route may also suffice for inter-FPGA connections but it is not recommended to rely only on default operation. In addition, care should be taken when the design is manually partitioned.

The board-level environment may also constrain the voltage requirement and we will need to move away from the default IO voltage settings. Then there are special considerations for differential signals compared with single-ended.

Let’s look at some of these board-level voltage issues here:

11.4.6.1. NOTE: run scripts to check IO consistency

Although most of the inter-FPGA IO considerations above will be managed by the partitioning tools and therefore consistent by construction, we should maintain a “golden” reference for IO standard, voltage, drive current and placement for the critical pins on the FPGA and certainly between the FPGAs and external peripherals. We can then compare the golden reference against the created PAD report for each FPGA, however, it is wrong to have to make repetitive manual checks on every iteration of the design, so it is worth taking the time to script these kinds of checks.

We have mentioned how scripts can be used to automate these post-implementation, pre-board checks. Tools will have their own commands for generating reports on a large number of details, including top-level ports. A script can make mult iple checks on such reports in the same pass, for example verifying that the IO standards could be combined with the pin location constraints, checking that every top-level port or partition boundary signal has an equivalent entry in the FPGA location constraints. Automatically running these scripts in a larger makefile process will prevent running on into long tool passes using data that is incomplete, and wasting a lot of time as a result.

11.5. Introducing the design onto the board

Once we are confident that the platforms/boards are functional and we are confident of the results of our implementation flow then we can focus on introducing the FPGA-ready version of the SoC design onto the board.

This first introduction should be done in stages because it is unlikely that every aspect of the design’s configuration will be correct first time. Some configuration errors have an effect of masking others and so bringing up the prototype in stages helps to reduce the impact of such masking errors.

After checking the prototyping board setup with some test designs and verifying that there are no FPGA implementation issues, the actual design to be prototyped can be taken to the board. Here are some useful steps to follow:

11.5.1. Note: incorrect startup state for multiple FPGAs

A common issue with multi-FPGA systems is the improper startup state. In particular, if the FPGAs do not all start operating at the same time, even if they are only one clock edge apart, then the design is very unlikely to work.

Consider the simple case of a register-to-register path sharing a common clock. If the source and destination registers are partitioned into separate FPGAs and the receiving FPGA becomes operational just one clock cycle after the sending FPGA, then the first register-to-register transmission will be lost. Conversely, if the sending FPGA becomes operational one clock later than the receiving register, then the first register-to-register data may be random and yet still be clocked through the receiving FPGA as valid data. This simple example shows how there might be coherency problems across FPGA boundaries right from the start. These startup problems can also occur between FPGAs and external components such as synchronous memories or sequenced IOs.

A number of factors determine the time it takes for an FPGA to become operational after power up or after an external hard reset is applied, and it can vary from one FPGA to another. Prime examples are clocking circuitry (PLL/DCM) that use an internal feedback circuit and can take a variable amount of time to lock and provide a stable clock at the desired frequency. In addition, any hard IO such as Ethernet PHY or other static hardware typically becomes operational much sooner than the FPGAs as they do not need to be configured.

When the design shows signs of life but not as we know it, then one place to look is at this startup synchronization.

The remedy for the synchronized startup condition issue is a global reset management, where all system-wide “ready” conditions are brought to a single point and are used to generate a global reset. This is covered in detail in chapter 8.

11.6. Debugging on-board issues

After giving ourselves confidence that the FPGA board is fully functional and configured correctly, and also having checked the implementation and timing reports for common errors, we can consider that our platform is implemented correctly. From here on, any functional faults in the operation of the design will probably be bugs in the design itself that are becoming visible for the first time. The prototype is starting to pay us back for all our hard work.

The severity and number of the bugs discovered will depend upon the maturity of the RTL and how much verification has already been performed upon it. We shall explore in the remainder of this chapter some common sources of faults.

11.6.1. Sources of faults

Every design is different and we cannot hope to offer advice in this manual on which parts of the design to test first or which priority to place on their debug. However, we can offer some guidance on ways to gain visibility into the behavior and some often-seen on-board problems.

Assuming the design was well verified before its release to use, we should be looking for new faults to become evident because the design on the bench is exposed to new stimulus, not previously provided by the testbench during simulation. The cause of such faults can be found in three main areas, as listed in Table 29.

Table 29: Three main kinds of SoC bug discovered by FPGA-based Prototyping

Image

Type 1 bugs should be caught by the normal RTL verification plan and as mentioned previously, an FPGA-based prototype is not the most efficient way to discover RTL bugs, especially when compared to an advanced verification methodology like VMM. Nevertheless, RTL bugs do creep into the prototype either because there is an unforeseen error exposed by real-world data or because the RTL issued to the prototyping team is not fully tested.

Type 2 bugs are related to the SoC’s interface with external hardware. All SoC’s are specified to be used in a final product or system but there is a diminishing return on making specifications ultra-complete. Eventually there may be some combination of external components that do not fit within the spec and the design needs to be altered to compensate. This is often the case when extensive use of IP means that the SoC is the first platform in which a certain combination of IP has ever been used together. Alternatively, the design itself may be an IP block that is being specified to run in a number of different SoC designs for different end-users. Predicting every possible use of the IP is hard and corner cases are often discovered during the prototyping stage.

Type 3 bugs are the most valuable for the prototyping team to find. The prototype is the first place where the majority of the embedded software runs on the hardware. The interface between software and hardware is very complex having time dependency, data dependency and environmental dependency. These dependencies are difficult to model properly in the normal software development and validation flows, therefore on introduction to real hardware running at (or near) real-speed, the software will tend to display a whole new set of behaviors.

Any particular fault may be a combination of any two or indeed all three types of bug, so where do we start in debugging the source of a fault?

11.6.2. Logical design issues

Assuming that our newly found bug is not a known RTL problem, already discovered by the verification team since delivering the RTL for our prototype (always worth checking), we now need to capture the bug. We need to trace enough of the bug-induced behavior in order to inform the SoC designers of the problem and guide their analysis. This means progressively zeroing-in on the fault to isolate it in an efficient and compact form for analysis, away from the prototype if necessary.

The first step in identifying the fault is to isolate it to the FPGA level. To accomplish this, we need visibility into the design as it is running in the FPGAs. As mentioned in chapters 5 and 6, it is good practice to provide physical access to as many FPGA pins as possible so that they can be probed with normal bench instruments such as scopes and logic analyzers. Test points or connectors at key pins will hopefully have been provided by the board’s designers for checking key signals, resets, clocks, etc. However, most FPGA pins will usually be unreachable, being obscured as they are by the FPGAs ball grid array (BGA) package and therefore we need some other form of instrumentation.

The simplest type of instrumentation is to sample the pins of the FPGA using their built-in boundary scan chains, often referred to by the name of the original industry body that developed boundary scan techniques, JTAG (Joint Test Action Group). FPGAs have included JTAG chains for many years, primarily to assist test during manufacture and ensuring sound connection for BGAs. JTAG is most commonly known to FPGA users as one of the means for configuring the FPGA devices via a download cable. JTAG allows the FPGA’s built-in scan chains to be used to drive values from the FPGA pins internally to the device and externally to the surrounding board. Through intelligent use of the JTAG chains and intelligent triggering of the sample by visible on-board events, a JTAG chain can capture a series of static snapshots of the pin values, helping to add some visibility to a debug process.

Greater visibility at FPGA boundaries or at critical internal nodes is provided by instrumentation tools such as ChipScope and Identify as described in chapter 3. As with any of these tools, there is an inverse correlation between how selective we are in our sampling ‘vs’ the amount of sample data. Since data is usually kept in any unused RAM resources available in the FPGA, we should expect that we will not be able to capture more than a few thousand samples of a few thousand nodes in the FPGA. Therefore we need to use some of our own intelligence and debug skills in order focus the instrumentation on the most likely sites of the fault and its causes.

A good Design-for-Prototyping technique is for the RTL writers to create a list of the key nodes in their part of the design i.e., a “where would you look first” list. This list would be a useful starting point for applying our default instrumentation.

11.6.3. Logic debug visibility

There is a traditional perception that FPGA-based prototyping has low productivity as a verification environment because it is hard to see what is happening on the board. Furthermore, the perception has been that even when we can access the correct signals, it is difficult to relate that back to the source design.

It is certainly true that FPGA-based prototypes have far lower visibility than a pure RTL simulator, however, that may be the wrong comparison. The prototype is acting in place of the final silicon and as such it actually offers far greater visibility into circuit behavior than can be provided from a test chip or the silicon itself. Furthermore the focus of any visibility enhancement circuits can be changed, sometimes dynamically.

As a short recap on the debug tools explored in chapter 3, we can gain visibility into the prototype in a number of ways; by extracting internal signals in real-time and also by collecting samples for later extraction and analysis.

Real-time signal probing: in this simplest method of probing designs’ internal nodes, we directly modify the design in order to bring internal nodes to FPGA pins for real-time probing on bench instruments such as logic analyzers or oscilloscopes. This is a common debugging practice and offers the benefit that, in addition to viewing signals’ states, it is also easier to link signal behavior with other real-time events in the system.

Embedded trace extraction: This approach generally requires EDA tool support to add instrumentation logic and RAM in order to sample internal nodes and store them temporarily for later extraction and analysis. This method consumes little or no logic resource, very little routing resource and only those pins that are used to probe the signals.

We shall also look at two other ways of expanding upon debug capability, especially for software:

Bus-based instrumentation: some teams instantiate instrumentation elements into the design in order to read back values from certain key areas of the design, read or load memory contents or even to drive test stimulus and read back results. These kinds of approaches are often in-house proprietary standards but can also be built upon existing commercial tools.

Custom debuggers: parts of the design are “observed” by extra elements which are added into the design expressly for that purpose. For example, a MicroBlaze™ embedded CPU is connected onto the SoC internal bus in order to detect certain combinations of data or error conditions. These are almost always user-generated and very application-specific.

11.6.4. Bus-based design access and instrumentation

The most commonly requested enhancements to standard FPGA-based prototyping platforms are to increase user visibility and access to the system, including remote access. We have seen how tools like Identify® and Xilinx® ChipScope tools offer good visibility into the prototype but these communicate with their PC-hosted control and analysis programs via the FPGA’s JTAG port. As mentioned, the bandwidth of the JTAG channel can limit the maximum rate that information can be passed into or out of the prototype. If we had a very high bandwidth channel into the prototype, what extra debug capability would that give us?

Debug situations where higher bandwidth would be useful include:

We can provide a higher bandwidth interface by using a faster serial connection or by using a parallel interface, or even a combination of the two. Very fast serial interfaces from a host directly into FPGA pins on a board is a difficult proposition but we could envisage a USB 2.0 connection and embedded USB IP in the design which might be used for faster communication, replacing the standard JTAG interface.

A simpler approach used by a number of labs is the instantiation in the design of blocks which can receive data in a fast channel and distribute it directly into parts of the design via dedicated fast buses. The instantiated block could be a simple register bank into which values can be written which then override existing signals in the design. Another use of a bus-based debug block might be to act as an extra port into important RAMs in the design, for example, changing a single-port RAM into a dual-port RAM so that the extra port can be used to pre-load the RAM.

The advantages of this kind of approach are clear, but even with the use of our chosen hardware description language’s cross-module references this might mean some changes to the RTL. However, much of this change could be automated, or inserted after synthesis by a netlist editor. It might even be adopted as a company-wide default standard, much as certain test or debug parts are added into SoC design for other purposes during silicon fabrication.

These advanced communication and debug ports are traditionally the reserve of tools which work on emulator systems but are starting to become more common in FPGA-based prototypes as well. One example of this is the Universal Multi Resource bus (UMRBus®) interface originally developed by one of the authors of this book, René Richter, along with his development director at Synopsys, Heiko Mauersberger (in fact, their names were the original meaning of the M and the R of UMRBus).

UMRBus, as the new name suggests, is a multi-purpose channel for high-bandwidth commutation with the prototype. It works by placing extra blocks into the design and linking them together via a bus-based protocol which also communicates back to PC-host. There are a number of blocks and other details which we will not cover here, but at the heart of the UMRBus is a simple block called a client application interface module, or CAPIM, a schematic of which is shown in Figure 140.

As we can see in the diagram, a CAPIM appears as a node on a ring communication channel, the UMRBus itself, which carries up to 32 bits of traffic at a time. We can choose a width which best suits the amount of traffic we want to pass. For example, for fast upload of a many megabytes of software image, we might use a wider bus but for setting and reading status registers in the design a 4-bit bus might suffice, saving FPGA resources. Each CAPIM is connected into the design, often using XMRs to save boundary changes, to any point of interest.

Figure 140 : Client application interface module (CAPIM) for UMRBus®

Image

In Figure 141, we can see the use of three CAPIMs on the same UMRBus, each offering access and control of a different part of the prototype. In this example, UMRBus is allowing read and load of a RAM, of some simple registers and test points or to allow reprogramming of the book code in the design.

These functions might all be in the same FPGA or spread across the board across different segments of a UMRBus. Similar in-house proprietary bus-based access systems should also be designed to allow for multi-chip access and cross-triggering.

Figure 141 : UMRBus with three CAPIMs connected into various design blocks

Image

11.6.5. Benefits of a bus-based access system

Some labs have developed their own variations on this bus-based approach and each will have its own details of operation, however, the general aims and benefits are the same as those listed above, so let’s explore how this extra capability can help the FPGA-based prototyping project.

In very advanced cases, the bus-based access can become a transport medium for protocol layers but this might be a large investment for most project-based labs, therefore we might expect these types of advanced use modes to be bought in as proven solutions from commercial board and tool vendors.

Once we start to employ the most sophisticated methods for debug and configuration of the system, we might begin to explore other use modes including:

11.6.6. Custom debug using an embedded CPU

Most of the SoC designs which have embedded processors would also have built-in custom debuggers. These custom debuggers would be used to connect to the processor bus present in the SoC designs, usually to load software programs to be run on the processor, debug the software programs by single stepping or breakpoints and access all the memory space of the processor bus. These custom debuggers will be present inside the design which will be connected to the external world through some dedicated IO ports. Usually a debugging software utility running on a laptop will be connected to these FPGA IO ports using some hardware connected through USB or other IO ports as shown in Figure 142. Using the software utility we can load programs into the program memory, debug the software and access the memory space in the SoC.

Figure 142: Software debugger linked into FPGA-based prototype

Image

If the design to be prototyped has this debugger, then it can be used to debug the design if it doesn’t work on the board. We should take the dedicated IO ports coming from the inbuilt debugger out of the prototyping board using IO boards or test points. Then the debugging software utility can be connected to these IO ports. After programming the FPGAs, the software utility can be asked to connect to the internal debugger present in the design. Upon successful connection to the internal debugger, we can try to read and write to all the different memory spaces in the processor bus. Then a simple program can be loaded and run on the processor present in the design. Finally the actual software debugging can happen over the real design running on the prototyping board.

11.7. Note: use different techniques during debug

Some might say that on-the-bench debug of any design, not just FPGA-based prototypes, is more intuition that invention. Debug skills certainly are hard to capture in a step-by-step methodology, however, it is a good start to put a range of tools in the hands of experienced engineers. On the bench, it is common to find a combination of these different debug tools in use, such as real-time probing, non real-time signal tracing and customer debuggers.

For example, if a certain peripheral like a TFT display driver in a mobile SoC chip is functioning improperly then we might try to debug the design though the custom debugger first. We could read and write a certain register embedded in the specific peripheral through the customer debugger to check basic access to the hardware. Then we can try to tap onto the internal processor bus through Identify instrumentor and debugger by setting the trigger point as the access to the address corresponding to the register. With the traced signals we can identify whether proper data is read and written into the register. Then we can connect the oscilloscope or logic analyzer on the signals which are connected to the TFT display to check whether the signal are coming appropriately when a certain register is read and written. Of course, all of these steps might be performed in a different order as long as we are successful. With this combined debug approach we can debug most of the problems in the design.

11.8. Other general tips for debug

One of the aims of the FPMM web forum which accompanies this book is to share questions and answers on the subject. The authors fully expect some traffic in the area of bug hunting in the lab and to get us started, here is a miscellaneous list, in no particular order, of some short but typical bug scenarios and their resolutions.

Many thanks to those who have already contributed to this list.

These are a few of the ideas that Synopsys support staff and end-users have shared over many years of successful prototyping and at time of writing, the authors look forward to hearing many more as the FPMM website goes live.

11.9. Quick turn-around after fixing bugs

Debugging can become an iterative process and given that the time taken to process a large SoC design through synthesis and place & route might be many hours, if we find ourselves wasting a lot of time just waiting for the latest build in order to test a fix, then we are probably doing something wrong.

Here are three approaches we recommended in order to avoid this kind of painful “debug-by-iteration.”

A debug plan for a prototype is much like a verification plan for the SoC design as a whole. As a full rerun of the prototype tool chains for large designs might take more than a whole day, we need to have clear objectives for the build, which we should aim to meet before re-running the next build. A debug plan sets out the parts of the design which will be exercised for each build and also a schedule of forthcoming builds. This is particularly useful when multiple copies of the prototype are created and used in parallel. Revision control for each build and documentation of included bug fixes and other changes are also critical. This is all good engineering advice, of course, but a little discipline when chasing bugs, especially when time is short, can sometimes be a rare commodity. With a good debug plan in place and a firm understanding of the aims and content of each build, a regime of daily builds can be very productive.

A day’s turn-around is an extreme example and in some cases it would be much less than that. For example, a bug fix that requires a change to only one FPGA will not require the entire flow to be rerun. The partitioner may work on the design top-down, but for small changes, the previous partition scripts and commands will still be valid. As mentioned in chapter 4, a pre-synthesis partitioning approach helps to decrease runtime in this case because synthesis and place & route take place only on the FPGA that holds the bug fix.

To achieve a faster turn-around on small changes we can use incremental flows and we shall take a closer look at these now.

11.9.1. Incremental synthesis flow

As mentioned in chapter 3, both synthesis and place & route tools have the ability to reuse previous results and only process new or changed parts of the design. Considering synthesis first, the incremental flow in most tools was created to reduce the overall runtime taken to re-synthesize a design when only a small portion of the design is modified. The synthesis will compare parts of the design to the same parts during the previous run. If they are identical then the tools will simply read in the previous results rather than recreate them. The tool then maps any remaining parts of the design into the FPGA elements instead of the entire design, combining the results with those saved from the previous runs of the unchanged parts in order to complete the whole FPGA. This incremental approach can save a great deal of time and takes relatively small effort to set up.

In Synopsys FPGA tools, incremental synthesis is supported by the compile point synthesis flow. In the compile point flow, we manually divide the design into a number of smaller sub-designs or compile points (CPs) that can be processed separately. This does not require any RTL changes but is controlled by small changes in the project and constraint files only, and can even be driven from a GUI.

A CP covers the subtree of the design from that point down although CPs can also be nested as we can see in Figure 143.

Figure 143: possible arrangement for Compile Points during FPGA Synthesis

Image

The design can have any number of compile points, and we do not have to place everything into CPs because the tool makes the top-level a CP by default. Therefore the simplest approach might be to define CPs on all the modules which are frozen or are not expected to change, allowing the tools to simply reuse the previous results for that CP.

Since each CP will be synthesized separately, we must define a separate constraint file for each one which will include many of the same clock and boundary constraints that we would need if the CP block was being assembled in a bottom-up flow, or as a standalone FPGA. Details are of constraints available in the references but the aim is that for a small up-front effort in time budgeting or by limiting our CP boundaries to registered signals, we can dramatically reduce our turn-around time.

The tool needs to be able to spot when the contents of a CP have changed it achieves this by noticing when the CP’s RTL source code logic or its constraints have changed. Changes to some implementation options (such as retiming) will also trigger all CPs to be re-synthesized.

There is no such thing as a free lunch, however, and in the case of incremental flows the downside is that there may be a negative impact on overall device performance and resource usage. This is because cross-boundary optimizations may be arrested at CP boundaries and so some long paths may not be as efficiently mapped as they would have been had the whole design been synthesized top-down. In Synopsys FPGA synthesis this can be mitigated to some degree and the amount of boundary optimizations across the CP boundary can be controlled by setting the CP’s type as either “so ft,” “hard” or “locked.”

Another reason that incremental flows might not yield the highest performance results is that the individual timing constraints for each CP may not be accurate as they rely on estimated timing budgeting or user intervention. Conversely, this might be seen as an advantage since we might purposely choose to tighten or relax the constraints for each CP in order to focus the tool’s effort on more difficult-to-achieve results for certain parts of the design, while relaxing others to save area or runtime.

Readers might be struck by the similarity between CP incremental synthesis and traditional bottom-up synthesis of a design block-by-block. However, the scripting, partitioning and resource management involved in a traditional bottom-up flow is sometimes seen as too much of an investment for a prototyping team. There is also the problem of tracking exactly which files are to be re-synthesized in the new build. In a design of thousands of RTL files, automation is very desirable.

11.9.2. Automation and parallel synthesis

If the workstation upon which the FPGA synthesis is running has multiple processors then the synthesis task can be spread across the available processors, as shown in Figure 144.

With such multiprocessing enabled, multiple CPs can be synthesized simultaneously further reducing the overall runtime of synthesis even for the first run or complete re-builds when all CPs are re-synthesized. By using the automatic CPs and multiprocessing options together, we can reduce the overall runtime of the synthesis during the first time as well as the subsequent incremental iterations.

Figure 144: Automatic compile points and synthesis on multiple CPUs

Image

A typical SoC design would need a number of CPs to be defined with individual constraint files in order to achieve a significant reduction in the synthesis runtime. Again, if the up-front investment for an incremental flow is too great or requires too much maintenance (e.g., in scripts) then the flow becomes less attractive. Indeed, defining a large number of CPs and creating constraints files for each may even be seen as too time consuming. To overcome this apparent hurdle, Synopsys FPGA synthesis is able to create and use CPs automatically.

When automatic CPs are used, the tool can analyze a design and identify modules that can be defined as CPs. The CP’s timing constraints can also be automatically budgeted from the top-level constraints. This eliminates the need for us to define a separate constraint file for each defined CP.

This is a relatively new technology and is a useful weapon for trading off runtime against overall performance.

11.9.3. Incremental place & route flow

Incremental flow in place & route can be achieved using design preservation flow in the Xilinx® back-end tool, the flow for which is seen in Figure 145.

Design preservation is one of the hierarchical design flows supported in the Xilinx® back-end tool. In the design preservation flow, the designs are broken into blocks referred to as partitions. Partitions are the building blocks of all the hierarchical design flows supported in the Xilinx® back-end tool. Partitions create boundaries

Figure 145: Design preservation flow in Xilinx® ISE® tools around the hierarchical module instances so that they are isolated from other parts of the design.

Image

A partition can either be implemented (mapped, placed and routed) or its previous preserved implementation can be retained, depending on the current state of the partition. If a partition’s state is set as “implement,” then the partition will be implemented by force. And if the partition’s state is set as “import,” then the implementation of that partition from the previous preserved implementation will be retained.

When a design with multiple partitions is implemented in the Xilinx® back-end tool for the first time, the state of all the partitions must be set to “implement.” For the subsequent iterations, the state of the partition should be set either to “import” if the partition has not changed or “implement” if it has changed. This way, only the changed modules are re-implemented and the implementations of all the partitions that have not changed are retained. This saves significant time in the implementation of the entire design.

11.9.4. Combined incremental synthesis and P&R flow

Figure 146: Combined synthesis and place & route incremental flow

Image

By combining incremental flows for synthesis and for place & route we can gain our maximum runtime reduction. In fact, since place & route runtime is typically much longer than synthesis runtime, setting up incremental synthesis without incremental place & route would not really improve turn-around time that much.

With this in mind, Synopsys FPGA synthesis has been created to interleave its compile point flow automatically with the Xilinx® design preservation flow, as shown in Figure 146. In this integrated flow, the type of CP in the synthesis flow should be set to “locked, partition,” so that it will be marked as a partition during the place & route stage. When the integrated flow is enabled, the Synplify® Premier software automatically writes the states of the partitions as “implement” or ‘import” as appropriate.

By using this integrated flow, as summarized in Figure 146, the overall turnaround time to synthesis, place & route and design can be significantly reduced.

11.10. A bring-up and debug checklist

Table 30: A step-by-step approach to bring-up and debug of the prototype in the lab

Image

We have covered a lot of ground in this chapter but there will be very many on-the- bench debug scenarios not covered here. Every design and lab setup is different and we can only attempt to cover the more common patterns in the projects that we know.

We hope that there will be a great deal of discussion and “I found this, try that…” experience swapped between prototypers on the FPMM online forum, from which we can all learn to become better debuggers. In the meantime, Table 30 gives a rough checklist of tasks in the lab during bring-up and debug.

11.11. Summary

After giving ourselves confidence that the FPGA board is fully functional and configured correctly and also having checked the implementation and timing reports for common errors, we can consider that functional faults in the operation of the prototype might actually be bugs in the design itself that are becoming visible for the first time.

As we find and correct bugs and move towards a fully functional prototype we are being paid back with interest for all our hard work in getting this far. The prototype has also become an extraordinarily powerful tool for validating the integration of the software with the SoC. Replicating the prototype for use by a wider community of software engineers is relatively quick and simple at this stage.

The authors gratefully acknowledge significant contribution to this chapter from

Ramanan Sanjeevi Krishnan of Synopsys Bangalore