Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 04

Multi-Core Software


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1104.02

  • Volume 11
  • Issue 04
  • Published November 15, 2007

Multi-Core Software

  Section 3 of 8  

Parallelization Made Easier with Intel® Performance-Tuning Utility

NEW PERFORMANCE ANALYSIS MODELS

Performance tuning is like debugging: you'd like to avoid it but you cannot. And similar to the debugging process, you cannot do anything effectively unless you have a reliable tool that can save you a lot of time and effort. The importance of good debuggers and performance analyzers becomes critical as we move into the all-parallel world of microprocessing.

Intel® PTU is meant to become such a time-saving tool. We do not pretend though that the tool can fully automate the tuning process. We simply believe that it is more important to put the burden of routine work on the tool, and let engineers think about their performance problems rather than about the tool itself.

To accomplish this, we focus on the following:

  • Provide an easy way to perform repetitive data collection and analysis tasks.
  • Provide effective and reliable methods of data collection and analysis that are relevant to both sequential and parallel programs.

The rest of this section describes in detail how the above goals are addressed by Intel PTU. We explicitly indicate product features that are especially valuable in the case of parallel analysis.

Projects, Configurations, and Experiments

To be effective in repetitive performance tuning tasks Intel PTU introduces the concepts of project and profile configurations.

Project contains information about the application, working directory, input arguments, maximum data collection time, etc. — in other words, it specifies what should be analyzed.

Profile configurations are a means of organizing collection methods into convenient and reusable shortcuts that can be reused for any project. Configurations can be predefined or user-defined. The predefined configurations for Intel PTU are as follows:

  • Basic Statistical Call Graph
  • Basic Sampling
  • Basic Call Count
  • Basic Data Access Profiling

A single profile configuration can be defined for multiple Intel processors, thereby generalizing its use. For example, the predefined Basic Sampling configuration is defined to collect two performance events corresponding to the "number of cycles" and "instructions retired" events mapped to different hardware events on different processors.

Creating a project is the first thing a user does. Once it is created, the predefined configurations list can be invoked by a simple right-click on the project in the navigator, floating the mouse over the "Profile As" option and selecting one of the profiling options (Figure 1).



Figure 1: Launching a predefined data collection
click image for larger view
 

Alternatively, right clicking on the project and selecting the "Profile..." option will invoke the configuration editor allowing users to select one of their own existing configurations or to create and invoke a new one.

After a profile configuration is applied to a project, the data are collected into an experiment. The basic visualization of the experiment data in Intel PTU is a tabular spreadsheet. The rows correspond to the currently chosen aggregation unit: module, function, basic block, or single address. The columns display the metrics for that region. The granularity of the aggregation unit can be selected through pull down menus (Figure 2).



Figure 2: Intel® PTU tabular data view
click image for larger view
 

For parallel programs, Intel PTU includes the current thread identifier (TID) and the CPU identifier in the program state information so that for each collection point it is clear which thread was executing on which processor at that moment. It is possible to filter the data for a specific thread, process, or CPU by using pull-down filter menus. Specifically this is useful for analysis of thread- or CPU-balancing.

Now, let's discuss the predefined analysis methods Intel PTU suggests.

Statistical Call Graph Analysis

Statistical Call Graph (a.k.a. Stack Sampling) collects its data by interrupting the program execution periodically (100 times per second) and capturing the current instruction pointer (IP) and the call stack. Using the IP value, it calculates how many samples occurred in a given function. This number is called Self Samples, because it corresponds to the number of samples that occurred in the function itself, not in the functions called by this function (callees). Places where a significant portion of samples occur are called hotspots.

The sample data can be aggregated by different units: function, module, basic block, or address. In the Intel PTU GUI the aggregation unit concept is exposed as "granularity." The function granularity is still the most popular, so it was made the default one in Hotspot view. By default, functions (rows) are sorted by the number of Self Samples, so the most active functions are displayed at the top.

A second metric for each function, Total Samples, can be defined as the number of samples in the function plus all the samples that have the function in the call stack. Thus Total Samples measures the time in the function and everything the function calls.

To illustrate this we used a simple program:

The time spent in each function is proportional to the iteration count of the loops. The loops in f1, f2, f3 and f4 are defined to split the total execution time of the application and thus the expected numbers of samples, in a known manner. 30% in f1, 20% in f2, 40% in f3, and 10% in f4, while main and foo are negligible.



Figure 3: Statistical Call Graph results
click image for larger view
 

The top hotspot display shows that the self samples are in the ratio of 4:3:2:1 (Figure 3). The call stack expanded from f2 shows f1 and main as its callers. The four hotspots are all highlighted and the total sample count for them is shown below as 4124 samples. An important thing to note is that 829 samples for main here does not mean that we had 829 samples in the main() itself. It is all about samples in f2: we had 829 samples in f2, and for all those samples we had f1 and main as callers.

Note the pull down menus for process, thread, and module filtering of the data displayed in the hotspot view. This greatly simplifies use of the view. This technique is common to hotspot displays for all the collection modes.

The four functions were highlighted, one by one, by clicking the left mouse button and holding down the CTRL key. As the last function selected was f1, it is displayed in the Caller/Callee view. The call chain is expanded in both directions around f1 with callers of f1 shown above it and its callees shown below it. The total number of samples for f1 is equal to its self samples plus the total number of samples for the functions it calls, f2 and f3. So we get 3703 total samples for f1 as 1230 plus 1644 plus 829. The total for f3 is equal to the total number of self samples for f3 and f4. As main is the caller of f1 it inherits the self and total times associated with f1.

Functions foo and main are not visible in the hotspot view because no samples occurred within their code ranges. Statistical Call Graph doesn't capture every call the program made. There is another collection technique called Exact Call Graph that instruments all functions in the program and can collect information about each call. However, this method has a much higher overhead. It is also very intrusive and distorts the execution of parallel programs making it impossible to map the results of this analysis to the behavior of the original program. Hence, it has a limited scope of applicability and is not always relevant for parallelization tasks. Exact Call Graph has the advantage of providing function call counts; to provide this useful information Intel PTU has a special Basic Call Count configuration.

Profile-Guided Loop Analysis

An important step in program parallelization is deciding which parts of the program should be parallelized. Since loops are often good candidates for parallelization, Intel PTU treats loops in a very special way.

Loops are identified by analysis of the binary. This information is then used to generate entries in the hint column. The hints tell you if there is a hot loop in the function and if a hot function was called from a loop. Hot functions called from a loop can be considered for parallelization.

Event-based Sampling

Intel® processors have a powerful performance monitoring unit (PMU) that can count and interrupt execution for sampling on a wide variety of performance critical signals (e.g., CPU cycles, instructions retired, last-level cache misses, etc.). Intel PTU has made it easier to use the hundreds of performance events by displaying the sampling data in a logical and convenient manner. The event based sampling hotspot view shows an ordered spreadsheet of all functions, in all modules and processes by default. The spreadsheet can be sorted by any of the collected events. The granularity can be set to module, function (default), basic block, or instruction. A histogram of samples vs. IP can be viewed for any event with a right click option.



Figure 4: Source view in Intel® PTU
click image for larger view
 

Double clicking on a row will open a source view display that includes a source view spreadsheet and a disassembly spreadsheet organized into units of basic blocks and a control flow graph for the basic blocks (Figure 4). The disassembly spreadsheet can be sorted by the sample totals for the basic blocks to ease the identification of hotspots. The disassembly view can be collapsed to only show the basic block data summary rows for analysis of large complex functions.

There is also the ability to compare two event-based sampling experiments. This is particularly useful for identifying the performance differences from two binaries that have been compiled differently.

As there are hundreds of events, their use must be organized into a methodology. An introduction to the use of the Intel® Core™2 processor PMU is discussed in [6]. A detailed discussion of the cycle accounting methodology on that processor is offered in [7]. The same Web site [8] also contains a number of articles about the use of the Itanium® processor PMU. There are a variety of performance issues associated with parallel execution. Their identification with the Intel Core 2 processor PMU is discussed in [9].

Data Access Analysis

Data access tends to dominate application performance, even in single-threaded execution. Parallel execution only exacerbates this, as the number of execution units available has increased faster than the memory access capability. The actions of the processor, in response to data access requests, can be monitored with performance events counting last-level cache misses, bus traffic, and the like. What has not been generally available is the ability to analyze the application memory access behavior in terms of the data address patterns.

To collect and present performance metrics for accessed data addresses Intel PTU uses advanced features of Intel processors. Intel Itanium processor CPU supports capturing the data access address and access latency directly. On Intel Core 2, Xeon® and Pentium® 4 processors the tool uses precise performance events that allow the capture of the values of all the registers at a known value of IP. When coupled with the disassembly of the function, load and store operations can have their target addresses reconstructed. This feature is unique for mass-market CPUs, and for end users, the aspects of the collection mechanism are abstracted by a predefined data access configuration that is used the same way on any of the supported processors. It is also possible to define custom data access configurations using any combination of memory-related events.

Some of the more obvious objectives of address profiling would be identifying the following:

  1. Cachelines that are only partially consumed, increasing memory bandwidth and wasting cache space for no benefit.
  2. Cachelines shared by multiple threads unnecessarily (false sharing).
  3. Variables (and cachelines) that are being thrashed during synchronization.
  4. Arrays of structures that are not organized by usage, resulting in 1 above.
  5. Cachelines and variable access resulting in disproportionate access latency.

Today data access analysis provides good help in pinpointing items 2 and 5, while easy identification of the rest of the items is still dependent on future development of the technology.

For data access analysis, Intel PTU provides two hotspot views in both IP and data address (Figure 5). The IP hotspot view is similar to the other hotspot views but has columns associated with data access metrics (average and total latency, reference count, page access count, etc.). The address hotspot view uses a granularity of 64 byte aligned address ranges for IA-32 and Intel® 64 Architecture-based processors and 128 byte aligned ranges on Itanium processors. These correspond to cachelines even though we use virtual rather than physical addresses. Similarly "pages" are usually defined as 4KB aligned ranges and 8KB ranges per the architecture.

The address hotspots can be expanded to show which offsets into the lines were accessed, and which threads and functions accessed the offset. This easily identifies lines that are falsely shared by multiple threads. As a result of the automatic analysis, the tool highlights such lines in pink (note those pink lines in the address hotspot view in Figure 5). However, for now the false-positives are possible, although we hope to minimize their number in the future.



Figure 5: Data access analysis views
click image for larger view
 

The two hotspot views (IP and data address) are coupled and a selection in one can be used to filter the display of the other (with control buttons indicated in Figure 5). Thus the user can select a single function, identify which lines it accesses heavily, select a set of those lines, and then see which other functions also access those same lines. This filtering extends down to the source views.

We went through the most important features of Intel PTU. We learned the important concepts provided by the tool (project, profile configuration, and experiment). We found out which collection and analysis capabilities are supported and identified which of them are specifically applicable for parallelization tasks. Now it's time to discuss how what we learned can be applied to solving real-world parallelization problems.

  Section 3 of 8  

Back to Top

In this article

Download a PDF of this article.