Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 04

Multi-Core Software


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1104.02

  • Volume 11
  • Issue 04
  • Published November 15, 2007

Multi-Core Software

  Section 2 of 8  

Parallelization Made Easier with Intel® Performance-Tuning Utility

INTRODUCTION

Parallel processing has been in common use for decades, but it's only recently that it became available on virtually every computer with the advent of multi-core processors. Historically, mass performance analysis tools [1, 2, 3, 4] have not generally had features designed to help identify parallel execution opportunities nor many of the common parallel execution bottlenecks. The Intel® Performance Tuning Utility (Intel® PTU), externally available at [5], has many of these features available in a single tool on Intel® Architecture.

Building on the experience of the Intel VTune™ Performance Analyzer, Intel PTU was designed to significantly improve on the data collection and display features available and add capabilities needed for enabling and analysis of parallel execution. Initially supported instrumentation-based control flow analysis (Exact Call Graph) suffers from excessive overhead and the resulting data distortion. This was replaced with a statistical approach to data collection based on call stack sampling in Intel PTU. The new statistical call stack sampling is supplemented with a precise call count data collection that can be used when required. Binary analysis was added to improve the disassembly displays through the use of basic blocks as the underlying execution units and to generate a control flow graph for the disassembly to simplify its interpretation. The binary analysis also enables the identification of loops, which, coupled with the performance data, allow for the identification of parallel execution opportunities. The full use of the Precise Event Based Sampling (PEBS) mechanism, only available on Intel® processors, enables simultaneous profiling by both Instruction Pointer (IP) and by data address, and a graphical filtering interface facilitates the analysis and identification of performance bottlenecks due to data access and layout issues.

All Intel PTU features are thread and CPU aware and can display data specific to either. Intel PTU works on a wide range of Windows* and Linux* operating system flavors and provides the same look-and-feel on all of them. It can be used from the command-line or from a GUI, which integrates into the Eclipse* IDE.

In this paper, we first describe the new features of Intel PTU in detail, as well as the analysis models facilitated by those features. We then illustrate the process of parallel software analysis and parallel execution discovery using Intel PTU on real program examples. We continue with an outline of areas for further development such as the quality of analysis and data representation, and finally we look at modern hardware performance monitoring capabilities.

Reading this paper requires some experience in parallel program design, as well as a certain knowledge of parallel performance monitoring and analysis. The sections below should not be viewed as providing a final recipe of efficient parallel software development or as describing methods of automated parallelization. Our goal, rather, is to illustrate the information that may be of use when dealing with parallel software and how that information may be collected, presented, and best interpreted with Intel PTU in order to ease the task of exploiting parallelization opportunities and parallel performance tuning.

  Section 2 of 8  

Back to Top

In this article

Download a PDF of this article.