- Home›
- Technology and Research›
- Intel Technology Journal›
- Multi-Core Software
Multi-Core Software
Inside the Intel® 10.1 Compilers: New Threadizer and New Vectorizer for Intel® Core™2 Processors
INTEL® CORE™ MICRO-ARCHITECTURE
Intel® Core™ micro-architecture is the foundation for all new Intel® architecture-based desktop, mobile, and server multi-core processors. This state-of-the-art multi-core processor with optimized micro-architecture delivers a number of innovative features that have set new standards for energy-efficient performance. In this section we outline a few innovations relevant to this paper. A more detailed description can be found in the Intel® literature [4].

Figure 1: Quad-core processor schematic
click image for larger view
Figure 1 shows a schematic of the Intel® Core™2 Quad processor. Two independent cores with their own private L1 caches reside on a single die. Two shared Level 2 (L2) caches, referred to as the Intel® Advanced Smart Cache, work by sharing the L2 cache between cores so that data are stored in one place accessible by the cores. Sharing the L2 cache enables a core to dynamically use up to 100% of the available L2 cache, thus optimizing cache resources.
The quad-core processor is equipped with Intel® Smart Memory Access techniques that boost system performance by optimizing available data bandwidth from the memory subsystem and hiding the latency of memory accesses through two techniques: memory disambiguation and an instruction pointer-based prefetcher that fetches memory contents to the shared L2 cache and then into each private L1 cache before they are requested. The data prefetcher can detect strided memory access patterns to make accurate predictions about future load addresses.
Another key feature of Intel Core micro-architecture is the Intel® Advanced Digital Media Boost that can issue 128-bit SSE instructions with a throughput of one per clock cycle. Previous-generation Intel processors had a sustained throughput of one instruction per two clock cycles, typically one cycle for the lower 64 bits followed by another cycle for the upper 64 bits. By widening execution units to the full 128 bits, the Intel processor effectively doubles the performance of a series of 128-bit SSE instructions relative to previous-generation Intel processors. In addition, the latency of various individual 128-bit SSE instructions has been reduced, and SSSE3 has been added to extend the instruction set. As a result, more overall performance improvements can be expected from vectorization (i.e., transforming sequential code into SIMD instructions).
