Willersbau
Raum A104
Tel. +49 351 - 463 - 42483
Robert Schöne ([email protected])
Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
Leisungsanalyse
von Rechnersystemen
Comparing system using sample data
- BenchIT -
Robert Schöne
Contributions
Jupp Müller
Daniel Molka
Jens Domke
Dr. Stefan Pflüger
Daniel Reiche
BenchIT Team
Robert Schöne
Agenda
Implementation Guidelines and Feature Overview
BenchIT GUI – Measuring and Plotting
BenchIT Website
Case Study – Optimizing STREAM for Intel Core 2
Robert Schöne
Implementation Guidelines
Platform independent
– POSIX conformability
– ANSI-C conformability
Usage of sh and cc only
No make files
Minimized size of the sources
Plain text for
– Configuration data
– Results
GPL licence model
Robert Schöne
The BenchIT Concept – From Measurement to Analysis
Measurement Analysis
user
user
group
Y-A
ch
se
X-Achse
Y-A
ch
se
X-Achse
Y-A
ch
se
X-Achse
Y-A
ch
se
X-Achse
Server
Database
WWW
1212121 122545
21212 1154532
21212154 4532132
5456465 452121
1212121 122545
21212 1154532
21212154 4532132
5456465 452121
1212121 122545
21212 1154532
21212154 4532132
5456465 452121
user
user
user
group
Robert Schöne
BenchIT – Step by Step
ConsoleEditor
Robert Schöne
BenchIT – Step by Step
Kernel
Sources
Console
LOCAL
DEFS
Editor use
edit edit
Robert Schöne
BenchIT – Step by Step
Kernel
Sources
Execut-
able
compile
Console
LOCAL
DEFS
Editor use
edit edit start
Robert Schöne
BenchIT – Step by Step
Kernel
Sources
Execut-
able
Result
File
compile run
Console
LOCAL
DEFS
Editor use
edit edit start start
Robert Schöne
BenchIT – Step by Step
Kernel
Sources
Execut-
able
Result
File
eps
png
...
compile run create
Console
LOCAL
DEFS
Editor use
edit edit start start
Robert Schöne
BenchIT – Step by Step
Kernel
Sources
BenchIT
Database
BenchIT-Website
Execut-
able
Result
File
eps
png
...
compile run create
compare resultsConsole
LOCAL
DEFS
Editor use
edit edit start start create
upload
Robert Schöne
BenchIT – Step by Step
Kernel
Sources
BenchIT
Database
BenchIT-GUI
BenchIT-Website
Execut-
able
Result
File
eps
png
...
compile run create
compare resultsConsole
LOCAL
DEFS
Editor use
edit edit start start startview/
plotcreate compare results
edit edit start start create
upload
Robert Schöne
BenchIT – Different Solutions for Specialized Purposes
BenchIT measurement
– Skripts (COMPILE.sh, RUN.sh, reference_run.sh)
– BenchIT-GUI for
• Local Measurement
• Remote Measurement
- Compile and run on the remote system
- Cross-compilation on the host system and run only on the remote
system
BenchIT visualization of results and comparison of different runs
– BenchIT-Website
– BenchIT-GUI
Examples: Memory Latency
Measuring the latency to the different memory levels
Problemsize: size of used memory
Benchmark: pointer chasing
Robert Schöne
ptr=first;
do{
ptr=(void **) *ptr;
} while (ptr!=first);
Examples: MPI Latency
Measuring the latency between different MPI nodes
Problemsize: ID of sender-receiver pair
Benchmark: ping pong
Robert Schöne
if (myRank==receiver(ID)){
MPI_Receive();
MPI_Send();
}
if (myRank==sender(ID)){
MPI_Send();
MPI_Receive();
}
Examples: Floating Point Performance
Measuring the floating point performance for using data in different memory
levels
Problemsize: memory size
Benchmark: matrix multiplication
Robert Schöne
for (i=0;i<N;i++)
for (j=0;j<N;j++)
for (k=0;k<N;k++)
c[i][j]=c[i][j]+a[i][k]*b[k][j];
Examples: Bandwidth
Measuring the bandwidth of different memory levels
Problemsize: memory size
Benchmark: STREAM like
Robert Schöne
for (i=0;i<N;i++)
c[i]=a[i]
…
Writing a measurement kernel
Naming convention
category.name.language.parallelLibs.otherLibs.ID
– numerical.matmul.C.0.0.double
– memory.latency.C.0.0.pointerchasing
Clear Interface to program against:
– bi_getinfo
Used by benchit to get information about the measurement kernel
– bi_init
Called by benchit to initialize data for the measurement kernel
– bi_entry
Called n times by benchit to generate results
– bi_cleanup
Called by benchit to free allocated resources
Robert Schöne
bi_getinfo
Passes info struct, defined in interface
Kernel should fill out the following informations:
– X/Y - axis settings
– Legend texts
– Outlier direction
– „maxproblemsize“ (Not the real problem size, but the number of
bi_entry calls)
– Usage of parallel libraries
– Number of functions
– Definition of „best“ result
Robert Schöne
bi_init / bi_cleanup
bi_entry
Called once before measurements start
„maxproblemsize“ passed
Should allocatelarge data fields, only parts of them may be used in bi_entry
Should initialize used libraries, devices, …
May return ONE pointer to its data
bi_cleanup
Called once after the measurement
Pointer returned by bi_init passed
Should free resources
Robert Schöne
bi_entry
Called several times
Pointer returned by bi_init and ID passed
ID is the number of the measurement – maybe its problemsize
Result value vector passed (double[number of functions +1] )
Should do measurement
Can use:
– bi_gettime() gets current time in seconds as double
– dTimerOverhead means overhead for bi_gettime()
– dTimerGranularity means granularity of bi_gettime()
Results should be stored in result vector
Robert Schöne
If there‘s so much to write …
Why should I use BenchIT?
BenchIT stores informations about compile and run time environment
BenchIT makes batch systems transparent to use
BenchIT selects the „best“ result
BenchIT allows easy comparison
BenchIT provides tools for remote measurement
…
Robert Schöne
Robert Schöne
Agenda
Implementation Guidelines and Feature Overview
BenchIT GUI – Measuring and Plotting
BenchIT Website
Case Study – Optimizing STREAM for Intel Core 2
Robert Schöne
BenchIT GUI – Start
Robert Schöne
BenchIT GUI – definition of local system
Robert Schöne
BenchIT GUI – select a kernel
Robert Schöne
BenchIT GUI – run kernel …
Robert Schöne
BenchIT GUI – run kernel … finished
Robert Schöne
BenchIT GUI – show result
Robert Schöne
BenchIT GUI – result with default settings
Robert Schöne
BenchIT GUI – changing settings (before)
Robert Schöne
BenchIT GUI - changing settings (after)
Robert Schöne
BenchIT GUI – result plot with new settings
Robert Schöne
BenchIT GUI – running on a remote machine
Robert Schöne
BenchIT GUI – define a remote machine
Robert Schöne
BenchIT GUI
Robert Schöne
BenchIT GUI – automatic generation of definitions
Robert Schöne
BenchIT GUI – switching local definitions
Robert Schöne
BenchIT GUI – loading definitions from remote machine
Robert Schöne
BenchIT GUI – new definitions loaded
Robert Schöne
BenchIT GUI
Robert Schöne
BenchIT GUI – changing some settings
Robert Schöne
BenchIT GUI – running pointerchasing on remote system
Robert Schöne
BenchIT GUI – selecting the target system
Robert Schöne
BenchIT GUI - pointerchasing running remote …
Robert Schöne
BenchIT GUI - pointerchasing running remote … done
Robert Schöne
BenchIT GUI – getting results from remote machine
Robert Schöne
BenchIT GUI – result from remote machine
Robert Schöne
BenchIT GUI – comparing both results
Robert Schöne
BenchIT GUI – comparing both results, better layout
Robert Schöne
BenchIT GUI - connecting to web server
Robert Schöne
BenchIT GUI – selecting results from web server
Robert Schöne
BenchIT GUI – getting results for Pentium M
Robert Schöne
BenchIT GUI – results from web server
Robert Schöne
BenchIT GUI - putting all together …
Robert Schöne
BenchIT GUI - … and another one
Robert Schöne
BenchIT GUI – exported to png
Robert Schöne
Agenda
Implementation Guidelines and Feature Overview
BenchIT GUI – Measuring and Plotting
BenchIT Website
Case Study – Optimizing STREAM for Intel Core 2
Analysis/Plot: 3 Different Analyse Paths, Stored Plots
Compare Different Architectures
Compare Different Processors
Kernels which run on both Systems
Compare their Memory Access Time
Select Additional Information
Compared Results
Compare a specific Kernel
Compare Memory Latencies (Pointerchasing)
Compare a Larger Set of Systems
Not Satisfying?
Compare Different Implementations
Compare Different Compilers
Compare Different Compiler Flags
Compare Different Processor Generations
Compare Different Libraries
Share ...
Share with specific user groups
Robert Schöne
Agenda
Feature Overview and Implementation Guidelines
BenchIT GUI – Measuring and Plotting
BenchIT Website
Case Study – Optimizing STREAM for Intel Core 2
Intel Core 2 Duo Processor
Robert Schöne
Core 1
32 KiB L1 Instruction Cache
4 MiB
shared(dynamically
allocated)
L2 Cache
32 KiB L1 Data Cache
ITLB
DTLB
Fetch and Predecode
Reservation Station – 32 entries
FS
B
Reorder Buffer – 96 entries
Rename/Alloc
Instruction Queue – 18 x86 Inst
Store
addrLoad
Int ALU
Int SIMD
FP MUL
Int ALU
Int SIMD
Int ALU
Int SIMD
FP ADD
Decode – 4+1 x86 Inst
Branch Predict
Bus
Interface
Unit
Microcode
ROM
Store
data
port2 port0port4port3 port5port1
16 Byte
6 x86
4+1 x86
complex simple simplesimple4 µops 1 µop 1 µop1 µop
Memory Order Buffer
12
8 B
it
12
8 B
it
12
8 B
it
12
8 B
it
Core 0
256 Bit
physical
Registers
Load/Store
Buffers
alloc
free
Robert Schöne
The STREAM Benchmark – Source Code Fracture
# define N 2000000
# define NTIMES 10
# define OFFSET 0
...
static double a[N+OFFSET],
b[N+OFFSET],
c[N+OFFSET];
...
for (k=0; k<NTIMES; k++)
{
times[0][k] = mysecond();
#pragma omp parallel for
for (j=0; j<N; j++)
c[j] = a[j];
times[0][k] = mysecond() - times[0][k];
...
}
Robert Schöne
First Measurements
Dissatisfying results, imprecise for small problem sizes
– STREAM designed for large memory accesses
– STREAM very simplistic
Only a single problem size is measured per run
– Recompilation for every measurement
– For cache access: more time needed to compile then
to measure
Reimplementation in BenchIT
Robert Schöne
First Measurements - Reimplementation
Design of the benchmark untouched, but
– Dynamic memory allocation
– Variable problem size
– Using RDTSC
No optimizations done
Offset still 0 (STREAM default)
Robert Schöne
Derived STREAM Benchmark
L1 Cache L2 Cache
Bandwidth in L2 Cache
approx. 20 GB/s
Robert Schöne
Derived STREAM Benchmark
Robert Schöne
Derived STREAM Benchmark
Robert Schöne
Derived STREAM Benchmark
Robert Schöne
Optimizations – Reduce Overhead
Still unsatisfying results in the L1 cache
To much overhead due to OpenMP
Solution:
Move time measurement into parallel region
Repeat every operation
Only increased timer accuracy
BUT:
Loops are moved into parallel regions too!
Robert Schöne
Optimizations – Reduce Overhead
Repititions for
every single
operation, not for
whole loop
Robert Schöne
Optimizations – Align Memory for SSE Access
Still relatively low cache performance
Previous measurements have shown
– 16 byte alignment important for performance
– Compiler directive #pragma vector aligned helps compiler
using alignments
Solution:
– Vectors now 16 byte aligned
– Both parts of the vectors have a multiple of 2 as length
– Compiler directive was introduced
Robert Schöne
Optimizations – Align Memory for SSE Access
Robert Schöne
Optimizations – Align Memory for Better Cache Access
Still instable behavior for small problem sizes
Better performance for vector lengths, which are a
multiple of 16 (8 for single threaded)
8*8 (double precision floating point) Byte
= 64 Byte (cache line length)
Solution:
Aligning vectors at 128 Byte barrier for 2 threads
Robert Schöne
Optimizations – Align Memory for Better Cache Access
Robert Schöne
Examination of Other Multicore-CPUs
Intel
Xeon
5160
Intel
Core Duo
T2600
Intel
Xeon
5060
AMD
Opteron
285
Codename Woodcrest Yonah Dempsey Italy
Compiler icc 9.1-em64t icc 9.1 icc 9.1-em64t icc 9.1-em64t
Clock rate 3.0 GHz 2.167 GHz 3.2 GHz 2.6 GHz
L1 D-
Cache per
Core
32 KiByte 32 KiByte 16 KiByte 64 KiByte
L2 Cache 4 MiByte
shared
2 MiByte
shared
2*2 MiByte 2*512 kByte
Robert Schöne
Examination of Other Multicore-CPUs
Robert Schöne
Examination of Other Multicore-CPUs
Top Related