Open Trace Format (OTF) Tutorial - ParaTools, Inc.

Open Trace Format (OTF) Tutorial - ParaTools, Inc.

Open Trace Format (OTF) Tutorial Wolfgang E. Nagel, Holger Brunst, T.U. Dresden, Germany Sameer Shende, Allen D. Malony, ParaTools, Inc. http://www.vampir.eu [email protected] 2006 Wolfgang E. Nagel, TU Dresden, ZIH Outline An overview of OTF, TAU and Vampir/VNG OTF Tools API Building trace conversion tools TAU Instrumentation Measurement Analysis Scalable Tracing Vampir

VNG OTF 2 Tutorial Goals This tutorial is intended as an introduction to OTF tools. Today you should leave here with a better understanding of OTF API and tools Steps involved in building a trace conversion tool to target OTF How to instrument your programs with TAU to generate OTF Automatic instrumentation at the routine level and outer loop level Manual instrumentation at the loop/statement level Measurement options provided by TAU Environment variables used for choosing metrics, generating performance data How to use the Vampir and VNG tools Nature and types of visualization that VNG provides for visualizing OTF traces 3 Vampir: Technical Components

Trace 1 Trace 2 Trace 3 Trace N Tools Server Worker 1 Worker 2 Worker m Master 1. 2. 3. 4. 5. 4 Trace generator

Classical Vampir viewer and analyzer Vampir client viewer Parallel server engine Conversion and analysis tools Many Trace Formats to choose from 5 OTF Features Fast and efficient sequential and parallel access Platform independent Selective access to Processes Time intervals API / Interfaces High level interface for analysis tools Read/write complete traces with multiple files Supports filtering and parallel I/O

Low level interface for trace libraries 6 Relative File Size 3 2,5 2 SMG 98 (18MB) Better 1,5 IRS (1.8 GB) SMG2000 (2.3 GB) Relative Size 1

0,5 0 STF VTF OTF OTFZ 7 Read Performance 3,5 3 2,5 2 SMG 98 (18MB)

Better IRS (1.8 GB) SMG2000 (2.3 GB) Mevents/s 1,5 1 0,5 0 STF VTF OTF OTFZ 8

Performance Scalability 1000 100 VTF Better STF OTF Mevents/s 10 OTFZ 1 1 4

16 64 9 256 Vampir Server Workflow Parallel Program Monitor System (TAU/ Kojak) File System Analysis Server Merged Traces Trace 1 Trace 2 Trace 3

Trace N Master Worker 1 Classic Analysis: Worker 2 monolithic Worker m sequential Event Streams Parallel I/O Process Visualization Client

Timeline with 16 visible Traces Message Passing Interne Interne tt Segment Indicator 768 Processes Thumbnail 1 0 Organization of Parallel Analysis Master Worker

Message Passing Message Passing Master Worker 1 Session Thread Session Thread Worker 2 Analysis Module Event Databases Analysis Merger Worker m Endian Conversion Traces Trace Format Driver

Socket Communication N Session Threads N Session Threads M Worker 1 1 Visualization Client Scalability sPPM Analyzed on Origin 2000 18,00 sPPM ASCI 16,00 Benchmark

3D Gas Dynamic Data to be analyzed 16 Processes 14,00 Com. Matrix 12,00 Timeline 10,00 Summary Profile 8,00 Speedup 6,00 Process Profile Stack Tree

4,00 LoadTime 2,00 200 MByte Volume 0,00 0 10 20 30 40 Number of Workers Number of Workers

Load Time Timeline Summary Profile Process Profile Com. Matrix Stack Tree 1 47,33 0,10 1,59 1,32 0,06 2,57 2 22,48 0,09 0,87 0,70 0,07 1,39

4 10,80 0,06 0,47 0,38 0,08 0,70 1 2 8 5,43 0,08 0,30 0,26 0,09 0,44 16 3,01 0,09 0,28

0,17 0,09 0,25 32 3,16 0,09 0,25 0,17 0,09 0,25 A Fairly Large Test Case IRS ASCI Benchmark Implicit Radiation Solver Data to be analyzed: 64 Processes in 8 Streams Approx. 800.000.000 Events 40 GByte Data Volume

Processing Times in Seconds 9,11 10,00 8,00 6,00 4,65 3,62 4,00 2,00 0,00 0,02 Timeline Jump.fz-juelich.de 41 IBM p690 nodes (32 processors per node) 128 GByte per node Visualization Platform: Remote Laptop

1 3 3,84 0,16 0,02 Summary Prof. 16 Worker Analysis Platform: 5,59 4,67 Process Prof. Com.

Matrix 32 Worker 0,09 Stack Tree Outline An overview of OTF, TAU and Vampir/VNG OTF Tools API Building trace conversion tools TAU Instrumentation Measurement Analysis Scalable Tracing Vampir VNG OTF

1 4 OTF Trace Generation and Analysis Tools 1 5 OTF Contents Definition records Map event ids to interval (begin/end) event names Symbols for atomic events Process groups Performance events Timestamped events for entering or leaving a state Timestamped counter events (monotonically increasing or not)

Global master file Mapping processes to streams Statistical Summaries Overview over a whole interval of time Snapshots Callstack, list of pending messages, etc. at a point in time 1 6 OTF File Hierarchy 1 7

OTF Streams 1 8 otfmerge Allows an existing OTF trace to alter the number of streams Add snapshots or statistics to the merged trace file otfmerge - converter program of OTF library. otfmerge [Options] options: -h, --help show this help message -n set number of streams for output -f set max number of filehandles available -o namestub of the output file (default out) -rb set buffersize of the reader

-wb set buffersize of the writer -stats cover statistics too -snaps cover snapshots too -V show OTF version 1 9 OTF Tools: otfaux otfaux Adds auxillary snapshot and/or statistics information to the trace file Snapshots include callstack, pending messages, current counter values Statistics include number of calls, exclusive/inclusive time Statistics are monotonically increasing - unlike profiles Original event trace is unmodified Auxillary data is generated at breakpoints -periodically or at ticks

2 0 otfaux otfaux - append snapshots and statistics to existing otf traces at given break time stamps otfaux [Options] Options: -h, --help show this help message -b buffer size for read and write operations -n number of breaks (distributed regularly) if -p and -t are not set, the default for -n is 200 breaks -p

create break every p ticks (if both, -n and -p are specified the one producing more breaks wins) -t define (additional) break at given time stamp -F force overwrite old snapshots and statistics -R delete existing snapshots and statistics only -f max number of filehandles output ... 2

1 otfaux (contd.) -g create functiongroup summaries instead of function summaries -v verbose mode, print break time stamps -V show OTF version -a show advancing progress during operation -snapshots write ONLY snapshots but NO statistics --statistics write ONLY statistics but NO snapshots -s a[,b]* regard given streams only when computing statistics. expects a single token or comma separated list. this implies the --statistics option! -l list existing stream tokens 2 2 tau2otf Converts TAU traces to OTF

tau2otf [-n streams] [-nomessage] [-z] [-v] -n : Specifies the number of output streams (default 1) -nomessage : Suppress printing of message information in the trace -z : Enable compression of trace files. By default it is uncompressed. -v : Verbose Trace format of is OTF % tau2otf merged.trc tau.edf app.otf 2 3 vtf2otf Convert VTF traces to OTF format vtf2otf [Options] Options: -o output file

-f max count of filehandles -n output stream count -b size of the writer buffer -V show OTF version 2 4 otf2vtf Convert OTF trace files to VTF format otf2vtf [Options] Options: -o output file -b size of the reader buffer -A write VTF3 ASCII sub-format (default) -B write VTF3 binary sub-format -V show OTF version 2 5 Building Trace Analysis Tools

Writing OTF traces in trace conversion tools High level API writes multiple streams Low level API writes a single stream Each OTF file has a prefix (e.g., app.otf) Parallel reading and searching in OTF analysis tools Each process in tool reads local and global event definitions Each process reads a subset of events Read summary information to select interesting spots in trace Tool might read a selected time interval for analysis OTF supports efficient binary search Tool may support for compressed or uncompressed OTF trace

Tool may support for single or multi-stream OTF traces 2 6 OTF Trace Writer API - OTF_FileManager_open Generates a new file manager with a maximum number of files that are allowed to be open simultaneously OTF_FileManager* OTF_FileManager_open( uint32_t number ); #include OTF_FileManager *manager; manager = OTF_FileManager_open(256); 2 7 OTF_FileManager_close Closes the file manager void OTF_FileManager_close( OTF_FileManager* m ); #include OTF_FileManager_close(manager);

2 8 OTF_Writer_open Define file control block for output trace file OTF_Writer* OTF_Writer_open( char* fileNamePrefix, uint32_t numberOfStreams, OTF_FileManager* fileManager ); #include void *fcb = (void *) OTF_Writer_open(out_file, num_streams, manager); 2 9 OTF_Writer_setCompression Enable compression if specified by the user int OTF_Writer_setCompression( OTF_Writer* writer, OTF_FileCompression);

#include OTF_Writer_setCompression((OTF_Writer *)fcb, OTF_FILECOMPRESSION_COMPRESSED); 3 0 OTF_Writer_writeDefCreator Specify a comment about the creator (trace conversion tool) int OTF_Writer_writeDefCreator( void* userData, uint32_t stream, /* stream = 0 means global definition */ const char* creator ); #include OTF_Writer_writeDefCreator(fcb, 0, MyTool2otf ver 2.42); 3 1 OTF_Writer_writeDefProcess Write a process definition record int OTF_Writer_writeDefProcess( OTF_Writer* writer, uint32_t stream,

uint32_t process, const char* name, uint32_t parent ); #include OTF_Writer_writeDefProcess( (OTF_Writer *)fcb, 0, cpuid, name, 0); 3 2 OTF_Writer_writeDefTimerResolution Provides the timer resolution. All timestamps are interpreted based on this resolution. By default it is 1 microseconds. int OTF_Writer_write_DefTimerResolution( void* userData, uint32_t stream, uint64_t ticksPerSecond ); #include OTF_Writer_writeDefTimerResolution((OTF_Writer*) userData, 0, getTicksPerSecond()); 3

3 OTF_Writer_write_DefFunction Provide a function definition and specify an event id to name mapping int OTF_Writer_write_DefFunction( void* userData, uint32_t stream, uint32_t func, const char* name, uint32_t funcGroup, uint32_t source ); /* specify source code location */ #include OTF_Writer_writeDefFunction((OTF_Writer*)userData, 0, eventID, (const char *) name, groupID, 0); 3 4 OTF_Writer_writeDefFunctionGroup Provides a function group definition

int OTF_Handler_DefFunctionGroup( void* userData, uint32_t stream, uint32_t funcGroup, const char* name ); #include OTF_Writer_writeDefFunctionGroup((OTF_Writer*)user Data, 0, groupId, GroupName); 3 5 OTF_Writer_writeEnter Write a function entry record int OTF_Writer_writeEnter( OTF_Writer* writer, uint64_t time, uint32_t function, uint32_t process, uint32_t source ); #include OTF_Writer_writeEnter((OTF_Writer*)userData, GetClockTicksInGHz(time), stateid, cpuid, 0); 3

6 int OTF_Writer_writeSendMsg Write a send message record int OTF_Writer_writeSendMsg( OTF_Writer* writer, uint64_t time, uint32_t sender, uint32_t receiver, uint32_t procGroup, uint32_t tag, uint32_t length, uint32_t source ); 3 7 int OTF_Writer_writeRecvMsg Write a receive message record

int OTF_Writer_writeRecvMsg( OTF_Writer* writer, uint64_t time, uint32_t receiver, uint32_t sender, uint32_t procGroup, uint32_t tag, uint32_t length, uint32_t source ); 3 8 OTF Trace Reader API Similar to trace writer API Instead of Write, create a Handler for callbacks, e.g., int OTF_Handler_DefFunction( void* userData, uint32_t stream, uint32_t func, const char* name, uint32_t funcGroup, uint32_t source ); 3

9 OTF Trace Reader API Similar to trace writer API Instead of Write, create a Handler for callbacks Specify the parameters to the handler routine After setting up handlers, read events, snapshots, definitions.... The library invokes appropriate handlers Close the file manager and exit cleanly 4 0 Global Read Operations Open array handler Open OTF reader Control the buffer size Set handler and arguments Read definitions Read snapshots Read events Close reader

4 1 OTF_HandlerArray_open/close To open a new array of handlers and then fill in the callback routines OTF_HandlerArray *OTF_HandlerArray_open( void); #include OTF_HandlerArray *handlers; handlers = OTF_HandlerArray_open(); To close the array, use OTF_HandlerArray_close

OTF_HandlerArray_close(handlers); 4 2 A Sample Handler int OTF_handleDefinitionComment( void* fcb, uint32_t streamid, const char* comment ) { /* written by user; called by OTF */ } The first argument is a file control block. We need to pass this argument and the callback functions address to the OTF reader. 4 3 OTF_HandlerArray_setHandler int OTFHandlerArray_setHandler( OTF_HandlerArray *handlers, OTFFunctionPointer *pointer,

uint32_t recordtype); int OTF_HandlerArray_setFirstHandlerArg( OTF_HandlerArray* handlers, void* firsthandlerarg, uint32_t recordtype ); To specify any user defined pointer that should be passed as the first argument. Useful for keeping track of location. 4 4 OTF_HandlerArray_setHandler #include OTF_HandlerArray *handlers = OTF_HandlerArray_Open(); ... /* put the callback routines address in the array */ OTF_HandlerArray_setHandler( handlers, (OTF_FunctionPointer*) OTF_handleDefinitionComment, OTF_DEFINITIONCOMMENT_RECORD ); /* specify the file position/any address (of say the trace writer) as the first argument*/

OTF_HandlerArray_setFirstHandlerArg( handlers, &fcb, OTF_DEFINITIONCOMMENT_RECORD ); /* invokes OTF_handleDefinitionComment routine*/ 4 5 OTF_Reader_open Opens a master control file and returns an OTF_Reader OTF_Reader * OTF_Reader_open(char *name, OTF_FileManager *manager); #include OTF_FileManager *manager = OTF_FileManager_open(256); OTF_Reader *reader = OTF_Reader_open(inputfile, manager); 4 6 User defined handlers for definitions int handleDeftimerresolution( void* firsthandlerarg, uint32_t streamid,

uint64_t ticksPerSecond ) { ...} int handleDefprocess( void* firsthandlerarg, uint32_t streamid, uint32_t deftoken, const char* name, uint32_t parent) { ... } int handleDefprocessgroup( void* firsthandlerarg, uint32_t streamid, uint32_t deftoken, const char* name, uint32_t n, uint32_t* array ) { } int handleDeffunction( void* firsthandlerarg, uint32_t streamid, uint32_t deftoken, const char* name, uint32_t group, uint32_t scltoken ) { } int handleDefcounter( void* firsthandlerarg, uint32_t streamid, uint32_t deftoken, const char* name, uint32_t properties, uint32_t countergroup, const char* unit ) { } ... 4 7 User defined handlers for timestamped events int handleCounter( void* firsthandlerarg, uint64_t time, uint32_t process, uint32_t token, uint64_t value ) {... }

int handleRecvmsg( void* firsthandlerarg, uint64_t time, uint32_t receiver, uint32_t sender, uint32_t communicator, uint32_t msgtype, uint32_t msglength, uint32_t scltoken ) int handleSendmsg( void* firsthandlerarg, uint64_t time, uint32_t sender, uint32_t receiver, uint32_t communicator, uint32_t msgtype, uint32_t msglength, uint32_t scltoken ) int handleEnter( void* firsthandlerarg, uint64_t time, uint32_t statetoken, uint32_t cpuid, uint32_t scltoken ) {...} int handleLeave( void* firsthandlerarg, uint64_t time, uint32_t statetoken, uint32_t cpuid, uint32_t scltoken ) {...} 4 8 OTF_Reader_readDefinitions int OTF_Reader_readDefinitions(OTF_Reader *r, OTF_HandlerArray *handlers);

#include OTF_HandlerArray *handlers = OTF_HandlerArray_open(); OTF_Manager *manager = OTF_FileManager_open(100); OTF_Reader *reader = OTF_Reader_open(inputFile, manager); /* set up handlers */ OTF_Reader_readDefinitions(reader, handlers); /* OTF invokes handlers for process, functions, groups and counters here */ 4 9 OTF_Reader_readEvents int OTF_Reader_readEvents(OTF_Reader *reader, OTF_HandlerArray *handlers); #include OTF_Reader_readEvents (reader, handlers); /* invokes handlers for timestamped message communication, routine entry/exit, counter events */

5 0 Building OTF Analysis Tools Header files are in /include directory Libraries are in //lib directory Support for Zlib (v1.2.3) is included in libotf.a % g++ tool.cpp -I/include % g++ tool.o -o tool -L//lib -lotf 5 1 Outline An overview of OTF, TAU and Vampir/VNG OTF Tools API Building trace conversion tools TAU Instrumentation

Measurement Analysis Scalable Tracing Vampir VNG OTF 5 2 TAU Parallel Performance System http://www.cs.uoregon.edu/research/tau/ Multi-level performance instrumentation Multi-language automatic source instrumentation Flexible and configurable performance measurement Widely-ported parallel performance profiling system Computer system architectures and operating systems Different programming languages and compilers Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid

Integration in complex software, systems, applications 5 3 Using TAU: A brief Introduction To instrument source code, choose measurement module: % setenv TAU_MAKEFILE /usr/tau-2.16/x86_64/lib/Makefile.tau-mpipdt-trace-pgi And use tau_f90.sh, tau_cxx.sh or tau_cc.sh as Fortran, C++ or C compilers: % mpif90 foo.f90 changes to % tau_f90.sh foo.f90 Execute application and then run: % tau_treemerge.pl % tau2otf tau.trc tau.edf app.otf % vampir app.otf 5 4

TAU Performance System Architecture event selection 5 5 TAU Performance System Architecture 5 6 Program Database Toolkit (PDT) Application / Library C / C++ parser IL C / C++ IL analyzer Program Database

Files Fortran parser F77/90/95 IL Fortran IL analyzer DUCTAPE 5 7 PDBhtml Program documentation SILOON Application component glue

CHASM C++ / F90/95 interoperability TAU_instr Automatic source instrumentation TAU Instrumentation Approach Support for standard program events Routines Classes and templates Statement-level blocks

Support for user-defined events Begin/End events (user-defined timers) Atomic events (e.g., size of memory allocated/freed) Selection of event statistics Support definition of semantic entities for mapping Support for event groups Instrumentation optimization (eliminate instrumentation in lightweight routines)

5 8 TAU Instrumentation Flexible instrumentation mechanisms at multiple levels Source code manual (TAU API, TAU Component API) automatic C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP spec) Object code pre-instrumented libraries (e.g., MPI using PMPI) statically-linked and dynamically-linked Executable code dynamic instrumentation (pre-execution) (DynInstAPI) virtual machine instrumentation (e.g., Java using JVMPI) Python interpreter based instrumentation at runtime Proxy Components 5 9

TAU Measurement Approach Portable and scalable parallel profiling solution Multiple profiling types and options Event selection and control (enabling/disabling, throttling) Online profile access and sampling Online performance profile overhead compensation Portable and scalable parallel tracing solution Trace translation to Open Trace Format (OTF) Trace streams and hierarchical trace merging Robust timing and hardware performance support Multiple counters (hardware, user-defined, system) Performance measurement for CCA component software 6 0

Using TAU Configuration Instrumentation Manual MPI Wrapper interposition library PDT- Source rewriting for C,C++, F77/90/95 OpenMP Directive rewriting Component based instrumentation Proxy components Binary Instrumentation DyninstAPI Runtime Instrumentation/Rewriting binary Java Runtime instrumentation

Python Runtime instrumentation Measurement Performance Analysis 6 1 TAU Measurement System Configuration configure [OPTIONS] {-c++=, -cc=} {-pthread, -sproc} -openmp -jdk=

-opari=

-papi=

-pdt= -dyninst= -mpi[inc/lib]= -shmem[inc/lib]= -python[inc/lib]= -tag= -epilog= -slog2 -otf= -arch= (bgl, xt3,ibm64,ibm64linux) Specify C++ and C compilers Use pthread or SGI sproc threads Use OpenMP threads Specify Java instrumentation (JDK) Specify location of Opari OpenMP tool Specify location of PAPI Specify location of PDT Specify location of DynInst Package Specify MPI library instrumentation Specify PSHMEM library instrumentation

Specify Python instrumentation Specify a unique configuration name Specify location of EPILOG Build SLOG2/Jumpshot tracing package Specify location of OTF trace package Specify architecture explicitly 6 2 TAU Measurement System Configuration configure [OPTIONS] -TRACE -PROFILE (default) -PROFILECALLPATH -PROFILEPHASE -PROFILEMEMORY -PROFILEHEADROOM -MULTIPLECOUNTERS -COMPENSATE -CPUTIME

-PAPIWALLCLOCK -PAPIVIRTUAL -SGITIMERS -LINUXTIMERS Generate binary TAU traces Generate profiles (summary) Generate call path profiles Generate phase based profiles Track heap memory for each routine Track memory headroom to grow Use hardware counters + time Compensate timer overhead Use usertime+system time Use PAPIs wallclock time Use PAPIs process virtual time Use fast IRIX timers Use fast x86 Linux timers 6 3 TAU Measurement Configuration Examples

./configure pdt=/opt/ALTIX/pkgs/pdtoolkit-3.9 -mpi Configure using PDT and MPI with GNU compilers ./configure -papi=/usr/local/packages/papi -pdt=/usr/local/pdtoolkit-3.9 mpiinc=/usr/local/include -mpilib=/usr/local/lib -MULTIPLECOUNTERS c++=icpc cc=icc fortran=intel -tag=intel91039; make clean install Use PAPI counters (one or more) with C/C++/F90 automatic instrumentation. Also instrument the MPI library. Use Intel compilers. Typically configure multiple measurement libraries

Each configuration creates a unique /lib/Makefile.tau stub makefile. It corresponds to the configuration options used. e.g., /opt/tau-2.15.5/x86_64/lib/Makefile.tau-icpc-mpi-pdt /opt/tau-2.15.5/x86_64/lib/Makefile.tau-icpc-mpi-pdt-trace 6 4 TAU Measurement Configuration Examples % cd /usr/tau-2.16/x86_64/lib; ls Makefile.*pgi Makefile.tau-pdt-pgi Makefile.tau-mpi-pdt-pgi Makefile.tau-callpath-mpi-pdt-pgi Makefile.tau-mpi-pdt-trace-pgi Makefile.tau-mpi-compensate-pdt-pgi Makefile.tau-pthread-pdt-pgi Makefile.tau-papiwallclock-multiplecounters-papivirtual-mpi-papi-pdt-pgi Makefile.tau-multiplecounters-mpi-papi-pdt-trace-pgi Makefile.tau-mpi-pdt-epilog-trace-pgi Makefile.tau-papiwallclock-multiplecounters-papivirtual-papi-pdt-openmp-opari-pgi

For an MPI+F90 application, you may want to start with: Makefile.tau-mpi-pdt-trace-pgi Supports MPI instrumentation & PDT for automatic source instrumentation for PGI with tracing 6 5 Configuration Parameters in Stub Makefiles Each TAU stub Makefile resides in //lib directory

Each stub makefile encapsulates the parameters that TAU was configured with Variables: TAU_CXX TAU_CC, TAU_F90 TAU_DEFS TAU_LDFLAGS TAU_INCLUDE

TAU_LIBS TAU_SHLIBS TAU_MPI_LIBS TAU_MPI_FLIBS TAU_FORTRANLIBS TAU_CXXLIBS TAU_INCLUDE_MEMORY TAU_DISABLE TAU_COMPILER Specify the C++ compiler used by TAU Specify the C, F90 compilers Defines used by TAU. Add to CFLAGS Linker options. Add to LDFLAGS Header files include path. Add to CFLAGS Statically linked TAU library. Add to LIBS Dynamically linked TAU library TAUs MPI wrapper library for C/C++ TAUs MPI wrapper library for F90 Must be linked in with C++ linker for F90 Must be linked in with F90 linker Use TAUs malloc/free wrapper lib TAUs dummy F90 stub library

Instrument using tau_compiler.sh script It represents a specific instance of the TAU libraries. TAU scripts use stub makefiles to identify what performance measurements are to be performed. 6 6 Using TAU Install TAU % configure [options]; make clean install Instrument application manually/automatically Typically modify application makefile

Select TAUs stub makefile, change name of compiler in Makefile Set environment variables TAU Profiling API TAU_MAKEFILE stub makefile directory where profiles/traces are to be stored Execute application % mpirun np a.out; Analyze performance data

paraprof, vampir, pprof, paraver 6 7 TAUs MPI Wrapper Interposition Library Uses standard MPI Profiling Interface Provides name shifted interface MPI_Send = PMPI_Send Weak bindings Interpose TAUs MPI wrapper library between MPI and TAU -lmpi replaced by lTauMpi lpmpi lmpi No change to the source code! Just re-link the application to generate performance data setenv TAU_MAKEFILE

//lib/Makefile.tau-mpi -[options] Use tau_cxx.sh, tau_f90.sh and tau_cc.sh as compilers 6 8

Instrumenting MPI Applications Under Linux you may use tau_load.sh to launch un-instrumented programs under TAU Without TAU: % mpirun -np 4 ./a.out With TAU: % ls /usr/tau/x86_64/lib/libTAU*pgi* % mpirun -np 4 tau_load.sh ./a.out % mpirun -np 4 tau_load.sh -XrunTAUsh-mpi-pdt-trace-pgi.so a.out loads //lib/libTAUsh-mpi-pdt-trace-pgi.so shared object Under AIX, use tau_poe instead of poe Without TAU:

% poe a.out -procs 8 With TAU: % tau_poe a.out -procs 8 % tau_poe -XrunTAUsh-mpi-pdt-trace.so a.out -procs 8 chooses //lib/libTAUsh-mpi-pdt-trace.so No change to source code or executables! No need to re-link! Only instruments MPI routines. To instrument user routines, you may need to parse the application source code! 6 9 Integration with Application Build Environment Try to minimize impact on users application build procedures

Handle process of parsing, instrumentation, compilation, linking Dealing with Makefiles Minimal change to application Makefile Avoid changing compilation rules in application Makefile No explicit inclusion of rules for process stages Some applications do not use Makefiles Facilitate integration in whatever procedures used Two techniques: TAU shell scripts (tau_.sh) Invokes all PDT parser, TAU instrumenter, and compiler TAU_COMPILER 7 0 Using Program Database Toolkit (PDT) 1. Parse the Program to create foo.pdb: % cxxparse foo.cpp I/usr/local/mydir DMYFLAGS

or % cparse foo.c I/usr/local/mydir DMYFLAGS or % f95parse foo.f90 I/usr/local/mydir % f95parse *.f omerged.pdb I/usr/local/mydir R free 2. Instrument the program: % tau_instrumentor foo.pdb f select.tau 3. foo.f90 o foo.inst.f90 Compile the instrumented program: % ifort foo.inst.f90 c I/usr/local/mpi/include o foo.o 7 1 Tau_[cxx,cc,f90].sh Improves Integration in Makefiles

# set TAU_MAKEFILE and TAU_OPTIONS env vars CC = tau_cc.sh F90 = tau_f90.sh CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o fn.o app: $(OBJS) $(F90) $(LDFLAGS) $(OBJS) -o [email protected] $(LIBS) .c.o: $(CC) $(CFLAGS) -c $< .f90.o: $(F90) $(FFLAGS) c $< 7 2 AutoInstrumentation using TAU_COMPILER $(TAU_COMPILER) stub Makefile variable

Invokes PDT parser, TAU instrumentor, compiler through tau_compiler.sh shell script Requires minimal changes to application Makefile Compilation rules are not changed User adds $(TAU_COMPILER) before compiler name F90=mpxlf90 Changes to F90= $(TAU_COMPILER) mpxlf90 Passes options from TAU stub Makefile to the four compilation stages Use tau_cxx.sh, tau_cc.sh, tau_f90.sh scripts OR $(TAU_COMPILER)

Uses original compilation command if an error occurs 7 3 Automatic Instrumentation We now provide compiler wrapper scripts Simply replace mpxlf90 with tau_f90.sh Automatically instruments Fortran source code, links with TAU MPI Wrapper libraries. Use tau_cc.sh and tau_cxx.sh for C/C++ Before After CXX = mpCC CXX = tau_cxx.sh

F90 = mpxlf90_r F90 = tau_f90.sh CFLAGS = CFLAGS = LIBS = -lm LIBS = -lm OBJS = f1.o f2.o f3.o fn.o OBJS = f1.o f2.o f3.o fn.o app: $(OBJS) app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o [email protected] .cpp.o:

$(LIBS) $(CXX) $(LDFLAGS) $(OBJS) -o [email protected] .cpp.o: $(CC) $(CFLAGS) -c $< $(CC) $(CFLAGS) -c $< 7 4 $(LIBS) TAU_COMPILER Improving Integration in Makefiles include /usr/tau-2.15.5/x86_64/Makefile.tau-icpc-mpi-pdt CXX = $(TAU_COMPILER) mpicxx F90 = $(TAU_COMPILER) mpif90 CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o fn.o

app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o [email protected] $(LIBS) .cpp.o: $(CXX) $(CFLAGS) -c $< 7 5 TAU_COMPILER Commandline Options See //bin/tau_compiler.sh help Compilation: % mpxlf90 -c foo.f90 Changes to % f95parse foo.f90 $(OPT1) % tau_instrumentor foo.pdb foo.f90 o foo.inst.f90 $(OPT2) % mpxlf90 c foo.f90 $(OPT3)

Linking: % mpxlf90 foo.o bar.o o app Changes to % mpxlf90 foo.o bar.o o app $(OPT4) Where options OPT[1-4] default values may be overridden by the user: F90 = $(TAU_COMPILER) $(MYOPTIONS) mpxlf90 7 6 TAU_COMPILER Options Optional parameters for $(TAU_COMPILER): [tau_compiler.sh help] -optVerbose Turn on verbose debugging messages -optDetectMemoryLeaks Turn on debugging memory allocations/ de-allocations to track leaks

-optPdtGnuFortranParser Use gfparse (GNU) instead of f95parse (Cleanscape) for parsing Fortran source code -optKeepFiles Does not remove intermediate .pdb and .inst.* files -optPreProcess Preprocess Fortran sources before instrumentation -optTauSelectFile="" Specify selective instrumentation file for tau_instrumentor -optLinking="" Options passed to the linker. Typically $(TAU_MPI_FLIBS) $(TAU_LIBS) $(TAU_CXXLIBS) -optCompile="" Options passed to the compiler. Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) -optPdtF95Opts="" Add options for Fortran parser in PDT (f95parse/gfparse) -optPdtF95Reset="" Reset options for Fortran parser in PDT (f95parse/gfparse) -optPdtCOpts="" Options for C parser in PDT (cparse). Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) -optPdtCxxOpts="" Options for C++ parser in PDT (cxxparse). Typically

$(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) ... 7 7 Overriding Default Options:TAU_COMPILER include /usr/tau/x86_64/lib/ Makefile.tau-icpc-mpi-pdt-trace # Fortran .f files in free format need the -R free option for parsing # Are there any preprocessor directives in the Fortran source? MYOPTIONS= -optVerbose optPreProcess -optPdtF95Opts=-R free F90 = $(TAU_COMPILER) $(MYOPTIONS) ifort OBJS = f1.o f2.o f3.o LIBS = -Lappdir lapplib1 lapplib2 app: $(OBJS) $(F90) $(OBJS) o app $(LIBS) .f.o: $(F90) c $< 7 8

Overriding Default Options:TAU_COMPILER % cat Makefile F90 = tau_f90.sh OBJS = f1.o f2.o f3.o LIBS = -Lappdir lapplib1 lapplib2 app: $(OBJS) $(F90) $(OBJS) o app $(LIBS) .f90.o: $(F90) c $< % setenv TAU_OPTIONS -optVerbose -optTauSelectFile=select.tau -optKeepFiles % setenv TAU_MAKEFILE /x86_64/lib/Makefile.tau-icpc-mpi-pdt 7 9 Optimization of Program Instrumentation Need to eliminate instrumentation in frequently executing lightweight routines Throttling of events at runtime:

% setenv TAU_THROTTLE 1 Turns off instrumentation in routines that execute over 10000 times (TAU_THROTTLE_NUMCALLS) and take less than 10 microseconds of inclusive time per call (TAU_THROTTLE_PERCALL) Selective instrumentation file to filter events % tau_instrumentor [options] f OR % setenv TAU_OPTIONS -optTauSelectFile=tau.txt Compensation of local instrumentation overhead % configure -COMPENSATE 8 0 Selective Instrumentation File Specify a list of routines to exclude or include (case sensitive) # is a wildcard in a routine name. It cannot appear in the first column.

BEGIN_EXCLUDE_LIST Foo Bar D#EMM END_EXCLUDE_LIST Specify a list of routines to include for instrumentation BEGIN_INCLUDE_LIST int main(int, char **) F1 F3 END_LIST_LIST Specify either an include list or an exclude list! 8 1 Selective Instrumentation File

Optionally specify a list of files to exclude or include (case sensitive) * and ? may be used as wildcard characters in a file name BEGIN_FILE_EXCLUDE_LIST f*.f90 Foo?.cpp END_EXCLUDE_LIST Specify a list of routines to include for instrumentation BEGIN_FILE_INCLUDE_LIST main.cpp foo.f90 END_INCLUDE_LIST_LIST 8 2

Selective Instrumentation File User instrumentation commands are placed in INSTRUMENT section ? and * used as wildcard characters for file name, # for routine name \ as escape character for quotes Routine entry/exit, arbitrary code insertion Outer-loop level instrumentation BEGIN_INSTRUMENT_SECTION loops file=foo.f90 routine=matrix# file=foo.f90 line = 123 code = " print *, \" Inside foo\""

exit routine = int foo() code = "cout <<\"exiting foo\"<[-o ] [-noinline] [-g groupname] [-i headerfile] [-c|-c++|-fortran] [-f ] For selective instrumentation, use f option % tau_instrumentor foo.pdb foo.cpp o foo.inst.cpp f selective.dat % cat selective.dat # Selective instrumentation: Specify an exclude/include list of routines/files. BEGIN_EXCLUDE_LIST void quicksort(int *, int, int) void sort_5elements(int *) void interchange(int *, int *) END_EXCLUDE_LIST BEGIN_FILE_INCLUDE_LIST Main.cpp Foo?.c *.C

END_FILE_INCLUDE_LIST # Instruments routines in Main.cpp, Foo?.c and *.C files only # Use BEGIN_[FILE]_INCLUDE_LIST with END_[FILE]_INCLUDE_LIST 8 4 Automatic Outer Loop Level Instrumentation BEGIN_INSTRUMENT_SECTION loops file="loop_test.cpp" routine="multiply" # it also understands # as the wildcard in routine name # and * and ? wildcards in file name. # You can also specify the full # name of the routine as is found in profile files. #loops file="loop_test.cpp" routine="double multiply#" END_INSTRUMENT_SECTION % pprof NODE 0;CONTEXT 0;THREAD 0: --------------------------------------------------------------------------------------%Time Exclusive Inclusive

msec total msec #Call #Subrs Inclusive Name usec/call --------------------------------------------------------------------------------------100.0 0.12 25,162 1 1 25162827 int main(int, char **)

100.0 0.175 25,162 1 4 25162707 double multiply() 90.5 22,778 22,778 1 0 [ file = line,col = <23,3> to <30,3> ] 22778959 Loop: double multiply() 9.3

2,345 2,345 1 0 [ file = line,col = <38,3> to <46,7> ] 2345823 Loop: double multiply() 8 5 TAU_REDUCE Reads profile files and rules Creates selective instrumentation file Specifies which routines should be excluded from instrumentation rules Selective instrumentation file tau_reduce profile

8 6 Optimizing Instrumentation Overhead: Rules #Exclude all events that are members of TAU_USER #and use less than 1000 microseconds TAU_USER:usec < 1000 #Exclude all events that have less than 100 #microseconds and are called only once usec < 1000 & numcalls = 1 #Exclude all events that have less than 1000 usecs per #call OR have a (total inclusive) percent less than 5 usecs/call < 1000 percent < 5 Scientific notation can be used usec>1000 & numcalls>400000 & usecs/call<30 & percent>25 Usage: % pprof d > pprof.dat % tau_reduce f pprof.dat r rules.txt o select.tau

8 7 TAU Tracing Enhancements Configure TAU with -TRACE otf=

option % configure TRACE otf= Generates tau_merge, tau2vtf, tau2otf tools in //bin directory % tau_f90.sh app.f90 o app Instrument and execute application % mpirun -np 4 app Merge and convert trace files to OTF format % tau2otf tau.trc tau.edf app.otf [-z][n ] % vampir app.otf OR use VNG to analyze OTF/VTF trace files

8 8 Environment Variables Configure TAU with -TRACE otf=

option % configure TRACE otf= -MULTIPLECOUNTERS papi= -mpi pdt=dir Set environment variables % % % % setenv setenv

setenv setenv TRACEDIR COUNTER1 COUNTER2 COUNTER3 /p/gm1//traces GET_TIME_OF_DAY (reqd) PAPI_FP_INS PAPI_TOT_CYC Execute application % mpirun -np 32 ./a.out [args] % tau_treemerge.pl; tau2otf/tau2vtf ... 8 9 Outline An overview of OTF, TAU and Vampir/VNG OTF

Tools API Building trace conversion tools TAU Instrumentation Measurement Analysis Scalable Tracing Vampir VNG OTF 9 0 Using Vampir Next Generation (VNG) 9 1 VNG Timeline Display

9 2 VNG Calltree Display 9 3 VNG Timeline Zoomed In 9 4 VNG Grouping of Interprocess Communications 9 5 VNG Process Timeline with PAPI Counters 9 6

OTF/VNG Support for Counters 9 7 VNG Communication Matrix Display 9 8 VNG Message Profile 9 9 VNG Process Activity Chart 1 0 VNG Preferences

1 0 Support Acknowledgements Lawrence Livermore National Laboratory (LLNL) Department of Energy (DOE) Office of Science contracts LLNL ParaTools/GWT contract University of Oregon T.U. Dresden, GWT

Research Centre Juelich 1 0

Recently Viewed Presentations

  • Open day - Amazon S3

    Open day - Amazon S3

    Capablanca "You should think only about the position, but not about the opponent… Psychology bears no relation to it and only stands in the way of real chess."
  • CS 3501 - Chapter 3 (3A and 10.2.2)

    CS 3501 - Chapter 3 (3A and 10.2.2)

    * Consensus Theorem F(x,y,z) = xy + x′z + yz Consensus Theorem Dr. Clincy * Working backwards and adding a term * Through our exercises in simplifying Boolean expressions, we see that there are numerous ways of stating the same...
  • Comparison of carbon flux estimates using 10 years

    Comparison of carbon flux estimates using 10 years

    Comparison of carbon flux estimates using 10 years of eddy covariance data and plot-level biometric measurements from the Bartlett Experimental Forest, New Hampshire. In the northeastern United States, forest regrowth following 19th and 20th century agricultural abandonment represents an important...
  • Moving Beyond Information Sharing Evolution to knowledge sharing

    Moving Beyond Information Sharing Evolution to knowledge sharing

    Everyday, we create 2.5 quintillion bytes of data. The Law of Accelerating Returns. An analysis of the history of technology shows that technological change is exponential, contrary to the common-sense "intuitive linear" view. Dr. Ray Kurzweil.
  • Conductance of Single Alkanedithiol Molecular Junctions

    Conductance of Single Alkanedithiol Molecular Junctions

    Interlayer exchange coupling in magnetic trilayers One puzzle Our calculations DFT Chang, Dou, Chen, Hong, & Kaun, Scientific Reports 5, 16844 (2015) The Fe3/Ag6/Fe3 trilayer Chang, Dou, Chen, Hong, & Kaun, Scientific Reports 5, 16844 (2015) Magnetic anisotropy in ferromagnetic...
  • Diapositiva 1 - 2ESOHistory

    Diapositiva 1 - 2ESOHistory

    Thomas Kuhn was a fundamental text in historiography. The ideas here can help students think critically about the events of the scientific revolution in a progressive fashion. Students also get introduced to an important historian.
  • The Ohio High School Athletic Association

    The Ohio High School Athletic Association

    You may not be eligible if you are competing under a false name or have provided your school with an incorrect home address. ... Semester and yearly grades have no effect on OHSAA eligibility unless your school provides grades at...
  • WHAT IS ART? IS IT FORM? CONTENT? STYLE?

    WHAT IS ART? IS IT FORM? CONTENT? STYLE?

    Take a look at these very different representations of the same vegetable- the bell pepper. The black and white photograph on the left is by a famous photographer, Edward Weston, who took the photograph in 1930 using a view camera...