Optimizing an application program can involve modifying the build process, modifying the source code, or both.
In many instances, optimizing an application program can result in major improvements in run-time performance. Two preconditions should be met, however, before you begin measuring the run-time performance of an application program and analyzing how to improve the performance:
Check the software on your system to ensure that you are using the latest versions of the compiler and the operating system to build your application program. Newer versions of a compiler often perform more advanced optimizations, and newer versions of the operating system often operate more efficiently.
Test your application program to ensure that it runs without
errors.
Whether you are porting an application from a 32-bit system to Linux Alpha
or Tru64 UNIX, or developing a new application, never attempt to optimize
an application until it has been thoroughly debugged and tested.
(If you are
porting an application written in C, compile your program using the C compiler's
-message_enable questcode
option, and/or use
lint
with the
-Q
option to help identify possible portability
problems that you may need to resolve.)
After you verify that these conditions have been met, you can begin the optimization process.
The process of optimizing an application can be divided into two separate, but complementary, activities:
Tuning your application's build process so that you use, for example, an optimal set of preprocessing and compilation optimizations
Analyzing your application's source code to ensure that it uses efficient algorithms and that it does not use programming language constructs that can degrade performance
The following sections provide details that relate to these two aspects of the optimization process.
Opportunities to improve an application's run-time performance exist in all phases of the build process. The following sections identify some of the major opportunities that exist in the areas of compiling, linking and loading, preprocessing and postprocessing, and library selection.
Compile
your application with the highest optimization level possible, that is, the
level that produces the best performance and the correct results.
In general,
applications that conform to language-usage standards should tolerate the
highest optimization levels, and applications that do not conform to such
standards may have to be built at lower optimization levels.
For details,
see
cc
(1)
or
Chapter 1.
If your application will tolerate it, compile all of the source files together in a single compilation. Compiling multiple source files increases the amount of code that the compiler can examine for possible optimizations. This can have the following effects:
To
take advantage of these optimizations, use the following compilation options:
-ifo
and either
-O3
or
-O4
.
To determine whether the highest level of optimization benefits your particular program, compare the results of two separate compilations of the program, with one compilation at the highest level of optimization and the other compilation at the next lower level of optimization. Some routines may not tolerate a high level of optimization; such routines will have to be compiled separately.
Other compilation considerations that can have a significant impact on run-time performance include the following:
For C applications
with numerous floating-point operations, consider using the
-fp_reorder
option if a small difference in the result is acceptable.
If your C application uses a lot of
char
,
short
, or
int
data items within loops, you may
be able to use the C compiler's highest-level optimization option to improve
performance.
(The highest-level optimization option (-O4
)
implements byte vectorization, among other optimizations, for Alpha systems.)
[Tru64] For C applications that are thoroughly
debugged and that do not generate any exceptions, consider using the
The
Both options result in exceptions being dismissed: the
To print a count of the number of dismissed exceptions when the program
does a normal termination, specify the following environment variable:
The statistics feature is not currently available with
the
Use of the
Both options can be specified as follows:
-speculate
option.
When a program compiled with this option
is executed, values associated with a variety of execution paths are precomputed
so that they are immediately available if they are needed.
This "work ahead"
operation uses idle machine cycles, so it has no negative effect on performance.
Performance is usually improved whenever a precomputed value is used.
-speculate
option can be specified in two
forms:
-speculate all
-speculate by_routine
-speculate
all
option dismisses exceptions generated in all compilation units
of the program, and the
-speculate by_routine
option
dismisses only the exceptions in the compilation unit to which it applies.
If speculative execution results in a significant number of dismissed exceptions,
performance will be degraded.
The
-speculate all
option is more aggressive and may result in greater performance improvements
than the other option, especially for programs doing floating-point computations.
The
-speculate all
option cannot be used if any routine
in the program does exception handling; however, the
-speculate
by_routine
option can be used when exception handling occurs outside
the compilation unit on which it is used.
Neither
-speculate
option should be used if debugging is being done.
%
setenv _SPECULATE_ARGS -stats
-speculate all
option.
-speculate all
and
-speculate
by_routine
options disables all messages about alignment fixups.
To generate alignment messages for both speculative and nonspeculative alignment
fixups, specify the following environment variable:
%
setenv _SPECULATE_ARGS -alignmsg
%
setenv _SPECULATE_ARGS -stats -alignmsg
You can use the following compilation options together or individually to improve run-time performance:
Option | Description |
-ansi_alias |
Specifies whether source code observes ANSI C aliasing rules. ANSI C aliasing rules allow for more aggressive optimizations. |
-ansi_args |
Specifies whether source code observes ANSI C rules about arguments. If ANSI C rules are observed, special argument-cleaning code does not have to be generated. |
-fast |
Turns on the optimizations for the following options for increased performance:
|
-feedback |
[Tru64] Specifies the name of a previously created feedback file. Information in the file can be used by the compiler when performing optimizations. |
-fp_reorder |
Specifies whether certain code transformations that affect floating-point operations are allowed. |
-G |
Specifies the maximum byte size of data items in the small data sections (sbss or sdata). |
-inline |
Specifies whether to perform inline expansion of functions. |
-ifo |
Provides improved optimization (interfile optimization) and code generation across file boundaries that would not be possible if the files were compiled separately. |
-O |
Specifies the level of optimization that is to be achieved by the compilation. |
-om |
[Tru64] Performs a variety
of code optimizations for programs compiled with the
-non_shared
option. |
-preempt_module |
Supports symbol preemption on a module-by-module basis. |
-speculate |
[Tru64] Enables work (for example, load or computation operations) to be done in running programs on execution paths before the paths are taken. |
-tune |
Selects processor-specific instruction tuning for specific implementations of the Alpha architecture. |
-unroll |
Controls loop unrolling done by the optimizer
at levels
-O2
and above.
|
Using the preceding options may cause a reduction in accuracy and
adherence to standards.
See
cc
(1)
for details on these options.
For C applications, the compilation option in effect for handling floating-point exceptions can have a significant impact on execution time as follows:
Default exception handling (no special compilation option)
With the default exception-handling mode, overflow, divide-by-zero,
and invalid-operation exceptions always signal the
SIGFPE
exception handler.
Also, any use of an IEEE infinity, an IEEE NaN (not-a-number),
or an IEEE denormalized number will signal the
SIGFPE
exception
handler.
By default, underflows silently produce a zero result, although the
compilers support a separate option that allows underflows to signal the
SIGFPE
handler.
The default exception-handling mode is suitable for any portable program that does not depend on the special characteristics of particular floating-point formats. The default mode provides the best exception-handling performance.
Portable IEEE exception handling
(-ieee
)
With the portable IEEE exception-handling mode, floating-point exceptions
do not signal unless a special call is made to enable the fault.
This mode
correctly produces and handles IEEE infinity, IEEE NaNs, and IEEE denormalized
numbers.
This mode also provides support for most of the nonportable aspects
of IEEE floating point: all status options and trap enables are supported,
except for the inexact exception.
(See
ieee
(3)
for information on the inexact exception
feature (-ieee_with_inexact
).
Using this feature
can slow down floating-point calculations by a factor of 100 or more, and
few, if any, programs have a need for its use.)
The portable IEEE exception-handling mode is suitable for any program that depends on the portable aspects of the IEEE floating-point standard. This mode is usually 10-20 percent slower than the default mode, depending on the amount of floating-point computation in the program. In some situations, this mode can increase execution time by more than a factor of two.
If your application does not use many large libraries,
consider linking it nonshared.
This allows the linker to optimize calls into
the library, which decreases your application's startup time and improves
run-time performance (if calls are made frequently).
Nonshared applications,
however, can use more system resources than call-shared applications.
If you
are running a large number of applications simultaneously and the applications
have a set of libraries in common (for example,
libX11
or
libc
), you may increase total system performance by
linking them as call-shared.
See
Chapter 3
for details.
For applications
that use shared libraries, ensure that those libraries can be quickstarted.
Quickstarting is a Tru64 UNIX capability that can greatly reduce an application's
load time.
For many applications, load time is a significant percentage of
the total time that it takes to start and run the application.
If an object
cannot be quickstarted, it still runs, but startup time is slower.
See
Section 3.7
for details.
You perform postlink optimizations by using
the
The postlink optimizer performs the following code optimizations:
Removal of
Removal of
Reallocation of common symbols according to a size you determine.
When you use the
This option
removes unused
This option removes
dead code (unreachable options) generated after optimizations have been applied.
The
This option directs the compiler to use the
This option
turns off instruction scheduling.
This option
turns off alignment of labels.
Normally, the
This option sets the size threshold of "common" symbols.
Every "common" symbol whose size is less than or equal to
num
will be allocated close together.
For more information, see
Preprocessing options and postprocessing (run-time) options that can
affect performance include the following:
Use the Kuck & Associates Preprocessor
(KAP) tool to gain extra optimizations.
The preprocessor uses final source
code as input and produces an optimized version of the source code as output.
KAP is especially useful for applications with the following characteristics
on both symmetric multiprocessing systems (SMP) and uniprocessor systems:
To take advantage of the parallel processing capabilities of SMP
systems, the KAP preprocessors support automatic and directed decomposition
for C programs.
KAP's automatic decomposition feature analyzes an existing
program to locate loops that are candidates for parallel execution.
Then,
it decomposes the loops and inserts all necessary synchronization points.
If more control is desired, the programmer can manually insert directives
to control the parallelization of individual loops.
On Tru64 UNIX systems,
KAP uses DECthreads to implement parallel processing.
For C programs, KAP is invoked with the
KAP is available for Tru64 UNIX systems as a separately orderable
layered product.
Use
the
To improve compiler optimizations, try recompiling
your C programs with a feedback file.
The C compilers can make use of data
from an actual run of the program to fine tune their optimizations.
The feedback
information is most useful at the highest two levels of optimization ( See
Section 7.4.2.2
for information on
how to create and use feedback files for profile-directed optimization.
9.1.2.1 [Tru64] Using the Postlink Optimizer
-om
option on the
cc
command line.
This
option must be used with the
-non_shared
option and
must be specified when performing the final link.
For example:
%
cc -om -non_shared prog.c
nop
(no operation) instructions,
that is, those instructions that have no effect on machine state.
.lita
data; that is, that portion
of the data section of an executable image that holds address literals for
64-bit addressing.
Using available options, you can remove unused
.lita
entries after optimization and then compress the
.lita
section.
-om
option, you get the full
range of postlink optimizations.
To
specify a specific postlink optimization, use the
-WL
compiler option, followed by one of the following options:
.lita
entries after optimization, then compresses
the
.lita
section.
.lita
section is not compressed by this option.
pixie
-produced information in
file.Counts
and
file.Addrs
to
reorganize the instructions to reduce cache thrashing.
-om
option will align the targets of all branches on quadword boundaries to improve
loop performance.
cc
(1).
9.1.3 [Tru64] Preprocessing and Postprocessing Considerations
kapc
(which
invokes separate KAP processing) or
kcc
command (which
invokes combined KAP processing and Compaq C compilation).
For information
on how to use KAP on a C program, see the
KAP for C for Tru64 UNIX User Guide.
cord
utility (-cord
cc
command option) to improve the instruction cache behavior for
C applications.
This utility uses data in a feedback file from an actual run
of your application to improve your application's use of the instruction cache.
Section 7.4.2.3
shows how to create a feedback file and use the
cord
utility.
(If you have produced a feedback file and you are
going to compile your program with the
-non_shared
option, it is better to use the feedback file with the
-om
option than with the
-cord
option.
See
Section 9.1.2.1
for details on the
om
utility.)
-03
or
-O4
).
If you are compiling a program with a feedback file and with the
-non_shared
option, it is better to use the
-prof_use_om_feedback
option than the
-prof_use_feedback
or
-feedback
options.
(See
Section 9.1.2.1
for details
on the
om
utility.)
Library routine options that can affect performance include the following:
Use the Compaq Extended Math Library (CXML) for applications that perform numerically intensive operations. CXML is a collection of mathematical routines that are optimized for Alpha systems -- both SMP systems and uniprocessor systems. The routines in CXML are organized in the following four libraries:
BLAS -- A library of basic linear algebra subroutines
LAPACK -- A linear algebra package of linear system and eigensystem problem solvers
Sparse Linear System Solvers -- A library of direct and iterative sparse solvers
Signal Processing -- A basic set of signal-processing functions, including one-, two-, and three-dimensional fast Fourier transforms (FFTs), group FFTs, sine/cosine transforms, convolution functions, correlation functions, and digital filters
By using CXML, applications that involve numerically intensive
operations may run significantly faster on Tru64 UNIX systems, especially
when used with KAP.
CXML routines can be called explicitly from your program
or, in certain cases, from KAP (that is, when KAP recognizes opportunities
to use the CXML routines).
You access CXML by specifying the
-ldxml
option on the compilation command line.
For details on CXML, see the Compaq Extended Mathematical Library for Tru64 UNIX Systems Reference Manual.
The CXML routines are written in Fortran. For information on calling Fortran routines from a C program, see the Compaq Fortran (formerly Digital Fortran) user manual for Tru64 UNIX. (Information about calling CXML routines from C programs is also provided in the TechAdvantage C/C++ Getting Started Guide.)
[Tru64] If your application does not require
extended-precision accuracy, you can use math library routines that are faster
but slightly less accurate.
Specifying the
-D_FASTMATH
option on the compilation command causes the compiler to use faster
floating-point routines at the expense of three bits of floating-point accuracy.
See
cc
(1)
for details.
[Tru64]
Consider compiling your C programs with
the
-D_INTRINSICS
and
-D_INLINE_INTRINSICS
options; this causes the compiler to inline calls to certain standard
C library routines.
If you are willing to modify your application, use the profiling tools to determine where your application spends most of its time. Many applications spend most of their time in a few routines. Concentrate your efforts on improving the speed of those heavily used routines.
Several profiling tools that work for programs written in C and other languages are available. See the following for more details:
gprof
(1).
[Tru64] Chapter 6,
Chapter 7,
Chapter 8,
prof_intro
(1),
hiprof
(1),
pixie
(1),
prof
(1),
third
(1),
uprofile
(1),
and
atom
(1).
After you identify the heavily used portions of your application, consider the algorithms used by that code. Is it possible to replace a slow algorithm with a more efficient one? Replacing a slow algorithm with a faster one often produces a larger performance gain than tweaking an existing algorithm.
When you are satisfied with the efficiency of your algorithms, consider making code changes to help the compiler optimize the object code that it generates for your application. High Performance Computing by Kevin Dowd (O'Reilly & Associates, Inc., ISBN 1-56592-032-5) is a good source of general information on how to write source code that maximizes optimization opportunities for compilers.
The following sections identify performance opportunities involving data types, I/O handling, cache usage and data alignment, and general coding issues.
Data type considerations that can affect performance include the following:
The smallest unit of efficient access on Alpha systems is 32 bits. A 32- or 64-bit data item can be accessed with a single, efficient machine instruction. If your application's performance on older implementations of the Alpha architecture (processors earlier than EV56) is critical, you may want to consider the following points.
Avoid using integer and logical data types that are less than 32 bits, especially for scalars that are used frequently.
In C programs, consider replacing
char
and
short
declarations with
int
and
long
declarations.
Division of integer quantities is slower than division of floating-point quantities. If possible, consider replacing such integer operations with equivalent floating-point operations.
Integer division operations are not native to the Alpha processor and must be emulated in software, so they can be slow. Other non-native operations include transcendental operations (for example, sine and cosine) and square root.
Cache usage patterns can have a critical impact on performance:
If your application has a few heavily used data structures, try to allocate these data structures on cache line boundaries in the secondary cache. Doing so can improve the efficiency of your application's use of cache. See Appendix A of the Alpha Architecture Reference Manual for additional information.
Look for potential
data cache collisions between heavily used data structures.
Such collisions
occur when the distance between two data structures allocated in memory is
equal to the size of the primary (internal) data cache.
If your data structures
are small, you can avoid this by allocating them contiguously in memory.
You
can use the
uprofile
tool to determine the number of cache
collisions and their locations.
See Appendix A of the
Alpha Architecture Reference Manual
for
additional information on data cache collisions.
Data alignment can also affect performance. By default, the C compiler aligns each data item on its natural boundary; that is, it positions each data item so that its starting address is an even multiple of the size of the data type used to declare it. Data not aligned on natural boundaries is called misaligned data. Misaligned data can slow performance because it forces the software to make necessary adjustments at run time.
In C programs, misalignment can occur when you type cast a pointer variable
from one data type to a larger data type; for example, type casting a
char
pointer (1-byte alignment) to an
int
pointer
(4-byte alignment) and then dereferencing the new pointer may cause unaligned
access.
Also in C, creating packed structures using the
#pragma pack
directive can cause unaligned access.
(See
Chapter 2
for details on the
#pragma pack
directive.)
To correct alignment problems in C programs, you can use the
-align
option or you can make necessary modifications to the
source code.
If instances of misalignment are required by your program for
some reason, use the
_ _unaligned
data-type qualifier
in any pointer definitions that involve the misaligned data.
When data is
accessed through the use of a pointer declared
_ _unaligned
, the compiler generates the additional code necessary to copy or
store the data without generating alignment errors.
(Alignment errors have
a much more costly impact on performance than the additional code that is
generated.)
Warning messages identifying misaligned data are not issued during the compilation of C programs.
During execution of any program, the kernel issues warning messages ("unaligned access") for most instances of misaligned data. The messages include the program counter (PC) value for the address of the instruction that caused the misalignment.
You can use either of the following two methods to access code that causes the unaligned access fault:
By using a debugger to examine the PC value presented in the "unaligned access" message, you can find the routine name and line number for the instruction causing the misalignment. (In some cases, the "unaligned access" message results from a pointer passed by a calling routine. The return address register (ra) contains the address of the calling routine -- if the contents of the register have not been changed by the called routine.)
By turning off the
-align
option on the
command line and running your program in a debugger session, you can examine
your program's stack and variables at the point where the debugger stops due
to the unaligned access.
For additional information on data alignment, see Appendix A in the
Alpha Architecture Reference Manual.
See
cc
(1)
for details on alignment-control options that
you can specify on compilation command lines.
General coding considerations specific to C applications include the following:
Use
libc
functions (for example:
strcpy
,
strlen
,
strcmp
,
bcopy
,
bzero
,
memset
,
memcpy
) instead of writing similar routines or your own loops.
These
functions are hand coded for efficiency.
Use the
unsigned
data type for variables
wherever possible because:
The variable is always greater than or equal to zero, which enables the compiler to perform optimizations that would not otherwise be possible
The compiler generates fewer instructions for all unsigned divide operations.
Consider the following example:
int long i; unsigned long j;
.
.
.
return i/2 + j/2;
In the example,
i/2
is
an expensive expression; however,
j/2
is inexpensive.
The compiler generates three instructions for the signed
i/2
operations:
addq $l, l, $28 cmovge $l, $l, $28 sra $28, l, $2
The compiler generates only one instruction
for the unsigned
j/2
operation:
srl $3, 1, $4
Also, consider using the
-unsigned
option to treat all
char
declarations
as
unsigned
char
.
If your application uses
large amounts of data for a short period of time, consider allocating the
data dynamically with the
malloc
function instead of declaring
it statically.
When you have finished using the memory, free it so it can
be used for other data structures later in your program.
Using this technique
to reduce the total memory usage of your application can substantially increase
the performance of applications running in an environment in which physical
memory is a scarce resource.
If an application uses the
malloc
function extensively, you may be able to improve the application's
performance (processing speed, memory utilization, or both) by using
malloc
's control variables to tune memory allocation.
See
malloc
(3)
for details.
If
your application uses local arrays whose sizes are unknown at compile time,
you can gain a performance advantage by allocating them with the
alloca
function, which uses very few instructions and is very efficient.
Storage allocated by the
alloca
function is automatically
reclaimed when an exit is made from the routine in which the allocation is
made.
You can also
use variable length arrays.
The
alloca
function allocates space on the stack,
not the heap, so you must make sure that the object being allocated does not
exhaust all of the free stack space.
If the object does not fit in the stack,
a
core
dump is issued.
Programs that issue calls to the
alloca
function
should include the
alloca.h
header file.
If the header
file is not included, the program will execute properly, but it will run much
slower.
Minimize type casting, especially type conversion from integer to floating point and from a small data type to a larger data type.
To avoid cache misses, make sure that multidimensional arrays are traversed in natural storage order; that is, in row major order with the rightmost subscript varying fastest and striding by 1. Avoid column major order (which is used by Fortran).
[Tru64]
If your application fits in a 32-bit address space and allocates
large amounts of dynamic memory by allocating structures that contain many
pointers, you may be able to save significant amounts of memory by using the
-xtaso
option.
To
use the option, you must modify your source code with a C-language pragma
that controls pointer size allocations.
See
cc
(1)
and
Chapter 1
for details.
Do not use indirect calls in C programs (that is, calls that use routines or pointers to functions as arguments). Indirect calls introduce the possibility of changes to global variables. This effect reduces the amount of optimization that can be safely performed by the optimizer.
Use functions to return values instead of reference parameters.
Use
do while
instead of
while
or
for
whenever possible.
With
do while
, the optimizer does not have to duplicate
the loop condition in order to move code from within the loop to outside the
loop.
Use local variables and avoid global variables.
Declare any
variable outside of a function as
static
, unless that variable
is referenced by another source file.
Minimizing the use of global variables
increases optimization opportunities for the compiler.
Use value parameters instead of reference parameters or global variables. Reference parameters have the same degrading effects as pointers.
Write straightforward code.
For example, do not use
++
and
--
operators within an expression.
When
you use these operators for their values instead of their side-effects, you
often get bad code.
For example, the following coding is not recommended:
while (n--) {
.
.
.
}
The following coding is recommended:
while (n != 0) { n--;
.
.
.
}
Avoid taking and passing addresses (that is,
&
values).
Using
&
values can create aliases,
make the optimizer store variables from registers to their home storage locations,
and significantly reduce optimization opportunities.
Avoid creating functions that take a variable number of arguments. A function with a variable number of arguments causes the optimizer to unnecessarily save all parameter registers on entry.
Declare functions as
static
unless the
function is referenced by another source module.
Use of
static
functions allows the optimizer to use more efficient calling sequences.
Also, avoid aliases where possible by introducing local variables to store dereferenced results. (A dereferenced result is the value obtained from a specified address.) Dereferenced values are affected by indirect operations and calls, whereas local variables are not; local variables can be kept in registers. Example 9-1 shows how the proper placement of pointers and the elimination of aliasing enable the compiler to produce better code.
Source Code: int len = 10; char a[10]; void zero() { char *p; for (p = a; p != a + len; ) *p++ = 0; }
Consider the use of pointers in
Example 9-1.
Because the statement
*p++=0
might modify
len
, the compiler must load it from memory and add it to the address
of
a
on each pass through the loop, instead of computing
a + len
in a register once outside the loop.
Two different methods can be used to increase the efficiency of the code used in Example 9-1:
Use subscripts instead of pointers.
As shown in the following
example, the use of subscripting in the
azero
procedure
eliminates aliasing; the compiler keeps the value of
len
in a register, saving two instructions, and still uses a pointer to access
a
efficiently, even though a pointer is not specified in the source
code:
Source Code: char a[10]; int len; void azero() { int i; for (i = 0; i != len; i++) a[i] = 0; }
Use local variables.
As shown in the following example, specifying
len
as a local variable or formal argument ensures that aliasing
cannot take place and permits the compiler to place
len
in a register:
Source Code: char a[10]; void lpzero(len) int len; { char *p; for (p = a; p != a + len; ) *p++ = 0; }