Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Writing Efficient Code Feb 08
1. Insight
Insight
Understand the Machine
to Write Efficient Code
How many programmers can actually write assembly programs? With the rising popularity
of high-level languages (like Java, VB.Net, etc), there is rarely any need for programmers
to learn assembly or low-level programming. However, there are domains where writing
efficient code is very important, for example, game programming and scientific computing.
W
hen can we write highly for alternatives to write efficient code.
efficient code? It is when we We can write in low-level programming
understand how the underlying languages like C to get write code, whose
machine works and make best efficiency is often comparable to the
use of that knowledge. One equivalent code written in assembly. For this
well-known way to write highly efficient code reason, C is often referred as a ‘high-level
is to write code in assembly. There are many assembler’. In this article, we’ll look at various
disadvantages with this; for example, we programming constructs from the perspective
cannot port the programs easily, it is difficult of efficiency. We’ll consider general machine
to maintain the code, etc. So, we need to look architecture for illustration; and for specific
12 February 2008 | LINuX For you | www.openITis.com
cmyk
2. Insight
examples, x86 architecture will be floating point division operation toggling the case of characters
used. A word of caution before we might take 50 to 100 cycles. Memory in a string, then it is not a good
proceed: the techniques and issues access operations are very slow—if implementation.
covered are not for general-purpose the desired memory location is not in The code can be improved
programming. cache, then it might take hundreds of as follows: since the comparison
cycles to fetch the data from the main operators are not required, the
Basic types memory. function precondition says that
The machine, in general, ch passed to the function is in the
understands only three types of Operators given range [‘a’-‘z’] or [‘A’-‘Z’]. Based
values: address, integer and floating- C supports a rich set of operators. on C tradition, we need not check
point values. For representation There are few operators that are to ensure that the given char is
and manipulation, here is the directly supported by the processor in fact in this range. Also, since
correspondence between the types and there are a few that are simulated we are performing either the ‘-’ or
that the machine can understand in the software. ‘+’ arithmetic operation, we can
and what C supports. Addresses For integral types, bit- replace it with bit-wise operations
correspond to the pointer construct; manipulation operators are faster for toggling the bit using the ex-
integers—both signed and compared to other operators like or operator. With this the code
unsigned—correspond to short, int, arithmetic, logical or relational becomes efficient and simple:
long, long long, char (yes, a char is operators. One of the ways to write
represented as an int internally!) etc; efficient code is to write code using // precondition: the char ch provided
floating-point types correspond to bitwise operations instead of other is in range [‘a’-‘z’] or [‘A’-‘Z’]
float, double, long double, etc. slower operations. Here is a well- char toggle_ascii_char_case(char ch)
The most efficient data-type known example: Using ‘<<’ is more {
that a processor handles is a ‘word’, operator efficient than dividing an return (ch ^= 0x20);
which corresponds to ‘int’ type integer value by 2. We’ll look at a }
in C. For floating-point types, all different example for illustration
computation is typically done in here. This example is just for
a larger floating-point type. For A typical code segment for illustration purposes. Using bit-wise
example, in x86 machines, all toggling a character’s case is to use operators obscures the code, but it
floating-point computation is done relational operators, as in: usually significantly improves the
in ‘extended precision’, which is 80 efficiency of the code.
bits in size and usually corresponds // precondition: the char ch provided
to ‘long double’ in C; if floating point is in range [‘a’-‘z’] or [‘A’-‘Z’] Control flow
expressions are used in the code, char toggle_ascii_char_case(char ch) { C has various conditional and
they are internally converted to if( (ch >= ‘a’) && (ch <= ‘z’) looping constructs. A C compiler
extended precision by the processor ) // lower case transforms such code constructs
and the results are converted back ch = ch - 0x20; to branching (also known as
to float (which occupies 32 bits). else if( (ch >= ‘A’) && (ch <= ‘jump’) instructions. So, goto is the
The processor does floating-point ‘Z’) ) // upper case most straightforward construct
computations in a separate ‘co- ch = ch + 0x20; for programming. It is possible to
processor’ unit. return ch; take any C program and create an
Address computation (such } equivalent program by removing all
as array index access) is done conditions and loops with just goto
using pointers in C. They directly The code works on the following statements. Though it is ultimately
correspond to memory access assumption: the given char ch is branching instructions, there are
operations in the underlying machine within the range [a-z] or [A-Z]. If subtle differences in constructs
(such as index addressing, in this the char is [A-Z], it returns the when it comes to efficiency.
case). corresponding char in [a-z] and vice Which one is more efficient—
The ‘int’ type is the most efficient versa, that is, it toggles the case of nested if conditions or switch
for computations. Unsigned types the character. The value 0x20 is statements? In general, a switch
and operations on that are as added or subtracted based on the fact is more efficient than nested if
efficient as signed types. Floating that the alphabetic characters are statements. If both are implemented
point types and operations are slow separated by the hex value 0x20 in using branching, why is switch more
compared to integral types. For the ASCII table. efficient than nested if conditions?
example, in an imaginary processor, But this function is slow. If Recall that, in a switch statement,
if integer division takes four cycles, this is a library function used for all cases are constants. So, a
www.openITis.com | LINuX For you | February 2008 13
cmyk
3. Insight
compiler can transform a switch more efficient? it is not possible to take address
statement to a range or look-up of specific bits in a byte. Though it
table, which is more efficient than a for(i = 0; i< 50; i++) is space-efficient to use bit-fields,
long list of jumps. for(j = 0; j< 50; j++) it is not time-efficient since it is
Note that executing jump for(k = 0; k< 50; k++) not possible to access individual
instructions is not costly, but printf(“%d ”, bits; so the compiler emits code to
unpredictable jumps can result a[k][j][i]); access ‘word’s and then does bit-
in considerably slower execution. manipulation to access individual
For example, frequent jumps can for(i = 0; i< 50; i++) bit-field member values.
result in the flushing of the pipeline. for(j = 0; j< 50; j++) The following bit-field struct is to
Similarly, processors typically look for(k = 0; k< 50; k++) represent time in a day in HH:MM:
ahead in the instruction stream printf(“%d ”, SS format. For the hour, the range is
and pre-fetch necessary memory a[i][j][k]); 0-23 and for minutes and seconds,
accesses and put it in cache. So, the range is 0-59. We can use the
unpredictable jumps can result in C has arrays implemented in following struct:
memory faults, which will result in row-major order, that is, the same
wasting hundreds of cycles since way it is organised in the hardware. struct time {
the processor has to wait for the The second loop is more efficient unsigned int hour : 5;
memory value to be available for it because it accesses memory unsigned int minute: 6;
to continue execution. In general, locations sequentially, in row-major unsigned int second: 6;
a program with less number of order. The processor will fetch the } tm1, tm2;
branches is faster than those that memory blocks into cache, and
have a large number of branches. since the memory access is also To access tm.minute, the
sequential, this is efficient. However, compiler has to generate code to
Memory access in the first loop, the memory access access the word (4 bytes) and do bit-
As said earlier, memory access is not sequential and hence there manipulation and access only the 5th
is a costly operation. Let us take might be a lot of memory faults and to 10th bit in that word, which is slow.
a specific example to illustrate hence it will be considerably slower So, avoid using bit-fields extensively
this. Typically, it takes the same than the second loop. if performance is important for your
time to access global data or local From these examples we learn software. A better option in this case
(stack allocated) data. However, that it is important to keep in mind is to use a struct (without bit-fields)
it is preferable to use local data that memory faults are costly and with a byte each for the hour, minute
instead of global data because of the we need to minimise such memory and second, respectively.
well-known ‘principle of locality’. faults to write efficient code. In this article, we explored some
If a memory location is accessed, fundamental issues to understand
the processor doesn’t fetch value Compound types how various programming constructs
in just that memory location; it C supports compound types like can affect the efficiency of programs.
fetches many values adjacent to that structs, unions and bit-fields that There are many other issues, such
memory location since the program are implemented in terms of other as unaligned memory access, cost
is likely to access variables that are primitive or compound types. The of I/O operations, etc, that are not
located near that memory location. hardware does not understand any covered in this article. This article is
Fetching a block of data and putting compound types, and all processing just a starting point to understand
it in cache is not time consuming, is done on primitive types only. such problems. If you are interested,
but if there is a memory fault, it can There are many aspects—such you can read books on assembly
lead to the processor waiting for as padding and alignment—in language, computer architecture and
hundreds of cycles for that memory using compound types that can compiler optimisation to get a better
access to happen. If there are large affect the performance. Here, we’ll understanding of the issues related to
numbers of global variables and their look at an example of bit-fields to writing efficient programs.
accesses are spread throughout understand how it is supported by
the program, then the program the hardware. By: S G Ganesh is a research
execution becomes considerably In C, we can manipulate and engineer in Siemens (Corporate
slower. access bits using bit-fields. It is a Technology), Bangalore. His latest
Let us consider another (well- syntax error to attempt taking the book is, ‘60 Tips on Object Oriented
known) example to illustrate this address of a bit-field member. Why? Programming’, published by Tata
important ‘principle of locality’. Of The granularity of addressing in McGraw-Hill in December 2007. You
can reach him at sgganesh@gmail.com
the following two loops, which one is modern computers is in bytes and
14 February 2008 | LINuX For you | www.openITis.com
cmyk