1. The C programming language
Why we love to hate C
and hate that we love C
Bent Thomsen
Institut for Datalogi, AAU
InfinIT Embedded Systems Engineering interest group meeting
21.11.2018
2. “C is quirky, flawed, and a tremendous success”
Dennis Ritchie, inventor of C
#include <stdio.h>
int main() {
/* my first program in C */
printf("Hello, World! n");
return 0;
}
4. The IT world runs on C
• v6UNIX About 9,000 LoC – mainly C and assembly
• L4 micro kernel About 6,400 LoC – mainly C and assembly
• seL4 About 10,200 LoC – mainly C and assembly
• Linux 2.6 About 5.6M LoC – mainly C and assembly
• Current Linux About 16M LoC – mainly C and assembly
• Current Mac OS X About 13M LoC – some C, but mainly ObjectiveC
• Current iOS About 12M LoC - some C, but mainly ObjectiveC
• Android About 15M LoC – C, C++, Java and others
• NT 3.1 About 5M LoC – C, C++ and assembly
• XP About 48M LoC - C, C++ and assembly
• Windows 10 About 60M LoC - C, C++ and assembly
Source: Various more or less reliable source on the web
6. C Language History
• Dennis Ritchie developed the language C in years 1969-1973
• C was developed as a high level language for writing the Unix OS
• C was based on CPL, BCPL and B – also influenced by Algol
• In 1978 Dennis Ritchie and Brian Kernighan wrote a book on C:
• The C Programming Language, Prentice-Hall. ISBN 0-13-110163-3.
• includes a definition of the C language in BNF
• instrumental in removing syntax ambiguities between different versions of the language
• In 1982 American National Standards for Information Systems (ANSI)
established a committee with the goal of producing a C standard
• In 1989 ISO published its first C standard (C89)
• The C standard has been revised in 1999 (C99), 2011 (C11) and 2018 (C18)
8. C99• C99 introduced several new features, many of which had already been implemented as extensions in several compilers:[4]
• inline functions
• intermingled declarations and code: variable declaration is no longer restricted to file scope or the start of a compound statement (block), facilitating
static single assignment form
• several new data types, including long long int, optional extended integer types, an explicit boolean data type, and a complex type to represent
complex numbers
• variable-length arrays (although subsequently relegated in C11 to a conditional feature that implementationsare not required to support)
• flexible array members
• support for one-line comments beginning with //, as in BCPL, C++ and Java
• new library functions, such as snprintf
• new headers, such as <stdbool.h>,<complex.h>, <tgmath.h>, and <inttypes.h>
• type-generic math (macro) functions, in <tgmath.h>, which select a math library function based upon float, double, or long double arguments, etc.
• improved support for IEEE floating point
• designated initializers
• compound literals (for instance, it is possible to construct structures in function calls: function((struct x) {1, 2}))[5]
• support for variadic macros (macros with a variable number of arguments)
• restrict qualification allows more aggressive code optimization, removing compile-time array access advantages previously held by FORTRAN over
ANSI C[6]
• universal character names, which allows user variables to contain other characters than the standard character set
• keyword static in array indices in parameter declarations[7]
9. C11
• The standard includes severalchanges to the C99 languageand libraryspecifications,such as:[9]
• Alignment specification (_Alignas specifier, _Alignof operator,aligned_allocfunction, <stdalign.h>header file)
• The _Noreturn function specifier and the <stdnoreturn.h>header file
• Type-generic expressionsusingthe _Generic keyword. For example, the followingmacro cbrt(x) translates to cbrtl(x), cbrt(x) or cbrtf(x) dependingon the type of x:
• #define cbrt(x) _Generic((x),long double:cbrtl,
• default: cbrt,
• float: cbrtf)(x)
• Multi-threadingsupport(_Thread_localstorage-classspecifier, <threads.h>header includingthread creation/managementfunctions,mutex, condition variableand thread-
specific storage functionality, as well as <stdatomic.h>[10]for atomic operations supportingthe C11 memorymodel).
• Improved Unicodesupportbased on the C Unicode Technical Report ISO/IECTR 19769:2004 (char16_t and char32_t types for storing UTF-16/UTF-32 encoded data, including
conversion functions in <uchar.h>and the correspondingu and U string literal prefixes,as well as the u8 prefix for UTF-8 encoded literals).[11]
• Removal of the gets function,deprecated in the previous C language standard revision,ISO/IEC9899:1999/Cor.3:2007(E), in favor of a new safe alternative, gets_s.
• Bounds-checkinginterfaces (Annex K).[12]
• Analyzabilityfeatures (Annex L).
• More macros for querying the characteristics of floating pointtypes, concerning subnormalfloatingpointnumbers and the number of decimal digits the type is able to store.
• Anonymousstructures and unions,usefulwhen unions and structures are nested, e.g. in struct T { int tag; union { float x; int n; }; };.
• Static assertions,which are evaluated during translation at a later phasethan #if and #error, when types are understood bythe translator.
• An exclusive create-and-openmode("…x" suffix) for fopen. This behaves like O_CREAT|O_EXCL in POSIX, which is commonlyused for lock files.
• The quick_exit function as a third way to terminate a program, intended to do at least minimaldeinitialization if termination with exit fails.[13]
• Macros for the construction of complex values (partly becausereal + imaginary*I might not yield the expected value if imaginaryis infinite or NaN).[14]
10. Towards the C2x standard
• A clarification of the restrict keyword.
• The restrict keyword is used to inform the compiler that a given object in memory can only be accessed by that pointer, as an
optimization. In C2x, there will be more detailed examples of how restrict ought to behave, the better to ensure compilers don’t end up
making unsafe or unnecessary optimizations.
• Making static_assert behave the same in C as it does in C++.
•The static_assert declaration, found in both C and C++, is used to ensure that a given constant expression is valid at compile time, but it is
implemented differently in the two languages. With this change, the C2x version will behave the same as the C++ version, making it easier to
share header code between the languages and to translate between C and C++.
•Better definitions for behavior of unions.
•Different implementations of C have different behavior when it comes to anonymous unions, a feature added in C11. C2x clarifies how this
works so it’s not dependent on the implementation.
11. The IT world when C was invented
• The programmers view of the computer:
• This model was pretty accurate in 1970
(and lasted until 1985)
• C was as expressively time-accurate as a
language could be: almost all C operators
took one or two cycles.
• Compiler needed to run on PDP-11 with
24KB of memory
12. Programming Languages Genealogy
Fortran (1954)
Algol (1958)
LISP (1957)
Scheme (1975)
CPL (1963), U Cambridge
Combined Programming Language
BCPL (1967), MIT
Basic Combined Programming Language
B (1969), Bell Labs
C (1970), Bell Labs
C++ (1983), Bell Labs
Java (1995), Sun
Objective C
Simula (1967)
13. C as a language
• C is an imperative block structured programming language
• C has variables and assignments (thus a C program manipulates a state)
• C has structured commands like while and for loops
• But also goto
• Blocks of commands are enclosed within { } which defines a scope
• Scopes can be nested
• Flat block structure for functions
• Simplifies compilation and run-time stack management
• Function definitions
• the main() function is where program execution begins
• Function parameters in C are passed by value to the function
• By reference is achieved through pointers
• Functions are usually given a prototype line
• either at the top of the source file, or in a separate file called a header file(.h extension)
• Facilitates separate compilation and mutual recursion in single pass compiler
14. Basic types and type modifiers
• character char
• integer int
• floating point float
• double floating point double
• valueless void
• signed
• unsigned
• short
• long
17. C was designed for a single pass compiler
Mutual recursion problem:
– Every identifier must be declared
before it is used.
– How to handle mutual recursion then?
17
void ping(int x)
{
pong(x-1); ...
}
void pong(int x)
{
ping(x); ...
}
void pong(int x);
void ping(x:integer)
{
pong(x-1); ...
}
Void pong(int x)
{
ping(x); ...
}
OK!
18. Criteria in a good language design and C
• Readability
– understand and comprehend a computation easily and
accurately
• Write-ability
– express a computation clearly, correctly, concisely, and
quickly
• Reliability
– assures a program will not behave in unexpected or
disastrous ways
• Orthogonality
– A relatively small set of primitive constructs can be
combined in a relatively small number of ways
– Every possible combination is legal
– Lack of orthogonality leads to exceptions to rules
• Readability/Writability
– Structured commands:
• if, while, for, etc.
• Even goto requires label not line no.
– Pointer can be hard to read
– { } vs. begin .. end
– = vs. :=
• Reliability
– Type checking, but weak
– No exception mechanism
– No bounds check
– No required variable initialization
• Orthogonality
– No return of arrays
– No nested functions
18
19. Tennent’s Language Design principles
• C has two main syntactic categories
– Expressions
• Functions
– Commands
• Procedures – void functions
• Thus C satisfy the Principle of Abs
• Some violation of Data Type compl
– E.g. no array return
19
20. Principle of correspondence
• Example in Pascal:
var i : integer;
begin
i := -j;
write(i)
end
and
procedure p(i : integer);
begin
write(i)
end;
begin p(-j) end
• Are equivalent
• Example in C
int i;
Void main() {
i = -j;
printf(i)
}
and
void p(int i) {
printf(i)
}
void main() {
p(-j)
}
• Are equivalent
20
21. Example of missing correspondence
In Pascal:
procedure inc(var i : integer);
begin
i := i + 1
end;
var x : integer;
begin
x := 1;
inc(x);
writeln(x);
end
No corresponding declaration
However C has correspondence
void inc(int *i) {
*i = *i + 1;
}
int x = 1;
inc(&x);
printf("%d", x);
int x = 1;
{
int *i = &x;
*i = *i + 1;
}
printf("%d", x);
21
22. Pointers in C and the C memory model
Address
0x3dc
0x3d8
Program Memory
0x3cc
0x3c8
0x3c4
0x3c0
Note: The compiler converts z[1] or *(z+1) to
Value at address (Address of z + sizeof(int));
In C you would write the byte address as:
(char *)z + sizeof(int);
or letting the compiler do the work for you
(int *)z + 1;
int main (int argc, argv) {
int x = 4;
int *y = &x;
int *z[4] = {NULL, NULL, NULL, NULL};
int a[4] = {1, 2, 3, 4};
...
0x3bc
0x3b8
0x3b4
0x3b0
0x3d4
0x3d0
z[3]
z[2]
z[1]
z[0]
a[3]
a[2]
a[1]
a[0]
4
0x3dc
0
0
0
0
4
3
2
1
NA
NA
x
y
23. Beware of pointers
int f (void) {
int s = 1;
int t = 1;
int *ps = &s;
int **pps = &ps;
int *pt = &t;
**pps = 2;
pt = ps;
*pt = 3;
t = s;
}
s == 1, t == 1
s == 2, t == 1
s == 3, t == 3
s == 3, t == 1
24. Parameter Passing in C (and pointers)
• Actual parameters are transferred by value
void swap (int a, int b) {
int tmp = b; b = a; a = tmp;
}
int main (void) {
int i = 3;
int j = 4;
swap (i, j);
…
}
The value of i (3) is passed, not its location!
swap does nothing
25. Parameter Passing in C
• Can pass addresses around
void swap (int *a, int *b) {
int tmp = *b; *b = *a; *a = tmp;
}
int main (void) {
int i = 3;
int j = 4;
swap (&i, &j);
…
}
The value of &i is passed, which is the address of i
28. Beware!
int *value (void)
{
int i = 3;
return &i;
}
void callme (void)
{
int x = 35;
}
int main (void) {
int *ip;
ip = value ();
printf (“*ip == %dn", *ip);
callme ();
printf ("*ip == %dn", *ip);
}
*ip == 3
*ip == 35
But it could really be anything!
29. 20 November 2003 CS 201J Fall 2003 29
Manipulating Addresses
char s[6];
s[0] = ‘h’;
s[1] = ‘e’;
s[2]= ‘l’;
s[3] = ‘l’;
s[4] = ‘o’;
s[5] = ‘0’;
printf (“s: %sn”, s);
s: hello
expr1[expr2] in C is just syntactic sugar for
*(expr1 + expr2)
31. In C, undefined behavior means anything can
happen
• With undefined behavior, “Anything at all can happen; the Standard
imposes no requirements. The program may fail to compile, or it may
execute incorrectly (either crashing or silently generating incorrect results),
or it may fortuitously do exactly what the programmer intended.” [C FAQ]
• “If any step in a program’s execution has undefined behavior, then the
entire execution is without meaning.
– This is important: it’s not that evaluating (1<<32) has an unpredictable result, but
rather that the entire execution of a program that evaluates this expression is
meaningless.
– Also, it’s not that the execution is meaningful up to the point where undefined
behavior happens: the bad effects can actually precede the undefined operation.”
[Regehr]
31
[Regehr] Regehr, John. A Guide to Undefined Behavior in C and C++, Parts 1-3. http://blog.regehr.org/archives/213
32. Examples of undefined behaviors
• Pointer
– Dereferencing a NULL pointer
– Using pointers to objects whose lifetime has ended
– Dereferencing a pointer that has not yet been definitely initialized
– Performing pointer arithmetic that yields a result outside the
boundaries (either above or below) of an array.
– Dereferencing the pointer at a location beyond the end of an array
• Buffer overflows
– Reading or writing to an object or array at an offset that is negative, or
beyond the size of that object (stack/heap overflow)
• Integer Overflows
– Signed integer overflow
– Evaluating an expression that is not mathematically defined
– Left-shifting values by a negative amount (right shifts by negative
amounts are implementation defined)
– Shifting values by an amount greater than or equal to the number of
bits in the number
32
http://stackoverflow.com/questions/367633/what-are-all-the-common-undefined-behaviours-that-a-c-programmer-should-know-a
33. What really happens with undefined code?
• C compilers are allowed to assume that undefined behaviors cannot
happen
– Market pressures for performance encourage this
– Ever-more-aggressive optimizations are increasingly adding dependencies on
this assumption
– Leads to faster code but also turns previously-working code into broken code
• C compilers are no longer “high level assemblers”
– They implement a complex model
33
34. A trivial division by 0 example
• Trivial example from Linux kernel lib/mpi/mpi-pow.c:
if (!msize)
msize = 1 / msize; /* provoke a signal */
• On gcc with x86, generated signal as expected
• On gcc with PowerPC, does not generate exception
• On clang, no code generated at all
– Division by 0 is undefined behavior
– Since this “can’t” happen, the compiler presumes that on this branch msize !=
0
– Since this branch only occurs when msize == 0, it must be impossible, and
compiler removes everything as dead code
34
[Wang2012] Wang, Xi, et al., Undefined Behavior: What Happened to My Code?,
APSys ‘12, 2012, ACM, https://pdos.csail.mit.edu/papers/ub:apsys12.pdf
35. Optimizer reordering creates
debugging snares
void bar (void);
int a;
void foo3(unsigned y, unsigned z)
{
bar();
a = y%z;
}
void bar(void)
{
setlinebuf(stdout);
printf ("hello!n");
}
int main(void)
{
foo3(1,0);
return 0;
}
35
On many compilers, when optimizing
this will crash without printing “hello”
first because foo3 will compute before
calling bar().
C compilers are allowed to reorder
anything without side-effects, and
undefined behaviors don’t need to be
considered as side effects. Source: [Regehr]
36. Clang-compiled program executes function
never called (undef issue)
#include <cstdlib>
typedef int (*Function)();
static Function Do;
static int EraseAll() {
return system("rm -rf /");
}
void NeverCalled() {
Do = EraseAll;
}
int main() {
return Do();
}
main:
movl $.L.str, %edi
jmp system
.L.str:
.asciz "rm -rf /"
36
Source: “Why undefined behavior may call a never-called function”, Krister Walfridsson, September 4,
2017, https://kristerw.blogspot.com/2017/09/why-undefined-behavior-may-call-never.html
This always runs “rm –rf /”, even though
NeverCalled() is never called.
The value of Do() is never set, so execution
behavior of the program is undefined.
37. Seeding random numbers in BSD libc
struct timeval tv;
unsigned long junk; /* XXX left uninitialized on purpose */
gettimeofday(&tv, NULL);
srandom((getpid() << 16) ^ tv.tv_sec ^ tv.tv_usec ^ junk);
Process IDSeed Time Junk value
38. 38
A C programmer’s view of computers
This model was pretty accurate for the PDP-11 in 1973 and worked fine
until 1985. Processors (386, ARM, MIPS, SPARC) all ran at 1–10MHz clock
speed and could access external memory in 1 cycle; and most
instructions took 1 cycle.
Indeed the C language was as expressively time-accurate as a
language could be: almost all C operators took one or two cycles.
But this model is no longer accurate!
39. 39
A modern view of memory timings
So what happened?
On-chip computation (clock-speed) sped up faster (1985–2005) than off-chip
communication (with memory) as feature sizes shrank.
The gap was filled by spending transistor budget on caches which
(statistically) filled the mismatch until 2005 or so.
Techniques like caches, deep pipelining with bypasses, and
superscalar instruction issue burned power to preserve our illusions.
2005 or so was crunch point as faster, hotter, single-CPU Pentiums
were scrapped. These techniques had delayed the inevitable.
40. 40
The Current Mainstream Processor
Will scale to 2, 4 maybe 8 processors.
But ultimately shared memory becomes the bottleneck (1024 processors?!?).
… introduce NUMA (Non Uniform Memory Access) …
42. Concurrency on Multiprocessors
Output not consistent with any interleaved execution!
can be the result of out-of-order stores
can be the result of out-of-order loads
improves performance
x = 1
y = 1
print y
print x
thread 1 thread 2
→ 1
→ 0
Initially x = y = 0
43. Conclusions
• “C is quirky, flawed, and a tremendous success”
• C is not a bad language (for its time)
• C programming requires (very) skilled programmers
• C programmers often abuse language features
• C undefined behaviour is problematic
• But the C standard is moving in the right direction
• C’s memory model is problematic in the multi-core world
• C is no longer a low level language
• I hate and love C !