SlideShare uma empresa Scribd logo
1 de 13
Baixar para ler offline
The tyranny of averages
Author: Mikhail Nikishin
Date: 12.01.2015
Let us begin with mentioning that this article is completely non-serious. New Year is coming, holidays
are almost there and there is no reason to do anything deliberate. That is why we decided to write an
article about, suddenly, statistics.
This article indirectly connected with one of the discussion we were participating in few weeks ago. It
was about the possibility of some consistent patterns in errors in this or that line in duplicated code. We
often refer to the article "The Last Line Effect" - according to our observations, lines of code of the same
type and structure generated by copy and paste technique is more likely to be erroneous in the last line.
The discussion was about the fact of error possibility in other places in duplicating blocks. Unfortunately,
it is difficult to gather statistics about places when the error occurs in these examples. However, it gave
us an idea to make small statistic study on our examples base.
We should mention that we wrote this article in jocose style, because we had not found any real
consistent patterns. Many people can remember that "There are three kinds of lies: lies, damned lies,
and statistics" and distrust any statistic researches. This may be a valid point, because statistics oriented
to mass media is usually used to find relations in cases when there are no connection at all. One of the
most widely known example is "Mars effect". However, that is not our case. We claim from the
beginning that this statistical study does not pretend to be serious. Any statistical dependencies in this
article either obvious, "does not prove any causation" or does not surpass statistical significance due to
small sample size.
Well, let us begin. While Google tries to gather statistics about what people hate, we are trying to gather
statistics about what analyzer hate.
Assumption 1. Some words are more frequent than others
Really? You have got to be kidding, have not you?
Anyone who is familiar with any programming language can tell for sure that some words and symbols
occur in source code more frequently than others do. Even in Brainfuck code symbol '+' is more frequent
than symbol '.'. The only arguable "programming language" used to write real programs is even not an
Assembler but a machine code itself. Experts can also remember other counterexamples from esoteric
languages like Malbolge etc. However, what about C++? It is expected that "int" keyword should be
more frequent than "float", "public" should be more frequent than "protected" and "class" should be
more frequent than "struct" and, all the more so, "union". Still, which words are the most frequent ones
in pieces of C++ code containing errors? We counted word frequency by evaluating the number of
words in all the examples, i.e. if one example contained two occurrences of "if" keyword, then program
counted it twice. Words is comments was omitted. List of the most frequent words is presented below
(number before colon is a number of occurrences in all the examples):
 1323 : if
 798 : int
 699 : void
 686 : i
 658 : const
 620 : return
 465 : char
 374 : static
 317 : else
 292 : sizeof
 258 : bool
 257 : NULL
 239 : s
 223 : for
 194 : unsigned
 187 : n
 150 : struct
 146 : define
 137 : x
 133 : std
 121 : c
 121 : new
 115 : typedef
 113 : j
 107 : d
 105 : a
 102 : buf
 102 : case
"Conclusion": "if" causes many errors.
Next words gives us a light of hope; not the words by itself, but their frequency compared to "if" and
even "case":
 15 : goto
 13 : static_cast
 6 : reinterpret_cast
It looks like not everything is that bad with structure of Open Source applications.
However, words like "auto" are not frequent at all (less than five occurrences) as well as "constexpr", as
"unique_ptr" etc. On the one hand, it was to be expected, because we started collecting examples long
ago, when nobody even thought about implementing C++11 standard. On the other hand, there is
another subtext: language extensions are introduced to decrease the probability of making a mistake.
Let us recall that our base contains only code with errors that was found by PVS-Studio static code
analyzer.
We gathered similar statistics on numbers.
 1304 : 0
 653 : 1
 211 : 2
 120 : 4
 108 : 3
 70 : 8
 43 : 5
 39 : 16
 36 : 64
 29 : 6
 28 : 256
It is curious that number 4 in examples of erroneous code is more frequent then 3; moreover, this fact
is not related to 64-bit diagnostics - even if there are present some errors from 64-bit diagnostics, they
are small in quantity (not more than one or two code example). Majority of examples (at least 99%) are
general analysis errors.
It is likely that four is more frequent than three are, however insignificantly, because four is a "round
number" while three is not (do you understand me?). This is why 8, 16, 64, 256 are in breakaway too.
This is the reasons behind oddity of distribution.
Next is a short test for wits and knowledge. Do you think where those numbers came from, 4996 and
2047?
 6 : 4996
 5 : 2047
The answer is in the end of the next paragraph.
Assumption 2. The most frequent letter is a letter 'e'
According to this statistics, the most frequent letter in formal English is 'e'. Ten most frequent letters in
English are e, t, a, o, i, n, s, h, r, d. We asked ourselves about the frequency of letters in C++ source code
fragments. Then we made another experiment. Approach was even more brutal and heartless than
previous one. We simply counted every symbol in every example. Case did not matter, i.e. 'K' = 'k'. The
results are presented below:
 82100 :
 28603 : e
 24938 : t
 19256 : i
 18088 : r
 17606 : s
 16700 : a
 16466 : .
 16343 : n
 14923 : o
 12438 : c
 11527 : l
The most frequent symbol is a space. In formal English space symbol is slightly more frequent than 'e'
letter, but that is not our case. Space is used widely for indentation, which provides solid first place in
terms of frequency at least in our examples, because we had replaced all the tabs to spaces to ease
formatting. In addition, what about the rest? Letters 'i' (leader on a counters name market since
19XX), 'r' (our assumption - used in names like run, rand, vector, read, write, and, most of all, error)
and 's' (std::string s) are much more frequent. However, due to the large sample size we can claim that
letters 'e' and 't' is also the most frequent letters in C++ source code as well as in formal English.
Some words about the dot. Of course, in real examples dot is not as frequent as it is in the list above.
The thing is our database omits a lot of excess code that is not required to understand errors, and four
dots are used for omitted code. That is why dot is probably not one of the most frequent symbols of C++
language.
Does anybody mentioned entropy encoding?
Okay, let us check it from another point of view. Which symbol is the least frequent one?
 90 : ?
 70 : ~
 24 : ^
 9 : @
 1 : $
In addition, another strange result that amazed us. Look at the quantity of these symbols. It is almost
coincide (somewhere it coincides exactly!). That is weird. How could this happen?
 8167 : (
 8157 : )
 3064 : {
 2897 : }
 1457 : [
 1457 : ]
Ah, well, the promised answer to the question from previous paragraph. 2047 = 2048 - 1, and number
4996 came from lines like
#pragma warning (disable:4996)
Assumption 3. There are dependency between occurrences of some
words
It reminds correlation analysis in some way. The problem was set like that: is there any dependency
between occurrences of some pair of words?
What is the reason behind words "in some way" in previous sentence? We decided to evaluate relative
value that resembles correlation coefficient, but it is not actually correlation coefficient, because it can
change only between 0 and 1 inclusively and is measured for every pair of words (a,b) this way. For
instance, word a was occurred in Na examples, word b - in Nb examples, both a and b in Nab examples.
Given that, Rab = Nab / Na, Rba = Nab / Nb. Using the fact that 0 <= Nab <= Na, Nb; Na, Nb > 0 it is
possible to prove that, obviously, 0 <= Rab, Rba <= 1.
How does it work? Let us make an assumption that word 'void' was encountered in 500 examples, word
'int' in 2000 examples, and both 'void' and 'int' were encountered in 100 examples. Then Rvoid,int = 100
/ 500 = 20%, Rint,void = 100 / 2000 = 5%. Yes, this coefficient is asymmetrical (Rab in general is not
equal to Rba); however, it is hardly an obstacle.
Perhaps, it is possible to talk about an even slightest statistical dependency when R >= 50%. Why 50%?
Just because we wanted to. Actually, thresholds are usually chosen approximately and there are no clear
recommendations. 95% value should, perhaps, indicate strong dependency. Perhabs.
Well, using correlation analysis, we were able to find out these stunning, unorthodox facts:
 In examples with usage of 'else' keyword 'if' keyword is also usually (95.00%) used! (Where
are the rest 5%?)
 In examples with usage of 'public' keyword 'class' keyword is also usually (95.12%) used!
 In examples with usage of 'typename' keyword 'template' keyword is also usually (90.91%)
used!
Et cetera. Here are some "obvious" blocks below.
 100.00% ( 18 / 18) : argc -> argv
 100.00% ( 18 / 18) : argc -> int
 94.44% ( 17 / 18) : argc -> char
 90.00% ( 18 / 20) : argv -> argc
 90.00% ( 18 / 20) : argv -> char
 90.00% ( 18 / 20) : argv -> int
 75.00% ( 12 / 16) : main -> argv
 60.00% ( 12 / 20) : argv -> main
At least it proves that program works, and by 'work' we mean meaningless operations to find all the
dependencies between 'main', 'argc' and 'argv'.
 100.00% ( 11 / 11) : disable -> pragma
 100.00% ( 11 / 11) : disable -> default
 100.00% ( 11 / 11) : disable -> warning
 91.67% ( 11 / 12) : warning -> pragma
 91.67% ( 11 / 12) : warning -> default
 91.67% ( 11 / 12) : warning -> disable
 78.57% ( 11 / 14) : pragma -> warning
 78.57% ( 11 / 14) : pragma -> disable
 78.57% ( 11 / 14) : pragma -> default
 57.89% ( 11 / 19) : default -> warning
 57.89% ( 11 / 19) : default -> disable
 57.89% ( 11 / 19) : default -> pragma
Compiler directives insanity. Analysis has found all the dependencies between words 'disable', 'pragma',
'warning' and 'default'. It seems like all these examples came from V665 database - take a note that
there are eleven examples. By the way, these dependencies may be unclear for a non-programmer, but
should be obvious for programmer.
Let us continue.
 100.00% ( 24 / 24) : WPARAM -> LPARAM
 92.31% ( 24 / 26) : LPARAM -> WPARAM
 91.30% ( 21 / 23) : wParam -> WPARAM
 91.30% ( 21 / 23) : lParam -> LPARAM
 91.30% ( 21 / 23) : wParam -> LPARAM
 87.50% ( 21 / 24) : WPARAM -> wParam
 86.96% ( 20 / 23) : wParam -> lParam
 86.96% ( 20 / 23) : lParam -> wParam
 86.96% ( 20 / 23) : lParam -> WPARAM
 83.33% ( 20 / 24) : WPARAM -> lParam
 80.77% ( 21 / 26) : LPARAM -> wParam
 80.77% ( 21 / 26) : LPARAM -> lParam
This probably may be left without commenting at all. Strong dependencies between WPARAM and
LPARAM types and their default names lParam and wParam. By the way, this words are coming from
16-bit versions of Windows, moreover, it seems like their origin is Windows 3.11. That is a
demonstrative proof that Microsoft makes a lot of work in terms of compatibility from year to year.
However, there were interesting results too.
 100.00% ( 12 / 12) : continue -> if
 100.00% ( 13 / 13) : goto -> if
 68.25% ( 43 / 63) : break -> if
First two elements of this list imply that, probably, there are no example with unconditional continue or
goto. Third one does not imply anything, because break can be used not only in cycle, but also in switch
operator, which by itself replaces bunches of 'if' operators. Or does it? Is 'if' operator indicates that
'goto' or 'continue' are conditional? Does anybody mentioned V612 diagnostics? In my defense,
however, I can tell that there are no single 'goto' and 'continue' in in V612 examples at all! Nevertheless,
situation with 'break' is not that pleasant.
 85.00% ( 17 / 20) : vector -> std
The authors of the real code try to avoid "using namespace std;" construction in headers, which is
certainly good for code reviewers, let it is sometimes not convenient for programmers (of course, we
are talking about five symbols!).
 94.87% ( 74 / 78) : memset -> 0
 82.05% ( 64 / 78) : memset -> sizeof
Most frequently, memory is filled with zeroes, at least in our examples. Yes, of course, diagnostics
V597 had a huge impact on that, as well as V575, V512 etc.
By the way, memory is filled by zeroes more frequently than sizeof is used, which is strange and justified
only in case when programmer fills an array of bytes with known size. The other case is an error like
V512, when sizeof is missing in third argument of memset.
 76.80% ( 139 / 181) : for -> 0
In majority of cases cycles starts from zero. Well, that is not a phrase to stress differences between C++
and Pascal or, for instance, Mathematica. Of course, many cycles count from zero. This is may be the
reason why foreach operator was introduced in C++11, that can also deal not only with the classes with
redefined begin(), end() etc, but with a usial arrays too (but not with pointers to arrays). Additionally, it
is much harder to make an error in foreach cycle than in for cycle.
So it goes. In addition, this analysis had been taking one hour and seven minutes in release mode on
eight-core processor.
Assumption 4. There are dangerous function names in which errors are
more likely
Strictly speaking, title of this paragraph should speak of its own. There was a suspicion that
programmers tend to make errors with some caption. This suspicion was shattered into pieces when it
met with reality - functions are called very differently, and the same function in different projects may
be called ReadData(), readData(), read_data(), ReAdDaTa() etc. So the first idea was to write additional
subprogram that would split function names into words, such as 'read' and 'data' in first three cases, and
would try to burn the fourth case with fire.
After splitting all the function names with errors, we got this distribution.
 159 : get
 69 : set
 46 : init
 44 : create
 44 : to
 38 : on
 37 : read
 35 : file
 34 : is
 30 : string
 29 : data
 29 : operator
 26 : proc
 25 : add
 25 : parse
 25 : write
 24 : draw
 24 : from
 23 : info
 22 : process
 22 : update
 20 : find
 20 : load
It seems like mistakes are more likely in 'get' functions than in 'set' functions. Alternatively, maybe,
our analyzer finds more errors in 'get' functions than in 'set' functions. Maybe, 'get' functions are more
frequent then 'set' functions.
Analysis fully similar to the previous one was conducted on a set of function words. This time results is
not that big and can be shown fully. There are no clear correlations in function names. However, we
were able to find something.
 77.78% ( 14 / 18) : dlg -> proc
 70.59% ( 12 / 17) : name -> get
 53.85% ( 14 / 26) : proc -> dlg
 43.48% ( 10 / 23) : info -> get
Significance of this magnificent result is comparable with this correlation:
Assumption 5. Some diagnostics warns more frequently than the others
Again, this assumption is in obvious style. No one from analyzer development team set an objective to
make every diagnostic pop up with nearly the same frequency. In addition, even if this task would been
set, some errors would have shown themselves almost on the spot (like V614). They are usually made to
speed up the development with advice 'on the fly'. Some errors, however, can remain unnoticed until
the end of product lifecycle (like V597). Our database contains errors found after Open Source
application analysis (at least the most part of it); moreover, it is usually stable version. Do I need to
mention that we find errors of the second class much more frequently than errors of the first class?
Again, the methodology is simple. Let us illustrate it on an example. Database contains an error like this:
NetXMS
V668 There is no sense in .... calltip.cpp 260
PRectangle CallTip::CallTipStart(....)
{
....
val = new char[strlen(defn) + 1];
if (!val)
return PRectangle();
....
}
Identical errors can be found in some other places:
V668 There is no sense in .... cellbuffer.cpp 153
V668 There is no sense in .... document.cpp 1377
V668 There is no sense in .... document.cpp 1399
And 23 additional diagnostic messages.
First record is a short name of project. We shall use it, but not now. Next record contains information
about an error - number of a diagnostic rule, its description, and relevant .cpp file name with line
number. Next record contains code; we are not interested in it for now. Next database contain records
containing additional places with another information string. This information may be absent. Last
record hosts number of errors that were skipped to shorten error description. After processing, we
should receive an information that V668 diagnostics found 1 + 3 + 23 = 27 errors. We can proceed to the
next entry.
Now then, the most frequent diagnostics are:
 1037 : 668
 1016 : 595
 311 : 610
 249 : 547
 231 : 501
 171 : 576
 143 : 519
 141 : 636
 140 : 597
 120 : 512
 89 : 645
 83 : 611
 81 : 557
 78 : 624
 67 : 523
Two diagnostics related to work with memory are leading. This is not surprising, because C/C++
languages implement "unsafe" memory management. V595 diagnostics searches for cases when it is
possible to dereference null pointer, V668 diagnostics warns about the fact that checking a pointer
received from new operator against null does not have any sense, because new throws an exception if
memory cannot be allocated. Yes, 9X.XX% programmers make errors while working with memory in
C/C++.
Next idea was to check which projects are the most prone to errors and to which ones. Well, no sooner
said than done.
 640 : Miranda NG :
 --- V595 : 165 (25.8%)
 --- V645 : 84 (13.1%)
 --- V668 : 83 (13%)
 388 : ReactOS :
 --- V595 : 213 (54.9%)
 --- V547 : 32 (8.25%)
 280 : V8 :
 --- V668 : 237 (84.6%)
 258 : Geant4 :
 --- V624 : 71 (27.5%)
 --- V668 : 70 (27.1%)
 --- V595 : 31 (12%)
 216 : icu :
 --- V668 : 212 (98.1%)
Assumption 6. Error density in the beginning of file is larger than in the
end
Last assumption is not very graceful either. The idea is simple. Is there a line or a group of lines (like, for
example, from 67 to 75), where programmers tend to make errors more frequently? Obvious fact:
programmers rarely mistake in first ten lines (usually it is about #pragma once or #include "file.h"). It is
also obvious that programmers rarely mistake in lines from 30000 to 30100. It is because there are
usually no files that large in real projects.
Strictly speaking, method was quite simple. Every diagnostic message contain number of line of source
file. However, not every error has an information about source line. It is possible to extract only four line
numbers from the example above out of 27, because the rest 23 are not detailed at all. Nevertheless,
even this tool can extract many errors from database. The only problem is that there is no total size of
.cpp file in database, so it is impossible to normalize results to make them relatival. In other words, one
does not simply check the hypothesis that 80% of errors occurs in last 20% of file.
This time we present histogram instead of text.
Figure 1 - Error density histogram
Let us clarify how we made our evaluations in application to first column. We counted all the errors
made in lines from 11 to 20 and then divided it into the number of lines from 11 to 20 inclusive (i.e. into
10). In summary, on average in all projects there were slightly less than one error in lines from 11 to 20.
This result is shown on histogram. Let us remind that we have not done any normalization - it was more
important for us not to show precise values that would barely represent dynamics due to small sample
size anyway, but to show the approximate form of distribution.
Despite the fact that histogram contains sharp derivations from trending line (and it slightly reminds log-
normal distribution), we decided not to prove that the mistakes are made the most frequently from
lines 81 to 90. Still, draw a plot is one kind of problem, to proof something based on it - another kind of
problem that is much more difficult. We decided to leave only generic phrase. "Unfortunately, it looks
like all derivations does not exceed statistical threshold value". That is all.
Conclusion
In this article, we managed to show how it is possible to make money by making nonsense.
Speaking seriously, there are two problems related to data mining on error database. First one - what
should we search for? "The Last Line Effect" can be proved manually (and should be, because automatic
search of similar blocks is ungrateful), and the rest runs up with an absence of ideas. Second problem - is
sample size large enough? It is possible that sample size for letter frequency analysis is large enough,
but we cannot tell for sure about other statistics. Similar words can be said about statistical significance.
Moreover, after collecting larger database it is not enough to simply re-run experiments. To proof
statistical hypothesis one should make many mathematical calculations to, for instance, choose the
most fitting distribution function and apply Pearson's chi-squared test. Of course, in case when
dependency is assumed as strong as astrologist prediction, these tests are senseless.
We made this article to find directions where one can look in terms of statistics on error database. If we
had spotted significant deviation, we would have thought about this and would have made experiments
that are more detailed. However, this was not the case.

Mais conteúdo relacionado

Semelhante a The tyranny of averages

Searching for bugs in Mono: there are hundreds of them!
Searching for bugs in Mono: there are hundreds of them!Searching for bugs in Mono: there are hundreds of them!
Searching for bugs in Mono: there are hundreds of them!PVS-Studio
 
Beyond the Style Guides
Beyond the Style GuidesBeyond the Style Guides
Beyond the Style GuidesMosky Liu
 
How to find 56 potential vulnerabilities in FreeBSD code in one evening
How to find 56 potential vulnerabilities in FreeBSD code in one eveningHow to find 56 potential vulnerabilities in FreeBSD code in one evening
How to find 56 potential vulnerabilities in FreeBSD code in one eveningPVS-Studio
 
An SEO’s Intro to Web Dev PHP
An SEO’s Intro to Web Dev PHPAn SEO’s Intro to Web Dev PHP
An SEO’s Intro to Web Dev PHPTroyfawkes
 
Data Science Your Vacation
Data Science Your VacationData Science Your Vacation
Data Science Your VacationTJ Stalcup
 
The D language comes to help
The D language comes to helpThe D language comes to help
The D language comes to helpPVS-Studio
 
Data Science Your Vacation
Data Science Your VacationData Science Your Vacation
Data Science Your VacationTJ Stalcup
 
Inheritance Versus Roles - The In-Depth Version
Inheritance Versus Roles - The In-Depth VersionInheritance Versus Roles - The In-Depth Version
Inheritance Versus Roles - The In-Depth VersionCurtis Poe
 
TwiSent: A Multi-Stage System for Analyzing Sentiment in Twitter
TwiSent: A Multi-Stage System for Analyzing Sentiment in TwitterTwiSent: A Multi-Stage System for Analyzing Sentiment in Twitter
TwiSent: A Multi-Stage System for Analyzing Sentiment in TwitterSubhabrata Mukherjee
 
Stringing Things Along
Stringing Things AlongStringing Things Along
Stringing Things AlongKevlin Henney
 
Top 10 Interview Questions for Coding Job.docx
Top 10 Interview Questions for Coding Job.docxTop 10 Interview Questions for Coding Job.docx
Top 10 Interview Questions for Coding Job.docxSurendra Gusain
 
Top 10 Interview Questions for Coding Job.docx
Top 10 Interview Questions for Coding Job.docxTop 10 Interview Questions for Coding Job.docx
Top 10 Interview Questions for Coding Job.docxSurendra Gusain
 
60 terrible tips for a C++ developer
60 terrible tips for a C++ developer60 terrible tips for a C++ developer
60 terrible tips for a C++ developerAndrey Karpov
 
Lesson 03 python statement, indentation and comments
Lesson 03   python statement, indentation and commentsLesson 03   python statement, indentation and comments
Lesson 03 python statement, indentation and commentsNilimesh Halder
 

Semelhante a The tyranny of averages (20)

Tf dsyv
Tf dsyvTf dsyv
Tf dsyv
 
Searching for bugs in Mono: there are hundreds of them!
Searching for bugs in Mono: there are hundreds of them!Searching for bugs in Mono: there are hundreds of them!
Searching for bugs in Mono: there are hundreds of them!
 
Beyond the Style Guides
Beyond the Style GuidesBeyond the Style Guides
Beyond the Style Guides
 
How to find 56 potential vulnerabilities in FreeBSD code in one evening
How to find 56 potential vulnerabilities in FreeBSD code in one eveningHow to find 56 potential vulnerabilities in FreeBSD code in one evening
How to find 56 potential vulnerabilities in FreeBSD code in one evening
 
An SEO’s Intro to Web Dev PHP
An SEO’s Intro to Web Dev PHPAn SEO’s Intro to Web Dev PHP
An SEO’s Intro to Web Dev PHP
 
FinalReport
FinalReportFinalReport
FinalReport
 
Data Science Your Vacation
Data Science Your VacationData Science Your Vacation
Data Science Your Vacation
 
The D language comes to help
The D language comes to helpThe D language comes to help
The D language comes to help
 
Data Science Your Vacation
Data Science Your VacationData Science Your Vacation
Data Science Your Vacation
 
Inheritance Versus Roles - The In-Depth Version
Inheritance Versus Roles - The In-Depth VersionInheritance Versus Roles - The In-Depth Version
Inheritance Versus Roles - The In-Depth Version
 
Gaps in the algorithm
Gaps in the algorithmGaps in the algorithm
Gaps in the algorithm
 
TwiSent: A Multi-Stage System for Analyzing Sentiment in Twitter
TwiSent: A Multi-Stage System for Analyzing Sentiment in TwitterTwiSent: A Multi-Stage System for Analyzing Sentiment in Twitter
TwiSent: A Multi-Stage System for Analyzing Sentiment in Twitter
 
Stringing Things Along
Stringing Things AlongStringing Things Along
Stringing Things Along
 
Rubykin
Rubykin Rubykin
Rubykin
 
Top 10 Interview Questions for Coding Job.docx
Top 10 Interview Questions for Coding Job.docxTop 10 Interview Questions for Coding Job.docx
Top 10 Interview Questions for Coding Job.docx
 
Top 10 Interview Questions for Coding Job.docx
Top 10 Interview Questions for Coding Job.docxTop 10 Interview Questions for Coding Job.docx
Top 10 Interview Questions for Coding Job.docx
 
60 terrible tips for a C++ developer
60 terrible tips for a C++ developer60 terrible tips for a C++ developer
60 terrible tips for a C++ developer
 
Writeissues
WriteissuesWriteissues
Writeissues
 
Lesson 03 python statement, indentation and comments
Lesson 03   python statement, indentation and commentsLesson 03   python statement, indentation and comments
Lesson 03 python statement, indentation and comments
 
Grounded Pointers
Grounded PointersGrounded Pointers
Grounded Pointers
 

Último

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 

Último (20)

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 

The tyranny of averages

  • 1. The tyranny of averages Author: Mikhail Nikishin Date: 12.01.2015 Let us begin with mentioning that this article is completely non-serious. New Year is coming, holidays are almost there and there is no reason to do anything deliberate. That is why we decided to write an article about, suddenly, statistics. This article indirectly connected with one of the discussion we were participating in few weeks ago. It was about the possibility of some consistent patterns in errors in this or that line in duplicated code. We often refer to the article "The Last Line Effect" - according to our observations, lines of code of the same type and structure generated by copy and paste technique is more likely to be erroneous in the last line. The discussion was about the fact of error possibility in other places in duplicating blocks. Unfortunately, it is difficult to gather statistics about places when the error occurs in these examples. However, it gave us an idea to make small statistic study on our examples base. We should mention that we wrote this article in jocose style, because we had not found any real consistent patterns. Many people can remember that "There are three kinds of lies: lies, damned lies, and statistics" and distrust any statistic researches. This may be a valid point, because statistics oriented to mass media is usually used to find relations in cases when there are no connection at all. One of the most widely known example is "Mars effect". However, that is not our case. We claim from the beginning that this statistical study does not pretend to be serious. Any statistical dependencies in this article either obvious, "does not prove any causation" or does not surpass statistical significance due to small sample size. Well, let us begin. While Google tries to gather statistics about what people hate, we are trying to gather statistics about what analyzer hate.
  • 2. Assumption 1. Some words are more frequent than others Really? You have got to be kidding, have not you? Anyone who is familiar with any programming language can tell for sure that some words and symbols occur in source code more frequently than others do. Even in Brainfuck code symbol '+' is more frequent than symbol '.'. The only arguable "programming language" used to write real programs is even not an Assembler but a machine code itself. Experts can also remember other counterexamples from esoteric languages like Malbolge etc. However, what about C++? It is expected that "int" keyword should be more frequent than "float", "public" should be more frequent than "protected" and "class" should be more frequent than "struct" and, all the more so, "union". Still, which words are the most frequent ones in pieces of C++ code containing errors? We counted word frequency by evaluating the number of words in all the examples, i.e. if one example contained two occurrences of "if" keyword, then program counted it twice. Words is comments was omitted. List of the most frequent words is presented below (number before colon is a number of occurrences in all the examples):  1323 : if  798 : int  699 : void  686 : i  658 : const  620 : return  465 : char  374 : static  317 : else  292 : sizeof  258 : bool  257 : NULL  239 : s  223 : for  194 : unsigned  187 : n  150 : struct  146 : define  137 : x
  • 3.  133 : std  121 : c  121 : new  115 : typedef  113 : j  107 : d  105 : a  102 : buf  102 : case "Conclusion": "if" causes many errors. Next words gives us a light of hope; not the words by itself, but their frequency compared to "if" and even "case":  15 : goto  13 : static_cast  6 : reinterpret_cast It looks like not everything is that bad with structure of Open Source applications. However, words like "auto" are not frequent at all (less than five occurrences) as well as "constexpr", as "unique_ptr" etc. On the one hand, it was to be expected, because we started collecting examples long ago, when nobody even thought about implementing C++11 standard. On the other hand, there is another subtext: language extensions are introduced to decrease the probability of making a mistake. Let us recall that our base contains only code with errors that was found by PVS-Studio static code analyzer. We gathered similar statistics on numbers.  1304 : 0  653 : 1  211 : 2  120 : 4  108 : 3  70 : 8  43 : 5  39 : 16  36 : 64  29 : 6  28 : 256 It is curious that number 4 in examples of erroneous code is more frequent then 3; moreover, this fact is not related to 64-bit diagnostics - even if there are present some errors from 64-bit diagnostics, they are small in quantity (not more than one or two code example). Majority of examples (at least 99%) are general analysis errors. It is likely that four is more frequent than three are, however insignificantly, because four is a "round number" while three is not (do you understand me?). This is why 8, 16, 64, 256 are in breakaway too. This is the reasons behind oddity of distribution.
  • 4. Next is a short test for wits and knowledge. Do you think where those numbers came from, 4996 and 2047?  6 : 4996  5 : 2047 The answer is in the end of the next paragraph. Assumption 2. The most frequent letter is a letter 'e' According to this statistics, the most frequent letter in formal English is 'e'. Ten most frequent letters in English are e, t, a, o, i, n, s, h, r, d. We asked ourselves about the frequency of letters in C++ source code fragments. Then we made another experiment. Approach was even more brutal and heartless than previous one. We simply counted every symbol in every example. Case did not matter, i.e. 'K' = 'k'. The results are presented below:  82100 :  28603 : e  24938 : t  19256 : i  18088 : r  17606 : s  16700 : a  16466 : .  16343 : n  14923 : o  12438 : c  11527 : l The most frequent symbol is a space. In formal English space symbol is slightly more frequent than 'e' letter, but that is not our case. Space is used widely for indentation, which provides solid first place in terms of frequency at least in our examples, because we had replaced all the tabs to spaces to ease formatting. In addition, what about the rest? Letters 'i' (leader on a counters name market since 19XX), 'r' (our assumption - used in names like run, rand, vector, read, write, and, most of all, error)
  • 5. and 's' (std::string s) are much more frequent. However, due to the large sample size we can claim that letters 'e' and 't' is also the most frequent letters in C++ source code as well as in formal English. Some words about the dot. Of course, in real examples dot is not as frequent as it is in the list above. The thing is our database omits a lot of excess code that is not required to understand errors, and four dots are used for omitted code. That is why dot is probably not one of the most frequent symbols of C++ language. Does anybody mentioned entropy encoding? Okay, let us check it from another point of view. Which symbol is the least frequent one?  90 : ?  70 : ~  24 : ^  9 : @  1 : $ In addition, another strange result that amazed us. Look at the quantity of these symbols. It is almost coincide (somewhere it coincides exactly!). That is weird. How could this happen?  8167 : (  8157 : )  3064 : {  2897 : }  1457 : [  1457 : ] Ah, well, the promised answer to the question from previous paragraph. 2047 = 2048 - 1, and number 4996 came from lines like #pragma warning (disable:4996) Assumption 3. There are dependency between occurrences of some words It reminds correlation analysis in some way. The problem was set like that: is there any dependency between occurrences of some pair of words? What is the reason behind words "in some way" in previous sentence? We decided to evaluate relative value that resembles correlation coefficient, but it is not actually correlation coefficient, because it can change only between 0 and 1 inclusively and is measured for every pair of words (a,b) this way. For instance, word a was occurred in Na examples, word b - in Nb examples, both a and b in Nab examples. Given that, Rab = Nab / Na, Rba = Nab / Nb. Using the fact that 0 <= Nab <= Na, Nb; Na, Nb > 0 it is possible to prove that, obviously, 0 <= Rab, Rba <= 1. How does it work? Let us make an assumption that word 'void' was encountered in 500 examples, word 'int' in 2000 examples, and both 'void' and 'int' were encountered in 100 examples. Then Rvoid,int = 100 / 500 = 20%, Rint,void = 100 / 2000 = 5%. Yes, this coefficient is asymmetrical (Rab in general is not equal to Rba); however, it is hardly an obstacle. Perhaps, it is possible to talk about an even slightest statistical dependency when R >= 50%. Why 50%? Just because we wanted to. Actually, thresholds are usually chosen approximately and there are no clear recommendations. 95% value should, perhaps, indicate strong dependency. Perhabs.
  • 6. Well, using correlation analysis, we were able to find out these stunning, unorthodox facts:  In examples with usage of 'else' keyword 'if' keyword is also usually (95.00%) used! (Where are the rest 5%?)  In examples with usage of 'public' keyword 'class' keyword is also usually (95.12%) used!  In examples with usage of 'typename' keyword 'template' keyword is also usually (90.91%) used! Et cetera. Here are some "obvious" blocks below.  100.00% ( 18 / 18) : argc -> argv  100.00% ( 18 / 18) : argc -> int  94.44% ( 17 / 18) : argc -> char  90.00% ( 18 / 20) : argv -> argc  90.00% ( 18 / 20) : argv -> char  90.00% ( 18 / 20) : argv -> int  75.00% ( 12 / 16) : main -> argv  60.00% ( 12 / 20) : argv -> main At least it proves that program works, and by 'work' we mean meaningless operations to find all the dependencies between 'main', 'argc' and 'argv'.  100.00% ( 11 / 11) : disable -> pragma  100.00% ( 11 / 11) : disable -> default  100.00% ( 11 / 11) : disable -> warning  91.67% ( 11 / 12) : warning -> pragma  91.67% ( 11 / 12) : warning -> default  91.67% ( 11 / 12) : warning -> disable  78.57% ( 11 / 14) : pragma -> warning  78.57% ( 11 / 14) : pragma -> disable  78.57% ( 11 / 14) : pragma -> default  57.89% ( 11 / 19) : default -> warning  57.89% ( 11 / 19) : default -> disable  57.89% ( 11 / 19) : default -> pragma Compiler directives insanity. Analysis has found all the dependencies between words 'disable', 'pragma', 'warning' and 'default'. It seems like all these examples came from V665 database - take a note that
  • 7. there are eleven examples. By the way, these dependencies may be unclear for a non-programmer, but should be obvious for programmer. Let us continue.  100.00% ( 24 / 24) : WPARAM -> LPARAM  92.31% ( 24 / 26) : LPARAM -> WPARAM  91.30% ( 21 / 23) : wParam -> WPARAM  91.30% ( 21 / 23) : lParam -> LPARAM  91.30% ( 21 / 23) : wParam -> LPARAM  87.50% ( 21 / 24) : WPARAM -> wParam  86.96% ( 20 / 23) : wParam -> lParam  86.96% ( 20 / 23) : lParam -> wParam  86.96% ( 20 / 23) : lParam -> WPARAM  83.33% ( 20 / 24) : WPARAM -> lParam  80.77% ( 21 / 26) : LPARAM -> wParam  80.77% ( 21 / 26) : LPARAM -> lParam This probably may be left without commenting at all. Strong dependencies between WPARAM and LPARAM types and their default names lParam and wParam. By the way, this words are coming from 16-bit versions of Windows, moreover, it seems like their origin is Windows 3.11. That is a demonstrative proof that Microsoft makes a lot of work in terms of compatibility from year to year. However, there were interesting results too.  100.00% ( 12 / 12) : continue -> if  100.00% ( 13 / 13) : goto -> if  68.25% ( 43 / 63) : break -> if First two elements of this list imply that, probably, there are no example with unconditional continue or goto. Third one does not imply anything, because break can be used not only in cycle, but also in switch operator, which by itself replaces bunches of 'if' operators. Or does it? Is 'if' operator indicates that 'goto' or 'continue' are conditional? Does anybody mentioned V612 diagnostics? In my defense, however, I can tell that there are no single 'goto' and 'continue' in in V612 examples at all! Nevertheless, situation with 'break' is not that pleasant.  85.00% ( 17 / 20) : vector -> std The authors of the real code try to avoid "using namespace std;" construction in headers, which is certainly good for code reviewers, let it is sometimes not convenient for programmers (of course, we are talking about five symbols!).  94.87% ( 74 / 78) : memset -> 0  82.05% ( 64 / 78) : memset -> sizeof Most frequently, memory is filled with zeroes, at least in our examples. Yes, of course, diagnostics V597 had a huge impact on that, as well as V575, V512 etc. By the way, memory is filled by zeroes more frequently than sizeof is used, which is strange and justified only in case when programmer fills an array of bytes with known size. The other case is an error like V512, when sizeof is missing in third argument of memset.  76.80% ( 139 / 181) : for -> 0
  • 8. In majority of cases cycles starts from zero. Well, that is not a phrase to stress differences between C++ and Pascal or, for instance, Mathematica. Of course, many cycles count from zero. This is may be the reason why foreach operator was introduced in C++11, that can also deal not only with the classes with redefined begin(), end() etc, but with a usial arrays too (but not with pointers to arrays). Additionally, it is much harder to make an error in foreach cycle than in for cycle. So it goes. In addition, this analysis had been taking one hour and seven minutes in release mode on eight-core processor. Assumption 4. There are dangerous function names in which errors are more likely Strictly speaking, title of this paragraph should speak of its own. There was a suspicion that programmers tend to make errors with some caption. This suspicion was shattered into pieces when it met with reality - functions are called very differently, and the same function in different projects may be called ReadData(), readData(), read_data(), ReAdDaTa() etc. So the first idea was to write additional subprogram that would split function names into words, such as 'read' and 'data' in first three cases, and would try to burn the fourth case with fire. After splitting all the function names with errors, we got this distribution.  159 : get  69 : set  46 : init  44 : create  44 : to  38 : on  37 : read  35 : file  34 : is  30 : string  29 : data  29 : operator  26 : proc  25 : add  25 : parse  25 : write  24 : draw  24 : from  23 : info  22 : process  22 : update  20 : find  20 : load It seems like mistakes are more likely in 'get' functions than in 'set' functions. Alternatively, maybe, our analyzer finds more errors in 'get' functions than in 'set' functions. Maybe, 'get' functions are more frequent then 'set' functions.
  • 9. Analysis fully similar to the previous one was conducted on a set of function words. This time results is not that big and can be shown fully. There are no clear correlations in function names. However, we were able to find something.  77.78% ( 14 / 18) : dlg -> proc  70.59% ( 12 / 17) : name -> get  53.85% ( 14 / 26) : proc -> dlg  43.48% ( 10 / 23) : info -> get Significance of this magnificent result is comparable with this correlation: Assumption 5. Some diagnostics warns more frequently than the others Again, this assumption is in obvious style. No one from analyzer development team set an objective to make every diagnostic pop up with nearly the same frequency. In addition, even if this task would been set, some errors would have shown themselves almost on the spot (like V614). They are usually made to speed up the development with advice 'on the fly'. Some errors, however, can remain unnoticed until the end of product lifecycle (like V597). Our database contains errors found after Open Source application analysis (at least the most part of it); moreover, it is usually stable version. Do I need to mention that we find errors of the second class much more frequently than errors of the first class? Again, the methodology is simple. Let us illustrate it on an example. Database contains an error like this: NetXMS V668 There is no sense in .... calltip.cpp 260 PRectangle CallTip::CallTipStart(....) { .... val = new char[strlen(defn) + 1]; if (!val) return PRectangle();
  • 10. .... } Identical errors can be found in some other places: V668 There is no sense in .... cellbuffer.cpp 153 V668 There is no sense in .... document.cpp 1377 V668 There is no sense in .... document.cpp 1399 And 23 additional diagnostic messages. First record is a short name of project. We shall use it, but not now. Next record contains information about an error - number of a diagnostic rule, its description, and relevant .cpp file name with line number. Next record contains code; we are not interested in it for now. Next database contain records containing additional places with another information string. This information may be absent. Last record hosts number of errors that were skipped to shorten error description. After processing, we should receive an information that V668 diagnostics found 1 + 3 + 23 = 27 errors. We can proceed to the next entry. Now then, the most frequent diagnostics are:  1037 : 668  1016 : 595  311 : 610  249 : 547  231 : 501  171 : 576  143 : 519  141 : 636  140 : 597  120 : 512  89 : 645  83 : 611  81 : 557  78 : 624  67 : 523 Two diagnostics related to work with memory are leading. This is not surprising, because C/C++ languages implement "unsafe" memory management. V595 diagnostics searches for cases when it is possible to dereference null pointer, V668 diagnostics warns about the fact that checking a pointer received from new operator against null does not have any sense, because new throws an exception if memory cannot be allocated. Yes, 9X.XX% programmers make errors while working with memory in C/C++.
  • 11. Next idea was to check which projects are the most prone to errors and to which ones. Well, no sooner said than done.  640 : Miranda NG :  --- V595 : 165 (25.8%)  --- V645 : 84 (13.1%)  --- V668 : 83 (13%)  388 : ReactOS :  --- V595 : 213 (54.9%)  --- V547 : 32 (8.25%)  280 : V8 :  --- V668 : 237 (84.6%)  258 : Geant4 :  --- V624 : 71 (27.5%)  --- V668 : 70 (27.1%)  --- V595 : 31 (12%)  216 : icu :  --- V668 : 212 (98.1%) Assumption 6. Error density in the beginning of file is larger than in the end Last assumption is not very graceful either. The idea is simple. Is there a line or a group of lines (like, for example, from 67 to 75), where programmers tend to make errors more frequently? Obvious fact: programmers rarely mistake in first ten lines (usually it is about #pragma once or #include "file.h"). It is also obvious that programmers rarely mistake in lines from 30000 to 30100. It is because there are usually no files that large in real projects. Strictly speaking, method was quite simple. Every diagnostic message contain number of line of source file. However, not every error has an information about source line. It is possible to extract only four line numbers from the example above out of 27, because the rest 23 are not detailed at all. Nevertheless, even this tool can extract many errors from database. The only problem is that there is no total size of .cpp file in database, so it is impossible to normalize results to make them relatival. In other words, one does not simply check the hypothesis that 80% of errors occurs in last 20% of file. This time we present histogram instead of text.
  • 12. Figure 1 - Error density histogram Let us clarify how we made our evaluations in application to first column. We counted all the errors made in lines from 11 to 20 and then divided it into the number of lines from 11 to 20 inclusive (i.e. into
  • 13. 10). In summary, on average in all projects there were slightly less than one error in lines from 11 to 20. This result is shown on histogram. Let us remind that we have not done any normalization - it was more important for us not to show precise values that would barely represent dynamics due to small sample size anyway, but to show the approximate form of distribution. Despite the fact that histogram contains sharp derivations from trending line (and it slightly reminds log- normal distribution), we decided not to prove that the mistakes are made the most frequently from lines 81 to 90. Still, draw a plot is one kind of problem, to proof something based on it - another kind of problem that is much more difficult. We decided to leave only generic phrase. "Unfortunately, it looks like all derivations does not exceed statistical threshold value". That is all. Conclusion In this article, we managed to show how it is possible to make money by making nonsense. Speaking seriously, there are two problems related to data mining on error database. First one - what should we search for? "The Last Line Effect" can be proved manually (and should be, because automatic search of similar blocks is ungrateful), and the rest runs up with an absence of ideas. Second problem - is sample size large enough? It is possible that sample size for letter frequency analysis is large enough, but we cannot tell for sure about other statistics. Similar words can be said about statistical significance. Moreover, after collecting larger database it is not enough to simply re-run experiments. To proof statistical hypothesis one should make many mathematical calculations to, for instance, choose the most fitting distribution function and apply Pearson's chi-squared test. Of course, in case when dependency is assumed as strong as astrologist prediction, these tests are senseless. We made this article to find directions where one can look in terms of statistics on error database. If we had spotted significant deviation, we would have thought about this and would have made experiments that are more detailed. However, this was not the case.