Rao probability theory with applications

PROBABILITY THEORY
WITH APPLICATIONS
Second Edition

Mathematics and Its A~~lications
Managing Editor:
M. HAZEWINKEL
Centrefor Mathematics and Computer Science, Amsterdam, The Netherlands
Volume 582

PROBABILITY THEORY
WITH APPLICATIONS
Second Edition
M.M. RAO
University of California, Riverside, California
R.J. SWIFT
California State Polytechnic University, Pomona, California
Springer-

Library of Congress Control Number: 2005049973
Printed on acid-free paper.
AMS Subject Classifications: 60Axx, 60Exx, 60Fxx, 60Gxx, 62Bxx, 62Exx, 62Gxx, 62Mxx, 93Cxx
02006 Springer Science+Business Media, Inc.
All rights rcscrvcd. This work may not bc translatcd or copicd in wholc or in part without thc writtcn
permission of the publisher (Springer Science+Business Media, Tnc., 233 Spring Street, New York, NY
10013, USA), cxccpt for bricf cxccrpts in conncction with rcvicws or scholarly analysis. Usc in
conncction with any form of information storagc and rctricval, clcctronic adaptation, computcr software,
or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not idcntificd as such, is not to bc takcn as an cxprcssion of opinion as to whcthcr or not thcy arc subjcct
to proprietary rights.
Printed in the United States of America.

To the memory of my brother-in-law,
Raghavayya V. Kavuri
M.M.R.
To the memory of my parents,
Randall and Julia Swift
R.J.S.

Contents
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Preface to Second Edition ix
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Preface to First Edition xv
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .List of Symbols xvii
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Part I .Foundations 1
1 Background Material and Preliminaries . . . . . . . . . . . . . . . . . . . . 3
1.1 What is Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Random Variables and Measurability Results . . . . . . . . . . . . . . . 7
1.3 Expectations and the Lebesgue Theory . . . . . . . . . . . . . . . . . . . . . 12
1.4 Image Measure and the Fundamental Theorem of Probability . 20
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Exercises 28
2 Independence and Strong Convergence . . . . . . . . . . . . . . . . . . . . 33
2.1 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Coilvergelice Concepts, Series and Inequalities . . . . . . . . . . . . . . . 46
2.3 Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58
2.4 Applications to Empiric Distributions. Densities. Queueing.
andRandom Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3 Conditioning and Some Dependence Classes . . . . . . . . . . . . . . . 103
3.1 Conditional Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.2 Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.3 MarkovDependeiice 140
3.4 Existelice of Various Random Families . . . . . . . . . . . . . . . . . . . . .158
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.5 Martingale Sequences 174
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Exercises 203
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Part I1 .Analytical Theory 221
4 Probability Distributions and Characteristic Functions . . . .223
4.1 Distribution Functioiis and the Selection Principle . . . . . . . . . . .223
4.2 Characteristic Functions. Inversion. aiid Lkvy's Continuity
Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
4.3 Cramkr's Theorem on Fourier Transforms of Signed Measures . 251
4.4 Bochner's Theorem on Positive Definite Functions . . . . . . . . . . . 256

viii Contents
4.5 Some Multidimensional Extensions . . . . . . . . . . . . . . . . . . . . . . . . 265
4.6 Equivalence of Convergences for Sums of Independent
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Random Variables 274
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Exercises 276
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Weak Limit Laws 291
. . . . . . . . . . . . . . . . . . . . . . . . . .5.1 Classical Central Limit Theorems 291
5.2 Infinite Divisibility and the Lkvy-Khintchine Formula . . . . . . . .304
. . . . . . . . . . . . . . . . . . . .5.3 General Limit Laws, Including Stability 318
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.4 Invariance Principle 341
. . . . . . . . . . . . . . . .5.5 Kolmogorov's Law of the Iterated Logarithm 364
. . . . . . . . . . . . .5.6 Application to a Stochastic Difference Equation 375
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .386
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Part I11 .Applications 409
. . . . . . . . . . .6 Stopping Times. Martingales. and Convergences 411
. . . . . . . . . . . . . . . . . . . . . . . .6.1 Stopping Times and Their Calculus 411
. . . . . . . . . . . . . . . . . . . . . . .6.2 Wald's Equation and an Application 415
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6.3 Stopped Martingales 420
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .427
7 Limit Laws for Some Dependent Sequences . . . . . . . . . . . . . . . . 429
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7.1 Central Limit Theorems 429
7.2 Limit Laws for a Random Number of Random Variables . . . . . .436
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7.3 Ergodic Sequences 449
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
. . . . . . . . . . . . . . . . . . . . . . . . .8 A Glimpse of Stochastic Processes 459
. . . . . . . . . . . . . .8.1 Brownian Motion: Definition and Construction 459
. . . . . . . . . . . . . . . . . . . . . .8.2 Some Properties of Brownian Motion 463
8.3 Law of the Iterated Logarithm for Brownian Motion . . . . . . . . . 467
. . . . . . . . . . . . . . . . . . .8.4 Gaussian and General Additive Processes 470
..................................8.5 Second-Order Processes 493
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .498
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .References 509
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Author Index 519
Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .523

Preface to Second Edition
The following is a revised aiid somewhat enlarged account of Probability
Theory with Applications, whose basic aim as expressed in the preface to the
first edition (appended here) is maintained. In this revision, the material and
presentation is better highlighted with several (small and large) alterations
made to each chapter. We believe that these additions make a better text for
graduate students and also a reference work for a later study. We now discuss
in some detail the subject of this text, as modified here. It is hoped that this
will provide an appreciation for the view-point of this edition, as well as the
earlier one, published over two decades ago.
In the present setting, the work is organized into three parts, the first
being on the foundations of the subject, consists of Chapters 1-3. The second
part concentrates on the analytical aspects of probability in relatively large
chapters 4-5. The final part in Chapters 6-8 treats some serious and deep
applications of the subject. The point of view presented here has the following
focus. Parts I and I1 can be essentially studied independently with only cursory
cross-references. Each part could easily be used for a quarter or semester
long beginning graduate course in Probability Theory. The prerequisite is
a graduate course in Real Analysis, although it is possible to study the two
subjects concurrently. Each of these parts of this text also has applications and
ideas some of which are discussed as problems that illustrate as well as extend
the basic subject. The final part of the text can be used for a follow-up course
on the preceding material or for a seminar thereafter. Numerous suggestions
for further study aiid even several research problems are pointed out. We now
detail some of these points for a better view of the treatment which is devoted
to the mathematical content, avoiding nonmathematical views and concepts.
To accommodate the new material and not to substantially increase the
size of the volume, we had to omit most of the original Chapter 6 and part
of Chapter 7. Thus this new version has eight chapters, but it is still well

x Preface to Second Edition
focused and the division into parts makes the work more useful. We now turn
to explaining the new format.
The first part, on foundations, treats the two fundamental ideas of prob-
ability, independelice and conditioning. In Chapter 1 we recall the necessary
results from Real Analysis which we recoininelid for a perusal. It is also iin-
portant that readers take a careful look at the fundamental law of probability
and the basic uniform continuity of characteristic functions.
Chapter 2 undertakes a serious study of (statistical) independence, which
is a distinguishing feature of Probability Theory. Independence is treated in
considerable detail in this chapter, both the basic strong and weak laws, as
well as the convergence of series of random variables. The applications consid-
ered here illustrate such results as the Glivenko-Cantelli Theorem for empiric
and density estimation, random walks, and queueing theory. There are also
exercises (with hints) of special interest aiid we recommend that all readers
pay particular attention to Problems 5 and 6, aiid also 7, 15 aiid 21 which
explain the very special nature of the subject and the concept of independence
itself.
The somewhat long third chapter is devoted to the second fundamental
idea, namely conditioning. As far as we know, no other graduate text in prob-
ability has treated the subject of coiiditioiial probability in such detail aiid
specificity. To mention some noteworthy points of our presentation, we have
included: (i) the unsuspected, but spectacular, failure of the Vitali convergence
theorem for conditional probabilities. This is a consequence of an interesting
theorem of Blackwell and Dubins. We include a discussion and imposition of a
restriction for a positive conclusion to prevail, (ii) the basic problem (still un-
resolved) of calculating coiiditioiial expected values (probabilities) when the
conditioning is relative to random variables taking uiicountably inany values,
particularly when the random variables arise from continuous distributions.
In this setting, multiple answers (all natural) for the same question are ex-
hibited via a Gaussian family. The calculations we give follow some work
by Kac aiid Slepian, leading to paradoxes. These difficulties arise from the
necessary calculation of the Radon-Nikod9m derivative which is fuiidaineiital
here, and for which no algorithmic procedure exists in the literature. A search
through E. Bishop's text on the foundations of constructivism (in the way
of L.E.J. Brower) shows that we do not yet have a solution or a resolution
for the problems discussed. Thus our results are on existence and hence use
"idealistic methods", which present, to future researchers in Bishop's words,
"a challenge to find a coiistructive version aiid to give a coiistructive proof."
Until this is fulfilled, we have to live with subjectively chosen solutions, for
applications of our work in practice.
It is in this context, we detail in chapter 3, the Jessen-Kolmogorov-
Bochner-Tulcea theorems on existence of arbitrary families of random vari-
ables on (suitable) spaces. We also iiiclude here the basic martingale limit
theorems with applications to U-statistics, likelihood ratios, Markov processes
and quasi-martingales. Several exercises, (about 50) add complements to the

Preface to Second Edition xi
theory. These exercises include the concept of sufficiency, a martingale proof of
the Radon-Nikod9m theorem, aspects of Markov kernels, ergodic-martingale
relations and many others. Thus here and throughout the text one finds that
the exercises contain a large amount of additional information on the subject
of probability. Many of these exercises can be omitted in a first reading but we
strongly urge our readers to at least glance through them all and then return
later for a serious study. Here and elsewhere in the book, we follow the lead
of Feller's classics.
The classical as well as modern aspects of the so called analytical theory
of probability is the subject of the detailed treatment of Part 11. This part
coiisists of the two chapters 4 aiid 5 , with the latter being the longest in the
text. These chapters can be studied with the basic outline of chapter 1and just
the notion of independence translated to analysis. The main aim of Chapter 4
is to use distribution theory (or image probabilities using random variables) on
Euclidean spaces. This fully utilizes the topological structure of their ranges.
Thus the basic results are on characteristic fuiictioiis including the Lkvy-
Bochner-Cram& theorems and their multidimensional versions. The chapter
concludes with a proof of the equivalence of convergences-pointwise a.e., in
probability and in distribution-for sums of independent random variables.
Regarding some characterizations, we particularly recoininelid Problems 4,
16, 26, and 33 in this chapter.
The second longest chapter of the text, is chapter 5 and is the heart of
analytical theory. This chapter contains the customary central limit theory
with Berry-Essen error estimation. It also contains a substantial introduction
to infinite divisibility, iiicludiiig the Lkvy-Khintchine representation, stable
laws, aiid the Donsker invariance principle with applications to Kolmogorov-
Sinirnov type theorems. The basic law of the iterated logarithm, with H. Te-
icher's (somewhat) simplified proof, is presented. This chapter also contains
interesting applications in several exercises. Noteworthy are Bochner's gen-
eralization of stable types (without positive definiteness) in Exercises 26-27
aiid Wendel's "elementary" treatment of Spitzer's identity in Exercise 33. We
recommend that these exercises be completed by filling in the details of the
proofs outlined there. We have included the m-dependent central limit theo-
rem and an illustration to exemplify the applicability and limitations of the
classical invariance principle in statistical theory. Several additional aspects
of infinite divisibility and stability are also discussed in the exercises. These
problems are recommended for study so that certain interesting ideas arising
in applicatioiis of the subject can be learned by such an effort. These are also
useful for the last part of the book.
The preceding parts I & 11, prepare the reader to take a serious look
at Part 111, which is devoted to the next stage of our subject. This part is
devoted to what we consider as very important in modern applications, both
new aiid significant, in the subject. Chapters 6 aiid 7 are relatively short,
but are concerned with the limit theory of nonindependent random sequences
which demand new techniques. Chapter 6 introduces and uses stopping time

xii Preface to Second Edition
techniques. We establish Wald's identities, which play key roles in sequential
analysis, and the Doob optional stopping and sampling theorems, which are
essential for key developments in martingale theory. Chapter 7 contains central
limit theorems for a raiidoin number of raiidoin variables and the Birkhoff
ergodic theorem. The latter shows a natural setting for strict stationarity of
families of random variables and sets the stage for the last chapter of the text.
Chapter 8 presents a glimpse of the panorama of stochastic processes with
some analysis. There is a significant increase and expansion of the last chapter
of the first edition. It can be studied to get a sense of the expanding vistas
of the subject which appear to have great prospects aiid potential for further
research. The following items are considered to exhibit just a few of the many
new and deep applications.
The chapter begins with a short existence proof of Brownian motion di-
rectly through (random) Fourier series aiid then establishes the continuity,
nondifferentiability of its sample paths, statioiiarity of its increments, as well
as the iterated logarithm law for it. These ideas lead to a study of (general)
additive processes with independent, stable and strictly stationary increments.
The Poisson process plays a key role very similar to Brownian motion, and
points to a study of random measures with independent values on disjoint
sets. We indicate some modern developments following the work of Kahane-
Marcus-Pisier, geiieraliziiig the classical Paley-Zygmund analysis of raiidoin
Fourier series. This opens up many possibilities for a study of sample continu-
ity of the resulting (random) functions as sums, with just 0 < a! 5 2 moments.
These ideas lead to an analysis of strongly stationary classes (properly) con-
tained in strictly stationary families. The case n = 2 is special since Hilbert
space geometry is available for it. Thus the (popular) weakly stationary case
is coilsidered with its related (but more general) classes of weakly, strictly aiid
strongly harmonizable processes. These are outlined along with their integral
representations, giving a picture of the present state of stochastic analysis.
Again we include several complements as exercises with hints in the way pi-
oneered by Feller, aiid strongly recommend to our readers to at least glance
through them to have a better view of the possibilities and applications that
are opened up here. In this part therefore, problems 6 and 7 of Chapter 6; and
problems 2, 6, and 10 of Chapter 7, and problems 8, 12, 15, and 16 in chapter
8 are interesting as they reveal the unfolding areas shown by this work.
This book gives our view of how Probability Theory could be presented aiid
studied. It has evolved as a collaboratioii resulting from decades of research
experience aiid lectures prepared by the first author and the experieiices of
the second author who, as a student, studied and learned the subject from the
first edition and then subsequently used it as a research reference. His notes
and clarifications are implemented in this edition to improve the value of the
text. This project has been a satisfying effort resulting in a newer text that is
offered to the public.
In the preparation of the present edition we were aided by some colleagues,
friends and students. We express our sincere gratitude to Mary Jane Hill for

Preface to Second Edition xiii
her assistance and diligence with aspects of typesetting and other technical
points of the manuscript. Our colleague Michael L. Green offered valuable
comments, and Kunthel By, who read drafts of the early chapters with a
student's perspective, provided clarifications. We would like to thank our wives
Durgamba Rao and Kelly Swift, for their love, support, aiid understanding.
We sincerely thank all these people, and hope that the new edition will
serve well as a graduate text as well as a reference volume for many aspiring
and working mathematical scientists. It is our hope that we have succeeded,
at least to some extent, to convey the beauty aiid magnificence of probability
theory aiid its manifold applications to our audience.
Riverside, CA
Pomona, CA
M.M.Rao,
R.J. Swzft

Preface to First Edition
The material in this book is designed for a standard graduate course on prob-
ability theory, including some important applications. It was prepared from
the sets of lecture notes for a course that I have taught several times over the
past 20 years. The present version reflects the reactions of my audiences as
well as some of the textbooks that I used. Here I have tried to focus on those
aspects of the subject that appeared to me to add interest both pedagogi-
cally and methodologically. In this regard, I inelltioil the following features
of the book: it emphasizes the special character of the subject and its prob-
lems while eliminating the mystery surrounding it as much as possible; it
gradually expands the content, thus showing the blossoming of the subject;
it indicates the need for abstract theory even in applications aiid shows the
inadequacy of existing results for certain apparently simple real-world prob-
lems (See Chapter 6); it attempts to deal with the existence problems for
various classes of random families that figure in the main results of the sub-
ject; it contains a more complete (and I hope more detailed) treatment of
conditional expectatioiis and of conditional probabilities than any existing
textbook known to me; it shows a deep internal relation among the Lkvy
coiitiiiuity theorem, Bochner's theorem on positive definite functions, aiid the
Kolmogorov-Bochner existence theorem; it makes a somewhat more detailed
treatment of the invariance principles and of limit laws for a random number
of (ordered) random variables together with applications in both areas; aiid it
provides an unhurried treatment that pays particular attention to inotivatioii
at every stage of development.
Since this is a textbook, essentially all proofs are given in complete de-
tail (even at the risk of repetition), and some key results are given multiple
proofs when each argument has something to contribute. On the other hand,
generalization for its own sake is avoided, aiid as a rule, abstract-Banach-

xvi Preface to First Edition
space-valued random variables have not been included (if they have been, the
demands on the reader's preparation would have had to be much higher).
Regarding the prerequisites, a knowledge of the Lebesgue integral would
be ideal, aiid at least a concurrent study of real analysis is recommended.
The necessary results are reviewed in Chapter 1, aiid some results that are
generally not covered in such a course, but are essential for our work, are given
with proofs. In the rest of the book, the treatment is detailed and complete,
in accordance with the basic purpose of the text. Thus it can be used for
self-study by mature scientists having no prior knowledge of probability.
The main part of the book consists of Chapters 2-5. Even though I regard
the order presented here to be the most natural, one can start, after a review
of the relevant part of Chapter 1, with Chapter 2, 3 or 4, and with a little
discussion of independence, Chapter 5 can be studied. The last four chapters
concern applications and problems arising from the preceding work aiid partly
generalizing it. The material there indicates some of the inany directions along
which the theory is progressing.
There are several exercises at the end of each chapter. Some of these
are routine, but others demand more serious effort. For many of the latter
type, hints are provided, and there are a few that complement the text (e.g.,
Spitzer's identity and aspects of stability in Chapter 5); for them, essentially
complete details are given. I present some of these not only as good illustra-
tions but also for reference purposes.
I have included in the list of references only those books and articles that
influenced my treatment; but other works can be obtained from these sources.
Detailed credits and priorities of discovery have not been scrupulously as-
signed, although historical accounts are given in the interest of motivation.
For cross-referencing purposes, all the items in the book are serially nuin-
bered. Thus 3.4.9 is the ninth item of Section 4 of Chapter 3. In a given
section (chapter) the corresponding section (and chapter) number is omitted.
The material presented here is based on the subject as I learned it from
Professor A/I. D. Donsker's beautiful lectures many years ago. I feel it is appro-
priate here to express my gratitude to him for that opportunity. This book has
benefited from my experience with generations of participants in my classes
and has been read by Derek K. Chang from a student's point of view; his
questions have resolved several ambiguities in the text. The manuscript was
prepared with partial support from an Office of Naval Research contract aiid
a University of California, Riverside, research grant. The difficult task of con-
verting my handwritten copy into the finished typed product was ably done
by Joyce Kepler, Joanne McIntosh, and Anna McDermott, with the care and
interest of Florence Kelly. Both D. Chang and J. Sroka have aided me in proof-
reading and preparation of the Index. To all these people and organizations I
wish to express my appreciation for this help aiid support.
(M.M. RUO)

List of Symbols
a.a.
a.e.
ch.f.(s)
d.f.(s)
iff
i.i.d.
r.v.(s)
m.g.f.
AAB
0
(a,b?
(a,c,P)
P(fl ( A ? )
X A
A
v
R
c
N
alinost all
alinost everywhere
characteristic function(s)
distribution function(s)
if and only if
independent identically distributed
random variable(s)
moment generating function
symmetric difference of A and B
empty set
open interval
a probability space
= P[f E A](= (Po fpl(A))
indicator of A
minimum symbol
maximum symbol
reals
complex iiuinbers
natural iiuinbers (=positive integers)
sigma algebra generated by the r.v.s Xi,
i = 1 , ...,n
variance of the r.v. X
correlation of X and Y
the set of scalar r.v.s on
( 0 7 C,P?
the set of pth power
integrable r.v.s. on (0,C,P)
the Lebesgue space of
equivaleiice classes of
r.v.s from C"
usually a partition of a set
the Lebesgue space on R
with Lebesgue measure
u is absolutely continuous
relative to p (measures)
u is singular relative to p

xviii List of Symbols
= [ E ( I X ~ ) ] ' / ~
= [JQI x I ~ ~ P ] ~ / ~
=p-norm of X
integral part of the real
number n > 0
topological equivalence
means an/bn + 1 as n + cc
boundary of the set A
distinguished logarithm
of P
signum function
convolution of f l and f 2 in
L1(R)
the kth binomial coefficient

Part I Foundations
The mathematical basis of probability, namely real analysis, is sketched
with essential details of key results, including the fundamental law of prob-
ability, and a characterization of uniform integrability in Chapter 1 which is
used frequently through out the book. Most of the important results on inde-
pendence, the laws of large numbers, convergence of series as well as some key
applications on random walks and queueing are treated in Chapter 2, which
also contains some important complements as problems. Then a quite detailed
treatment of conditional probabilities with applications to Markovian fami-
lies, martingales, and the Kolmogorov-Bochner-Tulcea existence theorems on
processes are included in Chapter 3. Also important additional results are
in a long problems section. The basic foundations of modern probability are
detailed in this part.

Chapter I
Background Material and Preliminaries
In this chapter, after briefly discussing the begiiiiiiiigs of probability theory
we shall review some standard background material. Basic concepts are intro-
duced and immediate consequences are noted. Then the fundamental law of
probability and some of its implications are recorded.
1.1What Is Probability?
Before considering what probability is or what it does, a brief historical discus-
sion of it will be illuminating. In a general sense, one can think of a probability
as a long-term average, or (in a combinatorial sense) as the proportion of the
number of favorable outcomes to the number of possible and equally likely
ones (all being finite in number in a real world). If the last condition is not
valid, one may give certain weights to outcomes based on one's beliefs about
the situation. Other concepts can be similarly formulated. Such ideas are still
seriously discussed in different schools of thought on probability.
Basically, the concept originates from the recognition of the uncertainty of
outcome of an action or experiment; the assignment of a numerical value arises
in determining the degree of uncertainty. The need for measuring this degree
has been recognized for a very long time. In the Indian Jaiiia philosophy the
uiicertaiiity was explicitly stated as early as the fifth century B.C., and it was
classified into seven categories under the name syadvadasystem. Applications
of this idea also seem to have been prevalent. There are references in medieval
Hindu texts to the practice of giving alms to religious mendicants without
ascertaining whether they were deserving or not. It was noted on observation
that "only ten out of a hundred were undeserving," so the public (or the

4 1 Background Material and Preliminaries
donors) were advised to continue the practice. This is a clear forerunner of
what is now known as the frequency interpretation of probability.
References related to gambling may be found throughout recorded history.
The great Indian epic, the Mahabharata, deals importantly with gambling.
Explicit iiuinerical assignment, as in the previous example, was not always
recorded, but its implicit recognition is discernible in the story. The Jaina case
was discussed with source material by Mahalanobis (1954), and an interesting
application of the syudvuda system was illustrated by Haldane (1957).
On the other hand, it has become customary among a section of historians
of this subject to regard probability as having its roots in calculatioiis based
on the assuinptioii of equal likelihood of the outcomes of throws of dice. This
is usually believed to start with the correspondence of Fermat and Pascal in
the 1650s or (occasionally) with Cardano in about 1550 and Galileo a little
later. The Fermat-Pascal correspondence has been nicely dramatized by R h y i
[seehis book (1970) for references] to make it more appealing and to give the
impression of a true begiiiiiing of probabilistic ideas.
Various reasons have been advanced as to why the concept of probability
could not have started before. Apparently an unwritten edict for this is that
the origins of the subject should be coupled approximately with the Industrial
Revolutioii in Europe. Note also that the calculations made in this period
with regard to probability, assume equal likelihood. However, all outcomes
are iiot always equally likely. Thus the true starting point must come much
later-perhaps with E. Borel, A. Liapounov, and others at the end of the
nineteenth century, or even only with Kolmogorov's work of 1933, since the
presently accepted broad based theory started only then! Another brief
personal viewpoint is expressed in the elementary text by Neuts (1973). We
cannot go into the merits of all these historical formulatioils of the subject
here. A good scholarly discussion of such a (historical) basis has been given in
Maistrov's book (1974). One has to keep in mind that a considerable amount
of subjectivity appears in all these treatments (which may be inevitable).
Thus the preceding sketch leads us to conclude that the concepts of un-
certainty aiid prediction, aiid hence probabilistic ideas, started a long time
ago. Perhaps they can be placed 2500 years ago or more. They may have
originated at several places in the world. The methods of the subject have
naturally been refined as time went on. Whether there has been cross- fertil-
ization of ideas due to trade and commerce among various parts of the world
in the early developinelit is iiot clear, although it cannot be ruled out. But the
sixteenth-seventeenth century "beginning" based on gambling and problems
of dice cannot be taken as the sole definitive starting point of probability.
With these generalities, let us turn to the present-day concept of probability
that is the foundation for our treatment of the subject.
As late as the early 1920s, R. von hlises summed up the situation, no doubt in
despair, by saying, "Today, probability theory is not a mathematical science."

1.1 What Is Probability? 5
As is clear from the preceding discourse, probability is a numerical mea-
sure of the uncertainty of outcomes of an action or experiment. The actual
assignment of these values must be based on experience and should generally
be verifiable when the experiment is (if possible) repeated under essentially
the same conditions. From the modern point of view, therefore, we consider all
possible outcomes of an experiment and represent them by (distinct) points of
a nonempty set. Since the collection of all such possibilities can be infinitely
large, various interesting combinations of them, useful to the experiments,
have to be considered. It is here that the modern viewpoint distinguishes it-
self by introducing an algebraic structure into the coinbiiiatioiis of outcomes,
which are called events. Thus one coiisiders an algebra of events as the pri-
mary datum. This is evidently a computational convenience, though a decisive
one, and it must and does include everything of conceivable use for an exper-
iment. Then each event is assigned a iiuinerical measure corresponding to
the "amount" of uncertainty in such a way that this assignment has natural
additivity and coiisistency properties. Once this setup is accepted, an ax-
iomatic formulation in the style of twentieth-century mathematics in general
becomes desirable as well as inevitable. This may also be regarded as building
a mathematical model to describe the experiment at hand. A precise and sat-
isfactory formulatioil of the latter has beeii given by Kolmogorov (1933), and
the resulting analytical structure is almost universally accepted. In its mani-
fold applications, some alterations have beeii proposed by de Finetti, R&iyi,
Savage, and others. However, as shown by the first author (Rao 1981) in a
monograph on the modern foundations of the subject, the analytical structure
of Kolmogorov actually takes care of these alterations when his work is inter-
preted from an abstract point of view. This is especially relevant in the case
of coiiditioiial probabilities, which we discuss in detail in Chapter 3. Thus we
take the Kolmogorov setup as the basis of this book and develop the theory
while keeping in contact with the phenomenological origins of the subject as
much as possible. Also, we illustrate each concept as well as the general theory
with concrete (but not necessarily numerical) examples. This should show the
importance and definite utility of our subject.
The preceding account implies that the methods of real analysis play a key
role in this treatment. Indeed they do, and the reader should ideally be already
familiar with them, although concurrent study in real analysis should suffice.
Dealing with special cases that are immediately applicable to probability is
not necessary. In fact, experience indicates that it can distort the general
coinpreheiisioii of both subjects. To avoid misunderstanding, the key results
are recalled below for reference, mostly without proofs.
With this preamble, let us start with the axiomatic formulation of Kol-
mogorov. Let fl be a noiieinpty point set representing all possible outcomes
of an experiment, and let C be an algebra of subsets of fl. The members
of C, called events, are the collections of outcomes that are of interest to
the experimenter. Thus C is nonempty and is closed under finite unions and
complements, hence also under differences. Let P : C + R+ be a mapping,

called a probability, defined for all elements of C so that the following rules
are satisfied.
(I) For each A E C,0 < P(A) and P(a)= 1.
(2) A, B E C, A n B = 0,implies P ( A U B) = P(A) +P(B).
From these two rules, we deduce immediately that (i) (taking B = 0)
P(0)= 0 and (ii) A > B, A, B E C, implies P ( A - B) = P(A) - P(B).In
particular, P(AC)= 1 - P(A) for any A E C, where AC= a - A.
Such a P is called a "finitely additive probability." At this stage, oiie
strengthens (2) by introducing a continuity condition, namely, countable ad-
ditivity, as follows:
(2') If Al, A2,. . . are disjoint events of 0 such that A = U:!& is also
an event of R,then P(A) = Cr?lP(Ak).
Clearly (2') implies (2), but trivial examples show that (2), is strictly
weaker than (2'). The justification for (2') is primarily operational in that a
very satisfactory theory emerges that has ties at the deepest levels to many
branches of mathematics. There are other cogent reasons too. For instance,
a good knowledge of the theory with this "countably additive probability"
enables oiie to develop a finitely additive theory. Indeed, every finitely addi-
tive probability fuiiction can be made to correspond uniquely to a countably
additive one on a "nice" space, according to an isomorphism theorem that
depends on the Stone space representation of Boolean algebras. For this and
other reasons, we are primarily concerned with the countably additive case,
and so henceforth a probability function always stands for one that satisfies
rules or axioms (1) and (2'). The other concept will be qualified "finitely
additive," if it is used at all.
If P : C + R+ is a probability in the above sense and C is an algebra,
it is a familiar result from real analysis that P can be uniquely extended to
the a-algebra (i.e., algebra closed under countable unions) generated by C
(i.e., the smallest 0-algebra containing C). Hence we may and do assume for
conveiiieiice that C is a a-algebra, and the triple (a,C,P) is then called a
probability space. Thus a probability space, in Kolmogorov's model, is a fi-
nite measure space whose measure function is normalized so that the whole
space has measure one. Consequently several results from real analysis can be
employed profitably in our study. However, this does not imply that probabil-
ity theory is just a special case of the standard measure theory, since, as we
shall see, it has its own special features that are absent in the general theory.
Foremost of these, is the concept of probabilistic (or statistical) independence.
With this firmly in hand, several modifications of the concept have evolved,
so that the theory has been enriched and branched out in various directions.
These developments, some of which are considered in Chapter 3, attest to the
individuality and vitality of probability theory.
A concrete example illustrating the above discussion is the following:

1.2 Random Variables 7
Example 1. Let R, = {0,1} be a two-point space for each i = 1,2,....
This space corresponds to the ith toss of a coin, where 0 represents its tail
and 1 its head and is known as a Bernoulli trial. Let Ci = (0,{0}, {I}, Qi}
aiid Pi({O))= q aiid Pi({l))= p, O < p = 1 - q < 1. Then (ai,Ci,Pi),
i = 1,2,..., are identical copies of the same probability space. If (fl,C,P)[=
@i21(R,, C,,Pi)]is the product measure space, then 0 = {x : x =
(21x2,. . .),xi = 0 , l for all i}, and C is the a-algebra generated by the semir-
ing C = {Inc R : Inconsists of those x E 0 whose first n components
have a prescribed pattern). For instance, I2can be the set of all x in fl
whose first two components are 1. If In(€ C) has the first n components
coiisistiiig of k 1's aiid n - k O's, then P(I,) = pQn-" aiid P ( R ) = 1.
[Recall that a semiring is a nonempty class C which is closed under inter-
sections and if A, B E C,A c B , then there are sets Ai E C such that
A = A1 c ... c An = B with Ai+1 - Ai E C.]
The reader should verify that C is a semiriiig and that P satisfies coii-
ditioiis (I) aiid (2'), so that it is a probability on C with the above-stated
properties. We use this example for some other illustrations.
1.2 Random Variables and Measurability Results
As the definition implies, a probability space is generally based on an abstract
point set 0 without any algebraic or topological properties. It is therefore use-
ful to consider various mappings of R into topological spaces with finer struc-
ture in order to make available several mathematical results for such spaces.
We thus coiisider the simplest aiid most familiar space, the real line R.To
reflect the structure of C ,we start with the a-algebra B of R, generated by all
open intervals. It is the Borel a-algebra. Let us now introduce a fundamental
concept:
Definition 1 A random variable f on fl is a finite real-valued measur-
able function. Thus f : fl + R is a raiidoin variable if fP1(B) c C, where
B is the Borel a-algebra of R; or fP1(A) = {w : f (w) E A} E C, for
A = (-oo,x),x E R. (Also written f-'(-oo,x), [f < x] for fP1(A).)
Thus a raiidoin variable is a fuiictioii and each outcome w E R is assigned
a real number f (w) E R.This expresses the heuristic notion of "randomness"
as a mathematical concept. A fundamental nature of this formulation will be
seen later (cf. Problem 5 (c) of Chapter 2). The point of this concept is that
it is of real interest when related to a probability function P. Its relation is
obtained in terms of image probabilities, also called distribution functions in
our case. The latter coiicept is given in the following:

Definition 2 If f : R + R is a random variable, then its distribution
function is a mapping Ff: R + R+ given by
Evidently P and f uniquely determine F f . The converse implication is
slightly involved. It follows from definitions that Ffis a nonnegative nonde-
creasing left continuous [i.e.,Ff(x - 0) = Ff(x)]bounded mapping of R into
[O, I] such that lim,,-, F(z)= Ff(-oo) = 0, Ff(+oo) = lim,,+, F(z)=
1. Now any function F with these properties arises from some probability
space; let fl = R, C = B, f = identity, and P(A) = JAdF,A E B. The
general case of several variables is considered later. First let us present some
elementary properties of random variables.
In the definition of a raiidoin variable, the probability measure played no
part. Using the measure function, we can make the structure of the class of all
random variables richer than without it. Recall that (fl,C,P) is complete if
for any null set A E C [i.e.,P(A) = 0] every subset B of A is also in C, so that
P(B) is defined and is zero. It is known and easy to see that every probability
space (indeed a measure space) can always be completed if it is not already
complete. The need for coinpletioii arises from simple examples. In fact, let
fl,fi, ... be a sequeiice of raiidoin variables that forms a Cauchy sequeiice in
measure, so that for E > 0, we have lim,,,,, P [f, - fml > E] = 0. Then
there may not be a unique random variable f such that
However, if (0,C,P) is complete, then there always exists such an f , and if
f' is another limit function, then P{w : f ( w ) # fl(w)} = 0; i.e., the limit is
unique outside a set of zero probability. Thus if, Lo is the class of random
variables on (R,C,P), a complete probability space, then Lo is an algebra
and contains the limits of sequences of random variables that are Cauchy in
measure. (See Problem 3 on the structure of Lo.) The following measurability
result on fuiictioiis of random variables is useful in this study. It is due to Doob
and, in the form we state it, to Dynkin. As usual, B is the Bore1 0-algebra of R.
Proposition 3 Let (R,C) and (S,A) be measurable spaces and f : 0 +
S be measurable, i.e., fP1(A) c C. Then a function g : 0 + R is measurable
relative to the a-algebra fpl(A) [i.e., gpl(B) c fpl(A)] iff ( = if and only
if) there is a measurable function h : S + R such that g = h o f . (This result
is sometimes refered to, for coiivience, as the "Doob-Dynkin lemma.")
Proof One direction is immediate. For g = h o f : R + R is measurable
implies gpl (B) = (h o f)-l(B) = f p l(hpl(B)) c f -'(A), since hpl(B) c A
For the converse, let g be fpl(A)-measurable. Clearly fpl(A) is a a-
algebra contained in C. It suffices to prove the result for g simple, i.e., g =
Czla i x ~ ,,A, E f (A). Indeed, if this is proved, then the general case is

obtained as follows. Since g is measurable for the a-algebra fP1(A), by the
structure theorem of measurable functions there exist simple functions g,,
measurable for fpl(A), such that gn(w) + g(w) as n + oo for each w E 0.
Using the special case, there is an A-measurable h, : S + R, g, = h, o f ,
for each n > 1. Let So= {s E S : hn(s) + h(s),n + 00). Theii S o E A, and-
g(R) c S.Let h(s) = h(s) if s E So,= 0 if s E S-So. Then h is A-measurable
and g(w) = h(f (w)),w E 0. Consequently, we need to prove the special case.
Thus let g be simple: g = Cr=l aixA,, and Ai = fP1(Bi)E f -'(A), for a
Bi E A. Define h = Cy=l a i x ~ ,.Theii h : S + R is A-measurable aiid simple.
[Herethe B j need not be disjoint even if the A, are. To have symmetry in the
definitions, we may replace Bi by C,, where C1 = B1 aiid Ci = B, - ~ 5 1 : ~ ~
for i > 1. So Ci E A, disjoint, fpl(C,) = Ai, aiid h = Cy=l a,xc, is the same
function.] Thus
aiid h o f = g. This completes the proof.
A number of specializations are possible from the above result. If S = Rn
and A is the Borel a-algebra of Rn, then by this result there is an h : Rn + R,
(Borel) measurable, which satisfies the requirements. This yields the following:
Corollary 4 Let (R,C) and (Rn,A) be measurable spaces, and f : fl +
Rn be measurable. Then g : fl + R is fpl(A)-measurable iff there is a Borel
measurable function h : Rn + R such that g = h(fl, fi, . .. ,f,) = h o f where
If A is replaced by the larger a-algebra of all (completion of A) Lebesgue
measurable subsets of Rn, then h will be a Lebesgue measurable function.
The above result will be of special interest in studying, among other things,
the structure of conditional probabilities. Some of these questions will be con-
sidered in Chapter 3. The mapping f in the above corollary is also called a
multidimensional random variable and f of the theorem, an abstract random
variable. We state this concept for reference.
Definition 5 Let (0,C ) be a measurable space and S be a separable
metric space with its Borel a-algebra. (E.g., S = Rn or Cn or R?). Then a
mapping f : R + S is called a generalized (or abstract) random variable (and
random vector if S = Rn or Cn) whenever fpl(B)E C for each open (or
closed) set B c S, and it is a random variable if S = R. [See Problem 2b for

an alternative definition if S = Rn.]
As a special case, we get f : 0 + C,where f = f l +if2, fj : R + R,j =
1,2, is a complex random variable if its real and imaginary parts fl,f 2 are
(real) random variables. To illustrate the above ideas, consider the following:
Example 6 Let (R,C,P) be the space as in the last example, and
f, : 0 + R be given as fn(w) = n if the first 1 appears on the nth com-
ponent (the preceding are zeroes), = 0 otherwise. Siiice C = 2Q = ?(a),it is
clear that f, is a random variable aiid in fact each fuiictioii on fl is measurable
for C. This example will be further discussed in illustrating other concepts.
Resuming the theme, it is necessary to discuss the validity of the results
on 0-algebras generated by certain simple classes of sets aiid functions. In this
connection the monotone class theorem aiid its substitute, as introduced by
E. B. Dynkin, called the (T, A)-classes will be of some interest. Let us state
the concept and the result precisely.
Definition 7 A nonempty collection C of subsets of a nonempty set 0 is
called (i) a monotone class if {A,, n > 1) c C, An moiiotoiie +- limn A, E C,
(ii) a T-(or product) class if A, B E C +- A n B E C, (iii) a A- (or latticial) class
if (a) A , B E C , A ~ B= B + A u B E C , (b) A , B E C , A B + A - B E
C, 0 E C, and (c) A, E C, A, c A,+l + uA, E C; (iv) the smallest class
of sets C containing a given collection A having a certain property (e.g., a
monotone class, or a a-algebra) is said to be generated by A.
The following two results relate a given collectioii and its desirable gener-
ated class. They will be needed later on. Note that a A-class which is a T-class
is a a-algebra. We detail some nonobvious (mathematical) facts.
Proposition 8 (a) IfA is an algebra, then the monotone class generated
by A is the same as the a-algebra generated by A.
(b) If A is a A-class and B is a T-class, A > B, then A also contains the
a-algebra generated by B.
Proof The argument is similar for both parts. Siiice the proof of (a) is in
most textbooks, here we prove (b).
The proof of (b) is not straightforward, but is based on the followiiig idea.
Consider the collection A1 = {A c 0 : A n B E A. for all B E B}. Here
we take A. > B, and A. is the smallest A-class, which is the intersection of
all such collectioiis containing B. The class A1 is not empty. In fact B c dl.
We observe that A1 is a A-class. Clearly R E All Ai E All Al n A, = 8 +-
A, n B,i = 1,2, are disjoint for all B E B, and Ai n B E do.Since A. is a
A-class, (A1 u A,) n B = (A1 n B) u (A, n B) E Ao, so that Al U A, E A1.
Similarly A1 > A2 +AlnB-A2nB = (Al-A2)nB E A. andA1-A2 E A1.

The monotonicity is similarly verified. Thus A1 is a A-class. Since A. is the
smallest A-class, A1 > A. > B. Hence A E A0 cA1, B E B +A n B E Ao.
Next consider A2 = {A c R : A n B E Ao, all B E Ao) By the pre-
ceding work, A2 3 B and, by an entirely similar argument, we can coiiclude
that Aa is also a A-class. Hence Az 3 A0 3 B. This means with A, B E Ao,
A n B E A. c Aa, and hence A. is a T-class. But by Definition 7, a collection
which is both a T- and a A-class is a a-algebra. Thus A. is a a-algebra > B.
Then a(B) c Ao, where a(B) is the generated a-algebra by B. Since A. c A,
the proposition is proved.
The next result, containing two assertions, is of interest in theoretical ap-
plications.
Proposition 9 Let B(R) be the space of real bounded functions on R and
'Ft c B(fl) be a linear set containing constants and satisfying (i) f, E 'Ft, f, +
f uniformly +- f E 'Ft, or (4) f E 'Ft +- f* E 'Ft, where f + = max(f,O) and
f - = f + - f , a n d ( i i ) O < f , ~ ' F t , f , T f , f ~ B ( Q ) + f ~ ' F t . I f C c ' F t
is any set which is closed under multiplication and C = a(C) is the small-
est a-algebra relative to which every element of C is measurable, then every
f (E B(R)) which is C-measurable belongs to 'Ft. The same conclusion holds
if C c C is not necessarily closed under multiplication, but 'Ft satisfies (i')
[instead of (i)]C is a linear set closed under infima, and f E C +f A 1E C.
Proof The basic idea is similar to that of the above result. Let A. be an
algebra, generated by C and 1,which is closed under uniform convergence and
is contained in 'Ft. Clearly A. exists. Let dl be the largest such algebra. The
existelice of A1 is a consequence of the fact that the class of all such A. is
closed under unions and hence is partially ordered by inclusion. The existence
of the desired class A1 follows from the maximal principle of Hausdorff.
If f E A1, then there is a k > 0, such that if f l 5 k and if p(.) is
any polynomial on [-k, k], then p(f) E dl.Also by the classical Weierstrass
approximation theorem the function h : [-k, k] + R,h(z) = 1x1, is the
uniform limit of polynomials p, on [-k, k]. Hence p,(f) + If uniformly so
that (by the uniform closure of A1) If 1 E A1 and A1 is a vector lattice.
Observe that A1 automatically satisfies (ii), since if 0 < g, E Al, g, 1'
g E B(fl), then g E 'Ft, aiid if A2 is generated by dl aiid g (as Ao), then by
the inaximality of All Az = d l . Thus A1 satisfies (i) and (ii) aiid is a vector
lattice. The second part essentially has this conclusion as its hypothesis. Let us
verify this. By (i'), iff E 'Ft, then f* E 'Ft, so that f ++f - = f 1 E 'Ft. Hence
if f ,g E 'Ft, then f V g = +(if - gl +f +g) E 'Ft, since f - g E 'Ft (because
'Ft is a vector space). Thus 'Ft is a vector lattice. Coiisequently we coiisider
vector lattices containing C and 1which are subsets of 'Ft. Next one chooses
a maximal lattice (as above). If this is A;, then it has the same properties as
Az. Thus it suffices to consider Az and prove that each f in B(R) which is
C-measurable is in A2 (C 'Ft ).

Let S = {A c R : XA E A2) Since A2 is an algebra, S is a T-class. Also
S is closed under disjoint unions and monotone limits. Thus it is a A-class
as well, and by the preceding proposition it is a a-algebra. If 0 < g [ ~B(R)]
is S-measurable, then there exist 0 < g, 1' g: g, is an S-measurable simple
function. But then gn E A2 and so g E A2 also. Since A2 is a lattice, this result
extends to all g E B ( 0 ) which are S-measurable. To complete the proof, it is
only necessary to verify C = a(C) c S . Let 0 < f E C and B = [ f > 11 E C.
We claim that B E S . In fact, let g = f A 1. Then g E A2 and 0 < g < 1.
Now [g = 11 = B aiid [g < 11 = Be. Thus gn E A2 aiid gn J 0 on Be,
or 1 - gn 1 on Be. Since 1 - gn E A2, and it is closed under bounded
monotone limits, we have 1 - gn 1' XB? E A2 + BCE S, so that B E S. If
0 < f E C,B, = [ f > a ] = [ f l u > 11 for a > 0, then f l u E A2 and by the
above proof B, E S for each a. But such sets as B, clearly generate C, so
that C c S. This completes the result in the algebra case.
In the lattice case A, B E S + XAXB = m i i i ( ~ ~ ,xB) E A;, SO that
A n B E S . Thus S is a n-class again. That it is a A-class is proved as before,
so that S is a a-algebra. The rest of the argument holds verbatim. Since with
each f E C, one has f A 1 E C we do not need to go to A;, and the proof is
simplified. This establishes the result in both cases.
1.3 Expectations and the Lebesgue Theory
If X : R + R is a random variable (r.v.) on C1, then X is said to have an
expected value iff it is integrable in Lebesgue's sense, relative to P . This means
X is also integrable. It is suggestively denoted
E ( X ) = E p ( X ) = XdP,
Lthe integral on the right being the (absolute) Lebesgue integral. Thus E ( X )
exists, by definition, iff E(IX1) exists. Let C1 be the class of all Lebesgue
integrable functions on (R,C,P). Then E : C1 + R is a positive linear
mapping since the integral has that property. Thus for X, Y E C1 we have
and E(1) = 1 since P ( R ) = 1, E ( X ) > 0 if X > 0 a.e. The operator E is
also called the (mathematical) expectation on C1. It is clear that the standard
results of Lebesgue integration are thus basic for the following work. In the
next section we relate this theory to the distribution function of X .
To fix the notation and terminology, let us recall the key theorems of
Lebesgue's theory, the details of which the reader can find in any standard

1.3 Expectations and the Lebesgue Theory 13
text on real analysis [see,e.g., Royden (1968, 1988),Sion (1968),or Rao (1987,
2004)l.
The basic Lebesgue theorems that are often used in the sequel, are the
following:
Theorem 1 (Monotone Convergence) Let 0 < XI < X2 < .. . be a
sequence of random variables on (0,C ,P ) . ThenX = limn Xn is a measurable
(extended) real valued function (or a "defective" random variable) and
lim E(X,) = E ( X )
n-00
holds, where the right side can be infinite.
A result of equal importaiice is the following:
Theorem 2 (Dominated Convergence) Let {X,, n > 1) be a sequence
of random variables on (R,C ,P) such that (2) limn,, Xn = X exists at
all points of R except for a set N c fl, P ( N ) = 0, (written Xn + X
a.e.), and (ii) X,I < Y, an r.v., with E(Y) < oo. Then X is an r.v. and
limn E(X,) = E(X) holds, all quantities being finite.
The next statement is a consequence of Theorem 1.
Theorem 3 (Fatou's Lemma) Let {X,,n > 1) be any sequence of
nonnegative random variables on (R,C,P ) . Then we have E(1iminf, X,) <
lim inf, E(X,) .
In fact, if Yk = inf{Xn, n > k), then Theorem 1 applies to {Yk,k > 1).
Note that these theorems are valid if P is replaced by a noiifiiiite measure.
Many of the deeper results in analysis are usually based on inequalities.
We present here some of the classical inequalities that occur frequently in our
subject. First recall that a mapping q5 : R + R is called convex if for any
a , p > O , a + P = l , o n e has
From this definition, it follows that if {&, n > 1) is a sequence of convex
functions a, E R', then CT=la,& is also convex on R, and if $, + $, then
$ is convex. Further, from elementary calculus, we know that each twice-
differentiable function q5 is convex iff its second derivative $/' is nonnegative.
It can be shown that a measurable convex function on an open interval is nec-
essarily coiitiiiuous there. These facts will be used without comment. Hereafter
"convex function" always stands for a measurable convex function on R.
Let $(x) = - logx, for x > 0. Then $'/(x) > 0, so that it is convex. Hence
(3) becomes

Since log is an increasing function, this yields for a >0,,C? >0,z > 0,y > O
For any pair of random variables X, Y on (R,C,P),and p > 1,q = p/(p - l),
we define I X l p = [E(xI")]~/",1 < p < oo, and IIX, (= essential supre-
mum of X I ) = inf{k > 0 : P[IXI > k] = 0). Then I . I p , 1 < p < cm,
is a positively homogeneous illvariant metric, called the p-norm; i.e., if
d(X,Y) = IX - Y I I., then d(.,.)is a metric, d(X + 2,Y + 2) = d(X,Y)
aiid d(aX,0) = la d(X,0),a E R.We have
Theorem 4 Let X, Y be random variables on (0,C ,P ) . Then
(i) Holder's Inequality
(ii) Minkowski's Inequality
Proof (i) If IXIIP = 0 , or IYI, = 0 , t h e n X = O a.e., or Y =Oa.e., so
that (5) is true and trivial. Now suppose I X I 1. > 0, and I Y 1 , > 0. If p = 1,
then q = cm, and we have I Yl loo = ess sup YI, by definition (= k, say),
Thus (5) is true in this case. Let then p > 1, so that q = p/(p - 1)> 1. In (4)
set a = l/p,,C?= l/q,z = (lX/IXI,)P(w), aiid y = (IY/IY1,)q(w). Then it
becomes
Applying the (positive) operator E to both sides of (7) we get
This proves (5) in this case also, and hence it is true as stated.
(ii) Since X +Y P < 2Pmax(Xlp,IYlP) <2P[XlP+IYlP],the linearity of
E implies E(IX +YIP) < cm, so that (6) is meaningful. If p = 1, the result
follows from X +Y < X I +IYI. If p = oo, X I < IIX,, YI < I Y , , a.e.
Hence IX +YI < 1x1, + IYl,, a.e., so that (6) again holds in this case.

1.3 Expectations and the Lebesgue Theory 15
Now let 1 < p < oo. If I X +YIP = 0, then (6) is trivial and true. Thus let
IIX+YIl, > 0 also.
Consider
Since (p - 1) > 0, let q = p/(p - 1).Then (p - l)q = p, and
Hence applying (5) to the two terms of (8) separately we get
or
I IX +YI I. 1x1I. + I lYl 1..
This completes the proof.
Some specializatioiis of the above result, which holds for any measure
space, in the context of probability spaces are needed. Taking Y = 1 a.e. in
Hence writing 4(x) = ZIP, (9) says that q5(E(XI)) IE(q5(X)).We prove
below that this is true for any contiiiuous convex fuiictioii 4, provided the
respective expectations exist. The significance of (9) is the following. If X is
an r.v., s > 0, and E ( X I S ) < oo, then X is said to have the sth moment
finite. Thus if X has pth moment, p > 1, then its expectation exists. More is
true, namely, all of its lower-order moments exist, as seen from
Corollary 5 Let X be an r .v., on a probability space, with sth moment
finite. If 0 < r < s, then ( E ( X I ~ ) ) ~ / ~< ( E ( x I ~ ) ) ~ / ~ .More generally, for
any 0 <r,, i = 1,2,3, if ifr, = E(IXIrf),we have the Liapounov inequality:

so that all lower-order moments exist. The inequality holds if we show that
&" is a noiidecreasing fuiictioii of r > 0. But this follows from (9) if we let
p = s/r > 1 and replace X by X I Tthere. Thus
which is the desired result on taking the sth root.
For the Liapounov inequality (lo), note that Po = 1, and on the open
interval (0,s), p, is twice differentiable if P, < ce [use the dominated con-
vergence (Theorem ) for differentiation relative to r under the integral sign],
and
Let y, = logp,. If X $ 0 a.e., then this is well defined and
because
by the Holder inequality with exponent 2. Thus y, is also convex in r. Taking
a = r3/(r2 +r3),P = r~/(ra+r3) and x = rl,y' = rl +r 2 +r3 in (4) with
4(r)= y,, one gets ax +Py' = rl +rz, so that
which is (10). Note that the coiivexity of y, can also be proved with a direct
application of the Holder inequality. This completes the proof.
The special case of (5) with p = 2 is the classical Cauchy-Buniakowski-
Schwarz (or CBS) inequality. Due to its great applicational potential, we state
it as
Corollary 6 (CBS Inequality) If X , Y have two moments finite, then
X Y is integrable and
Proof Because of its interest we present an independent proof. Since X , Y
have two moments, it is evident that t X +Y has two moments for any t E R,
and we have

1.3 Expectations and the Lebesgue Theory
This is a quadratic equation in t which is never negative. Hence it has no
distinct real roots. Thus its discriminant must be nonpositive. Consequently,
This is (11),and the proof is complete.
Remark The conditions for equality in (5), (6), (lo), and (11) can be
obtained immediately, and will be left to the reader. We invoke them later
when necessary.
One can now present the promised generalization of (9) as
Proposition 7 (Jensen's Inequality) If $ : R + R is convex and X
is an r.v. on (R,C ,P ) such that E ( X ) and E(4(X)) exist, then
Proof Let xo, x1 be two points on the line and x be an intermediate point
so that x = ax1 +pxo, where 0 Ia I 1,a +p = 1. Then by (3)
For definiteness, let zo < x < xl so that with n = (z - xo)/(zl - xo),,C? =
(21 - x)/(xl - xo), we get x. Hence the above inequality becomes
so that
(x - xo)($(x) - 4(x1)) I (x1 - x)(4(xo)- 4(x)).
By setting y = 21, yo = x, this becomes
In this inequality, written as 4(y) > 4(yo)+g(yo)(y- yo),the right side is
called the support line of 4 at y = yo. Let X(w) = y, and yo = E ( X ) in (13).
Then 4(X) is an r.v., and taking expectations, we get
This is (12), and the result holds. [Note: ti < tz +-g(t1) <g ( t ~ ) . ~ ]
This is not entirely trivial. Use (3) in different forms carefully. [See, e.g., G.H.
Hardy, J.E. Littlewood, and G. PolyA (1934, p. 93).]

In establishing (10) we first showed that &/' = [ E ( I X I ~ ) ] ~ I ~is an increas-
ing function of r. This has the following consequence:
Proposition 8 For any random variable X , lim,,,, (E[IX~'])~I'=
1x1I'm
Proof If 1x1, = 0,X = 0 a.e., the result is true. So let 0 < k = 1x1, <
cm.Then, by definition, P[IXI > k] = 0. Hence
so that for any 0 < t < k,
Letting r + cc in (14), we get k > l i m , , , ( ~ ( I ~ ~ ) ) ~ / ~> t. Since t < k is
arbitrary, the result follows on letting t 1' k.
Let X , Y be two random variables with finite second moments. Then we
can define (a) the variance of X as
which always exists since a 2 ( X ) < E(X2) < cm;and (b) the covariance of
X. Y as
This also exists since by the CBS inequality,
The normalized covariance, called the correlation, between X and Y, denoted
p(X, Y), is then
cov(X,Y)
p(X,Y) =
a(X)a(Y)'
(18)
where a(X),o(Y) are the positive square roots of the corresponding variances.
Thus Ip(X,Y ) < 1 by (17). The quantity 0 < o(X) is called the standard
deviation of X . Note that if E ( X ) = 0, then p2 = a2(X), and generally
Pz > a2(X),by (15).
Another simple but very useful inequality is given by
Proposition 9 (i) (Markov's Inequality) If < : R + R+ is a Bore1
function and X is an r.v. on (a,C ,P), then for any X > 0,

1.3Expectations and the Lebesgue Theory
(ii) ( ~ e b ~ ~ e v ' sInequality) If X has a finite variance, then
Proof For (i) we have
(ii). In (19), replace X by X - E(X), [(x) by x2 aiid X by X2. Then 6
being one-to-one on R', [IX - E(X)l2 > X2] = [IX - E(X)I > X],and the
result follows from that inequality.
Another interesting consequence is
Corollary 10 If X I , . ..,Xn are n random variables each with two mo-
ments finite, then we have
and if they are uncorrelated [i.e., p(Xi,XI) = 0 for i # j] then
This follows immediately from definitions. The second line says that for
uncorrelated random variables, the variance of the sum is the sum of the
variances. We later strengthen this concept into what is called "independence"
and deduce several results of great importance in the subject.
For future use, we iiiclude two fundamental results on multiple integration
aiid differentiation of set functions.
Theorem 11 (i) (Fubini-Stone) Let (Qi,Ci,pi)i = 1,2, be a pair of
measure spaces and (R,C,p) be their product. I f f : Q + R is a measurable
and p-integrable function, then
Llf (wi, .)pl(dwl) is p2 - measurable,
L2f (., ~z)pz(dwz)is p1 - measurable,
and, moreover,

-+ .
(ii) (Tonelli) If i n the above p1, pz are a-finite and f : R + R zs measur-
able, or pC1,are arbitrary measures but there exists a sequence of p-integrable
simple functions fn : 0 + R+ such that fn 1' f a.e. ( p ) , then again (21) holds
even though both sides may now be infinite.
The detailed arguments for this result are found in most standard texts
[cf.,e.g., Zaaiien (1967), Rao (1987, 2004)l. The other key result is the follow-
ing:
Theorem 12 (i) (Lebesgue Decomposition) Let p and u be two finite
or 0-finite measures on (a,C), a measurable space. Then u can be uniquely
expressed as u = y +u2, where ul vanishes on p-null sets and there is a set
A E C such that p(A) = 0 and u2(AC)= 0. Thus u2 is diflerent from zero only
on a p-null set. (Here u2 is called singular or orthogonal to p and denoted
p 1u2 Note also that ul 1u2 is written.)
(ii) (Radon-Nikodgm Theorem) If p is a 0-finite measure on (R,C)
and u : C + @ is a-additive, and vanishes on p-null sets (denoted u << p),
then there exists a p-unique function (or density) f : 0 + such that
This important result is also proved in the above-stated references.
1.4 Image Measure and the Fundamental Theorem of
Probability
As noted in the beginning of Section 2, the basic probability spaces often in-
volve abstract sets without any topology. However, when a random variable
(or vector) is defined on such (R,C,P),we can associate a distribution fuiic-
tion on the range space which usually has a nice topological structure, as in
Definition 2.2. Evidently the same probability space can generate numerous
image measures by using different measurable mappings, or random variables.
There is a fuiidaineiital relation between the expectation of a fuiictioii of a
random variable on the original space and the integral on its image space.

1.4 Fundamental Theorem of Probability 21
The latter is often more convenient in evaluation of these expectations than
to work on the original abstract spaces.
A comprehensive result on these ideas is contained in
Theorem 1 (i) (Image Measures). Let (a,C,p) be a measure space
with (S,A) as a measurable space, and f : R + S be measurable [i.e.,
-+ .
f f l ( A ) c C]. If u = p o f f l : A + R 2s the image measure, then for
each g : S + R measurable, we have
in the sense that if either side exists, so does the other and equality holds.
(ii) (Fundamental Law of Probability). If (a,C , p ) is a probability
space and X : fl + R is a random variable with distribution function Fx,
and g : R + R is a Bore1 function, Y = g(X), then
in the sense that if either side exists, so does the other with equality holding.
(iii) I n particular, for any p > 0 ,
Proof (i) This very general statement is easily deduced from the definition
of the image measure. Indeed, if g(s) = xA(s),A E A, theii the left side of (I)
becomes
Thus (1) is true, and by the linearity of the integral and the (a-additivity of
v) the same result holds if g = Cr=la i x ~ ,, a simple function with ai > 0. If
g > 0 is measurable, theii there exist simple functioiis 0 <g, 1' g, so that (I)
holds by the Lebesgue monotone convergence theorem. Since any measurable
g = g + g f with g* >0 aiid measurable, the last statement implies the truth
of (1)in general for which g+ or g- is integrable.
(ii) Taking S = R, ( p is probability) we get v(-m,z) = Fx(z),the
distribution function of X . Thus (1)is simply (2). If Y = g(X) : L?+ R, then
clearly Y is a random variable. Replace X by Y, g by identity, aiid S by R in
which establishes all parts of (2).

(iii) This is just a useful application of (ii), stated in a convenient form.
In fact, the first part of (3) being (2), for the last equation consider, with
Y = 1x1,and writing P for p:
Hence (2) becomes
(by integrating by parts and making a change of variable)
= ypP1(1+Fx(-y) - Fx(y))dy (by Thmrein 1). (4)
This is (3), and the proof is complete. In the last equality, Fx(-oo) = 0 and
Fx(+oo) = 1 are substituted.
In the above theorem, it is clear that g can be complex valued since the
stated result applies to g = gl +iga, where gl,g2 are real measurable func-
tions. We use this fact to illustrate the followiiig important concept on Fourier
transform of real random variables. Indeed if X : R + R is any random vari-
able, g : R + C is a Bore1 function, then g o X : R + C is a complex random
variable. If gt(x) = costx +isintx = eitx, then gt : R + C is a bounded
continuous function and gt(X) is a bounded complex random variable for all
t € R.
Thus the followiiig definition is meaningful:
q5x(t) = E(gt(X)) = E(costX) +iE(si1itX),t E R. ( 5 )
The mapping q5x : R + C, defined for each random variable X , is called the
characteristic function of X . It exists without any moment restrictions on X ,
and q5x(0) = 1, Iq5x(t)1 < 1. As an application of the above theorem we have
Proposition 2 The characteristic function 4x of a random variable X
is uniformly continuous on R.
Proof By Theorem 1ii, we have the identity

1.4 Fundamental Theorem of Probability
cjx (t) = IC(eitX) = 1e i t " d ~ x(z).
R
Hence given E > 0, choose L, > 0 such that Fx(L,) - Fx(-LC) > 1- ( ~ 1 4 ) .
If tl < t2, consider, with the elementary properties of Stieltjes integrals,
+ i l > L E l
(eatl' - eit2x)d~x(z)
If 6,= E / ( ~ L , )and It2 - tl < S,, then (6) implies
& E
cjx(t1) - cjx(t2)l < 5+5 = &-
This completes the proof.
This result shows that many properties of random variables on abstract
probability spaces can be studied through their image laws and their char-
acteristic functions with nice continuity properties. We make a deeper study
of this aspect of the subject in Chapter 4. First, it is necessary to introduce
several concepts of probability theory aiid establish its individuality as a sep-
arate discipline with its own innate beauty and elegance. This we do in part
in the next two chapters, and the full story emerges as the subject develops,
with its manifold applications, reaching most areas of scientific significance.
Before closing this chapter we present a few results on uniform integrabil-
ity of sets of random variables. This concept is of importance in applications
where an integrable dominating function is not available to start with. Let us
state the concept.
Definition 3 An arbitrary collection {Xt,t E T) of r.v.s on a probability
space (0,C,P) is said to be uniformly integrable if (i) E(IXtI) Iko < oo,
t E T, aiid (ii) liinp(A),o JA X t ldP = 0 uniformly in t E T.

The earliest occasion on which the reader may have encountered this con-
cept is perhaps in studying real analysis, in the form of the Vitali theorem,
which for finite measure spaces is a generalization of the dominated conver-
gence criterion (Theorem 2.2). Let us recall this result.
Theorem 4 (Vitali) Let XI,X2,... be a sequence of random variables
on a probability space (0,C,P) such that X, + X a.e. (or only in measure).
If {X,, n > 1) is a uniformly integrable set, then we have
Actually the conclusion holds if only E(IX,) < oo,n > 1, and (ii) of Defini-
tion 3 is satisfied for {X,, n > 1).
Note that if X,I 5 Y and Y is integrable, then {X,, n > 1) is trivially
uniformly integrable. The point of the above result is that there may be no
such dominating function Y. Thus it is useful to have a characterization of this
important concept, which is given by the next result. It contains the classical
all-important de La Valle'e Poussin criterion obtained in about 1915. It was
brought to light for probabilistic applicatioiis by Meyer (1966).
Theorem 5 Let K = {Xt,t E T ) be a set of integrable random variables
on a probability space. Then the following conditions are equivalent [ ( i ) e(iii)
is due to de la Vallee Poussin]:
(i) K is uniformly integrable.
(ii)
liin IXtdP = 0 uniformly i n t E T.
a+cc
(7)
(iii)There exists a convex function 4 : R + R', 4(O) = 0, 4(-z) = $(z),
and 4(z)/z/' cc as z /' oo,such that SUPtt~E(4(Xt)) < cc.
Proof (i) +-(ii) By Proposition 3.9 (Markov's inequality) we have
uniformly in t E T. Thus by the second condition of Definition 3, given E > 0,
there is a 6, > 0 such that for any A E C ,P(A) < 6, + JA jaXtldP < E
uniformly in t E T. Let At = [lXt1 > a] and choose a > ko/6, so that
P(At) < 6, by ( 8 ) ,and hence JAL X t l d P < E , whatever t E T is. This is (7),
and (ii) holds.
(ii) + (iii) Here we need to construct explicitly a convex function 4 of the
desired kind. Let 0 5 a, < a,+l /' oo be a sequence of iiuinbers such that by
(7) we have

sup/ X~ d~ < 2-"-17 n > 1.
t [ X t > a , , ]
(9)
The sequence {an,n > 1) is determined by the set K: but not the individual
Xt. Let N(n) = the number of a k in [n,n +I), = 0 if there is no ak in this
set, aiid put ((n) = N(k), with N(0) = 0. Then ((n) /' oo. Define
where [(t)is a constant on [k,k+ 1) and increases only by jumps. Clearly $(.)
is convex, q5-x) = q5(x), 4(0) = 0, $(x)/x > <(k)((x- k)/x) 1' oo, for k < x
aiid x, k /' oo. We claim that this function satisfies the requirements of (iii).
Indeed, let us calculate E(4(Xt)).We have
However,
Summing over n, we get with (9)
Thus (10) and (11)imply sup, E(4(Xt))5 1,aiid (iii) follows.
(11)
(iii) + (i) is a consequence of the Holder inequality for Orlicz spaces since
4(.) can be assumed here to be the so-called Young function. The proof is
similar to the case in which 4(x) = xl",p > 1.By the support line property,
the boundedness of E(q5(Xt))Ik < oo implies that of E(IXt) < kl < oo.
The second condition follows from [q = p/(p - I)]

as P(A) + 0. The general Young function has the same argument. How-
ever, without using the Orlicz space theory, we follow a little longer but an
alternative and more elementary route, by proving (iii) + (ii) + (i) now.
Thus let (iii) be true. Then set i&= sup, E(q5(Xt)) < oo.Given E > 0, let
0 < b, = &/E and choose a = a, such that 1x1 > a, +- q5(x) > lxlb,, which is
possible since q5(x)/x 7cc as x 7cc.Thus w E [IXtl > a,] + b,Xtl(w) <
4(Xt(w??7 and
This clearly implies (ii).
Finally, (ii) +- (i). It is evident that (7) implies that if E = 1,then there
is a1 > 0 such that
So there is a k(> 1+al) < oo such that sup, E ( I X t ) <k < oo.To verify the
second condition of Definition 3, we have for A t C
Given E > 0, choose a = a, > 0, so that by (ii) the first integral is < E
uniformly in t. For this a,, (12) becomes
Since E > 0 is arbitrary this integral is zero, and (i) holds. This completes the
demonstration.
The following is an interesting supplement to the above, called Scheff4's
lemma, it is proved for probability distributions on the line. We present it in
a slightly more general form.
Proposition 6 (Scheff6) Let X , X, > 0 be integrable random variables
on a probability space (0,C,P) and X, + X a.e. (or in measure). Then
E(X,) + E ( X ) as n + cc iff {X,, n > 1) is uniformly integrable, which is
equivalent to saying that lim,,, E(IX, - XI) = 0.

Proof If {X,, n > 1) is uniformly integrable, then E(X,) + E(X) by the
Vitali theorem (Theorem 4) even without positivity. Since {IXn - XI, n > 1)
is again uniformly integrable and X n - XI + 0 a.e. (or in measure), the last
statement follows from the above theorem. Thus it is the converse which is of
interest, and it needs the additional hypothesis.
Thus let X, X, >0 and be integrable. Then the equation
is employed in the argument. Since min(Xn,X ) < X , and min(Xn,X ) + X
a.e., the dominated convergence theorem implies E(min(Xn,X ) ) + E(X) as
n + oo. Hence taking expectations on both sides of (14) and letting n + oo,
we get E(max(X,, X ) ) + E(X) as well. On the other hand,
IXn - X = max(Xn,X ) - min(Xn,X). (15)
Applying the operator E to both sides of (151, and using the preceding facts
on the limits to the right-side expressions, we get E(IX, - XI) + 0. This
implies for each E > 0 that there is an n, such that for all n > n, aiid all
A E C,
It follows that, because each finite set of integrable random variables is always
uniformly integrable,
lim Lx,~P<lim
P(A)-0 P(A)-0
(16)
uniformly in n. Thus, because E > 0 is arbitrary, {X,,n > 1) is uniformly
integrable, as asserted.
In Scheffk's original version, it was assumed that d P = fdp, where p
is a a- finite measure. Thus f is called the density of P relative to p. If
Sn = f .Xn >- 0, then JQgndp = SOXn . fdp = JO XndP is taken as unity,
so that g, itself is a probability density relative to p. In this form {g,, n > 1)
is assumed to satisfy 1 = JQg,dp + gdp = 1 aiid g, + g a.e. (or in
measure). It is clear that the preceding result is another form of this result,
and both are essentially the same statements. These results can be further
generalized. (See, e.g., Problems 7-9.)
One denotes by CP(fl,C,P),or CP, the class of all pth-power integrable
random variables on (fl,C,P).By the Holder and Minkowski inequalities, it
follows that Cp is a vector space, p > 1, over the scalars. Thus f E Cp iff
I f 1Ip = [E(f lp)ll/p < oo, and I . lp is the p-norm, i.e., I f 11, = 0 iff f = 0
a.e., a f +91, < a l I f l p + 1 9 p , a E (or a E c?.When If - 911, = 0,
so that f = g a.e., one identifies the equivalence classes (f g iff f = g
a.e.1. Then the quotient LP = CP/ -is a normed linear space. Moreover, if
{f,,n > 1) c Cp, I fm - f n l p + 0, as n , m + oo then it is not hard to

see that there is a P-unique f E C" such that If - f,llP + 0, so that C" is
complete. The space of equivalence classes (L",I . I ,),>1 is thus a complete
normed linear (or Banach) space, called the Lebesgue space, for 1 < p < cm.
It is customary to call the elements of LP functioiis when a member of its
equivaleiice class is meant. We also follow this custom.
Exercises
1. Let 0 be a nonempty set and A c 0. Then XA, called the indicator
(or "characteristic," in older terminology) function, which is 1 on A, 0 on
f l A = A", is useful in some calculations on set operations. We illustrate its
uses by this problem.
(a) If A, c R,z = 1,2, and AIAAa is the symmetric difference, show that
X A ~ A A ~= I X A ~ - X A ~ .
(b) If A, c R,n = 1,2,..., is a sequence, A = lim sup, A, ( = the
set of points that belong to infinitely many A,, = n g l U,)I, A,) and B =
liminf, A, (= the set of points that belong to all but finitely many A,, =
00
Uk=, nnykA,), show that XA = limsup, XA,,,XB = liminf, A,, and A = B
(this common set is called the limit and denoted limn A,) iff XA = limn XA,,.
(c) (E.Bishop) If A, c 0,n = 1,2,... , define C1 = A1, C2 = ClAA2,. . .,
C, = CnPlAAn. Show that limn C, = C exists [in the sense of (b) above] iff
lim, A, = 0.[Hint:Use the indicator fuiictioiis and the results of (a) and (b).
Verify that Ixc,,+I - XC,, I = XA,,+l .I
(d) If (0,E,P) is a probability space, and {A,, n > 1) c C , suppose
that limn A, exists in the sense of (b). Then show that limn P(A,) exists and
equals P(lim, A,).
2. (a) Let (fl,C,P) be a probability space and {Ai, 1 5 i 5 n} c C,
n > 2. Prove (Poincare's formula) that
n
- x P(A, n Aj) + x P(A, nAj n A,)
l<i<j<n l<i<j<k<n
Thus the first two terms usually underestimate the probability of Uy=lAi.
(b) Let (R,, A,),i = 0,1,.. .,n, be measurable spaces, f : Go + x,"=,Ri
be a mapping. Establish that f is measurable iff each component of f =
(fl,...,f,) is. [Hint:Verify fpl(a(C)) = o(fpl(C)) for a collection C of sets.]

Exercises 29
3. (a) Let {X,, n > 1) be a sequence of random variables on a probability
space (0,C ,P). Show that X, + X , a random variable, in probability iff
(b) If X, Y are any pair of ra,ndomvariables, and Lo is the set of all random
variables, define
and verify that d(., .) is a metric on Lo and that Lo is an algebra of random
variables.
(c) If X N Y denotes X = Y a.e., and Lo = Lo/ N, show that (Lo,d(.,.))
is a complete linear metric space, in the sense that it is a vector space and
each Cauchy sequence for d(.,.) converges in Lo.
(d) Prove that (Lp, I . I Ip) introduced in the last paragraph of Section 1.4
is complete.
4. Consider the probability space of Example 2.6. If f : 0 + R is the ran-
dom variable defined there, verify that E(f) = l i p and a2(f) = (1- p)/p2.
In particular, if p = 1/2, then E(f) = 2, 0 2 ( f )= 2, so that the expected
number of tosses of a fair coin to get the first head is 2 but the variance is
also 2, which is "large."
5. (a) Let X be an r.v. on ( 0 ,C ,P). Prove that E(X) exists iff
for some a > 0, and hence also for all a > 0.
(b) If E ( X ) exists, show that it can be evaluated as
[SeeTheorem 4.1iii.l
6. (a) Let X be a bounded random variable on (R,C ,P).Then for any
E > 0 and any r > 0, verify that E ( I X r ) <E~ +a r P [ X >E], where a is the
bound on X I . In particular, if a = 1, we have E ( x ~ )- E~ 5 P [ X I >E].
(b) Obtain an improved version of the one-sided ceby:ev7s inequality as
follows:
Var(X)
PIX > E ( X ) +&I 5 &, + Var(X).
[Hint:Let Y = X - E(X) and a2 = Var X. Set

1 Background Material and Preliminaries
Then if B = [Y > E], verify that E(f(Y)) > P ( B ) and E(f(Y)) = -.I
7. Let {X,, n > 1) be a sequence of r.v.s on (0,C,P) such that X, + X
a.e., where X is an r.v. If 0 < p < cm and E(IX,") < cm,n > 1, then
{lXnI", n > 1) is uniformly integrable iff E ( X n - XI") + 0 as n + cm.
The same argument applies to a more general situation as follows. Suppose
4 : R+ + R+ is a symmetric function, 4(0) = 0, and either 4 is coiitin-
uous concave increasing function on R+ or is a convex fuiiction satisfying
4(2x) < ccj(x),x > 0, for some 0 < c < cm. If E(4(X,)) < k < cc
and E(cj(Xn)) + E(4(X)),then E(4(Xn - X ) ) + 0 as n + cm and
{4(Xn),n > 1) is uniformly integrable. [Hint: Observe that there is a con-
stant 0 < 2 < cm such that in both the above convex and concave cases,
~ ( Z + Y )I c[4(z)+4(y)l,z,yE R.Hence ~ [ 4 ( X ~ I ) + 4 ( X ) 1 - 4 - X )>0
a.e.1
8. (Doob) Let (0,C,P) be a probability space, C > Fn> Fn+lbe a-
subalgebras, and Xn : R + R be Fn-measurable (hence also measurable for
C). Suppose that v,(A) = JA X,dP, A E F, satisfies for each n > 1,v,(A) >
u,+l (A),A E 3,.Such sequences exist, as we shall see in Chapter 3. (A trivial
example satisfying the above conditions is the following: 3,+1= 3, = C all
n, and Xn > Xn+1 a.e. for all n > 1.) Show that {X,, n > 1) is uniformly
integrable iff (*) limn vn(R) > G O . In the special example of a decreasing
sequence for which (*) holds, deduce that there is a random variable X such
that E(IX, - XI) + 0 as n + GO. [m:If A; = [X,I > A], verify that
P(A;) + 0 as X 1' cc uniformly in n, after noting that sA jaX,dP < ul(0) +
2v,(fl) for all 1 < n Irn. Finally, verify that
9. [This is an advanced problem.] (a) Let (R,C,p) be a measure space,
X, : 0 + R, n > 1, be random variables such that (i) X, + X a.e., n + cm,
and (ii) Xn = Yn + Zn,n > 1, where the random variables Yn,Zn satisfy
(a)&+ Z a.e. and sQZndp + SOZdp E R, n + cm, (P)limn,, SAYndp
exists, A E C, and (iii)
liin liin
1..Y,dp = 0
m+cx n i o o
for any A, J, 0,A, E C. Then limn,, SOXndp = JQXdp. If p ( 0 ) < w,
(iii) may be omitted here. [Hints:If X : A H limn,, JA Yndp, then X : C + R
is additive aiid vanishes on p-null sets. (P) aiid (iii) + X is also 0-additive,
so that X(A) = SAY1dpfor a p-unique r.v. Y1,since the Y,, being integrable,

Exercises 31
vanish outside a fixed a-finite set, and p may thus be assumed a-finite. It may
be noted that (iii) is a consequence of (p)if p(Q) < oo.Next, (p)also implies
so that it is "weakly coiivergent" to Y'. Let F E C , p ( F ) < oo.Then by the
Vitali-Hahn-Saks theorem (cf. Dunford-Schwartz, III.7.2),
uniformly in n. Also YnxF = (X, - Zn)xF -- ( X - Z)xF a.e. Let Y = X - Z.
These two imply SOIY, - Y x ~ d p+ 0. Deduce that Y = Y' a.e., and then
YnxF + YxF = Y'xF in measure on each F E C, with p(F) < 00. Hence
by another theorem in Dunford-Schwartz (III.8.12), Jn IY, - Y ldp + 0. Thus
using (a),this implies the result. The difficulty is that the hypothesis is weaker
than the dominated or Vitali convergence theorems, and the X,,n > 1, are
not uniformly integrable. The result can be extended if the X, are vector
valued.]
(b) The following example shows how the hypotheses of the above part
caii be specialized. Let X,, g,, h, be random variables such that (i)X, + X
a . e . , g , + g a . e . , a n d h , + h a . e . a s n - - o o , ( i i ) g , < X , < h , , n > l ,
and (iii) SOgndp -- sQgdp E R, SOh,dp -- SOhdp E R , n -- oo.Then
limn,, SOX,dp = SOXdp E R. [Let Y, = X, - g,, 2, = g,. Then (i) and
(ii a ) of (a) hold.
Now 0 < Y, < h, - g, and Sn(h, - g,)dp + sQ(h- g)dp by hypoth-
esis. Since h, - g, > 0 and we may assume that these are finite after some
n, let us take n = 1 for convenience. As shown in Proposition 4.6, this im-
plies the uniform integrability of {h, - g,, n > I}, and (ii p) and (iii) will
hold, since SQ(h, - g,) - (h - g ) d p + 0 is then true. Note that no order
relation of the range is involved in (a), while this is crucial in the present
formulation.] Observe that if g, < 0 < h,, we may take g, = -h,, replacing
h, by max(h,, -g,) if necessary, so that IX, < h, and SQh,dp -- SQhdp
implies the h, sequence, and hence the X, sequence, is uniformly integrable
as in Proposition 4.6. The result of (b) (proved differently) is due to John
W. Pratt. The problem is presented here to show how uniform integrability
caii appear in different forms. The latter are neither more natural nor elegant
than the ones usually given.
10. This is a slight extension of the Fubini-Stone theorem. Let (ai,Ci),
i = 1,2, be two measurable spaces and R = R1 x R2, C = C1 @ C2 their
products. Let P(.,.): fll x C2 + R+ be such that P(wl,.) : C2 + R+ is
a probability, wl E Ql and P(.,A) : Ql -- R+ be a El-measurable function
for each A E C2.Prove that the mapping Q : (A,B) H SAP(w1,B)p(dwl)
for any probability p : C1 -- R+ uniquely defines a probability measure on

(R,C),sometimes called a mixture relative to p, and if X : R + R+ is any
random variable, then the mapping wl H Ju2X(w1,w ~ ) P ( w ~ ,dw2) is Q(.,R2)-
measurable and we have the equation
[If P(w1:) is independent of wl then this reduces to Theorem 3.11(ii) and the
proof is a modification of that result.]
11. (Skorokhod) For a pair of mixtures as in the preceding problem, the
Radon-Nikod9m theorem can be extended; this is of interest in probabilistic
and other applications. Let (Qi,Ci),i = 1,2, be two measurable spaces and
Pi : fll xC2 + R+,pi : C1+ R', aiid Qi : (A,B ) H JA Pi(wl, B)pi(dwl),i =
1,2, be defined as in the above problem satisfying the same conditions there.
Then Q1 << Qa on (fl,C), the product measurable space iff p1 << pa aiid
PI(wl, .) << Pa (wl,.) for a.a.(wl). When the hypothesis holds (i.e., Q1 << Qz),
deduce that
[Hints:If Q1 << Qa, then observe that, by considering the marginal measures
Q,(., Ra), we also have p1 << pa. Next note that for a.a.(wl),
is a inoliotolie class and an algebra. Deduce that PI(wl, .) << Pa(w1, .),
a.a,(wl). The converse is simpler, and then the above formula follows. Only a
careful application of the "chain rule" is needed. Here the proof can be simpli-
fied and the application of monotone class theorem avoided if C2is assumed
countably generated as was originally done.]

Chapter 2
Independence and Strong Convergence
This chapter is devoted to the fundamental concept of independence and to
several results based on it, including the Kolmogorov strong laws and his three
series theorem. Some applications to einpiric distributions, densities, queueing
sequences and random walk are also given. A number of important results,
included in the problems section, indicate the profound impact of the concept
of independence on the subject. All these facts provide deep motivation for
further study and development of probability theory.
2.1 Independence
If A and B are two events of a probability space (R,C,P),it is natural to
say that A is independent of B whenever the occurrence or nonoccurrence of
A has no influence on the occurrence or nonoccurrence of B. Consequently
the uncertainty about joint occurrence of both A and B must be higher than
either of the individual events. This means that the probability of a joint
occurrence of A and B should be "much smaller" than either of the individual
probabilities. This intuitive feeling can be formalized mathematically by the
equation
for a pair of events A, B. How should intuition translate for three events
A, B,C if every pair among them is independent? The following ancient ex-
ample, due to S. Bernstein, shows that, for a satisfactory mathematical ab-
straction, more care is necessary. Thus if R = {dl,w2,w3,w4), C = P(fl),the
power set, let each point carry the same weight, so that
Let A = {dl,w2), B = {wl,w3), and C = {w4,wl). Then clearly P ( A nB) =
P(A)P(B)= i,P ( B n C ) = P ( B ) P ( C )= i,and P(CnA) = P(C)P(A) = i .

34 2 Independence and Strong Convergence
But P(A n B n C) = i,and P(A)P(B)P(C) = i.Thus A, B , C are not
independent. Also A, ( B nC) are not independent, and similarly B , ( C nA)
and C, (A nB ) are not independent.
These coiisideratioiis lead us to introduce the precise coiicept of mutual
iiidependeiice of a collection of events by not pairwise but by systems of equa-
tions so that the above anomaly cannot occur.
Definition 1 Let (R,C,P)be a probability space and {Ai,i E I ) c P(0)
be a family of events. They are said to be pairwise independent if for each
distinct i,j in I we have P(Ai n Aj) = P(A,)P(Aj). If A,,, ...,A,", are n
(distinct) events, n > 2, then they are mutually independent if
holds simultaneously for each rn = 2,3, ...,n. The whole class {Ai,i E I ) is
said to be mutually independent if each finite subcollectioii is mutually inde-
pendent in the above sense, i.e., equations (1) hold for each n > 2. Similarly if
{Ai,i E I) is a collection of families of events from C then they are mutually
independent if for each n, Ai, E Ai, we have the set of equations (1)holding
for Ai,, k = 1,...,m, 1< m 5 n. Thus if Ai E Ai then {Ai,i E I ) is a mutually
independent family. [Followingcustom, we usually omit the word "mutually" .]
It is clear that the (mutual) independence concept is given by a system
of equations (1)which can be arbitrarily large depending on the richness of
C. Indeed for each n events, (1)is a set of 2" - n - 1 equations, whereas the
n
pairwise case needs only ( ) equations. Similarly %-wisen iiidependeiice has
2
n
(rn)
equations, and it does not imply other independences if 2 < m < n is
a fixed number m. It is the strength of the (mutual) concept that allows all
n > 2. This is the mathematical abstraction of the intuitive feeling of inde-
pendence that experience has shown to be the best possible one. It seems to
give a satisfactory approxiination to the heuristic idea of iiidepeiideiice in the
physical world. In addition, this mathematical formulation has been found
successful in applications to such areas as number theory, and Fourier analy-
sis. The notion of independence is fundamental to probability theory
and distinguishes it from measure theory. The coiicept translates itself
to random variables in the following form.
Definition 2 Let (0,C ,P) be a probability space and {Xi,i E I ) be
abstract random variables on R into a measurable space (S,A). Then they
are said to be mutually independent if the class {B, i E I ) of 0-algebras in C
is mutually independent in the sense of Definition 1, where Bi = x i 1 ( A ) ,the
a-algebra generated by X,, i E I. Pairwise independence is defined similarly.

2.1 Independence 35
Taking S = R(orRn) and A as its Bore1 a-algebra, one gets the corresponding
concept for real (or vector) random families.
It is perhaps appropriate at this place to observe that inany such (inde-
pendent) families of events or raiidoin variables on an (fl, C,P)need not exist
if (0,C) is not rich enough. Since 0 and 0 are clearly independent of each
event A E R, the set of equations (1) is non vacuous. Consider the trivial
example R = (0, 11,C = P(0)= (0, {0), {I), R), P({O))= p = 1- P({1)),
0 <p < 1.Then, omitting the 0,fl, there are no other independent events, and
if X, : fl + R,i = l , 2 , defined as X1(0) = 1= X2(1)aiid XI(1) = 2 = X2(0),
then XI,X2 are distinct raiidoin variables, but they are not independent. Any
other random variables defined on R can be obtained as functions of these two,
and it is easily seen that there are no nonconstant independent random vari-
ables on this R. Thus (R,C,P) is not rich enough to support nontrivial (i.e.,
nonconstant) independent raiidoin variables. We show later that a probability
space can be enlarged to have more sets, so that one can always assume the
existence of enough independent families of events or random variables. We
now consider some of the profound consequences of this mathematical formal-
ization of the natural concept of mutual independence. It may be noted that
the latter is also termed statistical (stochastic or probabilistic) independence
to contrast it with other concepts such as linear independence aiid functional
independence. [The functions XI,X2 in the above illustration are linearly in-
dependent but not mutually (or statistically) independent! See also Problem
1.1
To understand the implications of equations (I), we coiisider different
forms (or consequences) of Defiiiitioiis 1 aiid 2. First note that if {A,, i E
I} c C is a class of mutually iiidepeiideiit events, then it is evident that
{a(A,),i E I) is an independent class. However, the same cannot be said
if the singleton Ai is replaced by a bigger family Gi = {A:, j E Ji) c C,
where each Ji has at least two elements, i E I,as simple examples show. Thus
{o(Gi), i E I}need not be independent. On the other hand, we can make the
following statements.
Theorem 3 (a) Let {A, Bi, i E I) be classes of events from (R,C,P)
such that they are all mutually independent in the sense of Definition 1. If
each Bi, i E I , is a n-class, then for any subset J of I , the generated a-algebra
a(B,, i E J) and A are independent of each other.
(b) Definition 2 with S = R reduces to the statement that for each fi-
nite subset il, .. .,in of I and random variables x,,,. .. ,Xi,,, the collection
of events {[Xi, < X I , . .. ,Xi,, < xn],xj E R,j = 1,...,n,n > 1) forms an
independent class.
Proof (a) Let B = a(B,, i E J),J c I . If A E A, Bj E Bj1

36 2 Independence and Strong Convergence
are independeiit by hypothesis, i.e., (1)holds. We need to show that
If B is of the form B1 n . . . n B,, where Bi E Bi, i E J, then (2) holds
by (1). Let D be the collection of all sets B which are finite intersections of
sets each belonging to a Bj,j E J. Since each Bj is a T-class, it follows that
2) is also a T-class, and by the preceding observation, (2) holds for A aiid
Dl so that they are independent. Also it is clear that Bj c D,j E J. Thus
a(Bj,j E J) c a(D). We establish (2) for A and a(D) to complete the proof
of this part, and it involves another idea often used in the subject in similar
arguments.
Define a class G as follows:
B = {B E a(D) : P ( A nB ) = P(A)P(B),A E A). (3)
Evidently D c G. Also R E G, and if B1, Bz E with B1 nBz = 0,then
P((B1 u Bz) nA) = P(B1 nA) +P(B2nA) (since the B, nA are disjoint)
= P(B1)P(A) +P(B2)P(A) [ by definition of (3)]
Hence B1 U B2 E G. Similarly if B1 > Bz, B, E G, then
P((B1 - B2)nA) = P(Bl nA) - P(B2nA) (since El nA 3 B2 nA)
Thus El - B2 E G. Finally, if B, E G, B, c we can show, from the
fact that P is 0-additive, that limn En = U,>lBn E G. Hence G is a A-class.
Since B > D, by Proposition 1.2.8b, G > ;(Dl. But (3) implies G and A
are independent. Thus A and a(D) are independent also, as asserted. Note
that since J c I is an arbitrary subset, we need the full hypothesis that
{A,B,, i E I) is a mutually independeiit collection, aiid not a mere two-by-
two independence.
(b) It is clear that Definition 2 implies the statement here. Conversely, let
B1 be the collection of sets {[X,, < XI, x E R), and

2.1 Independence
It is evident that B1 and B2 are T-classes. Indeed,
and similarly for B2. Hence by (a), B1 and a(&) are independent. Since
B1 is a T-class, we also get, by (a) again, that o(B1)and 0(B2)are inde-
pendent. But a(&) = x ; ' ( ~ ) [ =o(Xi,)],and 0(B2)= o ( u ~ = ~ x ~ , ' ( R ) ) [ =
a(Xi,, ...,X,,,)],where R is the Borel a-algebra of R.
Heiice if A1 c a(Xi,),Aj c xcl(R)(=o ( X z J ) )c o(Bz),then Al and
{ A z ,...,A,} are independent. Thus
p(Al n ...nA,) = P ( A l ). P(A2n ... nA,). (4)
Next consider Xi, and (Xi,,. .. ,Xi,,).The above argument can be applied to
get
P(A2n ... nA,) = P(A2). P(A, n ... nA,).
Coiitinuiiig this finitely inany times aiid substituting in (4), we get (1).Heiice
Definition 2 holds. This completes the proof.
The above result says that we can obtain (1)for random variables if we
assume the apparently weaker coiiditioii in part (b) of the above theorem.
This is particularly useful in computations. Let us record some consequences.
Corollary 4 Let {Bi,i E I ) be an arbitrary collection of mutually inde-
pendent T-classes in (R,C ,P ) , and Ji c I , Jl nJ2 = 0.If
then B1 and Ga are independent. The same is true if ';fi= n(Bj1j c J,), i =
1,2, are the generated T-classes.
If X , Y are independent random variables, f , g are any pair of real Borel
fuiictioiis on R, then f o X , g o Y are also independent random variables.
This is because ( f o X)-'(R) = X p l ( f p ' ( R ) )c X p l ( R ) , and similarly
( g o y ) ' (R)c Y p l(R);aiid X p l (R),Y p l(R)are independent 0-subalgebras
of C . The same argument leads to the following:
Corollary 5 If X I ,.. .,X, are mutually independent random variables
on (a,C ,P ) and f : R h R, g : R n p b R are any Borel functions, then
the random variables f ( X I ,...,X k ) , g(Xk+',. ..,X,) are independent; and
a ( X 1 , ...,X k ) , a(Xk+',.. .,X,) are independent a-algebras, for any k > 1.
Another consequence relates to distribution functions and expectations
when the latter exist.

Rao probability theory with applications

Rao probability theory with applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Rao probability theory with applications

Similar to Rao probability theory with applications (20)

Recently uploaded

Recently uploaded (20)

Rao probability theory with applications