1. ECOLE POLYTECHNIQUE FEDERALE DE LAUSANNE School Of
Architecture, Civil and Environmental Engineering
Semester Project in Civil Engineering
Enhancing the Serial Estimation of Discrete
Choice Models Sequences
by
Youssef Kitane
Under the direction of Prof. Michel Bierlaire and supervision of Nicola
Ortelli and Gael Lederrey in the Transport and Mobility Laboratory
Lausanne, June 2020
1
3. 1 Introduction
Discrete Choice Models (DCMs) have played an essential role in transportation modeling
for the last 25 years [1]. Discrete choice modeling is a field designed to capture in detail the
underlying behavioral mechanisms at the foundation of the decision-making process that
drives consumers [2]. Because they must be behaviorally realistic while properly fitting
the data, appropriate utility specifications for discrete choice models are hard to develop.
In particular, modelers usually start by including a number of variables that are seen as
”essential” in the specification; these originate from their context knowledge or intuition.
Then, small changes are tested sequentially so as to improve the goodness of fit of the
model while ensuring its behavioral realism.The result is that many model specifications are
usually tested before the modeler is satisfied with the result. It, thus, leads to extensive
computational time since each model has to be optimized separately. A faster optimization
time would allow researchers to test many more specifications in the same amount of time.
In this project, the quasi-Newton Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm
is used to estimate the parameters of each DCM. Three techniques are implemented to
accelerate the process of estimating a sequence of DCMs:
Standardization (ST) of the variables: The goal is to change the values of numeric
columns in the dataset to a common scale, without distorting differences in the ranges
of values.
Warm Start(WS): This technique uses the knowledge acquired by the precedent model
to initialize the values of the parameters for the estimation of the next model.
Early Stopping (ES): This consists in stopping the estimation of a model earlier than
expected, based on how promising the improvement in log likelihood looks in the last
iterations of the optimization algorithm.
The next Section is dedicated to the literature review of the existent methods that speed up
an optimization process. Then, in Section 3, the three techniques are presented in detail for
a sequence of DCMs. Section 4 presents the data considered in this project, as well as the
sequences of models that we use to measure the effectiveness of the three techniques. Section
5 gathers the results obtained by the implemented methods. The last Section resumes the
findings of this project and highlights possible improvements and directions of research for
the future.
3
4. 2 Literature Review
In large-scale convex optimization, first-order methods are methods of choice due to their
cheap iteration cost [3]. While second-order methods, such as the Newton method, are
making use of the curvature’s information, the cost of computing the Hessian can become a
hassle. Thus, quasi-Newton methods are a good compromise between curvature information
and low computation time. Indeed, they use an approximation of the Hessian instead of
its exact computation. The BFGS algorithm named after its inventors Broyen, Fletcher,
Goldfarb and Shannon [4] is one of the most well-known quasi-Newton methods. A new
method for solving linear systems is proposed [5]. The algorithm is specialized to invert
positive definite matrices in such a way that all iterates (approximate solutions) generated
by the algorithm are positive definite matrices themselves. This opens the way for many
applications in the field of optimization. The accelerated matrix inversion algorithm was
then incorporated into an optimization framework to develop both accelerated stochastic
and deterministic BFGS, which to the best of our knowledge, are the first accelerated quasi-
Newton updates. Under a careful choice of the parameters of the method, and depending on
the problem structure and conditioning, acceleration might result into significant speedups
both for the matrix inversion problem and for the stochastic BFGS algorithm. It is con-
firmed experimentally that these accelerated methods can lead to speed-ups when compared
to the classical BFGS algorithm, but no convergence analysis is yet provided.
The increase in the size of choice modeling datasets in recent years has led to a growing
interest in research to accelerate the estimation of DCMs. Researchers have used techniques
to speed-up the estimation of one DCM based on Machine Learning (ML) techniques [6].
It is achieved by proposing new efficient stochastic optimization algorithms and extensively
testing them alongside existing approaches. These algorithms are developed based on three
main contributions: the use of a stochastic Hessian, the modification of the batch size, and
a change of optimization algorithm depending on the batch size. This paper shows that
the use of a second-order method and a small batch size is a good starting point for DCM
optimization. It also shows that BFGS is an algorithm that works particularly well when
the said starting point has been found.
The problem of initializing a parameter in a model is central in ML. One particularly com-
mon scenario is where a ML algorithm must be constantly updated with new data. This
situation occurs generally in finance, online advertising, recommendation systems, fraud
detection, and many other domains where machine learning systems are used for prediction
and decision making in the real world [7]. When new data arrive, the model needs to be
updated so that it can be as accurate as possible. While the majority of existing methods
start the configuration process of an algorithm from scratch by initializing randomly the
parameters, it is possible to exploit information previously learned in order to ”warm start”
its configuration on new type of configurations.
In most common optimization algorithms and more precisely in ML, the modeler decides to
stop the optimization procedure before reaching the required tolerance in the solution [8].
Stopping earlier an optimization process is a trick used to control the generalization per-
formance of the ongoing model during the training phase and avoid over-fitting in the test
phase. In discrete choice modeling,the main objective is not to have the highest accuracy
but parameters that are behaviorally realistic.
4
5. 3 Methodology
This section introduces briefly the principles underlying the BFGS algorithm used to esti-
mate a sequence of DCMs before presenting the techniques used to speed-up the estimation
of a sequence of DCMs .
As a reminder, the iterates {xj} of a line search optimization method following a descent
direction dj and a step size αj are defined as follows :
xj+1 = xj + αjdj (1)
where the direction of descent is obtained by preconditioning the gradient and is defined
as :
dj = −Dj f(xj) (2)
assuming that the matrix Dj at the iterate xj is semi-definite positive.
For quasi-Newton methods, Dj is an approximation of the Hessian. A slightly different
version of BFGS consists in approximating the inverse of the Hessian. The BFGS−1
algo-
rithm uses the following approximation [9] :
D−1
j+1 = D−1
j +
(sT
j yj + yT
j D−1
j yj)(sjsT
j )
(sT
j yj)2
−
D−1
j yjsT
j + sjyT
j D−1
j
sT
j yj
(3)
where sj = xj+1 − xj and yj = f(xj+1) − f(xj)
The step is calculated with an inexact line search method, based on the two Wolfe con-
ditions (Wolfe, 1969, 1971). The first condition, also known as the Armijo rule, guarantees
that the step gives a sufficient decrease in the objective function. The second condition,
known as the curvature condition, prevents the step length from being too short.
3.1 Standardization
The concept of standardization is relevant when continuous independent variables are mea-
sured at different scales. Indeed, standardization is a technique often applied as part of data
preparation for ML. The goal is to change the values of numeric columns in the dataset to
a common scale, without distorting differences in the ranges of values. More formally, let’s
suppose that a variable x takes values from the set S = {x1, x2, ...., xn}. The process of
standardization of one variable xi in S is applied as follows :
xi =
xi − ¯x
σ
(4)
where ¯x is the mean of the values in S and σ is the respective standard deviation.
5
6. 3.2 Warm Start
A method commonly used in the field of ML is the warm start which consist on initializing
a set of parameters by non arbitrary parameters. In our case, we initialize the variables of
a model by the variables estimated by the BFGS algorithm in the previous model.
Formally, we define the set of parameters in model j as xj ∈ RNj
where Nj corresponds
to the number of parameters in model j. The set of parameters for the following model is
defined similarly, i.e. xj+1 ∈ RNj+1
where Nj+1 corresponds to the number of parameters of
this model. To generate the initial variables of model j + 1, i.e. x0
j+1, we use the optimized
variable of the previous model, i.e. x∗
j . We thus define the initialization of x0
j+1 for each
index i ∈ {1, . . . , Nj+1} such that:
x0
j+1,i =
x∗
j,i if i ∈ {1, . . . , Nj} ,
0 otherwise.
In the case where a variable is transformed using a non-linear function, such as a Box-Cox
transformation, we propose to use a slightly updated version of the warm-start. First, we
have to define a boolean array, B ∈ {0, 1}
N
j+1, such that:
Bi =
True if xj+1,i has been transformed non linearly,
False otherwise.
x0
j+1,i =
x∗
j,i if i ∈ {1, . . . , Nj} and Bi is False,
0 otherwise.
This allows us to reset the value of a variables associated with a non-linear transformation
to 0 instead of using the previous optimized value.
The same procedure is used for the initialization of the Hessian between the model j and
the following model j+1. We thus define the initialization of H0
j+1 for each combination of
indexes i, k ∈ {1, . . . , Nj+1} such that:
H0
j+1,(i,k) =
H∗
j,(i,k) if i, k ∈ {1, . . . , Nj} ,
1 if i = k
0 otherwise.
3.3 Early Stopping
The early stopping method consists in stopping the estimation process before the conver-
gence is achieved. Because the objective is to select the best model among a sequence of
DCMs, the log likelihood evaluation f(xi) obtained at an epoch is compared to the lowest
log likelihood LLbest of all the previous models. For a model i, if at a certain epoch f(xi) is
lower than the LLbest,the optimization is pursued until the end. The new best log likelihood
is now equal to the estimated value by the BFGS algorithm of the log likelihood LLopt. If
at certain epoch f(xi) is higher than the LLbest, the optimization process is stopped based
on some criterion. This criterion estimates the relative evolution of the function in order to
6
7. detect a plateau, it means that the function is no longer experiencing a significant improve-
ment. Three evaluations of the function are considered in order to be sure of the convergence
of the log likelihood.
Let’s suppose that we have access to the last three evaluations of the log-likelihood f(xi),
f(xi−1), f(xi−2) during the estimation process of one model. Is it possible to assess that stag-
nation by evaluating the two following ratios and compare them to a predefined threshold
ε :
f(xi−1)
f(xi)
< ε
f(xi−2)
f(xi−1)
< ε
Even though the goal of the early stopping is to reduces the estimation time of DCMs,it
is important to keep in mind that an important difference between the solution obtained
by applying the early stopping to the BFGS algorithm and the standard BFGS should not
arise. For example, Figure 3 shows the value of the log likelihood during the estimation
process of a random model. As we can see in this example, there is a stagnation in the
middle of the estimation. We do not want to do an early stopping at this moment since the
estimation is far from being finished. We thus have to be careful with the threshold and
make a sensitivity analysis on this parameter.
Figure 1: Example : Difference between a possible stagnation of the log likelihood and the
real convergence
7
8. 4 Case Study
4.1 Dataset
The Swissmetro dataset Bierlaire et al. (2001) corresponds to survey data collected in
Switzerland between St Gall and Geneva in Switzerland during march 1998. In that sense,
it was used to study the market penetration of the Swissmetro, a revolutionary mag-lev
under-ground system. Three alternatives - train,car and swissmetro - were generated for
each of the 1192 respondents. A sample of 10’728 observations were obtained by generating
9 types of situations. The pre-selected attributes of the alternatives are for some categorical
(travel card ownership, gender, type of luggage, etc.) and for others continuous (travel time,
cost and headway).
4.2 Sequence of Discrete Choice Models
For the purpose of this project, two sequences of hundred DCMs respectively denoted by S1
and S2, are considered. Each sequence starts with a given choice model. Then, a random
perturbation is applied. These small modifications corresponds to the typical elementary
perturbations that are used to alternate from one model to another. In fact, six types of
modifications are considered :
• Adds a non selected variable to enter the utility of an alternative
• Removes a variable from the utility of an alternative
• Increments the Box-Cox parameter of a given variable
• Decrements the Box-Cox parameter of a given variable
• Interacts a variable with a socioeconomic variable
• Deactivate the interaction of the considered variable with a socioeconmic variable
The first sequence S1 begins with an alternative specific constant model and the com-
plexity increases while the sequence S2 start with a random model and the complexity is
approximately constant along the hundred models. The number of parameters for each
sequence of DCMs is shown in the Figure 4 :
8
9. Figure 2: Number of parameters for the two sequences S1 and S2.
9
10. 5 Results
In order to avoid misunderstandings, abbreviations are given to the different methods. The
base method estimates the parameters without applying any warm start and is denoted by
Base. The warm start of the parameters is denoted by WSx, the warm start that concerns
only variables non linearly transformed by WSbc, the warm start of the Hessian by WSh
and the combination of the warm start of the Hessian and the variables by WS.
A benchmark of ten estimation for the methods Base, WS, WSx, WSh, WSbc and ST
is conducted for the sequences S1 and S2. The Tables 1 and 2 presents a summary of the
statistics for the methods previously mentioned. The lowest, highest, mean time and the
standard deviation among the ten estimations are reported. The speedup corresponds to
the ratio between the mean time of each method and the mean time of the Base method.
Among the five types of warm start Base, WS, WSx, WSh, WSbc, the WS is the most
efficient and reliable method. Indeed, it permits to speedup the estimation time by a fac-
tor of 3.84 for S1 and 4.5 for S2. The observed standard deviation for WS is the lowest
compared to the other methods. A standard deviation of 0.18 and 0.35 for respectively
S1 and S2 are obtained. For the sequence S1 and S2,The WSh is also an efficient method
because it permits to accelerate the estimation time by a factor respectively equal to 2.2 and
2.56. The WSx is not as performing as the WS and WSh,it reduces the estimation time by
respectively 19 % and 15 % for S1 and S2. The WSbc does not reduces the estimation time
of the Base method. It appears that the ST method of the variables is efficient. For the
first sequence S1, the estimation time is reduced from 229.51 s to 203.28 s.Concerning the
sequence S2, a reduction of 11 % of the estimation time is obtained. The standardization
of the variables is not as effective as the WS method but permits to have interesting results
and should be applied beforehand for every sequence of DCMs that presents variables with
differences in the range of values.
Table 1: Summary of statistics for 10 estimations by method for the sequence S1
Statistics Base WS WSx WSh WSbc ST
Minimum 224.90 60.86 187.74 104.99 225.80 198.10
Maximum 231.32 61.45 189.49 106.07 231.85 203.28
Mean 229.51 61.07 188.73 105.44 229.80 201.72
Standard Deviation 1.78 0.18 0.5 0.29 1.65 1.35
Speedup 1.0 3.84 1.21 2.2 0.99 1.15
10
11. Table 2: Summary of statistics for 10 estimations by method for the sequence S2
Statistics Base WS WSx WSh WSbc ST
Minimum 534.58 119.18 459.26 210.35 529.25 478.36
Maximum 539.44 120.35 461.36 212.46 538.44 485.93
Mean 536.23 119.74 460.34 211.27 535.84 480.96
Standard Deviation 1.36 0.35 0.72 0.60 2.47 2.16
Speedup 1.0 4.5 1.17 2.56 1.01 1.12
A sensitivity analysis is conducted for The ES method. A sequence of 20 thresholds
ranging from 10−7
to 5*10−4
is used in order to test the performance of the ES method
compared to the Base method. The Figures 3 and 5 presents the relative estimation time
of the ES method for the 20 thresholds compared to the Base method for respectively S1
and S2. The black line correspond to the mean time observed for the Base method for
10 estimations. The grey lines represents a confidence interval of 95 % around the mean
estimation time of the Base method. A box plot with a confidence interval of 95 % for
every threshold is plotted. For the sequence S1, a restrictive threshold of 10−7
leads to a
speed up of 3 %, while for S2 a speed up for approximately 4% is observed. Is it possible
to obtain a better speed up when increasing the value of the threshold. Indeed, a reduction
of 35 % and 15 % of the optimization is obtained for respectively S1 and S2 when a less
restrictive threshold of 0.0005 is used.
Figure 3: Sensitivity analysis of the threshold parameter for S1
11
12. Figure 4: Sensitivity analysis of the threshold parameter for S2
In order to select the best threshold, the improvement of the estimation time is not
the only criterion that should be taken into account. Even though the number of models
stopped earlier increases as long as the threshold increase and the total optimization time
decreases,the main drawback is that the method could stop at a plateau that is far away
from the real plateau of convergence of the log likelihood. These models are falsely stopped
earlier and should be distinguished from the models that have reached the real plateau of
convergence as explained in the Figure 1. The Figures 5 and 6 shows that from a certain
threshold, some models are falsely stopped. Indeed, a threshold of 0.001 leads to 6 models
among the 76 models that don’t reach the real convergence of the log likelihood for the
sequence S1. Concerning the sequence S2, a less restrictive threshold of 0.003 stopped
falsely 3 models among the 90 models stopped earlier. Even though the main objective is
to speed up the optimization time of a sequence of models and higher thresholds leads to
lower optimization time, the modeler has to be careful with models that are falsely stopped
earlier. A threshold of 2*10−5
is acceptable in the sense that no model is falsely stopped
earlier for both S1 and S2 and gives a speed up performance that is equivalent to less
restrictive thresholds.
12
13. Figure 5: Number of models falsely stopped earlier for S1
Figure 6: Number of models falsely stopped earlier for S2
13
14. A benchmark that regroups the methods that speed up the estimation time of both S1
and S2 is launched. Among the five warm starts, the WS is the the most efficient. The ST
must also be taken into account. The ES with a threshold of 2*10−5
method has shown an
interesting reduction of time. The obtained results of the combination of all this methods
for both S1 and S2 are compared to the Base method and presented in the Tables 3 and
4. The combination of the WS, ES with a threshold of 2*10−5
, ST methods leads to an
improvement of a factor 5.26 and 6.67 compared to the Base method for respectively S1
and S2.
Table 3: Summary of statistics for 10 estimations for the sequence S1 : Comparison between
the regrouped performing methods and the Base method
Statistics Base Final
Minimum 224.90 44.25
Maximum 231.32 44.83
Mean 229.51 44.45
Standard Deviation 1.78 0.18
Speedup 1.0 5.26
Table 4: Summary of statistics for 10 estimations for the sequence S1 : Comparison between
the regrouped performing methods and the Base method
Statistics Base Final
Minimum 534.58 83.77
Maximum 539.44 84.55
Mean 536.23 84.22
Standard Deviation 1.36 0.23
Speedup 1.0 6.67
14
15. 6 Conclusion
Enhancing the estimation of a sequence of DCMs is a subject that has not yet been explored.
The objective of this project was to propose different methods to improve the total estima-
tion time of a sequence of DCMs. Indeed, the BFGS−1
algorithm is used to estimate the
two sequence of DCMs S1 and S2. The first approach was to implement a commonly used
method in ML, which uses the knowledge acquired before to use it for a new task. The WS
method permits to speedup the estimation time by a factor of 3.84 and 4.5 compared to the
Base method for respectively S1 and S2. The standardization of the variables accelerate
slowly the estimation time and should be used at the beginning of every optimization task
because of its simplicity. The last approach is the ES method which has shown interesting
improvement of the estimation time but the applied threshold has to be carefully chosen in
order not stop at a bad convergence plateau of the log likelihood. A threshold of 2*10−5
is chosen. The combination of all the methods that speedup the estimation time of both
S1 and S2 leads to an improvement of respectively 5.26 and 6.67 compared to the Base
method. For the future, I would like to work on two improvements. The first one concerns
the ES method. The ES method tends to stop at a plateau of convergences that could be
far away from the real convergence of the log likelihood. A more robust ES method could be
implemented by finding an efficient way to detect these regions of convergence. The second
possible improvement concerns the warm start. A more detailed analysis of the warm start
could be done. Even though, the total estimation time is reduced by the WS method, some
models were this method is applied have an optimization time higher than the case where
no warm start is applied.
15
16. References
[1] Bierlaire M. (1998) Discrete Choice Models. In: Labb´e M., Laporte G., Tanczos K.,
Toint P. (eds) Operations Research and Decision Aid Methodologies in Traffic and Trans-
portation Management. NATO ASI Series (Series F: Computer and Systems Sciences),
vol 166. Springer, Berlin, Heidelberg
[2] Ben-Akiva M., Bierlaire M. (1999) Discrete Choice Methods and their Applications to
Short Term Travel Decisions. In: Hall R.W. (eds) Handbook of Transportation Science.
International Series in Operations Research & Management Science, vol 23. Springer,
Boston, MA
[3] Devolder, O., Glineur, F. & Nesterov, Y. First-order methods of smooth convex opti-
mization with inexact oracle. Math. Program. 146, 37–75 (2014).
https://doi.org/10.1007/s10107-013-0677-5
[4] Henning, P. and Kiefel, M. (2013). Quasi-newton methods: A new direction. The Journal
of Machine Learning Research,14(1):843-865
[5] Robert M. Gower and Filip Hanzely and Peter Richt´arik and Sebastian Stich (2018).
Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules
for Faster Second-Order Optimization.
[6] Lederrey G., Lurkin V. and Hillel T. and Bierlaire M (2018). Estimation of Discrete
Choice Models with Hybrid Stochastic Adaptive Batch Size Algorithms.
[7] Jordan T. Ash and Ryan P. Adams (2019). On the Difficulty of Warm-Starting Neural
Network Training.
[8] Prechelt L. (2012). Early Stopping — But When?
[9] Fletcher, R. (1987). Practical Methods of Optimization; (2Nd Ed.). Wiley-Interscience,
New York, NY, USA.
16