Machine Learning: Theory, Applications, Experiences

Machine Learning:
Theory, Applications, Experiences
A Workshop for Women in Machine Learning

October 4, 2006
San Diego, CA

http://www.seas.upenn.edu/~wiml/

Workshop Organization

Organizers: Lisa Wainer, University College London
Hanna Wallach, University of Cambridge
Jennifer Wortman, University of Pennsylvania

Faculty advisor: Amy Greenwald, Brown University

Additional reviewers: Maria-Florina Balcan Kristina Klinkner
Melissa Carroll Bethany Leffler
Kimberley Ferguson Ozgur Simsek
Katherine Heller Alicia Peregrin Wolfe
Julia Hockenmaier Elena Zheleva
Rebecca Hutchinson

Thanks to our generous sponsors:

1

Schedule

October 3, 2006

19:30 Workshop dinner

October 4, 2006

08:45 Registration and poster set-up

09:00 Welcome

09:15 Invited talk: A General Class of No-Regret Learning Algorithms and Game-
Theoretic Equilibria
Amy Greenwald, Brown University

09:45 On a Theory of Learning with Similarity Functions
Maria-Florina Balcan, Carnegie Mellon University

10:00 Matrix Tile Analysis
Inmar Givoni, University of Toronto

10:15 Towards Bayesian Black Box Learning Systems
Jo-Anne Ting, University of Southern California

10:30 Coffee break

10:45 Invited talk: Clustering High-Dimensional Data
Jennifer Dy, Northeastern University

11:15 Efficient Bayesian Algorithms for Clustering
Katherine Heller, Gatsby Unit, University College London

11:30 Hidden Process Models
Rebecca Hutchinson, Carnegie Mellon University

11:45 Invited talk: Recent advances in near-neighbor learning
Maya Gupta, University of Washington

12:15 Spotlight talks:

Correcting sample selection bias by unlabeled data
Jiayuan Huang, University of Waterloo

Decision Tree Methods for Finding Reusable MDP Homomorphisms
Alicia Peregrin Wolfe, University of Massachusetts, Amherst

Evaluating a Reputation-based Spam Classification System
Elena Zheleva, University of Maryland, College Park

Improving Robot Navigation Through Self-Supervised Online Learning
Ellie Lin, Carnegie Mellon University

12:30 Lunch

2

13:00 Poster session 1

13:45 Invited talk: SRL: Statistical Relational Learning
Lise Getoor, University of Maryland, College Park

14:15 Generalized statistical methods for fraud detection
Cecile Levasseur, University of California, San Diego

14:30 Kernels for the Predictive Regression of Physical, Chemical and Biological
Properties of Small Molecules
Chloe-Agathe Azencott, University of California, Irvine

14:45 Invited talk: Modeling and Learning User Preferences for Sets of Objects
Marie desJardins, University of Maryland, Baltimore County

15:15 Coffee break

15:30 Efficient Exploration with Latent Structure
Bethany Leffler, Rutgers University

15:45 Efficient Model Learning for Dialog Management
Finale Doshi, MIT

16:00 Transfer in the context of Reinforcement Learning
Soumi Ray, University of Maryland, Baltimore County

16:15 Spotlight talks:

Simultaneous Team Assignment and Behavior Recognition from Spatio-
temporal Agent Traces
Gita Sukthankar, Carnegie Mellon University

An Online Learning System for the Prediction of Electricity Distribution
Feeder Failures
Hila Becker, Columbia University

Classification of fMRI Images: An Approach Using Viola-Jones Features
Melissa K Carroll, Princeton University

Fast Online Classification with Support Vector Machines
Seyda Ertekin, Penn State University

16:30 Poster session 2

17:15 Open discussion

17:45 Closing remarks and poster take-down

18:00 End of workshop

3

Invited Talks

A General Class of No-Regret Learning Algorithms and Game-Theoretic Equilibria
Amy Greenwald, Brown University

No-regret learning algorithms have attracted a great deal of attention in the game
theoretic and machine learning communities. Whereas rational agents act so as to
maximize their expected utilities, no-regret learners are boundedly rational agents that act
so as to minimize their "regret". In this talk, we discuss the behavior of no-regret learning
algorithms in repeated games.

Specifically, we introduce a general class of algorithms called no-Φ-regret learning, which
includes common variants of no-regret learning such as no external-regret and no-internal-
regret learning. Analogously, we introduce a class of game-theoretic equilibria called Φ-
equilibria. We show that no-Φ-regret learning algorithms converge to Φ-equilibria. In
particular, no-external-regret learning converges to minimax equilibrium in zero-sum
games; and no-internal-regret learning converges to correlated equilibrium in general-sum
games. Although our class of no-regret algorithms is quite extensive, no algorithm in this
class learns Nash equilibrium.

Speaker biography:

Dr. Amy Greenwald is an assistant professor of computer science at Brown University in
Providence, Rhode Island. Her primary research area is the study of economic interactions
among computational agents. Her primary methodologies are game-theoretic analysis
and simulation. Her work is applicable in areas ranging from dynamic pricing to
autonomous bidding to transportation planning and scheduling. She was awarded a Sloan
Fellowship in 2006; she was nominated for the 2002 Presidential Early Career Award for
Scientists and Engineers (PECASE); and she was named one of the Computing Research
Association's Digital Government Fellows in 2001. Before joining the faculty at Brown, Dr.
Greenwald was employed by IBM's T.J. Watson Research Center, where she researched
Information Economies. Her paper entitled "Shopbots and Pricebots" (joint work with Jeff
Kephart) was named Best Paper at IBM Research in 2000.

Clustering High-Dimensional Data
Jennifer Dy, Northeastern University

Creating effective algorithms for unsupervised learning is important because vast amounts
of data preclude humans from manually labeling the categories of each instance. In
addition, human labeling is expensive and subjective. Therefore, a majority of existing
data is unsupervised (unlabeled). The goal of unsupervised learning or cluster analysis is
to group "similar" objects together. "Similarity" is typically defined by a metric or a
probability model. These measures are highly dependent on the features representing the
data. Many clustering algorithms assume that relevant features have been determined by
the domain experts. But, not all features are important. Moreover, many clustering
algorithms fail when dealing with high-dimensions. We present two approaches for dealing
with clustering in high-dimensional spaces: 1. Feature selection for clustering, through
Gaussian mixtures and the maximum likelihood and scatter separability criteria, and 2.
Hierarchical feature transformation and clustering, through automated hierarchical
mixtures of probabilistic principal component analyzers.

4

Speaker biography:

Dr. Jennifer G. Dy is an assistant professor at the Department of Electrical and Computer
Engineering, Northeastern University, Boston, MA, since 2002. She obtained her MS and
PhD in 1997 and 2001 respectively from the School of Electrical and Computer
Engineering, Purdue University, West Lafayette, IN, and her BS degree in 1993 from the
Department of Electrical Engineering, University of the Philippines. She received an NSF
Career award in 2004. She is an editorial board member for the journal, Machine Learning
since 2004, and publications chair for the International Conference on Machine Learning in
2004. Her research interests include Machine Learning, Data Mining, Statistical Pattern
Recognition, and Computer Vision.

Recent advances in near-neighbor learning
Maya R. Gupta, University of Washington

Recent advances in nearest-neighbor learning are shown for adaptive neighborhood
definitions, neighborhood weighting, and estimating given nearest-neighbors. In particular,
it is shown that weights that solve linear interpolation equations minimize the first-order
learning error, and this is coupled with the principle of maximum entropy to create a
flexible weighting approach. Different approaches to adaptive neighborhoods are
contrasted, the focus being on neighborhoods that form a convex hull around the test
point. Standard weighted nearest-neighbor estimation is shown to maximize likelihood,
and it is shown that minimizing expected Bregman divergence instead leads to optimal
solutions in terms of expected misclassification cost. Applications may include the testing
of pipeline integrity, custom color enhancements, and estimation for color management.

Speaker biography:

Maya Gupta completed her Ph.D. in Electrical Engineering in 2003 at Stanford University as
a National Science Foundation Graduate Fellow. Her undergraduate studies led to a BS in
Electrical Engineering and a BA in Economics from Rice University in 1997. From 1999-
2003 she worked for Ricoh's California Research Center as a color image processing
research engineer. In the fall of 2003, she joined the EE faculty of the University of
Washington as an Assistant Professor where she also serves as an Adjunct Assistant
Professor of Applied Mathematics. More information about her research is available at her
group's webpage: idl.ee.washington.edu.

Modeling and Learning User Preferences for Sets of Objects
Marie desJardins, University of Maryland, Baltimore County

Most work on preference learning has focused on pairwise preferences or rankings over
individual items. In many application domains, however, when a set of items is presented
together, the individual items can interact in ways that increase (via complementarity) or
decrease (via redundancy or incompatibility) the quality of the set as a whole.

In this talk, I will describe the DD-PREF language that we have developed for specifying
set-based preferences. One problem with such a language is that it may be difficult for
users to explicitly specify their preferences quantitatively. Therefore, we have also
developed an approach for learning these preferences. Our learning method takes as
input a collection of positive examples―that is, one or more sets that have been identified
by a user as desirable. Kernel density estimation is used to estimate the value function for
individual items, and the desired set diversity is estimated from the average set diversity

5

observed in the collection.

Since this is a new learning problem, I will also describe our new evaluation methodology
and give experimental results of the learning method on two data collections: synthetic
blocks-world data and a new real-world music data collection.

Joint work with Eric Eaton and Kiri L. Wagstaff.

Speaker biography:

Dr. Marie desJardins is an assistant professor in the Department of Computer Science and
Electrical Engineering at the University of Maryland, Baltimore County. Prior to joining the
faculty in 2001, Dr. desJardins was a senior computer scientist at SRI International in Menlo
Park, California. Her research is in artificial intelligence, focusing on the areas of machine
learning, multi-agent systems, planning, interactive AI techniques, information
management, reasoning with uncertainty, and decision theory.

SRL: Statistical Relational Learning
Lise Getoor, University of Maryland, College Park

A key challenge for machine learning is mining richly structured datasets describing
objects, their properties, and links among the objects. We'd like to be able to learn
models, which can capture both the underlying uncertainty and the logical relationships in
the domain. Links among the objects may demonstrate certain patterns, which can be
helpful for many practical inference tasks and are usually hard to capture with traditional
statistical models. Recently there has been a surge of interest in this area, fueled largely
by interest in web and hypertext mining, but also by interest in mining social networks,
security and law enforcement data, bibliographic citations and epidemiological records.

Statistical Relational Learning (SRL) is a newly emerging research area, which attempts to
represent, reason and learn in domains with complex relational and rich probabilistic
structure. In this talk, I'll begin with a short SRL overview. Then, I'll describe some of my
group's recent work, including our work on entity resolution in relational domains.

Joint work with Indrajit Bhattacharya, Mustafa Bilgic, Louis Licamele and Prithviraj Sen.

Speaker biography:

Prof. Lise Getoor is an assistant professor in the Computer Science Department at the
University of Maryland, College Park. She received her PhD from Stanford University in
2001. Her current work includes research on link mining, statistical relational learning and
representing uncertainty in structured and semi-structured data. Her work in these areas
has been supported by NSF, NGA, KDD, ARL and DARPA. In June 2006, she co-organized
the 4th in a series of successful workshops on statistical relational learning,
http://www.cs.umd.edu/srl2006. She has published numerous articles in machine learning,
data mining, database and AI forums. She is a member of AAAI Executive council, is on
the editorial board of the Machine Learning Journal and JAIR and has served on numerous
program committees including AAAI, ICML, IJCAI, KDD, SIGMOD, UAI, VLDB, and WWW.

6

Talks

On a Theory of Learning with Similarity Functions
Maria-Florina Balcan, Carnegie Mellon University

Kernel functions have become an extremely popular tool in machine learning. They have
an attractive theory that describes a kernel function as being good for a given learning
problem if data is separable by a large margin in a (possibly very high-dimensional)
implicit space defined by the kernel. This theory, however, has a bit of a disconnect with
the intuition of a good kernel as a good similarity function. In this work we develop an
alternative theory of learning with similarity functions more generally (i.e., sufficient
conditions for a similarity function to allow one to learn well) that does not require
reference to implicit spaces, and does not require the function to be positive semi-definite.
Our results also generalize the standard theory in the sense that any good kernel function
under the usual definition can be shown to also be a good similarity function under our
definition. In this way, we provide the first steps towards a theory of kernels that describes
the effectiveness of a given kernel function in terms of natural similarity-based properties.

Joint work with Avrim Blum.

Matrix Tile Analysis
Inmar Givoni, University of Toronto

Many tasks require finding groups of elements in a matrix of numbers, symbols or class
likelihoods. One approach is to use efficient bi- or tri-linear factorization techniques
including PCA, ICA, sparse matrix factorization and plaid analysis. These techniques are
not appropriate when addition and multiplication of matrix elements are not sensibly
defined. More directly, methods like bi-clustering can be used to classify matrix elements,
but these methods make the overly restrictive assumption that the class of each element
is a function of a row class and a column class. We introduce a general computational
problem, "matrix tile analysis" (MTA), which consists of decomposing a matrix into a set of
non-overlapping tiles, each of which is defined by a subset of usually nonadjacent rows
and columns. MTA does not require an algebra for combining tiles, but must search over an
exponential number of discrete combinations of tile assignments. We describe a loopy BP
(sum-product) algorithm and an ICM algorithm for performing MTA. We compare the
effectiveness of these methods to PCA and the plaid method on hundreds of randomly
generated tasks. Using double-gene-knockout data, we show that MTA finds groups of
interacting yeast genes that have biologically-related functions.

Joint work with Vincent Cheung and Brendan J. Frey.

Towards Bayesian Black Box Learning Systems
Jo-Anne Ting, University of Southern California

A long-standing dream of machine learning is to create black box learning systems that
can operate autonomously in home, research and industrial applications. While it is well
understood that a universal black box may not be possible, significant progress can be
made in specific domains. In particular, we address learning problems in sensor-rich and
data-rich environments, as provided by autonomous vehicles, surveillance systems,
biological or robotic systems. In these scenarios, the input data has hundreds or thousands

7

of dimensions and is used to make predictions (often in real-time), resulting in a learning
system that learns to "understand" the environment.

The goal of machine learning in this domain is to devise algorithms that can efficiently deal
with very high dimensional data, usually contaminated by noise, redundancy and
irrelevant dimensions. These algorithms must learn nonlinear functions, potentially in an
incremental and real-time fashion, for robust classification and regression. In order to
achieve black box quality, manual tuning parameters (e.g. as in gradient descent or
structure selection) need to be minimized or, ideally, avoided.

Bayesian inference, when combined with approximation methods to reduce computational
complexity, suggests a promising route to achieve our goals, since it offers a principled
way to eliminate open parameters. In past work, we have started to create a toolbox of
methods to achieve our goal of black box learning. In (Ting et al., NIPS 2005), we
introduced a Bayesian approach to linear regression. The novelty of this algorithm comes
from a Bayesian and EM-like formulation of linear regression that robustly performs
automatic feature detection in the inputs in a computationally efficient way. We applied
this algorithm to the analysis of neuroscientific data (i.e. the problem of prediction of
electromyographic (EMG) activity in the arm muscles of a monkey from spiking activity of
neurons in the primary motor and premotor cortex). The algorithm achieves results that
are faster by orders of magnitude and higher quality than previously applied methods.

More recently, we introduced a variational Bayesian regression algorithm that is able to
perform optimal prediction, given noise-contaminated input and output data (Ting, D'Souza
& Schaal, ICML 2006). Traditional linear regression algorithms produce biased estimates
when input noise is present and suffer numerically when the data contains irrelevant
and/or redundant inputs. Our algorithm is able to effectively handle datasets with both
characteristics. On a system identification task for a robot dynamics model, we achieved
from 10 to 70% better results than traditional approaches.

Current work focuses on developing a Bayesian version of nonlinear function
approximation with locally weighted regression. The challenge is to determine the size of
the neighborhood of data that should contribute to the local regression model―a typical
bias-variance trade-off problem. Preliminary results indicate that a full Bayesian treatment
of this problem can achieve impressive robust function approximation performance without
the need for tuning meta parameters. We are also interested in extending this locally
linear Bayesian model to an online setting, in the spirit of dynamic Bayesian networks, to
offer a parameter-free alternative to incremental learning.

Joint work with Aaron D'Souza, Stefan Schaal, Kenji Yamamoto, Toshinori Yoshioka, Donna
Hoffman, Shinji Kakei, Lauren Sergio, John Kalaska, Mitsuo Kawato, Peter Strick, Michael
Mistry, Jan Peters, and Jun Nakanishi.

This work will also be in Poster Session 1.

Efficient Bayesian Algorithms for Clustering
Katherine Ann Heller, Gatsby Unit, University College London

One of the most important goals of unsupervised learning is to discover meaningful
clusters in data. There are many different types of clustering methods that are commonly
used in machine learning including spectral, hierarchical, and mixture modeling. Our work
takes a model-based Bayesian approach to defining a cluster and evaluates cluster
membership in this paradigm. We use marginal likelihoods to compare different cluster
models, and hence determine which data points belong to which clusters. If we have

8

models with conjugate priors, these marginal likelihoods can be computed extremely
efficiently.

Using this clustering framework in conjunction with non-parametric Bayesian methods, we
have proposed a new way of performing hierarchical clustering. Our Bayesian Hierarchical
Clustering (BHC) algorithm takes a more principled approach to the problem than the
traditional algorithms (e.g. allowing for model comparisons and the prediction of new data
points) without sacrificing efficiency. BHC can also be interpreted as performing
approximate inference in Dirichlet Process Mixtures (DPMs), and provides a combinatorial
lower bound on the marginal likelihood of a DPM.

We have also explored the task of "clustering on demand" for information retrieval. Given
a query consisting of a few examples of some concept, we have proposed a method that
returns other items belonging to the concept exemplified by the query. We do this by
ranking all items using a Bayesian relevance criterion based on marginal likelihoods, and
returning the items with the highest scores. In the case of binary data, all scores can be
computed with a single matrix-vector product. We can also use this method as the basis
for an image retrieval system. In our most recent work this framework has served as
inspiration for a new approach to automated analogical reasoning.

Joint work with Zoubin Ghahramani and Ricardo Silva.

Hidden Process Models
Rebecca Hutchinson, Carnegie Mellon University

We introduce the Hidden Process Model (HPM), a probabilistic model for multivariate time
series data. HPMs assume the data is generated by a system of partially observed, linearly
additive processes that overlap in space and time. While we present a general formalism
for any domain with similar modeling assumptions, HPMs are motivated by our interest in
studying cognitive processes in the brain, given a time series of functional magnetic
resonance imaging (fMRI) data. We use HPMs to model fMRI data by assuming there is an
unobserved series of hidden, overlapping cognitive processes in the brain that
probabilistically generate the observed fMRI time series.
Consider for example a study in which subjects in the scanner repeatedly view a picture
and read a sentence and indicate whether the sentence correctly describes the picture. It
is natural to think of the observed fMRI sequence as arising from a set of hidden cognitive
processes in the subject’s brain, which we would like to track. To do this, we use HPMs to
learn the probabilistic time series response signature for each type of cognitive process,
and to estimate the onset time of each instantiated cognitive process occurring throughout
the experiment.
There are significant challenges to this learning task in the fMRI domain. The first is that
fMRI data is high dimensional and sparse. A typical fMRI dataset measures approximately
10,000 brain locations over 15-20 minutes (features), with only a few dozen trials (training
examples). A second challenge is due to the nature of the fMRI signal: it is a highly noisy
measurement of an indirect and temporally blurred neural correlate called the
hemodynamic response. The hemodynamic response to a short burst of less than a second
of neural activity lasts for 10-12 seconds. This temporal blurring in fMRI makes it
problematic to model the time series as a first-order Markov process. In short, our problem
is to learn the parameters and timing of potentially overlapping, partially observed
responses to cognitive processes in the brain using many features and a small number of
noisy training examples.
The modeling assumptions that HPMs make to deal with the challenges of the fMRI domain

9

are: 1) the latent time series is modeled at the level of processes rather than individual
time points; 2) processes are general descriptions that can be instantiated many times
over the course of the time series; 3) we can use prior knowledge of the form “process
instance X occurs somewhere inside the time interval [a, b].” HPMs could apply to any
domain in which these assumptions are valid.
HPMs address a key open question in fMRI analysis: how can one learn the response
signatures of overlapping cognitive processes with unknown timing? There is no
competing method to HPMs available in the fMRI community. In our ICML paper, we give
the HPM formalism, inference and learning algorithms, and experimental results on real
and synthetic fMRI datasets.

Joint work with Tom Mitchell and Indrayana Rustandi.


Generalized statistical methods for fraud detection
Cecile Levasseur, University of California, San Diego

Many important risk assessment system applications depend on the ability to accurately
detect the occurrence of key events given a large data set of observations. For example,
this problem arises in drug discovery (“Do the molecular descriptors associated with
known drugs suggest that a new, candidate drug will have low toxicity and high
effectiveness?”); and credit card fraud detection (“Given the data for a large set of credit
card users does the usage pattern of this particular card indicate that it might have been
stolen?”). In many of these domains, no or little a priori knowledge exists regarding the
true sources of any causal relationships that may occur between variables of interest. In
these situations, meaningful information regarding the circumstances of the key events
must be extracted from the data itself, a problem that can be viewed as an important
application of data-driven pattern recognition or detection.

The problem of unsupervised data-driven detection or prediction is one of relating
descriptors of a large unlabeled database of “objects” to measured properties of these
objects, and then using these empirically determined relationships to infer or detect the
properties of new objects. This work considers measured object properties that are
nongaussian (and comprised of continuous and discrete data), very noisy, and highly
nonlinearly related. Data comprised of measurements of such disparate properties are said
to be hybrid or of mixed type. As a consequence, the resulting detection problem is very
difficult. The difficulties are further compounded because the descriptor space is of high
dimension. While many domains lack accurate labels in their database, others like credit
card fraud exhibit tagged data. Therefore, the problem of supervised data-driven
detection, one relating to a labelled database of objects, is also examined. In addition, by
utilizing tagged data, a performance benchmark can be set, enabling meaningful
comparisons of supervised and unsupervised approaches.

Statistical approaches to fraud detection are mostly based on modelling the data relying
on their statistical properties and using this information to estimate whether a new object
comes from the same distribution or not. The statistical modelling approach proposed here
is a generalization and amalgamation of techniques from classical linear statistics (logistic
regression, principal component analysis and generalized linear models) into a framework
referred to as generalized linear statistics (GLS). It is based on the use of exponential
family distributions to model the various types (continuous and discrete) of data
measurements. A key aspect is that the natural parameter of the exponential family
distributions is constrained to a lower dimensional subspace to model the belief that the

10

intrinsic dimensionality of the data is smaller than the dimensionality of the observation
space. The proposed constrained statistical modelling is a nonlinear methodology that
exploits the split that occurs for exponential family distributions between the data space
and the parameter space as soon as one leaves the domain of purely Gaussian random
variables. Although the problem is nonlinear, it can be solved by using classical linear
statistical tools applied to data that has been mapped into the parameter space that still
has a natural, flat Euclidean structure. This approach provides an effective way to exploit
tractably parameterized latent-variable exponential-family probability models for data-
driven learning of model parameters and features, which in turn are useful for the
development of effective fraud detection algorithms.

The fraud detection techniques proposed here are performed in the parameter space
rather than in the data space as has been done in more classical approaches. In the case
of a low level of contamination of the data by fraudulent points, a single lower dimensional
subspace is learned by using the GLS based statistical modelling on a training set. Given a
new data point, it is projected to its image on the lower dimensional subspace and fraud
detection is performed by comparing its distance from the training set mean-image to a
threshold. An example that shows that there are domains for which the classical linear
techniques, such as principal component analysis, used in the data space perform far from
optimally is presented compared to the new proposed parameter space techniques. For
cases of data with roughly as many fraudulent as non-fraudulent points, an unsupervised
approach to the linear Fisher discriminant is proposed. The GLS based framework enables
unsupervised learning of a lower dimensional subspace in the parameter space that
separates fraudulent from non-fraudulent data. Fraud detection is performed as in the
previous case. In both cases, an ROC curve is generated to assess the performance of the
proposed fraud detection methods.

Joint work with Kenneth Kreutz-Delgado and Uwe Mayer.

Kernels for the Predictive Regression of Physical, Chemical and Biological Properties of
Small Molecules
Chloe-Agathe Azencott, University of California, Irvine

Small molecules, i.e. molecules composed of a couple hundreds of atoms, play a
fundamental role in biology, chemistry and pharmacology. Their usage goes from the
design of new drugs to the better understanding of biological systems; however,
establishing their physical, chemical and biological properties through a physical
experimentation can be very costly. It is therefore essential to develop efficient
computational methods to predict these properties.

Kernel methods, and among them support vector machines, appear as particularly
appropriate for chemical data, for they involve similarity measures which allow to embed
the data in a high-dimensional feature space where linear methods can be used. Machine
learning spectral kernels can be derived from various descriptions of the molecules; we
study representations which dimensionality ranges from 1 to 4, thus obtaining 1D, 2D,
2.5D, 3D and 4D kernels.

Using cross-validation and redundancy reduction techniques on various datasets of small
and medium size from the literature, we test the kernels for the prediction of boiling
points, melting points, aqueous solubility and octanol/water partition coefficient and
compare them against state-of-the art results.

Spectral kernels derived from the rich and reliable two-dimensional representation of the
molecules outperform the other methods on most of the datasets. They seem to be the

11

method of choice, given their simplicity, computational efficiency and prediction accuracy.

Efficient Exploration with Latent Structure
Bethany Leffler, Rutgers University

Developing robot control using a reinforcement-learning (RL) approach involves a number
of technical challenges. In our work, we address the problem of learning an action model.

Classical RL approaches assume Markov decision process (MDP) environments, which do
not support the critical idea of generalization between states. For an agent to learn the
results of its actions for each state, it would have to visit each state and perform each
action in that state at least once. In a robot setting, however, it is unrealistic to assume
there will be sufficient time to learn about every state of the environment independently;
so richer models of environmental dynamics are needed. Our technique for developing
such a model is to assume that each state is not unique. In most environments, there will
be states that have the same transition dynamics. By developing models where similar
states have similar dynamics, it becomes possible for a learner to reuse its experience in
one state to more quickly learn the dynamics of other parts of the environment. However,
it also introduces an additional challenge―determining which states are similar.

To evaluate the viability of this approach, we constructed an experiment using a four-
wheeled Lego Mindstorm robot as the agent. The state space consisted of discretized
vehicle locations with a hidden variable of slope (flat or incline), which correlated directly
with the action model. The agent had to learn which throttling action to perform in each
state to maintain a target speed. In this scenario, the actions did not affect the transitions
between states.

To determine similarity between states, the agent executed a selected action several times
in each of the vehicle locations. The outcomes of these actions were used to hierarchically
cluster the states. Once the states were clustered, the agent then started learning an
action model for each state cluster. The advantage of this approach over one that learned
a separate action model for each state is that information gathered in several different
states can be pooled together. In common environments, there are many more states than
state-types; therefore, learning based on clusters drastically reduces learning time. In
fact, we were able to prove a worst-case learning time result that formalizes and validates
this claim.

If the environment does not have many similar states or if the clustering algorithm groups
the states incorrectly, than the benefit of this approach will be minimized. Even in this
worst case, however, it is important to note that this algorithm is no more costly than
exploring each state individually.

Some limitations of this algorithm arise when states have semi-similar action models. For
instance, if two states behave similarly when one action is performed, but not for all the
actions, it is possible that the agent would learn incorrectly when following our proposed
algorithm. In most robotic environments, however, using our algorithm will greatly reduce
the time taken by the agent to determine its action model in all states, thereby increasing
the efficiency of the robot.

Joint work with Michael L. Littman, Alexander L. Strehl, and Thomas Walsh.

12

Efficient Model Learning for Dialog Management
Finale Doshi, MIT

Intelligent planning algorithms such as the Partially Observable Markov Decision Process
(POMDP) have succeeded in dialog management applications because of their robustness
to the inherent uncertainty of human interaction. Like all dialog planning systems,
however, POMDPs require an accurate model of the user (such as the user's different
states of the user and what the user might say). POMDPs are generally specified using a
large probabilistic model with many parameters; these parameters are difficult to specify
from domain knowledge, and gathering enough data to estimate the parameters
accurately a priori is expensive.

In this paper, we take a Bayesian approach to learning the user model simultaneously the
dialog management problem. First we show that the policy that maximizes the expected
reward is the solution of the POMDP taken with the expected values of the parameters. We
update the parameter distributions after each test, and incrementally update the previous
POMDP solution. The update process has a relatively small computational cost, and we
test various heuristics to focus computation in circumstances where it is most likely to
improve the dialog. We are able to demonstrate a robust dialog manager that learns from
interaction data, out-performing a hand-coded model in simulation and in a robotic
wheelchair application.

Joint work with Nicholas Roy.

Transfer in the context of Reinforcement Learning
Soumi Ray, University of Maryland, Baltimore County

We are investigating the problem of transferring knowledge learned in one domain to
another related domain. Transfer of knowledge from simple domains to more complex
domains can reduce the total training time in the complex domains. We are doing transfer
in the context of reinforcement learning. In the past, knowledge transfer has been
accomplished between domains with the same state and action spaces. Work has also
been done where the state and action spaces of the two domains are different but a
mapping has been provided by humans. We are trying to automate the mapping from the
old domain to the new domain when the state and action spaces are different.

We have two domains D1 and D2, with corresponding state spaces S1 and S2 and action
spaces A1 and A2 where |S1| = |S2| and |A1| = |A2|. Our goal is to transfer a policy learned
in D1 to D2 so as to speed learning in D2. We first run Q-learning in D1 to produce Q-table
Q1. Then we train for limited time in D2 and generate Q2. The test bed we have used is a
16x16 grid world. We have taken two domains in a 16x16 grid world with four actions:
North, South, East and West. In the first domain we have trained for 500 iterations and in
the second domain we have trained for 20 iterations. The two approaches that we have
used are as follows.

Our goal is to find the mapping between the state spaces S1 and S2 and action spaces A1
and A2 In the first approach we compute the difference between matrices Q1 and Q2 and
greedily find a mapping that minimizes the difference calculated above. With this mapping
we can transfer the Q-values from the completely trained domain D1 to the partially
trained domain D2 to speed up learning in domain D2. We find that it takes fewer steps to
learn completely in the second domain when the Q-values are transferred than learning
from scratch. Our second approach finds the mapping that assigns the highest Q-values of
the states in domain one to the highest Q-values of the states in domain two. This
approach is an improvement over the first approach. It takes many fewer steps to learn in

13

the second domain using transfer.

We are also interested in finding the mapping when S1 and A1 are subsets of S2 and A2
respectively, i.e. |S1|<|S2| and |A1|<|A2|. This can be handled by allowing mapping a
single state/action in S1/A1 to multiple states/actions in S2/A2.

Joint work with Tim Oates.


14

Spotlights (Session 1)

Correcting sample selection bias by unlabeled data
Jiayuan Huang, University of Waterloo

The default assumption in many learning scenarios is that training and test data are
independently and identically drawn from the same distribution. When the distributions on
training and test set do not match, we are facing the problem that commonly referred to
as sample selection bias or covariance shift. This problem occurs in many real world
applications including the areas of surveys, sociology, biology and economics. It is not hard
to see that given the skewed selection for the training data, it is impossible to derive a
good model to make accurate predictions on the general target as the training set might
not be representative of the complete population from which the test is usually come. Thus
the prediction results in a biased estimation, potentially increasing the errors. Although
there exists previous work addressing this problem, sample selection bias is typically
ignored in standard estimation algorithms. In this work, we utilize the availability of
unlabeled data to direct a sample selection de-biasing procedure for various learning
methods. Unlike most previous algorithms that try to first recover sampling distributions
and then make appropriate corrections based on the distribution estimate, our method
infer the re-sampling weight directly by distribution matching between training and testing
sets in the feature space in a non-parametric manner. We do not require the estimation of
biased densities or selection probabilities or any assumptions of knowing the probabilities
of different classes. Our method works by matching distributions between training and
testing sets in feature space that can handle high dimensional data. Our experiments
results with many benchmark datasets demonstrate our method works well in practice.
The method also shows good performance in tumor diagnosis using microarrays that it
promises to be a valuable tool for cross-platform microarray classification.

Joint work with Alex Smola, Arthur Gretton, Karsten Borgwardt, Bernhard Scholkopf.

Decision Tree Methods for Finding Reusable MDP Homomorphisms
Alicia Peregrin Wolfe, University of Massachusetts, Amherst

State abstraction is a useful tool for agents interacting with complex environments. Good
state abstractions are compact, reusable, and easy to learn from sample data. This paper
combines and extends two existing classes of state abstraction methods to achieve these
criteria. The first class of methods search for MDP homomorphisms (Ravindran 2004),
which produce models of reward and transition probabilities in an abstract state space.
The second class of methods, like the UTree algorithm (McCallum 1995), learn compact
models of the value function quickly from sample data. Models based on MDP
homomorphisms can easily be extended such that they are usable across tasks with
similar reward functions. However, value based methods like UTree cannot be extended in
this fashion. We present results showing a new, combined algorithm that fulfills all three
criteria: the resulting models are compact, can be learned quickly from sample data, and
can be used across a class of reward functions.

Joint work with Andrew Barto.

15

Evaluating a Reputation-based Spam Classification System
Elena Zheleva, University of Maryland, College Park

Over the past several years, spam has been a growing problem for the Internet
community. It interferes with valid e-mail and burdens both e-mail users and ISPs. While
there are various successful automated e-mail filtering approaches that aim at reducing
the amount of spam, there are still many challenges to overcome.

Reactive spam filtering approaches classify a piece of e-mail as spam if it has been
reported as such by a large volume of e-mail users. Unfortunately, by the time the system
responds by blocking the message or automatically placing it in future recipients' spam
folders, the spam campaign has already affected a lot of users. The challenge that we
consider is whether we can reduce the response time, recognizing a spam campaign at an
earlier stage, thus reducing the cost that users and systems incur. Specifically, we are
evaluating the predictive power of a reputation-based spam filtering system, which uses
the feedback only from trustworthy e-mail users.

In a reputation-based or trust-based spam filtering system, the system identifies a set of
users who report spam reliably and trusts their spam reports more than the spam reports
of other users. A message coming into the system is classified as spam if enough reliable
users report it. This automatic spam filtering approach is vulnerable to malicious users
when any anonymous person can subscribe and unsubscribe to the e-mail service. This is
the case with most free e-mail providers such as AOL, Hotmail and Yahoo. We show how to
overcome this problem in this work.

There are two well-known open-source projects which operate in this framework: Vipul's
Razor and Distributed Checksum Clearinghouse. Unfortunately, their reputation systems
work only as a part of their commercially available software counterparts and, due to trade
secrets, it is not clear how the design characteristics such as reputation definition and
metrics affect the system performance. More importantly, the spam reports they receive
are mostly from authorized users (such as business partner company employees), which
reduce the risk of abuse by anonymous users.

The effectiveness of a reputation-based spam filtering system is based on evaluating the
following properties: 1) automatic maintenance of a reliable user set over time, 2) timely
and accurate recognition of a spam campaign, and 3) having a set of guarantees on the
system vulnerability. In our work, we present the results from simulating a reputation-
based spam filtering over a period of time. The evaluation dataset includes all the spam
reports received during that period of time for a particular free e-mail provider. We show
how our algorithms effectively reduce spam campaign response time, while minimizing
system vulnerability.

Joint work with Lise Getoor and Alek Kolcz.

Improving Robot Navigation Through Self-Supervised Online Learning
Ellie Lin, Carnegie Mellon University

In mobile robotics, there are often features that, while potentially powerful for improving
navigation, prove difficult to profit from as they generalize poorly to novel situations.
Overhead imagery data, for instance, has the potential to greatly enhance autonomous
robot navigation in complex outdoor environments. In practice, reliable and effective
automated interpretation of imagery from diverse terrain, environmental conditions, and
sensor varieties proves challenging. Similarly, fixed techniques that successfully interpret
on-board sensor data across many environments begin to fail past short ranges as the

16

density and accuracy necessary for such computation quickly degrade and features that
are able to be computed from distant data are very domain-specific. We introduce an
online, probabilistic model to effectively learn to use these scope-limited features by
leveraging other features that, while perhaps otherwise more limited, generalize reliably.
We apply our approach to provide an efficient, self-supervised learning method that
accurately predicts traversal costs over large areas from overhead data. We present
results from field-testing on-board a robot operating over large distances in off-road
environments. Additionally, we show how our algorithm can be used offline with overhead
data to produce a priori traversal cost maps and detect misalignments between overhead
data and estimated vehicle positions. This approach can significantly improve the
versatility of many unmanned ground vehicles by allowing them to traverse highly varied
terrains with increased performance.

Joint work with B. Sofman, J. Bagnell, N. Vandapel and A. Stentz.

17

Spotlights (Session 2)

Simultaneous Team Assignment and Behavior Recognition from Spatio-temporal Agent
Traces
Gita Sukthankar, Carnegie Mellon University

This research addresses the problem of activity recognition for physically embodied agent
teams. We define team activity recognition as the process of identifying team behaviors
from traces of agent positions over time; for many physical domains, military or athletic,
coordinated team behaviors create distinctive spatio-temporal patterns that can be used
to identify low-level action sequences. We focus on the novel problem of recovering agent-
to-team assignments for complex team tasks where team composition, the mapping of
agents into teams, changes over time. Without a priori knowledge of current team
assignments, the behavior recognition problem is challenging since behaviors are
characterized by the aggregate motion of the entire team and cannot generally be
determined by observing the movements of a single agent in isolation.

To handle this problem, we introduce a new algorithm, Simultaneous Team Assignment and
Behavior Recognition (STABR) that generates behavior annotations from spatio-temporal
agent traces. STABR leverages information from the spatial relationships of the team
members to create sets of potential team assignments at selected time-steps. These
spatial relationships are efficiently discovered using a randomized search technique,
RANSAC, to generate potential team assignment hypotheses. Sequences of team
assignment hypotheses are evaluated using dynamic programming to derive a
parsimonious explanation for the entire observed spatio-temporal trace. To prune the
number of hypotheses, potential team assignments are fitted to a parameterized team
behavior model; poorly fitting hypotheses are eliminated before the dynamic programming
phase. The proposed approach is able to perform accurate team behavior recognition
without exhaustive search over the partition set of potential team assignments, as
demonstrated on several scenarios of simulated military maneuvers.

STABR does not simply assume that agents within a certain proximity should be assigned
to the same team; instead if relies on matching static snapshots of agent position against
a database of team formation templates to produce a candidate pool of agent-to-team
assignments. This candidate pool of assignments is verified by running a local spatio-
temporal behavior detector. The intuition is that the aggregate agent movement for an
incorrect team assignment will generally fail to match any behavior model. STABR
significantly outperforms agglomerative clustering on the agent-to-team assignment
problem for traces with dynamic agent composition (95% accuracy).

The scenarios presented here illustrate the operation of STABR in environments that lack
the external cues used by other multi-agent plan recognition approaches, such as
landmarks, cleanly clustered agent teams, and extensive domain knowledge. We believe
that when such cues are available they can be directly incorporated into STABR, both to
improve accuracy and to prune hypotheses. STABR provides a principled framework for
reasoning about dynamic team assignments in spatial domains.

Joint work with Katia Sycara.

18

An Online Learning System for the Prediction of Electricity Distribution Feeder Failures
Hila Becker, Columbia University

We are using machine learning techniques for constructing a failure-susceptibility ranking
of feeder cables that supply electricity to the boroughs of New York City. The electricity
system is inherently dynamic, and thus our failure-susceptibility ranking system must be
able to adapt to the latest conditions in real time, having to update its ranking accordingly.
The feeders have a significant failure rate, and many resources are devoted to monitoring,
maintenance and repair of feeders. The ability to predict failures allows the shifting from
reactive to proactive maintenance, thus reducing costs.

The feature set for each feeder includes a mixture of static data (e.g. age and composition
of each feeder section) and dynamic data (e.g. electrical load data for a feeder and its
transformers). The values of the dynamic features are captured at the time of training and
therefore lead to different models depending on the time and day at which each model is
trained. Previously, a framework was designed to train models using a new variant of
boosting called Martingale Boosting, as well as Support Vector Machines. However, in this
framework, an engineer had to decide whether to use the most recent data to build a new
model, or use the latest model instead for future predictions.

To avoid the need of human intervention, we have developed an “online” system that
determines what model to use by monitoring past performance of previously trained
models. In our new framework, we treat each batch-trained model as an expert, and use a
measurement of its performance as the basis for reward or penalty of its quality score. We
measure performance as a normalized average rank of failures. For example, in a ranking
of 50 items with actual failures ranked #4 and #20, the performance is: 1 – (4 + 20) /
(2*50) = 0.76.

Our approach builds on the notion of learning from expert advice as formulated in the
continuous version of the Weighted Majority algorithm. Since each model is analogous to
an expert and our system runs live thus gathering new data and generating new models,
we have to keep adding new experts to the existing ensemble throughout the algorithm’s
execution. To avoid having to monitor an ever-increasing set of experts, we drop poorly
performing experts after each prediction. We had to address the following key issues in our
solution: (1) how often and with what weight do we add new experts, and (2) what experts
do we drop. Our simulations suggest that using the median of all current models’ weights
for new models works best. To drop experts we use a combination of age of the model and
past performance. Finally, to make predictions we use a weighted average of the top-
scoring experts.

Our system is currently deployed and being tested by New York City’s electricity
distribution company. Results are highly encouraging, with 75% of the failures in the
summer of 2005 being ranked in the top 26%, and 75% of failures in 2006 being ranked in
the top 36%.

Joint work with Marta Arias.

Classification of fMRI Images: An Approach Using Viola-Jones Features
Melissa K. Carroll, Princeton University

There has been growing interest in using Functional Magnetic Resonance Imaging (fMRI)
for “mind reading,” particularly in applying machine learning methods to classifying fMRI
brain images based on the subject’s instantaneous cognitive state. For instance, Haxby et
al. (2001) perform fMRI scans while subjects are viewing images of one of seven classes of

19

objects with the goal of discriminating the brain images based on the class of image being
viewed at the time.

Most machine learning approaches used to date for fMRI classification have treated
individual voxels as features and ignored the spatial correlation between voxels (Norman
et al., 2006). We present a novel method for searching this feature space to generate
features that capture spatial information, derived from the Viola and Jones (2001)
algorithm for 2D object detection, and apply it to 2D representations of the images. In this
method, features are computed corresponding to absolute and relative intensities over
regions of varying size and shape, and used by AdaBoost (Schapire and Singer, 1999) to
generate a classifier. Figure 1 (http://www.cs.princeton.edu/~mkc/wiml06/Figure1.jpg)
shows examples of these features overlaid on an actual 2D representation of the 3D fMRI
image. Mean intensities in white regions are subtracted from mean activities in gray
regions to compute each feature, which are combined to form the feature vector. One-,
two-, three- and four-rectangle features of all 100 size combinations between 1x1 and
10x10 are computed for all positions in the image.

As Figure 2 (http://www.cs.princeton.edu/~mkc/wiml06/Figure2.jpg) shows, including richer
features than the standard one-pixel features can result in improved classification of the
Haxby et al. dataset. One potential limitation of the method is that the large feature set it
produces conflicts with computational limitations; however, figure 2 shows that even
selecting a small random subset of the richer features can result in an increase in
classification accuracy by 5% or more, although performance varies across subjects. In
addition, the performance of this subset of features can be used to target subsequent
feature selection. Future work needs to be performed to develop reliable and valid
methods for rating feature importance.

Finally, Figure 3 (http://www.cs.princeton.edu/~mkc/wiml06/Figure3.jpg) shows that
confusion among predicted classes occurs most often between classes that are most
similar and for which previous classifiers have encountered difficulty, e.g. male faces and
female faces. This target space similarity structure could be exploited in future work to
improve classification.

1. J. V. Haxby, M. I. Gobbini, M. L. Furey, A. Ishai, J. L. Schouten, and P. Pietrini. (2001).
Distributed and overlapping representations of faces and objects in ventral
temporal cortex. Science, (293) 2425-2429.
2. K. A. Norman, S. M. Polyn, G. J. Detre and J. V. Haxby, (2006). Beyond mind-reading:
multi-voxel pattern analysis of fMRI data. Trends in Cognitive Sciences, In Press.
3. R.E. Schapire and Y. Singer. (1999). Improved boosting algorithms using confidence-
rated predictions. Machine Learning, 37(3): 297-336.
4. P. Viola and M. Jones. (2001). Rapid object detection using a boosted cascade of
simple features. CVPR 2001.

Joint work with Kenneth A. Norman, James V. Haxby and Robert E. Schapire.

Fast Online Classification with Support Vector Machines
Seyda Ertekin, Penn State University

In recent years, we have witnessed significant increase in the amount of data in digital
format, due to the widespread use of computers and advances in storage systems. As the
volume of digital information increases, people need more effective tools to better find,
filter and manage these resources. Classification, the assignment of instances (i.e.
pictures, text documents, emails, Web sites etc.) to one or more predefined categories

20

based on their content, is an important component in many information organization and
management tasks. Support Vector Machines (SVMs) is a popular machine learning
algorithm for classification problems due to their theoretical foundation and good
generalization performance. However, SVMs have not yet seen widespread adoption in the
communities working with very large datasets due to the high computational cost involved
in solving quadratic programming (QP) problem in the training phase. This research
presents an online SVM learning algorithm, LASVM, which yields classification accuracy
rates of the state-of-the-art SVM solvers but requires less computational resources. LASVM
tolerates much smaller main memory and has a much faster training phase. We also show
that not all the examples are equally informative in the training set. We present methods
to select the most informative examples and exploit those to reduce the computational
requirements of the learning algorithm. We uncover the properties of active learning
algorithms to select the informative examples efficiently from very large-scale training
sets. We will also show the benefits of using a non-convex loss function at SVMs for faster
speeds and less computational requirements.

Joint work with Leon Bottou, Antoine Bordes and Jason Weston.

21

Posters (Session 1)

Using Decision Trees for Gabor-based Texture Classification of Tissues in Computed
Tomography
Alia Bashir, DePaul University

This research is aimed at developing an automated imaging system for classification of
tissues in CT images. Classification of tissues in CT scans using shape or gray level
information is challenging due to the changing shape of organs in a stack of images and
the gray level intensity overlap in soft tissues. However, healthy organs are expected to
have a consistent texture within tissues across slices. Given a large enough set of normal-
tissue images, and a good set of texture features, machine learning techniques can be
applied to create an automatic classifier. Previous work from one of the authors explored
texture descriptors based on wavelet, ridgelets, and curvelets for the classification of
tissues from normal chest and abdomen CT scans. These texture descriptors were able to
classify tissues with an accuracy range of 85 - 98%, with curvelet-based texture
descriptors performing the best. In this paper we bridge the gap to perfect accuracy by
focusing on texture features based on a bank of Gabor filters. The approach consists of
three steps: convolution of the regions of interest with a bank of 32 Gabor filters (4
frequencies and 8 orientations), extraction of two Gabor texture features per filter (mean
and standard deviation), and creation of a classifier that automatically identifies the
various tissues. The data set consists of 2D DICOM images from five normal chest and
abdomen CT studies from Northwestern Medical Hospital. The following regions of interest
were segmented out and labeled by an expert radiologist: liver, spleen, kidney, aorta,
trabecular bone, lung, muscle, IP fat, and SQ fat for a total of 1112 images. For each
image, the feature vector consists of the mean and standard deviation of the 32 filtered
images, totaling 64 descriptors. The classification step is carried out using a Classification
and Regression decision tree classifier. A decision tree predicts the class of an object
(tissue) from values of predictor variables (texture descriptors), and generates a set of
decision rules. These sets of rules are then used for the classification of each region of
interest. Both the cross-validation and the random split of the data set into a training set
(~65%) and testing set (~35%) techniques were applied but no significant difference was
observed. The optimal tree had a depth of 20, parent node value set at 10 and child node
value set at 1. To evaluate the performance of each classifier, specificity, sensitivity,
precision, and accuracy rates are calculated from each misclassification matrix. Results
show that this set of texture features is able to perfectly classify the 9 regions of interests.
The Gabor filters’ ability to isolate features at different scales and directions allows for a
multi-resolution analysis of texture essential when dealing with, at times, very subtle
differences in the texture of tissues in CT scans. Given the great performance in the
classification of healthy tissues, we plan to apply Gabor texture feature to the classification
of abnormal tissues.

Joint work with Julie Hasemann and Lucia Dettori.

VOGUE: A Novel Variable Order-Gap State Machine for Modeling Sequences
Bouchra Bouqata, Rensselaer Polytechnic Institute (RPI)

In this paper we present VOGUE, a new state machine that combines two separate
techniques for modeling long range dependencies in sequential data: data mining and
data modeling. VOGUE relies on a novel Variable-Gap Sequence mining method (VGS), to
mine frequent patterns with different lengths and gaps between elements. It then uses
these mined sequences to build the state machine. We applied VOGUE to the task of

22

protein sequence classification on real data from the PROSITE protein families. We show
that VOGUE yields significantly better scores than higher-order Hidden Markov Models.
Moreover, we show that VOGUEs classification sensitivity outperforms that of HMMER, a
state-of-the-art method for protein classification

Joint work with Christopher Carothers, Boleslaw K. Szymanski and Mohammed J. Zaki.

GroZi: a Grocery Shopping Assistant for the Blind
Carolina Galleguillos, U.C San Diego

Grocery shopping is a common activity that people all over the world perform on a regular
basis. Unfortunately, grocery stores and supermarkets are still largely inaccessible to
people with visual impairments, as they are generally viewed as "high cost" customers. We
propose to develop a computer vision based grocery shopping assistant based on a
handheld device with haptic feedback that can detect different products inside of a store,
thereby increasing the autonomy of blind (or low vision) people to perform grocery
shopping.

Our solution makes use of new computer vision techniques for the task of visual
recognition of specific products inside of a store as specified in advance on a shopping list.
These techniques can avail of complementary resources such as RFID, barcode scanning,
and sighted guides. We also present a challenging new dataset of images consisting of
different categories of grocery products that can be use for object recognition studies.

The use of the system consists of the creation of a shopping list followed by in-store
navigation. In order to create a shopping list we will develop a website accessible to
visually impaired people that stores data and images of different products. The website will
be augmented with new image templates from the community of users that shop with the
device, in addition to images of the same product that are taken in different stores by
different users. This will increase the system's ability to recognize products that change
appearance due to seasonal or promotion reasons. The navigational task includes finding
the correct aisle for the products (based on text detection and character recognition),
avoiding obstacles, finding products and checking out.

A typical grocery store carries around 30,000 items, so recognizing a single object is a
nontrivial task. Assuming a shopping list length is generally less than 1/1000th of this
amount (i.e., less than 30 items), the recognition can be constrained to two different
phases: detection of object on a possibly cluttered shelf, and verification of the detected
object with respect to the shopping list. For this task, we intend to use state of the art
object recognition algorithms and develop new approaches for fast identification.

Applications of Kernel Minimum Enclosing Ball
Cristina Garcia C., Universidad Central de Venezuela

The minimum enclosing ball (MEB) is a well-studied problem in computational geometry. In
this work we describe a generalization of a simple approximate MEB construction,
introduced by M. Badoiu and K. L. Clarkson, to a feature space MEB using the kernel trick.
The simplicity of the methodology in itself is surprising, the MEB algorithm is based only on
geometrical information extracted from a sample of data points, and just two parameters
need to be tuned: the constant of the kernel and the tolerance in the radio of the
approximation. The applicability of the method is demonstrated on anomaly detection and
less traditional scenarios as 3D object modeling and path planning. Results are

23

encouraging and show that even an approximate feature space MEB, is able to induce
topology preserving mappings on arbitrary dimensional noisy data as efficiently as other
machine learning approaches.

Joint work with Jose Ali Moreno.

Classification With Cumular Trees
Claudia Henry, Antilles-Guyane

The accurate combination of decision trees and linear separators has been shown to
provide some of the best off the shelf classifiers. We describes a new type of such
combination, which we call Cumular (Cumulative Linear) Trees. Cumular Trees are midway
between Oblique Decision Trees and Alternating Decision Trees: more expressive than the
former, and simpler than the latter. We provide an induction algorithm for Cumular Trees,
which is, as we show, a boosting algorithm in the original sense. Experimental results
against AdaBoost, C4.5 and OC1 display very good results, especially when dealing with
noisy data.

Joint work with Richard Nock and Franck Nielsen.

Transient Memory in Reinforcement Learning: Why Forgetting Can be Good for You
Anna Koop, University of Alberta

The vast majority of work in machine learning is concerned with algorithms that converge
to a single solution. It is not clear that this is always the most appropriate aim. Consider a
sailor adapting to the ship's motion. She may learn two conditional models: one for walking
when at sea, and another for walking when on land. She may, when memory resources are
limited, learn a best-on-average policy that settles on a compromise among all situations
she has encountered. A more flexible approach might be to quickly adapt the walking
policy to new situations, rather than seeking one final solution or set of solutions.

We explore two cases of transient memory. In the first case, the rate at which individual
parameters change is controlled by meta-parameters. These meta parameters allow the
agent to ignore irrelevant or random features, to converge where features are consistent
throughout its experience, and otherwise to adapt quickly to changes in the environment.
This approach requires no commitment to the number of parameter sets necessary in a
given environment, but makes the best use of available resources.

In the second case, a single solution is stored in long-term parameters, but this solution is
used only as the starting point for learning about a specific situation. This is currently
being applied to the game of Go. At the beginning of a game, the agent's value function
parameters are initialized according to the long-term memory. During the course of a game
these parameters are updated by simulating, from each state, thousands of self-play
games. The short-term parameters learned in this way are used both for action selection
and as the starting point for learning on the next turn, after the opponent has moved.
Actual game-play moves are used to update both the short- and long-term memory. At the
end of the game, the short-term memory is forgotten and the value function parameters
are initialized to the long-term values. This allows the agent to store general knowledge in
long-term memory while adapting quickly to the specific situations encountered in the
current game.

24

Predicting Task-Specific Webpages for Revisiting
A. Twinkle E. Lettkeman, Oregon State University

Most web browsers track the history of all pages visited, with the intuition that users are
likely to want to return to pages that they have previously accessed. However, the history
viewers in web browsers are ineffective for most users, because of the overwhelming glut
of webpages that appear in the history. Not only does the history represent a potentially
confusing interleaving of many of a user's different tasks, but it also includes many
webpages that would provide minimal or no utility to the user if revisited. This paper
reports on a technique used to dramatically reduce web browsing histories down to pages
that are relevant to the user's current task context and have a high likelihood of being
desirable to revisit. We briefly describe how the TaskTracer system maintains an awareness
of a user's tasks and semi-automatically segments the web browsing history by task. We
then present a technique that is used to predict whether webpages previously visited on a
task will be of future value to the user and are worth displaying in the history user
interface. Our approach uses a combination of heuristics and machine learning to evaluate
the content of a page and interactions with the page to learn a predictive model of
webpage relevance for each user task. We show the results of an empirical evaluation of
this technique based on user data. This approach could be applied to systems that include
tracking of webpage resources to predict future value of resources and to lower costs of
finding and reusing webpages to the user. Our findings suggest that prediction of web
pages is highly user- and task-specific, and that the choice of prediction algorithms is not
obvious. In future work we aim to refine the features used to predict revisitability. We will
analyze the effect of better text feature extraction in conjunction with user interest
indicators such as reading time, scrolling behavior, and text selection. Preliminary analysis
indicates that applying these refinements may increase the accuracy of our prediction
models.

Joint work with Simone Stumpf, Jed Irvine and Jonathan Herlocker.

Hyper-parameters auto-setting using regularization path for SVM
Gaëlle Loosli, INSA de Rouen

In the context of classification tasks, Support Vector Machines are now very popular.
However, their utilization by neophyte users is still hampered by the need to supply values
for control parameters in order to get the best attainable results. Mainly, given clean data,
SVM's users must make three choices: the type of kernel, its bandwidth and the
regularization parameter. It would be convenient to provide users with a push-button SVM
that would be able to auto-set to the best possible values. This paper presents a new
method that approaches this goal. Given the importance of this problem for reaping all
the potential benefits of the use of SVM, many research works have been dedicated to
ways of helping the setting of the parameters. Most rely on either outer measures, such as
cross-validation, to guide the selection, or to measures embedded in the learning method
itself. In place of empirical approaches to the setting of the control parameters,
regularization paths have been proposed and widely studied these past years since they
provide a smart and fast way to access all the optimal solutions of a problem according to
all compromises between bias and variance for regression or compromises between bias
and regularity in classification. For instance, in the case of classification tasks, as studied
in this paper, Soft margins SVM deal with non-separable problem thanks to slack variables
that are parametrized by a slack trade-off (usually noted C, it is the regularization
parameter). Within the usual formulation of the Soft margins SVM, this trade-off takes its
value between 0 (random) and infinity (hard-margins). The nu-SVM technique reformulates
the SVM problem so that C is replaced by nu parameters taking values in [0,1]. This
normalized parameter has a more intuitive meaning: it represents the minimal proportion

25

of points in the solution and the maximal proportion of misclassified points.

However, having the whole regularization path is not enough. Indeed, the end user still
needs to retrieve from it the best values for the regularization parameters. Instead of
selecting these values by k-fold cross-validation or leave-one-out, or other approximations,
we propose to include the leave-one-out estimator inside the regularization path in order
to have an idea of the generalization error at each step. We explain why it is less
expansive than selecting the best parameter a posteriori and give a method to stop
learning before attaining the end of the path to save useless efforts. Contrarily to what is
usually done for regularization path, our method does not start with all points as support
vectors. Doing so we avoid the computation of the whole Gram matrix at the first step.
Then, since the proposed method stops on the path, this extreme non-sparse solution is
never attained and thus the whole Gram matrix never required. One of the main
advantages of this is that it is possible to use this setting for large databases.

The Influence of Ranker Quality on Rank Aggregation Algorithms
Brandeis Marshall, Rensselaer Polytechnic Institute

The rank aggregation problem has been studied extensively in recent years with a focus
on how to combine several different rankers to obtain a consensus aggregate ranker. We
study the rank aggregation problem from a different perspective: how the individual input
rankers impact the performance of the aggregate ranker. We develop a general statistical
framework based on a model of how the individual rankers depend on the ground truth
ranker. Within this framework, one can study the performance of different aggregation
methods. The individual rankers, which are the inputs to the rank aggregation algorithm,
are statistical perturbations of the ground truth ranker. With rigorous experimental
evaluation, we study how noise level and the misinformation of the rankers affect the
performance of the aggregate ranker. We introduce and study a novel Kendall-tau rank
aggregator and a simple aggregator called PrOpt, which we compare to some other well
known rank aggregation algorithms such as average, median and Markov chain
aggregators. Our results show that the relative performance of aggregators varies
considerably depending on how the input rankers relate to the ground truth.

Joint work with Sibel Adali and Malik Magdon-Ismail.

Learning for Route Planning under Uncertainty
Evdokia Nikolova, Massachusetts Institute of Technology

We present new complexity results and efficient algorithms for optimal route planning in
the presence of uncertainty. We employ a decision theoretic framework for defining the
optimal route: for a given source S and destination T in the graph, we seek an ST-path of
lowest expected cost where the edge travel times are random variables and the cost is a
nonlinear function of total travel time. Although this is a natural model for route planning
on real-world road networks, results are sparse due to the analytic difficulty of finding
closed form expressions for the expected cost, as well as the computational/combinatorial
difficulty of efficiently finding an optimal path, which minimizes the expected cost.

We identify a family of appropriate cost models and travel time distributions that are
closed under convolution and physically valid. We obtain hardness results for routing
problems with a given start time and cost functions with a global minimum, in a variety of
deterministic and stochastic settings. In general the global cost is not separable into edge
costs, precluding classic shortest-path approaches. However, using partial minimization

26

techniques, we exhibit an efficient solution via dynamic programming with low polynomial
complexity.

We then consider an important special case of the problem, in which the goal is to
maximize the probability that the path length does not exceed a given threshold value
(deadline). We give a surprising exact nθ log n algorithm for the case of normally distributed
edge lengths, which is based on quasi-convex maximization. We then prove average and
smoothed polynomial bounds for this algorithm, which also translate to average and
smoothed bounds for the parametric shortest path problem, and extend to a more general
non-convex optimization setting. We also consider a number other edge length
distributions, giving a range of exact and approximation schemes.

Our offline algorithms can be adapted to give online learning algorithms via the Kalai-
Vempala approach of converting an offline to an efficient online optimization solution.

Joint work with Matthew Brand, David Karger, Jonathan Kelner and Michael Mitzenmacher.

A Neurocomputational Model of Impaired Imitation
Biljana Petreska, Ecole Polytechnique Federale de Lausanne

This abstract addresses the question of human imitation through convergent evidence
from neuroscience, using tools from machine learning. In particular, we consider a deficit
in imitation of meaningless gestures (i.e., hand postures relative to the head) following
callosal brain lesion (i.e., disconnected hemispheres). We base our work on the rational
that looking at how imitation in apraxic patients is impaired can unveil its underlying
neural principles. We ground the functional architecture and information flow of our model
in brain imaging studies. Finally findings from monkey brain neurophysiological studies
drive the choice of implementation of our processing modules. Our neurocomputational
model of visuo-motor imitation is based on selforganizing maps receiving sensory input
(i.e., visual, tactile or proprioceptive) with associated activities [1]. We train the
connections between the maps with anti-hebbian learning to account for the
transformations required to translate the observation of the visual stimulus to imitate to
the corresponding tactile and proprioceptive information that will guide the imitative
gesture. Patterns of impairment of the model, realized by adding uncertainty in the
transfer of information between the networks, reproduce the deficits found in a clinical
examination of visuo-motor imitation of meaningless gestures [2]. The model makes
hypotheses on the type of representation used and the neural mechanisms underlying
human visuo-motor imitation. The model also helps to gain more understanding in the
occurrence and nature of imitation errors in patients with brain lesions.

[1] B. Petreska, and A.G. Billard. A Neurocomputational Model of an Imitation Deficit
following Brain Lesion. In Proceedings of 16th International Conference on Artificial Neural
Networks (ICANN 2006), Athens (Greece). To appear.

[2] G. Goldenberg, K. Laimgruber, and J. Hermsdörfer. Imitation of gestures by
disconnected hemispheres. Neuropsychologia, 39:1432–1443, 2001.

Joint work with A. G. Billard.

27

Bayesian Estimation for Autonomous Object Manipulation Based on Tactile Sensors
Anya Petrovskaya, Stanford University

We consider the problem of autonomously estimating position and orientation of an object
from tactile data. When initial uncertainty is high, estimation of all six parameters
precisely is computationally expensive. We propose an efficient Bayesian approach that is
able to estimate all six parameters in both unimodal and multimodal scenarios. The
approach is termed Scaling Series sampling as it estimates the solution region by samples.
It performs the search using a series of successive refinements, gradually scaling the
precision from low to high. Our approach can be applied to a wide range of manipulation
tasks. We demonstrate its portability on two applications: (1) manipulating a box and (2)
grasping a door handle.

Joint work with Oussama Khatib, Sebastian Thrun, Andrew Y Ng.
.

Therapist Robot Behavior Adaptation for Post-stroke Rehabilitation Therapy
Adriana Tapus, University of Southern California

Research into Human-Robot Interaction (HRI) for socially assistive applications is in its
infancy. Socially assistive robotics, which focuses on the social interaction, rather than the
physical interaction between the robot and the human user, has the potential to enhance
the quality of life for large populations of users. Post-stroke rehabilitation is one of the
largest potential application domains, since stroke is a dominant cause of severe disability
in the growing ageing population. In the US alone, over 750,000 people suffer a new stroke
each year, with the majority sustaining some permanent loss of movement [Institute06].
This loss of function, termed "learned disuse", can improve with rehabilitation therapy
during the critical post-stroke period. One of the most important elements of any
rehabilitation program is carefully directed, well-focused and repetitive practice of
exercises, which can be passive and active.

Our work focuses on hands-off therapist robots that assist, encourage, and socially interact
with patients during their active exercises. Our previous research demonstrated, through
real world experiments with stroke patients [Tapus06b, Eriksson05, Gockley06], that the
physical embodiment (including shared physical context and physical movement of the
robot), the encouragements, and the monitoring play key roles in patient compliance with
rehabilitation exercises.

In the current work we investigate the role of the robot’s personality in the hands-off
therapy process. We focus on the relationship between the level of
extroversion/introversion (as defined in Eysenck Model of personality [Eysenck91]) of the
robot and the user, addressing the following research questions: 1. How should we model
the behavior and encouragement of the therapist robot as a function of the personality of
the user and the number of exercises performed? 2. Is there a relationship between the
extroversion-introversion personality spectrum based on the Eysenck model and the
challenge based vs. nurturing style of patient encouragement?

To date, little research into human-robot personality matching has been performed. Some
of our recent results showed the preference for personality matching between users and
socially assistive robots [Tapus06a]. Our therapist robot behavior adaptation system
monitors the number of exercises/minute performed by the human/patient, indicating the
level of engagement and/or fatigue, and changes the robot’s behavior in order to
maximize this level. The socially assistive therapist robot (see Figure 1) is equipped with a
basis set of behaviors that will explicitly express its desires and intentions in a physical and
verbal way that is observable to the user/patient. These behaviors involve the control of

28

physical distance, gestural expression, and verbal expression (tone and content). The
number of exercises/minute is therefore used as a reward that maximizes the response of
the system.

Hands-off robot post-stroke rehabilitation therapy holds great promise of improving patient
compliance in the recovery program. Our work aims toward developing and testing a
model of compatibility between human and robot personality in the assistive context,
based on the PEN theory of personality and toward building a customized therapy protocol.
Examining and answering these issues will begin to address the role of assistive robot
personality in enhancing patient compliance.

[Eriksson05] Eriksson, J., Matarić, M., J., and Winstein, C. "Hands-off assistive robotics for
post-stroke arm rehabilitation", In Proceedings of the International Conference on
Rehabilitation Robotics (ICORR-05), Chicago, Illinois, June 2005.

[Eysenck91] Eysenck, H., J. "Dimensions of personality: 16, 5 or 3? Criteria for a taxonomic
paradigm", In Personality and individual differences, vol. 12, pp.773-790, 1991.

[Gockley06] Gockley, R., and Matarić, M., J. "Encouraging Physical Therapy Compliance
with a Hands-Off Mobile Robot", In Proceedings of the First International Conference on
Human Robot Interaction (HRI-06), Salt Lake City, Utah, March 2006.

[Institute06] "Post-Stroke Rehabilitation Fact Sheet", National Institute of neurological
disorders and stroke, January, 2006.

[Tapus06a] Tapus, A. and Matarić, M., J. (2006) "User Personality Matching with Hands-Off
Robot for Post-Stroke Rehabilitation Therapy", In Proceedings of the 10th International
Symposium on Experimental Robotics (ISER), Rio de Janeiro, Brazil, July 2006.

[Tapus06b] Tapus, A. and Matarić, M., J. (2006) "Towards Socially Assistive Robotics",
International Journal of the Robotics Society of Japan (JRSJ), 24(5), pp. 576- 578, July, 2006.

Joint work with Maja J. Matarić.

Learning How To Teach
Cynthia Taylor, University of California, San Diego

The goal of the RUBI project is to develop a social robot (RUBI) that can interact with
children and teach them in an autonomous manner. As part of the project we are currently
focusing on the problem of teaching 18-24 month old children skills targeted by the
California Department of Education as appropriate for this age group.

In particular we are focusing on teaching the children to identify objects, shapes and
colors. We have seven RFID-tagged stuffed toys, in the shapes of common objects like a
slice of watermelon or a waffle. RUBI says the name of the object and shows a picture of it
on her touch screen, and the children hand her a toy, which she identifies as correct or
incorrect. She keeps track of the right and wrong answers for each toy.

RUBI has a touch screen on her stomach she can use to play short videos and play games
with the children. By recording when the children touch her stomach, the screen also
provides important information about whether or not the children are engaged. She has
two Apple iSight cameras for eyes, and runs machine learning software that lets her detect
both faces and smiles. The smile detection lets her gage people’s moods during social
interaction, and respond accordingly. She has an RFID reader in her right hand, letting her

29

identify RFID tagged toys.

The machine learning aspect of this problem is how to use the information from her
perceptual primitives so as to teach the materials in an effective manner. After each
question/answer, RUBI has to decide whether to continue playing her current learning
game or switch to another activity, and what question to ask next if she continues playing
the game. She also has to decide what to do in situations where she asks a question and
does not get an answer for a long period of time. Unlike many standard AI problems like
chess, RUBI works in continuous time, with no discrete turns.

We are approaching the problem from the point of view of control theory. Exact solutions to
the optimal teach problem exist for some simple models of learning, such as the Atkinson
and Bower learning model. We are planning to find approximate solutions to this control
problem using Reinforcement Learning Methods. We will complement formal and
computational analysis with ethnographic study of how teachers do teach the children on
the same task. Our focus will be on understanding both timing and what sources of
information they use to adapt their teaching strategies.

Joint work with Paul Ruvolo, Ian Fasel, Javier R. Movellan.

Strategies for improving face recognition in video using machine learning methods
Deborah Thomas, University of Notre Dame

Surveillance cameras are a common feature in many stores and public places. There are
many applications for face recognition from video streams in the area of law enforcement.
However, while face recognition from high quality still images has been very successful,
face recognition from video is a relatively new area and there is huge room for
improvement. Furthermore, when using video as our data, we can exploit the fact that
there are multiple frames to choose from to improve recognition performance. So, instead
of representing subjects using a single high quality image, they can be represented using a
set of frames chosen from the frames in a video clip. However, we want to select as many
distinct frames for an individual as possible. This allows for the diversity in the training
space, thereby improving the generalization capacity of the learned face recognition
classifier.

In this work, we consider two different approaches. The commonality between the two
approaches is Principal Component Analysis. Given the high dimensionality of the data,
PCA is often warranted to not only reduce the dimensions but also construct mode
independent dimensions. In our first approach, we use a nearest neighbor algorithm with
Mahalanobis Cosine (MahCosine) distance measure. A pair of images in which the faces
differ from each other in pose and expression will have a bigger MahCosine distance
between them. So we can use this as a measure of difference between frames. In the
second approach, we project the images into PCA space and then use K-means clustering
to group all the frames from one subject and pick one image per cluster to make up the
representation set. Here again, images, which are similar to each other, will be in the same
cluster, while more different images will be in different clusters. In addition to difference
between frames, we also incorporate a quality metric of the face in picking the frames in
addition to using PCA and this yields a higher recognition rate.

We demonstrate our approach using two different datasets. First, we compare our
approach to the approach used by Lee et. al in 2003 (Video-based Face Recognition Using
Appearance Manifolds) and 2005 (Visual Tracking and Recognition using Probabilistic
Appearance Manifolds). They use appearance manifolds to represent their subjects and
use planes in PCA space for the different poses. We show that our approach performs

30

Machine Learning: Theory, Applications, Experiences

Machine Learning: Theory, Applications, Experiences

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (7)

Semelhante a Machine Learning: Theory, Applications, Experiences

Semelhante a Machine Learning: Theory, Applications, Experiences (20)

Mais de butest

Mais de butest (20)

Machine Learning: Theory, Applications, Experiences