SlideShare uma empresa Scribd logo
1 de 43
Baixar para ler offline
Machine Learning:
Theory, Applications, Experiences
A Workshop for Women in Machine Learning



            October 4, 2006
             San Diego, CA



    http://www.seas.upenn.edu/~wiml/
Workshop Organization

         Organizers: Lisa Wainer, University College London
                     Hanna Wallach, University of Cambridge
                     Jennifer Wortman, University of Pennsylvania


     Faculty advisor: Amy Greenwald, Brown University


Additional reviewers: Maria-Florina Balcan     Kristina Klinkner
                      Melissa Carroll          Bethany Leffler
                      Kimberley Ferguson       Ozgur Simsek
                      Katherine Heller         Alicia Peregrin Wolfe
                      Julia Hockenmaier        Elena Zheleva
                      Rebecca Hutchinson




                  Thanks to our generous sponsors:




                                  1
Schedule

October 3, 2006

     19:30 Workshop dinner

October 4, 2006

     08:45 Registration and poster set-up

     09:00 Welcome

     09:15 Invited talk: A General Class of No-Regret Learning Algorithms and Game-
           Theoretic Equilibria
           Amy Greenwald, Brown University

     09:45 On a Theory of Learning with Similarity Functions
           Maria-Florina Balcan, Carnegie Mellon University

     10:00 Matrix Tile Analysis
           Inmar Givoni, University of Toronto

     10:15 Towards Bayesian Black Box Learning Systems
           Jo-Anne Ting, University of Southern California

     10:30 Coffee break

     10:45 Invited talk: Clustering High-Dimensional Data
           Jennifer Dy, Northeastern University

     11:15 Efficient Bayesian Algorithms for Clustering
           Katherine Heller, Gatsby Unit, University College London

     11:30 Hidden Process Models
           Rebecca Hutchinson, Carnegie Mellon University

     11:45 Invited talk: Recent advances in near-neighbor learning
           Maya Gupta, University of Washington

     12:15 Spotlight talks:

            Correcting sample selection bias by unlabeled data
            Jiayuan Huang, University of Waterloo

            Decision Tree Methods for Finding Reusable MDP Homomorphisms
            Alicia Peregrin Wolfe, University of Massachusetts, Amherst

            Evaluating a Reputation-based Spam Classification System
            Elena Zheleva, University of Maryland, College Park

            Improving Robot Navigation Through Self-Supervised Online Learning
            Ellie Lin, Carnegie Mellon University

     12:30 Lunch

                                             2
13:00 Poster session 1

13:45 Invited talk: SRL: Statistical Relational Learning
      Lise Getoor, University of Maryland, College Park

14:15 Generalized statistical methods for fraud detection
      Cecile Levasseur, University of California, San Diego

14:30 Kernels for the Predictive Regression of Physical, Chemical and Biological
      Properties of Small Molecules
      Chloe-Agathe Azencott, University of California, Irvine

14:45 Invited talk: Modeling and Learning User Preferences for Sets of Objects
      Marie desJardins, University of Maryland, Baltimore County

15:15 Coffee break

15:30 Efficient Exploration with Latent Structure
      Bethany Leffler, Rutgers University

15:45 Efficient Model Learning for Dialog Management
      Finale Doshi, MIT

16:00 Transfer in the context of Reinforcement Learning
      Soumi Ray, University of Maryland, Baltimore County

16:15 Spotlight talks:

      Simultaneous Team Assignment and Behavior Recognition from Spatio-
      temporal Agent Traces
      Gita Sukthankar, Carnegie Mellon University

      An Online Learning System for the Prediction of Electricity Distribution
      Feeder Failures
      Hila Becker, Columbia University

      Classification of fMRI Images: An Approach Using Viola-Jones Features
      Melissa K Carroll, Princeton University

      Fast Online Classification with Support Vector Machines
      Seyda Ertekin, Penn State University

16:30 Poster session 2

17:15 Open discussion

17:45 Closing remarks and poster take-down

18:00 End of workshop




                                     3
Invited Talks

A General Class of No-Regret Learning Algorithms and Game-Theoretic Equilibria
Amy Greenwald, Brown University

      No-regret learning algorithms have attracted a great deal of attention in the game
      theoretic and machine learning communities. Whereas rational agents act so as to
      maximize their expected utilities, no-regret learners are boundedly rational agents that act
      so as to minimize their "regret". In this talk, we discuss the behavior of no-regret learning
      algorithms in repeated games.

      Specifically, we introduce a general class of algorithms called no-Φ-regret learning, which
      includes common variants of no-regret learning such as no external-regret and no-internal-
      regret learning. Analogously, we introduce a class of game-theoretic equilibria called Φ-
      equilibria. We show that no-Φ-regret learning algorithms converge to Φ-equilibria. In
      particular, no-external-regret learning converges to minimax equilibrium in zero-sum
      games; and no-internal-regret learning converges to correlated equilibrium in general-sum
      games. Although our class of no-regret algorithms is quite extensive, no algorithm in this
      class learns Nash equilibrium.

      Speaker biography:

      Dr. Amy Greenwald is an assistant professor of computer science at Brown University in
      Providence, Rhode Island. Her primary research area is the study of economic interactions
      among computational agents. Her primary methodologies are game-theoretic analysis
      and simulation. Her work is applicable in areas ranging from dynamic pricing to
      autonomous bidding to transportation planning and scheduling. She was awarded a Sloan
      Fellowship in 2006; she was nominated for the 2002 Presidential Early Career Award for
      Scientists and Engineers (PECASE); and she was named one of the Computing Research
      Association's Digital Government Fellows in 2001. Before joining the faculty at Brown, Dr.
      Greenwald was employed by IBM's T.J. Watson Research Center, where she researched
      Information Economies. Her paper entitled "Shopbots and Pricebots" (joint work with Jeff
      Kephart) was named Best Paper at IBM Research in 2000.



Clustering High-Dimensional Data
Jennifer Dy, Northeastern University

      Creating effective algorithms for unsupervised learning is important because vast amounts
      of data preclude humans from manually labeling the categories of each instance. In
      addition, human labeling is expensive and subjective. Therefore, a majority of existing
      data is unsupervised (unlabeled). The goal of unsupervised learning or cluster analysis is
      to group "similar" objects together. "Similarity" is typically defined by a metric or a
      probability model. These measures are highly dependent on the features representing the
      data. Many clustering algorithms assume that relevant features have been determined by
      the domain experts. But, not all features are important. Moreover, many clustering
      algorithms fail when dealing with high-dimensions. We present two approaches for dealing
      with clustering in high-dimensional spaces: 1. Feature selection for clustering, through
      Gaussian mixtures and the maximum likelihood and scatter separability criteria, and 2.
      Hierarchical feature transformation and clustering, through automated hierarchical
      mixtures of probabilistic principal component analyzers.




                                                4
Speaker biography:

      Dr. Jennifer G. Dy is an assistant professor at the Department of Electrical and Computer
      Engineering, Northeastern University, Boston, MA, since 2002. She obtained her MS and
      PhD in 1997 and 2001 respectively from the School of Electrical and Computer
      Engineering, Purdue University, West Lafayette, IN, and her BS degree in 1993 from the
      Department of Electrical Engineering, University of the Philippines. She received an NSF
      Career award in 2004. She is an editorial board member for the journal, Machine Learning
      since 2004, and publications chair for the International Conference on Machine Learning in
      2004. Her research interests include Machine Learning, Data Mining, Statistical Pattern
      Recognition, and Computer Vision.



Recent advances in near-neighbor learning
Maya R. Gupta, University of Washington

      Recent advances in nearest-neighbor learning are shown for adaptive neighborhood
      definitions, neighborhood weighting, and estimating given nearest-neighbors. In particular,
      it is shown that weights that solve linear interpolation equations minimize the first-order
      learning error, and this is coupled with the principle of maximum entropy to create a
      flexible weighting approach. Different approaches to adaptive neighborhoods are
      contrasted, the focus being on neighborhoods that form a convex hull around the test
      point. Standard weighted nearest-neighbor estimation is shown to maximize likelihood,
      and it is shown that minimizing expected Bregman divergence instead leads to optimal
      solutions in terms of expected misclassification cost. Applications may include the testing
      of pipeline integrity, custom color enhancements, and estimation for color management.

      Speaker biography:

      Maya Gupta completed her Ph.D. in Electrical Engineering in 2003 at Stanford University as
      a National Science Foundation Graduate Fellow. Her undergraduate studies led to a BS in
      Electrical Engineering and a BA in Economics from Rice University in 1997. From 1999-
      2003 she worked for Ricoh's California Research Center as a color image processing
      research engineer. In the fall of 2003, she joined the EE faculty of the University of
      Washington as an Assistant Professor where she also serves as an Adjunct Assistant
      Professor of Applied Mathematics. More information about her research is available at her
      group's webpage: idl.ee.washington.edu.



Modeling and Learning User Preferences for Sets of Objects
Marie desJardins, University of Maryland, Baltimore County

      Most work on preference learning has focused on pairwise preferences or rankings over
      individual items. In many application domains, however, when a set of items is presented
      together, the individual items can interact in ways that increase (via complementarity) or
      decrease (via redundancy or incompatibility) the quality of the set as a whole.

      In this talk, I will describe the DD-PREF language that we have developed for specifying
      set-based preferences. One problem with such a language is that it may be difficult for
      users to explicitly specify their preferences quantitatively. Therefore, we have also
      developed an approach for learning these preferences. Our learning method takes as
      input a collection of positive examples―that is, one or more sets that have been identified
      by a user as desirable. Kernel density estimation is used to estimate the value function for
      individual items, and the desired set diversity is estimated from the average set diversity

                                               5
observed in the collection.

       Since this is a new learning problem, I will also describe our new evaluation methodology
       and give experimental results of the learning method on two data collections: synthetic
       blocks-world data and a new real-world music data collection.

       Joint work with Eric Eaton and Kiri L. Wagstaff.

       Speaker biography:

       Dr. Marie desJardins is an assistant professor in the Department of Computer Science and
       Electrical Engineering at the University of Maryland, Baltimore County. Prior to joining the
       faculty in 2001, Dr. desJardins was a senior computer scientist at SRI International in Menlo
       Park, California. Her research is in artificial intelligence, focusing on the areas of machine
       learning, multi-agent systems, planning, interactive AI techniques, information
       management, reasoning with uncertainty, and decision theory.



SRL: Statistical Relational Learning
Lise Getoor, University of Maryland, College Park

       A key challenge for machine learning is mining richly structured datasets describing
       objects, their properties, and links among the objects. We'd like to be able to learn
       models, which can capture both the underlying uncertainty and the logical relationships in
       the domain. Links among the objects may demonstrate certain patterns, which can be
       helpful for many practical inference tasks and are usually hard to capture with traditional
       statistical models. Recently there has been a surge of interest in this area, fueled largely
       by interest in web and hypertext mining, but also by interest in mining social networks,
       security and law enforcement data, bibliographic citations and epidemiological records.

       Statistical Relational Learning (SRL) is a newly emerging research area, which attempts to
       represent, reason and learn in domains with complex relational and rich probabilistic
       structure. In this talk, I'll begin with a short SRL overview. Then, I'll describe some of my
       group's recent work, including our work on entity resolution in relational domains.

       Joint work with Indrajit Bhattacharya, Mustafa Bilgic, Louis Licamele and Prithviraj Sen.

       Speaker biography:

       Prof. Lise Getoor is an assistant professor in the Computer Science Department at the
       University of Maryland, College Park. She received her PhD from Stanford University in
       2001. Her current work includes research on link mining, statistical relational learning and
       representing uncertainty in structured and semi-structured data. Her work in these areas
       has been supported by NSF, NGA, KDD, ARL and DARPA. In June 2006, she co-organized
       the 4th in a series of successful workshops on statistical relational learning,
       http://www.cs.umd.edu/srl2006. She has published numerous articles in machine learning,
       data mining, database and AI forums. She is a member of AAAI Executive council, is on
       the editorial board of the Machine Learning Journal and JAIR and has served on numerous
       program committees including AAAI, ICML, IJCAI, KDD, SIGMOD, UAI, VLDB, and WWW.




                                                  6
Talks

On a Theory of Learning with Similarity Functions
Maria-Florina Balcan, Carnegie Mellon University

       Kernel functions have become an extremely popular tool in machine learning. They have
       an attractive theory that describes a kernel function as being good for a given learning
       problem if data is separable by a large margin in a (possibly very high-dimensional)
       implicit space defined by the kernel. This theory, however, has a bit of a disconnect with
       the intuition of a good kernel as a good similarity function. In this work we develop an
       alternative theory of learning with similarity functions more generally (i.e., sufficient
       conditions for a similarity function to allow one to learn well) that does not require
       reference to implicit spaces, and does not require the function to be positive semi-definite.
       Our results also generalize the standard theory in the sense that any good kernel function
       under the usual definition can be shown to also be a good similarity function under our
       definition. In this way, we provide the first steps towards a theory of kernels that describes
       the effectiveness of a given kernel function in terms of natural similarity-based properties.

       Joint work with Avrim Blum.



Matrix Tile Analysis
Inmar Givoni, University of Toronto

       Many tasks require finding groups of elements in a matrix of numbers, symbols or class
       likelihoods. One approach is to use efficient bi- or tri-linear factorization techniques
       including PCA, ICA, sparse matrix factorization and plaid analysis. These techniques are
       not appropriate when addition and multiplication of matrix elements are not sensibly
       defined. More directly, methods like bi-clustering can be used to classify matrix elements,
       but these methods make the overly restrictive assumption that the class of each element
       is a function of a row class and a column class. We introduce a general computational
       problem, "matrix tile analysis" (MTA), which consists of decomposing a matrix into a set of
       non-overlapping tiles, each of which is defined by a subset of usually nonadjacent rows
       and columns. MTA does not require an algebra for combining tiles, but must search over an
       exponential number of discrete combinations of tile assignments. We describe a loopy BP
       (sum-product) algorithm and an ICM algorithm for performing MTA. We compare the
       effectiveness of these methods to PCA and the plaid method on hundreds of randomly
       generated tasks. Using double-gene-knockout data, we show that MTA finds groups of
       interacting yeast genes that have biologically-related functions.

       Joint work with Vincent Cheung and Brendan J. Frey.



Towards Bayesian Black Box Learning Systems
Jo-Anne Ting, University of Southern California

       A long-standing dream of machine learning is to create black box learning systems that
       can operate autonomously in home, research and industrial applications. While it is well
       understood that a universal black box may not be possible, significant progress can be
       made in specific domains. In particular, we address learning problems in sensor-rich and
       data-rich environments, as provided by autonomous            vehicles, surveillance systems,
       biological or robotic systems. In these scenarios, the input data has hundreds or thousands


                                                 7
of dimensions and is used to make predictions (often in real-time), resulting in a learning
       system that learns to "understand" the environment.

       The goal of machine learning in this domain is to devise algorithms that can efficiently deal
       with very high dimensional data, usually contaminated by noise, redundancy and
       irrelevant dimensions. These algorithms must learn nonlinear functions, potentially in an
       incremental and real-time fashion, for robust classification and regression. In order to
       achieve black box quality, manual tuning parameters (e.g. as in gradient descent or
       structure selection) need to be minimized or, ideally, avoided.

       Bayesian inference, when combined with approximation methods to reduce computational
       complexity, suggests a promising route to achieve our goals, since it offers a principled
       way to eliminate open parameters. In past work, we have started to create a toolbox of
       methods to achieve our goal of black box learning. In (Ting et al., NIPS 2005), we
       introduced a Bayesian approach to linear regression. The novelty of this algorithm comes
       from a Bayesian and EM-like formulation of linear regression that robustly performs
       automatic feature detection in the inputs in a computationally efficient way. We applied
       this algorithm to the analysis of neuroscientific data (i.e. the problem of prediction of
       electromyographic (EMG) activity in the arm muscles of a monkey from spiking activity of
       neurons in the primary motor and premotor cortex). The algorithm achieves results that
       are faster by orders of magnitude and higher quality than previously applied methods.

       More recently, we introduced a variational Bayesian regression algorithm that is able to
       perform optimal prediction, given noise-contaminated input and output data (Ting, D'Souza
       & Schaal, ICML 2006). Traditional linear regression algorithms produce biased estimates
       when input noise is present and suffer numerically when the data contains irrelevant
       and/or redundant inputs. Our algorithm is able to effectively handle datasets with both
       characteristics. On a system identification task for a robot dynamics model, we achieved
       from 10 to 70% better results than traditional approaches.

       Current work focuses on developing a Bayesian version of nonlinear function
       approximation with locally weighted regression. The challenge is to determine the size of
       the neighborhood of data that should contribute to the local regression model―a typical
       bias-variance trade-off problem. Preliminary results indicate that a full Bayesian treatment
       of this problem can achieve impressive robust function approximation performance without
       the need for tuning meta parameters. We are also interested in extending this locally
       linear Bayesian model to an online setting, in the spirit of dynamic Bayesian networks, to
       offer a parameter-free alternative to incremental learning.

       Joint work with Aaron D'Souza, Stefan Schaal, Kenji Yamamoto, Toshinori Yoshioka, Donna
       Hoffman, Shinji Kakei, Lauren Sergio, John Kalaska, Mitsuo Kawato, Peter Strick, Michael
       Mistry, Jan Peters, and Jun Nakanishi.

       This work will also be in Poster Session 1.



Efficient Bayesian Algorithms for Clustering
Katherine Ann Heller, Gatsby Unit, University College London

       One of the most important goals of unsupervised learning is to discover meaningful
       clusters in data. There are many different types of clustering methods that are commonly
       used in machine learning including spectral, hierarchical, and mixture modeling. Our work
       takes a model-based Bayesian approach to defining a cluster and evaluates cluster
       membership in this paradigm. We use marginal likelihoods to compare different cluster
       models, and hence determine which data points belong to which clusters. If we have

                                                     8
models with conjugate priors, these marginal likelihoods can be computed extremely
      efficiently.

      Using this clustering framework in conjunction with non-parametric Bayesian methods, we
      have proposed a new way of performing hierarchical clustering. Our Bayesian Hierarchical
      Clustering (BHC) algorithm takes a more principled approach to the problem than the
      traditional algorithms (e.g. allowing for model comparisons and the prediction of new data
      points) without sacrificing efficiency. BHC can also be interpreted as performing
      approximate inference in Dirichlet Process Mixtures (DPMs), and provides a combinatorial
      lower bound on the marginal likelihood of a DPM.

      We have also explored the task of "clustering on demand" for information retrieval. Given
      a query consisting of a few examples of some concept, we have proposed a method that
      returns other items belonging to the concept exemplified by the query. We do this by
      ranking all items using a Bayesian relevance criterion based on marginal likelihoods, and
      returning the items with the highest scores. In the case of binary data, all scores can be
      computed with a single matrix-vector product. We can also use this method as the basis
      for an image retrieval system. In our most recent work this framework has served as
      inspiration for a new approach to automated analogical reasoning.

      Joint work with Zoubin Ghahramani and Ricardo Silva.



Hidden Process Models
Rebecca Hutchinson, Carnegie Mellon University

      We introduce the Hidden Process Model (HPM), a probabilistic model for multivariate time
      series data. HPMs assume the data is generated by a system of partially observed, linearly
      additive processes that overlap in space and time. While we present a general formalism
      for any domain with similar modeling assumptions, HPMs are motivated by our interest in
      studying cognitive processes in the brain, given a time series of functional magnetic
      resonance imaging (fMRI) data. We use HPMs to model fMRI data by assuming there is an
      unobserved series of hidden, overlapping cognitive processes in the brain that
      probabilistically generate the observed fMRI time series.
      Consider for example a study in which subjects in the scanner repeatedly view a picture
      and read a sentence and indicate whether the sentence correctly describes the picture. It
      is natural to think of the observed fMRI sequence as arising from a set of hidden cognitive
      processes in the subject’s brain, which we would like to track. To do this, we use HPMs to
      learn the probabilistic time series response signature for each type of cognitive process,
      and to estimate the onset time of each instantiated cognitive process occurring throughout
      the experiment.
      There are significant challenges to this learning task in the fMRI domain. The first is that
      fMRI data is high dimensional and sparse. A typical fMRI dataset measures approximately
      10,000 brain locations over 15-20 minutes (features), with only a few dozen trials (training
      examples). A second challenge is due to the nature of the fMRI signal: it is a highly noisy
      measurement of an indirect and temporally blurred neural correlate called the
      hemodynamic response. The hemodynamic response to a short burst of less than a second
      of neural activity lasts for 10-12 seconds. This temporal blurring in fMRI makes it
      problematic to model the time series as a first-order Markov process. In short, our problem
      is to learn the parameters and timing of potentially overlapping, partially observed
      responses to cognitive processes in the brain using many features and a small number of
      noisy training examples.
      The modeling assumptions that HPMs make to deal with the challenges of the fMRI domain


                                                 9
are: 1) the latent time series is modeled at the level of processes rather than individual
      time points; 2) processes are general descriptions that can be instantiated many times
      over the course of the time series; 3) we can use prior knowledge of the form “process
      instance X occurs somewhere inside the time interval [a, b].” HPMs could apply to any
      domain in which these assumptions are valid.
      HPMs address a key open question in fMRI analysis: how can one learn the response
      signatures of overlapping cognitive processes with unknown timing? There is no
      competing method to HPMs available in the fMRI community. In our ICML paper, we give
      the HPM formalism, inference and learning algorithms, and experimental results on real
      and synthetic fMRI datasets.

      Joint work with Tom Mitchell and Indrayana Rustandi.

      This work will also be in Poster Session 1.



Generalized statistical methods for fraud detection
Cecile Levasseur, University of California, San Diego

      Many important risk assessment system applications depend on the ability to accurately
      detect the occurrence of key events given a large data set of observations. For example,
      this problem arises in drug discovery (“Do the molecular descriptors associated with
      known drugs suggest that a new, candidate drug will have low toxicity and high
      effectiveness?”); and credit card fraud detection (“Given the data for a large set of credit
      card users does the usage pattern of this particular card indicate that it might have been
      stolen?”). In many of these domains, no or little a priori knowledge exists regarding the
      true sources of any causal relationships that may occur between variables of interest. In
      these situations, meaningful information regarding the circumstances of the key events
      must be extracted from the data itself, a problem that can be viewed as an important
      application of data-driven pattern recognition or detection.

      The problem of unsupervised data-driven detection or prediction is one of relating
      descriptors of a large unlabeled database of “objects” to measured properties of these
      objects, and then using these empirically determined relationships to infer or detect the
      properties of new objects. This work considers measured object properties that are
      nongaussian (and comprised of continuous and discrete data), very noisy, and highly
      nonlinearly related. Data comprised of measurements of such disparate properties are said
      to be hybrid or of mixed type. As a consequence, the resulting detection problem is very
      difficult. The difficulties are further compounded because the descriptor space is of high
      dimension. While many domains lack accurate labels in their database, others like credit
      card fraud exhibit tagged data. Therefore, the problem of supervised data-driven
      detection, one relating to a labelled database of objects, is also examined. In addition, by
      utilizing tagged data, a performance benchmark can be set, enabling meaningful
      comparisons of supervised and unsupervised approaches.

      Statistical approaches to fraud detection are mostly based on modelling the data relying
      on their statistical properties and using this information to estimate whether a new object
      comes from the same distribution or not. The statistical modelling approach proposed here
      is a generalization and amalgamation of techniques from classical linear statistics (logistic
      regression, principal component analysis and generalized linear models) into a framework
      referred to as generalized linear statistics (GLS). It is based on the use of exponential
      family distributions to model the various types (continuous and discrete) of data
      measurements. A key aspect is that the natural parameter of the exponential family
      distributions is constrained to a lower dimensional subspace to model the belief that the


                                                10
intrinsic dimensionality of the data is smaller than the dimensionality of the observation
      space. The proposed constrained statistical modelling is a nonlinear methodology that
      exploits the split that occurs for exponential family distributions between the data space
      and the parameter space as soon as one leaves the domain of purely Gaussian random
      variables. Although the problem is nonlinear, it can be solved by using classical linear
      statistical tools applied to data that has been mapped into the parameter space that still
      has a natural, flat Euclidean structure. This approach provides an effective way to exploit
      tractably parameterized latent-variable exponential-family probability models for data-
      driven learning of model parameters and features, which in turn are useful for the
      development of effective fraud detection algorithms.

      The fraud detection techniques proposed here are performed in the parameter space
      rather than in the data space as has been done in more classical approaches. In the case
      of a low level of contamination of the data by fraudulent points, a single lower dimensional
      subspace is learned by using the GLS based statistical modelling on a training set. Given a
      new data point, it is projected to its image on the lower dimensional subspace and fraud
      detection is performed by comparing its distance from the training set mean-image to a
      threshold. An example that shows that there are domains for which the classical linear
      techniques, such as principal component analysis, used in the data space perform far from
      optimally is presented compared to the new proposed parameter space techniques. For
      cases of data with roughly as many fraudulent as non-fraudulent points, an unsupervised
      approach to the linear Fisher discriminant is proposed. The GLS based framework enables
      unsupervised learning of a lower dimensional subspace in the parameter space that
      separates fraudulent from non-fraudulent data. Fraud detection is performed as in the
      previous case. In both cases, an ROC curve is generated to assess the performance of the
      proposed fraud detection methods.

      Joint work with Kenneth Kreutz-Delgado and Uwe Mayer.



Kernels for the Predictive Regression of Physical, Chemical and Biological Properties of
Small Molecules
Chloe-Agathe Azencott, University of California, Irvine

      Small molecules, i.e. molecules composed of a couple hundreds of atoms, play a
      fundamental role in biology, chemistry and pharmacology. Their usage goes from the
      design of new drugs to the better understanding of biological systems; however,
      establishing their physical, chemical and biological properties through a physical
      experimentation can be very costly. It is therefore essential to develop efficient
      computational methods to predict these properties.

      Kernel methods, and among them support vector machines, appear as particularly
      appropriate for chemical data, for they involve similarity measures which allow to embed
      the data in a high-dimensional feature space where linear methods can be used. Machine
      learning spectral kernels can be derived from various descriptions of the molecules; we
      study representations which dimensionality ranges from 1 to 4, thus obtaining 1D, 2D,
      2.5D, 3D and 4D kernels.

      Using cross-validation and redundancy reduction techniques on various datasets of small
      and medium size from the literature, we test the kernels for the prediction of boiling
      points, melting points, aqueous solubility and octanol/water partition coefficient and
      compare them against state-of-the art results.

      Spectral kernels derived from the rich and reliable two-dimensional representation of the
      molecules outperform the other methods on most of the datasets. They seem to be the

                                               11
method of choice, given their simplicity, computational efficiency and prediction accuracy.



Efficient Exploration with Latent Structure
Bethany Leffler, Rutgers University

      Developing robot control using a reinforcement-learning (RL) approach involves a number
      of technical challenges. In our work, we address the problem of learning an action model.

      Classical RL approaches assume Markov decision process (MDP) environments, which do
      not support the critical idea of generalization between states. For an agent to learn the
      results of its actions for each state, it would have to visit each state and perform each
      action in that state at least once. In a robot setting, however, it is unrealistic to assume
      there will be sufficient time to learn about every state of the environment independently;
      so richer models of environmental dynamics are needed. Our technique for developing
      such a model is to assume that each state is not unique. In most environments, there will
      be states that have the same transition dynamics. By developing models where similar
      states have similar dynamics, it becomes possible for a learner to reuse its experience in
      one state to more quickly learn the dynamics of other parts of the environment. However,
      it also introduces an additional challenge―determining which states are similar.

      To evaluate the viability of this approach, we constructed an experiment using a four-
      wheeled Lego Mindstorm robot as the agent. The state space consisted of discretized
      vehicle locations with a hidden variable of slope (flat or incline), which correlated directly
      with the action model. The agent had to learn which throttling action to perform in each
      state to maintain a target speed. In this scenario, the actions did not affect the transitions
      between states.

      To determine similarity between states, the agent executed a selected action several times
      in each of the vehicle locations. The outcomes of these actions were used to hierarchically
      cluster the states. Once the states were clustered, the agent then started learning an
      action model for each state cluster. The advantage of this approach over one that learned
      a separate action model for each state is that information gathered in several different
      states can be pooled together. In common environments, there are many more states than
      state-types; therefore, learning based on clusters drastically reduces learning time. In
      fact, we were able to prove a worst-case learning time result that formalizes and validates
      this claim.

      If the environment does not have many similar states or if the clustering algorithm groups
      the states incorrectly, than the benefit of this approach will be minimized. Even in this
      worst case, however, it is important to note that this algorithm is no more costly than
      exploring each state individually.

      Some limitations of this algorithm arise when states have semi-similar action models. For
      instance, if two states behave similarly when one action is performed, but not for all the
      actions, it is possible that the agent would learn incorrectly when following our proposed
      algorithm. In most robotic environments, however, using our algorithm will greatly reduce
      the time taken by the agent to determine its action model in all states, thereby increasing
      the efficiency of the robot.

      Joint work with Michael L. Littman, Alexander L. Strehl, and Thomas Walsh.




                                               12
Efficient Model Learning for Dialog Management
Finale Doshi, MIT

      Intelligent planning algorithms such as the Partially Observable Markov Decision Process
      (POMDP) have succeeded in dialog management applications because of their robustness
      to the inherent uncertainty of human interaction. Like all dialog planning systems,
      however, POMDPs require an accurate model of the user (such as the user's different
      states of the user and what the user might say). POMDPs are generally specified using a
      large probabilistic model with many parameters; these parameters are difficult to specify
      from domain knowledge, and gathering enough data to estimate the parameters
      accurately a priori is expensive.

      In this paper, we take a Bayesian approach to learning the user model simultaneously the
      dialog management problem. First we show that the policy that maximizes the expected
      reward is the solution of the POMDP taken with the expected values of the parameters. We
      update the parameter distributions after each test, and incrementally update the previous
      POMDP solution. The update process has a relatively small computational cost, and we
      test various heuristics to focus computation in circumstances where it is most likely to
      improve the dialog. We are able to demonstrate a robust dialog manager that learns from
      interaction data, out-performing a hand-coded model in simulation and in a robotic
      wheelchair application.

      Joint work with Nicholas Roy.



Transfer in the context of Reinforcement Learning
Soumi Ray, University of Maryland, Baltimore County

      We are investigating the problem of transferring knowledge learned in one domain to
      another related domain. Transfer of knowledge from simple domains to more complex
      domains can reduce the total training time in the complex domains. We are doing transfer
      in the context of reinforcement learning. In the past, knowledge transfer has been
      accomplished between domains with the same state and action spaces. Work has also
      been done where the state and action spaces of the two domains are different but a
      mapping has been provided by humans. We are trying to automate the mapping from the
      old domain to the new domain when the state and action spaces are different.

      We have two domains D1 and D2, with corresponding state spaces S1 and S2 and action
      spaces A1 and A2 where |S1| = |S2| and |A1| = |A2|. Our goal is to transfer a policy learned
      in D1 to D2 so as to speed learning in D2. We first run Q-learning in D1 to produce Q-table
      Q1. Then we train for limited time in D2 and generate Q2. The test bed we have used is a
      16x16 grid world. We have taken two domains in a 16x16 grid world with four actions:
      North, South, East and West. In the first domain we have trained for 500 iterations and in
      the second domain we have trained for 20 iterations. The two approaches that we have
      used are as follows.

      Our goal is to find the mapping between the state spaces S1 and S2 and action spaces A1
      and A2 In the first approach we compute the difference between matrices Q1 and Q2 and
      greedily find a mapping that minimizes the difference calculated above. With this mapping
      we can transfer the Q-values from the completely trained domain D1 to the partially
      trained domain D2 to speed up learning in domain D2. We find that it takes fewer steps to
      learn completely in the second domain when the Q-values are transferred than learning
      from scratch. Our second approach finds the mapping that assigns the highest Q-values of
      the states in domain one to the highest Q-values of the states in domain two. This
      approach is an improvement over the first approach. It takes many fewer steps to learn in

                                               13
the second domain using transfer.

We are also interested in finding the mapping when S1 and A1 are subsets of S2 and A2
respectively, i.e. |S1|<|S2| and |A1|<|A2|. This can be handled by allowing mapping a
single state/action in S1/A1 to multiple states/actions in S2/A2.

Joint work with Tim Oates.

This work will also be in Poster Session 2.




                                          14
Spotlights (Session 1)

Correcting sample selection bias by unlabeled data
Jiayuan Huang, University of Waterloo

      The default assumption in many learning scenarios is that training and test data are
      independently and identically drawn from the same distribution. When the distributions on
      training and test set do not match, we are facing the problem that commonly referred to
      as sample selection bias or covariance shift. This problem occurs in many real world
      applications including the areas of surveys, sociology, biology and economics. It is not hard
      to see that given the skewed selection for the training data, it is impossible to derive a
      good model to make accurate predictions on the general target as the training set might
      not be representative of the complete population from which the test is usually come. Thus
      the prediction results in a biased estimation, potentially increasing the errors. Although
      there exists previous work addressing this problem, sample selection bias is typically
      ignored in standard estimation algorithms. In this work, we utilize the availability of
      unlabeled data to direct a sample selection de-biasing procedure for various learning
      methods. Unlike most previous algorithms that try to first recover sampling distributions
      and then make appropriate corrections based on the distribution estimate, our method
      infer the re-sampling weight directly by distribution matching between training and testing
      sets in the feature space in a non-parametric manner. We do not require the estimation of
      biased densities or selection probabilities or any assumptions of knowing the probabilities
      of different classes. Our method works by matching distributions between training and
      testing sets in feature space that can handle high dimensional data. Our experiments
      results with many benchmark datasets demonstrate our method works well in practice.
      The method also shows good performance in tumor diagnosis using microarrays that it
      promises to be a valuable tool for cross-platform microarray classification.

      Joint work with Alex Smola, Arthur Gretton, Karsten Borgwardt, Bernhard Scholkopf.



Decision Tree Methods for Finding Reusable MDP Homomorphisms
Alicia Peregrin Wolfe, University of Massachusetts, Amherst

      State abstraction is a useful tool for agents interacting with complex environments. Good
      state abstractions are compact, reusable, and easy to learn from sample data. This paper
      combines and extends two existing classes of state abstraction methods to achieve these
      criteria. The first class of methods search for MDP homomorphisms (Ravindran 2004),
      which produce models of reward and transition probabilities in an abstract state space.
      The second class of methods, like the UTree algorithm (McCallum 1995), learn compact
      models of the value function quickly from sample data. Models based on MDP
      homomorphisms can easily be extended such that they are usable across tasks with
      similar reward functions. However, value based methods like UTree cannot be extended in
      this fashion. We present results showing a new, combined algorithm that fulfills all three
      criteria: the resulting models are compact, can be learned quickly from sample data, and
      can be used across a class of reward functions.

      Joint work with Andrew Barto.




                                               15
Evaluating a Reputation-based Spam Classification System
Elena Zheleva, University of Maryland, College Park

      Over the past several years, spam has been a growing problem for the Internet
      community. It interferes with valid e-mail and burdens both e-mail users and ISPs. While
      there are various successful automated e-mail filtering approaches that aim at reducing
      the amount of spam, there are still many challenges to overcome.

      Reactive spam filtering approaches classify a piece of e-mail as spam if it has been
      reported as such by a large volume of e-mail users. Unfortunately, by the time the system
      responds by blocking the message or automatically placing it in future recipients' spam
      folders, the spam campaign has already affected a lot of users. The challenge that we
      consider is whether we can reduce the response time, recognizing a spam campaign at an
      earlier stage, thus reducing the cost that users and systems incur. Specifically, we are
      evaluating the predictive power of a reputation-based spam filtering system, which uses
      the feedback only from trustworthy e-mail users.

      In a reputation-based or trust-based spam filtering system, the system identifies a set of
      users who report spam reliably and trusts their spam reports more than the spam reports
      of other users. A message coming into the system is classified as spam if enough reliable
      users report it. This automatic spam filtering approach is vulnerable to malicious users
      when any anonymous person can subscribe and unsubscribe to the e-mail service. This is
      the case with most free e-mail providers such as AOL, Hotmail and Yahoo. We show how to
      overcome this problem in this work.

      There are two well-known open-source projects which operate in this framework: Vipul's
      Razor and Distributed Checksum Clearinghouse. Unfortunately, their reputation systems
      work only as a part of their commercially available software counterparts and, due to trade
      secrets, it is not clear how the design characteristics such as reputation definition and
      metrics affect the system performance. More importantly, the spam reports they receive
      are mostly from authorized users (such as business partner company employees), which
      reduce the risk of abuse by anonymous users.

      The effectiveness of a reputation-based spam filtering system is based on evaluating the
      following properties: 1) automatic maintenance of a reliable user set over time, 2) timely
      and accurate recognition of a spam campaign, and 3) having a set of guarantees on the
      system vulnerability. In our work, we present the results from simulating a reputation-
      based spam filtering over a period of time. The evaluation dataset includes all the spam
      reports received during that period of time for a particular free e-mail provider. We show
      how our algorithms effectively reduce spam campaign response time, while minimizing
      system vulnerability.

      Joint work with Lise Getoor and Alek Kolcz.



Improving Robot Navigation Through Self-Supervised Online Learning
Ellie Lin, Carnegie Mellon University

      In mobile robotics, there are often features that, while potentially powerful for improving
      navigation, prove difficult to profit from as they generalize poorly to novel situations.
      Overhead imagery data, for instance, has the potential to greatly enhance autonomous
      robot navigation in complex outdoor environments. In practice, reliable and effective
      automated interpretation of imagery from diverse terrain, environmental conditions, and
      sensor varieties proves challenging. Similarly, fixed techniques that successfully interpret
      on-board sensor data across many environments begin to fail past short ranges as the

                                               16
density and accuracy necessary for such computation quickly degrade and features that
are able to be computed from distant data are very domain-specific. We introduce an
online, probabilistic model to effectively learn to use these scope-limited features by
leveraging other features that, while perhaps otherwise more limited, generalize reliably.
We apply our approach to provide an efficient, self-supervised learning method that
accurately predicts traversal costs over large areas from overhead data. We present
results from field-testing on-board a robot operating over large distances in off-road
environments. Additionally, we show how our algorithm can be used offline with overhead
data to produce a priori traversal cost maps and detect misalignments between overhead
data and estimated vehicle positions. This approach can significantly improve the
versatility of many unmanned ground vehicles by allowing them to traverse highly varied
terrains with increased performance.

Joint work with B. Sofman, J. Bagnell, N. Vandapel and A. Stentz.




                                         17
Spotlights (Session 2)

Simultaneous Team Assignment and Behavior Recognition from Spatio-temporal Agent
Traces
Gita Sukthankar, Carnegie Mellon University

     This research addresses the problem of activity recognition for physically embodied agent
     teams. We define team activity recognition as the process of identifying team behaviors
     from traces of agent positions over time; for many physical domains, military or athletic,
     coordinated team behaviors create distinctive spatio-temporal patterns that can be used
     to identify low-level action sequences. We focus on the novel problem of recovering agent-
     to-team assignments for complex team tasks where team composition, the mapping of
     agents into teams, changes over time. Without a priori knowledge of current team
     assignments, the behavior recognition problem is challenging since behaviors are
     characterized by the aggregate motion of the entire team and cannot generally be
     determined by observing the movements of a single agent in isolation.

     To handle this problem, we introduce a new algorithm, Simultaneous Team Assignment and
     Behavior Recognition (STABR) that generates behavior annotations from spatio-temporal
     agent traces.    STABR leverages information from the spatial relationships of the team
     members to create sets of potential team assignments at selected time-steps. These
     spatial relationships are efficiently discovered using a randomized search technique,
     RANSAC, to generate potential team assignment hypotheses.            Sequences of team
     assignment hypotheses are evaluated using dynamic programming to derive a
     parsimonious explanation for the entire observed spatio-temporal trace. To prune the
     number of hypotheses, potential team assignments are fitted to a parameterized team
     behavior model; poorly fitting hypotheses are eliminated before the dynamic programming
     phase. The proposed approach is able to perform accurate team behavior recognition
     without exhaustive search over the partition set of potential team assignments, as
     demonstrated on several scenarios of simulated military maneuvers.

     STABR does not simply assume that agents within a certain proximity should be assigned
     to the same team; instead if relies on matching static snapshots of agent position against
     a database of team formation templates to produce a candidate pool of agent-to-team
     assignments. This candidate pool of assignments is verified by running a local spatio-
     temporal behavior detector. The intuition is that the aggregate agent movement for an
     incorrect team assignment will generally fail to match any behavior model.          STABR
     significantly outperforms agglomerative clustering on the agent-to-team assignment
     problem for traces with dynamic agent composition (95% accuracy).

     The scenarios presented here illustrate the operation of STABR in environments that lack
     the external cues used by other multi-agent plan recognition approaches, such as
     landmarks, cleanly clustered agent teams, and extensive domain knowledge. We believe
     that when such cues are available they can be directly incorporated into STABR, both to
     improve accuracy and to prune hypotheses. STABR provides a principled framework for
     reasoning about dynamic team assignments in spatial domains.

     Joint work with Katia Sycara.




                                             18
An Online Learning System for the Prediction of Electricity Distribution Feeder Failures
Hila Becker, Columbia University

      We are using machine learning techniques for constructing a failure-susceptibility ranking
      of feeder cables that supply electricity to the boroughs of New York City. The electricity
      system is inherently dynamic, and thus our failure-susceptibility ranking system must be
      able to adapt to the latest conditions in real time, having to update its ranking accordingly.
      The feeders have a significant failure rate, and many resources are devoted to monitoring,
      maintenance and repair of feeders. The ability to predict failures allows the shifting from
      reactive to proactive maintenance, thus reducing costs.

      The feature set for each feeder includes a mixture of static data (e.g. age and composition
      of each feeder section) and dynamic data (e.g. electrical load data for a feeder and its
      transformers). The values of the dynamic features are captured at the time of training and
      therefore lead to different models depending on the time and day at which each model is
      trained. Previously, a framework was designed to train models using a new variant of
      boosting called Martingale Boosting, as well as Support Vector Machines. However, in this
      framework, an engineer had to decide whether to use the most recent data to build a new
      model, or use the latest model instead for future predictions.

      To avoid the need of human intervention, we have developed an “online” system that
      determines what model to use by monitoring past performance of previously trained
      models. In our new framework, we treat each batch-trained model as an expert, and use a
      measurement of its performance as the basis for reward or penalty of its quality score. We
      measure performance as a normalized average rank of failures. For example, in a ranking
      of 50 items with actual failures ranked #4 and #20, the performance is: 1 – (4 + 20) /
      (2*50) = 0.76.

      Our approach builds on the notion of learning from expert advice as formulated in the
      continuous version of the Weighted Majority algorithm. Since each model is analogous to
      an expert and our system runs live thus gathering new data and generating new models,
      we have to keep adding new experts to the existing ensemble throughout the algorithm’s
      execution. To avoid having to monitor an ever-increasing set of experts, we drop poorly
      performing experts after each prediction. We had to address the following key issues in our
      solution: (1) how often and with what weight do we add new experts, and (2) what experts
      do we drop. Our simulations suggest that using the median of all current models’ weights
      for new models works best. To drop experts we use a combination of age of the model and
      past performance. Finally, to make predictions we use a weighted average of the top-
      scoring experts.

      Our system is currently deployed and being tested by New York City’s electricity
      distribution company. Results are highly encouraging, with 75% of the failures in the
      summer of 2005 being ranked in the top 26%, and 75% of failures in 2006 being ranked in
      the top 36%.

      Joint work with Marta Arias.



Classification of fMRI Images: An Approach Using Viola-Jones Features
Melissa K. Carroll, Princeton University

      There has been growing interest in using Functional Magnetic Resonance Imaging (fMRI)
      for “mind reading,” particularly in applying machine learning methods to classifying fMRI
      brain images based on the subject’s instantaneous cognitive state. For instance, Haxby et
      al. (2001) perform fMRI scans while subjects are viewing images of one of seven classes of

                                               19
objects with the goal of discriminating the brain images based on the class of image being
      viewed at the time.

      Most machine learning approaches used to date for fMRI classification have treated
      individual voxels as features and ignored the spatial correlation between voxels (Norman
      et al., 2006). We present a novel method for searching this feature space to generate
      features that capture spatial information, derived from the Viola and Jones (2001)
      algorithm for 2D object detection, and apply it to 2D representations of the images. In this
      method, features are computed corresponding to absolute and relative intensities over
      regions of varying size and shape, and used by AdaBoost (Schapire and Singer, 1999) to
      generate a classifier. Figure 1 (http://www.cs.princeton.edu/~mkc/wiml06/Figure1.jpg)
      shows examples of these features overlaid on an actual 2D representation of the 3D fMRI
      image. Mean intensities in white regions are subtracted from mean activities in gray
      regions to compute each feature, which are combined to form the feature vector. One-,
      two-, three- and four-rectangle features of all 100 size combinations between 1x1 and
      10x10 are computed for all positions in the image.

      As Figure 2 (http://www.cs.princeton.edu/~mkc/wiml06/Figure2.jpg) shows, including richer
      features than the standard one-pixel features can result in improved classification of the
      Haxby et al. dataset. One potential limitation of the method is that the large feature set it
      produces conflicts with computational limitations; however, figure 2 shows that even
      selecting a small random subset of the richer features can result in an increase in
      classification accuracy by 5% or more, although performance varies across subjects. In
      addition, the performance of this subset of features can be used to target subsequent
      feature selection. Future work needs to be performed to develop reliable and valid
      methods for rating feature importance.

      Finally, Figure 3 (http://www.cs.princeton.edu/~mkc/wiml06/Figure3.jpg) shows that
      confusion among predicted classes occurs most often between classes that are most
      similar and for which previous classifiers have encountered difficulty, e.g. male faces and
      female faces. This target space similarity structure could be exploited in future work to
      improve classification.

         1. J. V. Haxby, M. I. Gobbini, M. L. Furey, A. Ishai, J. L. Schouten, and P. Pietrini. (2001).
            Distributed and overlapping representations of faces and objects in ventral
            temporal cortex. Science, (293) 2425-2429.
         2. K. A. Norman, S. M. Polyn, G. J. Detre and J. V. Haxby, (2006). Beyond mind-reading:
            multi-voxel pattern analysis of fMRI data. Trends in Cognitive Sciences, In Press.
         3. R.E. Schapire and Y. Singer. (1999). Improved boosting algorithms using confidence-
            rated predictions. Machine Learning, 37(3): 297-336.
         4. P. Viola and M. Jones. (2001). Rapid object detection using a boosted cascade of
            simple features. CVPR 2001.

      Joint work with Kenneth A. Norman, James V. Haxby and Robert E. Schapire.



Fast Online Classification with Support Vector Machines
Seyda Ertekin, Penn State University

      In recent years, we have witnessed significant increase in the amount of data in digital
      format, due to the widespread use of computers and advances in storage systems. As the
      volume of digital information increases, people need more effective tools to better find,
      filter and manage these resources. Classification, the assignment of instances (i.e.
      pictures, text documents, emails, Web sites etc.) to one or more predefined categories


                                                20
based on their content, is an important component in many information organization and
management tasks. Support Vector Machines (SVMs) is a popular machine learning
algorithm for classification problems due to their theoretical foundation and good
generalization performance. However, SVMs have not yet seen widespread adoption in the
communities working with very large datasets due to the high computational cost involved
in solving quadratic programming (QP) problem in the training phase. This research
presents an online SVM learning algorithm, LASVM, which yields classification accuracy
rates of the state-of-the-art SVM solvers but requires less computational resources. LASVM
tolerates much smaller main memory and has a much faster training phase. We also show
that not all the examples are equally informative in the training set. We present methods
to select the most informative examples and exploit those to reduce the computational
requirements of the learning algorithm. We uncover the properties of active learning
algorithms to select the informative examples efficiently from very large-scale training
sets. We will also show the benefits of using a non-convex loss function at SVMs for faster
speeds and less computational requirements.

Joint work with Leon Bottou, Antoine Bordes and Jason Weston.




                                        21
Posters (Session 1)

Using Decision Trees for Gabor-based Texture Classification of Tissues in Computed
Tomography
Alia Bashir, DePaul University

      This research is aimed at developing an automated imaging system for classification of
      tissues in CT images. Classification of tissues in CT scans using shape or gray level
      information is challenging due to the changing shape of organs in a stack of images and
      the gray level intensity overlap in soft tissues. However, healthy organs are expected to
      have a consistent texture within tissues across slices. Given a large enough set of normal-
      tissue images, and a good set of texture features, machine learning techniques can be
      applied to create an automatic classifier. Previous work from one of the authors explored
      texture descriptors based on wavelet, ridgelets, and curvelets for the classification of
      tissues from normal chest and abdomen CT scans. These texture descriptors were able to
      classify tissues with an accuracy range of 85 - 98%, with curvelet-based texture
      descriptors performing the best. In this paper we bridge the gap to perfect accuracy by
      focusing on texture features based on a bank of Gabor filters. The approach consists of
      three steps: convolution of the regions of interest with a bank of 32 Gabor filters (4
      frequencies and 8 orientations), extraction of two Gabor texture features per filter (mean
      and standard deviation), and creation of a classifier that automatically identifies the
      various tissues. The data set consists of 2D DICOM images from five normal chest and
      abdomen CT studies from Northwestern Medical Hospital. The following regions of interest
      were segmented out and labeled by an expert radiologist: liver, spleen, kidney, aorta,
      trabecular bone, lung, muscle, IP fat, and SQ fat for a total of 1112 images. For each
      image, the feature vector consists of the mean and standard deviation of the 32 filtered
      images, totaling 64 descriptors. The classification step is carried out using a Classification
      and Regression decision tree classifier. A decision tree predicts the class of an object
      (tissue) from values of predictor variables (texture descriptors), and generates a set of
      decision rules. These sets of rules are then used for the classification of each region of
      interest. Both the cross-validation and the random split of the data set into a training set
      (~65%) and testing set (~35%) techniques were applied but no significant difference was
      observed. The optimal tree had a depth of 20, parent node value set at 10 and child node
      value set at 1. To evaluate the performance of each classifier, specificity, sensitivity,
      precision, and accuracy rates are calculated from each misclassification matrix. Results
      show that this set of texture features is able to perfectly classify the 9 regions of interests.
      The Gabor filters’ ability to isolate features at different scales and directions allows for a
      multi-resolution analysis of texture essential when dealing with, at times, very subtle
      differences in the texture of tissues in CT scans. Given the great performance in the
      classification of healthy tissues, we plan to apply Gabor texture feature to the classification
      of abnormal tissues.

      Joint work with Julie Hasemann and Lucia Dettori.



VOGUE: A Novel Variable Order-Gap State Machine for Modeling Sequences
Bouchra Bouqata, Rensselaer Polytechnic Institute (RPI)

      In this paper we present VOGUE, a new state machine that combines two separate
      techniques for modeling long range dependencies in sequential data: data mining and
      data modeling. VOGUE relies on a novel Variable-Gap Sequence mining method (VGS), to
      mine frequent patterns with different lengths and gaps between elements. It then uses
      these mined sequences to build the state machine. We applied VOGUE to the task of


                                                22
protein sequence classification on real data from the PROSITE protein families. We show
       that VOGUE yields significantly better scores than higher-order Hidden Markov Models.
       Moreover, we show that VOGUEs classification sensitivity outperforms that of HMMER, a
       state-of-the-art method for protein classification

       Joint work with Christopher Carothers, Boleslaw K. Szymanski and Mohammed J. Zaki.



GroZi: a Grocery Shopping Assistant for the Blind
Carolina Galleguillos, U.C San Diego

       Grocery shopping is a common activity that people all over the world perform on a regular
       basis. Unfortunately, grocery stores and supermarkets are still largely inaccessible to
       people with visual impairments, as they are generally viewed as "high cost" customers. We
       propose to develop a computer vision based grocery shopping assistant based on a
       handheld device with haptic feedback that can detect different products inside of a store,
       thereby increasing the autonomy of blind (or low vision) people to perform grocery
       shopping.

       Our solution makes use of new computer vision techniques for the task of visual
       recognition of specific products inside of a store as specified in advance on a shopping list.
       These techniques can avail of complementary resources such as RFID, barcode scanning,
       and sighted guides. We also present a challenging new dataset of images consisting of
       different categories of grocery products that can be use for object recognition studies.

       The use of the system consists of the creation of a shopping list followed by in-store
       navigation. In order to create a shopping list we will develop a website accessible to
       visually impaired people that stores data and images of different products. The website will
       be augmented with new image templates from the community of users that shop with the
       device, in addition to images of the same product that are taken in different stores by
       different users. This will increase the system's ability to recognize products that change
       appearance due to seasonal or promotion reasons. The navigational task includes finding
       the correct aisle for the products (based on text detection and character recognition),
       avoiding obstacles, finding products and checking out.

       A typical grocery store carries around 30,000 items, so recognizing a single object is a
       nontrivial task. Assuming a shopping list length is generally less than 1/1000th of this
       amount (i.e., less than 30 items), the recognition can be constrained to two different
       phases: detection of object on a possibly cluttered shelf, and verification of the detected
       object with respect to the shopping list. For this task, we intend to use state of the art
       object recognition algorithms and develop new approaches for fast identification.



Applications of Kernel Minimum Enclosing Ball
Cristina Garcia C., Universidad Central de Venezuela

       The minimum enclosing ball (MEB) is a well-studied problem in computational geometry. In
       this work we describe a generalization of a simple approximate MEB construction,
       introduced by M. Badoiu and K. L. Clarkson, to a feature space MEB using the kernel trick.
       The simplicity of the methodology in itself is surprising, the MEB algorithm is based only on
       geometrical information extracted from a sample of data points, and just two parameters
       need to be tuned: the constant of the kernel and the tolerance in the radio of the
       approximation. The applicability of the method is demonstrated on anomaly detection and
       less traditional scenarios as 3D object modeling and path planning. Results are

                                                23
encouraging and show that even an approximate feature space MEB, is able to induce
      topology preserving mappings on arbitrary dimensional noisy data as efficiently as other
      machine learning approaches.

      Joint work with Jose Ali Moreno.



Classification With Cumular Trees
Claudia Henry, Antilles-Guyane

      The accurate combination of decision trees and linear separators has been shown to
      provide some of the best off the shelf classifiers. We describes a new type of such
      combination, which we call Cumular (Cumulative Linear) Trees. Cumular Trees are midway
      between Oblique Decision Trees and Alternating Decision Trees: more expressive than the
      former, and simpler than the latter. We provide an induction algorithm for Cumular Trees,
      which is, as we show, a boosting algorithm in the original sense. Experimental results
      against AdaBoost, C4.5 and OC1 display very good results, especially when dealing with
      noisy data.

      Joint work with Richard Nock and Franck Nielsen.



Transient Memory in Reinforcement Learning: Why Forgetting Can be Good for You
Anna Koop, University of Alberta

      The vast majority of work in machine learning is concerned with algorithms that converge
      to a single solution. It is not clear that this is always the most appropriate aim. Consider a
      sailor adapting to the ship's motion. She may learn two conditional models: one for walking
      when at sea, and another for walking when on land. She may, when memory resources are
      limited, learn a best-on-average policy that settles on a compromise among all situations
      she has encountered. A more flexible approach might be to quickly adapt the walking
      policy to new situations, rather than seeking one final solution or set of solutions.

      We explore two cases of transient memory. In the first case, the rate at which individual
      parameters change is controlled by meta-parameters. These meta parameters allow the
      agent to ignore irrelevant or random features, to converge where features are consistent
      throughout its experience, and otherwise to adapt quickly to changes in the environment.
      This approach requires no commitment to the number of parameter sets necessary in a
      given environment, but makes the best use of available resources.

      In the second case, a single solution is stored in long-term parameters, but this solution is
      used only as the starting point for learning about a specific situation. This is currently
      being applied to the game of Go. At the beginning of a game, the agent's value function
      parameters are initialized according to the long-term memory. During the course of a game
      these parameters are updated by simulating, from each state, thousands of self-play
      games. The short-term parameters learned in this way are used both for action selection
      and as the starting point for learning on the next turn, after the opponent has moved.
      Actual game-play moves are used to update both the short- and long-term memory. At the
      end of the game, the short-term memory is forgotten and the value function parameters
      are initialized to the long-term values. This allows the agent to store general knowledge in
      long-term memory while adapting quickly to the specific situations encountered in the
      current game.




                                               24
Predicting Task-Specific Webpages for Revisiting
A. Twinkle E. Lettkeman, Oregon State University

      Most web browsers track the history of all pages visited, with the intuition that users are
      likely to want to return to pages that they have previously accessed. However, the history
      viewers in web browsers are ineffective for most users, because of the overwhelming glut
      of webpages that appear in the history. Not only does the history represent a potentially
      confusing interleaving of many of a user's different tasks, but it also includes many
      webpages that would provide minimal or no utility to the user if revisited. This paper
      reports on a technique used to dramatically reduce web browsing histories down to pages
      that are relevant to the user's current task context and have a high likelihood of being
      desirable to revisit. We briefly describe how the TaskTracer system maintains an awareness
      of a user's tasks and semi-automatically segments the web browsing history by task. We
      then present a technique that is used to predict whether webpages previously visited on a
      task will be of future value to the user and are worth displaying in the history user
      interface. Our approach uses a combination of heuristics and machine learning to evaluate
      the content of a page and interactions with the page to learn a predictive model of
      webpage relevance for each user task. We show the results of an empirical evaluation of
      this technique based on user data. This approach could be applied to systems that include
      tracking of webpage resources to predict future value of resources and to lower costs of
      finding and reusing webpages to the user. Our findings suggest that prediction of web
      pages is highly user- and task-specific, and that the choice of prediction algorithms is not
      obvious. In future work we aim to refine the features used to predict revisitability. We will
      analyze the effect of better text feature extraction in conjunction with user interest
      indicators such as reading time, scrolling behavior, and text selection. Preliminary analysis
      indicates that applying these refinements may increase the accuracy of our prediction
      models.

      Joint work with Simone Stumpf, Jed Irvine and Jonathan Herlocker.



Hyper-parameters auto-setting using regularization path for SVM
Gaëlle Loosli, INSA de Rouen

      In the context of classification tasks, Support Vector Machines are now very popular.
      However, their utilization by neophyte users is still hampered by the need to supply values
      for control parameters in order to get the best attainable results. Mainly, given clean data,
      SVM's users must make three choices: the type of kernel, its bandwidth and the
      regularization parameter. It would be convenient to provide users with a push-button SVM
      that would be able to auto-set to the best possible values. This paper presents a new
      method that approaches this goal. Given the importance of this problem for reaping all
      the potential benefits of the use of SVM, many research works have been dedicated to
      ways of helping the setting of the parameters. Most rely on either outer measures, such as
      cross-validation, to guide the selection, or to measures embedded in the learning method
      itself. In place of empirical approaches to the setting of the control parameters,
      regularization paths have been proposed and widely studied these past years since they
      provide a smart and fast way to access all the optimal solutions of a problem according to
      all compromises between bias and variance for regression or compromises between bias
      and regularity in classification. For instance, in the case of classification tasks, as studied
      in this paper, Soft margins SVM deal with non-separable problem thanks to slack variables
      that are parametrized by a slack trade-off (usually noted C, it is the regularization
      parameter). Within the usual formulation of the Soft margins SVM, this trade-off takes its
      value between 0 (random) and infinity (hard-margins). The nu-SVM technique reformulates
      the SVM problem so that C is replaced by nu parameters taking values in [0,1]. This
      normalized parameter has a more intuitive meaning: it represents the minimal proportion

                                                25
of points in the solution and the maximal proportion of misclassified points.

       However, having the whole regularization path is not enough. Indeed, the end user still
       needs to retrieve from it the best values for the regularization parameters. Instead of
       selecting these values by k-fold cross-validation or leave-one-out, or other approximations,
       we propose to include the leave-one-out estimator inside the regularization path in order
       to have an idea of the generalization error at each step. We explain why it is less
       expansive than selecting the best parameter a posteriori and give a method to stop
       learning before attaining the end of the path to save useless efforts. Contrarily to what is
       usually done for regularization path, our method does not start with all points as support
       vectors. Doing so we avoid the computation of the whole Gram matrix at the first step.
       Then, since the proposed method stops on the path, this extreme non-sparse solution is
       never attained and thus the whole Gram matrix never required. One of the main
       advantages of this is that it is possible to use this setting for large databases.



The Influence of Ranker Quality on Rank Aggregation Algorithms
Brandeis Marshall, Rensselaer Polytechnic Institute

       The rank aggregation problem has been studied extensively in recent years with a focus
       on how to combine several different rankers to obtain a consensus aggregate ranker. We
       study the rank aggregation problem from a different perspective: how the individual input
       rankers impact the performance of the aggregate ranker. We develop a general statistical
       framework based on a model of how the individual rankers depend on the ground truth
       ranker. Within this framework, one can study the performance of different aggregation
       methods. The individual rankers, which are the inputs to the rank aggregation algorithm,
       are statistical perturbations of the ground truth ranker. With rigorous experimental
       evaluation, we study how noise level and the misinformation of the rankers affect the
       performance of the aggregate ranker. We introduce and study a novel Kendall-tau rank
       aggregator and a simple aggregator called PrOpt, which we compare to some other well
       known rank aggregation algorithms such as average, median and Markov chain
       aggregators. Our results show that the relative performance of aggregators varies
       considerably depending on how the input rankers relate to the ground truth.

       Joint work with Sibel Adali and Malik Magdon-Ismail.



Learning for Route Planning under Uncertainty
Evdokia Nikolova, Massachusetts Institute of Technology

       We present new complexity results and efficient algorithms for optimal route planning in
       the presence of uncertainty. We employ a decision theoretic framework for defining the
       optimal route: for a given source S and destination T in the graph, we seek an ST-path of
       lowest expected cost where the edge travel times are random variables and the cost is a
       nonlinear function of total travel time. Although this is a natural model for route planning
       on real-world road networks, results are sparse due to the analytic difficulty of finding
       closed form expressions for the expected cost, as well as the computational/combinatorial
       difficulty of efficiently finding an optimal path, which minimizes the expected cost.

       We identify a family of appropriate cost models and travel time distributions that are
       closed under convolution and physically valid. We obtain hardness results for routing
       problems with a given start time and cost functions with a global minimum, in a variety of
       deterministic and stochastic settings. In general the global cost is not separable into edge
       costs, precluding classic shortest-path approaches. However, using partial minimization

                                                26
techniques, we exhibit an efficient solution via dynamic programming with low polynomial
       complexity.

       We then consider an important special case of the problem, in which the goal is to
       maximize the probability that the path length does not exceed a given threshold value
       (deadline). We give a surprising exact nθ log n algorithm for the case of normally distributed
       edge lengths, which is based on quasi-convex maximization. We then prove average and
       smoothed polynomial bounds for this algorithm, which also translate to average and
       smoothed bounds for the parametric shortest path problem, and extend to a more general
       non-convex optimization setting. We also consider a number other edge length
       distributions, giving a range of exact and approximation schemes.

       Our offline algorithms can be adapted to give online learning algorithms via the Kalai-
       Vempala approach of converting an offline to an efficient online optimization solution.

       Joint work with Matthew Brand, David Karger, Jonathan Kelner and Michael Mitzenmacher.



A Neurocomputational Model of Impaired Imitation
Biljana Petreska, Ecole Polytechnique Federale de Lausanne

       This abstract addresses the question of human imitation through convergent evidence
       from neuroscience, using tools from machine learning. In particular, we consider a deficit
       in imitation of meaningless gestures (i.e., hand postures relative to the head) following
       callosal brain lesion (i.e., disconnected hemispheres). We base our work on the rational
       that looking at how imitation in apraxic patients is impaired can unveil its underlying
       neural principles. We ground the functional architecture and information flow of our model
       in brain imaging studies. Finally findings from monkey brain neurophysiological studies
       drive the choice of implementation of our processing modules. Our neurocomputational
       model of visuo-motor imitation is based on selforganizing maps receiving sensory input
       (i.e., visual, tactile or proprioceptive) with associated activities [1]. We train the
       connections between the maps with anti-hebbian learning to account for the
       transformations required to translate the observation of the visual stimulus to imitate to
       the corresponding tactile and proprioceptive information that will guide the imitative
       gesture. Patterns of impairment of the model, realized by adding uncertainty in the
       transfer of information between the networks, reproduce the deficits found in a clinical
       examination of visuo-motor imitation of meaningless gestures [2]. The model makes
       hypotheses on the type of representation used and the neural mechanisms underlying
       human visuo-motor imitation. The model also helps to gain more understanding in the
       occurrence and nature of imitation errors in patients with brain lesions.

       [1] B. Petreska, and A.G. Billard. A Neurocomputational Model of an Imitation Deficit
       following Brain Lesion. In Proceedings of 16th International Conference on Artificial Neural
       Networks (ICANN 2006), Athens (Greece). To appear.

       [2] G. Goldenberg, K. Laimgruber, and J. Hermsdörfer. Imitation of gestures by
       disconnected hemispheres. Neuropsychologia, 39:1432–1443, 2001.

       Joint work with A. G. Billard.




                                                27
Bayesian Estimation for Autonomous Object Manipulation Based on Tactile Sensors
Anya Petrovskaya, Stanford University

      We consider the problem of autonomously estimating position and orientation of an object
      from tactile data. When initial uncertainty is high, estimation of all six parameters
      precisely is computationally expensive. We propose an efficient Bayesian approach that is
      able to estimate all six parameters in both unimodal and multimodal scenarios. The
      approach is termed Scaling Series sampling as it estimates the solution region by samples.
      It performs the search using a series of successive refinements, gradually scaling the
      precision from low to high. Our approach can be applied to a wide range of manipulation
      tasks. We demonstrate its portability on two applications: (1) manipulating a box and (2)
      grasping a door handle.

      Joint work with Oussama Khatib, Sebastian Thrun, Andrew Y Ng.
                                                               .



Therapist Robot Behavior Adaptation for Post-stroke Rehabilitation Therapy
Adriana Tapus, University of Southern California

      Research into Human-Robot Interaction (HRI) for socially assistive applications is in its
      infancy. Socially assistive robotics, which focuses on the social interaction, rather than the
      physical interaction between the robot and the human user, has the potential to enhance
      the quality of life for large populations of users. Post-stroke rehabilitation is one of the
      largest potential application domains, since stroke is a dominant cause of severe disability
      in the growing ageing population. In the US alone, over 750,000 people suffer a new stroke
      each year, with the majority sustaining some permanent loss of movement [Institute06].
      This loss of function, termed "learned disuse", can improve with rehabilitation therapy
      during the critical post-stroke period. One of the most important elements of any
      rehabilitation program is carefully directed, well-focused and repetitive practice of
      exercises, which can be passive and active.

      Our work focuses on hands-off therapist robots that assist, encourage, and socially interact
      with patients during their active exercises. Our previous research demonstrated, through
      real world experiments with stroke patients [Tapus06b, Eriksson05, Gockley06], that the
      physical embodiment (including shared physical context and physical movement of the
      robot), the encouragements, and the monitoring play key roles in patient compliance with
      rehabilitation exercises.

      In the current work we investigate the role of the robot’s personality in the hands-off
      therapy process. We focus on the relationship between the level of
      extroversion/introversion (as defined in Eysenck Model of personality [Eysenck91]) of the
      robot and the user, addressing the following research questions: 1. How should we model
      the behavior and encouragement of the therapist robot as a function of the personality of
      the user and the number of exercises performed? 2. Is there a relationship between the
      extroversion-introversion personality spectrum based on the Eysenck model and the
      challenge based vs. nurturing style of patient encouragement?

      To date, little research into human-robot personality matching has been performed. Some
      of our recent results showed the preference for personality matching between users and
      socially assistive robots [Tapus06a]. Our therapist robot behavior adaptation system
      monitors the number of exercises/minute performed by the human/patient, indicating the
      level of engagement and/or fatigue, and changes the robot’s behavior in order to
      maximize this level. The socially assistive therapist robot (see Figure 1) is equipped with a
      basis set of behaviors that will explicitly express its desires and intentions in a physical and
      verbal way that is observable to the user/patient. These behaviors involve the control of

                                                28
physical distance, gestural expression, and verbal expression (tone and content). The
       number of exercises/minute is therefore used as a reward that maximizes the response of
       the system.

       Hands-off robot post-stroke rehabilitation therapy holds great promise of improving patient
       compliance in the recovery program. Our work aims toward developing and testing a
       model of compatibility between human and robot personality in the assistive context,
       based on the PEN theory of personality and toward building a customized therapy protocol.
       Examining and answering these issues will begin to address the role of assistive robot
       personality in enhancing patient compliance.

       [Eriksson05] Eriksson, J., Matarić, M., J., and Winstein, C. "Hands-off assistive robotics for
       post-stroke arm rehabilitation", In Proceedings of the International Conference on
       Rehabilitation Robotics (ICORR-05), Chicago, Illinois, June 2005.

       [Eysenck91] Eysenck, H., J. "Dimensions of personality: 16, 5 or 3? Criteria for a taxonomic
       paradigm", In Personality and individual differences, vol. 12, pp.773-790, 1991.

       [Gockley06] Gockley, R., and Matarić, M., J. "Encouraging Physical Therapy Compliance
       with a Hands-Off Mobile Robot", In Proceedings of the First International Conference on
       Human Robot Interaction (HRI-06), Salt Lake City, Utah, March 2006.

       [Institute06] "Post-Stroke Rehabilitation Fact Sheet", National Institute of neurological
       disorders and stroke, January, 2006.

       [Tapus06a] Tapus, A. and Matarić, M., J. (2006) "User Personality Matching with Hands-Off
       Robot for Post-Stroke Rehabilitation Therapy", In Proceedings of the 10th International
       Symposium on Experimental Robotics (ISER), Rio de Janeiro, Brazil, July 2006.

       [Tapus06b] Tapus, A. and Matarić, M., J. (2006) "Towards Socially Assistive Robotics",
       International Journal of the Robotics Society of Japan (JRSJ), 24(5), pp. 576- 578, July, 2006.

       Joint work with Maja J. Matarić.



Learning How To Teach
Cynthia Taylor, University of California, San Diego

       The goal of the RUBI project is to develop a social robot (RUBI) that can interact with
       children and teach them in an autonomous manner. As part of the project we are currently
       focusing on the problem of teaching 18-24 month old children skills targeted by the
       California Department of Education as appropriate for this age group.

       In particular we are focusing on teaching the children to identify objects, shapes and
       colors. We have seven RFID-tagged stuffed toys, in the shapes of common objects like a
       slice of watermelon or a waffle. RUBI says the name of the object and shows a picture of it
       on her touch screen, and the children hand her a toy, which she identifies as correct or
       incorrect. She keeps track of the right and wrong answers for each toy.

       RUBI has a touch screen on her stomach she can use to play short videos and play games
       with the children. By recording when the children touch her stomach, the screen also
       provides important information about whether or not the children are engaged. She has
       two Apple iSight cameras for eyes, and runs machine learning software that lets her detect
       both faces and smiles. The smile detection lets her gage people’s moods during social
       interaction, and respond accordingly. She has an RFID reader in her right hand, letting her


                                                 29
identify RFID tagged toys.

      The machine learning aspect of this problem is how to use the information from her
      perceptual primitives so as to teach the materials in an effective manner. After each
      question/answer, RUBI has to decide whether to continue playing her current learning
      game or switch to another activity, and what question to ask next if she continues playing
      the game. She also has to decide what to do in situations where she asks a question and
      does not get an answer for a long period of time. Unlike many standard AI problems like
      chess, RUBI works in continuous time, with no discrete turns.

      We are approaching the problem from the point of view of control theory. Exact solutions to
      the optimal teach problem exist for some simple models of learning, such as the Atkinson
      and Bower learning model. We are planning to find approximate solutions to this control
      problem using Reinforcement Learning Methods. We will complement formal and
      computational analysis with ethnographic study of how teachers do teach the children on
      the same task. Our focus will be on understanding both timing and what sources of
      information they use to adapt their teaching strategies.

      Joint work with Paul Ruvolo, Ian Fasel, Javier R. Movellan.



Strategies for improving face recognition in video using machine learning methods
Deborah Thomas, University of Notre Dame

      Surveillance cameras are a common feature in many stores and public places. There are
      many applications for face recognition from video streams in the area of law enforcement.
      However, while face recognition from high quality still images has been very successful,
      face recognition from video is a relatively new area and there is huge room for
      improvement. Furthermore, when using video as our data, we can exploit the fact that
      there are multiple frames to choose from to improve recognition performance. So, instead
      of representing subjects using a single high quality image, they can be represented using a
      set of frames chosen from the frames in a video clip. However, we want to select as many
      distinct frames for an individual as possible. This allows for the diversity in the training
      space, thereby improving the generalization capacity of the learned face recognition
      classifier.

      In this work, we consider two different approaches. The commonality between the two
      approaches is Principal Component Analysis. Given the high dimensionality of the data,
      PCA is often warranted to not only reduce the dimensions but also construct mode
      independent dimensions. In our first approach, we use a nearest neighbor algorithm with
      Mahalanobis Cosine (MahCosine) distance measure. A pair of images in which the faces
      differ from each other in pose and expression will have a bigger MahCosine distance
      between them. So we can use this as a measure of difference between frames. In the
      second approach, we project the images into PCA space and then use K-means clustering
      to group all the frames from one subject and pick one image per cluster to make up the
      representation set. Here again, images, which are similar to each other, will be in the same
      cluster, while more different images will be in different clusters. In addition to difference
      between frames, we also incorporate a quality metric of the face in picking the frames in
      addition to using PCA and this yields a higher recognition rate.

      We demonstrate our approach using two different datasets. First, we compare our
      approach to the approach used by Lee et. al in 2003 (Video-based Face Recognition Using
      Appearance Manifolds) and 2005 (Visual Tracking and Recognition using Probabilistic
      Appearance Manifolds). They use appearance manifolds to represent their subjects and
      use planes in PCA space for the different poses. We show that our approach performs


                                                30
Machine Learning: Theory, Applications, Experiences
Machine Learning: Theory, Applications, Experiences
Machine Learning: Theory, Applications, Experiences
Machine Learning: Theory, Applications, Experiences
Machine Learning: Theory, Applications, Experiences
Machine Learning: Theory, Applications, Experiences
Machine Learning: Theory, Applications, Experiences
Machine Learning: Theory, Applications, Experiences
Machine Learning: Theory, Applications, Experiences
Machine Learning: Theory, Applications, Experiences
Machine Learning: Theory, Applications, Experiences

Mais conteúdo relacionado

Mais procurados

02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collectiondnac
 
Contextualized versus Structural Overlapping Communities in Social Media.
Contextualized versus Structural Overlapping Communities in Social Media. Contextualized versus Structural Overlapping Communities in Social Media.
Contextualized versus Structural Overlapping Communities in Social Media. Mohsen Shahriari
 
10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studies10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studiesdnac
 
Basic principles of interaction for learning in web based environment
Basic principles of interaction for learning in web based environmentBasic principles of interaction for learning in web based environment
Basic principles of interaction for learning in web based environmentSu-Tuan Lulee
 
11 Network Experiments and Interventions
11 Network Experiments and Interventions11 Network Experiments and Interventions
11 Network Experiments and Interventionsdnac
 
03 Ego Network Analysis
03 Ego Network Analysis03 Ego Network Analysis
03 Ego Network Analysisdnac
 
09 Respondent Driven Sampling and Network Sampling with Memory
09 Respondent Driven Sampling and Network Sampling with Memory09 Respondent Driven Sampling and Network Sampling with Memory
09 Respondent Driven Sampling and Network Sampling with Memorydnac
 
Analysis of Overlapping Communities in Signed Complex Networks
Analysis of Overlapping Communities in Signed Complex NetworksAnalysis of Overlapping Communities in Signed Complex Networks
Analysis of Overlapping Communities in Signed Complex NetworksMohsen Shahriari
 
LAK13 Tutorial Social Network Analysis 4 Learning Analytics
LAK13 Tutorial Social Network Analysis 4 Learning AnalyticsLAK13 Tutorial Social Network Analysis 4 Learning Analytics
LAK13 Tutorial Social Network Analysis 4 Learning Analyticsgoehnert
 
DOMINANT FEATURES IDENTIFICATION FOR COVERT NODES IN 9/11 ATTACK USING THEIR ...
DOMINANT FEATURES IDENTIFICATION FOR COVERT NODES IN 9/11 ATTACK USING THEIR ...DOMINANT FEATURES IDENTIFICATION FOR COVERT NODES IN 9/11 ATTACK USING THEIR ...
DOMINANT FEATURES IDENTIFICATION FOR COVERT NODES IN 9/11 ATTACK USING THEIR ...IJNSA Journal
 
06 Network Study Design: Ethical Considerations and Safeguards
06 Network Study Design: Ethical Considerations and Safeguards06 Network Study Design: Ethical Considerations and Safeguards
06 Network Study Design: Ethical Considerations and Safeguardsdnac
 
thesis-final-version-for-viewing
thesis-final-version-for-viewingthesis-final-version-for-viewing
thesis-final-version-for-viewingSanket Patil
 
Trends of machine learning in 2020 - International Journal of Artificial Inte...
Trends of machine learning in 2020 - International Journal of Artificial Inte...Trends of machine learning in 2020 - International Journal of Artificial Inte...
Trends of machine learning in 2020 - International Journal of Artificial Inte...gerogepatton
 
Acm tist-v3 n4-tist-2010-11-0317
Acm tist-v3 n4-tist-2010-11-0317Acm tist-v3 n4-tist-2010-11-0317
Acm tist-v3 n4-tist-2010-11-0317StephanieLeBadezet
 
SMART Seminar Series: "Data is the new water in the digital age"
SMART Seminar Series: "Data is the new water in the digital age"SMART Seminar Series: "Data is the new water in the digital age"
SMART Seminar Series: "Data is the new water in the digital age"SMART Infrastructure Facility
 
Open domain Question Answering System - Research project in NLP
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLPGVS Chaitanya
 
Recent Research and Developments on Recommender Systems in TEL
Recent Research and Developments on Recommender Systems in TELRecent Research and Developments on Recommender Systems in TEL
Recent Research and Developments on Recommender Systems in TELHendrik Drachsler
 
01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measuresdnac
 

Mais procurados (20)

02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collection
 
00 Social Influence Effects on Men's HIV Testing
00 Social Influence Effects on Men's HIV Testing00 Social Influence Effects on Men's HIV Testing
00 Social Influence Effects on Men's HIV Testing
 
Contextualized versus Structural Overlapping Communities in Social Media.
Contextualized versus Structural Overlapping Communities in Social Media. Contextualized versus Structural Overlapping Communities in Social Media.
Contextualized versus Structural Overlapping Communities in Social Media.
 
10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studies10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studies
 
Basic principles of interaction for learning in web based environment
Basic principles of interaction for learning in web based environmentBasic principles of interaction for learning in web based environment
Basic principles of interaction for learning in web based environment
 
11 Network Experiments and Interventions
11 Network Experiments and Interventions11 Network Experiments and Interventions
11 Network Experiments and Interventions
 
03 Ego Network Analysis
03 Ego Network Analysis03 Ego Network Analysis
03 Ego Network Analysis
 
09 Respondent Driven Sampling and Network Sampling with Memory
09 Respondent Driven Sampling and Network Sampling with Memory09 Respondent Driven Sampling and Network Sampling with Memory
09 Respondent Driven Sampling and Network Sampling with Memory
 
Analysis of Overlapping Communities in Signed Complex Networks
Analysis of Overlapping Communities in Signed Complex NetworksAnalysis of Overlapping Communities in Signed Complex Networks
Analysis of Overlapping Communities in Signed Complex Networks
 
LAK13 Tutorial Social Network Analysis 4 Learning Analytics
LAK13 Tutorial Social Network Analysis 4 Learning AnalyticsLAK13 Tutorial Social Network Analysis 4 Learning Analytics
LAK13 Tutorial Social Network Analysis 4 Learning Analytics
 
DOMINANT FEATURES IDENTIFICATION FOR COVERT NODES IN 9/11 ATTACK USING THEIR ...
DOMINANT FEATURES IDENTIFICATION FOR COVERT NODES IN 9/11 ATTACK USING THEIR ...DOMINANT FEATURES IDENTIFICATION FOR COVERT NODES IN 9/11 ATTACK USING THEIR ...
DOMINANT FEATURES IDENTIFICATION FOR COVERT NODES IN 9/11 ATTACK USING THEIR ...
 
06 Network Study Design: Ethical Considerations and Safeguards
06 Network Study Design: Ethical Considerations and Safeguards06 Network Study Design: Ethical Considerations and Safeguards
06 Network Study Design: Ethical Considerations and Safeguards
 
thesis-final-version-for-viewing
thesis-final-version-for-viewingthesis-final-version-for-viewing
thesis-final-version-for-viewing
 
Trends of machine learning in 2020 - International Journal of Artificial Inte...
Trends of machine learning in 2020 - International Journal of Artificial Inte...Trends of machine learning in 2020 - International Journal of Artificial Inte...
Trends of machine learning in 2020 - International Journal of Artificial Inte...
 
A systematic literature review of academic cyberbullying 2021
A systematic literature review of academic cyberbullying 2021A systematic literature review of academic cyberbullying 2021
A systematic literature review of academic cyberbullying 2021
 
Acm tist-v3 n4-tist-2010-11-0317
Acm tist-v3 n4-tist-2010-11-0317Acm tist-v3 n4-tist-2010-11-0317
Acm tist-v3 n4-tist-2010-11-0317
 
SMART Seminar Series: "Data is the new water in the digital age"
SMART Seminar Series: "Data is the new water in the digital age"SMART Seminar Series: "Data is the new water in the digital age"
SMART Seminar Series: "Data is the new water in the digital age"
 
Open domain Question Answering System - Research project in NLP
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLP
 
Recent Research and Developments on Recommender Systems in TEL
Recent Research and Developments on Recommender Systems in TELRecent Research and Developments on Recommender Systems in TEL
Recent Research and Developments on Recommender Systems in TEL
 
01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures
 

Destaque

Publications_FINAL.docx - The University of Maryland
Publications_FINAL.docx - The University of MarylandPublications_FINAL.docx - The University of Maryland
Publications_FINAL.docx - The University of Marylandbutest
 
BibTex.doc
BibTex.docBibTex.doc
BibTex.docbutest
 
Elegant Resume
Elegant ResumeElegant Resume
Elegant Resumebutest
 
Tearn Up pitch deck.pdf
Tearn Up pitch deck.pdfTearn Up pitch deck.pdf
Tearn Up pitch deck.pdfasenju
 

Destaque (7)

Publications_FINAL.docx - The University of Maryland
Publications_FINAL.docx - The University of MarylandPublications_FINAL.docx - The University of Maryland
Publications_FINAL.docx - The University of Maryland
 
BibTex.doc
BibTex.docBibTex.doc
BibTex.doc
 
Elegant Resume
Elegant ResumeElegant Resume
Elegant Resume
 
[PPT]
[PPT][PPT]
[PPT]
 
[ppt]
[ppt][ppt]
[ppt]
 
Polynomial stations
Polynomial stationsPolynomial stations
Polynomial stations
 
Tearn Up pitch deck.pdf
Tearn Up pitch deck.pdfTearn Up pitch deck.pdf
Tearn Up pitch deck.pdf
 

Semelhante a Machine Learning: Theory, Applications, Experiences

Machine Learning:
Machine Learning:Machine Learning:
Machine Learning:butest
 
Tacoma, WA 98422
Tacoma, WA 98422Tacoma, WA 98422
Tacoma, WA 98422butest
 
ICSE 2016 - Awards III and Closing
ICSE 2016 - Awards III and ClosingICSE 2016 - Awards III and Closing
ICSE 2016 - Awards III and Closingsonal-mahajan
 
Woodland Owner Networks and Peer-to-Peer Learning
Woodland Owner Networks and Peer-to-Peer LearningWoodland Owner Networks and Peer-to-Peer Learning
Woodland Owner Networks and Peer-to-Peer LearningEli Sagor
 
Using Simulations to Evaluated the Effects of Recommender Systems for Learner...
Using Simulations to Evaluated the Effects of Recommender Systems for Learner...Using Simulations to Evaluated the Effects of Recommender Systems for Learner...
Using Simulations to Evaluated the Effects of Recommender Systems for Learner...Hendrik Drachsler
 
User-Centered Design of Learning Spaces on a Diverse, Urban, Commuter Campus.
 User-Centered Design of Learning Spaces on a Diverse, Urban, Commuter Campus.   User-Centered Design of Learning Spaces on a Diverse, Urban, Commuter Campus.
User-Centered Design of Learning Spaces on a Diverse, Urban, Commuter Campus. Brown-Sica Margaret
 
Myths And Misperceptions About Online Learning2
Myths And Misperceptions About Online Learning2Myths And Misperceptions About Online Learning2
Myths And Misperceptions About Online Learning2P Shea
 
Women who choose Computer Science - what really matters
Women who choose Computer Science - what really mattersWomen who choose Computer Science - what really matters
Women who choose Computer Science - what really mattersWBDC of Florida
 
Review
ReviewReview
Reviewronda3
 
poster.docx
poster.docxposter.docx
poster.docxMTKho
 
cv.doc.doc.doc
cv.doc.doc.doccv.doc.doc.doc
cv.doc.doc.docbutest
 
RDAP 16: DMPs and Public Access: Agency and Data Service Experiences
RDAP 16: DMPs and Public Access: Agency and Data Service ExperiencesRDAP 16: DMPs and Public Access: Agency and Data Service Experiences
RDAP 16: DMPs and Public Access: Agency and Data Service ExperiencesASIS&T
 
CV_LeiWang
CV_LeiWangCV_LeiWang
CV_LeiWangLei Wang
 
Julian Sass - CV
Julian Sass - CVJulian Sass - CV
Julian Sass - CVJulian Sass
 

Semelhante a Machine Learning: Theory, Applications, Experiences (20)

Machine Learning:
Machine Learning:Machine Learning:
Machine Learning:
 
2014 uo mcnairjournal
2014 uo mcnairjournal2014 uo mcnairjournal
2014 uo mcnairjournal
 
Types-of-Research.ppt
Types-of-Research.pptTypes-of-Research.ppt
Types-of-Research.ppt
 
Tacoma, WA 98422
Tacoma, WA 98422Tacoma, WA 98422
Tacoma, WA 98422
 
ICSE 2016 - Awards III and Closing
ICSE 2016 - Awards III and ClosingICSE 2016 - Awards III and Closing
ICSE 2016 - Awards III and Closing
 
Woodland Owner Networks and Peer-to-Peer Learning
Woodland Owner Networks and Peer-to-Peer LearningWoodland Owner Networks and Peer-to-Peer Learning
Woodland Owner Networks and Peer-to-Peer Learning
 
Thesis 1
Thesis 1Thesis 1
Thesis 1
 
Using Simulations to Evaluated the Effects of Recommender Systems for Learner...
Using Simulations to Evaluated the Effects of Recommender Systems for Learner...Using Simulations to Evaluated the Effects of Recommender Systems for Learner...
Using Simulations to Evaluated the Effects of Recommender Systems for Learner...
 
User-Centered Design of Learning Spaces on a Diverse, Urban, Commuter Campus.
 User-Centered Design of Learning Spaces on a Diverse, Urban, Commuter Campus.   User-Centered Design of Learning Spaces on a Diverse, Urban, Commuter Campus.
User-Centered Design of Learning Spaces on a Diverse, Urban, Commuter Campus.
 
Myths And Misperceptions About Online Learning2
Myths And Misperceptions About Online Learning2Myths And Misperceptions About Online Learning2
Myths And Misperceptions About Online Learning2
 
Webinar2
Webinar2Webinar2
Webinar2
 
Webinar #2
Webinar #2Webinar #2
Webinar #2
 
Women who choose Computer Science - what really matters
Women who choose Computer Science - what really mattersWomen who choose Computer Science - what really matters
Women who choose Computer Science - what really matters
 
Review
ReviewReview
Review
 
“Discovery with Models”
“Discovery with Models”“Discovery with Models”
“Discovery with Models”
 
poster.docx
poster.docxposter.docx
poster.docx
 
cv.doc.doc.doc
cv.doc.doc.doccv.doc.doc.doc
cv.doc.doc.doc
 
RDAP 16: DMPs and Public Access: Agency and Data Service Experiences
RDAP 16: DMPs and Public Access: Agency and Data Service ExperiencesRDAP 16: DMPs and Public Access: Agency and Data Service Experiences
RDAP 16: DMPs and Public Access: Agency and Data Service Experiences
 
CV_LeiWang
CV_LeiWangCV_LeiWang
CV_LeiWang
 
Julian Sass - CV
Julian Sass - CVJulian Sass - CV
Julian Sass - CV
 

Mais de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mais de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Machine Learning: Theory, Applications, Experiences

  • 1. Machine Learning: Theory, Applications, Experiences A Workshop for Women in Machine Learning October 4, 2006 San Diego, CA http://www.seas.upenn.edu/~wiml/
  • 2.
  • 3. Workshop Organization Organizers: Lisa Wainer, University College London Hanna Wallach, University of Cambridge Jennifer Wortman, University of Pennsylvania Faculty advisor: Amy Greenwald, Brown University Additional reviewers: Maria-Florina Balcan Kristina Klinkner Melissa Carroll Bethany Leffler Kimberley Ferguson Ozgur Simsek Katherine Heller Alicia Peregrin Wolfe Julia Hockenmaier Elena Zheleva Rebecca Hutchinson Thanks to our generous sponsors: 1
  • 4. Schedule October 3, 2006 19:30 Workshop dinner October 4, 2006 08:45 Registration and poster set-up 09:00 Welcome 09:15 Invited talk: A General Class of No-Regret Learning Algorithms and Game- Theoretic Equilibria Amy Greenwald, Brown University 09:45 On a Theory of Learning with Similarity Functions Maria-Florina Balcan, Carnegie Mellon University 10:00 Matrix Tile Analysis Inmar Givoni, University of Toronto 10:15 Towards Bayesian Black Box Learning Systems Jo-Anne Ting, University of Southern California 10:30 Coffee break 10:45 Invited talk: Clustering High-Dimensional Data Jennifer Dy, Northeastern University 11:15 Efficient Bayesian Algorithms for Clustering Katherine Heller, Gatsby Unit, University College London 11:30 Hidden Process Models Rebecca Hutchinson, Carnegie Mellon University 11:45 Invited talk: Recent advances in near-neighbor learning Maya Gupta, University of Washington 12:15 Spotlight talks: Correcting sample selection bias by unlabeled data Jiayuan Huang, University of Waterloo Decision Tree Methods for Finding Reusable MDP Homomorphisms Alicia Peregrin Wolfe, University of Massachusetts, Amherst Evaluating a Reputation-based Spam Classification System Elena Zheleva, University of Maryland, College Park Improving Robot Navigation Through Self-Supervised Online Learning Ellie Lin, Carnegie Mellon University 12:30 Lunch 2
  • 5. 13:00 Poster session 1 13:45 Invited talk: SRL: Statistical Relational Learning Lise Getoor, University of Maryland, College Park 14:15 Generalized statistical methods for fraud detection Cecile Levasseur, University of California, San Diego 14:30 Kernels for the Predictive Regression of Physical, Chemical and Biological Properties of Small Molecules Chloe-Agathe Azencott, University of California, Irvine 14:45 Invited talk: Modeling and Learning User Preferences for Sets of Objects Marie desJardins, University of Maryland, Baltimore County 15:15 Coffee break 15:30 Efficient Exploration with Latent Structure Bethany Leffler, Rutgers University 15:45 Efficient Model Learning for Dialog Management Finale Doshi, MIT 16:00 Transfer in the context of Reinforcement Learning Soumi Ray, University of Maryland, Baltimore County 16:15 Spotlight talks: Simultaneous Team Assignment and Behavior Recognition from Spatio- temporal Agent Traces Gita Sukthankar, Carnegie Mellon University An Online Learning System for the Prediction of Electricity Distribution Feeder Failures Hila Becker, Columbia University Classification of fMRI Images: An Approach Using Viola-Jones Features Melissa K Carroll, Princeton University Fast Online Classification with Support Vector Machines Seyda Ertekin, Penn State University 16:30 Poster session 2 17:15 Open discussion 17:45 Closing remarks and poster take-down 18:00 End of workshop 3
  • 6. Invited Talks A General Class of No-Regret Learning Algorithms and Game-Theoretic Equilibria Amy Greenwald, Brown University No-regret learning algorithms have attracted a great deal of attention in the game theoretic and machine learning communities. Whereas rational agents act so as to maximize their expected utilities, no-regret learners are boundedly rational agents that act so as to minimize their "regret". In this talk, we discuss the behavior of no-regret learning algorithms in repeated games. Specifically, we introduce a general class of algorithms called no-Φ-regret learning, which includes common variants of no-regret learning such as no external-regret and no-internal- regret learning. Analogously, we introduce a class of game-theoretic equilibria called Φ- equilibria. We show that no-Φ-regret learning algorithms converge to Φ-equilibria. In particular, no-external-regret learning converges to minimax equilibrium in zero-sum games; and no-internal-regret learning converges to correlated equilibrium in general-sum games. Although our class of no-regret algorithms is quite extensive, no algorithm in this class learns Nash equilibrium. Speaker biography: Dr. Amy Greenwald is an assistant professor of computer science at Brown University in Providence, Rhode Island. Her primary research area is the study of economic interactions among computational agents. Her primary methodologies are game-theoretic analysis and simulation. Her work is applicable in areas ranging from dynamic pricing to autonomous bidding to transportation planning and scheduling. She was awarded a Sloan Fellowship in 2006; she was nominated for the 2002 Presidential Early Career Award for Scientists and Engineers (PECASE); and she was named one of the Computing Research Association's Digital Government Fellows in 2001. Before joining the faculty at Brown, Dr. Greenwald was employed by IBM's T.J. Watson Research Center, where she researched Information Economies. Her paper entitled "Shopbots and Pricebots" (joint work with Jeff Kephart) was named Best Paper at IBM Research in 2000. Clustering High-Dimensional Data Jennifer Dy, Northeastern University Creating effective algorithms for unsupervised learning is important because vast amounts of data preclude humans from manually labeling the categories of each instance. In addition, human labeling is expensive and subjective. Therefore, a majority of existing data is unsupervised (unlabeled). The goal of unsupervised learning or cluster analysis is to group "similar" objects together. "Similarity" is typically defined by a metric or a probability model. These measures are highly dependent on the features representing the data. Many clustering algorithms assume that relevant features have been determined by the domain experts. But, not all features are important. Moreover, many clustering algorithms fail when dealing with high-dimensions. We present two approaches for dealing with clustering in high-dimensional spaces: 1. Feature selection for clustering, through Gaussian mixtures and the maximum likelihood and scatter separability criteria, and 2. Hierarchical feature transformation and clustering, through automated hierarchical mixtures of probabilistic principal component analyzers. 4
  • 7. Speaker biography: Dr. Jennifer G. Dy is an assistant professor at the Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, since 2002. She obtained her MS and PhD in 1997 and 2001 respectively from the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, and her BS degree in 1993 from the Department of Electrical Engineering, University of the Philippines. She received an NSF Career award in 2004. She is an editorial board member for the journal, Machine Learning since 2004, and publications chair for the International Conference on Machine Learning in 2004. Her research interests include Machine Learning, Data Mining, Statistical Pattern Recognition, and Computer Vision. Recent advances in near-neighbor learning Maya R. Gupta, University of Washington Recent advances in nearest-neighbor learning are shown for adaptive neighborhood definitions, neighborhood weighting, and estimating given nearest-neighbors. In particular, it is shown that weights that solve linear interpolation equations minimize the first-order learning error, and this is coupled with the principle of maximum entropy to create a flexible weighting approach. Different approaches to adaptive neighborhoods are contrasted, the focus being on neighborhoods that form a convex hull around the test point. Standard weighted nearest-neighbor estimation is shown to maximize likelihood, and it is shown that minimizing expected Bregman divergence instead leads to optimal solutions in terms of expected misclassification cost. Applications may include the testing of pipeline integrity, custom color enhancements, and estimation for color management. Speaker biography: Maya Gupta completed her Ph.D. in Electrical Engineering in 2003 at Stanford University as a National Science Foundation Graduate Fellow. Her undergraduate studies led to a BS in Electrical Engineering and a BA in Economics from Rice University in 1997. From 1999- 2003 she worked for Ricoh's California Research Center as a color image processing research engineer. In the fall of 2003, she joined the EE faculty of the University of Washington as an Assistant Professor where she also serves as an Adjunct Assistant Professor of Applied Mathematics. More information about her research is available at her group's webpage: idl.ee.washington.edu. Modeling and Learning User Preferences for Sets of Objects Marie desJardins, University of Maryland, Baltimore County Most work on preference learning has focused on pairwise preferences or rankings over individual items. In many application domains, however, when a set of items is presented together, the individual items can interact in ways that increase (via complementarity) or decrease (via redundancy or incompatibility) the quality of the set as a whole. In this talk, I will describe the DD-PREF language that we have developed for specifying set-based preferences. One problem with such a language is that it may be difficult for users to explicitly specify their preferences quantitatively. Therefore, we have also developed an approach for learning these preferences. Our learning method takes as input a collection of positive examples―that is, one or more sets that have been identified by a user as desirable. Kernel density estimation is used to estimate the value function for individual items, and the desired set diversity is estimated from the average set diversity 5
  • 8. observed in the collection. Since this is a new learning problem, I will also describe our new evaluation methodology and give experimental results of the learning method on two data collections: synthetic blocks-world data and a new real-world music data collection. Joint work with Eric Eaton and Kiri L. Wagstaff. Speaker biography: Dr. Marie desJardins is an assistant professor in the Department of Computer Science and Electrical Engineering at the University of Maryland, Baltimore County. Prior to joining the faculty in 2001, Dr. desJardins was a senior computer scientist at SRI International in Menlo Park, California. Her research is in artificial intelligence, focusing on the areas of machine learning, multi-agent systems, planning, interactive AI techniques, information management, reasoning with uncertainty, and decision theory. SRL: Statistical Relational Learning Lise Getoor, University of Maryland, College Park A key challenge for machine learning is mining richly structured datasets describing objects, their properties, and links among the objects. We'd like to be able to learn models, which can capture both the underlying uncertainty and the logical relationships in the domain. Links among the objects may demonstrate certain patterns, which can be helpful for many practical inference tasks and are usually hard to capture with traditional statistical models. Recently there has been a surge of interest in this area, fueled largely by interest in web and hypertext mining, but also by interest in mining social networks, security and law enforcement data, bibliographic citations and epidemiological records. Statistical Relational Learning (SRL) is a newly emerging research area, which attempts to represent, reason and learn in domains with complex relational and rich probabilistic structure. In this talk, I'll begin with a short SRL overview. Then, I'll describe some of my group's recent work, including our work on entity resolution in relational domains. Joint work with Indrajit Bhattacharya, Mustafa Bilgic, Louis Licamele and Prithviraj Sen. Speaker biography: Prof. Lise Getoor is an assistant professor in the Computer Science Department at the University of Maryland, College Park. She received her PhD from Stanford University in 2001. Her current work includes research on link mining, statistical relational learning and representing uncertainty in structured and semi-structured data. Her work in these areas has been supported by NSF, NGA, KDD, ARL and DARPA. In June 2006, she co-organized the 4th in a series of successful workshops on statistical relational learning, http://www.cs.umd.edu/srl2006. She has published numerous articles in machine learning, data mining, database and AI forums. She is a member of AAAI Executive council, is on the editorial board of the Machine Learning Journal and JAIR and has served on numerous program committees including AAAI, ICML, IJCAI, KDD, SIGMOD, UAI, VLDB, and WWW. 6
  • 9. Talks On a Theory of Learning with Similarity Functions Maria-Florina Balcan, Carnegie Mellon University Kernel functions have become an extremely popular tool in machine learning. They have an attractive theory that describes a kernel function as being good for a given learning problem if data is separable by a large margin in a (possibly very high-dimensional) implicit space defined by the kernel. This theory, however, has a bit of a disconnect with the intuition of a good kernel as a good similarity function. In this work we develop an alternative theory of learning with similarity functions more generally (i.e., sufficient conditions for a similarity function to allow one to learn well) that does not require reference to implicit spaces, and does not require the function to be positive semi-definite. Our results also generalize the standard theory in the sense that any good kernel function under the usual definition can be shown to also be a good similarity function under our definition. In this way, we provide the first steps towards a theory of kernels that describes the effectiveness of a given kernel function in terms of natural similarity-based properties. Joint work with Avrim Blum. Matrix Tile Analysis Inmar Givoni, University of Toronto Many tasks require finding groups of elements in a matrix of numbers, symbols or class likelihoods. One approach is to use efficient bi- or tri-linear factorization techniques including PCA, ICA, sparse matrix factorization and plaid analysis. These techniques are not appropriate when addition and multiplication of matrix elements are not sensibly defined. More directly, methods like bi-clustering can be used to classify matrix elements, but these methods make the overly restrictive assumption that the class of each element is a function of a row class and a column class. We introduce a general computational problem, "matrix tile analysis" (MTA), which consists of decomposing a matrix into a set of non-overlapping tiles, each of which is defined by a subset of usually nonadjacent rows and columns. MTA does not require an algebra for combining tiles, but must search over an exponential number of discrete combinations of tile assignments. We describe a loopy BP (sum-product) algorithm and an ICM algorithm for performing MTA. We compare the effectiveness of these methods to PCA and the plaid method on hundreds of randomly generated tasks. Using double-gene-knockout data, we show that MTA finds groups of interacting yeast genes that have biologically-related functions. Joint work with Vincent Cheung and Brendan J. Frey. Towards Bayesian Black Box Learning Systems Jo-Anne Ting, University of Southern California A long-standing dream of machine learning is to create black box learning systems that can operate autonomously in home, research and industrial applications. While it is well understood that a universal black box may not be possible, significant progress can be made in specific domains. In particular, we address learning problems in sensor-rich and data-rich environments, as provided by autonomous vehicles, surveillance systems, biological or robotic systems. In these scenarios, the input data has hundreds or thousands 7
  • 10. of dimensions and is used to make predictions (often in real-time), resulting in a learning system that learns to "understand" the environment. The goal of machine learning in this domain is to devise algorithms that can efficiently deal with very high dimensional data, usually contaminated by noise, redundancy and irrelevant dimensions. These algorithms must learn nonlinear functions, potentially in an incremental and real-time fashion, for robust classification and regression. In order to achieve black box quality, manual tuning parameters (e.g. as in gradient descent or structure selection) need to be minimized or, ideally, avoided. Bayesian inference, when combined with approximation methods to reduce computational complexity, suggests a promising route to achieve our goals, since it offers a principled way to eliminate open parameters. In past work, we have started to create a toolbox of methods to achieve our goal of black box learning. In (Ting et al., NIPS 2005), we introduced a Bayesian approach to linear regression. The novelty of this algorithm comes from a Bayesian and EM-like formulation of linear regression that robustly performs automatic feature detection in the inputs in a computationally efficient way. We applied this algorithm to the analysis of neuroscientific data (i.e. the problem of prediction of electromyographic (EMG) activity in the arm muscles of a monkey from spiking activity of neurons in the primary motor and premotor cortex). The algorithm achieves results that are faster by orders of magnitude and higher quality than previously applied methods. More recently, we introduced a variational Bayesian regression algorithm that is able to perform optimal prediction, given noise-contaminated input and output data (Ting, D'Souza & Schaal, ICML 2006). Traditional linear regression algorithms produce biased estimates when input noise is present and suffer numerically when the data contains irrelevant and/or redundant inputs. Our algorithm is able to effectively handle datasets with both characteristics. On a system identification task for a robot dynamics model, we achieved from 10 to 70% better results than traditional approaches. Current work focuses on developing a Bayesian version of nonlinear function approximation with locally weighted regression. The challenge is to determine the size of the neighborhood of data that should contribute to the local regression model―a typical bias-variance trade-off problem. Preliminary results indicate that a full Bayesian treatment of this problem can achieve impressive robust function approximation performance without the need for tuning meta parameters. We are also interested in extending this locally linear Bayesian model to an online setting, in the spirit of dynamic Bayesian networks, to offer a parameter-free alternative to incremental learning. Joint work with Aaron D'Souza, Stefan Schaal, Kenji Yamamoto, Toshinori Yoshioka, Donna Hoffman, Shinji Kakei, Lauren Sergio, John Kalaska, Mitsuo Kawato, Peter Strick, Michael Mistry, Jan Peters, and Jun Nakanishi. This work will also be in Poster Session 1. Efficient Bayesian Algorithms for Clustering Katherine Ann Heller, Gatsby Unit, University College London One of the most important goals of unsupervised learning is to discover meaningful clusters in data. There are many different types of clustering methods that are commonly used in machine learning including spectral, hierarchical, and mixture modeling. Our work takes a model-based Bayesian approach to defining a cluster and evaluates cluster membership in this paradigm. We use marginal likelihoods to compare different cluster models, and hence determine which data points belong to which clusters. If we have 8
  • 11. models with conjugate priors, these marginal likelihoods can be computed extremely efficiently. Using this clustering framework in conjunction with non-parametric Bayesian methods, we have proposed a new way of performing hierarchical clustering. Our Bayesian Hierarchical Clustering (BHC) algorithm takes a more principled approach to the problem than the traditional algorithms (e.g. allowing for model comparisons and the prediction of new data points) without sacrificing efficiency. BHC can also be interpreted as performing approximate inference in Dirichlet Process Mixtures (DPMs), and provides a combinatorial lower bound on the marginal likelihood of a DPM. We have also explored the task of "clustering on demand" for information retrieval. Given a query consisting of a few examples of some concept, we have proposed a method that returns other items belonging to the concept exemplified by the query. We do this by ranking all items using a Bayesian relevance criterion based on marginal likelihoods, and returning the items with the highest scores. In the case of binary data, all scores can be computed with a single matrix-vector product. We can also use this method as the basis for an image retrieval system. In our most recent work this framework has served as inspiration for a new approach to automated analogical reasoning. Joint work with Zoubin Ghahramani and Ricardo Silva. Hidden Process Models Rebecca Hutchinson, Carnegie Mellon University We introduce the Hidden Process Model (HPM), a probabilistic model for multivariate time series data. HPMs assume the data is generated by a system of partially observed, linearly additive processes that overlap in space and time. While we present a general formalism for any domain with similar modeling assumptions, HPMs are motivated by our interest in studying cognitive processes in the brain, given a time series of functional magnetic resonance imaging (fMRI) data. We use HPMs to model fMRI data by assuming there is an unobserved series of hidden, overlapping cognitive processes in the brain that probabilistically generate the observed fMRI time series. Consider for example a study in which subjects in the scanner repeatedly view a picture and read a sentence and indicate whether the sentence correctly describes the picture. It is natural to think of the observed fMRI sequence as arising from a set of hidden cognitive processes in the subject’s brain, which we would like to track. To do this, we use HPMs to learn the probabilistic time series response signature for each type of cognitive process, and to estimate the onset time of each instantiated cognitive process occurring throughout the experiment. There are significant challenges to this learning task in the fMRI domain. The first is that fMRI data is high dimensional and sparse. A typical fMRI dataset measures approximately 10,000 brain locations over 15-20 minutes (features), with only a few dozen trials (training examples). A second challenge is due to the nature of the fMRI signal: it is a highly noisy measurement of an indirect and temporally blurred neural correlate called the hemodynamic response. The hemodynamic response to a short burst of less than a second of neural activity lasts for 10-12 seconds. This temporal blurring in fMRI makes it problematic to model the time series as a first-order Markov process. In short, our problem is to learn the parameters and timing of potentially overlapping, partially observed responses to cognitive processes in the brain using many features and a small number of noisy training examples. The modeling assumptions that HPMs make to deal with the challenges of the fMRI domain 9
  • 12. are: 1) the latent time series is modeled at the level of processes rather than individual time points; 2) processes are general descriptions that can be instantiated many times over the course of the time series; 3) we can use prior knowledge of the form “process instance X occurs somewhere inside the time interval [a, b].” HPMs could apply to any domain in which these assumptions are valid. HPMs address a key open question in fMRI analysis: how can one learn the response signatures of overlapping cognitive processes with unknown timing? There is no competing method to HPMs available in the fMRI community. In our ICML paper, we give the HPM formalism, inference and learning algorithms, and experimental results on real and synthetic fMRI datasets. Joint work with Tom Mitchell and Indrayana Rustandi. This work will also be in Poster Session 1. Generalized statistical methods for fraud detection Cecile Levasseur, University of California, San Diego Many important risk assessment system applications depend on the ability to accurately detect the occurrence of key events given a large data set of observations. For example, this problem arises in drug discovery (“Do the molecular descriptors associated with known drugs suggest that a new, candidate drug will have low toxicity and high effectiveness?”); and credit card fraud detection (“Given the data for a large set of credit card users does the usage pattern of this particular card indicate that it might have been stolen?”). In many of these domains, no or little a priori knowledge exists regarding the true sources of any causal relationships that may occur between variables of interest. In these situations, meaningful information regarding the circumstances of the key events must be extracted from the data itself, a problem that can be viewed as an important application of data-driven pattern recognition or detection. The problem of unsupervised data-driven detection or prediction is one of relating descriptors of a large unlabeled database of “objects” to measured properties of these objects, and then using these empirically determined relationships to infer or detect the properties of new objects. This work considers measured object properties that are nongaussian (and comprised of continuous and discrete data), very noisy, and highly nonlinearly related. Data comprised of measurements of such disparate properties are said to be hybrid or of mixed type. As a consequence, the resulting detection problem is very difficult. The difficulties are further compounded because the descriptor space is of high dimension. While many domains lack accurate labels in their database, others like credit card fraud exhibit tagged data. Therefore, the problem of supervised data-driven detection, one relating to a labelled database of objects, is also examined. In addition, by utilizing tagged data, a performance benchmark can be set, enabling meaningful comparisons of supervised and unsupervised approaches. Statistical approaches to fraud detection are mostly based on modelling the data relying on their statistical properties and using this information to estimate whether a new object comes from the same distribution or not. The statistical modelling approach proposed here is a generalization and amalgamation of techniques from classical linear statistics (logistic regression, principal component analysis and generalized linear models) into a framework referred to as generalized linear statistics (GLS). It is based on the use of exponential family distributions to model the various types (continuous and discrete) of data measurements. A key aspect is that the natural parameter of the exponential family distributions is constrained to a lower dimensional subspace to model the belief that the 10
  • 13. intrinsic dimensionality of the data is smaller than the dimensionality of the observation space. The proposed constrained statistical modelling is a nonlinear methodology that exploits the split that occurs for exponential family distributions between the data space and the parameter space as soon as one leaves the domain of purely Gaussian random variables. Although the problem is nonlinear, it can be solved by using classical linear statistical tools applied to data that has been mapped into the parameter space that still has a natural, flat Euclidean structure. This approach provides an effective way to exploit tractably parameterized latent-variable exponential-family probability models for data- driven learning of model parameters and features, which in turn are useful for the development of effective fraud detection algorithms. The fraud detection techniques proposed here are performed in the parameter space rather than in the data space as has been done in more classical approaches. In the case of a low level of contamination of the data by fraudulent points, a single lower dimensional subspace is learned by using the GLS based statistical modelling on a training set. Given a new data point, it is projected to its image on the lower dimensional subspace and fraud detection is performed by comparing its distance from the training set mean-image to a threshold. An example that shows that there are domains for which the classical linear techniques, such as principal component analysis, used in the data space perform far from optimally is presented compared to the new proposed parameter space techniques. For cases of data with roughly as many fraudulent as non-fraudulent points, an unsupervised approach to the linear Fisher discriminant is proposed. The GLS based framework enables unsupervised learning of a lower dimensional subspace in the parameter space that separates fraudulent from non-fraudulent data. Fraud detection is performed as in the previous case. In both cases, an ROC curve is generated to assess the performance of the proposed fraud detection methods. Joint work with Kenneth Kreutz-Delgado and Uwe Mayer. Kernels for the Predictive Regression of Physical, Chemical and Biological Properties of Small Molecules Chloe-Agathe Azencott, University of California, Irvine Small molecules, i.e. molecules composed of a couple hundreds of atoms, play a fundamental role in biology, chemistry and pharmacology. Their usage goes from the design of new drugs to the better understanding of biological systems; however, establishing their physical, chemical and biological properties through a physical experimentation can be very costly. It is therefore essential to develop efficient computational methods to predict these properties. Kernel methods, and among them support vector machines, appear as particularly appropriate for chemical data, for they involve similarity measures which allow to embed the data in a high-dimensional feature space where linear methods can be used. Machine learning spectral kernels can be derived from various descriptions of the molecules; we study representations which dimensionality ranges from 1 to 4, thus obtaining 1D, 2D, 2.5D, 3D and 4D kernels. Using cross-validation and redundancy reduction techniques on various datasets of small and medium size from the literature, we test the kernels for the prediction of boiling points, melting points, aqueous solubility and octanol/water partition coefficient and compare them against state-of-the art results. Spectral kernels derived from the rich and reliable two-dimensional representation of the molecules outperform the other methods on most of the datasets. They seem to be the 11
  • 14. method of choice, given their simplicity, computational efficiency and prediction accuracy. Efficient Exploration with Latent Structure Bethany Leffler, Rutgers University Developing robot control using a reinforcement-learning (RL) approach involves a number of technical challenges. In our work, we address the problem of learning an action model. Classical RL approaches assume Markov decision process (MDP) environments, which do not support the critical idea of generalization between states. For an agent to learn the results of its actions for each state, it would have to visit each state and perform each action in that state at least once. In a robot setting, however, it is unrealistic to assume there will be sufficient time to learn about every state of the environment independently; so richer models of environmental dynamics are needed. Our technique for developing such a model is to assume that each state is not unique. In most environments, there will be states that have the same transition dynamics. By developing models where similar states have similar dynamics, it becomes possible for a learner to reuse its experience in one state to more quickly learn the dynamics of other parts of the environment. However, it also introduces an additional challenge―determining which states are similar. To evaluate the viability of this approach, we constructed an experiment using a four- wheeled Lego Mindstorm robot as the agent. The state space consisted of discretized vehicle locations with a hidden variable of slope (flat or incline), which correlated directly with the action model. The agent had to learn which throttling action to perform in each state to maintain a target speed. In this scenario, the actions did not affect the transitions between states. To determine similarity between states, the agent executed a selected action several times in each of the vehicle locations. The outcomes of these actions were used to hierarchically cluster the states. Once the states were clustered, the agent then started learning an action model for each state cluster. The advantage of this approach over one that learned a separate action model for each state is that information gathered in several different states can be pooled together. In common environments, there are many more states than state-types; therefore, learning based on clusters drastically reduces learning time. In fact, we were able to prove a worst-case learning time result that formalizes and validates this claim. If the environment does not have many similar states or if the clustering algorithm groups the states incorrectly, than the benefit of this approach will be minimized. Even in this worst case, however, it is important to note that this algorithm is no more costly than exploring each state individually. Some limitations of this algorithm arise when states have semi-similar action models. For instance, if two states behave similarly when one action is performed, but not for all the actions, it is possible that the agent would learn incorrectly when following our proposed algorithm. In most robotic environments, however, using our algorithm will greatly reduce the time taken by the agent to determine its action model in all states, thereby increasing the efficiency of the robot. Joint work with Michael L. Littman, Alexander L. Strehl, and Thomas Walsh. 12
  • 15. Efficient Model Learning for Dialog Management Finale Doshi, MIT Intelligent planning algorithms such as the Partially Observable Markov Decision Process (POMDP) have succeeded in dialog management applications because of their robustness to the inherent uncertainty of human interaction. Like all dialog planning systems, however, POMDPs require an accurate model of the user (such as the user's different states of the user and what the user might say). POMDPs are generally specified using a large probabilistic model with many parameters; these parameters are difficult to specify from domain knowledge, and gathering enough data to estimate the parameters accurately a priori is expensive. In this paper, we take a Bayesian approach to learning the user model simultaneously the dialog management problem. First we show that the policy that maximizes the expected reward is the solution of the POMDP taken with the expected values of the parameters. We update the parameter distributions after each test, and incrementally update the previous POMDP solution. The update process has a relatively small computational cost, and we test various heuristics to focus computation in circumstances where it is most likely to improve the dialog. We are able to demonstrate a robust dialog manager that learns from interaction data, out-performing a hand-coded model in simulation and in a robotic wheelchair application. Joint work with Nicholas Roy. Transfer in the context of Reinforcement Learning Soumi Ray, University of Maryland, Baltimore County We are investigating the problem of transferring knowledge learned in one domain to another related domain. Transfer of knowledge from simple domains to more complex domains can reduce the total training time in the complex domains. We are doing transfer in the context of reinforcement learning. In the past, knowledge transfer has been accomplished between domains with the same state and action spaces. Work has also been done where the state and action spaces of the two domains are different but a mapping has been provided by humans. We are trying to automate the mapping from the old domain to the new domain when the state and action spaces are different. We have two domains D1 and D2, with corresponding state spaces S1 and S2 and action spaces A1 and A2 where |S1| = |S2| and |A1| = |A2|. Our goal is to transfer a policy learned in D1 to D2 so as to speed learning in D2. We first run Q-learning in D1 to produce Q-table Q1. Then we train for limited time in D2 and generate Q2. The test bed we have used is a 16x16 grid world. We have taken two domains in a 16x16 grid world with four actions: North, South, East and West. In the first domain we have trained for 500 iterations and in the second domain we have trained for 20 iterations. The two approaches that we have used are as follows. Our goal is to find the mapping between the state spaces S1 and S2 and action spaces A1 and A2 In the first approach we compute the difference between matrices Q1 and Q2 and greedily find a mapping that minimizes the difference calculated above. With this mapping we can transfer the Q-values from the completely trained domain D1 to the partially trained domain D2 to speed up learning in domain D2. We find that it takes fewer steps to learn completely in the second domain when the Q-values are transferred than learning from scratch. Our second approach finds the mapping that assigns the highest Q-values of the states in domain one to the highest Q-values of the states in domain two. This approach is an improvement over the first approach. It takes many fewer steps to learn in 13
  • 16. the second domain using transfer. We are also interested in finding the mapping when S1 and A1 are subsets of S2 and A2 respectively, i.e. |S1|<|S2| and |A1|<|A2|. This can be handled by allowing mapping a single state/action in S1/A1 to multiple states/actions in S2/A2. Joint work with Tim Oates. This work will also be in Poster Session 2. 14
  • 17. Spotlights (Session 1) Correcting sample selection bias by unlabeled data Jiayuan Huang, University of Waterloo The default assumption in many learning scenarios is that training and test data are independently and identically drawn from the same distribution. When the distributions on training and test set do not match, we are facing the problem that commonly referred to as sample selection bias or covariance shift. This problem occurs in many real world applications including the areas of surveys, sociology, biology and economics. It is not hard to see that given the skewed selection for the training data, it is impossible to derive a good model to make accurate predictions on the general target as the training set might not be representative of the complete population from which the test is usually come. Thus the prediction results in a biased estimation, potentially increasing the errors. Although there exists previous work addressing this problem, sample selection bias is typically ignored in standard estimation algorithms. In this work, we utilize the availability of unlabeled data to direct a sample selection de-biasing procedure for various learning methods. Unlike most previous algorithms that try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate, our method infer the re-sampling weight directly by distribution matching between training and testing sets in the feature space in a non-parametric manner. We do not require the estimation of biased densities or selection probabilities or any assumptions of knowing the probabilities of different classes. Our method works by matching distributions between training and testing sets in feature space that can handle high dimensional data. Our experiments results with many benchmark datasets demonstrate our method works well in practice. The method also shows good performance in tumor diagnosis using microarrays that it promises to be a valuable tool for cross-platform microarray classification. Joint work with Alex Smola, Arthur Gretton, Karsten Borgwardt, Bernhard Scholkopf. Decision Tree Methods for Finding Reusable MDP Homomorphisms Alicia Peregrin Wolfe, University of Massachusetts, Amherst State abstraction is a useful tool for agents interacting with complex environments. Good state abstractions are compact, reusable, and easy to learn from sample data. This paper combines and extends two existing classes of state abstraction methods to achieve these criteria. The first class of methods search for MDP homomorphisms (Ravindran 2004), which produce models of reward and transition probabilities in an abstract state space. The second class of methods, like the UTree algorithm (McCallum 1995), learn compact models of the value function quickly from sample data. Models based on MDP homomorphisms can easily be extended such that they are usable across tasks with similar reward functions. However, value based methods like UTree cannot be extended in this fashion. We present results showing a new, combined algorithm that fulfills all three criteria: the resulting models are compact, can be learned quickly from sample data, and can be used across a class of reward functions. Joint work with Andrew Barto. 15
  • 18. Evaluating a Reputation-based Spam Classification System Elena Zheleva, University of Maryland, College Park Over the past several years, spam has been a growing problem for the Internet community. It interferes with valid e-mail and burdens both e-mail users and ISPs. While there are various successful automated e-mail filtering approaches that aim at reducing the amount of spam, there are still many challenges to overcome. Reactive spam filtering approaches classify a piece of e-mail as spam if it has been reported as such by a large volume of e-mail users. Unfortunately, by the time the system responds by blocking the message or automatically placing it in future recipients' spam folders, the spam campaign has already affected a lot of users. The challenge that we consider is whether we can reduce the response time, recognizing a spam campaign at an earlier stage, thus reducing the cost that users and systems incur. Specifically, we are evaluating the predictive power of a reputation-based spam filtering system, which uses the feedback only from trustworthy e-mail users. In a reputation-based or trust-based spam filtering system, the system identifies a set of users who report spam reliably and trusts their spam reports more than the spam reports of other users. A message coming into the system is classified as spam if enough reliable users report it. This automatic spam filtering approach is vulnerable to malicious users when any anonymous person can subscribe and unsubscribe to the e-mail service. This is the case with most free e-mail providers such as AOL, Hotmail and Yahoo. We show how to overcome this problem in this work. There are two well-known open-source projects which operate in this framework: Vipul's Razor and Distributed Checksum Clearinghouse. Unfortunately, their reputation systems work only as a part of their commercially available software counterparts and, due to trade secrets, it is not clear how the design characteristics such as reputation definition and metrics affect the system performance. More importantly, the spam reports they receive are mostly from authorized users (such as business partner company employees), which reduce the risk of abuse by anonymous users. The effectiveness of a reputation-based spam filtering system is based on evaluating the following properties: 1) automatic maintenance of a reliable user set over time, 2) timely and accurate recognition of a spam campaign, and 3) having a set of guarantees on the system vulnerability. In our work, we present the results from simulating a reputation- based spam filtering over a period of time. The evaluation dataset includes all the spam reports received during that period of time for a particular free e-mail provider. We show how our algorithms effectively reduce spam campaign response time, while minimizing system vulnerability. Joint work with Lise Getoor and Alek Kolcz. Improving Robot Navigation Through Self-Supervised Online Learning Ellie Lin, Carnegie Mellon University In mobile robotics, there are often features that, while potentially powerful for improving navigation, prove difficult to profit from as they generalize poorly to novel situations. Overhead imagery data, for instance, has the potential to greatly enhance autonomous robot navigation in complex outdoor environments. In practice, reliable and effective automated interpretation of imagery from diverse terrain, environmental conditions, and sensor varieties proves challenging. Similarly, fixed techniques that successfully interpret on-board sensor data across many environments begin to fail past short ranges as the 16
  • 19. density and accuracy necessary for such computation quickly degrade and features that are able to be computed from distant data are very domain-specific. We introduce an online, probabilistic model to effectively learn to use these scope-limited features by leveraging other features that, while perhaps otherwise more limited, generalize reliably. We apply our approach to provide an efficient, self-supervised learning method that accurately predicts traversal costs over large areas from overhead data. We present results from field-testing on-board a robot operating over large distances in off-road environments. Additionally, we show how our algorithm can be used offline with overhead data to produce a priori traversal cost maps and detect misalignments between overhead data and estimated vehicle positions. This approach can significantly improve the versatility of many unmanned ground vehicles by allowing them to traverse highly varied terrains with increased performance. Joint work with B. Sofman, J. Bagnell, N. Vandapel and A. Stentz. 17
  • 20. Spotlights (Session 2) Simultaneous Team Assignment and Behavior Recognition from Spatio-temporal Agent Traces Gita Sukthankar, Carnegie Mellon University This research addresses the problem of activity recognition for physically embodied agent teams. We define team activity recognition as the process of identifying team behaviors from traces of agent positions over time; for many physical domains, military or athletic, coordinated team behaviors create distinctive spatio-temporal patterns that can be used to identify low-level action sequences. We focus on the novel problem of recovering agent- to-team assignments for complex team tasks where team composition, the mapping of agents into teams, changes over time. Without a priori knowledge of current team assignments, the behavior recognition problem is challenging since behaviors are characterized by the aggregate motion of the entire team and cannot generally be determined by observing the movements of a single agent in isolation. To handle this problem, we introduce a new algorithm, Simultaneous Team Assignment and Behavior Recognition (STABR) that generates behavior annotations from spatio-temporal agent traces. STABR leverages information from the spatial relationships of the team members to create sets of potential team assignments at selected time-steps. These spatial relationships are efficiently discovered using a randomized search technique, RANSAC, to generate potential team assignment hypotheses. Sequences of team assignment hypotheses are evaluated using dynamic programming to derive a parsimonious explanation for the entire observed spatio-temporal trace. To prune the number of hypotheses, potential team assignments are fitted to a parameterized team behavior model; poorly fitting hypotheses are eliminated before the dynamic programming phase. The proposed approach is able to perform accurate team behavior recognition without exhaustive search over the partition set of potential team assignments, as demonstrated on several scenarios of simulated military maneuvers. STABR does not simply assume that agents within a certain proximity should be assigned to the same team; instead if relies on matching static snapshots of agent position against a database of team formation templates to produce a candidate pool of agent-to-team assignments. This candidate pool of assignments is verified by running a local spatio- temporal behavior detector. The intuition is that the aggregate agent movement for an incorrect team assignment will generally fail to match any behavior model. STABR significantly outperforms agglomerative clustering on the agent-to-team assignment problem for traces with dynamic agent composition (95% accuracy). The scenarios presented here illustrate the operation of STABR in environments that lack the external cues used by other multi-agent plan recognition approaches, such as landmarks, cleanly clustered agent teams, and extensive domain knowledge. We believe that when such cues are available they can be directly incorporated into STABR, both to improve accuracy and to prune hypotheses. STABR provides a principled framework for reasoning about dynamic team assignments in spatial domains. Joint work with Katia Sycara. 18
  • 21. An Online Learning System for the Prediction of Electricity Distribution Feeder Failures Hila Becker, Columbia University We are using machine learning techniques for constructing a failure-susceptibility ranking of feeder cables that supply electricity to the boroughs of New York City. The electricity system is inherently dynamic, and thus our failure-susceptibility ranking system must be able to adapt to the latest conditions in real time, having to update its ranking accordingly. The feeders have a significant failure rate, and many resources are devoted to monitoring, maintenance and repair of feeders. The ability to predict failures allows the shifting from reactive to proactive maintenance, thus reducing costs. The feature set for each feeder includes a mixture of static data (e.g. age and composition of each feeder section) and dynamic data (e.g. electrical load data for a feeder and its transformers). The values of the dynamic features are captured at the time of training and therefore lead to different models depending on the time and day at which each model is trained. Previously, a framework was designed to train models using a new variant of boosting called Martingale Boosting, as well as Support Vector Machines. However, in this framework, an engineer had to decide whether to use the most recent data to build a new model, or use the latest model instead for future predictions. To avoid the need of human intervention, we have developed an “online” system that determines what model to use by monitoring past performance of previously trained models. In our new framework, we treat each batch-trained model as an expert, and use a measurement of its performance as the basis for reward or penalty of its quality score. We measure performance as a normalized average rank of failures. For example, in a ranking of 50 items with actual failures ranked #4 and #20, the performance is: 1 – (4 + 20) / (2*50) = 0.76. Our approach builds on the notion of learning from expert advice as formulated in the continuous version of the Weighted Majority algorithm. Since each model is analogous to an expert and our system runs live thus gathering new data and generating new models, we have to keep adding new experts to the existing ensemble throughout the algorithm’s execution. To avoid having to monitor an ever-increasing set of experts, we drop poorly performing experts after each prediction. We had to address the following key issues in our solution: (1) how often and with what weight do we add new experts, and (2) what experts do we drop. Our simulations suggest that using the median of all current models’ weights for new models works best. To drop experts we use a combination of age of the model and past performance. Finally, to make predictions we use a weighted average of the top- scoring experts. Our system is currently deployed and being tested by New York City’s electricity distribution company. Results are highly encouraging, with 75% of the failures in the summer of 2005 being ranked in the top 26%, and 75% of failures in 2006 being ranked in the top 36%. Joint work with Marta Arias. Classification of fMRI Images: An Approach Using Viola-Jones Features Melissa K. Carroll, Princeton University There has been growing interest in using Functional Magnetic Resonance Imaging (fMRI) for “mind reading,” particularly in applying machine learning methods to classifying fMRI brain images based on the subject’s instantaneous cognitive state. For instance, Haxby et al. (2001) perform fMRI scans while subjects are viewing images of one of seven classes of 19
  • 22. objects with the goal of discriminating the brain images based on the class of image being viewed at the time. Most machine learning approaches used to date for fMRI classification have treated individual voxels as features and ignored the spatial correlation between voxels (Norman et al., 2006). We present a novel method for searching this feature space to generate features that capture spatial information, derived from the Viola and Jones (2001) algorithm for 2D object detection, and apply it to 2D representations of the images. In this method, features are computed corresponding to absolute and relative intensities over regions of varying size and shape, and used by AdaBoost (Schapire and Singer, 1999) to generate a classifier. Figure 1 (http://www.cs.princeton.edu/~mkc/wiml06/Figure1.jpg) shows examples of these features overlaid on an actual 2D representation of the 3D fMRI image. Mean intensities in white regions are subtracted from mean activities in gray regions to compute each feature, which are combined to form the feature vector. One-, two-, three- and four-rectangle features of all 100 size combinations between 1x1 and 10x10 are computed for all positions in the image. As Figure 2 (http://www.cs.princeton.edu/~mkc/wiml06/Figure2.jpg) shows, including richer features than the standard one-pixel features can result in improved classification of the Haxby et al. dataset. One potential limitation of the method is that the large feature set it produces conflicts with computational limitations; however, figure 2 shows that even selecting a small random subset of the richer features can result in an increase in classification accuracy by 5% or more, although performance varies across subjects. In addition, the performance of this subset of features can be used to target subsequent feature selection. Future work needs to be performed to develop reliable and valid methods for rating feature importance. Finally, Figure 3 (http://www.cs.princeton.edu/~mkc/wiml06/Figure3.jpg) shows that confusion among predicted classes occurs most often between classes that are most similar and for which previous classifiers have encountered difficulty, e.g. male faces and female faces. This target space similarity structure could be exploited in future work to improve classification. 1. J. V. Haxby, M. I. Gobbini, M. L. Furey, A. Ishai, J. L. Schouten, and P. Pietrini. (2001). Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science, (293) 2425-2429. 2. K. A. Norman, S. M. Polyn, G. J. Detre and J. V. Haxby, (2006). Beyond mind-reading: multi-voxel pattern analysis of fMRI data. Trends in Cognitive Sciences, In Press. 3. R.E. Schapire and Y. Singer. (1999). Improved boosting algorithms using confidence- rated predictions. Machine Learning, 37(3): 297-336. 4. P. Viola and M. Jones. (2001). Rapid object detection using a boosted cascade of simple features. CVPR 2001. Joint work with Kenneth A. Norman, James V. Haxby and Robert E. Schapire. Fast Online Classification with Support Vector Machines Seyda Ertekin, Penn State University In recent years, we have witnessed significant increase in the amount of data in digital format, due to the widespread use of computers and advances in storage systems. As the volume of digital information increases, people need more effective tools to better find, filter and manage these resources. Classification, the assignment of instances (i.e. pictures, text documents, emails, Web sites etc.) to one or more predefined categories 20
  • 23. based on their content, is an important component in many information organization and management tasks. Support Vector Machines (SVMs) is a popular machine learning algorithm for classification problems due to their theoretical foundation and good generalization performance. However, SVMs have not yet seen widespread adoption in the communities working with very large datasets due to the high computational cost involved in solving quadratic programming (QP) problem in the training phase. This research presents an online SVM learning algorithm, LASVM, which yields classification accuracy rates of the state-of-the-art SVM solvers but requires less computational resources. LASVM tolerates much smaller main memory and has a much faster training phase. We also show that not all the examples are equally informative in the training set. We present methods to select the most informative examples and exploit those to reduce the computational requirements of the learning algorithm. We uncover the properties of active learning algorithms to select the informative examples efficiently from very large-scale training sets. We will also show the benefits of using a non-convex loss function at SVMs for faster speeds and less computational requirements. Joint work with Leon Bottou, Antoine Bordes and Jason Weston. 21
  • 24. Posters (Session 1) Using Decision Trees for Gabor-based Texture Classification of Tissues in Computed Tomography Alia Bashir, DePaul University This research is aimed at developing an automated imaging system for classification of tissues in CT images. Classification of tissues in CT scans using shape or gray level information is challenging due to the changing shape of organs in a stack of images and the gray level intensity overlap in soft tissues. However, healthy organs are expected to have a consistent texture within tissues across slices. Given a large enough set of normal- tissue images, and a good set of texture features, machine learning techniques can be applied to create an automatic classifier. Previous work from one of the authors explored texture descriptors based on wavelet, ridgelets, and curvelets for the classification of tissues from normal chest and abdomen CT scans. These texture descriptors were able to classify tissues with an accuracy range of 85 - 98%, with curvelet-based texture descriptors performing the best. In this paper we bridge the gap to perfect accuracy by focusing on texture features based on a bank of Gabor filters. The approach consists of three steps: convolution of the regions of interest with a bank of 32 Gabor filters (4 frequencies and 8 orientations), extraction of two Gabor texture features per filter (mean and standard deviation), and creation of a classifier that automatically identifies the various tissues. The data set consists of 2D DICOM images from five normal chest and abdomen CT studies from Northwestern Medical Hospital. The following regions of interest were segmented out and labeled by an expert radiologist: liver, spleen, kidney, aorta, trabecular bone, lung, muscle, IP fat, and SQ fat for a total of 1112 images. For each image, the feature vector consists of the mean and standard deviation of the 32 filtered images, totaling 64 descriptors. The classification step is carried out using a Classification and Regression decision tree classifier. A decision tree predicts the class of an object (tissue) from values of predictor variables (texture descriptors), and generates a set of decision rules. These sets of rules are then used for the classification of each region of interest. Both the cross-validation and the random split of the data set into a training set (~65%) and testing set (~35%) techniques were applied but no significant difference was observed. The optimal tree had a depth of 20, parent node value set at 10 and child node value set at 1. To evaluate the performance of each classifier, specificity, sensitivity, precision, and accuracy rates are calculated from each misclassification matrix. Results show that this set of texture features is able to perfectly classify the 9 regions of interests. The Gabor filters’ ability to isolate features at different scales and directions allows for a multi-resolution analysis of texture essential when dealing with, at times, very subtle differences in the texture of tissues in CT scans. Given the great performance in the classification of healthy tissues, we plan to apply Gabor texture feature to the classification of abnormal tissues. Joint work with Julie Hasemann and Lucia Dettori. VOGUE: A Novel Variable Order-Gap State Machine for Modeling Sequences Bouchra Bouqata, Rensselaer Polytechnic Institute (RPI) In this paper we present VOGUE, a new state machine that combines two separate techniques for modeling long range dependencies in sequential data: data mining and data modeling. VOGUE relies on a novel Variable-Gap Sequence mining method (VGS), to mine frequent patterns with different lengths and gaps between elements. It then uses these mined sequences to build the state machine. We applied VOGUE to the task of 22
  • 25. protein sequence classification on real data from the PROSITE protein families. We show that VOGUE yields significantly better scores than higher-order Hidden Markov Models. Moreover, we show that VOGUEs classification sensitivity outperforms that of HMMER, a state-of-the-art method for protein classification Joint work with Christopher Carothers, Boleslaw K. Szymanski and Mohammed J. Zaki. GroZi: a Grocery Shopping Assistant for the Blind Carolina Galleguillos, U.C San Diego Grocery shopping is a common activity that people all over the world perform on a regular basis. Unfortunately, grocery stores and supermarkets are still largely inaccessible to people with visual impairments, as they are generally viewed as "high cost" customers. We propose to develop a computer vision based grocery shopping assistant based on a handheld device with haptic feedback that can detect different products inside of a store, thereby increasing the autonomy of blind (or low vision) people to perform grocery shopping. Our solution makes use of new computer vision techniques for the task of visual recognition of specific products inside of a store as specified in advance on a shopping list. These techniques can avail of complementary resources such as RFID, barcode scanning, and sighted guides. We also present a challenging new dataset of images consisting of different categories of grocery products that can be use for object recognition studies. The use of the system consists of the creation of a shopping list followed by in-store navigation. In order to create a shopping list we will develop a website accessible to visually impaired people that stores data and images of different products. The website will be augmented with new image templates from the community of users that shop with the device, in addition to images of the same product that are taken in different stores by different users. This will increase the system's ability to recognize products that change appearance due to seasonal or promotion reasons. The navigational task includes finding the correct aisle for the products (based on text detection and character recognition), avoiding obstacles, finding products and checking out. A typical grocery store carries around 30,000 items, so recognizing a single object is a nontrivial task. Assuming a shopping list length is generally less than 1/1000th of this amount (i.e., less than 30 items), the recognition can be constrained to two different phases: detection of object on a possibly cluttered shelf, and verification of the detected object with respect to the shopping list. For this task, we intend to use state of the art object recognition algorithms and develop new approaches for fast identification. Applications of Kernel Minimum Enclosing Ball Cristina Garcia C., Universidad Central de Venezuela The minimum enclosing ball (MEB) is a well-studied problem in computational geometry. In this work we describe a generalization of a simple approximate MEB construction, introduced by M. Badoiu and K. L. Clarkson, to a feature space MEB using the kernel trick. The simplicity of the methodology in itself is surprising, the MEB algorithm is based only on geometrical information extracted from a sample of data points, and just two parameters need to be tuned: the constant of the kernel and the tolerance in the radio of the approximation. The applicability of the method is demonstrated on anomaly detection and less traditional scenarios as 3D object modeling and path planning. Results are 23
  • 26. encouraging and show that even an approximate feature space MEB, is able to induce topology preserving mappings on arbitrary dimensional noisy data as efficiently as other machine learning approaches. Joint work with Jose Ali Moreno. Classification With Cumular Trees Claudia Henry, Antilles-Guyane The accurate combination of decision trees and linear separators has been shown to provide some of the best off the shelf classifiers. We describes a new type of such combination, which we call Cumular (Cumulative Linear) Trees. Cumular Trees are midway between Oblique Decision Trees and Alternating Decision Trees: more expressive than the former, and simpler than the latter. We provide an induction algorithm for Cumular Trees, which is, as we show, a boosting algorithm in the original sense. Experimental results against AdaBoost, C4.5 and OC1 display very good results, especially when dealing with noisy data. Joint work with Richard Nock and Franck Nielsen. Transient Memory in Reinforcement Learning: Why Forgetting Can be Good for You Anna Koop, University of Alberta The vast majority of work in machine learning is concerned with algorithms that converge to a single solution. It is not clear that this is always the most appropriate aim. Consider a sailor adapting to the ship's motion. She may learn two conditional models: one for walking when at sea, and another for walking when on land. She may, when memory resources are limited, learn a best-on-average policy that settles on a compromise among all situations she has encountered. A more flexible approach might be to quickly adapt the walking policy to new situations, rather than seeking one final solution or set of solutions. We explore two cases of transient memory. In the first case, the rate at which individual parameters change is controlled by meta-parameters. These meta parameters allow the agent to ignore irrelevant or random features, to converge where features are consistent throughout its experience, and otherwise to adapt quickly to changes in the environment. This approach requires no commitment to the number of parameter sets necessary in a given environment, but makes the best use of available resources. In the second case, a single solution is stored in long-term parameters, but this solution is used only as the starting point for learning about a specific situation. This is currently being applied to the game of Go. At the beginning of a game, the agent's value function parameters are initialized according to the long-term memory. During the course of a game these parameters are updated by simulating, from each state, thousands of self-play games. The short-term parameters learned in this way are used both for action selection and as the starting point for learning on the next turn, after the opponent has moved. Actual game-play moves are used to update both the short- and long-term memory. At the end of the game, the short-term memory is forgotten and the value function parameters are initialized to the long-term values. This allows the agent to store general knowledge in long-term memory while adapting quickly to the specific situations encountered in the current game. 24
  • 27. Predicting Task-Specific Webpages for Revisiting A. Twinkle E. Lettkeman, Oregon State University Most web browsers track the history of all pages visited, with the intuition that users are likely to want to return to pages that they have previously accessed. However, the history viewers in web browsers are ineffective for most users, because of the overwhelming glut of webpages that appear in the history. Not only does the history represent a potentially confusing interleaving of many of a user's different tasks, but it also includes many webpages that would provide minimal or no utility to the user if revisited. This paper reports on a technique used to dramatically reduce web browsing histories down to pages that are relevant to the user's current task context and have a high likelihood of being desirable to revisit. We briefly describe how the TaskTracer system maintains an awareness of a user's tasks and semi-automatically segments the web browsing history by task. We then present a technique that is used to predict whether webpages previously visited on a task will be of future value to the user and are worth displaying in the history user interface. Our approach uses a combination of heuristics and machine learning to evaluate the content of a page and interactions with the page to learn a predictive model of webpage relevance for each user task. We show the results of an empirical evaluation of this technique based on user data. This approach could be applied to systems that include tracking of webpage resources to predict future value of resources and to lower costs of finding and reusing webpages to the user. Our findings suggest that prediction of web pages is highly user- and task-specific, and that the choice of prediction algorithms is not obvious. In future work we aim to refine the features used to predict revisitability. We will analyze the effect of better text feature extraction in conjunction with user interest indicators such as reading time, scrolling behavior, and text selection. Preliminary analysis indicates that applying these refinements may increase the accuracy of our prediction models. Joint work with Simone Stumpf, Jed Irvine and Jonathan Herlocker. Hyper-parameters auto-setting using regularization path for SVM Gaëlle Loosli, INSA de Rouen In the context of classification tasks, Support Vector Machines are now very popular. However, their utilization by neophyte users is still hampered by the need to supply values for control parameters in order to get the best attainable results. Mainly, given clean data, SVM's users must make three choices: the type of kernel, its bandwidth and the regularization parameter. It would be convenient to provide users with a push-button SVM that would be able to auto-set to the best possible values. This paper presents a new method that approaches this goal. Given the importance of this problem for reaping all the potential benefits of the use of SVM, many research works have been dedicated to ways of helping the setting of the parameters. Most rely on either outer measures, such as cross-validation, to guide the selection, or to measures embedded in the learning method itself. In place of empirical approaches to the setting of the control parameters, regularization paths have been proposed and widely studied these past years since they provide a smart and fast way to access all the optimal solutions of a problem according to all compromises between bias and variance for regression or compromises between bias and regularity in classification. For instance, in the case of classification tasks, as studied in this paper, Soft margins SVM deal with non-separable problem thanks to slack variables that are parametrized by a slack trade-off (usually noted C, it is the regularization parameter). Within the usual formulation of the Soft margins SVM, this trade-off takes its value between 0 (random) and infinity (hard-margins). The nu-SVM technique reformulates the SVM problem so that C is replaced by nu parameters taking values in [0,1]. This normalized parameter has a more intuitive meaning: it represents the minimal proportion 25
  • 28. of points in the solution and the maximal proportion of misclassified points. However, having the whole regularization path is not enough. Indeed, the end user still needs to retrieve from it the best values for the regularization parameters. Instead of selecting these values by k-fold cross-validation or leave-one-out, or other approximations, we propose to include the leave-one-out estimator inside the regularization path in order to have an idea of the generalization error at each step. We explain why it is less expansive than selecting the best parameter a posteriori and give a method to stop learning before attaining the end of the path to save useless efforts. Contrarily to what is usually done for regularization path, our method does not start with all points as support vectors. Doing so we avoid the computation of the whole Gram matrix at the first step. Then, since the proposed method stops on the path, this extreme non-sparse solution is never attained and thus the whole Gram matrix never required. One of the main advantages of this is that it is possible to use this setting for large databases. The Influence of Ranker Quality on Rank Aggregation Algorithms Brandeis Marshall, Rensselaer Polytechnic Institute The rank aggregation problem has been studied extensively in recent years with a focus on how to combine several different rankers to obtain a consensus aggregate ranker. We study the rank aggregation problem from a different perspective: how the individual input rankers impact the performance of the aggregate ranker. We develop a general statistical framework based on a model of how the individual rankers depend on the ground truth ranker. Within this framework, one can study the performance of different aggregation methods. The individual rankers, which are the inputs to the rank aggregation algorithm, are statistical perturbations of the ground truth ranker. With rigorous experimental evaluation, we study how noise level and the misinformation of the rankers affect the performance of the aggregate ranker. We introduce and study a novel Kendall-tau rank aggregator and a simple aggregator called PrOpt, which we compare to some other well known rank aggregation algorithms such as average, median and Markov chain aggregators. Our results show that the relative performance of aggregators varies considerably depending on how the input rankers relate to the ground truth. Joint work with Sibel Adali and Malik Magdon-Ismail. Learning for Route Planning under Uncertainty Evdokia Nikolova, Massachusetts Institute of Technology We present new complexity results and efficient algorithms for optimal route planning in the presence of uncertainty. We employ a decision theoretic framework for defining the optimal route: for a given source S and destination T in the graph, we seek an ST-path of lowest expected cost where the edge travel times are random variables and the cost is a nonlinear function of total travel time. Although this is a natural model for route planning on real-world road networks, results are sparse due to the analytic difficulty of finding closed form expressions for the expected cost, as well as the computational/combinatorial difficulty of efficiently finding an optimal path, which minimizes the expected cost. We identify a family of appropriate cost models and travel time distributions that are closed under convolution and physically valid. We obtain hardness results for routing problems with a given start time and cost functions with a global minimum, in a variety of deterministic and stochastic settings. In general the global cost is not separable into edge costs, precluding classic shortest-path approaches. However, using partial minimization 26
  • 29. techniques, we exhibit an efficient solution via dynamic programming with low polynomial complexity. We then consider an important special case of the problem, in which the goal is to maximize the probability that the path length does not exceed a given threshold value (deadline). We give a surprising exact nθ log n algorithm for the case of normally distributed edge lengths, which is based on quasi-convex maximization. We then prove average and smoothed polynomial bounds for this algorithm, which also translate to average and smoothed bounds for the parametric shortest path problem, and extend to a more general non-convex optimization setting. We also consider a number other edge length distributions, giving a range of exact and approximation schemes. Our offline algorithms can be adapted to give online learning algorithms via the Kalai- Vempala approach of converting an offline to an efficient online optimization solution. Joint work with Matthew Brand, David Karger, Jonathan Kelner and Michael Mitzenmacher. A Neurocomputational Model of Impaired Imitation Biljana Petreska, Ecole Polytechnique Federale de Lausanne This abstract addresses the question of human imitation through convergent evidence from neuroscience, using tools from machine learning. In particular, we consider a deficit in imitation of meaningless gestures (i.e., hand postures relative to the head) following callosal brain lesion (i.e., disconnected hemispheres). We base our work on the rational that looking at how imitation in apraxic patients is impaired can unveil its underlying neural principles. We ground the functional architecture and information flow of our model in brain imaging studies. Finally findings from monkey brain neurophysiological studies drive the choice of implementation of our processing modules. Our neurocomputational model of visuo-motor imitation is based on selforganizing maps receiving sensory input (i.e., visual, tactile or proprioceptive) with associated activities [1]. We train the connections between the maps with anti-hebbian learning to account for the transformations required to translate the observation of the visual stimulus to imitate to the corresponding tactile and proprioceptive information that will guide the imitative gesture. Patterns of impairment of the model, realized by adding uncertainty in the transfer of information between the networks, reproduce the deficits found in a clinical examination of visuo-motor imitation of meaningless gestures [2]. The model makes hypotheses on the type of representation used and the neural mechanisms underlying human visuo-motor imitation. The model also helps to gain more understanding in the occurrence and nature of imitation errors in patients with brain lesions. [1] B. Petreska, and A.G. Billard. A Neurocomputational Model of an Imitation Deficit following Brain Lesion. In Proceedings of 16th International Conference on Artificial Neural Networks (ICANN 2006), Athens (Greece). To appear. [2] G. Goldenberg, K. Laimgruber, and J. Hermsdörfer. Imitation of gestures by disconnected hemispheres. Neuropsychologia, 39:1432–1443, 2001. Joint work with A. G. Billard. 27
  • 30. Bayesian Estimation for Autonomous Object Manipulation Based on Tactile Sensors Anya Petrovskaya, Stanford University We consider the problem of autonomously estimating position and orientation of an object from tactile data. When initial uncertainty is high, estimation of all six parameters precisely is computationally expensive. We propose an efficient Bayesian approach that is able to estimate all six parameters in both unimodal and multimodal scenarios. The approach is termed Scaling Series sampling as it estimates the solution region by samples. It performs the search using a series of successive refinements, gradually scaling the precision from low to high. Our approach can be applied to a wide range of manipulation tasks. We demonstrate its portability on two applications: (1) manipulating a box and (2) grasping a door handle. Joint work with Oussama Khatib, Sebastian Thrun, Andrew Y Ng. . Therapist Robot Behavior Adaptation for Post-stroke Rehabilitation Therapy Adriana Tapus, University of Southern California Research into Human-Robot Interaction (HRI) for socially assistive applications is in its infancy. Socially assistive robotics, which focuses on the social interaction, rather than the physical interaction between the robot and the human user, has the potential to enhance the quality of life for large populations of users. Post-stroke rehabilitation is one of the largest potential application domains, since stroke is a dominant cause of severe disability in the growing ageing population. In the US alone, over 750,000 people suffer a new stroke each year, with the majority sustaining some permanent loss of movement [Institute06]. This loss of function, termed "learned disuse", can improve with rehabilitation therapy during the critical post-stroke period. One of the most important elements of any rehabilitation program is carefully directed, well-focused and repetitive practice of exercises, which can be passive and active. Our work focuses on hands-off therapist robots that assist, encourage, and socially interact with patients during their active exercises. Our previous research demonstrated, through real world experiments with stroke patients [Tapus06b, Eriksson05, Gockley06], that the physical embodiment (including shared physical context and physical movement of the robot), the encouragements, and the monitoring play key roles in patient compliance with rehabilitation exercises. In the current work we investigate the role of the robot’s personality in the hands-off therapy process. We focus on the relationship between the level of extroversion/introversion (as defined in Eysenck Model of personality [Eysenck91]) of the robot and the user, addressing the following research questions: 1. How should we model the behavior and encouragement of the therapist robot as a function of the personality of the user and the number of exercises performed? 2. Is there a relationship between the extroversion-introversion personality spectrum based on the Eysenck model and the challenge based vs. nurturing style of patient encouragement? To date, little research into human-robot personality matching has been performed. Some of our recent results showed the preference for personality matching between users and socially assistive robots [Tapus06a]. Our therapist robot behavior adaptation system monitors the number of exercises/minute performed by the human/patient, indicating the level of engagement and/or fatigue, and changes the robot’s behavior in order to maximize this level. The socially assistive therapist robot (see Figure 1) is equipped with a basis set of behaviors that will explicitly express its desires and intentions in a physical and verbal way that is observable to the user/patient. These behaviors involve the control of 28
  • 31. physical distance, gestural expression, and verbal expression (tone and content). The number of exercises/minute is therefore used as a reward that maximizes the response of the system. Hands-off robot post-stroke rehabilitation therapy holds great promise of improving patient compliance in the recovery program. Our work aims toward developing and testing a model of compatibility between human and robot personality in the assistive context, based on the PEN theory of personality and toward building a customized therapy protocol. Examining and answering these issues will begin to address the role of assistive robot personality in enhancing patient compliance. [Eriksson05] Eriksson, J., Matarić, M., J., and Winstein, C. "Hands-off assistive robotics for post-stroke arm rehabilitation", In Proceedings of the International Conference on Rehabilitation Robotics (ICORR-05), Chicago, Illinois, June 2005. [Eysenck91] Eysenck, H., J. "Dimensions of personality: 16, 5 or 3? Criteria for a taxonomic paradigm", In Personality and individual differences, vol. 12, pp.773-790, 1991. [Gockley06] Gockley, R., and Matarić, M., J. "Encouraging Physical Therapy Compliance with a Hands-Off Mobile Robot", In Proceedings of the First International Conference on Human Robot Interaction (HRI-06), Salt Lake City, Utah, March 2006. [Institute06] "Post-Stroke Rehabilitation Fact Sheet", National Institute of neurological disorders and stroke, January, 2006. [Tapus06a] Tapus, A. and Matarić, M., J. (2006) "User Personality Matching with Hands-Off Robot for Post-Stroke Rehabilitation Therapy", In Proceedings of the 10th International Symposium on Experimental Robotics (ISER), Rio de Janeiro, Brazil, July 2006. [Tapus06b] Tapus, A. and Matarić, M., J. (2006) "Towards Socially Assistive Robotics", International Journal of the Robotics Society of Japan (JRSJ), 24(5), pp. 576- 578, July, 2006. Joint work with Maja J. Matarić. Learning How To Teach Cynthia Taylor, University of California, San Diego The goal of the RUBI project is to develop a social robot (RUBI) that can interact with children and teach them in an autonomous manner. As part of the project we are currently focusing on the problem of teaching 18-24 month old children skills targeted by the California Department of Education as appropriate for this age group. In particular we are focusing on teaching the children to identify objects, shapes and colors. We have seven RFID-tagged stuffed toys, in the shapes of common objects like a slice of watermelon or a waffle. RUBI says the name of the object and shows a picture of it on her touch screen, and the children hand her a toy, which she identifies as correct or incorrect. She keeps track of the right and wrong answers for each toy. RUBI has a touch screen on her stomach she can use to play short videos and play games with the children. By recording when the children touch her stomach, the screen also provides important information about whether or not the children are engaged. She has two Apple iSight cameras for eyes, and runs machine learning software that lets her detect both faces and smiles. The smile detection lets her gage people’s moods during social interaction, and respond accordingly. She has an RFID reader in her right hand, letting her 29
  • 32. identify RFID tagged toys. The machine learning aspect of this problem is how to use the information from her perceptual primitives so as to teach the materials in an effective manner. After each question/answer, RUBI has to decide whether to continue playing her current learning game or switch to another activity, and what question to ask next if she continues playing the game. She also has to decide what to do in situations where she asks a question and does not get an answer for a long period of time. Unlike many standard AI problems like chess, RUBI works in continuous time, with no discrete turns. We are approaching the problem from the point of view of control theory. Exact solutions to the optimal teach problem exist for some simple models of learning, such as the Atkinson and Bower learning model. We are planning to find approximate solutions to this control problem using Reinforcement Learning Methods. We will complement formal and computational analysis with ethnographic study of how teachers do teach the children on the same task. Our focus will be on understanding both timing and what sources of information they use to adapt their teaching strategies. Joint work with Paul Ruvolo, Ian Fasel, Javier R. Movellan. Strategies for improving face recognition in video using machine learning methods Deborah Thomas, University of Notre Dame Surveillance cameras are a common feature in many stores and public places. There are many applications for face recognition from video streams in the area of law enforcement. However, while face recognition from high quality still images has been very successful, face recognition from video is a relatively new area and there is huge room for improvement. Furthermore, when using video as our data, we can exploit the fact that there are multiple frames to choose from to improve recognition performance. So, instead of representing subjects using a single high quality image, they can be represented using a set of frames chosen from the frames in a video clip. However, we want to select as many distinct frames for an individual as possible. This allows for the diversity in the training space, thereby improving the generalization capacity of the learned face recognition classifier. In this work, we consider two different approaches. The commonality between the two approaches is Principal Component Analysis. Given the high dimensionality of the data, PCA is often warranted to not only reduce the dimensions but also construct mode independent dimensions. In our first approach, we use a nearest neighbor algorithm with Mahalanobis Cosine (MahCosine) distance measure. A pair of images in which the faces differ from each other in pose and expression will have a bigger MahCosine distance between them. So we can use this as a measure of difference between frames. In the second approach, we project the images into PCA space and then use K-means clustering to group all the frames from one subject and pick one image per cluster to make up the representation set. Here again, images, which are similar to each other, will be in the same cluster, while more different images will be in different clusters. In addition to difference between frames, we also incorporate a quality metric of the face in picking the frames in addition to using PCA and this yields a higher recognition rate. We demonstrate our approach using two different datasets. First, we compare our approach to the approach used by Lee et. al in 2003 (Video-based Face Recognition Using Appearance Manifolds) and 2005 (Visual Tracking and Recognition using Probabilistic Appearance Manifolds). They use appearance manifolds to represent their subjects and use planes in PCA space for the different poses. We show that our approach performs 30