1. DARPA MARS Robotic Vision 2020 program proposal: CACI
Cover page
Program Solicitation No.: Mobile Autonomous Robot Software BAA #02-15
Technical topic area: 1. Structured software modules, 2. learning and adaptation tools, 3. robot self-
monitoring, 4. software components, 5. sensor-based algorithms, and 6. behavior software
components and architecture structures.
Title of Proposal: CACI: Cross-Platform and Cross-Task Integrated Autonomous Robot Software
Submitted to:
DARPA/IPTO
ATTN: BAA 02-15
3701 North Fairfax Drive
Arlington, VA 22203-1714
Technical contact:
John Weng, Associate Professor
3115 Engineering Building
Embodied Intelligence Laboratory
Michigan State University, East Lansing, MI 48824
Tel: 517-353-4388
Fax: 517-432-1061
E-mail: weng@cse.msu.edu
http://www.cse.msu.edu/~weng/
Administrative contact:
Daniel T. Evon, Director
Contract and Grant Administration
301 Administration Building
Michigan State University
East Lansing, MI 48824 USA
Tel: 517-355-4727
FAX: 517-353-9812
E-mail: evon@cga.msu.edu
Contractor’s type of business: Other educational
Summary of cost:
July 1, 2002 – June 30, 2003 July 1, 2003 – June 30, 2004 Total
$1,619,139 $1,674,006 $3,293,145
1
2. DARPA MARS Robotic Vision 2020 program proposal: CACI
A Innovative Claims for the Proposed Research
The major innovative claims include:
(1) Cross-platform integrated software. The proposed software is applicable to different platforms
using a uniform API level to achieve software “Plug and Play” capability for every plug-and-
play complaint robot body. This cross-platform capability is made practical due to the recent
advances in biologically motivated “developmental methodology” and other related technologies.
Although only indoor platforms will be tested, the technology is not limited to indoor platforms.
(2) Cross-task integrated software. The team will develop various perceptual and behavioral
capabilities for a suite of skills for a wide range of tasks, performed by robots individually or
collectively. This cross-task capability is made practical by a combination of “developmental
methodology” and other related technologies.
(3) Highly perceptual robot systematically combining vision, speech, touch and symbolic inputs,
to perceive and “understand” the environment, including humans, other robots, objects and their
changes for behavioral generation. The proposed software will be able to detect, track, classify
and interact with objects such as human overseers, by-standers, other robots and other objects.
(4) “Robots keep alive” during overseer intervention. The robots continue to be fully “aware” of
their operating environment and incrementally improve their performance even when a human
intervenes. A human can interact with robots before, during or after task execution. This is
made practical by the recent advances in robot “developmental” software and a novel integration
with other machine learning techniques and the perceptual frame based servo-level controller.
(5) Multimodal human interventions in the robot’s physical environment: from wireless remote
control, to physical touch and block, to auditory and speech commands, to visual cues, to high-
level goal alteration. Such a “mode rich” robotic capability is made possible by a novel
integration of “automatic generation of representation” in developmental programs with the
controller level perceptual frame based force-control.
(6) Digital dense 3-D maps construction capability using both laser radars (ladars) and trinocular
stereo cameras, for robot access and human use. Using ladars sensing, the proposed effort will
use both range and intensity information to achieve real-time multi-layer scene representation and
object classification. A multi-look methodology will be developed to produce a 3-D map using a
low resolution pulse ladar sensor. An efficient multiple path ladar fusion algorithm will be
developed to produce multi-layer 3-D scene representation. Using strinocular stereo camera
sensing, the proposed effort will use radial image rectification, variable-step-size textured-light-
aided trinocular color stereoscopy, very fast sensor evidence cone insertion, multiple viewpoint
color variance sensor model learning, and coarse to fine resolution hierarchies.
(7) Systematic methodology for quantitative assessment and validation of specific techniques in
specific system roles. This team has access to a wide range of robot test beds to support
extensive assessment, from simple mobile platform to the sophisticated Dav humanoid (its
combination of mobile and untethered features is unique among all existing humanoids in the
US).
2
3. DARPA MARS Robotic Vision 2020 program proposal: CACI
B Proposal Roadmap
The main goal: The work is to develop CACI: a Cross-platform And Cross-task Integrated software
system for multiple perception-based autonomous robots to effectively operate in real-world
environments and to interact with humans and with other robots, as illustrated in Figure 1
Figure 1: Future robot assistants for commanders? This schematic illustration is synthesized with
pictures of real SAIL and Dav robots.
Tangible benefits to end users:
1. Greatly reduced cost of software development due to the cross-platform nature of the proposed
software. Although each different robot platform needs to be trained using the proposed
software, the time of interactive training is significantly shorter than directly programming each
different robot platform for perception-based autonomy.
2. Greatly enhanced capability of robots to operate semi-autonomously in uncontrolled
environments, military and domestic. The proposed software is applicable, in principle, to
indoor and outdoor, on-road and off-road, ground-, air-, sea-, and space-based. However, in the
proposed effort, we will concentrate on a wide variety of indoor platforms, with some
extension to outdoor on-road applications.
3. Greatly increased variety of tasks that robots can execute semi-autonomously: not just
navigating according to range information while avoiding collisions, but also detecting, tracking,
recognizing, classifying and interacting with humans, robots and other objects. For example,
handing ammunition over to a soldier on request, warning of an incoming threat, and disposing an
explosive ordinance. .
4. Greatly reduced frequency for required interventions by human overseers. Depending on the
tasks executed, the interval of human intervention can be as long as a few minutes.
Critical technical barriers: Most autonomous robots are “action cute but perception weak.” They
can either operate under human remote control or programmed to perform a set of pre-designed actions in
largely controlled environments, e.g., following a red ball, playing robot soccer or navigating in a known
environment. However, their capability of responding to unknown environments (e.g., visual, auditory
and touch) is weak.
The main elements of the proposed approach:
3
4. DARPA MARS Robotic Vision 2020 program proposal: CACI
1. The proposed project will integrate a set of the most powerful technologies that have developed
in our past efforts in the DARPA MARS program as well as elsewhere, for unknown
environments, including various Autonomous Mental Development (AMD) techniques,
Markov Decision Process (MDP) based machine learning, supervised learning, reinforcement
learning, and the new communicative learning (including language acquisition and learning
through language).
2. We will also integrate techniques that take advantage of prior knowledge for partially known
environments, such as detection, tracking and recognizing human faces. This allows the robot to
perform these more specialized tasks efficiently without requirement for a long training process.
3. Our proposed innovative 3-D map construction takes advantage of photographic stereopsis,
structured light and ladars for the best quality and the widest applicability possible at this time.
4. The unique integration technology characterized by a unified architecture for sensors, effectors
and internal states and the “plug-and-play” methodology for various indoor and outdoor-road
robot platforms.
The basis for confidence that the proposed approach will overcome the technical barriers:
We have successfully tested the proposed individual technologies in the previous DARPA MARS
program or other prior projects. The proposed integration is however truly challenging. Our integration
philosophy is to find the best merging boundary of each individual technology so that the capability of
integrated system is not reduced to an intersection of individual applicabilities, but instead, increased to a
multiplication or at least a summation of them.
The nature of the expected results:
1. Unique: No other group that we know of has produced our scale of robot perception results
(vision, audition and touch integrated up to fine subsecond time scales). No other team has the
wide variety of indoor platforms as ours (e.g., all other humanoids in the US are immobile).
2. Novel: Our AMD approach and the associated human-robot interaction capabilities are truly
new, along with other novelties in component techniques.
3. Critical: No autonomous robot for an uncontrolled environment is possible without generating
representation from real-world sensory experience, which is the hallmark of AMD. Defense
environments are typically uncontrolled, very different from lab settings.
The risk if the work is not done. If the proposed work is not done, the ground mobile weapons of the
future combat system (FCS) will continue to rely on humans to operate, putting humans at a full risk in
the battle field. Further, miniature of mobile weapons is limited by the human size if a human operator
has to be carried in it.
The criteria for evaluating this progress includes the following major ones: (1) the frequency at which
human overseers need to intervene, (2) the scope of tasks that the technology can deal with, (3) the scope
of machine perception, (4) the flexibility of human robot interactions, and (5) the cost of the system.
The cost of the proposed effort for each year:
Year 1: $1,619,139
Year 2: $1,674,006
Total: $3,293,145
4
5. DARPA MARS Robotic Vision 2020 program proposal: CACI
C Research Objectives
C.1 Problem Description
The research project will address mainly indoor mobile platforms, including non-humanoid and
humanoid mobile robots. However, an indoor robot needs to perceive not only indoor scenes, but also
outdoor ones. For example, an indoor robot must be able to “look out through a window” to perceive
danger from outside. Further, in order to verify the cross-platform cross-task capability of the proposed
software system, the domain of application to be tested will include not only complex indoor
environments, but also outdoor flat driving surfaces. However, in the proposed effort, indoor tests will
have a higher priority. We will evaluate the power and limitation of component technologies and the
integrated system.
C.2 Research Goals
We propose that the following robot capabilities to be developed and integrated:
• Robotic perception in uncontrolled environment, including vision, audition and touch, for
various tasks. For example, detecting and recognizing humans, landmarks, objects, and body
parts.
• Robotic behaviors based on perception, including visual attention, autonomous navigation with
collision avoidance, autonomous object manipulation, and path planning for various tasks. For
example, guiding attention to moving parts or humans, navigating autonomously while avoiding
obstacles, and picking up and delivering objects from a location to a destination.
• Construction of 3-D world model and its application. We will integrate both laser-based direct
range sensing and stereo camera based range sensing. The constructed 3-D map with intensity
will be used by a human overseer for virtual walk-through and as an external 3-D digital map for
a robot to “read,” similar to a human consulting a map, for planning tasks.
• Human-robot interactions while keeping robot “awareness.” The integrated software enables
a human overseer to intervene at any time, to issue a command, to improve an action, or to issue a
warning. The real-time software is able to respond to human intervention within a fraction of a
second without terminating its “awareness.”
The proposed project will also reach the following integration and evaluation goals:
• Integration goal: Develop an integration technology that is cross-platform and cross-task.
• Evaluation goal: Develop a systematic method that is suited for quantitative assessment of the
power and limitation of specific robot capabilities, including the above four categories of
capabilities.
5
6. DARPA MARS Robotic Vision 2020 program proposal: CACI
C.3 Expected Impact
Volumetric sensing. Ladar is an all weather day and night active sensor. A real-time 3-D
map generation and exploitation using ladar images can significantly simplify a robot’s route
planning, cross platform cooperation, and information fusion missions. It will also provide
critical prior information for robot perception task and can significantly reduce
communication bandwidth and simplify human-robot interactions. It is a critical enabling
technology for cross-platform, cross-task, and cross-environment robot operations.
Our robots will navigate, employing a dense 3-D awareness of their surroundings, be
tolerant of route surprises, and be easily placed by ordinary workers in entirely new routes
or work areas. The 3-D maps built by our system, and 2-D plans derived from them, are
suitable for presentation to human overseers, who could designate operationally significant
locations and objects with point and click methods. Human overseers can also walk through
the robot’s 3-D experience for better human-robot interactions.
Automated detection and tracking of human faces, objects, landmarks, enemies and friends can greatly
enhance a robot’s awareness of the environment, which in turn is essential for generating context
sensitive actions. For example, identifying and understanding people dynamically will result in
successful interactions.
The “robot-keeping-alive” way of human-robot interaction completely changes the way in which
human and robots interact as well as how robots interact with each other. Robots are no longer “dead”
during human intervention. Instead they continuously experience the physical events, including human
intervention to improve their future performance. Autonomous robots will learn through human
interactions as well as their own practice.
Multimodal parallel perception will enable, for the first time, autonomous robots to sense and perceive
concurrently visual, auditory and touch multimodal environments. For the first time, these visual,
auditory and touch perceptual capabilities are highly integrated with online generating context-
appropriate behaviors. In other words, perception and behaviors are not two separate processes. These
milestone advances marked overcoming major theoretical, methodological and technical challenges in our
past work.
The integration technology will greatly enhance the overall capability of autonomous robot operation in
an uncontrolled real-world environment, indoor and outdoor, in ways that are not possible with existing
task-specific and platform-specific technologies.
The proposed evaluation technology will provide rigorous quantitative data on the capability of the
proposed technologies as well as comparison data with other existing technologies. With clear
understanding about the strengths and limitations, the system proposed here will enable autonomous
perception-based robot technology to be available for CFS application by the year 2020.
6
7. DARPA MARS Robotic Vision 2020 program proposal: CACI
D Technical Approach
D.1 Detailed Description of Technical Approach
D.1.1 System Architecture
The proposed CACI software framework is illustrated in Figure 2.
Figure 2: The software architecture of the CACI system.
The robot software contains three coarse layers distinguished by the time scale of their actions: planning
layer working on minute scale, perception-behavior layer working on second scale and servo control layer
working on millisecond scale.
The sensor inputs are available for every layer, depending on the need of each layer. The 3-D
range/intensity map is constructed from ladars and trinocular video cameras. It serves as a digital 3-D
site model (Chellappa, et al. 2001) available to the robot and the human overseer.
The state integrator is the place to post state information required for inter-layer communication. It is
divided into three sections, one for the state of each layer. Every layer can read the state of other layers
from the corresponding area in the state integrator. The action integrator records actions issued from each
layer to the next lower layer. A higher layer will issue action commands to be executed by only the next
lower layer, but all the layers can read action commands from other layers if needed. Due to our
decomposition of layers based on time scales, actions from different layers do not conflict. For example,
when the deliberative layer wants the reactive layer to move forward, the reactive layer will try to move
forward in minutes scale, although it might temporarily move side-ways in a short time period in order to
avoid an obstacle.
7
8. DARPA MARS Robotic Vision 2020 program proposal: CACI
A major strength of the CACI framework is that it is designed to work not only with cheap and low-
dimensional sensors, such as sonar and infra-red sensors, but also high-dimensional, high data rate
sensors, such as vision, speech and dense range map inputs. It also addresses several other challenges in
autonomous robots. For example, sensory perception, multisensory integration, and behavior generation
based on distributed sensory information are all extremely complex. Even given an intelligent human
overseer, the task of programming mobile robots to successfully operate in unstructured (i.e., unknown)
environments is tedious and difficult. One promising avenue towards smarter and easier-to-program
robots is to equip them with the ability to learn new concepts, behaviors and their association. Our
pragmatic approach to designing intelligent robots is one where a human designer provides the basic
learning mechanism (e.g. the overall software architecture), with a lot of details (features, internal
representations, behaviors, coordination among behaviors, values of behaviors under a context) being
filled in by additional training. The CACI framework is designed with this approach in mind.
The CACI architecture is also suited for distributed control among multiple robots. Each robot perceives
the world around it, including the commands from the human overseer or a designated robot leader. It
acts according to the perceptual and behavioral skills that it has learned. Collectively, multiple robots
acting autonomously successfully to show desired group perceptual capabilities and context appropriate
behaviors. A centralized control scheme has been proven not effective for multiple robots.
D.1.2 Integration approach
The proposed CACI is an integrated software system for a wide variety of robot platforms, a wide variety
of environments and a wide variety of tasks. Such challenging integration is not possible without the
methodology breakthroughs that have been achieved and demonstrated recently in MARS PI meetings
and other publications. A component technology that has a very limited applicability and yet is not
equipped with a suitable applicability checker is not suited for integration.
The team members have developed systematic technologies that are suited for integration. For
perception and perception-based behaviors, the thrust is the methodology of autonomous cognitive and
behavioral development. For longer-time behaviors and planning, multiple technologies to be used
include perception-based action chaining (PBAC), the Markov decision process (MDP) and the associated
learning methods. 3D site models are used as external digital maps, external to the robot “brain.”
Integrating these three methodologies as well as other well-proven techniques to be outlined in this
proposal, the proposed CACI software system will reach our goal of:
• cross-platform capability
• cross-task capability
for perception-based autonomous mobile robot software.
The “cross-platform capability” means that the software is applicable to different robot bodies:
• indoor and outdoor
• on-road and off-road
• ground, air, and under-water
• earth-bound, space flight and space station
• small, human and vehicle size.
A particular hardware robot platform is best only to a particular type of environment, due to its hardware
constraints, including sensors and effectors. A land robot cannot fly; a helicopter cannot dive into water.
However, different robot bodies do not mean that their software must also be ad hoc, based on totally
different principles. The same word processor can be used to write different articles. The same Window
2000 operating system (OS) can be used for different computers each with a different combination of
8
9. DARPA MARS Robotic Vision 2020 program proposal: CACI
computation resources and peripherals. This is known as “plug-and-play.” Of course, “plug-and-play” for
autonomous robot software is much harder.
The “plug-and-play” or cross-platform capability, for autonomous robot software is based on the
following well-known basic idea: encapsulate platform dependent parts into an application
programmer’s Interface (API). From the software point of view, each robot contains three types of
resources, sensors, effectors and computational resource (including CPU, memory, battery etc). From
an application software point of view, different robot platforms simply mean different parameters for
these three types of resources. For example, a camera class has resolution as a parameter and an arm
class has the degree of freedom as a parameter.
The amount of work needed for us to achieve software “plug and play” for robots is large, but it should
be smaller than the counterpart for OS. This is because OS has already addressed most “plug and play”
problems for computational resource, sensors, and effectors. For CACI, “plug and play,” only needs to be
done for the application program level, which requires definition of object classes, including camera
class, laser scanner class, robot limb class, robot wheel class, etc. A majority of these definitions have
been completed and tested in the MARS program by the members of this team. In the proposed work,
we will extend such a “plug and play” work to more sophisticated robots (e.g., Dav humanoid) and a
wider variety of robots available to the team members. As long as the API of a robot platform is
compliant to the “plug and play” specification (i.e., “plug-and-play” compliant or p-n-p compliant) the
same CACI robot software can be used for many different robots through “plug and play.”
The “cross-task capability” means that the proposed CACI software is not restricted by a specific task or
a few pre-defined tasks. It is applicable to a wide variety of tasks. The type of tasks that the software
can accomplish depends on the three types of resources, the quality of CACI software design and how the
robot is trained (i.e., the Five Factors). The cross-task capability requires:
(1) cross-environments capability .
(2) cross time scales capability
(3) cross goals capability
The environment, time, and goal all vary greatly, too tedious and too difficult to be modeled by hand in
terms of type and the associated parameters.
The “cross environments”capability means that the technology is applicable to various immediate
worlds. In a typical defense setting, little is known about environment before the task execution. The
world around an autonomous robot changes all the time. A path along a corridor that was safe before can
become dangerous if a section of the wall along the corridor has been blasted away, exposing the path to
enemy forces.
The “cross time scales”capability requires the robots to reason at different time scales, from a fraction of
a second to several hours, at different abstraction levels, through micro-steps about how to make a turn to
macro-steps about how to reach a destination.
The “cross goals” capability implies that the robot must be able to deal with different goals and quickly
adapt to new goals with minimal further training. For example, if the task requires the robot to move to a
site behind an enemy line, short travel distance should not be the only goal. Avoid being exposed to
hostile forces during travel is a much more important goal. Very often, a longer but safer route is
preferred rather than shorter but more dangerous ones. Further, a robot must re-plan when new
information arrives that requires an adjustment of the current goal. For example, when the commander
says, “hurry up,” a different behavior pattern should be adopted, which may result in a partial
modification of the planned route.
9
10. DARPA MARS Robotic Vision 2020 program proposal: CACI
The “cross-task capability” is much harder to accomplish than the “cross-platform capability.”
There is a very limited set of platform resources, in terms of the types of sensors, effectors and
computational resources. They can all be well defined in terms of type and the associated parameters.
For example, a video camera type has resolution as a parameter. However, robotic tasks involve much
wider variation in the environment, the time scales and the goals.
Lack of “cross-task” capability is a major reason for the “action cute and perception weak” phenomina of
most existing humanoid robots and mobile autonomous robots in the US and Japan.
To be action cute, one way is to program carefully. If the limbs have redundant degrees of freedom,
however, direct programming becomes very difficult. Innovative works have been done for studying
learning methods for training robots to perform actions with redundant body parts (e.g., Grupen et al.
2000, Vijayahumar & Schaal 2000). These studies of action learning through doing is an appropriate
methodology that has its root in biological motor development. However, just like the fact that cross-
platform is not as difficult as cross-task, producing perception strong robots is much more challenging
than producing a sequence of actions that do not require much perception.
However, an unknown environment poses a more challenging problem, compared with a known
redundant body, for the following major reasons:
• The model of the environment is unknown, while a model of a robot body is known.
• The degree of freedom of a little known environment is much larger than that of a redundant
robot body. The former is on the order of millions (sensory elements, e.g., pixels), while the
latter is on the order of dozens. Of course, these millions of sensory elements are not totally
independent, but we are unsure of their dependency, even when we have a range map.
• An unknown environment changes all the time but a robot body does not change its structure
even though it moves its limbs.
Perception is still the bottleneck of autonomous robots after decades of research in perception-based
autonomous robots.
Recent advances in a new direction called autonomous cognitive development (AMD) (Weng at al. 2001)
provided a powerful tool for dealing with the robot perception bottleneck. Battlefields are uncontrolled
environments, from lighting, to weather, to scenes, to objects. Why can human perception deal with
uncontrolled environment? As indicated by the neuroscientific studies cited in our recent Science paper
(Weng et al. 2001), the following is what a human developmental program does:
A. Derive processors with internal representation from physical experience: The representation
includes filters, their inter-connections, etc. The physical experience is sensed by sensory signals.
B. The processors process sensory signals and generate action outputs: Compute the response of
all filters from real time sensory input (physical experience).
The human brain performs (A) and (B) simultaneously and incrementally in real time (Flavell et al. 1993,
Kandel et al. 2000). (A) is done incrementally and accumulatively from real-time inputs, while B is
computed from each input in real time. Traditional approaches (knowledge-based, learning-based,
behavior-based and evolutional) rely on a human programmer to design representation in (A) and the
robot program does only (B). The traditional approaches are task-specific and environment-specific
since a human programmer can only competently think of a particular task and a particular environment.
Sometimes, human designed representations do contain some parameters that will be determined by data,
and this process is known as machine learning. However, `the representation is designed by the human
programmer for a specific task in a specific environment. Therefore, the learning-based approach is still
task-specific.
10
11. DARPA MARS Robotic Vision 2020 program proposal: CACI
With the new developmental approach, a human designs a developmental program that performs (A) and
(B). The essence of the new approach is to enable a robot to generate representation automatically,
online and in real time, through interactions with the physical environment. Since the developmental
program can develop a complete set of filters for any environment while the robot is doing any task, the
developmental approach is the only approach that is not task-specific and can deal with any
environment. In other words, a properly designed developmental program can learn in any environment
for any task. In practice, of course, simple tasks are learned by robots before more sophisticated tasks
can be more effectively learned, like a new recruit in the army.
A natural question to be raised here is: How much training does a robot need before it can execute a series
of tasks? According to our experience in the SAIL developmental project, the time spent on the SAIL
robot to demonstrate a series of breakthrough capabilities is approximately 5 to 7 hours, much less than
the time spent on writing the SAIL developmental program (which is in turn a lot shorter than
programming traditional ad hoc methods). The overall time for developing a robot using AMD is much
less than any traditional perception method. Further, there is virtually no parameter hand-tuning
required.
It is worth noting that we do not require a robot user to train his robot, although it is allowed. Robot
training is done in the robot production stage, and it does not need to be done in the robot deployment
stage. A robot user receives well-trained robots, if he orders them from a commercial company.
Further, the AMD technology is most useful to autonomous robotic systems for FCS, but it is also very
useful for perceptual capabilities of any autonomous system, such as surveillance systems, target
detection systems, and intelligent human-computer interfaces.
The basic requirements of autonomous cognitive development for robots include:
1. Autonomously derive the most discriminating features from high dimensional sensory signals
received online in real time (in contrast, the traditional learning approaches use human define
features, such as colors, which are not sufficient for most tasks).
2. Automatically generate and update representation or model of the world (clusters of sensory
vectors which form feature subspaces and the basis vectors of subspaces, etc.) of the feature
spaces and their inter-connections incrementally and automatically (in contrast, the traditional
machine learning approaches use human hand designed world representation and the learning
only adjusts predesigned parameters).
3. Real-time online with a large memory. For scaling up to a large number of real-world settings
and environments and for a real-time speed, self-organize the representation of perception in a
coarse to fine way for very fast logarithmic time complexity (e.g., the Hierarchical Discriminant
Regression (HDR) tree (Hwang & Weng 2000) used in the SAIL developmental robot).
4. Flexible learning methods, supervised learning, reinforcement learning and communicative
learning, all can be conducted interactively and concurrently while a robot keeps “alive” during
human intervention or instruction.
The realization of the above four basic requirements achieves the revolutionary “cross task” capability,
which includes “cross environments,” “cross perception-based behaviors,” “cross time scales” and “cross
goals” capabilities. For example, because the representation (or model) of the world is generated
through online, real-time interactions between the robot and its environment, instead of hand designed by
a human programmer, the robot software is applicable to any environment. For example, Weng and his
coworkers have demonstrated that the SAIL robot can autonomously navigate through both indoor and
outdoor environments (Weng et al. 2000 and Zhang et al. 2001) guided by its vision using video cameras,
a concrete demonstration of our cross-environment capability. No other robot software that we know
of has ever demonstrated capability for both indoor and outdoor navigation.
11
12. DARPA MARS Robotic Vision 2020 program proposal: CACI
The integrated 3-D map is not used as a robot’s internal “brain” representation, since such a monolithic
internal representation is not as good as distributed representation for robot perception, as discussed by
Rodney Brooks (Brooks, 1991). Instead, they are used as digital maps stored externally outside the
robot’s “brain” accessible to robots and humans. The robot and human overseer can refer to, index and
update the digital 3-D map. To retrace the trajectory to provide retrotraverse, route reply, “go to point X”
and other capabilities.
D.1.3 Evaluation approach
The “cross task” capability does not mean that our software is able to do any task. As we discussed
before, the five factors determine what tasks a robot can do and how well it does them. The proposed
CACI system is applicable of various behaviors, including autonomous navigation, collision avoidance,
object recognition and manipulation. However, different tasks require different amounts of training.
Currently, tasks that can be executed within a few seconds (e.g., 1 to 10 seconds), which include most
robot perception tasks, can be trained using AMD methods. Other tasks that take more time, such as path
planning task, AMD may require a considerable amount of training. In contrast, a hand designed model
is more effective, such as MDP methods. The proposed CACI system will integrate existing
technologies according to the nature of the tasks.
The evaluation approach includes the evaluation of following components:
1. The performance of each component technology.
2. The environmental applicability of each component technology.
3. The effectiveness of integration in terms of degree of increased capabilities.
4. The limitation of the integrated software.
5. The future directions to go beyond such limitation.
The criteria for evaluating progress include:
(1) the frequency at which human overseers need to intervene,
(2) the scope of tasks that the technology can deal with,
(3) the scope of machine perception,
(4) the flexibility of human robot interactions, and
(5) the cost of the system.
D.1.4 3-D map generation from ladars
The robot platforms that we will experiment with are equipped with ladars for range sensing and 3-D site
map integration. Ladar sensors typically collect multiple returns (range gates) and intensity associated
with each valid return. The 3-D position of the surface can be computed based on the robot’s position,
laser beam orientation, and the timing of laser returns. Figure 3 shows examples of ladar intensity map
(left) and height map (right) generated from single path overhead ladar data. A better 3-D maps can be
generated by fusing multiple paths and after spatial interpolation. Based on multiple returns of a ladar
beam and/or multiple hits within a cell, we can generate multiple layer 3-D maps (average intensity, min/
max intensity, ground level height, canopy top level map, etc). Figure 3 shows examples of average
intensity map (upper-left), ground level height map (upper-right), canopy top level map (lower-left) and
color coded height map (lower-right).
12
13. DARPA MARS Robotic Vision 2020 program proposal: CACI
Once we generate 3-D maps, we can use them as common reference upon which information collected
from different sensors and/or different platforms can be registered to and form 3-D site model of the
environment (e.g. Chellappa, et. al. 1997, 2001). Site model-supported image exploitation techniques can
then be employed to perform robot planning and other multi robot corporation tasks. For example, given
a robot location and orientation, ground viewed images can be generated from the site model. Such
predicted images are very useful for accurate robot positioning and cross sensor registration. Multi layer
3-D representations are efficient for representing 3D scene of large area and for movable objects. Figure
4 shows examples of ground view images projected from ladar generated site model for a hypothetic
robot view. We expect such ground view can be generated online at rate of several frame per second for
640x480 image size.
Figure 3: 3-D maps after fusing multiple paths and post-processing such as spatial interpolation.
Shown on upper-left is average intensity map, upper-right is ground level height map; lower-left is
a canopy top level height map, and lower-right is a color coded height map.
13
14. DARPA MARS Robotic Vision 2020 program proposal: CACI
Scene segmentation using both intensity and geometric information
Figure 4: Examples of projected ground view images for sensor positioning and data fusion.
D.1.5 Range map from trinocular video cameras
Another sensing modality of 3-D map construction is stereo using parallax disparity. Our stereo system is
built around 3D grids of spatial occupancy evidence, a technique we have been developing since 1984,
following a prior decade of robot navigation work using a different method. 2D versions of the grid
approach found favor in many successful research mobile robots, but seem short of commercial
reliability. 3D grids, with at least 1,000 times as much world data, were computationally infeasible until
1992, when we combined increased computer power with 100x speedup from representational,
organizational and coding innovations. In 1996 we wrote a preliminary stereoscopic front end for our fast
3D grid code, and the gratifying results convinced us of the feasibility of the approach, given at least
1,000 MIPS of computer power. From the 1999 to 2002, under MARS program funding we completed a
first draft of a complete mapping program implementing many ideas suggested by the earlier results. The
system uses trinocular color stereo and textured light to range even blank walls, choosing up to 10,000
range values from each trinocular glimpse. Each stereo range is converted to a ray of evidence added to
the grid, generally negative evidence up to the range, and positive evidence at the range. The ray’s
evidence pattern is controlled by about a dozen parameters that constitute a sensor model. A learning
process that adjusts the parameters to minimize the color variance when scene images are projected onto
the occupied cells of result grids greatly improves the map’s quality. A side effect of the learning process
is an average color for visible occupied cells. Many internal images of our new grids, thus colored, can
truly be mistaken for photographs of a real location (see Figure 5), and are clearly superior for navigation
planning. This proposal would enable us to extend that start towards a universally convincing
demonstration of practical navigation, just as the requisite computing power arrives.
Our present good results were obtained from a carefully position-calibrated run of cameras through a 10
meter L-shaped corridor area. The next phase of the project will try to derive equally good maps from
image sequences collected by imprecisely traveling robots. We have tested direct sampled convolution
and FFT -convolution-based approaches to registering each robot view, encoded as a local grid, to the
14
15. DARPA MARS Robotic Vision 2020 program proposal: CACI
global map. Both work in our sample data, but the dense FFT method gives smoother, more reliable,
matches, but is several times too slow at present. We will attempt to speed it up, possibly by applying it
to reduced-resolution grids, possibly by applying it to a subset of the map planes. When we are satisfied
with mapping of uncalibrated results, we will attempt autonomous runs, with new code that chooses paths
as it incrementally constructs maps. When the autonomous runs go satisfactorily, we will add code to
orchestrate full demonstration applications like patrol, delivery and cleaning.
Figure 5: Some views of the constructed 3-D dense map of the scene.
The suitability of the grid for navigation is probably best shown in plan view. The image above right was
created by a program that mapped each vertical column of cells in the grid to an image pixel. The color of
the pixel is the color of the topmost cell in the largest cluster of occupied cells in the column. The plants
in the scene are rendered dark because the low cameras saw mostly the dark shadowed underside of the
topmost leaves.
Using Data from Diverse Sensors. The evidence grid idea was initially developed to construct 2-D
maps from Polaroid sonar range data. A sensor model turned each sonar range into a fuzzy wedge of
positive and negative regions that were added to a grid in “weight of evidence” formulation. A later
experiment combined stereoscopic and sonar range information in a single grid. The 1996 version of our
3-D grid code was developed for two-camera stereo, but was used shortly thereafter to map data from a
scanning laser rangefinder, whose results were modeled as thin evidence rays. Grids can
straightforwardly merge data from different spatial sensors. To get good quality, however, not only must
the individual sensor models be properly tuned, but the combination of models must be tuned as well.
Our color-variance learning method is suitable for directing the adjustment, if the sensor mix contains at
least one camera. The spatial sensors, each with its own sensor model, build the grid, the sensor model
parameters are evaluated by coloring the grid from the images, and the process repeats with the
parameters adjusted in the direction of decreasing variance. Once good settings are found, they can be
retained to construct grids during navigation. Different types of environment may be better captured with
different parameter settings. A collection of parameter sets trained in different environments can be used
adaptively if a robot carries a camera. Each possible model can be used to construct a grid from the same
set of recent sensor data, and subjected to color variance evaluation. The model that gives the lowest
variance is the most suitable in the given circumstances.
D.1.6 Integration of range maps from ladars and trinocular cameras
3-D maps from ladar and trinocular stereo cameras will be integrated to give an integrated map with
intensity. Three type of information will be used for integration: the registration of two maps, the
resolution and the uncertainty of each source. To generate 3D maps from multi-view ladar images
requires accurate sensor location and orientation.
For indoor application, we plan to use the 3D map built using 3D grids of spatial occupancy evidence to
obtain accurate laser sensor location and orientation, and then update the 3D map using laser returns.
Laser images can be generated at a much higher frame rate and with better spatial and range accuracies.
In particular, using our multi-look active vision approach, we can control the laser toward a region with
low confidence scores from trinocular cameras, and update the 3D map and associated confidence scores
in the region. Confidence of range obtained from ladar data can be measured from the relative strength of
laser beams.
15
16. DARPA MARS Robotic Vision 2020 program proposal: CACI
The trinocular stereo has a measure of uncertainty at each volumetric cell. Thus, integration using the
volumetric information from the trinocular module will use Bayesian estimate for optimal integration.
D.1.7 Perception
The middle layer in the architecture shown in Figure 2 is the perception layer, which carries out the
development of perception capability, performs perception and generates perception-based behaviors.
An advantage of our approach is that perceptions for vision, audition and touch are all unified, guided by
a set of developmental principles.
We have extensive experience on computer vision, visual learning, robot construction, robot navigation,
robot object manipulation, speech learning, including sound source localization from microphone arrays
and action chaining. Our decade-long effort in enabling a machine to grow its perceptual and behavioral
capabilities has gone through four systems: Cresceptron (1991 – 1995), SHSOLIF (1993 – 2000), SAIL
(1996 - present ) and Dav (1999 – present).
Cresceptron is an interactive software system for visual recognition and segmentation. The major
contribution is a method to automatically generate (grow) a network for recognition from training images.
The topology of this network is a function of the content of the training images. Due to its general nature
in representation and learning, it turned out to be one of the first systems that have been trained to
recognize and segment complex objects of very different natures from natural, complex backgrounds
(Weng et al. 1997). Although Cresceptron is a general developmental system, its efficiency is low.
SHOSLIF (Self-organizing Hierarchical Optimal Subspace Learning and Inference Framework) was the
next project whose goal to resolve the efficiency of self-organization. It automatically finds a set of Most
Discriminating Features (MDF) using Principle Component Analysis (PCA) followed by Linear
Discriminant Analysis (LDA), for better generalization. It is a hierarchical structure organized by a tree
to reach a logarithmic time complexity. Using it in an observation-driven Markov Decision Process
(ODMDP), SHOSLIF has successfully controlled the ROME robot to navigate in MSU’s large
Engineering Building in real-time using only video cameras, without using any range sensors (Chen &
Weng 1998). All the real-time computing was performed by a slow Sun SPARC Ultra-1 Workstation.
Therefore, SHOSLIF is very efficient for real-time operation. However, SHOSLIF is not an incremental
learning method.
SAIL (Self-organizing, Autonomous, Incremental Learner) robot is the next generation after SHSOLIF.
The objective of the SAIL project is to automate the real-time incremental developmental for robot
perceptual and behavioral capabilities. The internal representation of the SAIL robot is generated
autonomously by the robot itself, starting with a design of a coarse architecture. A self-organization
engine called Incremental Hierarchical Disriminant Regression (IHDR) was the critical technology that
achieves the stringent real-time, incremental, small sample size, large memory and better generalization
requirements (Hwang & Weng 2000). IHDR automatically and incrementally grows and updates a tree
(network) of nodes (remotely resemble cortical areas). In each node is an incrementally updated feature
subspace, derived from the most discriminating features for better generalization. Discriminating
features disregard factors that are not related to perception or actions, such as lighting in object
recognition and autonomous navigation.
16
17. DARPA MARS Robotic Vision 2020 program proposal: CACI
Figure 6: Partial internal architecture of a single level in the perception layer
The schematic architecture of a single level of the perception layer is shown in Figure 6. Three types of
perceptual learning modes have been implemented on SAIL: learning by imitation (supervised
learning), reinforcement learning and communicative learning. First, a human teacher pushed the
SAIL robot around the Engineering Building several times, using its body pressure sensors mounted on
its body corners. This is learning by imitation. The system generalizes by disregarding areas that are not
important to navigation, using the HDR real-time mapping engine. The system runs at about 10 Hz, 10
updates of navigation decisions per second. In other words, for each 100 millisecond, a different set of
feature subspaces are used. At later stages, when the robot can explore more or less on its own, the
human teacher uses reinforcement learning by pressing its “good” or “bad” button to encourage and
discourage certain actions. These two learning modes are sufficient to conveniently teach the SAIL robot
to navigate autonomously in unknown environments.
Recently, we have successfully implemented the new communicative learning mode on the SAIL robot.
First, in the language acquisition stage, we taught SAIL simple verbal commands, such as “go ahead,”
“turn left,” “turn right,” “stop,” “look ahead,” “look left,” “look right,” etc by speaking to it online while
guiding the robot to perform the corresponding action. In the next stage, teaching using language, we
taught the SAIL robot what to do in the corresponding context through verbal commands. For example,
when we wanted the robot to turn left (a fixed amount of heading increment), we told it to “turn left.” If
we want it to look left (also a fixed amount of increment), we told it to “look left.” This way, we did not
need to physically touch the robot during training and used instead much more sophisticated verbal
commands. This makes training more efficient and more precise. Figure 7 shows the SAIL robot
navigating in real-time along the corridors of the Engineering Building, at a typical human walking speed,
controlled by the SAIL-3 perception development program.
17
18. DARPA MARS Robotic Vision 2020 program proposal: CACI
Figure 7: Left: SAIL developmental robot custom built at Michigan State University. Middle and
right: SAIL robot navigates autonomously using its autonomously developed visual perceptual
behaviors. Four movies are available at http://www.egr.msu.edu/mars/
Figure 8 shows the graphic user interface for humans to monitor the progress of online grounded speech
learning.
Internal attention for vision, audition and touch, is a very important mechanism for the success of
multimodal sensing. A major challenge of perception for high dimensional data inputs such as vision,
audition and touch is that often not all the lines in the input are related to the task at hand. Attention
selection enables singles of only a bundle of relevant lines are selected for passing through while others
are blocked. Attention selection is an internal effector since it acts on the internal structure of the “brain”
instead of the external environment.
First, each sensing modality, vision, audition and touch, needs intra-modal attention to select a subset of
internal output lines for further processing but disregard to leaving unrelated other lines. Second, the
inter-modal attention, which selects a single or multiple modalities for attention. Attention is necessary
because not only do our processors have only a limited computational power, but more importantly,
focusing on only related inputs enables powerful generalization.
Figure 8: The GUI of AudioDeveloper: (a) During online reinforcement learning, multiple actions
are generated; (b) After the online learning, only the correct action is generated.
18
19. DARPA MARS Robotic Vision 2020 program proposal: CACI
We have designed and implemented a sensory mapping method, called "Staggered Hierarchical Mapping
(SHM)," shown in the figure below and its developmental algorithm. Its goal includes: (1) the generate
feature representation for receptive fields at 0 16 32 48
different positions in the sensory space and
with different sizes and (2) to allow
attention selection for local processing.
SHM is a model motivated by human early
visual pathways including processing
performed by the retina, Lateral Geniculate
Nucleus (LGN) and the primary visual
cortex. A new Incremental Principal Output of
Component Analysis (IPCA) method is SHM
used to automatically develop orientation
sensitive and other needed filters. From Cognitive
sequentially sensed video frames, the Mapping(HDR)
proposed algorithm develops a hierarchy of
filters, whose outputs are uncorrelated
within each layer, but with increasing scale The architecture of sensory mapping, which allows
of receptive fields from low to high layers. not only a bottom up response computation, but also
To study the completeness of the a top down attention selection. The oval indicates the
representation generated by the SHM, we lines selected by attention selector.
experimentally showed that the response
produced at any layer is sufficient to
reconstruct the corresponding "retinal" image to a great degree. This result indicates that the internal
representation generated for receptive fields at different locations and sizes are nearly complete in the
sense that it does not lose important information. The attention selection effector is internal and thus
cannot be guided from the “outside” by a human teacher. The behaviors for internal effectors can be
learned through reinforcement learning and communicative learning.
19
20. DARPA MARS Robotic Vision 2020 program proposal: CACI
D.1.8 Human and face tracking and recognition
Due to the central role that humans play in human-robot interaction, the perception layer
contains a dedicated face recognition subsystem for locate, track and recognize human
faces. When the main perception layer detects a possible presence of a human face, the
human face module is applied automatically. Prior knowledge about humans is used in
programming this subsystem. This is an example how a general perception system can
incorporate a special purpose subsystem.
Figure 9: The face recognition subsystem.
In the human face module, we plan to efficiently locate and track faces for authentication
in a dynamic scene by using skin color and temporal motions of human faces and body.
We propose a subsystem (see Figure 9) that detects and tracks faces based on skin color
and facial components (e.g., eyes, mouth, and face boundary), estimation of face and
body motions, and motion prediction of facial components and human body. In our
approach, video color is normalized by estimating reference-white color in each frame.
Detection of skin color is based on a parametric model in a nonlinearly transformed
chrominance space (Hsu et al. 2002). Motion is detected by both frame differencing and
background subtraction, where the background of a dynamic scene is smoothly and
gradually updated. Facial components are located using the information of luminance and
chrominance around the extracted skin patches and their geometric constraints.
Parametric representations of these facial components can then be generated for motion
prediction on the basis of Kalman filtering. Human bodies are detected based on the
match of human silhouette models. The detected bodies can provide the information of
human gaits and motions.
Human faces are detected and tracked based on the coherence of locations, shapes, and motions of
detected faces and detected bodies. The detected facial components are aligned with a generic face model
through contour deformation, and result in a face graph represented at a semantic level. Aligned facial
20
21. DARPA MARS Robotic Vision 2020 program proposal: CACI
components are transformed to a feature space spanned by Fourier descriptors for face matching. The
semantic face graph allows face matching based on selected facial components, and also provides an
effective way to update a 3D face model based on 2D images (Hsu & Jain, 2002b). Figure 10 shows an
example of detection of motion and skin color (Hsu & Jain MSU 2002a). Figure 11 gives an example of
tracking results without prediction (Hsu & Jain MSU 2002a). Figure 12 shows the construction of a 3D
face model (Hsu & Jain 2001) and face matching by using 2D projections of the 3D model and the
hierarchical discriminant regression algorithm (Hwang & Weng 2000).
(a) (b) (c) (d)
Figure 10: An example of motion detection in a video frame: (a) A color video frame; (b) extracted
regions with significant motion; (c) detected moving skin patches shown in pseudocolor; (d)
extracted face candidates described by rectangles.
(a) (b) (c) (d) (e)
Figure 11: An example of tracking results on a sequence containing 5 frames of two subjects is
shown in (a)-(e) every 2 sec. Each detected face is described by an ellipse and a eye-mouth triangle.
Note that in (d) two faces are close to each other; therefore, only face candidates are shown.
(a) (b) (c) (d)
(e)
Figure 12: Face modeling and face matching: (a) input color image; (b) input range image; (c) face
alignment (a generic model shown in red, and range data shown in blue); (d) a synthetic face; (e)
21
22. DARPA MARS Robotic Vision 2020 program proposal: CACI
the top row shows the 15 training images generated from the aligned 3D model; the bottom row
shows 10 test images of the subject captured from a CCD camera. All the 10 test images of the
subject shown in the bottom row were correctly matched to our face model.
Planner
We have designed and implemented a hierarchical architecture and its developmental program for
planning and reasoning at different levels. Symbolically, the perception based hierarchical planning
can be modeled as,
C c → C s1 → As1 → C s 2 → As 2 ⇒ C c → As1 → As 2
where C c is a higher level goal, C s1 and C s 2 are lower level goals which will lead to behaviors As1 and
As 2 , respectively. → means “followed by”, and ⇒ means “develops.” The robot was first taught how
to produce planned action As1 given subgoal C s1 and produce planned action As 2 given subgoal C s 2 .
The robot is now given a higher level goal C c , without given subgoals C s1 and C s 2 . It is supposed to
know how to produce behaviors As1 and As 2 , consecutively. Note that we called the above capability
as action chaining, but the mechanism is the command in planning is the goal.
For planning, each goal has alternative actions and the action is only recalled “in the premotor cortex”
and is not actually executed. This new approach to planning can take into account rich context
information, such as:
• Context: when the time is not tight. Goal: go to landmark 1 from start. Plan: take action 1.
• Context: when the time is tight. Goal: to go landmark 1 from start. Plan: take action 1a.
• Context: when the time is not tight. Goal: go to landmark 2. Plan: take actions 1 and 2
consecutively.
• Context: when the time is tight. Goal: to go landmark 2. Plan: take actions 1a and 2a
consecutively.
If each of the above line has only one learned sequence, we say that the corresponding planning scheme
as been learned. Otherwise, the program will evaluate the performance of each plan and select the best
one. This is the planner learning stage. In other words, the goal of planner learning is the train the robot
so that each action sequence given each context leads is a unique When the condition associated with a
plan has been changed, the planner will run to reach the best plan.
The feature of this type of planning is to accommodate updated environmental conditions. Each of the
above case is basically equivalent to the action chaining mechanism that we have designed and
implemented. The major difference is that the action is unrehearsed but not executed. As we can see
that only given a goal, the possible plans could be multiple. When the context (or new changes in the
goal or performance measurements), the possible plan becomes unique. Figure 13 illustrates the
information causality during the planning. Figure 14 shows how the SAIL robot learns planning through
abstract composite goals (equivalently, commands).
22
23. DARPA MARS Robotic Vision 2020 program proposal: CACI
Figure 13: Internal mechanism of the two-level abstraction architecture.
Figure 14: Planning through abstraction of action sequences.
D.1.9 Servo control
The lowest layer in Figure 2 is the servo controller layer. Knowledge can be represented in the machine in
the forms of a connectionist model, such as a neural network or a differential equation, as well as in the
form of a symbolic model such as a rule-based system, a semantic network or a finite state machine.
Furthermore, the human commands may also consist of two types, logic decision and continuous control.
The key step of developing integrated human/machine systems is to develop a system model or
knowledge representation which is capable of combining symbolic and connectionist processing, as well
as logic decision and continuous control. In order to achieve the goal, the following specific problems
must be investigated and solved:
1. Developing a perceptive frame: A machine, specifically an autonomous system has its special
action reference. The tasks and actions of the system are synchronized and coordinated according
to the given action reference. Usually, this action reference is the time. A task schedule or action
plan can be described with respect to the time. It is understandable to use time as the action
reference since it is easy to obtain and be referenced by different entities of a system. Humans,
however, rarely act by referencing a time frame. Human actions are usually based on human
perceptions. These different action references make it very difficult for developing a
human/machine cooperated control scheme. A unified action reference frame to match human
perceptions with the sensory information is the key combining human reasoning/command with
autonomous planning/control. The important elements of human perceptions are "force" and
"space" (geometric). They are directly related to human actions and interactions with the
environment. The space describes the static status of actions, and force represents the potential or
actual change in that status, which describes the dynamic part of the actions. Nevertheless, the
space and force are also fundamental elements of machine actions. The essence of interactions
between humans and machines can also be embodied by these two physical quantities. Therefore,
force and space can be used as essential action references for an integrated human/machine
23
24. DARPA MARS Robotic Vision 2020 program proposal: CACI
system. A perceptive frame, which will be developed based these action references, is directly
related to the cooperative action of a human/machine system in that it provides a mechanism to
match human perceptions and sensory information. As a result, the human/machine cooperative
tasks can be easily modeled, planned and executed with respect to this action reference frame.
2. Combining symbolic/connectionist representation and logic/continuous control by Max-Plus
Algebra model : A new system model based on the perceptive frame will be developed for
analysis and design of task planning and control of integrated human/machine systems. The
perceptive frame provides a platform to combine the symbolic/connectionist representation of the
autonomous plan and control with human logic/continuous commands. The logical dependency
of actions, and task coordination and synchronization can be modeled by a Max-Plus Algebra
model with respect to the perceptive frame. This will facilitate an analytical method for
modeling, analysis and design of human/machine cooperative systems. New analysis and design
tools are expected to be developed for integrated human/machine systems described in a
perceptive reference frame. As a result, the integrated human/machine systems will not only have
a stable and robust performance, but also have behaviors which are independent with the
operators of the systems. This is essential for an integrated human/machine system to achieve a
reliable and flexible performance.
3. Designing a computing and control architecture for integrated human and machine system:
The proposed planning and control scheme is based on the perceptive frame model. Time is no
longer an action reference. Therefore, the system synchronization will entirely depend on the
sensory information. This poses a new challenge for designing the computing architecture. A
distributed computing and control architecture will be designed based on a multiple thread
architecture.
24
25. DARPA MARS Robotic Vision 2020 program proposal: CACI
D.2 Comparison with Current Technology
D.2.1 Integration
The existing traditional technologies are not suited for integration for the follow major reasons.
1. Individual component technology cannot work in an uncontrolled environment because human
hand-designed features (such as color and texture) are not enough for them to distinguish
different objects in the world.
2. Applicability of such an integrate system is low. Since each component technology only works
in a special setting, intersection of these settings gives a nearly null set: there are almost no
environmental situation under which these component technologies can all work.
3. In order for the integrated method to deal with an uncontrolled world, one must have an
applicability checker, which determines which component technology works and which does not.
Unfortunately, no such applicability checker for uncontrolled environment exists. This is called
the applicability checker problem discussed in (Weng & Chen 2000).
D.2.2 Evaluation
In the past, robot software is for a specific task. The evaluation work here is, for the first time, for robot
software that is cross-platform and cross-tasks. Therefore, the proposed evaluation work is new and is a
significant advance of the state of the art.
D.2.3 3-D Dense Map construction
3-D map construction and terrain classification using ladar data collected using helicopter is a relatively
new problem. Under DARPA funded PerceptOR program, we have developed a ladar mapping software
which is an order of magnitude faster than our competitors. The proposed effort will benefit from our
previous experience.
Ladars have been used extensively in manufacturing and outdoor environments. For indoor applications,
low-power, eye-safe ladars can give accurate range measurements for quality control and robot
positioning. For outdoor applications, ladars provide a convenient and accurate ranging capability, but are
limited by atmospheric attenuation and some obscurants, so that accuracy suffers as range increases. Our
approach is geared toward a direct-detection ladar with high range resolution but relatively low cross-
range resolution (typical for robot/ground vehicle based ladar sensors)
There are impressive programs that use clever statistical methods to construct and maintain two-
dimensional maps from robot traverses. Most of these, unlike our 3D methods, run in real time on present
hardware. The most effective use high-quality laser range data (primarily from Sick AG scanners). Laser
ranges have far fewer uncertainties than sonar or stereoscopic range data, avoiding many of the
difficulties that our grid methods were developed to solve. Yet no 2D method has demonstrated the
reliability necessary to guide robots really reliably through unknown facilities. A major problem is
hazards located outside the plane of the 2D laser scan. A secondary problem is the monotonous
appearance of indoors when mapped on a constant-height plane: many areas strongly resemble other
areas, and the local map configuration characterizes global location very ambiguously.
Sebastian Thrun’s group at CMU has equipped a robot with a second Sick scanner oriented vertically.
The motion of the robot adds a third dimension to the lasers 2D scan. Thrun’s group uses a surface-patch
model to characterize the traversed architecture, and projects a camera image onto the surface patches.
25
26. DARPA MARS Robotic Vision 2020 program proposal: CACI
The system is able to provide a 3D map suitable for human consumption in real time. In its present form
the system navigates only by means of a second horizontal 2D scanner. The vertical scan provides 3D
information only of positions that the robot has already passed. It should be possible to use the idea for
navigation by, say, placing a scanner high, looking down ahead at 45 degrees. Yet we believe the
approach has weaknesses. The planar representation becomes increasingly expensive and uncertain as
objects become complex (e.g. plants). Since it has limited means to statistically process data and filter
noise, the system depends on the clean signal from a laser rangefinder, which requires light emission. By
contrast, our system benefits from textured illumination especially to range clean walls, but can work with
passive illumination, especially in natural or dirty surroundings, where surface roughness or dirt provides
texture. The following is an argument by analogy as to why we expect that grid methods will displace
surface methods with near-future increases in computer power.
D.2.4 Perception
Designing and implementing a developmental program are systematic, clearly understandable using
mathematical tools. Designing a perception program and its representation in a task-specific way using a
traditional approach, however, is typically very complex, ad hoc and labor intensive. The resulting
system tends to be brittle. Design and implementation of a developmental program are of course not
easy. However the new developmental approach is significantly more tractable than the traditional
approaches in programming a perception machine. Further, it is applicable to uncontrolled real-world
environments, the only approach that is capable of doing this.
Due to its cross-environment capability, SAIL has demonstrated vision-guided autonomous navigation
capability in both complex outdoor and indoor environments. The Hierarchical Discriminant Regression
(HDR) engine played a central role in this success (Hwang & Weng 2000). Although ALVINN at CMU
(Pomerleau 1989) can in principle be applied to indoor, however the local minima and loss of memory
problem with artificial intelligence make it very difficult to work in the complex indoor scenes.
SAIL has successfully developed real-time, integrated multimodal (vision, audition, touch, keyboard and
via wireless network) human-robot interaction capability, to allow a human operator to enter different
degrees of intervention seamlessly. A basic reason for achieving this extremely challenging capability is
that the SAIL robot is developed to associate over tens of thousands of multi-modal contexts in real-time
in a grounded fashion, which is another central idea of AMD. Some behavior-based robots such as Cog
and Kismet at MIT do online interactions with humans, but they are hand programmed off line. They
cannot interact with humans while learning.
The perception-based action chaining develop complex perception-action sequences (or behaviors)
from simple perception-action sequences (behaviors) through real-time online human robot interactions,
all are done in the same continuous operational mode by SAIL. This capability appears simpler than it
really is. The robot must infer about context in high-dimensional perception vector space. It generates
new internal representation and uses it for later context prediction, which is central for scaling up in
AMD. David Touresky’s skinnerbot (Touretzky & Saksida 1999) does action chaining, but it does it
through preprogrammed symbols and thus the robot is not applicable to unknown environments.
D.2.5 Face recognition subsystem
Detecting and tracking human faces plays a crucial role in automating applications such as video
surveillance. According to the tracking features used, various approaches to face tracking (Hsu & Jain
2002a) can be categorized into three types: (i) the methods using low-level features such as facial
landmark points, (ii) the 2D template-based methods, and (iii) those using high-level models such as 2D
or 3D (deformable) models. Most tracking approaches focus on a single moving subject. Few methods
26
27. DARPA MARS Robotic Vision 2020 program proposal: CACI
directly deal with tracking multiple faces in videos. Although it is straightforward to extend the task of
face tracking for a single subject to that for multiple subjects (e.g., finding the second large blob for the
second target), it is still challenging to track multiple human faces with interaction in a wide range of
head poses, occlusions, backgrounds, and lighting conditions. We propose a new method to detect (Hsu et
al. 2002) and track (Hsu & Jain 2001a) faces based on the fusion of information derived from motion of
faces and bodies, skin-tone color, and locations of facial components. Tracked faces and their facial
components are used for face identification/recognition. The main challenge in face recognition is to be
able to deal with the high degree of variability in human face images, especially with variations in head
pose, illumination, and expression. We propose a pose-invariant (Hsu & Jain 2001) approach for face
recognition which is based on 3D face model, and a semantic (Hsu & Jain 2002b) approach which is
based on semantic graph matching.
D.2.6 Planner
Given a goal (which consists of a fixed destination and fixed set of performance criteria), the current
popular MDP based planning methods (see, e.g., Kaelbling et al. 1996) for sequential decision require
much exploration throughout the world model. When the destination is changed or the set of
performance criteria (e.g., the weights between safety and distance) is modified, the past planning
knowledge is not usable --- a new time consuming iterative training procedure through the site model
must be redone. Online re-planning, modifying a plan after receiving new information about the task has
been difficult.
Further, hierarchical plan is necessary for planning at different abstraction levels. For example, to reach a
target location, a robot needs to plan at a coarse level, such as reaching several landmarks. From one
landmark to the next requires planning at a finer level, to reach landmark 1, how to make a turn, how to
go straight, etc. Our perception-based action chaining work is precisely designed for such applications.
D.2.7 Servo controller
Human/machine cooperation has been studied for many years in robotics. In the past the research has
focused mainly on human knowledge acquisition and representation (Liu & Asada1993, Tso & Liu 1993).
It includes (i) robot programming which deals with the issue of how to pass a human command to a
robotic system (Mizoguchi et al. 1996, Kang & Ikeuchi 1994). This process usually happens off-line, i.e.
before the task execution. But humans have no real role during a task execution; (ii) teleoperation in
which a human operator can pass action commands to a robotic system on-line (Kosugen et al. 1995). In
the above two cases, the human operator has a deictic role and the robot is a slave system which executes
the human program/command received either off-line or on-line. Recently, there is ongoing research on
involving human beings in the autonomous control process, such as human/robot coordinated control (Al-
Jarah & Zheng 1996, Yamamoto et al. 1996). The human, however, is introduced to the system in a
similar role as a robot. In recent years, several new schemes have been developed for integrated human/
machine systems (Xi et al. 1996, Xi et al. 1999). Specially, the function based sharing control scheme
(Brady et al. 1998) has been developed and successfully implemented in DOE's Modified Light Duty
Utility Arm (MLDUA), which has been tested in nuclear waste retrieval operations in Gunite Tanks at
Oak Ridge National Laboratory in 1997. In addition, the development of Internet technology has further
provided a convenient and efficient communication means for integrated human/machine systems. It
further enables humans and machines, at different locations, to cooperatively control operations (Xi &
Tarn 1999). The theoretical issues related to integrated human/machine systems have also been studied
(Xi & Tarn 1999).
27
28. DARPA MARS Robotic Vision 2020 program proposal: CACI
E Statement of Work
E.1 Integration
The integration includes the following work:
1. Design and implement the API for plug-and-play.
2. Work with Maryland and CMU for integration of 3-D maps from ladars and from trinocular
stereo.
3. Integration of the integrated 3-D map into the CACI software system.
4. The work proposed here will provide environmental perception in terms of humans using a model
based approach to perceive both static and moving objects. Human objects will be dealt with
exclusively by the face detection and recognition system because of the highly stringent
requirements for correct recognition of humans. Other more general objects will be recognized
by the perception level.
5. Work with the planning group, the perception group, the face recognition subsystem group and
the servo controller group to design and implement state integrator.
6. Work with the planning group, the perception group, the face recognition subsystem group and
the servo controller group to design and implement action integrator.
7. Design and implement the integration of the entire CACI software system.
E.2 Evaluation
The evaluation work includes:
1. Design of the specification of the evaluation criteria.
2. Design of the test specification for the performance of component technology.
3. Coordination of the tests for component technology.
4. Coordination of the test for overall system.
5. Collection of test data and report for the test.
6. Tools for robot self-monitoring to support the systematic assessment of perception and behavior
performance in terms of quantitative metrics.
E.3 3D map generation from ladars
Ground robots/vehicles generally have difficulties in perceiving negative obstacles at distance. In this
work UMD team propose to develop 3-D map generation software using overhead as well as ground laser
range finders (ladars). Ladar sensors can capture both geometric and material information about the
scene/environment and they operate at day/night and during all weather conditions. 3-D maps generated
can then serve as a site model of the environment through which cross-platform, cross-task, and cross-
environment fusion can be accomplished relatively easier. Using site model, many prior site information
can be effectively incorporated in robot missions.
3-D map generation and exploitation is a key enabling technology in the proposed effort. UMD team
proposes to develop the following technologies under this project:
1. Develop multi look ladar sensor control to generate higher resolution image of field of interest. A
graphic user interface with ladar simulator will be developed and demonstrated for multi look
sensor control.
2. Develop real time multi path ladar fusion and multi layer 3-D map generation algorithms.
3. Develop 3-D map supported dynamic robot positioning and video/ladar image registration
algorithms.
28
29. DARPA MARS Robotic Vision 2020 program proposal: CACI
4. Support system integration and technology demonstration.
To fulfill the work of dynamic face identification, we plan to implement the face recognition subsystem in
three major modules: (i) face detection and tracking, (ii) alignment of face models, and (iii) face
matching. The detection and tracking module first finds the locations of faces and facial components and
the locations of human bodies and human gaits in a color image. Then it predicts the motions of faces and
bodies for reducing the searching regions for the detected faces. The face alignment module includes the
estimation of head pose, the 2D projection (a face graph) of a generic 3D face model, and the alignment
of a face graph and the input face image. The estimation of head pose is based on the arrangement of
facial components inside a face region. The 2D projected face graph is generated by rotating a 3D face
model to the estimated view. The alignment of the face graph is based on contour deformation. In the
matching module, an aligned face graph is first transformed into a feature space using facial (Fourier)
descriptors, and then is compared with the descriptors of template graphs obtained from the face database.
The face comparison is based on derived components weights that take into account the distinctiveness
and visibility of human faces.
E.4 3D map generation from trinocular stereo
Using precision images collected in May 2001, we will continue to explore ideas to further improve and
accelerate our 3D mapping program, and work towards applications, including those requiring substantial
user interface. We've described a dozen pending developments in our most recent DARPA MARS report.
In parallel we will develop a camera head for collecting a greater variety of new test data. The head will
have trinocular cameras, a fourth training camera, a 100-line laser textured light generator, a 360 degree
pan mechanism and a high-end controlling laptop. It will be mounted on borrowed robots, and in the
second year be used to demonstrate autonomous and supervised navigation.
The 3-D maps from the ladars will be integrated with the counterpart from trinocular stereo, to take the
best possible result from both sensing methods.
The above two parts of the work result in a integrated sensory-based algorithm that will support path
referenced perception and behavior. It will provide a perception-based representation of the path at
various levels of abstraction by combining the 3-D map and the output of perception from the perception
level.
E.5 Perception
The perception involves vision, audition and touch, as well as the generation of perception-based
behaviors for all the effectors of the platform. We have demonstrated a series of perceptual capabilities
by the SAIL robot while allow a rich set of modes of operator intervention. The proposed work includes
development of the following:
1. Tools for exploit operator intervention to enable the robot fully “experience” its operating
environment even when the human operator intervenes. It will also allow operator intervention
at a number of different levels, including behavior selection and perception selection. The higher
planning level intervention will be accomplished by the planner. These new capabilities have
been demonstrated by our current MARS project and we will make tools to be more integrated
and more user friendly.
2. Tools for machine learning and adaptation. They support behavior selection, behavior parameter
tuning, and perceptual classification. With the evaluation work discussed above, we will
quantitatively assess and validate specific techniques in specific system roles. We have already
provided some performance measurement of learning and adaptation techniques. In the proposed
work, we will make them applicable to multiple platforms.
29
30. DARPA MARS Robotic Vision 2020 program proposal: CACI
3. Software components for interaction between robots and humans. They enable interaction with
human operators as well as other robots, located in the robot’s physical environment. We will
also provide a high-level command interface to view a group of autonomous robots, in
cooperation with the planner and the controller levels.
E.6 Face detection and recognition
To fulfill the work of dynamic face identification, we plan to implement the face recognition subsystem in
three major modules: (i) face detection and tracking, (ii) alignment of face models, and (iii) face
matching. The detection and tracking module first finds the locations of faces and facial components and
the locations of human bodies and human gaits in a color image. Then it predicts the motions of faces and
bodies for reducing the searching regions for the detected faces. The face alignment module includes the
estimation of head pose, the 2D projection (a face graph) of a generic 3D face model, and the alignment
of a face graph and the input face image. The estimation of head pose is based on the arrangement of
facial components inside a face region. The 2D projected face graph is generated by rotating a 3D face
model to the estimated view. The alignment of the face graph is based on contour deformation. In the
matching module, an aligned face graph is first transformed into a feature space using facial (Fourier)
descriptors, and then is compared with the descriptors of template graphs obtained from the face database.
The face comparison is based on derived components weights that take into account the distinctiveness
and visibility of human faces.
E.7 Planner
We will develop the planner as an integration of the planning capability of our developmental method that
we have demonstrated with an interface that allows a human overseer to supply updated performance
criteria and a new goal to the planner for re-planning. The work includes:
1. Design the planner to integrate the SAIL planner with other existing technologies such as the
MDP based planner.
2. Implement the planner based our prior work at DARPA.
3. Evaluate the strength and weakness of the proposed planner in allowing more intimate and real
time interactions with the operator.
4. Test the planner with real-world planning programs to study the performance.
E.8 Servo Controller
The work for servo controller includes
1. Developing a human/machine cooperative paradigm to optimally map a task to
heterogeneous human and machine functions.
2. Design a perceptive action reference frame for modeling an integrated human/machine
system;
3. Developing a heterogeneous function-based cooperative scheme to combine autonomous
planning/control with human reasoning/command in a compatible and complementary
manner.
4. Developing a user-friendly human/robot interface to implement the human/machine
cooperative planning and control methods.
30
32. DARPA MARS Robotic Vision 2020 program proposal: CACI
F.2 Detailed Individual Effort Descriptions
Integration:
• Specification: Design the specification for the performance measurement of the entire system.
Translate the system-wise specification to component specifications.
• Improvement: Perform several design iterations, according to the limit, strength, and the cost of
the component technology.
• Tests: Perform preliminary test with component technology to investigate the overall potential
for improvement.
• Demo: Finalize the integration scheme and make the demo.
Evaluation:
• Criteria: Investigate the criteria that the defense requires for real-world applications.
• Refinement: Refine the criteria from testing of several component technologies.
• Data: Test the component overall system’s performance and collect rigorous evaluation of
performance.
• Demo: Make integrated tests with evaluation and make the final demo.
3Dladars:
• Ladar S: Develop a ladar simulation algorithm for algorithm development and performance
evaluation. Develop algorithms for mosaic/fuse ladar images acquired from different location
and/or with different sensor configuration.
• MapGen: Multi layer 3D maps (floor, obstacles, reflectivity, etc) generation. Object
classification based on intensity and geometry, and possible video images.
• Posit: Autonomous robot to site model positioning algorithm. Autonomous video/ladar image
registration algorithm.
• Integ: Support integration of the 3-D maps for cross-platform, cross-task, and cross-environment
operations.
• Demo: Participate in the final demonstration and software demonstration.
3Dstereo:
Year 1: Mapping program development Year 2: Applications demonstration development
• Prior: Prior Probe, Two Thresholds, Imaging by Ray. Interactive Visualization, Camera Head
Component Selection
• Vector: Vector Coding, Camera Head Fixture Design and Assembly. Code Integration, Mapping
Demo
• Local: FFT Localization, Camera Interface to Robot 1. Field Image Tuning, Robot 1 Camera
Head Data Collection Runs
• Navig: Navigation Demo Code, Robot 2 Interface, Controlled Run Testing
• Demo: Autonomous Runs, Code Cleanup, Final Report, Navigation Demo
Perception:
• Experience: Develop tools for exploit operator intervention to enable the robot fully “experience”
its operating environment even when the human operator intervenes.
• Adaptation: Develop tools for machine learning and adaptation.
32
33. DARPA MARS Robotic Vision 2020 program proposal: CACI
• Interaction: Software components for interaction between robots and humans.
• Integration: Integrate tools for machine “experience” intervention, machine learning and human
robot interaction.
• Demo: Modification, improvement and demonstration.
FaceSubsys:
• RealT: Speedup face detection for real-time applications
• Desg: Design face tracking module using motion prediction. Design face tracking module using
motion prediction
• Inteff: Integration of face detection and face tracking. Construct face models
• Recog: Build face recognition module
• Inte: System integration
• Demo: System integration/demonstration
Planner:
• Design: Design the planner to integrate the SAIL planner with other existing technologies.
• Implement: Implement the planner based our prior work at DARPA.
• Evaluate: Evaluate the strength and weakness of the proposed planner.
• Improve: Test and improve the planner with real-world planning programs to study the
performance.
• Demo: Modification, improvement and demonstration.
Servo:
• Paradigm: Developing a human/machine cooperative paradigm to optimally map a task to
heterogeneous human and machine functions.
• Frame: Design a perceptive action reference frame for modeling an integrated human/machine
system;
• Hetero: Developing a heterogeneous function-based cooperative scheme to combine autonomous
planning/control with human reasoning/command in a compatible and complementary manner.
• Interface: Developing a user-friendly human/robot interface to implement the human/machine
cooperative planning and control methods.
• Demo: system integration and demonstration.
33
34. DARPA MARS Robotic Vision 2020 program proposal: CACI
G Deliverables Description
Integration:
• Design documentation
• Test data on our platforms, including Nomad 2000, SAIL and Dav.
• Integrated CACI software. It will be the first cross-platform cross-task integrated system to
autonomous robot.
Evaluation:
• Updatged literacture survey about performance evaluation.
• Dodumentation of the evaluation criteria.
• Complete evaluation data. Not just what a robot can do but also what it cannot do now and why.
3Dladars:
• VC++ based ladar simulator and multi look automatic target recognition software
• VC++ based 3-D map generation and terrain classification software. VC++ based dynamic robot
positioning algorithm
3Dstereo:
• Year 1: High quality stereoscopic 3D spatial grid mapping code able to process 10,000 range
values from trinocular glimpses in under 10 seconds at 2,000 MIPS.
• Year 2: Autonomous and interactive robot navigation code using 3D grid maps able to drive a
robot at least one meter per five seconds at 5,000 MIPS.
Perception:
• Perceptual learning software for vision, audition, touch and behaviors.
• Test data for the software for vision, audition and touch.
• Documentation about the software and sample test data..
FaceSubsys:
• C++/C based face detection software
• C++/C based face tracking software
• C++/C based face modeling software
• C++/C based face matching software
Planner:
• Perception-based planner software for vision, audition and touch.
• Test data for the planner software.
• Documentation about the planner software.
Servo:
• Methodologies to optimally map a task to heterogeneous human and machine functions;
• Methods and algorithm to compute a perceptive action reference frame for modeling an
integrated human/machine system;
34
35. DARPA MARS Robotic Vision 2020 program proposal: CACI
• Heterogeneous function-based cooperative schemes to combine autonomous
planning/control with human reasoning/command in a compatible and complementary
manner.
• Software for human/robot interfaces and related documentations
Patent: “Developmental Learning Machine and Method,” US patent No. 6,353,814, filed Oct. 7,
1998 and granted March 5, 2002. Inventor: J. Weng; Assignee: MSU.
35
36. DARPA MARS Robotic Vision 2020 program proposal: CACI
H Technology Transition and Technology Transfer Targets
and Plans
We plan commercial development of the navigation code into highly autonomous and reliable industrial
products for factory transport, floor cleaning and security patrol. The enterprise would welcome the
opportunity to apply the techniques to DOD applications, should contracts materialize. Hans Moravec is
the point of contact for this commercialization.
We plan to commercialize the Dav humanoid robot platfform. The intended users are research
institutions, universities, and industrial plants where the environments are not suited for human to stay in
for a long time. John Weng is the contact person for this commercialization.
We also plan to commercialize the CACI software for all types of autonomous robots. The plug and play
feature is expected to attract many robot users. The cross-task capability of CACI will fundamentally
change the way software is written for autonomous robots. John Weng is the contact person for this
commercialization.
36