2. 2 NOVEMBER 2012
an expanding set of analytic services to help them gain
critical mission insights.
The Cloud Analytics Reference Architecture, which
is being adapted to the larger business and govern-
ment communities, removes the traditional con-
straints by bringing together innovations in two areas
of current technology. First, it uses the power of the
cloud to put an organization’s entire storehouse of
data into a common pool, or “data lake,” making all
of it easily accessible for the first time. It then uses
sophisticated computer analytics, such as machine
learning and natural language processing, to help ex-
tract the kind of knowledge and insight that creates
value, guides strategy, and drives business and mis-
sion success. Although the Cloud Analytics Refer-
ence Architecture builds upon current techniques, it is
not an incremental step forward. It is an entirely new
approach—one specifically designed for our new age
of data.
One way to understand how the Reference Archi-
tecture works is to view it in layers (see Figure 1). Its
foundation is the cloud computing and network infra-
structure, which supports the methods by which data is
managed—most notably, the data lake. The data lake, in
turn, supports a two-step process to analyze the data.
In the first step, special tools known as pre-analytics
filter information from the data lake, and give it an un-
derlying organization. That sets the stage for computer
analytics—in the next layer up—to search for valuable
knowledge. These elements support the final phase, the
visualization and interaction, where the human insights
and action take place.
THE POWER OF THE CLOUD ANALYTICS
REFERENCE ARCHITECTURE
The Reference Architecture opens up the enormous
potential of big data by allowing us to search for insight
in new ways. It enables us to look for overarching pat-
terns, and ask intuitive questions of all the data, rather
than limiting us to narrowly defined queries within
data sets. The Reference Architecture allows comput-
ers to take over much of the work humans are doing
now—freeing people to focus on the search for insight.
It makes it possible for non-computer experts, for the
first time, to frame the questions, look for patterns, and
follow hunches.
This is not some kind of magical solution—far from
it. The Reference Architecture is simply a new way of
looking at data, but one that revolutionizes our ability
to gain knowledge and insight. With conventional tech-
niques, the data and analytics are locked into stovepipes,
or silos. We can explore only limited amounts of data
at any one time—and then only with predetermined
questions that have already been built in. The Reference
Architecture removes these constraints by eliminating
the silos, and consolidating all the information in the
data lake. What results is not chaotic or overwhelming.
Rather, the rich diversity of information in the data lake
Figure 1. Primary Elements of the Cloud Analytics Reference Architecture
3. 3NOVEMBER 2012
becomes a powerful force. The data lake is more than
a means of storage—it is a medium expressly designed
to foster connections in data. And the Reference Archi-
tecture explores those connections to search for valu-
able correlations and patterns This actually reduces the
complexity of big data, making it manageable and use-
ful, and creating efficiencies.
Instead of using data to ask “canned” questions that
test what we may already know, the Reference Architec-
ture uses data to discover new possibilities—solutions
and answers that we have not even considered. The
power of the Reference Architecture is that it constant-
ly evolves and adapts as we search for insight, taking us
beyond the limits of our imagination.
WHAT THE CLOUD ANALYTICS REFERENCE
ARCHITECTURE DOES
The Cloud Analytics Reference Architecture re-
moves the constraints created by data silos. While
the rigid structures used in conventional techniques
provide ease of storage, they carry severe disadvan-
tages. They give us an artificial view of the world based
on data models, rather than on reality and meaning. It
is akin to reading a map through a tube—we can never
immerse ourselves in the diversity of big data, and in-
stead make decisions based on limited and constrained
information. Much of data science in the last ten years
has been devoted to improving access to the silos and
building bridges between them. But that does not solve
the underlying problem—that the data is regimented
and locked in.
Eliminating the need for silos gives us access to all
the data at once—including data from multiple outside
sources. Users no longer need to move from database
to database, pulling out specific information. And, be-
cause there are no data silos, there is no need to build
complex bridges between them.
If we want to know, for example, which parts of
our computer network are most vulnerable to attack in
the next six hours, we can take into account a wide va-
riety of data sources at the same time. We might look at
whether today is a holiday in certain foreign countries,
which means that the young hackers known as “script
kiddies” are more likely to be out of school and so have
time on their hands to launch an attack. If we deter-
mine that a particular group is targeting us, we might
examine how its members are connected, asking wheth-
er they had a common professor at a university, and if
so, what techniques did he or she teach. The Reference
Architecture gives us the ability to ask a full suite of
questions rather than a pre-selected few.
The Cloud Analytics Reference Architecture al-
lows us to experiment more with the data. The Ref-
erence Architecture’s flexibility provides a new kind of
freedom—to follow hunches wherever they may lead,
to quickly shift direction to pursue promising avenues
of inquiry, to easily factor in new knowledge and in-
sights as they arise.
With the conventional approach, it is difficult to add
or switch variables that are not already part of a dataset
or data base. That typically requires tearing apart and
rebuilding both the structure that the data is in and the
computer analytics that are custom-designed to handle
specific lines of inquiry. The process is expensive and
time consuming, and so consequently, we tend to focus
instead on doing better analysis with the limited tools
available on our narrow slices of data.
With the Reference Architecture, we might decide,
in the network security example above, to add new vari-
ables to the mix, such as the current propagation speed
of commonly used viruses and botnets. Even if those
variables come from outside data sources, we do not
have to tear down and rebuild our data structures and
analytics to consider them—they seamlessly become
part of our inquiry.
The Cloud Analytics Reference Architecture al-
lows us to ask more intuitive questions. With the
conventional approach, we do not really ask questions
of the data—we create hypotheses, and then test the
data to see whether we are right. In order to pose these
hypotheses, we have to guess in advance what the an-
swers might be, often a difficult proposition.
To determine where our network is most vulnerable,
for example, we would need to start with a hypothe-
sis—say, that any attacks will occur through outdated
operating systems. That hypothesis, accurate or not,
would drive our initial line of inquiry.
With the conventional approach, we also need to
be familiar with the data we are considering, includ-
ing where it is (in what specific datasets or databases),
what format it is in, and even to a large extent what the
data itself contains. That level of knowledge might be
achievable when we are working with a limited number
of datasets or databases, but not with the vast amounts
of information now becoming available to us. We often
have to put aside, or assume away, factors that we might
actually believe are critical.
Add to these handicaps our inability to go beyond
the pre-selected questions or easily change variables,
and it becomes an impossible task. And so we never
try it. We end up settling for marginal questions, and
marginal answers.
4. 4 NOVEMBER 2012
With the Reference Architecture, however, we can
structure an inquiry around a single, intuitive, big-pic-
ture question: What part of our computer network is
most vulnerable to attack in the next six hours? We do
not need to know much about any of the data sources
we are consulting—the data will point us to the answer.
The Cloud Analytics Reference Architecture al-
lows us to more readily look for unexpected pat-
terns—it lets the data talk to us, so to speak. Even
if we could ask all the questions we want, the way we
want, there is simply too much data to formulate every
question that might be important. Our questions can
also be limited by our biases about the issues we are
researching. We may not know what areas to explore,
or what we should be looking at. To get the full picture,
and help guide our inquiries, we need to see what pat-
terns naturally emerge in the data.
While we can look for patterns with the convention-
al approach, there are two significant drawbacks. We
can only do such searches within our narrowly defined
datasets and databases, rather than with the entire range
of data available to us. We also must first guess what
those specific patterns might be, and then test them out
with hypotheses. But what about the patterns we do not
even know might exist? How do we get to the hidden
knowledge that often proves so valuable?
Because there are no limiting data and analytic struc-
tures in the Reference Architecture, we do not need to
pose hypotheses, and our search for patterns encom-
passes the entire range of data. For example, the U.S.
military is now using the Reference Architecture to
search for patterns in war zone intelligence data, to map
out convoy routes least likely to encounter improvised
explosive devices (IEDs).
The Cloud Analytics Reference Architecture
allows computers to take over much of the work
humans are doing now—enabling people to focus
on creating value. Conventional methods require that
people play a large role in processing the data—in-
cluding selecting samples to be analyzed, creating data
structures, posing hypotheses, and sifting through and
refining results. That intense level of effort may be
workable for small amounts of data, but no organiza-
tion has the personnel or resources to use that method
to process big data.
The Cloud Analytics Reference Architecture solves
this problem by giving a great deal of that work to the
computers, particularly tasks that are repetitive and
computationally intensive. This reduces human error,
and substantially speeds up the work.
When we use the Reference Architecture to pose
more intuitive questions, or to find patterns, we are es-
sentially asking the computer to take us as close as it can
to finding the answers we want. It is then up to us, using
our cognitive skills, to find meaning in those answers.
By separating out what the computer can do—the
analytics—and what only people can do—the actual
analysis—the Cloud Analytics Reference Architecture
greatly eases the human workload. It is a division of la-
bor that frees subject-matter experts to look at the larg-
er picture. At the same time, the Reference Architecture
rapidly highlights areas that analysts should not waste
their time exploring—enabling them to focus their time
and attention in the right direction.
For example, agencies that investigate consumer
complaints against financial institutions often do not
know which individual complaints are indicative of a
broader patterns of consumer abuse, and so deserve
the most attention. Investigators rarely have the time to
sort through the vast array of sources that might pro-
vide valuable clues, such as blogs and social media sites
where consumers commonly air their grievances. With
a data lake that included all such available information,
the Reference Architecture’s analytics could quickly
identify patterns, such as consumer abuse affecting
large numbers of people. Investigators could then fo-
cus their resources on the most serious cases.
The Cloud Analytics Reference Architecture’s
analysis capability enables subject matter experts
to explore the data. If we are to drive business and
mission success, we must give direct access to the data
to the analysts, or subject matter experts, who under-
stand what that success might mean. However, be-
cause of the high level of computer expertise needed
to design custom data storage structures and analytics,
much of the analysis today is conducted by computer
scientists, computer engineers, and mathematicians act-
ing as agents for the subject matter experts. They are
typically the ones who translate the overall goals of the
business and government analysts into the language of
the machine. Whenever there is a middleman in any
field, things tend to get lost in the translation, and data
analysis is no exception. Here, it leads to a disconnect
between the people who need knowledge and insight
(the subject matter experts) and the data itself. It also
substantially slows the process.
In the top layers of the Reference Architecture, the
middleman syndrome goes away. The ability to ask in-
tuitive questions, and to look for patterns, provides the
analysts with direct access to the data. That gives them
the flexibility they need to experiment and explore,
and allows the system to reach maximum velocity. The
computer scientists, computer engineers and mathema-
ticians still play a key role, but now are no longer the
ones who drive the inquiries into the data.
5. 5NOVEMBER 2012
For example, investigators who suspect fraud may
be occurring are often hampered by the need to go
through computer experts to query the data. Their re-
quest may be one of many, and by the time they get
back the information they need to act, the criminals
have often long since committed the fraud and dis-
appeared. With the Reference Architecture, however,
investigators could query the data themselves, quickly
pinpoint the fraud, and take action in time to stop the
activity.
THE FOUNDATION OF THE REFERENCE
ARCHITECTURE: A NEW APPROACH TO
INFRASTRUCTURE
The Reference Architecture takes advantage of
the immense storage ability of the cloud, though in a
different way than in the past. With the conventional
approach, cloud storage does not eliminate the data si-
los—it simply makes them fatter. Organizations must
continually reinvest in infrastructure as analytic needs
change. Building bridges between silos, for example,
typically requires reconfiguring and even expanding the
infrastructure.
The Reference Architecture, by contrast, has an in-
herent flexibility that enables organizations to pursue
new analytical approaches with few if any changes to
the underlying infrastructure. One reason is that the
data lake is easily expandable. Because it stores infor-
mation so efficiently, it can accommodate both the
natural growth of an organization’s data, as well as the
addition of data from multiple outside sources. At the
same time, the Reference Architecture replaces the cur-
rent, custom-built analytics with a new generation of
tools that are highly reusable for almost any number
of inquiries. With the Reference Architecture, organi-
zations do not need to rebuild infrastructure as their
levels of data and analytics increase. An organization’s
initial investment in infrastructure is therefore both en-
during and cost-effective.
HOW THE DATA LAKE WORKS
With the conventional approach, the computer is
able to locate the information it needs because it knows
precisely where it is—in one database or another. The
information is identified largely by its location. With the
data lake, information is still identified for use, but now
in a way other than by location. Specific pieces of infor-
mation are identified by “tags”—details that have been
embedded in them for sorting and identification.
For example, an investor’s portfolio balance (the
data) is generally stored with identifying information
such as the name of the investor, the account number,
one or more dates, the location of the account, the
types of investments, the country the investor lives in,
and so on. This “metadata” is what gets tagged, and is
located by the computer during inquiries.
The process of tagging information is not new—it
is commonly done within specific datasets and databas-
es. What is new is using the technique to eliminate the
need for datasets and databases altogether.
The tags themselves are also a way of gaining
knowledge from the data. In the example above, they
might allow us to look for, say, connections between
investors’ countries and their types of investments. The
basic data—the portfolio balance—might not even be
part of the inquiry. Such connections can be made with
the conventional approach, but only if the custom-built
databases and computer analytics have already been de-
signed to take them into consideration. With the data
lake, all the data, metadata and identifying tags are avail-
able for any inquiry or search for patterns. And, such
inquiries or searches can pivot off of any one of those
pieces of information. This greatly expands the usabil-
ity of the data available to an organization. It actually
makes big data even bigger.
An important advantage of the data lake is that
there is no need to build, tear down, and rebuild rigid
data structures. For example, suppose we develop an
improved approach to translating English into Chi-
nese. With conventional techniques, the database is the
translation. To make major changes, we would have to
go back to the original data (the English and Chinese
words), and build a completely new structure. With the
Reference Architecture, however, we would simply pull
out the data in a new way, easily reusing it.
In addition, the data lake smoothly accepts every
type of data, including “unstructured” data—infor-
mation that has not been organized for inclusion in
a data base. An example might be the doctors’ and
nurses’ notes that accompany a patient’s electronic
health records.
Two other critical emerging data types are batch and
streaming. Batch data is typically collected on an auto-
mated basis and then delivered for analysis en masse—
for example, the utility meter readings from homes.
Streaming data is information from a continuous feed,
such as video surveillance.
Most of the flood of big data is unstructured, batch
and streaming, and so it is essential that organizations
have the ability to make full use of all types. With the
data lake, there is no second-class or third-class data.
All of it, including structured, unstructured, batch and
streaming, is equally “ingested” into the data lake, and
available for every inquiry.
6. 6 NOVEMBER 2012
It is an environment that is not random and chaotic,
but rather is purposeful. The data lake is like a viscous
medium that holds the data in place, and at the same
time fosters connections. Because the data is all in one
place, it is, in a sense, all connected.
GATHERING INFORMATION FROM THE DATA
LAKE: THE PRE-ANALYTICS
In the first step in analyzing the data, the Reference
Architecture uses tools known as pre-analytics to filter
data from the data lake and then give it an underlying
organization. For example, a recent study by Booz Al-
len and a large hospital chain in the Midwest analyzed
the electronic medical records of hundreds of patients,
to track the progression of a life-threatening condition
known as severe sepsis. Pre-analytics were used to first
pull patients’ vital signs from a version of a data lake,
and—using the time-and-date stamps embedded in the
records—organize them in chronological order. Once
that was accomplished, computer analytics could then
search for patterns in the way the patients’ vital signs
changed over time.
Pre-analytics accomplish a number of tasks at once.
Using the tags, they locate and pull out the relevant data
from the data lake. They then prepare that data for the
analytics, sorting and organizing the information in any
number of ways. The pre-analytics allow great flexibil-
ity in the inquiries—for example, one such tool might
transliterate a name like Muhammad into every possible
spelling (e.g., Mohammad, Mahamed, Muhamet). This
would enable the computer to collect and analyze infor-
mation about a particular person, even if that person’s
name is spelled differently in different sources of data.
Although pre-analytical tools are commonly used
in the conventional approach, they are typically part of
the rigid structure that must be torn down and rebuilt
as inquiries change. Generally, they cannot be reused—
for example, each name to be transliterated would re-
quire an entirely new pre-analytic. Because such work is
resource-intensive, only a limited number of such tools
can be built, severely hampering an organization’s abil-
ity to make full use of its data. By contrast, the pre-
analytics in the Cloud Analytics Reference Architecture
are designed for use with the data lake, and so are not
part of a custom-built structure. They are both flex-
ible and reusable, giving organizations almost endless
windows into their data. Moreover, they are designed to
be interoperable from the moment they come on-line,
creating a set of easily shared services for all users of
the data.
THE POWER OF COMPUTER ANALYTICS
Once the data has been prepared, the search for
knowledge and insight can begin. As with the other ele-
ments of the Reference Architecture, computer analyt-
ics are used in an entirely new way.
An analogy might be the difference between the
smartphones of today and the separate functions for
telephones, personal digital assistants and computers of
the not-so-distant past. Smartphones do more than just
combine those functions—they create a new world of
possibilities. The computer analytics in the Cloud Ana-
lytics Reference Architecture do the same.
There are several types of analytics in the Reference
Architecture, including:
Ad hoc queries. These are the analytics that ask
questions of the data. While in the conventional ap-
proach the analytics are part of the narrow, custom-
built structure, here they are free to pursue any line of
inquiry. For example, a financial institution might want
to know which of its foreign investors are at greatest
risk of switching to another firm, based on dozens of
characteristics of current and former customers. Later,
analysts might want to change the question somewhat,
asking the extent to which the political turmoil in cer-
tain countries plays a role. They can use the same ana-
lytic to ask the second question, and any number of
other questions—like the pre-analytics, they are flexible
and reusable. And they enable the kinds of improvised,
intuitive questions that can yield particularly valuable
results.
Machine learning. This is the search for patterns.
Because all of the data is available at once, and because
there is no need to hypothesize in advance what pat-
terns might exist, these analytics can look for patterns
that emerge anywhere across the data.
Alerting. This type analytic sends an alert when
something unexpected appears in the patterns. Such
anomalies are often clues to the kind of hidden knowl-
edge that can provide business with a competitive ad-
vantage, and help government organizations achieve
their missions.
Pre-Computation. These analytics enable organiza-
tions to do much of the analyzing in advance, creating
efficiencies. For example, an auto insurance company
might pre-compute the policy price for every individual
vehicle in the U.S., so that, with a few additional details,
a potential customer can be given an instant quote.
7. 7NOVEMBER 2012
PUTTING IT ALL TOGETHER: VISUALIZATION AND
INTERACTION
Decision-makers may be understandably concerned
that all this big data will be overwhelming, that remov-
ing the tube from the map will simply lead to informa-
tion overload. Quite the opposite is true. The Cloud
Analytics Reference Architecture addresses the issue
head-on by incorporating the visualization—how the
knowledge is presented to us—into the analytics from
the outset. That is, the analytics not only conduct the
inquiries, they help contextualize and focus the results.
At the visualization and interaction level of Refer-
ence Architecture, this focus enables the analysts to
more easily make sense of the information, to frame
better, more intuitive inquiries, and to gain deeper in-
sights. Building the visualization into the analytics has
another advantage—it provides the ability for quick and
effective feedback between the two layers, so that the
presentation of the findings can be continually refined
for the decision-maker.
With the Reference Architecture, the flood of infor-
mation is not overwhelming—it is readied for action as
never before. This breakthrough in visualization could
have as profound an effect on decision-making as bar
graphs and pie charts did in the 1950s and 1960s, when
statistics became widely used in business. Those visuals
presented all the essential information at a glance, chang-
ing the nature of decision-making. The Reference Ar-
chitecture will do the same—but this time with big data.
DELIVERING ON THE PROMISE
The possibilities of big data and the cloud are not
pipe dreams. But they will not be fulfilled on their
own—conscious effort and deliberate planning are
needed. Unless organizations make the right infra-
structure decisions, they cannot hope to build a data
lake. Unless they make the right data management de-
cisions, they will never break free from the rigid data
and analytic structures that are so limiting. The Cloud
Analytics Reference Architecture can be seen as a road
map for that decision-making, one that shows the im-
portance of a holistic, rather than piecemeal, haphazard
approach. Each element is closely tied to each of the
other elements, and so all must be considered together.
The Cloud Analytics Reference Architecture is no
more expensive to build than traditional approach, and
is considerably more cost-effective in the long run. Be-
cause the elements of the Cloud Analytics Reference
Architecture are largely reusable, they can scale an or-
ganization’s big data in an affordable way.
The Cloud Analytics Reference Architecture is al-
ready being used by the U.S. government to make our
nation safer, and it can help other organizations in gov-
ernment and business create value, solve real-world
problems, and drive success. The grand promise of big
data and the cloud is now within reach.
FOR MORE INFORMATION
Mark Jacobsohn
jacobsohn_mark@bah.com
301-497-6989
Joshua Sullivan, PhD
sullivan_joshua@bah.com
301-543-4611
www.boozallen.com/cloud
This document is part of a collection of papers developed by Booz Allen Hamilton to introduce new concepts and ideas spanning cloud
solutions, challenges, and opportunities across government and business. For media inquiries or more information on reproducing this
document, please contact:
James Fisher—Senior Manager, Media Relations, 703-377-7595, fisher_james_w@bah.com
Carrie Lake—Manager, Media Relations, 703-377-7785, lake_carrie@bah.com