O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Delivering on the Promise of Big Data and the Cloud

605 visualizações

Publicada em

Publicada em: Dados e análise
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Delivering on the Promise of Big Data and the Cloud

  1. 1. 1 WHY CAN’T WE SEEM TO DO MORE WITH BIG DATA? We are living in an age inundated with information. Our world is increasingly in- strumented—sensors are collecting data on everything from hospital patients’ vital signs, to the moment-by-moment naviga- tion of commercial aircraft, to consumer behavior based on buying patterns and the use of membership cards. Waves of data are coming from social media sites, from radio-frequency tracking systems, from the use of UPC barcodes. Our modern society is wired for data. Yet there is a growing belief in both business and government that we should be doing far more to take advantage of this wealth of information. We might use certain types of information for one purpose or another, but we nearly always view big data through multiple stove- pipes, rather than treating it holistically. We do not appear to be able to tap the full potential of all the data available to us. We have the technical ability. There have been significant innovations in com- puter technology in recent years, particu- larly with the advent of cloud computing. Yet like the promise of big data, the promise of the cloud—including unprec- edented savings, much greater access to data, and better decision-making—still seems largely unfulfilled. What holds us back is not technology, but a mindset. We are locked into an out- moded approach to data, one that relies on techniques created well before big data arrived on the scene. Those techniques give us access to only limited slices of in- formation, and are not designed to easily connect an analyst with multiple sources of data. They were sufficient in their day, but are no longer enough. Ultimately, we are not doing more with big data because we do not have complete access to it. We are never able to use it all at once, and so we are unable to track overall trends, or see entire patterns, or ask complex ques- tions that consider everything we know. To meet this need—and take full advantage of both big data and cloud computing—a new approach has been invented. Known as the Cloud Analytics Reference Architecture, it is the result of an ongoing collaboration between Booz Allen Hamilton and the U.S. government to leverage big data to search for terror- ists and other threats. Intelligence ana- lysts are now using the Cloud Analytics Reference Architecture to paint a com- prehensive picture that incorporates the full range of intelligence data at once, in- cluding reports that have been amassed and are ongoing from the field. Unlike conventional techniques, this new ap- proach makes it possible for analysts to use all available intelligence data, applying DELIVERING ON THE PROMISE OF BIG DATA AND THE CLOUD by Mark Jacobsohn Senior Vice President Booz Allen Hamilton Joshua Sullivan, PhD Vice President Booz Allen Hamilton ©2012 Booz Allen Hamilton Inc. All rights reserved. No part of this document may be reproduced without prior written permission of Booz Allen Hamilton.
  2. 2. 2 NOVEMBER 2012 an expanding set of analytic services to help them gain critical mission insights. The Cloud Analytics Reference Architecture, which is being adapted to the larger business and govern- ment communities, removes the traditional con- straints by bringing together innovations in two areas of current technology. First, it uses the power of the cloud to put an organization’s entire storehouse of data into a common pool, or “data lake,” making all of it easily accessible for the first time. It then uses sophisticated computer analytics, such as machine learning and natural language processing, to help ex- tract the kind of knowledge and insight that creates value, guides strategy, and drives business and mis- sion success. Although the Cloud Analytics Refer- ence Architecture builds upon current techniques, it is not an incremental step forward. It is an entirely new approach—one specifically designed for our new age of data. One way to understand how the Reference Archi- tecture works is to view it in layers (see Figure 1). Its foundation is the cloud computing and network infra- structure, which supports the methods by which data is managed—most notably, the data lake. The data lake, in turn, supports a two-step process to analyze the data. In the first step, special tools known as pre-analytics filter information from the data lake, and give it an un- derlying organization. That sets the stage for computer analytics—in the next layer up—to search for valuable knowledge. These elements support the final phase, the visualization and interaction, where the human insights and action take place. THE POWER OF THE CLOUD ANALYTICS REFERENCE ARCHITECTURE The Reference Architecture opens up the enormous potential of big data by allowing us to search for insight in new ways. It enables us to look for overarching pat- terns, and ask intuitive questions of all the data, rather than limiting us to narrowly defined queries within data sets. The Reference Architecture allows comput- ers to take over much of the work humans are doing now—freeing people to focus on the search for insight. It makes it possible for non-computer experts, for the first time, to frame the questions, look for patterns, and follow hunches. This is not some kind of magical solution—far from it. The Reference Architecture is simply a new way of looking at data, but one that revolutionizes our ability to gain knowledge and insight. With conventional tech- niques, the data and analytics are locked into stovepipes, or silos. We can explore only limited amounts of data at any one time—and then only with predetermined questions that have already been built in. The Reference Architecture removes these constraints by eliminating the silos, and consolidating all the information in the data lake. What results is not chaotic or overwhelming. Rather, the rich diversity of information in the data lake Figure 1. Primary Elements of the Cloud Analytics Reference Architecture
  3. 3. 3NOVEMBER 2012 becomes a powerful force. The data lake is more than a means of storage—it is a medium expressly designed to foster connections in data. And the Reference Archi- tecture explores those connections to search for valu- able correlations and patterns This actually reduces the complexity of big data, making it manageable and use- ful, and creating efficiencies. Instead of using data to ask “canned” questions that test what we may already know, the Reference Architec- ture uses data to discover new possibilities—solutions and answers that we have not even considered. The power of the Reference Architecture is that it constant- ly evolves and adapts as we search for insight, taking us beyond the limits of our imagination. WHAT THE CLOUD ANALYTICS REFERENCE ARCHITECTURE DOES The Cloud Analytics Reference Architecture re- moves the constraints created by data silos. While the rigid structures used in conventional techniques provide ease of storage, they carry severe disadvan- tages. They give us an artificial view of the world based on data models, rather than on reality and meaning. It is akin to reading a map through a tube—we can never immerse ourselves in the diversity of big data, and in- stead make decisions based on limited and constrained information. Much of data science in the last ten years has been devoted to improving access to the silos and building bridges between them. But that does not solve the underlying problem—that the data is regimented and locked in. Eliminating the need for silos gives us access to all the data at once—including data from multiple outside sources. Users no longer need to move from database to database, pulling out specific information. And, be- cause there are no data silos, there is no need to build complex bridges between them. If we want to know, for example, which parts of our computer network are most vulnerable to attack in the next six hours, we can take into account a wide va- riety of data sources at the same time. We might look at whether today is a holiday in certain foreign countries, which means that the young hackers known as “script kiddies” are more likely to be out of school and so have time on their hands to launch an attack. If we deter- mine that a particular group is targeting us, we might examine how its members are connected, asking wheth- er they had a common professor at a university, and if so, what techniques did he or she teach. The Reference Architecture gives us the ability to ask a full suite of questions rather than a pre-selected few. The Cloud Analytics Reference Architecture al- lows us to experiment more with the data. The Ref- erence Architecture’s flexibility provides a new kind of freedom—to follow hunches wherever they may lead, to quickly shift direction to pursue promising avenues of inquiry, to easily factor in new knowledge and in- sights as they arise. With the conventional approach, it is difficult to add or switch variables that are not already part of a dataset or data base. That typically requires tearing apart and rebuilding both the structure that the data is in and the computer analytics that are custom-designed to handle specific lines of inquiry. The process is expensive and time consuming, and so consequently, we tend to focus instead on doing better analysis with the limited tools available on our narrow slices of data. With the Reference Architecture, we might decide, in the network security example above, to add new vari- ables to the mix, such as the current propagation speed of commonly used viruses and botnets. Even if those variables come from outside data sources, we do not have to tear down and rebuild our data structures and analytics to consider them—they seamlessly become part of our inquiry. The Cloud Analytics Reference Architecture al- lows us to ask more intuitive questions. With the conventional approach, we do not really ask questions of the data—we create hypotheses, and then test the data to see whether we are right. In order to pose these hypotheses, we have to guess in advance what the an- swers might be, often a difficult proposition. To determine where our network is most vulnerable, for example, we would need to start with a hypothe- sis—say, that any attacks will occur through outdated operating systems. That hypothesis, accurate or not, would drive our initial line of inquiry. With the conventional approach, we also need to be familiar with the data we are considering, includ- ing where it is (in what specific datasets or databases), what format it is in, and even to a large extent what the data itself contains. That level of knowledge might be achievable when we are working with a limited number of datasets or databases, but not with the vast amounts of information now becoming available to us. We often have to put aside, or assume away, factors that we might actually believe are critical. Add to these handicaps our inability to go beyond the pre-selected questions or easily change variables, and it becomes an impossible task. And so we never try it. We end up settling for marginal questions, and marginal answers.
  4. 4. 4 NOVEMBER 2012 With the Reference Architecture, however, we can structure an inquiry around a single, intuitive, big-pic- ture question: What part of our computer network is most vulnerable to attack in the next six hours? We do not need to know much about any of the data sources we are consulting—the data will point us to the answer. The Cloud Analytics Reference Architecture al- lows us to more readily look for unexpected pat- terns—it lets the data talk to us, so to speak. Even if we could ask all the questions we want, the way we want, there is simply too much data to formulate every question that might be important. Our questions can also be limited by our biases about the issues we are researching. We may not know what areas to explore, or what we should be looking at. To get the full picture, and help guide our inquiries, we need to see what pat- terns naturally emerge in the data. While we can look for patterns with the convention- al approach, there are two significant drawbacks. We can only do such searches within our narrowly defined datasets and databases, rather than with the entire range of data available to us. We also must first guess what those specific patterns might be, and then test them out with hypotheses. But what about the patterns we do not even know might exist? How do we get to the hidden knowledge that often proves so valuable? Because there are no limiting data and analytic struc- tures in the Reference Architecture, we do not need to pose hypotheses, and our search for patterns encom- passes the entire range of data. For example, the U.S. military is now using the Reference Architecture to search for patterns in war zone intelligence data, to map out convoy routes least likely to encounter improvised explosive devices (IEDs). The Cloud Analytics Reference Architecture allows computers to take over much of the work humans are doing now—enabling people to focus on creating value. Conventional methods require that people play a large role in processing the data—in- cluding selecting samples to be analyzed, creating data structures, posing hypotheses, and sifting through and refining results. That intense level of effort may be workable for small amounts of data, but no organiza- tion has the personnel or resources to use that method to process big data. The Cloud Analytics Reference Architecture solves this problem by giving a great deal of that work to the computers, particularly tasks that are repetitive and computationally intensive. This reduces human error, and substantially speeds up the work. When we use the Reference Architecture to pose more intuitive questions, or to find patterns, we are es- sentially asking the computer to take us as close as it can to finding the answers we want. It is then up to us, using our cognitive skills, to find meaning in those answers. By separating out what the computer can do—the analytics—and what only people can do—the actual analysis—the Cloud Analytics Reference Architecture greatly eases the human workload. It is a division of la- bor that frees subject-matter experts to look at the larg- er picture. At the same time, the Reference Architecture rapidly highlights areas that analysts should not waste their time exploring—enabling them to focus their time and attention in the right direction. For example, agencies that investigate consumer complaints against financial institutions often do not know which individual complaints are indicative of a broader patterns of consumer abuse, and so deserve the most attention. Investigators rarely have the time to sort through the vast array of sources that might pro- vide valuable clues, such as blogs and social media sites where consumers commonly air their grievances. With a data lake that included all such available information, the Reference Architecture’s analytics could quickly identify patterns, such as consumer abuse affecting large numbers of people. Investigators could then fo- cus their resources on the most serious cases. The Cloud Analytics Reference Architecture’s analysis capability enables subject matter experts to explore the data. If we are to drive business and mission success, we must give direct access to the data to the analysts, or subject matter experts, who under- stand what that success might mean. However, be- cause of the high level of computer expertise needed to design custom data storage structures and analytics, much of the analysis today is conducted by computer scientists, computer engineers, and mathematicians act- ing as agents for the subject matter experts. They are typically the ones who translate the overall goals of the business and government analysts into the language of the machine. Whenever there is a middleman in any field, things tend to get lost in the translation, and data analysis is no exception. Here, it leads to a disconnect between the people who need knowledge and insight (the subject matter experts) and the data itself. It also substantially slows the process. In the top layers of the Reference Architecture, the middleman syndrome goes away. The ability to ask in- tuitive questions, and to look for patterns, provides the analysts with direct access to the data. That gives them the flexibility they need to experiment and explore, and allows the system to reach maximum velocity. The computer scientists, computer engineers and mathema- ticians still play a key role, but now are no longer the ones who drive the inquiries into the data.
  5. 5. 5NOVEMBER 2012 For example, investigators who suspect fraud may be occurring are often hampered by the need to go through computer experts to query the data. Their re- quest may be one of many, and by the time they get back the information they need to act, the criminals have often long since committed the fraud and dis- appeared. With the Reference Architecture, however, investigators could query the data themselves, quickly pinpoint the fraud, and take action in time to stop the activity. THE FOUNDATION OF THE REFERENCE ARCHITECTURE: A NEW APPROACH TO INFRASTRUCTURE The Reference Architecture takes advantage of the immense storage ability of the cloud, though in a different way than in the past. With the conventional approach, cloud storage does not eliminate the data si- los—it simply makes them fatter. Organizations must continually reinvest in infrastructure as analytic needs change. Building bridges between silos, for example, typically requires reconfiguring and even expanding the infrastructure. The Reference Architecture, by contrast, has an in- herent flexibility that enables organizations to pursue new analytical approaches with few if any changes to the underlying infrastructure. One reason is that the data lake is easily expandable. Because it stores infor- mation so efficiently, it can accommodate both the natural growth of an organization’s data, as well as the addition of data from multiple outside sources. At the same time, the Reference Architecture replaces the cur- rent, custom-built analytics with a new generation of tools that are highly reusable for almost any number of inquiries. With the Reference Architecture, organi- zations do not need to rebuild infrastructure as their levels of data and analytics increase. An organization’s initial investment in infrastructure is therefore both en- during and cost-effective. HOW THE DATA LAKE WORKS With the conventional approach, the computer is able to locate the information it needs because it knows precisely where it is—in one database or another. The information is identified largely by its location. With the data lake, information is still identified for use, but now in a way other than by location. Specific pieces of infor- mation are identified by “tags”—details that have been embedded in them for sorting and identification. For example, an investor’s portfolio balance (the data) is generally stored with identifying information such as the name of the investor, the account number, one or more dates, the location of the account, the types of investments, the country the investor lives in, and so on. This “metadata” is what gets tagged, and is located by the computer during inquiries. The process of tagging information is not new—it is commonly done within specific datasets and databas- es. What is new is using the technique to eliminate the need for datasets and databases altogether. The tags themselves are also a way of gaining knowledge from the data. In the example above, they might allow us to look for, say, connections between investors’ countries and their types of investments. The basic data—the portfolio balance—might not even be part of the inquiry. Such connections can be made with the conventional approach, but only if the custom-built databases and computer analytics have already been de- signed to take them into consideration. With the data lake, all the data, metadata and identifying tags are avail- able for any inquiry or search for patterns. And, such inquiries or searches can pivot off of any one of those pieces of information. This greatly expands the usabil- ity of the data available to an organization. It actually makes big data even bigger. An important advantage of the data lake is that there is no need to build, tear down, and rebuild rigid data structures. For example, suppose we develop an improved approach to translating English into Chi- nese. With conventional techniques, the database is the translation. To make major changes, we would have to go back to the original data (the English and Chinese words), and build a completely new structure. With the Reference Architecture, however, we would simply pull out the data in a new way, easily reusing it. In addition, the data lake smoothly accepts every type of data, including “unstructured” data—infor- mation that has not been organized for inclusion in a data base. An example might be the doctors’ and nurses’ notes that accompany a patient’s electronic health records. Two other critical emerging data types are batch and streaming. Batch data is typically collected on an auto- mated basis and then delivered for analysis en masse— for example, the utility meter readings from homes. Streaming data is information from a continuous feed, such as video surveillance. Most of the flood of big data is unstructured, batch and streaming, and so it is essential that organizations have the ability to make full use of all types. With the data lake, there is no second-class or third-class data. All of it, including structured, unstructured, batch and streaming, is equally “ingested” into the data lake, and available for every inquiry.
  6. 6. 6 NOVEMBER 2012 It is an environment that is not random and chaotic, but rather is purposeful. The data lake is like a viscous medium that holds the data in place, and at the same time fosters connections. Because the data is all in one place, it is, in a sense, all connected. GATHERING INFORMATION FROM THE DATA LAKE: THE PRE-ANALYTICS In the first step in analyzing the data, the Reference Architecture uses tools known as pre-analytics to filter data from the data lake and then give it an underlying organization. For example, a recent study by Booz Al- len and a large hospital chain in the Midwest analyzed the electronic medical records of hundreds of patients, to track the progression of a life-threatening condition known as severe sepsis. Pre-analytics were used to first pull patients’ vital signs from a version of a data lake, and—using the time-and-date stamps embedded in the records—organize them in chronological order. Once that was accomplished, computer analytics could then search for patterns in the way the patients’ vital signs changed over time. Pre-analytics accomplish a number of tasks at once. Using the tags, they locate and pull out the relevant data from the data lake. They then prepare that data for the analytics, sorting and organizing the information in any number of ways. The pre-analytics allow great flexibil- ity in the inquiries—for example, one such tool might transliterate a name like Muhammad into every possible spelling (e.g., Mohammad, Mahamed, Muhamet). This would enable the computer to collect and analyze infor- mation about a particular person, even if that person’s name is spelled differently in different sources of data. Although pre-analytical tools are commonly used in the conventional approach, they are typically part of the rigid structure that must be torn down and rebuilt as inquiries change. Generally, they cannot be reused— for example, each name to be transliterated would re- quire an entirely new pre-analytic. Because such work is resource-intensive, only a limited number of such tools can be built, severely hampering an organization’s abil- ity to make full use of its data. By contrast, the pre- analytics in the Cloud Analytics Reference Architecture are designed for use with the data lake, and so are not part of a custom-built structure. They are both flex- ible and reusable, giving organizations almost endless windows into their data. Moreover, they are designed to be interoperable from the moment they come on-line, creating a set of easily shared services for all users of the data. THE POWER OF COMPUTER ANALYTICS Once the data has been prepared, the search for knowledge and insight can begin. As with the other ele- ments of the Reference Architecture, computer analyt- ics are used in an entirely new way. An analogy might be the difference between the smartphones of today and the separate functions for telephones, personal digital assistants and computers of the not-so-distant past. Smartphones do more than just combine those functions—they create a new world of possibilities. The computer analytics in the Cloud Ana- lytics Reference Architecture do the same. There are several types of analytics in the Reference Architecture, including: Ad hoc queries. These are the analytics that ask questions of the data. While in the conventional ap- proach the analytics are part of the narrow, custom- built structure, here they are free to pursue any line of inquiry. For example, a financial institution might want to know which of its foreign investors are at greatest risk of switching to another firm, based on dozens of characteristics of current and former customers. Later, analysts might want to change the question somewhat, asking the extent to which the political turmoil in cer- tain countries plays a role. They can use the same ana- lytic to ask the second question, and any number of other questions—like the pre-analytics, they are flexible and reusable. And they enable the kinds of improvised, intuitive questions that can yield particularly valuable results. Machine learning. This is the search for patterns. Because all of the data is available at once, and because there is no need to hypothesize in advance what pat- terns might exist, these analytics can look for patterns that emerge anywhere across the data. Alerting. This type analytic sends an alert when something unexpected appears in the patterns. Such anomalies are often clues to the kind of hidden knowl- edge that can provide business with a competitive ad- vantage, and help government organizations achieve their missions. Pre-Computation. These analytics enable organiza- tions to do much of the analyzing in advance, creating efficiencies. For example, an auto insurance company might pre-compute the policy price for every individual vehicle in the U.S., so that, with a few additional details, a potential customer can be given an instant quote.
  7. 7. 7NOVEMBER 2012 PUTTING IT ALL TOGETHER: VISUALIZATION AND INTERACTION Decision-makers may be understandably concerned that all this big data will be overwhelming, that remov- ing the tube from the map will simply lead to informa- tion overload. Quite the opposite is true. The Cloud Analytics Reference Architecture addresses the issue head-on by incorporating the visualization—how the knowledge is presented to us—into the analytics from the outset. That is, the analytics not only conduct the inquiries, they help contextualize and focus the results. At the visualization and interaction level of Refer- ence Architecture, this focus enables the analysts to more easily make sense of the information, to frame better, more intuitive inquiries, and to gain deeper in- sights. Building the visualization into the analytics has another advantage—it provides the ability for quick and effective feedback between the two layers, so that the presentation of the findings can be continually refined for the decision-maker. With the Reference Architecture, the flood of infor- mation is not overwhelming—it is readied for action as never before. This breakthrough in visualization could have as profound an effect on decision-making as bar graphs and pie charts did in the 1950s and 1960s, when statistics became widely used in business. Those visuals presented all the essential information at a glance, chang- ing the nature of decision-making. The Reference Ar- chitecture will do the same—but this time with big data. DELIVERING ON THE PROMISE The possibilities of big data and the cloud are not pipe dreams. But they will not be fulfilled on their own—conscious effort and deliberate planning are needed. Unless organizations make the right infra- structure decisions, they cannot hope to build a data lake. Unless they make the right data management de- cisions, they will never break free from the rigid data and analytic structures that are so limiting. The Cloud Analytics Reference Architecture can be seen as a road map for that decision-making, one that shows the im- portance of a holistic, rather than piecemeal, haphazard approach. Each element is closely tied to each of the other elements, and so all must be considered together. The Cloud Analytics Reference Architecture is no more expensive to build than traditional approach, and is considerably more cost-effective in the long run. Be- cause the elements of the Cloud Analytics Reference Architecture are largely reusable, they can scale an or- ganization’s big data in an affordable way. The Cloud Analytics Reference Architecture is al- ready being used by the U.S. government to make our nation safer, and it can help other organizations in gov- ernment and business create value, solve real-world problems, and drive success. The grand promise of big data and the cloud is now within reach. FOR MORE INFORMATION Mark Jacobsohn jacobsohn_mark@bah.com 301-497-6989 Joshua Sullivan, PhD sullivan_joshua@bah.com 301-543-4611 www.boozallen.com/cloud This document is part of a collection of papers developed by Booz Allen Hamilton to introduce new concepts and ideas spanning cloud solutions, challenges, and opportunities across government and business. For media inquiries or more information on reproducing this document, please contact: James Fisher—Senior Manager, Media Relations, 703-377-7595, fisher_james_w@bah.com Carrie Lake—Manager, Media Relations, 703-377-7785, lake_carrie@bah.com