O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Massive Data Analytics and the Cloud

3.237 visualizações

Publicada em

Cloud computing offers a very important approach to achieving lasting strategic advantages by rapidly adapting to complex challenges in IT management and data analytics. This paper discusses the business impact and analytic transformation opportunities of cloud computing. Moreover, it highlights the differences among two cloud architectures—Utility Clouds and Data Clouds—with illustrative examples of how Data Clouds are shaping new advances in Intelligence Analysis.

Publicada em: Tecnologia, Negócios
  • Seja o primeiro a comentar

Massive Data Analytics and the Cloud

  1. 1. Massive Data Analytics and the CloudA Revolution in Intelligence AnalysisbyMichael Farberfarber_michael@bah.comMike Cameroncameron_mike@bah.comChristopher Ellisellis_christopher@bah.comJosh Sullivan, Ph.D.sullivan_joshua@bah.com
  2. 2. Massive Data Analytics and the CloudA Revolution in Intelligence AnalysisCloud computing offers a very important approach to way organizations store, access, and process massiveachieving lasting strategic advantages by rapidly adapting amounts of disparate data via massively parallel andto complex challenges in IT management and data distributed IT systems. These technologies includeanalytics. This paper discusses the business impact and Hadoop, MapReduce, BigTable, and other emergentanalytic transformation opportunities of cloud computing. Data Cloud technologies. Historically, the amount ofMoreover, it highlights the differences among two cloud processing power that could be applied constrainedarchitectures—Utility Clouds and Data Clouds—with the ability to analyze intelligence. Consequently, dataillustrative examples of how Data Clouds are shaping new representation and processing were restricted by theadvances in Intelligence Analysis. amount of available processing power. Complex data models and normalization became common, and dataIntelligence Analysis Transformation was forced into models that did not fully representUntil the advent of the Information Age, the process all aspects of data. Understanding the emergence ofof Intelligence Analysis was one of obtaining and ”big data,” data reflection, and massively scalablecombining sparse and often denied data to derive processing can lead to new insights for Intelligenceintelligence. Analysis needs were well-understood, Analysis, as demonstrated by Google®, Yahoo!®,with consistent enemies and smooth data growth and Amazon®, which leverage cloud computing andpatterns. With the advent of the Information Age, the Data Clouds to power their businesses. Booz Allen’sInternet, and rapidly evolving and multi-layered methods experience with cloud computing ideally positionsof disseminating information, including wikis, blogs, the firm to support current and future clients ine-mail, virtual worlds, online games, VoIP telephone, understanding and adopting cloud computing solutions.digital photos, instant messages (IM), and tweets, theraw data and reflections of data on nearly everyone Understanding the Differences Betweenand their every activity are becoming increasinglyavailable. This availability and constant growth of Data Clouds and Utility Clouds Cloud computing offers a wide range of technologiessuch vast amounts of disparate data is turning the that support the transformation of data analytics.Intelligence Analysis problem on its head, transforming As outlined by the National Institute of Standardsit from a process of “stitching together sparse data and Technology (NIST), cloud computing is a modelto derive conclusions” to a process of “extracting “for enabling available, convenient, on-demandconclusions from aggregation and distillation of network access to a shared pool of configurablemassive data and data reflections.”1 computing resources (e.g., networks, servers,Cloud computing provides new capabilities for storage, applications, services) that can be rapidlyperforming analysis across all data in an organization. provisioned and released with minimal managementIt uses new technical approaches to store, search, effort or service provider interaction.”2 A number ofmine, and distribute massive amounts of data. Cloud cloud computing deployment models exist, such ascomputing allows analysts and decision makers to ask public clouds, private clouds, community clouds, andad-hoc analysis questions of massive volumes of data hybrids. Moreover, two major cloud computing designin very quick and precise ways. New cloud computing patterns are emerging: Utility Clouds and Data Clouds.technologies are driving analytic transformation in the These distinctions are not mutually exclusive; rather,1 Discussions with US Government Computational Sciences Group. 2 US National Institute of Standards. http://groups.google.com/group/cloudforum/web/ nist-working-definition-of-cloud-computing. 1
  3. 3. Exhibit 1 | Relationship of Utility Clouds and Data Clouds Massively Parallel Compute DATA CLOUD Highly Scalable Multi- Dimensional Databases Distributed Highly Fault- Tolerant Massive Storage CLOUD COMPUTING UTILITY CLOUD Software as a Service (Saas) Platform as a Service (Paas) Infrastructure as a Service (Iaas) Source: Booz Allen Hamilton these designs can work cooperatively to provide multi-tenancy is a security model that separates and economies of scale, resiliency, security, scalability, protects data and processing. In fact, security— and analytics at world scale. As illustrated in Exhibit focused on ensuring data segmentation, integrity, and 1, fundamentally different mission objectives drive access control—is one of the most critical design Utility Clouds and Data Clouds. Utility Clouds focus drivers of Utility Cloud architectures. on offering infrastructure, platforms, and software In contrast, the main objective of Data Clouds is the as services that many users consume. These basic aggregation of massive data. Data Clouds have an building blocks of cloud computing are essential to architectural approach of dividing processing tasks into achieving real solutions at scale. Data Clouds can smaller units distributed to servers across the cluster leverage those utility building blocks to provide data with co-location of computing and storage capabilities, analytics, structured data storage, databases, and allowing highly scaled computation across all of the massively parallel computation, which allow analysts data in the cloud. The Data Cloud design pattern unprecedented access to mission data and shared typifies data-intensive and processing-intensive usage analysis algorithms. scenarios without regard to concepts used in the Utility Specifically, Utility Clouds focus on providing IT Cloud model, such as virtualization and multi-tenancy. capabilities as a service; e.g., Infrastructure as a Exhibit 2 contains a comparison of Data Clouds Service (IaaS), Platform as a Service (PaaS), and and Utility Clouds. The difference between these Software as a Service (SaaS). This service model two architectures is seen in industry. Amazon is an scales by allowing multiple constituencies to share excellent model for Utility Cloud computing, and Google IT assets, called multi-tenancy. Key to enabling is an excellent model for Data Cloud computing.2
  4. 4. Cloud Computing Can Solve Problems Consider a social network analysis problem fromPreviously Out of Reach Facebook® This problem was impossible to solve .More than just a new design pattern, cloud computing before the advent of cloud computing:3comprises a range of technologies that enable new With 400 to 500 million users and billions of pageforms of analysis that were previously computationally views every day, Facebook accumulates massiveimpossible or too difficult to attempt. These business amounts of data. One of the challenges it has facedand mission problems are intractable to solve in the since its early days is developing a scalable way ofcontext of traditional IT design and delivery models and storing and processing all these bytes since historicalhave often ended in failed or underwhelming results. data is a major driver of business innovation and theCloud computing allows research into these problems, user experience on Facebook. This task can only beopening new business opportunities and new outreach accomplished by empowering Facebook’s engineers andto clients with large amounts of data. analysts with easy to use tools to mine and manipulateMuch of the current cloud computing discussion large data sets. At some point, there isn’t a biggerfocuses on utility computing, economies of scale, server to buy, and the relational database stops scaling.and migration of current capabilities to the cloud. To drive the business forward, Facebook needed a wayRefocusing the conversation to new business to query and correlate roughly 15 terabytes of newopportunities that empower decision makers and social network data each day, in different languages, atanalysts to ask previously un-askable questions is the different times, often from mixed media (e.g., web, IM,emerging power of the cloud. SMS) streaming in from hundreds of different sources.What does the cloud allow us to do that we could Moreover, Facebook needed a massively parallelnot do before? Compute-intensive problems, such as processing framework and a way to safely and securelylarge-scale image processing, sensor data correlation, store large volumes of data. Several computationallysocial network analysis, encryption/decryption, data impossible problems were lingering at Facebook.mining, simulations, and pattern recognition, are strong One of the most interesting of these problems wasexamples of problems that can be solved in the cloud the Facebook Lexicon, a data analysis program thatcomputing domain. would first allow a user to select a word and wouldExhibit 2 | Comparison of Data and Utility Clouds Data Clouds Utility Clouds • Computing architecture for large-scale data processing • Computing services for outsourced IT operations and analytics • Concurrent, independent, multi-tenant user population • Designed to operate at trillions of operations/day, • Service offerings such as SaaS, PaaS, and IaaS petabytes of storage • Characterized by data segmentation, hosted • Designed for performance, scale, and data processing applications, low cost of ownership, and elasticity • Characterized by run-time data models and simplified development modelsSource: Booz Allen Hamilton 3 arma, Joydeep Sen. “Hadoop.” Facebook, June 5, 2008. http://www.facebook.com/ S notes.php?id=9445547199#!/note.php?note_id=16121578919. 3
  5. 5. then scan all available data at Facebook (which grows Distributed Highly Fault-Tolerant Massive Storage by 15 terabytes per day), calculate the frequency of The foundation of Data Cloud computing is the ability the word’s occurrence, and graphically display the to reliably store and process petabytes of data using information over time. This task was not possible in a non-specialized hardware and networking. In practice, traditional IT solution because of the large number of this form of storage has required a new way of users, size of data, and time to process. But the Data thinking about data storage. The Google File System Cloud allows Facebook to leverage more than 8,500 (GFS) and Hadoop Distributed File System (HDFS) Central Processing Unit (CPU) cores and petabytes of are two examples of proven approaches to creating disk space to create rich data analytics on a wide range distributed highly fault-tolerant massive storage of business characteristics. The Data Cloud allows systems. Several attributes of highly fault-tolerant analysts and technical leaders at Facebook to rapidly massive storage systems are key innovations in the write analytics in the software language of their choice. Data Cloud design pattern:4 Subsequently, these analytics run over massive amounts • Is reliable, allowing distributed storage and of data, condensing down to small, personalized analysis replication of bytes across networks and hardware results. These results can then be stored in a traditional assumed to fail at anytime relational database, allowing existing reporting and financial tools to remain unchanged but to still benefit • Allows for massive, world-scale storage that from the computational power of the cloud. Facebook separates metadata from data is now investigating a data warehousing layer that rides • Supports a write-once, sporadic append, read-many in the cloud and is capable of making decisions and usage structure courses of action based on millions of inputs. • Stores very large files, often each greater than 1 The same capabilities inherent in the Facebook Data terabyte in size Cloud are available to other organizations in their own Data Cloud. Cloud computing is a transformative force • Allows compute cycles to be easily moved to the data addressing size, speed, and scale, with a low cost of store, instead of moving data to a processer farm. entry and very high potential benefits. The first step These attributes are especially important for toward exploring the power of cloud computing is to Intelligence Analysis. First, a large-scale distributed understand the taxonomy differences between Data storage system, which leverages readily available Clouds and Utility Clouds. commercial hardware, offers a method of replication across many physical sites for data redundancy. Data Cloud Design Features Vast amounts of data can be reliably stored across The Data Cloud model offers a transformational a distributed system without costly storage-network effect to Intelligence Analysis. Unlike current data arrays, reducing the risk of storing mission-critical data warehousing models, the Data Cloud design begins in one central location. Second, because massive scale by assuming an organization needs to rapidly store is achieved horizontally (by adding hardware) the ability and process massive amounts of chaotic data that is to gauge and predict data growth rates across the spread across the enterprise, distressed by differing enterprise becomes dramatically easier by provisioning time sequences, and burdened with noise. hardware at the leading edge of a Data Cloud vice in Several key attributes of the Data Cloud design pattern independent storage silos for different projects. are specifically remarkable for Intelligence Analysis and are discussed in the following sections. 4 hemawat, Sanjay; Gobioff, Howard; and Leung, Shun-Tak. “The Google File System.” G Appeared in 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October 2003. http://labs.google.com/papers/gfs.html.4
  6. 6. Highly Scalable Multi-Dimensional Databases inserted into the database with little or no modification.The distributed highly fault-tolerant massive storage The database is designed to hold many unique formssystem lacks the ability to fully represent structured of data, allowing a query-time schema to be generateddata, providing only the raw data storage required when data is retrieved, instead of an a priori schemain the Data Cloud. Moving beyond raw data storage defined when the database is first installed. This isto representing structured data requires a highly extremely valuable because as new data arrives in ascalable database system. Traditional relational format not seen before, instead of lengthy modificationsdatabase systems are the ubiquitous design solution to the database schema or complicated parsing tofor structured data storage in conventional enterprise force the data into a relational model, the data canapplications. Many relational database systems simply be stored in the native format. By leveraging asupport multi-terabyte tables, relationships, and multi-dimensional database system that treats all datacomplex Structured Query Language (SQL) engines, as bytes, scales to massive sizes, and allows users towhich provide reliable, well-understood data schemas. ask questions of the bytes at query-time, organizationsIt is very important to understand that relational have a new choice. Multi-dimensional databases aredatabase systems are the best solution for many types in stark contrast to today’s highly normalized relationalof problems, especially when data is highly structured data model, which includes schemas, relationships,and volume is less than 10 terabytes. However, a new and pre-determined storage formats stored on one orclass of problem is emerging when dealing with big several clustered database servers.data of volumes greater than 10 terabytes. Although Moreover, the power of multi-dimensional databasesrelational database models are capable of running in a allows computations to be “pre” computed. ForData Cloud, many current relational database systems instance, several large Internet content providers minefail in the Data Cloud in two important ways. First, millions of data points every hour to pre-compute themany relational database systems cannot scale to best advertisements to show users the next time theysupport petabytes or greater amounts of data storage, visit one of the provider’s sites. Instead of pulling aor the database requires highly specialized components random advertisement or performing a search whento accomplish ever-diminishing increases in scale. the page is requested, these providers pre-computeSecond, and critically important to Intelligence Analysis, the best advertisements to show everyone, atis the impedance mismatch that occurs as complex anytime, on any of their sites based on historical data,data is normalized into a relational table format. advertisement click rates, probability of clicking, andWhen data is collected, often the first step is to revenue models. Similarly, a large US auto insurancetransform the data, normalize the data, and insert a firm leverages multi-dimensional databases torow into a relational database. Next, users query data pre-compute the insurance quote for every car in thebased on keywords or pre-loaded search queries and United States each night. If a caller asks for a quote,wait for the results to return. Once returned, users sift the firm can instantly tell the caller a price becausethrough results. it was already computed overnight based on many input sources.5 The highly scalable nature of suchMulti-dimensional databases offer a fundamentally databases allows for massive computational power.different model. The distributed and highly scalabledesign of multi-dimensional databases means data is There are some major analytic implications here. First,stored and searched as billions of rows with billions a computing framework such as a Data Cloud that,of columns across hundreds of servers. Even more by design, can scale to handle petabytes of missioninteresting, these forms of databases do not require data and perform intensive computation acrossschemas or predefined definitions of their structure. all the data, all the time means new insights andWhen new data arrives in an unfamiliar format, it can be discoveries are now possible by looking at big data in 5 aker, Stephen. “The Two Flavors of Google.” Businessweek.com, December 13, 2007. B http://www.businessweek.com/magazine/content/07_52/b4064000281756.htm. 5
  7. 7. many different ways. Consider the corollaries between 1,000 nodes, bringing extensive analytic processing Intelligence Analysis and the massive correlation/ capabilities to users in the enterprise. prediction engines at eHarmony® and Netflix® which , Working in tandem with the distributed file system leverage regression algorithms across massive data and the multi-dimensional database, the MapReduce sets to compute personality, character traits, hidden framework leverages a master node to divide large relationships, and group tendencies. These data-driven jobs into smaller tasks for worker nodes to process. analytic systems leverage the attributes of Data Clouds The framework, capable of running on thousands of but target specialized behavioral analysis to advance machines, attempts to maintain a high level of affinity their particular business. Hulu® a popular online , between data and processing, which means the television website, uses the Data Cloud to process framework intelligently moves the processing close many permutations of log files about the shows people to the data to minimize bandwidth needs. Moving the are watching.5 compute job to the data is easier than moving large amounts of data to a central bank of processors. Massively Parallel Compute (Analytic Algorithms) Moreover, the framework manages extrapolative errors, Parallel computing is a well-adopted technology noticing when a worker in the cloud is taking a long seen in processor cores and software thread-based time on one of these tasks or has failed altogether and parallelism. However, massively parallel processing— automatically tasks another node with completing the leveraging thousands of networked commodity servers same task. All these details and abstractions are built constrained only by bandwidth—is now the emerging into the framework. Developers are able to focus on the context for the Data Cloud. analytic value of the jobs they create and no longer worry If distributed file systems, such as GFS and HDFS, about the specialized complexities of massively parallel and column-oriented databases are employed to computing. An intelligence analyst is able to write 10 store massive volumes of data, there is then a need to 20 lines of computer code, and the MapReduce to analyze and process this data in an intelligent framework will convert it into a massively parallel fashion. In the past, writing parallel code required search—working against petabytes of data across highly trained developers, complex job coordination, thousands of machines—without requiring the analyst and locking services to ensure nodes did not overwrite to know or understand any of these technical details. each other. Often, each parallel system would develop Tasks such as sorting, data mining, image unique solutions for each of these problems. These manipulation, social network analysis, inverted index and other complexities inhibited the broad adoption of construction, and machine learning are prime jobs massively parallel processing, meaning that building for MapReduce. In another scenario, assume that and supporting the required hardware and software terabytes of aerial imagery have been collected was reserved for dedicated systems. for intelligence purposes. Even with an algorithm MapReduce, a framework pioneered by Google, has available to detect tanks, planes, or missile silos, overcome many of these previous barriers and allows the task of finding these weapons could take days for data-intensive computing while abstracting the if run in a conventional manner. Processing 100 details of the Data Cloud away from the developer. terabytes of imagery on a standard computer takes This ability allows analysts and developers to quickly 11 days. Processing the same amount of data on create many different parallelized analytic algorithms 1,000 standard computers takes 15 minutes. By that leverage the capabilities of the Data Cloud. incorporating MapReduce, each image or part of an Consequently, the same MapReduce job crafted to image becomes its own task and can be examined run on a single node can as easily run on a group of in a parallel manner. Distributing this work to many 5 aker, Stephen. “The Two Flavors of Google.” Businessweek.com, December 13, 2007. B http://www.businessweek.com/magazine/content/07_52/b4064000281756.htm.6
  8. 8. computers drastically cuts the time for this job and • Batch Processing Systems—Log file analysis,other large tasks, scales the performance linearly by nightly automated relationship matching, regressionadding commodity hardware, and ensures reliability testing, and financial analysisthrough data and task replication. • Unpredictable Content—Web portals or intelligence dissemination systems that vary widely in usageProgrammatic Models for Scaling in the Data Cloud based on time of day, temporal content stores forBuilding applications and the architectures that conferences, or National Special Security Events.run in the Data Cloud requires new thinking aboutscale, elasticity, and resilience. Cloud application Clearly, applications that can leverage the Data Cloudarchitectures follow two key tenets: (1) only use can take advantage of the ubiquitous infrastructurecomputing resources when needed (elasticity) and (2) that already exists within the cloud, elastically drawingsurvive drastically changing data volumes (scalability). on resources as needed based on scale. Moreover,Much of this work is accomplished by designing cloud the efficient application of computing resources acrossapplications to dynamically scale by dynamically the cloud and the rapid ability to decrease processingprocessing asynchronously queued events. As a result, time through massive parallelization is a transformativeany number of jobs can be submitted to the Data force for many stages of Intelligence Analysis.Cloud, and those jobs are persistently queued forresilience until completed. The jobs are then removed Conclusionfrom a queue and distributed across any number of Cloud computing has the potential to transformworker nodes, drawing on resources in the cloud on how organizations use computing power to create ademand. When the work is complete, these jobs are collaborative foundation of shared analytics, mission-closed and the resources returned to the cloud. centric operations, and IT management. Challenges to the implementation of cloud computing remain, but theThese features, working in concert, achieve scalability new analytic capabilities of big data, ad-hoc analysis,across computers and data volume, elasticity of and massively scalable analytics—combined with theresource utilization, and resilience for assured security and financial advantages of switching to a cloudoperational readiness. These forms of application computing environment—are driving research across thearchitecture address the difficulties of massive data cloud ecosystem. Cloud computing technology offers aprocessing. The cloud abstracts the complexities of very important approach to achieving lasting strategicresource provisioning, error handling, parallelization, advantage by rapidly adapting to complex challenges inand scalability, allowing developers to trade IT management and data analytics.sophistication for scale.The following actions leverage the power of theData Cloud:6• Processing Pipelines—Document or image processing, video transcoding, indexing data, and data mining6 aria, Jinesh. Cloud Architectures. Amazon, June 16, 2008. http://jineshvaria. V s3.amazonaws.com/public/cloudarchitectures-varia.pdf. 7
  9. 9. About Booz Allen Booz Allen Hamilton has been at the forefront of strategy rapidly deploy talent and resources, and deliver enduring and technology consulting for nearly a century. Today, results. By combining a consultant’s problem-solving Booz Allen is a leading provider of management and orientation with deep technical knowledge and strong technology consulting services to the US government execution, Booz Allen helps clients achieve success in in defense, intelligence, and civil markets, and to major their most critical missions—as evidenced by the firm’s corporations, institutions, and not-for-profit organizations. many client relationships that span decades. Booz Allen In the commercial sector, the firm focuses on leveraging helps shape thinking and prepare for future developments its existing expertise for clients in the financial services, in areas of national importance, including cybersecurity, healthcare, and energy markets, and to international clients homeland security, healthcare, and information technology. in the Middle East. Booz Allen offers clients deep functional Booz Allen is headquartered in McLean, Virginia, employs knowledge spanning strategy and organization, engineering more than 25,000 people, and had revenue of $5.59 and operations, technology, and analytics—which it billion for the 12 months ended March 31, 2011. Fortune combines with specialized expertise in clients’ mission and has named Booz Allen one of its “100 Best Companies domain areas to help solve their toughest problems. to Work For” for seven consecutive years. Working Mother The firm’s management consulting heritage is the basis has ranked the firm among its “100 Best Companies for for its unique collaborative culture and operating model, Working Mothers” annually since 1999. More information enabling Booz Allen to anticipate needs and opportunities, is available at www.boozallen.com. (NYSE: BAH) To learn more about the firm and to download digital versions of this article and other Booz Allen Hamilton publications, visit www.boozallen.com. Contact Information: Michael Farber Mike Cameron Christopher Ellis Josh Sullivan, Ph.D. Senior Vice President Principal Principal Principal farber_michael@bah.com cameron_mike@bah.com ellis_christopher@bah.com sullivan_joshua@bah.com 240-314-5671 301-543-4432 301-419-5147 301-543-46118
  10. 10. Principal OfficesHuntsville, Alabama Indianapolis, Indiana Philadelphia, PennsylvaniaSierra Vista, Arizona Leavenworth, Kansas Charleston, South CarolinaLos Angeles, California Aberdeen, Maryland Houston, TexasSan Diego, California Annapolis Junction, Maryland San Antonio, TexasSan Francisco, California Hanover, Maryland Abu Dhabi, United Arab EmiratesColorado Springs, Colorado Lexington Park, Maryland Alexandria, VirginiaDenver, Colorado Linthicum, Maryland Arlington, VirginiaDistrict of Columbia Rockville, Maryland Chantilly, VirginiaOrlando, Florida Troy, Michigan Charlottesville, VirginiaPensacola, Florida Kansas City, Missouri Falls Church, VirginiaSarasota, Florida Omaha, Nebraska Herndon, VirginiaTampa, Florida Red Bank, New Jersey McLean, VirginiaAtlanta, Georgia New York, New York Norfolk, VirginiaHonolulu, Hawaii Rome, New York Stafford, VirginiaO’Fallon, Illinois Dayton, Ohio Seattle, WashingtonThe most complete, recent list of offices and their addresses and telephone numbers can be found onwww.boozallen.comwww.boozallen.com ©2011 Booz Allen Hamilton Inc. 08.250.11