2. Big Data in the News Savings American Health-Care: $300 Billion/Year European Public Sector: €250 Billion/Year Productivity Margins: 60% increase Sources: McKinsey Global Institute
3. Topics What do we collect today? DBMS Landscape The Disconnect The Need What is BigData? Characteristics Approach Architectural Requirements Techniques Challenges Solutions Issues Deep Dive – Practical Approaches to Big Data Hadoop Aster Data Case Studies Big Data @ HCL Project Carbon SMILe CoE - Opportunities, Challenges
4. What do we collect? In 2010, people stored data to fill 60,000 Library of Congress (LoC collected 235TB in Apr/2011) YouTube receives 24hours of video, every minute 5 Billion mobile phones in use in 2010 Tesco (British Retailer) collects 1.5 billion pieces of information to adjust prices and promotions Amazon.com: 30% of sales is out of its recommendation engine Planecast, Mobclix : Track & Target systems promotes contextual promotions A Boeing Jet Engine produces 20TB/Hour for engineers to examine in real time to make improvements Sources: Forrester, The Economist,McKinsey Global Institute
5. Collect More Business Operations Transactions Registers Gateways Customer Information CRM Product Information Barcodes RFID Web Pages Web Repositories Unstructured Information Social Media Signals Mobile GPS, GeoSpatial
6. DBMS Solutions Legacy Faster Retrieval Efficient Storage Divide and Access Data Consolidation Broader Tables Access all as a row Fine Grain Access Security Rules and Policies Problems Data Growth When storage cost is not an issue Scalability Issues Performance Issues New types of requirements Deciding what to analyze, when and how? Cost of a change in the subject-area to analyze
7. The Disconnect Old DBMS vs. New Data Types/Structures Old DBMS vs. New volume Old DBMS vs. New Analysis Old DBMS vs. Data Retention Old DBMS vs. Data Element Striping Old DBMS vs. Data Infrastructure Old DBMS vs. One DB Platform for all
8. The Need System that can handle high volume data Perform complex operations Scalable Robust Highly Available Fault Tolerant Economic New Approach
9. Big Data “Tools and techniques to manage different types of data, in high volume, in high velocitywith varied requirements to mine them” Characteristics Size Scale up and scale out: Terabyte, Petabyte … Structure Structured Unstructured : Audio, Video, Text, GeoSpatial Schema Less Structures Stream Torrent of real-time information Operation Massively Parallel Processing (MPP)
10. Approach Hardware Commodity Hardware Appliance Dynamic Scaling Fault Tolerant Highly Available No constraints on Storage Cloud Virtual Environment, Storage Processing Models In-memory In-database Interfaces/Adapters Workload Management Distributed Data Processing Software Frameworks – Hadoop, MapReduce, Vrije, BOOM, Bloom Open Source Proprietary
11. Architectural Requirements Integration Framework Development Framework Management Framework Modeling Framework Processing Framework Data Management Framework
12. Challenges Volumetric Analysis Complexity Streaming Data/Real Time Data Network Topology Infrastructure Pattern-based Strategy
17. Top level Apache project Open source Software Framework - Java Inspired by Google’s white papers onMap/Reduce (MR)Google File System (GFS)Big Table Originally developed to support Apache Nutch Designed Large scale data processing For batch processing For sophisticated analysis To deal with structured and unstructured data DB Architect’s Hadoop : "Heck Another Darn Obscure Open-source Project"
18. Why Hadoop? Runs on commodity hardware Portability across heterogeneous hardwareand software platforms Shared-nothing architecture Scale hardware when ever you want System compensates for hardware scalingand issues (if any) Run large-scale, high volume data processes Scales well with complex analysis jobs (Hardware) “Failure is an option” Ideal to consolidate data from both new andlegacy data sources Highly Integrable Value to the business
19. Hadoop Ecosystem HDFS Hadoop Distributed File System Map/Reduce Software framework for Clustered, Distributed data processing ZooKeeper Scheduler Avro Data Serialization Chukwa Data Collection System to monitor Distributed Systems HBase Data storage for distributed large tables Hive Data warehouse Pig High-Level Query Language Scribe Log Collection UDF User Defined Functions
20. Hadoop Flow (Example) Network Storage Web Servers Scribe Oracle MySQL Hadoop Hive DWH MySQL Oracle Apps Feeds
21. HDFS Hadoop Distributed File System Master/Slave Architecture Runs on commodity hardware Fault Tolerant Handle large volumes of data Provides High Throughput Streaming data-access Simple file coherency model Portable to heterogeneous hardware and software Robust Handles disk failures, replication (& re-replication) Performs cluster rebalancing, data integrity checks
31. Aster Data Now part of Teradata Massively Parallel SQL Layer on MR (MapReduce) In-Database Analytics Appliance vs. Software Stack Model Cloud Options nPath and Statistical Options Data Integration
34. Aster nCluster @ Mobclix Mobclix App Library Smart Phone App-Ad Fabric Relevance Recommendation Events Ads Ad Exchange Context BI DB DB DB DB Targeting DB DB DB DB Amazon EC2 Analytics nCluster Amazon EC2 nCluster 1 TB/Hour 1-2 TB/Hour IP of Mobclix
35. MYNA @ Yahoo Front Page 12,000 Web Servers App Servers 3,500 Logs 12-15 TB/Hour Cut Clean Spray Process Cut Push Logs Highway 12-15 TB/Hour Access Collect Process Query Spray Push MYNA Fabric 3-4 TB/Hour Atlantic DW IP of Yahoo Inc.
36. Front Page Data Mart BI DB DB Web Servers DB App Servers Ads DB Targeting Views Clicks DB DB DB Web Logs Actions Registry FP Datamart FY Datamart DB DB DB DB Property Events Context User Ads Ad Events Data Highway IP of Yahoo Inc.
37. Big Data @ HCL Project Carbon Time-series Analytics HDF5 Data SMILe Hadoop framework Buzzmetrics, Nielsen 'sentiment' data CoE Opportunities DW/BI Expertise Grow talent organically POC Case Studies Challenges New Technology Talent in parts Investments Evangelize
38. Thank You "You either scale to where your customer base takes you or you die" Jim Starkey – Founder and CTO NimbusDB "Our philosophy is to build infrastructure using thebest tools available for the job and we areconstantly evaluating better ways to do thingswhen and where it matters."Facebook "In any year we probably generate more data than the Walt Disney Co. did in the first 80 years of existence" Bud Albers - Disney