SlideShare a Scribd company logo
1 of 46
Download to read offline
How to Crunch Petabytes with
  Hadoop and Big Data using
  InfoSphere BigInsights and
  Streams


                  Tom Deutsch, IBM




Vladimir B
Vl di i Bacvanski, Founder, SciSpike
               ki F    d S iS ik
vladimir.bacvanski@scispike.com
Stephen Brodsky, Technical Executive and Distinguished Engineer, IBM
sbrodsky@us.ibm.com
 b d k @ ib
August 24, 2011                                    © 2011 IBM Corporation & SciSpike
Who are we?

 Dr. Vladimir Bacvanski
     – Consultant, trainer, and mentor focusing on making clients successful in
       adopting new data and software approaches
     – Over 20 years of experience
               y            p
     – Founder of SciSpike – a training and consulting firm specializing in
       advanced software and data technologies


 Stephen Brodsky, Ph.D.
     – Di ti
       Distinguished E i
                 i h d Engineer and T h i l E
                                    d Technical Executive f IBM Bi D t
                                                      ti for     Big Data
       initiatives at the IBM Silicon Valley Laboratory
     – Previously led the architecture for the Optim Data Studio product line
       and pureQuery and was a member of the architecture team for DB2
       pureXML, Rational Application Developer (RAD), and WebSphere.



 2                                                            © 2011 IBM Corporation & SciSpike
Agenda

  The “Big Data challenge: smarter analytics for a
        Big Data”
   smarter planet

  How to do it?
     – The big data challenge
     –FFoundations of Big D
            d i       f Bi Data approaches
                                         h
     – MapReduce and Hadoop
     – Real-time data and stream processing
     – Integration with existing systems




 3                                            © 2011 IBM Corporation & SciSpike
The “Big Data” Challenge




August 24, 2011                 © 2011 IBM Corporation & SciSpike
The World is Changing and Becoming More…
                                   More


      INSTRUMENTED



      INTERCONNECTED



      INTELLIGENT



      The
      Th resulting explosion of information creates a need for
              lti       l i      fi f  ti       t        df
      a new kind of intelligence

              …to help build a Smarter Planet
 5                                                 © 2011 IBM Corporation & SciSpike
Information is Growing at a Phenomenal Rate . . . .


      44x
      44     as much data and content
             over coming decade                 80%           Of world’s data
                                                              is unstructured




                               2020
                         35 zettabytes
                      (35 billion terabytes)
      2009
800,000 petabytes

                    Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010
  6                                                               © 2011 IBM Corporation & SciSpike
The BIG Data Challenge
    • Manage and benefit from massive and growing amounts of data
    • Handle varied data formats (structured, unstructured, semi-structured) and
      increased data velocity
    • Exploit BIG Data in a timely and cost effective fashion


                                                 COLLECT        MANAGE
                                            Collect         Manage




                                            Integrate
                                                INTEGRATE   Analyze
                                                              ANALYZE




7                                                               © 2011 IBM Corporation & SciSpike
What clients are saying . . .

  Lots of potentially valuable data is dormant or discarded
           p         y
   due to size/performance considerations

  Large volume of unstructured or semi-structured data is not worth
                                      semi structured
   integrating fully (e.g. Tweets, logs, . . .)

  Not clear what should be analyzed (exploratory iterative)
                                     (exploratory,

  Information distributed across multiple systems and/or Internet

  Some information has a short useful lifespan

  Volumes can be extremely high

  Analysis needed in the context of existing information (not stand
   alone)

 8                                                      © 2011 IBM Corporation & SciSpike
Big Data Presents Big Opportunities
        Extract insight from a high volume, variety and velocity of data
        in a timely and cost-effective manner


                                   Variety: Manage and benefit from
                                            diverse data types and data
                                            structures

                                   Velocity: Analyze streaming data and
                                             large volumes of persistent
                                             data

                                   Volume: Scale from terabytes to
                                           zettabytes
                                            ettabytes

9
    9                                                       © 2011 IBM Corporation & SciSpike
Streams and Oceans of Information . . . .




                                                      Information oceans
     Information streams
                                            Information stored outside
High
Hi h speed information flowing in
          di f       ti fl i i                conventional systems. Data may
                                                      ti    l    t    D t
 real-time, often transient                   originate from the Web or different
 Information from sensors, instruments,      internal different systems
  etc.
  etc
 Information flowing from real-time logs    Collection of what has streamed
  and activity monitors                      Information from social media, logs, click
 Streaming content like audio and video      streams, emails, etc.
 High speed transactions like tickers,
  trades, or traffic systems                 Unstructured or mixed schema documents
                                              like claims, forms, desktop applications,
                                              etc.
                                             Structured data from disparate systems

10                                                                   © 2011 IBM Corporation & SciSpike
Applications for Big Data Analytics

      Smarter Healthcare   Multi-channel sales    Finance




      Homeland security      Traffic Control     Telecom




        Manufacturing       Trading Analytics


                                                 Many more!




 11                                               © 2011 IBM Corporation & SciSpike
Use Case Example: Energy Company
     Business scenario

          Analyze large volumes of public and
         private weather data for alternative
         energy business
          E i ti hi h
           Existing high-performance computing
                            f                 ti
         hardware, limited staff

     Technical challenges
         High data volume: 2+ PB

         Range of q y types
               g    query yp
          - Avg temp in given location? (Small
        result)
          - Geo pts where ice may form on wind
        turbines? (Large result derived values –
                           result,
        icing determined by humidity + temp.)

         Run on system with non-Hadoop apps



12                                                 © 2011 IBM Corporation & SciSpike
Use Case Example: Global Media Firm
     Business scenario
          Identify unauthorized content
         streaming in digital media (piracy)
           - Quantify annual revenue loss
           - Analyze trends
          Monitor social media sites to identify
         dissemination of pirated content. Time
         sensitive!

     Technical challenges

      High variety of unstructured and semi-
     structured data.
      t t dd t

      Initial focus: text analytics over 1 year’s
     worth of social media data. Look for live
     streaming URLs, sentiment, event info, etc.

      Complex rules to qualify & classify info

      Future potential for video analysis

13                                                   © 2011 IBM Corporation & SciSpike
IBM Watson




IBM Watson is a breakthrough in analytic innovation, but it is only successful
     because of the quality of the information from which it is working.


  14                                                       © 2011 IBM Corporation & SciSpike
Big Data and Watson
  Big Data technology is used to build        Watson technology offers great potential
       Watson’s knowledge base                   for advanced business analytics


  Watson uses the Apache Hadoop open
  framework to distribute the workload for
  loading information into memory.                               CRM Data
                                                 POS Data                            Social Media

            Approx. 200M pages of text
             (To compete on Jeopardy!)                                                Distilled Insight
                                                                                      - Spending habits
                                                                                      - Social relationships
                                                                                      - Buying trends
                                                            InfoSphere BigInsights
                                                               oSp e e g s g ts




                                   Watson’s
                                   Memory                                                Advanced
                                                                                         search and
                                                                                         analysis




 15                                                                            © 2011 IBM Corporation & SciSpike
Customer Engagements
Use patterns                                          Common requirements
• Customer sentiment analysis (cross-
                                (cross                • Extract business insight from large volumes of
  sell, up-sell, campaign management)                   raw data (often outside operational systems)
• Integrated retail and web customer                  • Integrate with other existing software
  behavior modeling  g                                • Ready for enterprise use
• Predictive modeling (credit card fraud)
• System log analytics (reduce
  operational risk)
    p              )



                                                              Consumer
 Text, Blog,
 Text Blog Weblog
                                                               Insight
       Click streams
                                                             Multi-channel
                                                                 sales
                  Log & transactions

                                                               Next Gen
                                                                                         Text Analytics
                               Biological Sequences
                                                             Fraud Models

        Operational system & streams data sources
         p           y                                       New Business                           Stat st ca ode
                                                                                                    Statistical Model
                                                             Development                                Building




1616                                                                             © 2011 IBM Corporation & SciSpike
The approach to
                  crunching big data




August 24, 2011                        © 2011 IBM Corporation & SciSpike
How to approach Big Data analytics?
  InfoSphere BigInsights and InfoSphere Streams
      • Analytics for data in-motion and at-rest
      • Platform for processing large volumes of diverse data
      • Complements and integrates with existing software solutions




 18                                                                   © 2011 IBM Corporation & SciSpike
Addressing the Key Requirements
  1. Platform for V3 – Variety, Velocity, Volume
         Variety - manage data & content “As Is”
         Handle any velocity - low-latency streams and large volume batch
         Volume - huge volumes of at-rest or streaming data                  Big Data Platform
  2 Analytics for V3
  2.
         Analyze Sources in their native format - text, data, rich content
         Analyze all of the data - not just a subset
         Dynamic analytics - automatic adjustments and actions

  3. Ease of Use for Developers and Users
         Developer UIs, common languages & automatic optimization
         End-user UIs & visualization

  4. Enterprise Class
         Failure tolerance, Security and Privacy
         Scale Economically

  5. Extensive Integration Capabilities
         Integrate wide variety of sources
         Leverage enterprise integration technologies




 19                                                                                © 2011 IBM Corporation & SciSpike
Big D t I iti ti
Bi Data Initiative

Volumes of diverse persistent data
           diverse,                                          Analytic applications for
                                                             “Big Data”




        InfoSphere
             p
        BigInsights
                                                 Warehouse


                                                             Traditional warehouse
                                                             applications
                              IBM Confidential




                      InfoSphere
                      Streams

Real-time streaming data
   20                                                             © 2011 IBM Corporation & SciSpike
BigInsights Summary


  BigInsights = analytical platform for persistent “Big Data”
      – Based on open source & IBM technologies

  Distinguishing characteristics
      – Built-in analytics . . . . Enhances business knowledge
      – Enterprise soft are integration . . . . Complements and e tends
                    software                                     extends
        existing capabilities
      – Production-ready platform . . . . Speeds time-to-value; simplifies
        development and maintenance




 21                                                            © 2011 IBM Corporation & SciSpike
Big Data Platform Vision
                     Bringing Big Data to the Enterprise
                                                                                                Data
                                Big Data Solutions                                            Warehouse



                                                                                              Information
                                                                                               Integration
                       Big Data User Environments
              Developers            End Users         Administrators                          Master Data
                                                                                                Mgmt




                                                                                IN
                                                                                 NTEGRATIO
AGENTS
A




                                                                                               Database


                       Big Data Enterprise Engines
                                                                                               Content




                                                                                         ON
                                                                                               Analytics



                                                                                               Business
                                                                                               Analytics
              Streaming Analytics
                      g     y               Internet Scale Analytics
                                                               y

                                                                                               Marketing

                Open Source Foundational Components
                                                                                              Data Growth
                                                                                              Management


         22                                                            © 2011 IBM Corporation & SciSpike
InfoSphere BigInsights v 1.1
  Platform for volume,
    variety, velocity -- V3
   Hadoop foundation
  Analytics for V3
   Text analytics & tooling                                             Enterprise Edition
                                                                                     Licensed
  Usability                                                 Web admin console, LDAP authentication
   Web administrative

                                           lass
                                                                   RDBMS, warehouse connectivity


                                nterprise cl
    console                                                                           Text analytics
                                                     Basic Edition
   Integrated install                                               Spreadsheet-style analytic tool
                                                    Free download            Flexible job scheduler
   Spreadsheet-style
    analytic t l
        l ti tool                                 Apache    24 x 7 Web
                               En

                                                  Hadoop      support
  Enterprise Class
   Storage, security,
    cluster management
                                                           Breadth of capabilities
  Integration
   Connectivity to DB2,
    Netezza

 23                                                                             © 2011 IBM Corporation & SciSpike
BigInsights Platform: Key Ideas

  Flexible, enterprise-class support for processing large
   volumes of data
      – Based on Google’s MapReduce technology
      – Inspired by Apache Hadoop; compatible with its ecosystem a d
          sp ed       pac e adoop; co pat b e t ts ecosyste and
        distribution
      – Well-suited to batch-oriented, read-intensive applications
      – Supports wide variety of data

  Enables applications to work with thousands of nodes and
   petabytes of data in a highly parallel, cost effective manner
     t b t    fd t i      hi hl      ll l     t ff ti
      – CPU + disks = “node”
      – Nodes can be combined into clusters
      – New nodes can be added as needed without changing
        • Data formats
        • How data is loaded
        • How jobs are written

 24                                                       © 2011 IBM Corporation & SciSpike
The M R d
Th MapReduce Programming Model
             P       i   M d l
 "Map" step:
   Map
       – Input split into pieces

       – W k nodes process individual pieces i parallel ( d
         Worker d                i di id l i     in ll l (under
         global control of the Job Tracker node)

       – Each worker node stores its result in its local file system
         where a reducer is able to access it

 "Reduce" step:
       – Data is aggregated (‘reduced” from the map steps) by
                            ( reduced
         worker nodes (under control of the Job Tracker)

       – M lti l reduce tasks can parallelize th aggregation
         Multiple d     t k           ll li the         ti
  25
 25                                                     © 2011 IBM Corporation & SciSpike
What is Hadoop?

 Apache Hadoop = free, open source framework for data-
  intensive applications
      – Inspired by Google technologies (MapReduce, GFS)
      – Well-suited to batc o e ted, read-intensive app cat o s
          e su ted batch-oriented, ead te s e applications
      – Originally built to address scalability problems of Nutch, an open source
        Web search technology

 Enables applications to work with thousands of nodes and
  petabytes of data in a highly parallel, cost effective manner
      – CPU + disks of commodity b = H d
              di k f         dit box Hadoop “ d ”
                                              “node”
      – Boxes can be combined into clusters
      – New nodes can be added as needed without changing
         • Data formats
         • How data is loaded
         • How jobs are written


 26                                                            © 2011 IBM Corporation & SciSpike
Two Key Aspects of Hadoop


  MapReduce framework
      – How Hadoop understands and assigns work to the nodes
        (machines)

  Hadoop Distributed File System = HDFS
      – Where Hadoop stores data
      – A file system that spans all the nodes in a Hadoop cluster
      – It links together the file systems on many local nodes to
        make them into one big file system




 27                                                  © 2011 IBM Corporation & SciSpike
Logical MapReduce Example: Word Count
                                        Content of Input Documents
                                         Hello World Bye World
 map(String key, String value):
                                         Hello IBM
 // key: document name
 // value: document contents             Map 1 emits:
                                         < Hello, 1>
 for each word w in value:               < World, 1>
   EmitIntermediate(w, 1 );
   EmitIntermediate(w "1");              < Bye, 1>
                                           Bye
                                         < World, 1>

 reduce(String key, Iterator values):
         (      g y,               )     Map 2 emits:
                                         < Hello, 1>
 // key: a word                          < IBM, 1>
 // values: a list of counts
                                         Reduce (final output):
 int result = 0;
                                         < Bye, 1>
 for each v in values:                   < IBM, 1>
   result += ParseInt(v);                < H ll 2>
                                           Hello, 2
 Emit(AsString(result));                 < World, 2>
 28                                              © 2011 IBM Corporation & SciSpike
How To Create MapReduce Jobs

 MapReduce development in Java
    p             p
  – Low level, very flexible
  – Time consuming development

 Hive
  – Open source language / Apache sub-project
                                     sub project
  – Provides a SQL-like interface to Hadoop

 Pig
  – Data flow language / Apache sub-project

 Jaql
  – A query language for JSON
  – Useful for loosely structured data
 29                                                © 2011 IBM Corporation & SciSpike
Management Tools: Web Console
  Graphically manage cluster, jobs, HDFS
  Sample administration tasks
      – Start/Stop Servers
      – Add/Remove Servers
      – Server Status Details (Log)




 30                                         © 2011 IBM Corporation & SciSpike
Spreadsheet like
Spreadsheet-like Analysis Tool
 Web-based analysis             BigSheets

  and visualization tool

 Spreadsheet-like
  interface
      – Define and manage
        long running data
        collection j b
           ll i jobs

      – Analyze content of the
        text on the pages that
        have been retrieved




 31                                          © 2011 IBM Corporation & SciSpike
Text Analytics
 • Distill structured info from unstructured data    "Acquisition"
    • Sentiment analysis                              "Address"
                                                       Address
                                                      "Alliance"
    • Consumer behavior                               "AnalystEarningsEstimate"
    • Illegal or suspicious activities                "City"
                                                      "CompanyEarningsAnnouncement"
                                                       CompanyEarningsAnnouncement
    • ...                                             "CompanyEarningsGuidance"
                                                      "Continent"
                                                      "Country"
 • Pre-built library of text annotators for common    "County"
                                                       County
       business entities                              "DateTime"
                                                      "EmailAddress"
                                                      "JointVenture"
 • Rich language and tooling to build custom
           g g             g                          "Location"
                                                       Location
       annotators                                     "Merger"
                                                      "NotesEmailAddress"
                                                      "Organization"
 • Support for Western languages ( g ,
     pp                   g g (English,               "Person"
                                                       Person
       Dutch/Flemish, French, German, Italian,        "PhoneNumber"
       Portuguese, or Spanish) plus select Asian      "StateOrProvince"
       languages (Japanese, Chinese)                  "URL"
                                                      "ZipCode"
                                                       ZipCode


  32
 32                                                             © 2011 IBM Corporation & SciSpike
Eclipse based
Eclipse-based Text Analytics Development




 33                                  © 2011 IBM Corporation & SciSpike
So What Does This Result In?


   Easy To Scale

   Fault Tolerant and Self-Healing

   Data Agnostic

   Extremely Flexible




 34                                   © 2011 IBM Corporation & SciSpike
Working with streaming data: a new paradigm

  Conventional processing: static data


         Queries           Data     Results




  Real-time processing: streaming data



          Data            Queries    Results




 35                                            © 2011 IBM Corporation & SciSpike
Real-Time
Real Time Data with InfoSphere Streams
                                        Source                                 Sink
 Streaming analytic applications       Adapters Operator Repository          Adapters
  – M lti l i
    Multiple input streams
                 t t
  – Advanced streaming analytics
 Eclipse based IDE
                                              InfoSphere Streams Studio
  – Define sources, apply operators,    (IDE for Streams Processing Language)
    define intermediary and final
    output sinks
  – User defined operators in Java or
    C++
                                             Automated,
                                             Automated Optimized Deploy
O i i i
 Optimizing compiler automates
                il                           and Management (Scheduler)
 deployment and connections
  – Extremely low latency
             y           y
  – Cluster of up to 125 nodes




 36                                                         © 2011 IBM Corporation & SciSpike
Scalable stream processing
      InfoSphere Streams provides
        – A programming model and IDE f d fi i d t sources and
                       i      d l d        for defining data           d
          software analytic modules called operators that are fused into
          process execution units (PEs)
        – infrastructure to support the composition of scalable stream
          processing applications from these components
        – deployment and operation of these applications across distributed
             p y             p                  pp
          x86 processing nodes, when scaled processing is required
        – stream connectivity between data sources and PEs of a stream
          processing application




37                                                          © 2011 IBM Corporation & SciSpike
Merging the Traditional and Big Data Approaches
               Traditional Approach             Big Data Approach
        Structured & Repeatable Analysis   Iterative & Exploratory Analysis

                                                          IT
       Business Users
                                                          Delivers a platform to
       Determine what                                     enable creative
                                                              bl      ti
       question to ask                                    discovery




       IT                                                Business
       Structures the                                    Explores what
       data to answer                                    questions could be
       that question
            q                                            asked

      Monthly sales reports                               Brand sentiment
      Profitability analysis                              Product strategy
      Customer surveys                                    Maximum asset utilization



 38                                                        © 2011 IBM Corporation & SciSpike
BigInsights and the data warehouse: filtering and
summarizing “Big Data”




                     BigInsights


                • Broader analytic coverage
                • Exploits IT investments while
                    p                               Data warehouse
                minimizing burden
 39                                               © 2011 IBM Corporation & SciSpike
BigInsights as a “queryable archive for growing
                  queryable archive”
data warehouses




                               BigInsights


Data warehouse   • Offl d “cold” or dated warehouse info but
                   Offload “ ld”    d t d      h      i f b t
                 maintain access for further exploration
                 • Keep warehouse size manageable and focused
                 on well-known business analytic needs

  40                                         © 2011 IBM Corporation & SciSpike
Trends and directions
  Enterprise software integration
      –   Data warehouses, RDBMSs
      –   ETL platforms
                l tf
      –   Business intelligence tools
      –   Applications
      –   ...

  Diverse range of analytics
      – Text
      – Image / video (e.g., content based user profiling)
                      (e g content-based
      – Predictive modeling (e.g., ranking and classification based on
        machine learning)
      – ...

  Sophisticated, scalable infrastructure for processing
   massive data volumes
      – High-performance file system with full POSIX compliance, g
          g p                    y                           p , granular
        security
      – Fully recoverable and restartable workflows
      – Parallel, distributed indexing for text (“BigIndex”)
      – Read-optimized column store
                p
      – Tooling for administrators, programmers, analysts
      – ...
 41                                                                 © 2011 IBM Corporation & SciSpike
Integrating Relational, Streams, and BigInsights

                Traditional /
Traditional      Relational
Warehouse
               Data Sources
                                                  Database &        At-rest Results
                                                  Warehouse          data
                                                                   analytics

              Non-Traditional /
 Streams       Non-Relational
               N   R l ti    l
               Data Sources
                                In-Motion                                       Ultra Low
                                Analytics                                        Latencyy
                                                                                 Results
                Varied data                        InfoSphere
                                                   Big Insights
                 formats
                                  Massive Scale
   Big Data
              Semi-structured,                                    Batch oriented Results
               unstructured...                                    data analytics



   42                                                                   © 2011 IBM Corporation & SciSpike
Typical Strategy for Analytics


                    ETL                  SQL Analytics, Mining

                                        Data warehouse / marts


Source
Sources
S
                    Transform/
          Extract                Load
                    subset




43                                            © 2011 IBM Corporation & SciSpike
Emerging requirements for analytics
                                                                SQL Analytics, Mining
                  ETL, ELT (MR BI, Mining)

  Source
Structured        Transform,
                   Analyze                                         Warehouses / marts
 Sources
                                  Transform/
             Extract                subset           Load




                       BigInsights
                         g    g
 Source                Repository
  Other
 Sources
                       Explore large volumes of “raw” or diverse data.
                       Discover, analyze new insights with BigInsights


  44                                                                     © 2011 IBM Corporation & SciSpike
Conclusions

      – Scale out to crunch petabytes


      – We need a mix of technologies
        • Data at rest: MapReduce, Hadoop and beyond
        • Data in motion: stream processing

      – To be successful, integrate with conventional
        technologies




 45                                                © 2011 IBM Corporation & SciSpike
Getting in touch

  Stephen Brodsky – IBM
      – Email: sbrodsky@us.ibm.com
  InfoSphere BigInsights
      – http://www-01.ibm.com/software/data/infosphere/biginsights/
         ttp //    0 b co /so t a e/data/ osp e e/b g s g ts/
  InfoSphere Streams
      – http://www-01.ibm.com/software/data/infosphere/streams/

  Vladimir Bacvanski - SciSpike
      –   Email: vladimir.bacvanski@scispike.com
      –   Blog: http://www.OnBuildingSoftware.com/
      –   Twitter: http://twitter.com/OnSoftware
      –   LinkedIn: http://www.linkedin.com/in/VladimirBacvanski
                         p




 46                                                           © 2011 IBM Corporation & SciSpike

More Related Content

What's hot

Solution Centric Architectural Presentation - Implementing a Logical Data War...
Solution Centric Architectural Presentation - Implementing a Logical Data War...Solution Centric Architectural Presentation - Implementing a Logical Data War...
Solution Centric Architectural Presentation - Implementing a Logical Data War...Denodo
 
The Need to Know for Information Architects: Big Data to Big Information
The Need to Know for Information Architects: Big Data to Big InformationThe Need to Know for Information Architects: Big Data to Big Information
The Need to Know for Information Architects: Big Data to Big InformationDATAVERSITY
 
Telco Big Data 2012 Highlights
Telco Big Data 2012 HighlightsTelco Big Data 2012 Highlights
Telco Big Data 2012 HighlightsAlan Quayle
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Denodo
 
Regulation and Compliance in the Data Driven Enterprise
Regulation and Compliance in the Data Driven EnterpriseRegulation and Compliance in the Data Driven Enterprise
Regulation and Compliance in the Data Driven EnterpriseDenodo
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance Qubole
 
Big Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementBig Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementTony Bain
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarDatameer
 
Core banking Closure bank day OSWA meetup 2018-Alexander Petrov Oslo
Core banking Closure bank day OSWA meetup 2018-Alexander Petrov OsloCore banking Closure bank day OSWA meetup 2018-Alexander Petrov Oslo
Core banking Closure bank day OSWA meetup 2018-Alexander Petrov OsloAlexander Petrov
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker
 
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...Usama Fayyad
 
What is big data - Architectures and Practical Use Cases
What is big data - Architectures and Practical Use CasesWhat is big data - Architectures and Practical Use Cases
What is big data - Architectures and Practical Use CasesTony Pearson
 
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...Neo4j
 
Crowdsourcing Data Governance
Crowdsourcing Data GovernanceCrowdsourcing Data Governance
Crowdsourcing Data GovernancePaul Boal
 
Introduction to open data in DataOps
Introduction to open data in DataOpsIntroduction to open data in DataOps
Introduction to open data in DataOpsDataops Ghent Meetup
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalHarvinder Atwal
 
TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...
TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...
TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...Vasu S
 
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...Usama Fayyad
 

What's hot (20)

Solution Centric Architectural Presentation - Implementing a Logical Data War...
Solution Centric Architectural Presentation - Implementing a Logical Data War...Solution Centric Architectural Presentation - Implementing a Logical Data War...
Solution Centric Architectural Presentation - Implementing a Logical Data War...
 
The Need to Know for Information Architects: Big Data to Big Information
The Need to Know for Information Architects: Big Data to Big InformationThe Need to Know for Information Architects: Big Data to Big Information
The Need to Know for Information Architects: Big Data to Big Information
 
Telco Big Data 2012 Highlights
Telco Big Data 2012 HighlightsTelco Big Data 2012 Highlights
Telco Big Data 2012 Highlights
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
 
Regulation and Compliance in the Data Driven Enterprise
Regulation and Compliance in the Data Driven EnterpriseRegulation and Compliance in the Data Driven Enterprise
Regulation and Compliance in the Data Driven Enterprise
 
Sgcp14dunlea
Sgcp14dunleaSgcp14dunlea
Sgcp14dunlea
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
Big Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementBig Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data Management
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop Webinar
 
Big Data Boom
Big Data BoomBig Data Boom
Big Data Boom
 
Core banking Closure bank day OSWA meetup 2018-Alexander Petrov Oslo
Core banking Closure bank day OSWA meetup 2018-Alexander Petrov OsloCore banking Closure bank day OSWA meetup 2018-Alexander Petrov Oslo
Core banking Closure bank day OSWA meetup 2018-Alexander Petrov Oslo
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
 
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
 
What is big data - Architectures and Practical Use Cases
What is big data - Architectures and Practical Use CasesWhat is big data - Architectures and Practical Use Cases
What is big data - Architectures and Practical Use Cases
 
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...
 
Crowdsourcing Data Governance
Crowdsourcing Data GovernanceCrowdsourcing Data Governance
Crowdsourcing Data Governance
 
Introduction to open data in DataOps
Introduction to open data in DataOpsIntroduction to open data in DataOps
Introduction to open data in DataOps
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
 
TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...
TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...
TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...
 
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...
 

Viewers also liked

Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascadingDataiku
 
Why choose VMware vCloud Suite Standard over vSOM
Why choose VMware vCloud Suite Standard over vSOMWhy choose VMware vCloud Suite Standard over vSOM
Why choose VMware vCloud Suite Standard over vSOMAnil Gupta (AJ) - vExpert
 
TOON Stephen Galsworthy
TOON Stephen GalsworthyTOON Stephen Galsworthy
TOON Stephen GalsworthyBigDataExpo
 
BVBA SOSIS van Jeroen Meus kent rustige start
BVBA SOSIS van Jeroen Meus kent rustige startBVBA SOSIS van Jeroen Meus kent rustige start
BVBA SOSIS van Jeroen Meus kent rustige startThierry Debels
 
Agile Operations Keynote: Redefine the Role of IT Operations With Digital Tra...
Agile Operations Keynote: Redefine the Role of IT Operations With Digital Tra...Agile Operations Keynote: Redefine the Role of IT Operations With Digital Tra...
Agile Operations Keynote: Redefine the Role of IT Operations With Digital Tra...CA Technologies
 
A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...DataWorks Summit
 
Technical Radar (Chinese version) 2014-06
Technical Radar (Chinese version) 2014-06Technical Radar (Chinese version) 2014-06
Technical Radar (Chinese version) 2014-06Freyr Lin
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies
 
Impact-driven Scrum Delivery at Scrum gathering Phoenix 2015
Impact-driven Scrum Delivery at Scrum gathering Phoenix 2015Impact-driven Scrum Delivery at Scrum gathering Phoenix 2015
Impact-driven Scrum Delivery at Scrum gathering Phoenix 2015Sara Lerén
 
20170918 remiqz - big data expo - final
20170918   remiqz - big data expo - final20170918   remiqz - big data expo - final
20170918 remiqz - big data expo - finalBigDataExpo
 
Boston Devops Meetup June 22nd
Boston Devops Meetup June 22ndBoston Devops Meetup June 22nd
Boston Devops Meetup June 22ndmdilawari
 
Deploy, Monitor and Manage in Style with WebSphere Liberty Admin Center
Deploy, Monitor and Manage in Style with WebSphere Liberty Admin CenterDeploy, Monitor and Manage in Style with WebSphere Liberty Admin Center
Deploy, Monitor and Manage in Style with WebSphere Liberty Admin CenterWASdev Community
 
Pre-Con Ed: Learn What's New in CA Spectrum®
Pre-Con Ed: Learn What's New in CA Spectrum®Pre-Con Ed: Learn What's New in CA Spectrum®
Pre-Con Ed: Learn What's New in CA Spectrum®CA Technologies
 
Next Generation Data Center Strategies
Next Generation Data Center StrategiesNext Generation Data Center Strategies
Next Generation Data Center StrategiesVenkat Nambiyur
 
How Verizon Innovates Through AI-Driven DevOps with Dynatrace
How Verizon Innovates Through AI-Driven DevOps with DynatraceHow Verizon Innovates Through AI-Driven DevOps with Dynatrace
How Verizon Innovates Through AI-Driven DevOps with DynatraceAmazon Web Services
 
Oracle OpenWorld - A quick take on all 22 press releases of Day #1 - #3
Oracle OpenWorld - A quick take on all 22 press releases of Day #1 - #3Oracle OpenWorld - A quick take on all 22 press releases of Day #1 - #3
Oracle OpenWorld - A quick take on all 22 press releases of Day #1 - #3Holger Mueller
 
Four Graphics credentials
Four Graphics credentialsFour Graphics credentials
Four Graphics credentialsEmile Melki
 
Cyberbullying in the Middle Years
Cyberbullying in the Middle YearsCyberbullying in the Middle Years
Cyberbullying in the Middle Yearselketeaches
 

Viewers also liked (20)

Oow2016 review--paas-microservices-
Oow2016 review--paas-microservices-Oow2016 review--paas-microservices-
Oow2016 review--paas-microservices-
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
 
Why choose VMware vCloud Suite Standard over vSOM
Why choose VMware vCloud Suite Standard over vSOMWhy choose VMware vCloud Suite Standard over vSOM
Why choose VMware vCloud Suite Standard over vSOM
 
TOON Stephen Galsworthy
TOON Stephen GalsworthyTOON Stephen Galsworthy
TOON Stephen Galsworthy
 
BVBA SOSIS van Jeroen Meus kent rustige start
BVBA SOSIS van Jeroen Meus kent rustige startBVBA SOSIS van Jeroen Meus kent rustige start
BVBA SOSIS van Jeroen Meus kent rustige start
 
Agile Operations Keynote: Redefine the Role of IT Operations With Digital Tra...
Agile Operations Keynote: Redefine the Role of IT Operations With Digital Tra...Agile Operations Keynote: Redefine the Role of IT Operations With Digital Tra...
Agile Operations Keynote: Redefine the Role of IT Operations With Digital Tra...
 
A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...
 
Technical Radar (Chinese version) 2014-06
Technical Radar (Chinese version) 2014-06Technical Radar (Chinese version) 2014-06
Technical Radar (Chinese version) 2014-06
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community Edition
 
Impact-driven Scrum Delivery at Scrum gathering Phoenix 2015
Impact-driven Scrum Delivery at Scrum gathering Phoenix 2015Impact-driven Scrum Delivery at Scrum gathering Phoenix 2015
Impact-driven Scrum Delivery at Scrum gathering Phoenix 2015
 
20170918 remiqz - big data expo - final
20170918   remiqz - big data expo - final20170918   remiqz - big data expo - final
20170918 remiqz - big data expo - final
 
Boston Devops Meetup June 22nd
Boston Devops Meetup June 22ndBoston Devops Meetup June 22nd
Boston Devops Meetup June 22nd
 
Deploy, Monitor and Manage in Style with WebSphere Liberty Admin Center
Deploy, Monitor and Manage in Style with WebSphere Liberty Admin CenterDeploy, Monitor and Manage in Style with WebSphere Liberty Admin Center
Deploy, Monitor and Manage in Style with WebSphere Liberty Admin Center
 
Pre-Con Ed: Learn What's New in CA Spectrum®
Pre-Con Ed: Learn What's New in CA Spectrum®Pre-Con Ed: Learn What's New in CA Spectrum®
Pre-Con Ed: Learn What's New in CA Spectrum®
 
Next Generation Data Center Strategies
Next Generation Data Center StrategiesNext Generation Data Center Strategies
Next Generation Data Center Strategies
 
How Verizon Innovates Through AI-Driven DevOps with Dynatrace
How Verizon Innovates Through AI-Driven DevOps with DynatraceHow Verizon Innovates Through AI-Driven DevOps with Dynatrace
How Verizon Innovates Through AI-Driven DevOps with Dynatrace
 
Oracle OpenWorld - A quick take on all 22 press releases of Day #1 - #3
Oracle OpenWorld - A quick take on all 22 press releases of Day #1 - #3Oracle OpenWorld - A quick take on all 22 press releases of Day #1 - #3
Oracle OpenWorld - A quick take on all 22 press releases of Day #1 - #3
 
Four Graphics credentials
Four Graphics credentialsFour Graphics credentials
Four Graphics credentials
 
Cyberbullying in the Middle Years
Cyberbullying in the Middle YearsCyberbullying in the Middle Years
Cyberbullying in the Middle Years
 
iOS and Android apps automation
iOS and Android apps automationiOS and Android apps automation
iOS and Android apps automation
 

Similar to How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)Mark Heid
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big DecisionsInnoTech
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureOdinot Stanislas
 
Intel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick KnupfferIntel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick KnupfferIntelAPAC
 
Big data seminor
Big data seminorBig data seminor
Big data seminorberasrujana
 
Big Data - A Real Life Revolution
Big Data - A Real Life RevolutionBig Data - A Real Life Revolution
Big Data - A Real Life RevolutionCapgemini
 
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
Ibm big data    hadoop summit 2012 james kobielus final 6-13-12(1)Ibm big data    hadoop summit 2012 james kobielus final 6-13-12(1)
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)Ajay Ohri
 
big-data-8722-m8RQ3h1.pptx
big-data-8722-m8RQ3h1.pptxbig-data-8722-m8RQ3h1.pptx
big-data-8722-m8RQ3h1.pptxVaishnavGhadge1
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
Ibm big data ibm marriage of hadoop and data warehousing
Ibm big dataibm marriage of hadoop and data warehousingIbm big dataibm marriage of hadoop and data warehousing
Ibm big data ibm marriage of hadoop and data warehousing DataWorks Summit
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond Rajesh Kumar
 
Building Data Science Ecosystems for Smart Cities and Smart Commerce
Building Data Science Ecosystems for Smart Cities and Smart CommerceBuilding Data Science Ecosystems for Smart Cities and Smart Commerce
Building Data Science Ecosystems for Smart Cities and Smart CommerceAlex Liu
 
Big data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You WantBig data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You WantStuart Miniman
 
Building Confidence in Big Data - IBM Smarter Business 2013
Building Confidence in Big Data - IBM Smarter Business 2013 Building Confidence in Big Data - IBM Smarter Business 2013
Building Confidence in Big Data - IBM Smarter Business 2013 IBM Sverige
 
IBM CDS Overview
IBM CDS OverviewIBM CDS Overview
IBM CDS OverviewJean Tan
 

Similar to How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams (20)

Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform Architecture
 
Big Data a big deal?
Big Data a big deal?Big Data a big deal?
Big Data a big deal?
 
Intel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick KnupfferIntel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick Knupffer
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big data seminor
Big data seminorBig data seminor
Big data seminor
 
Big Data - A Real Life Revolution
Big Data - A Real Life RevolutionBig Data - A Real Life Revolution
Big Data - A Real Life Revolution
 
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
Ibm big data    hadoop summit 2012 james kobielus final 6-13-12(1)Ibm big data    hadoop summit 2012 james kobielus final 6-13-12(1)
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
 
big-data-8722-m8RQ3h1.pptx
big-data-8722-m8RQ3h1.pptxbig-data-8722-m8RQ3h1.pptx
big-data-8722-m8RQ3h1.pptx
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Ibm big data ibm marriage of hadoop and data warehousing
Ibm big dataibm marriage of hadoop and data warehousingIbm big dataibm marriage of hadoop and data warehousing
Ibm big data ibm marriage of hadoop and data warehousing
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
On Big Data
On Big DataOn Big Data
On Big Data
 
Building Data Science Ecosystems for Smart Cities and Smart Commerce
Building Data Science Ecosystems for Smart Cities and Smart CommerceBuilding Data Science Ecosystems for Smart Cities and Smart Commerce
Building Data Science Ecosystems for Smart Cities and Smart Commerce
 
IBM Stream au Hadoop User Group
IBM Stream au Hadoop User GroupIBM Stream au Hadoop User Group
IBM Stream au Hadoop User Group
 
Big data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You WantBig data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You Want
 
Building Confidence in Big Data - IBM Smarter Business 2013
Building Confidence in Big Data - IBM Smarter Business 2013 Building Confidence in Big Data - IBM Smarter Business 2013
Building Confidence in Big Data - IBM Smarter Business 2013
 
IBM CDS Overview
IBM CDS OverviewIBM CDS Overview
IBM CDS Overview
 

More from DATAVERSITY

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...DATAVERSITY
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceDATAVERSITY
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data LiteracyDATAVERSITY
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for YouDATAVERSITY
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?DATAVERSITY
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling FundamentalsDATAVERSITY
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectDATAVERSITY
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?DATAVERSITY
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...DATAVERSITY
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsDATAVERSITY
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayDATAVERSITY
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise AnalyticsDATAVERSITY
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best PracticesDATAVERSITY
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?DATAVERSITY
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best PracticesDATAVERSITY
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageDATAVERSITY
 

More from DATAVERSITY (20)

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and Governance
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data Literacy
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for You
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement Today
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 

Recently uploaded

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 

Recently uploaded (20)

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 

How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams

  • 1. How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights and Streams Tom Deutsch, IBM Vladimir B Vl di i Bacvanski, Founder, SciSpike ki F d S iS ik vladimir.bacvanski@scispike.com Stephen Brodsky, Technical Executive and Distinguished Engineer, IBM sbrodsky@us.ibm.com b d k @ ib August 24, 2011 © 2011 IBM Corporation & SciSpike
  • 2. Who are we?  Dr. Vladimir Bacvanski – Consultant, trainer, and mentor focusing on making clients successful in adopting new data and software approaches – Over 20 years of experience y p – Founder of SciSpike – a training and consulting firm specializing in advanced software and data technologies  Stephen Brodsky, Ph.D. – Di ti Distinguished E i i h d Engineer and T h i l E d Technical Executive f IBM Bi D t ti for Big Data initiatives at the IBM Silicon Valley Laboratory – Previously led the architecture for the Optim Data Studio product line and pureQuery and was a member of the architecture team for DB2 pureXML, Rational Application Developer (RAD), and WebSphere. 2 © 2011 IBM Corporation & SciSpike
  • 3. Agenda  The “Big Data challenge: smarter analytics for a Big Data” smarter planet  How to do it? – The big data challenge –FFoundations of Big D d i f Bi Data approaches h – MapReduce and Hadoop – Real-time data and stream processing – Integration with existing systems 3 © 2011 IBM Corporation & SciSpike
  • 4. The “Big Data” Challenge August 24, 2011 © 2011 IBM Corporation & SciSpike
  • 5. The World is Changing and Becoming More… More INSTRUMENTED INTERCONNECTED INTELLIGENT The Th resulting explosion of information creates a need for lti l i fi f ti t df a new kind of intelligence …to help build a Smarter Planet 5 © 2011 IBM Corporation & SciSpike
  • 6. Information is Growing at a Phenomenal Rate . . . . 44x 44 as much data and content over coming decade 80% Of world’s data is unstructured 2020 35 zettabytes (35 billion terabytes) 2009 800,000 petabytes Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010 6 © 2011 IBM Corporation & SciSpike
  • 7. The BIG Data Challenge • Manage and benefit from massive and growing amounts of data • Handle varied data formats (structured, unstructured, semi-structured) and increased data velocity • Exploit BIG Data in a timely and cost effective fashion COLLECT MANAGE Collect Manage Integrate INTEGRATE Analyze ANALYZE 7 © 2011 IBM Corporation & SciSpike
  • 8. What clients are saying . . .  Lots of potentially valuable data is dormant or discarded p y due to size/performance considerations  Large volume of unstructured or semi-structured data is not worth semi structured integrating fully (e.g. Tweets, logs, . . .)  Not clear what should be analyzed (exploratory iterative) (exploratory,  Information distributed across multiple systems and/or Internet  Some information has a short useful lifespan  Volumes can be extremely high  Analysis needed in the context of existing information (not stand alone) 8 © 2011 IBM Corporation & SciSpike
  • 9. Big Data Presents Big Opportunities Extract insight from a high volume, variety and velocity of data in a timely and cost-effective manner Variety: Manage and benefit from diverse data types and data structures Velocity: Analyze streaming data and large volumes of persistent data Volume: Scale from terabytes to zettabytes ettabytes 9 9 © 2011 IBM Corporation & SciSpike
  • 10. Streams and Oceans of Information . . . . Information oceans Information streams Information stored outside High Hi h speed information flowing in di f ti fl i i conventional systems. Data may ti l t D t real-time, often transient originate from the Web or different  Information from sensors, instruments, internal different systems etc. etc  Information flowing from real-time logs  Collection of what has streamed and activity monitors  Information from social media, logs, click  Streaming content like audio and video streams, emails, etc.  High speed transactions like tickers, trades, or traffic systems  Unstructured or mixed schema documents like claims, forms, desktop applications, etc.  Structured data from disparate systems 10 © 2011 IBM Corporation & SciSpike
  • 11. Applications for Big Data Analytics Smarter Healthcare Multi-channel sales Finance Homeland security Traffic Control Telecom Manufacturing Trading Analytics Many more! 11 © 2011 IBM Corporation & SciSpike
  • 12. Use Case Example: Energy Company Business scenario  Analyze large volumes of public and private weather data for alternative energy business  E i ti hi h Existing high-performance computing f ti hardware, limited staff Technical challenges  High data volume: 2+ PB  Range of q y types g query yp - Avg temp in given location? (Small result) - Geo pts where ice may form on wind turbines? (Large result derived values – result, icing determined by humidity + temp.)  Run on system with non-Hadoop apps 12 © 2011 IBM Corporation & SciSpike
  • 13. Use Case Example: Global Media Firm Business scenario  Identify unauthorized content streaming in digital media (piracy) - Quantify annual revenue loss - Analyze trends  Monitor social media sites to identify dissemination of pirated content. Time sensitive! Technical challenges  High variety of unstructured and semi- structured data. t t dd t  Initial focus: text analytics over 1 year’s worth of social media data. Look for live streaming URLs, sentiment, event info, etc.  Complex rules to qualify & classify info  Future potential for video analysis 13 © 2011 IBM Corporation & SciSpike
  • 14. IBM Watson IBM Watson is a breakthrough in analytic innovation, but it is only successful because of the quality of the information from which it is working. 14 © 2011 IBM Corporation & SciSpike
  • 15. Big Data and Watson Big Data technology is used to build Watson technology offers great potential Watson’s knowledge base for advanced business analytics Watson uses the Apache Hadoop open framework to distribute the workload for loading information into memory. CRM Data POS Data Social Media Approx. 200M pages of text (To compete on Jeopardy!) Distilled Insight - Spending habits - Social relationships - Buying trends InfoSphere BigInsights oSp e e g s g ts Watson’s Memory Advanced search and analysis 15 © 2011 IBM Corporation & SciSpike
  • 16. Customer Engagements Use patterns Common requirements • Customer sentiment analysis (cross- (cross • Extract business insight from large volumes of sell, up-sell, campaign management) raw data (often outside operational systems) • Integrated retail and web customer • Integrate with other existing software behavior modeling g • Ready for enterprise use • Predictive modeling (credit card fraud) • System log analytics (reduce operational risk) p ) Consumer Text, Blog, Text Blog Weblog Insight Click streams Multi-channel sales Log & transactions Next Gen Text Analytics Biological Sequences Fraud Models Operational system & streams data sources p y New Business Stat st ca ode Statistical Model Development Building 1616 © 2011 IBM Corporation & SciSpike
  • 17. The approach to crunching big data August 24, 2011 © 2011 IBM Corporation & SciSpike
  • 18. How to approach Big Data analytics? InfoSphere BigInsights and InfoSphere Streams • Analytics for data in-motion and at-rest • Platform for processing large volumes of diverse data • Complements and integrates with existing software solutions 18 © 2011 IBM Corporation & SciSpike
  • 19. Addressing the Key Requirements 1. Platform for V3 – Variety, Velocity, Volume  Variety - manage data & content “As Is”  Handle any velocity - low-latency streams and large volume batch  Volume - huge volumes of at-rest or streaming data Big Data Platform 2 Analytics for V3 2.  Analyze Sources in their native format - text, data, rich content  Analyze all of the data - not just a subset  Dynamic analytics - automatic adjustments and actions 3. Ease of Use for Developers and Users  Developer UIs, common languages & automatic optimization  End-user UIs & visualization 4. Enterprise Class  Failure tolerance, Security and Privacy  Scale Economically 5. Extensive Integration Capabilities  Integrate wide variety of sources  Leverage enterprise integration technologies 19 © 2011 IBM Corporation & SciSpike
  • 20. Big D t I iti ti Bi Data Initiative Volumes of diverse persistent data diverse, Analytic applications for “Big Data” InfoSphere p BigInsights Warehouse Traditional warehouse applications IBM Confidential InfoSphere Streams Real-time streaming data 20 © 2011 IBM Corporation & SciSpike
  • 21. BigInsights Summary  BigInsights = analytical platform for persistent “Big Data” – Based on open source & IBM technologies  Distinguishing characteristics – Built-in analytics . . . . Enhances business knowledge – Enterprise soft are integration . . . . Complements and e tends software extends existing capabilities – Production-ready platform . . . . Speeds time-to-value; simplifies development and maintenance 21 © 2011 IBM Corporation & SciSpike
  • 22. Big Data Platform Vision Bringing Big Data to the Enterprise Data Big Data Solutions Warehouse Information Integration Big Data User Environments Developers End Users Administrators Master Data Mgmt IN NTEGRATIO AGENTS A Database Big Data Enterprise Engines Content ON Analytics Business Analytics Streaming Analytics g y Internet Scale Analytics y Marketing Open Source Foundational Components Data Growth Management 22 © 2011 IBM Corporation & SciSpike
  • 23. InfoSphere BigInsights v 1.1 Platform for volume, variety, velocity -- V3  Hadoop foundation Analytics for V3  Text analytics & tooling Enterprise Edition Licensed Usability Web admin console, LDAP authentication  Web administrative lass RDBMS, warehouse connectivity nterprise cl console Text analytics Basic Edition  Integrated install Spreadsheet-style analytic tool Free download Flexible job scheduler  Spreadsheet-style analytic t l l ti tool Apache 24 x 7 Web En Hadoop support Enterprise Class  Storage, security, cluster management Breadth of capabilities Integration  Connectivity to DB2, Netezza 23 © 2011 IBM Corporation & SciSpike
  • 24. BigInsights Platform: Key Ideas  Flexible, enterprise-class support for processing large volumes of data – Based on Google’s MapReduce technology – Inspired by Apache Hadoop; compatible with its ecosystem a d sp ed pac e adoop; co pat b e t ts ecosyste and distribution – Well-suited to batch-oriented, read-intensive applications – Supports wide variety of data  Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner t b t fd t i hi hl ll l t ff ti – CPU + disks = “node” – Nodes can be combined into clusters – New nodes can be added as needed without changing • Data formats • How data is loaded • How jobs are written 24 © 2011 IBM Corporation & SciSpike
  • 25. The M R d Th MapReduce Programming Model P i M d l  "Map" step: Map – Input split into pieces – W k nodes process individual pieces i parallel ( d Worker d i di id l i in ll l (under global control of the Job Tracker node) – Each worker node stores its result in its local file system where a reducer is able to access it  "Reduce" step: – Data is aggregated (‘reduced” from the map steps) by ( reduced worker nodes (under control of the Job Tracker) – M lti l reduce tasks can parallelize th aggregation Multiple d t k ll li the ti 25 25 © 2011 IBM Corporation & SciSpike
  • 26. What is Hadoop?  Apache Hadoop = free, open source framework for data- intensive applications – Inspired by Google technologies (MapReduce, GFS) – Well-suited to batc o e ted, read-intensive app cat o s e su ted batch-oriented, ead te s e applications – Originally built to address scalability problems of Nutch, an open source Web search technology  Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner – CPU + disks of commodity b = H d di k f dit box Hadoop “ d ” “node” – Boxes can be combined into clusters – New nodes can be added as needed without changing • Data formats • How data is loaded • How jobs are written 26 © 2011 IBM Corporation & SciSpike
  • 27. Two Key Aspects of Hadoop  MapReduce framework – How Hadoop understands and assigns work to the nodes (machines)  Hadoop Distributed File System = HDFS – Where Hadoop stores data – A file system that spans all the nodes in a Hadoop cluster – It links together the file systems on many local nodes to make them into one big file system 27 © 2011 IBM Corporation & SciSpike
  • 28. Logical MapReduce Example: Word Count Content of Input Documents Hello World Bye World map(String key, String value): Hello IBM // key: document name // value: document contents Map 1 emits: < Hello, 1> for each word w in value: < World, 1> EmitIntermediate(w, 1 ); EmitIntermediate(w "1"); < Bye, 1> Bye < World, 1> reduce(String key, Iterator values): ( g y, ) Map 2 emits: < Hello, 1> // key: a word < IBM, 1> // values: a list of counts Reduce (final output): int result = 0; < Bye, 1> for each v in values: < IBM, 1> result += ParseInt(v); < H ll 2> Hello, 2 Emit(AsString(result)); < World, 2> 28 © 2011 IBM Corporation & SciSpike
  • 29. How To Create MapReduce Jobs  MapReduce development in Java p p – Low level, very flexible – Time consuming development  Hive – Open source language / Apache sub-project sub project – Provides a SQL-like interface to Hadoop  Pig – Data flow language / Apache sub-project  Jaql – A query language for JSON – Useful for loosely structured data 29 © 2011 IBM Corporation & SciSpike
  • 30. Management Tools: Web Console  Graphically manage cluster, jobs, HDFS  Sample administration tasks – Start/Stop Servers – Add/Remove Servers – Server Status Details (Log) 30 © 2011 IBM Corporation & SciSpike
  • 31. Spreadsheet like Spreadsheet-like Analysis Tool  Web-based analysis BigSheets and visualization tool  Spreadsheet-like interface – Define and manage long running data collection j b ll i jobs – Analyze content of the text on the pages that have been retrieved 31 © 2011 IBM Corporation & SciSpike
  • 32. Text Analytics • Distill structured info from unstructured data "Acquisition" • Sentiment analysis "Address" Address "Alliance" • Consumer behavior "AnalystEarningsEstimate" • Illegal or suspicious activities "City" "CompanyEarningsAnnouncement" CompanyEarningsAnnouncement • ... "CompanyEarningsGuidance" "Continent" "Country" • Pre-built library of text annotators for common "County" County business entities "DateTime" "EmailAddress" "JointVenture" • Rich language and tooling to build custom g g g "Location" Location annotators "Merger" "NotesEmailAddress" "Organization" • Support for Western languages ( g , pp g g (English, "Person" Person Dutch/Flemish, French, German, Italian, "PhoneNumber" Portuguese, or Spanish) plus select Asian "StateOrProvince" languages (Japanese, Chinese) "URL" "ZipCode" ZipCode 32 32 © 2011 IBM Corporation & SciSpike
  • 33. Eclipse based Eclipse-based Text Analytics Development 33 © 2011 IBM Corporation & SciSpike
  • 34. So What Does This Result In?  Easy To Scale  Fault Tolerant and Self-Healing  Data Agnostic  Extremely Flexible 34 © 2011 IBM Corporation & SciSpike
  • 35. Working with streaming data: a new paradigm  Conventional processing: static data Queries Data Results  Real-time processing: streaming data Data Queries Results 35 © 2011 IBM Corporation & SciSpike
  • 36. Real-Time Real Time Data with InfoSphere Streams Source Sink  Streaming analytic applications Adapters Operator Repository Adapters – M lti l i Multiple input streams t t – Advanced streaming analytics  Eclipse based IDE InfoSphere Streams Studio – Define sources, apply operators, (IDE for Streams Processing Language) define intermediary and final output sinks – User defined operators in Java or C++ Automated, Automated Optimized Deploy O i i i Optimizing compiler automates il and Management (Scheduler) deployment and connections – Extremely low latency y y – Cluster of up to 125 nodes 36 © 2011 IBM Corporation & SciSpike
  • 37. Scalable stream processing  InfoSphere Streams provides – A programming model and IDE f d fi i d t sources and i d l d for defining data d software analytic modules called operators that are fused into process execution units (PEs) – infrastructure to support the composition of scalable stream processing applications from these components – deployment and operation of these applications across distributed p y p pp x86 processing nodes, when scaled processing is required – stream connectivity between data sources and PEs of a stream processing application 37 © 2011 IBM Corporation & SciSpike
  • 38. Merging the Traditional and Big Data Approaches Traditional Approach Big Data Approach Structured & Repeatable Analysis Iterative & Exploratory Analysis IT Business Users Delivers a platform to Determine what enable creative bl ti question to ask discovery IT Business Structures the Explores what data to answer questions could be that question q asked Monthly sales reports Brand sentiment Profitability analysis Product strategy Customer surveys Maximum asset utilization 38 © 2011 IBM Corporation & SciSpike
  • 39. BigInsights and the data warehouse: filtering and summarizing “Big Data” BigInsights • Broader analytic coverage • Exploits IT investments while p Data warehouse minimizing burden 39 © 2011 IBM Corporation & SciSpike
  • 40. BigInsights as a “queryable archive for growing queryable archive” data warehouses BigInsights Data warehouse • Offl d “cold” or dated warehouse info but Offload “ ld” d t d h i f b t maintain access for further exploration • Keep warehouse size manageable and focused on well-known business analytic needs 40 © 2011 IBM Corporation & SciSpike
  • 41. Trends and directions  Enterprise software integration – Data warehouses, RDBMSs – ETL platforms l tf – Business intelligence tools – Applications – ...  Diverse range of analytics – Text – Image / video (e.g., content based user profiling) (e g content-based – Predictive modeling (e.g., ranking and classification based on machine learning) – ...  Sophisticated, scalable infrastructure for processing massive data volumes – High-performance file system with full POSIX compliance, g g p y p , granular security – Fully recoverable and restartable workflows – Parallel, distributed indexing for text (“BigIndex”) – Read-optimized column store p – Tooling for administrators, programmers, analysts – ... 41 © 2011 IBM Corporation & SciSpike
  • 42. Integrating Relational, Streams, and BigInsights Traditional / Traditional Relational Warehouse Data Sources Database & At-rest Results Warehouse data analytics Non-Traditional / Streams Non-Relational N R l ti l Data Sources In-Motion Ultra Low Analytics Latencyy Results Varied data InfoSphere Big Insights formats Massive Scale Big Data Semi-structured, Batch oriented Results unstructured... data analytics 42 © 2011 IBM Corporation & SciSpike
  • 43. Typical Strategy for Analytics ETL SQL Analytics, Mining Data warehouse / marts Source Sources S Transform/ Extract Load subset 43 © 2011 IBM Corporation & SciSpike
  • 44. Emerging requirements for analytics SQL Analytics, Mining ETL, ELT (MR BI, Mining) Source Structured Transform, Analyze Warehouses / marts Sources Transform/ Extract subset Load BigInsights g g Source Repository Other Sources Explore large volumes of “raw” or diverse data. Discover, analyze new insights with BigInsights 44 © 2011 IBM Corporation & SciSpike
  • 45. Conclusions – Scale out to crunch petabytes – We need a mix of technologies • Data at rest: MapReduce, Hadoop and beyond • Data in motion: stream processing – To be successful, integrate with conventional technologies 45 © 2011 IBM Corporation & SciSpike
  • 46. Getting in touch  Stephen Brodsky – IBM – Email: sbrodsky@us.ibm.com  InfoSphere BigInsights – http://www-01.ibm.com/software/data/infosphere/biginsights/ ttp // 0 b co /so t a e/data/ osp e e/b g s g ts/  InfoSphere Streams – http://www-01.ibm.com/software/data/infosphere/streams/  Vladimir Bacvanski - SciSpike – Email: vladimir.bacvanski@scispike.com – Blog: http://www.OnBuildingSoftware.com/ – Twitter: http://twitter.com/OnSoftware – LinkedIn: http://www.linkedin.com/in/VladimirBacvanski p 46 © 2011 IBM Corporation & SciSpike