SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
Mining Big Data in Real Time

           Albert Bifet




  Turing/SLAIS 2012 Conference
BIG Data




   BIG DATA
           Measure and React
Motivation




     Source: IDC’s Digital Universe Study (EMC), June 2011




               Data is growing
Motivation

             Memory unit        Size   Binary size
             kilobyte (kB/KB)   103        210
             megabyte (MB)      106        220
             gigabyte (GB)      109        230
             terabyte (TB)      1012       240
             petabyte (PB)      1015       250
             exabyte (EB)       1018       260
             zettabyte (ZB)     1021       270
             yottabyte (YB)     1024       280




              Data is growing
Motivation




     Source: IDC’s Digital Universe Study (EMC), June 2011




               Data is growing
Motivation




     Source: IDC’s Digital Universe Study (EMC), June 2011




               Data is growing
Motivation




     Source: IDC’s Digital Universe Study (EMC), June 2011




               Data is growing
Streaming Data




       Big Data & Real Time
Big Data




           McKinsey Global Institute (MGI) Report on Big Data, 2011.



       Big data refers to datasets whose size is beyond
        the ability of typical database software tools to
            capture, store, manage, and analyze.
Big Data




           McKinsey Global Institute (MGI) Report on Big Data, 2011.



       Big data refers to datasets whose size is beyond
        the ability of typical database software tools to
            capture, store, manage, and analyze.
BIG Data




     Volume
     Variety
     Velocity



                3 Vs
Methodology




              Sampling and distributed systems
Methodology




                     Paolo Boldi
          Facebook Four degrees of separation


          Big Data does not need big machines,
                it needs big intelligence
Real time analytics




          We want to analyze what is happening now.
Real time analytics




          We want to analyze what is happening now.
Time and Memory




                    Number 8 Wire Mentality



      Time and memory are the resource dimensions of
                     the process.
Time and Memory




      Time and memory are the resource dimensions of
                     the process.
Algorithms




        Classification, Regression, Clustering, Frequent
                        Pattern Mining.
Applications




      sensor data: industry, cities
      telecomm data
      social networks: twitter, facebook, yahoo
      marketing: sales business


            Data may come from: humans, sensors, or
                         machines.
New applications: social networks
   Twitter: A Massive Data Stream




      Micro-blogging service
      Built to discover what is happening at any moment in time,
      anywhere in the world.
      3 billion requests a day via its API.

   MOA-TweetReader: a real-time system to
      read tweets in real time
      detect changes
      find the terms whose frequency changed
Sentiment Analysis on Twitter
   Sentiment analysis
   Classifying messages into two categories depending on
   whether they convey positive or negative feelings


   Emoticons are visual cues associated with emotional states,
   which can be used to define class labels for sentiment
   classification

             Positive Emoticons      Negative Emoticons
                       :)                     :(
                      :-)                    :-(
                      :)                     :(
                      :D
                      =)
             Table : List of positive and negative emoticons.
New problem: structured classification

   New methods for structured classification

                         D                   D

                         B               B       B
                                 →

                     C       C       C       C

                     A


       sequences, trees, graphs
New problem: structured classification

   New methods for structured classification

             D           D                  D           D

             B   ,       B            B         B   ,   B
                                 →
             C       C       C        C                 C

             A


       sequences, trees, graphs
       frequent pattern mining techniques
New problem: structured classification

   New methods for structured classification

             D           D                  D           D

             B   ,       B              B       B   ,   B
                                 →
             C       C       C          C               C

             A       a,b → class1, class2


       sequences, trees, graphs
       frequent pattern mining techniques
       multi-label data mining
           Example: Lord of the Rings → Action, Adventure, Fantasy
New Techniques: Distributed Systems




              Hadoop, S4 and Storm
Hadoop




         Hadoop
Hadoop




         Hadoop architecture
Apache Mahout




            Mahout: open source framework
Pig




      Pig: Similar to SQL
Pig




      A = LOAD ’data’ USING PigStorage() AS
      (f1:int, f2:int, f3:int);
      B = GROUP A BY f1;
      C = FOREACH B GENERATE COUNT ($0);
      DUMP C;



                  Pig: Similar to SQL
Apache S4




            Apache S4
Apache S4
Storm




        Storm from Twitter
Storm




        Stream, Spout, Bolt, Topology
Storm
        Tools
                                            ElephantDB, Voldemort

                       All     Hadoop Precomputed
                                         batch view
                      data
                                                        Storm   Query
                                        Precomputed
                                        realtime view
         New data stream       Storm
                                         Cassandra, Riak, HBase
            Kafka
                           “Lambda Architecture”




           Runaway complexity in Big Data
                Nathan Marz, 2012
Data Streams




       Big Data & Real Time
Data Streams




               Thanks!

Mais conteúdo relacionado

Mais procurados

Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataAnsgar Scherp
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks DataWorks Summit/Hadoop Summit
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RRadek Maciaszek
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonWes McKinney
 
A Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document AnnotationA Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document AnnotationAnsgar Scherp
 
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudAnsgar Scherp
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and OutTravis Oliphant
 
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...Ansgar Scherp
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLSpark Summit
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSymeon Papadopoulos
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLMLconf
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetJ On The Beach
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingGabriele Modena
 
PyData Barcelona Keynote
PyData Barcelona KeynotePyData Barcelona Keynote
PyData Barcelona KeynoteTravis Oliphant
 

Mais procurados (20)

Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in Python
 
A Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document AnnotationA Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document Annotation
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert Bifet
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
 
PyData Barcelona Keynote
PyData Barcelona KeynotePyData Barcelona Keynote
PyData Barcelona Keynote
 

Semelhante a Mining Big Data in Real Time

Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseAge Mooij
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1gauravsc36
 
Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big DataPierre De Wilde
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]Shirshanka Das
 
Multi-thematic spatial databases
Multi-thematic spatial databasesMulti-thematic spatial databases
Multi-thematic spatial databasesConor Mc Elhinney
 
Sample Paper.doc.doc
Sample Paper.doc.docSample Paper.doc.doc
Sample Paper.doc.docbutest
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardBrian O'Neill
 
Talk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource DirectorsTalk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource DirectorsDeepak Singh
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsIJMER
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataAndre Freitas
 
Analyzing Multi-Structured Data
Analyzing Multi-Structured DataAnalyzing Multi-Structured Data
Analyzing Multi-Structured DataDataWorks Summit
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesVishy Poosala
 
Blockchains for AI [With New Applications]
Blockchains for AI [With New Applications]Blockchains for AI [With New Applications]
Blockchains for AI [With New Applications]Trent McConaghy
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataSteve Watt
 
Middeware2012 crowd
Middeware2012 crowdMiddeware2012 crowd
Middeware2012 crowdmjfrankli
 

Semelhante a Mining Big Data in Real Time (20)

STI Summit 2011 - Digital Worlds
STI Summit 2011 - Digital WorldsSTI Summit 2011 - Digital Worlds
STI Summit 2011 - Digital Worlds
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Galaxy of bits
Galaxy of bitsGalaxy of bits
Galaxy of bits
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBase
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big Data
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
 
Multi-thematic spatial databases
Multi-thematic spatial databasesMulti-thematic spatial databases
Multi-thematic spatial databases
 
Sample Paper.doc.doc
Sample Paper.doc.docSample Paper.doc.doc
Sample Paper.doc.doc
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
 
Talk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource DirectorsTalk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource Directors
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Analyzing Multi-Structured Data
Analyzing Multi-Structured DataAnalyzing Multi-Structured Data
Analyzing Multi-Structured Data
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, Opportunities
 
Blockchains for AI [With New Applications]
Blockchains for AI [With New Applications]Blockchains for AI [With New Applications]
Blockchains for AI [With New Applications]
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
Middeware2012 crowd
Middeware2012 crowdMiddeware2012 crowd
Middeware2012 crowd
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 

Mais de Albert Bifet

Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
 
Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data ScienceAlbert Bifet
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream MiningAlbert Bifet
 
Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labelsAlbert Bifet
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.Albert Bifet
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsAlbert Bifet
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
 
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsPAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsAlbert Bifet
 
Sentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataSentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataAlbert Bifet
 
Leveraging Bagging for Evolving Data Streams
Leveraging Bagging for Evolving Data StreamsLeveraging Bagging for Evolving Data Streams
Leveraging Bagging for Evolving Data StreamsAlbert Bifet
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsFast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsAlbert Bifet
 
MOA : Massive Online Analysis
MOA : Massive Online AnalysisMOA : Massive Online Analysis
MOA : Massive Online AnalysisAlbert Bifet
 
New ensemble methods for evolving data streams
New ensemble methods for evolving data streamsNew ensemble methods for evolving data streams
New ensemble methods for evolving data streamsAlbert Bifet
 

Mais de Albert Bifet (20)

Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream Classifiers
 
Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache Flink
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labels
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid them
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive Windows
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data Streams
 
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsPAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
 
Sentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataSentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming Data
 
Leveraging Bagging for Evolving Data Streams
Leveraging Bagging for Evolving Data StreamsLeveraging Bagging for Evolving Data Streams
Leveraging Bagging for Evolving Data Streams
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsFast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data Streams
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
MOA : Massive Online Analysis
MOA : Massive Online AnalysisMOA : Massive Online Analysis
MOA : Massive Online Analysis
 
New ensemble methods for evolving data streams
New ensemble methods for evolving data streamsNew ensemble methods for evolving data streams
New ensemble methods for evolving data streams
 

Mining Big Data in Real Time

  • 1. Mining Big Data in Real Time Albert Bifet Turing/SLAIS 2012 Conference
  • 2. BIG Data BIG DATA Measure and React
  • 3. Motivation Source: IDC’s Digital Universe Study (EMC), June 2011 Data is growing
  • 4. Motivation Memory unit Size Binary size kilobyte (kB/KB) 103 210 megabyte (MB) 106 220 gigabyte (GB) 109 230 terabyte (TB) 1012 240 petabyte (PB) 1015 250 exabyte (EB) 1018 260 zettabyte (ZB) 1021 270 yottabyte (YB) 1024 280 Data is growing
  • 5. Motivation Source: IDC’s Digital Universe Study (EMC), June 2011 Data is growing
  • 6. Motivation Source: IDC’s Digital Universe Study (EMC), June 2011 Data is growing
  • 7. Motivation Source: IDC’s Digital Universe Study (EMC), June 2011 Data is growing
  • 8. Streaming Data Big Data & Real Time
  • 9. Big Data McKinsey Global Institute (MGI) Report on Big Data, 2011. Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.
  • 10. Big Data McKinsey Global Institute (MGI) Report on Big Data, 2011. Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.
  • 11. BIG Data Volume Variety Velocity 3 Vs
  • 12. Methodology Sampling and distributed systems
  • 13. Methodology Paolo Boldi Facebook Four degrees of separation Big Data does not need big machines, it needs big intelligence
  • 14. Real time analytics We want to analyze what is happening now.
  • 15. Real time analytics We want to analyze what is happening now.
  • 16. Time and Memory Number 8 Wire Mentality Time and memory are the resource dimensions of the process.
  • 17. Time and Memory Time and memory are the resource dimensions of the process.
  • 18. Algorithms Classification, Regression, Clustering, Frequent Pattern Mining.
  • 19. Applications sensor data: industry, cities telecomm data social networks: twitter, facebook, yahoo marketing: sales business Data may come from: humans, sensors, or machines.
  • 20. New applications: social networks Twitter: A Massive Data Stream Micro-blogging service Built to discover what is happening at any moment in time, anywhere in the world. 3 billion requests a day via its API. MOA-TweetReader: a real-time system to read tweets in real time detect changes find the terms whose frequency changed
  • 21. Sentiment Analysis on Twitter Sentiment analysis Classifying messages into two categories depending on whether they convey positive or negative feelings Emoticons are visual cues associated with emotional states, which can be used to define class labels for sentiment classification Positive Emoticons Negative Emoticons :) :( :-) :-( :) :( :D =) Table : List of positive and negative emoticons.
  • 22. New problem: structured classification New methods for structured classification D D B B B → C C C C A sequences, trees, graphs
  • 23. New problem: structured classification New methods for structured classification D D D D B , B B B , B → C C C C C A sequences, trees, graphs frequent pattern mining techniques
  • 24. New problem: structured classification New methods for structured classification D D D D B , B B B , B → C C C C C A a,b → class1, class2 sequences, trees, graphs frequent pattern mining techniques multi-label data mining Example: Lord of the Rings → Action, Adventure, Fantasy
  • 25. New Techniques: Distributed Systems Hadoop, S4 and Storm
  • 26. Hadoop Hadoop
  • 27. Hadoop Hadoop architecture
  • 28. Apache Mahout Mahout: open source framework
  • 29. Pig Pig: Similar to SQL
  • 30. Pig A = LOAD ’data’ USING PigStorage() AS (f1:int, f2:int, f3:int); B = GROUP A BY f1; C = FOREACH B GENERATE COUNT ($0); DUMP C; Pig: Similar to SQL
  • 31. Apache S4 Apache S4
  • 33. Storm Storm from Twitter
  • 34. Storm Stream, Spout, Bolt, Topology
  • 35. Storm Tools ElephantDB, Voldemort All Hadoop Precomputed batch view data Storm Query Precomputed realtime view New data stream Storm Cassandra, Riak, HBase Kafka “Lambda Architecture” Runaway complexity in Big Data Nathan Marz, 2012
  • 36. Data Streams Big Data & Real Time
  • 37. Data Streams Thanks!