O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Demystify big data data science

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Próximos SlideShares
Data science
Data science
Carregando em…3
×

Confira estes a seguir

1 de 78 Anúncio

Demystify big data data science

Demystify big data data science

An overview of the shift to Data Science Platforms

The 3 critical components of a Data Science platform

Industries that are most likely to get disrupted and shift to Data Science

Characteristics of firms that get left behind the Data Science wave

Factors that push an industry towards Data Science

A brief overview of aspects of platform architecture beyond technology

Demystify big data data science

An overview of the shift to Data Science Platforms

The 3 critical components of a Data Science platform

Industries that are most likely to get disrupted and shift to Data Science

Characteristics of firms that get left behind the Data Science wave

Factors that push an industry towards Data Science

A brief overview of aspects of platform architecture beyond technology

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Quem viu também gostou (17)

Anúncio

Semelhante a Demystify big data data science (20)

Mais de Mahesh Kumar CV (15)

Anúncio

Mais recentes (20)

Demystify big data data science

  1. 1. Demystify
  2. 2. Technology Basics Big Data Overview & Snapshot Big Data Architecture : Deep Dive Hadoop Overview Clear Understanding of Data Science Big Data Career Opportunities Q & A 1 What we will cover in the 60 mins 2 3 4 5 6 7
  3. 3. Apart from that we will also cover … • An overview of the shift to Data Science Platforms • The 3 critical components of a Data Science platform • Industries that are most likely to get disrupted and shift to Data Science • Characteristics of firms that get left behind the Data Science wave • Factors that push an industry towards Data Science • A brief overview of aspects of platform architecture beyond technology
  4. 4. Who am I ? • Mahesh Kumar CV is A Big Data Entrepreneur • Mahesh got about 14 years of experience in architecting and developing distributed and real-time data-driven systems. • Specialties: Translating big data into action, Big Data Trainings, Product Engineering Services, and Building Big Data CoE & Big Data Incubators • Written more than 60 Blogs in Big Data & SAP Analytics • Worked in the past with IBM, Mindtree, CSC & Rolta companies • Conducted couple of Boot camps & Workshops in Different companies
  5. 5. Data Vs Information • Data refers to a collection of numbers, characters and is a relative term; • Data is Raw, Facts , Figures etc • Information is Process Data
  6. 6. Structure Data Vs Unstructured Data
  7. 7. So where is this data getting generated ? Social Networking and Media: 700 million Facebook users, 250 million Twitter users 175+ million public blogs Each Facebook update, Tweet, blog post and comment creates multiple new data points, both structured, semi-structured and unstructured Mobile Devices: 5 billion mobile phones in use worldwide Each call, text and instant message is logged as data particularly smart phones and tablets, also make it easier to use social media Internet Transactions: Billions of online purchases, stock trades and other transactions happen every day, including countless automated transactions Each creates a number of data points collected by retailers, banks, credit cards, credit agencies and others Networked Devices and Sensors: Electronic devices of all sorts – including servers and other IT hardware, smart energy meters and temperature sensors -- all create semi- structured log data that record every action
  8. 8. Build Vs Buy HUMAN DRIVEN EMAIL WEB LOGS DOCUMENTS SOCIAL MACHINE DRIVEN SATELLITE IMAGES BIO- INFORMATICS M2M LOG FILES SENSORS VIDEO AUDIO BUSINESS DRIVEN OLTP ALL DATA TYPES 1X 10X 100X BIG DATA TODAY BIG DATA TOMORROW
  9. 9. Defining Big Data Any amount of data that's too BIGto be handled by one computer John Rauser
  10. 10. Why Big Data 12 TB of Tweets in a Day 80% Of world’s data is unstructured 30 billion pieces of content shared on Facebook every month Expected Data in 2020 would be 35 ZB 5 Million Trade events per second 2267 Billion Internet Users 4.7 billion searches on Google per day 5 Billion people tweet,text,call and browse on mobile phones daily Walmart handles 1 Million transaction per hour 255 Million Websites
  11. 11. Big Data Reference Architecture Structured Data Sources Data Integration (Batch / Near real-time) Data Repositories MDM End User Analytics Reports / Dashboards Unstructured/Semi- structured Data Sources Web logs, Application / Network log, Social, Chat transcripts, Emails Legacy applications, ERP and CRM applications Data Extraction External feeds Instrumentation data / Sensors, RFID, Telematics, Time and Location data Real-time Streaming/Integration Data Cleaning and Transformation Change Data Capture for Structured Data Change Data Capture ODS Analytics Data Warehouse DW Appliances Data Marts MOLAP CubeIn-memory Databases Unstructured / Semi- structured data Scorecards and Metrics Events and Alerts Data Mining and Exploration Predictive Analytics Text Analytics Visual Exploration Mobile BI Columnar Databases
  12. 12. Columnar Databases Structured Data Sources Data Integration Data Repositories MDM End User Analytics Reports Unstructured/Semi- structured Data Sources Web logs, Application / Network log, Social, Chat transcripts, Emails Legacy and ERP Data Extraction, Transformation External feeds Instrumentation data / Sensors, RFID, Telematics, Time and Location data Real-time Streaming / Integration Data Quality CDC for Structured data Change Data Capture ODS DW DW Appliance Data Marts MOLAP Cube In-memory Databases Unstructured / Semi-structured Scorecards / Metrics Events / Alerts Data Mining Predictive Analytics Text Analytics HANA / BW / Sybase SAP HANA Dash boards BO WebI / Crystal Reports BO dashboard Data Exploration Mobile BI SAP HANA Sybase IQ / HANA BO Mobile SAP HANA/ Sybase RDS / Rapid Marts SAP BW SAP Lumira SAP Predictive Analysis Analytics Hadoop Platform BO CMS SAP HANA / SAP BW SAP MDM SAPBO DataServices 3rd Party 3rd Party SAP HANA Big Data Reference Architecture SAP
  13. 13. Columnar Databases Structured Data Sources Data Integration Data Repositories MDM End User Analytics Reports Unstructured/Semi- structured Data Sources Web logs, Application / Network log, Social, Chat transcripts, Emails Legacy Applications and ERP Data Extraction External feeds Instrumentation data / Sensors, RFID, Telematics, Time and Location data Real-time Streaming Data Quality CDC for Structured Data CDC for Unstructured Data Hadoop Platform ODS Data Warehouse DW Appliance Data Marts MOLAP Cube In-memory Databases Semi / Unstructured Scorecards / Metrics Events / Alerts Predictive Analytics Text Analytics Content Analytics InfoSphere InformationServer Dash boards CognosBuisnessIntelligence Enterprise Visual Exploration Mobile BI Cognos TM1 Cognos Mobile PureData (Netezza, InfoSphere Warehouse) Cognos TM1 InfoSphere Data Explorer SPSS Premium SPSS Content Analytics InfoSphere Streams InfoSphere CDC Analytics Sandbox Big Insights / Streams Big Insights InfoSphere MDM Big Insights / NoSQL Big Insights / HBase PureData(Netezza, InfoSphereWarehouse, ISAS) Big Data Reference Architecture IBM
  14. 14. Columnar Databases Structured Data Sources Data Integration Data Repositories MDM End User Analytics Reports Unstructured/Semi- structured Data Sources Web logs, Application / Network log, Social, Chat transcripts, Emails Legacy Applications and ERP Data Extraction External feeds Instrumentation data / Sensors, RFID, Telematics, Time and Location data Real-time Streaming Data Quality CDC for Structured Data CDC for Unstructured Data Hadoop Platform ODS Data Warehouse DW Appliance Data Marts MOLAP Cube In-memory Databases Semi / Unstructured Scorecards / Metrics Real Time Decision Mgt. Data Mining Predictive Analytics Text Analytics Data Integrator Exadata Dash boards BI Publisher OBI Foundation Suite Visual Exploration Mobile BI Exalytics OBI Mobile Oracle/Exadata Oracle / Exadata Essbase / Hyperion Exalytics OBI Scorecard Exalytics+ OracleREnt. EndecaOracle Golden Gate Analytics Sandbox Exalytics Hadoop / Golden Gate Big Data Appliance Oracle MDM Big Data Appliance Exadata EHCC / HBase Silver Creek Data Integrator / Golden Gate Real-time Decisions Big Data Reference Architecture ORACLE
  15. 15. Big Data Reference Architecture Informatica+EMC+SAS Columnar Databases Structured Data Sources Data Integration Data Repositories MDM End User Analytics Reports Unstructured/Semi- structured Data Sources Legacy Applications and ERP Data Extraction External feeds Instrumentation data / Sensors RFID, Telematics, Time and Location data Real-time Streaming Data Quality CDC for Structured Data CDC for Unstructured Data Hadoop Platform ODS Data Warehouse DW Appliance Data Marts MOLAP Cube In-memory Databases Semi / Unstructured Scorecards / Metrics Data Exploration Predictive Analytics Text Analytics InformaticaPowerCenter& DataQuality EMC GreenPlum Dash boards SAS BI Visual Exploration Mobile BI SAS Visual Analytics SAS BI EMCGreenPlum Database EMC GreenPlum SAS OLAP Server SAS Visual BI SAS Ent. Miner SAS Strategy Mgt JMP Pro SAS Text Miner Informatica PowerCenter – Real-time edition Analytics Sandbox EMC GreenPlum UAP Informatica hParser / Hadoop Pwx EMC Greenplum HD EMC GreenPlum HD HBase Informatica MDM Web logs, Application / Network log, Social, Chat transcripts, Emails
  16. 16. Big Data Reference Architecture Open Source Technologies Columnar Databases Structured Data Sources Data Integration Data Repositories MDM End User Analytics Reports Unstructured/Semi- structured Data Sources Legacy Applications and ERP Data Extraction External feeds Instrumentation data / Sensors RFID, Telematics, Time and Location data Real-time Streaming Data Quality CDC for Structured Data CDC for Unstructured Data Hadoop Platform ODS Data Warehouse DW Appliance Data Marts MOLAP Cube In-memory Databases Semi / Unstructured Scorecards / Metrics Predictive Analytics Text Analytics ApacheMapReduce,Pig, TalendDataIntegration&DataQuality Commercial Product Dash boards Visual Exploration Mobile BI Apache Derby PentahoMob ile BI MySQL,Apache Hive MySQL, Hive SAS OLAP Server R, Apache Mahout SAS Text Miner Apache Flume Analytics Sandbox Apache HDFS + R Apache Hadoop HBase, NoSQL HBase Talend MDM Web logs, Application / Network log, Social, Chat transcripts, Emails Pentaho BusinessAnalytics,BI
  17. 17. What is Hadoop • It’s a framework for large-scale data processing: • Inspired by Google’s architecture: • A top-level Apache project – Hadoop is open source • Written in Java, plus a few shell scripts • An open-source software framework that supports data-intensive distributed applications • Abstract and facilitate the storage and processing of large and rapidly growing data sets • Structured and non-structured data • Simple programming models
  18. 18. 2 key components of Core Hadoop
  19. 19. • Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search • AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity. • Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage; • FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log analysis, data mining and machine learning • NetSeer - Up to 1000 instances on Amazon EC2 ; Data storage in Amazon S3; Used for crawling, processing, serving and log analysis • Powerset / Microsoft - Natural Language Search; up to 400 instances on Amazon EC2 ; data storage in Amazon S3 Hadoop uses every where
  20. 20. HDFS : High level architecture • HDFS Follows a master-slave architecture • 2 Major Daemons in HDFS – • Name Node • Data Node • Master : Name Node • Responsible for namespace and metadata • Namespace : file hierarchy • Metadata : ownership, permissions, block locations etc • Slave : DataNode • Responsible for storing actual data blocks
  21. 21. MapReduce : High Level Architecture • Map reduce has a master slave architecture too • 2 Daemon processes • Master : Job Tracker • Responsible for dividing, scheduling and monitoring work • Slave : Task Tracker • Responsible for actual processing
  22. 22. High Level View
  23. 23. Apache Hadoop Ecosystem
  24. 24. Disruptions
  25. 25. 1 Japanese dating app
  26. 26. 2.Heart implants
  27. 27. MOOC 3
  28. 28. Sensored cows in Netherland
  29. 29. Googles autonomous car
  30. 30. What's common to the following game changing solutions ? 1 2 3 4 5 Japanese dating app Sensored cows in Netherland Googles autonomous car MOOC Heart implants
  31. 31. At the core there is a deep embedded DATA PRODUCT !
  32. 32. Created by DATA SCIENCE ! Conquer the world ! Become Data Scientist
  33. 33. • How our health gets cared for ? • How we learn ? • How we fall in love ? • How we do farming ? • How we drive ? The world around is changing… Our lives are intimately Surrounded by Data products (an intimate fabric of our lives)
  34. 34. • Amazon Defeated Borders ( Books ) • Netflix Defeated Blockbuster ( Video ) • iTunes Defeated Tower records ( Music ) • Google defeated Yahoo ( Search ) – Page rank algorithm How did the following players disrupt the Marketplace ?
  35. 35. If Data Science is not integral you are no longer in the game
  36. 36. Demystifying Data Science ( in simple plain everyday English  )
  37. 37. In a Nutshell • Data Science is the extraction of knowledge from data • Data Science is the art of turning data into actions • The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it • Data Science seeks to • Extract meaning from data • Create " Data Products" • Use all available data to tell a valuable story to non- practioners The future belongs to the companies and people that turn data into products
  38. 38. Data Science is every where
  39. 39. 40 Known Unknowns (BI) Unknown Unknowns ( Data Science ) Lots of $ impacting patterns Unnoticed Waiting to be discovered! Data Science vs.BI
  40. 40. “As is” state in most organizations Data ( Sales , Finance ) Reports ( BO, Cognos, MSAS )
  41. 41. “As is” stage with leading game changers Data repository Insights Analytics cell + Modeling processes ( Segment, Score, Text mine ) Move from Reports  Insightful Actions that Impact
  42. 42. What's are 4 core differences between Data Science & Dashboards ? Data repository Dashboards Data repository (Purchase habits) Signal (Similiar people discovery) ML process (Collaborative filtering) Actions (Recommend a product ) Outcomes (Improve cross sell) 2 3 4 Dashboards 1 ML + Signals + Actions = Game Changing Outcomes
  43. 43. What exactly is an model ? • Mathematically defining a real world phenomena • Representative of real world • For example cross sell model
  44. 44. What are 3 common things between predictive models and caricatures ? • Its an approximation, not a perfection • Its better than not having anything • It get the job done REAL WORLD ANALYTICAL MODEL
  45. 45. Use data to discover Signals (patterns) that cause changes that impacts $ . What's the Goal of Data Science ?
  46. 46. Data Science Reference Architecture – Key components Hadoop Hive Hana Info bright Clustering Text mining Mobile Digital Data Ingestion Pipeline
  47. 47. Machine Learning Reference Architecture STORE ( Hadoop, Hive, HANA, Cloudera, Splunk, Hortonworks) SENSE ( signal extraction- text mining, scoring models ), RESPOND ( Front line actions thru website, call centre ) 1 2 3
  48. 48. Snapshot of Machine Learning Techniques 1. Segmentation 3.Forecasting 5. Scoring models 2.Text mining 4. Visual Analytics 6.Optimisation 1. Customer behavior segmentation 2. Defect segmentation 3. Employee segmentation model 4. Supplier segmentation mode 5. “Chunking” groups 6. Discovered by algorithm 1. Convert messy unstructured text into actionable signals 2. Keyword frequencies 3. Sentiment ratios 4. Blogs 5. Call center transcripts 6. Emails 7. Multi channel sentiment analysis 1. Predict CLTV 2. Predict Sales at a neighborhood outlet 3. Predict Salary based on experience, qualification, rating, market demand 4. Identify drivers of behavior 5. Weights processing 1. Beyond line, bar , pie charts 2. Geospatial modeling to see geo correlation 3. Spread analysis 4. Outlier detection 1. Churn propensity 2. Cross sell 3. Attrition modeling in HR 4. Risk scoring models in Banking 5. Logistic 6. Neural networks 7. Decision trees 8. Support Vector machines 1. Constraint modeling 2. Maximize an outcome 3. Maximize sales without cannibalizing sister brands
  49. 49. Its all about DETECTING PATTERNS !
  50. 50. 1. Segmentation
  51. 51. 2. Unstructured Text Mining
  52. 52. Real world Unstructured text mining in health care Doctors transcripts Split sentences onto words/tokens Step-1 : SPLIT Filter “noise” words eg : I , the, is, was, Step-2 : FILTER ‘Pulmonary’= ‘pulmonar’ ‘Insomnia’ = ‘Sleep’ = ‘Sleeplessnes; ‘ Step-3 : STEMMING Keyword extraction & Theme generation Step-4 : THEME EXTRACTION Step-5 : THEME / KEYWORD ANALYSIS Lab diagnostics Nurses Observations Cardiac watch list Oncology watch list Pulmonary watch list Diabetic watch list Schizophreni a watch list
  53. 53. 3. Scoring Models
  54. 54. 4. Forecasting !
  55. 55. 5. Recommenders
  56. 56. Industries disrupted by Data Science • Infrastructure optimisation, Network securityTelecom • Customer sentiment, Multi channel analysisBanking • Consumer engagement, Recommendation enginesDigital channel • Autonomous cards, Fords OnStarAutomotive • WearablesHealth care • Operations optimisationOil n Gas • DigitisationRetail
  57. 57. What factors are driving companies towards data science ? • Competitive advantage in the market place ( get ahead fast using unique insights ) • Existential threat ( others are moving ahead fast and I need to catch up ) • Revenue enhancement ( Cross sell models, recommenders ) • Cost optimisation ( Operational efficiency )
  58. 58. Technology behind Data Science Algorithams Machine learning Predictive analytics R
  59. 59. Why is Big Data HOT ?
  60. 60. Big Data jobs are Exploding!
  61. 61. Data Science jobs are Exploding!
  62. 62. Data Science Jobs exploding in India too !
  63. 63. 1 2 3
  64. 64. Transform yourself to 21st Century Skills
  65. 65. The 6 Most Desired Skills in 2015
  66. 66. 1 2 3 To summarize 3 key takeaways …
  67. 67. FAQ
  68. 68. FAQ-1: “I am confused between Hadoop and Data Science … What's difference between Hadoop and Data Science?” • Hadoop = Data Infrastructure layer • Data Science = Sensing patterns from data to impact business outcome
  69. 69. FAQ-2 : “I have worked on SAP, Oracle, etc How do I transition to becoming a Data Scientist ?” • Execute your first Data Science pilot • Step-1 : Learn R • Step-2 : Zero in on a business problem to solve • Step-3 : Setup R Your technology connector …Get access to data from your Technology • Step-4 : Apply an Analytical construct ( VEDA ML ) • Step-5 : Discover the pattern which impacts the outcome • Step-6 : Present final results to executive business team • Explore setting up a Data science project within existing organisation • Meetups to explore the outside world
  70. 70. FAQ-3: “Should I know probability and advanced statistics ?” • Not really • We are focussed on APPLICATION and not THEORY underpinning it • We will teach you • Business problem to solve • How to execute the command on a platform • What to look for in the output • What happens within the black box can be seen later
  71. 71. FAQ-4: “This is a big shift for me … In your experience how long does it take to make the transition from IT to Data Science ?” • We have seen people make the transition from 4 weeks to about 6 months • It depends upon the time + passion + drive you have
  72. 72. FAQ-5: “How are we going to prepare you for the data science job market ?” 1. Mock preparatory sessions 2. Worksheets + Modelling Checklists + Data Science Playbooks 3. Live projects on clustering , scoring which can be put in resume 4. Our strategic tie-ups with Organisations looking for data science skills 5. Top 30 Practitioner generated Data Science questions
  73. 73. FAQ-6: “I am not an IT professional but a domain person. How can I get started ?” 1. Option-1 : Focus on Industry use cases 2. Option-2 : Take basic introduction to data sciences
  74. 74. Big Data Resources• datasciencecentral.com • bigdatauniversity.com • Courseera.com • Big Data Architecture • Spotting Signals in Big Data • Signal Extraction Methodology • Advanced Visualization in Big Data • Exploratory Data Analysis (EDA) : Quick Deep Dive • Best practices in designing dashboards and scorecards • Exploring Big Data Using Bivariate Analysis • Where to start looking in Big Data using Univariate Analysis • Big Data Platform & Applications • Statistics Role in Data Science • Applied Mathematics Role in Data Science • Data-Scientist-playbook • 5-disruption-data-products By Data Science
  75. 75. All The Best Happy Hadooping & Dating with Data Science Conquer the world ! Become Data Scientist

×