SlideShare uma empresa Scribd logo
1 de 18
June 2013
BIG DATA SCIENCE: A PATH FORWARD
CONFIDENTIAL | 2
linkedin.com/in/danmallinger/
@danmallinger
www.thinkbiganalytics.com
 Data Science Lead @ Think Big
 Product/Brand Obsessive
 Teacher
 Occasional Engineer
CONFIDENTIAL | 3
TODAY
• High level exploration of the
• skills, tools, and techniques
• needed to achieve early success
• and to help you build
• your data science practice.
CONFIDENTIAL | 4
 Understand our organizational needs for data science
 Infrastructure: Technological tools and platforms.
 Talent: Staff hired and trained.
 Capabilities: Data science techniques utilized.
INFRASTRUCTURE, TALENT, & CAPABILITIES
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce
Data
Exploration
Basic Modeling PhD Math
Visualization Clustering Categorization
Continuous
Models
Text Analysis
CONFIDENTIAL | 5
 Boxed Solutions: Mahout & Platform
 Toolkits: RHadoop, Scikit, etc.
 You will need toolkits to solve unique problems
 but smart techniques make that easier.
 Boxed solutions are limited
 but can be a good source of early velocity.
ANALYTICS TOOLS
CONFIDENTIAL | 6
 Gigabytes from Stackoverflow
 Questions from users
 With metadata
 Users have reputations
 Questions open or closed
 Follow along
 Thinking about your data
 To learn in a
 Familiar context and
 Plan
DATA
Presenter Audience
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 7
select count(1) as total
, sum(has_code)
, avg(body_count)
, stddev_samp(body_count)
, corr(reputation,
owner_questions)
,
histogram_numeric(body_count, 10)
from questions
;
STEP 1: EXPLORE
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Patterns through Hive Patterns through Tableau
CONFIDENTIAL | 8
 Summaries of unstructured
data
 Time-since metrics
select transform(…)
using ‘python …’
 Clustering: Browsing cohorts
/bin/mahout canopy
STEP 2: FEATURE BUILDING
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
SQL Windowing Cross-Record Features
CONFIDENTIAL | 9
• Sample (don’t parallelize)
• Naturally parallel
• SVD
• Random Forests
• Estimators and Ensembles
• Bootstrapping
• Localizing
• Advanced Parallelization
• Linear models with SGD
• Neural networks
PARALLEL MODELS IN HADOOP
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 10
 Single R model
 run many times
 over samples
 and aggregated
m <- C5.0(status ~ …)
STEP 3: STRUCTURED MODEL (BAGGING)
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Mapper 1:
Define n reducer keys
Send any record to reducer I with
probability p
Reducer 1:
Key: Id of sample
Value: List of records
Perform analysis over records
Reducer 2:
Key: One
Value: List of models
Aggregate the models (e.g. average)
Bagging a Model
CONFIDENTIAL | 11
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
 We’ve created a structured model
 to flag questions that won’t be closed
 using Big Data.
 But we haven’t used unstructured data.
CONFIDENTIAL | 12
TEXT ANALYSIS
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
• Is “the big dog” really different from “dog is big?”
• How about “I like eggs but hate tofu” and “I hate eggs but like tofu?”
• Language has lexical and syntactical features
• Different techniques leverage these in different ways
 Bag of Words: Structure doesn’t matter
 n-gram: Structure matters (but not that much)
 Feature Extraction: BACON! BACON! BACON!
CONFIDENTIAL | 13
STEP 4: UNSTRUCTURED MODEL
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
 Similar to Hadoop’s Word
Count
 Create counts for
token/category pairs
 Use counts to calculate
Information Gain
MR Job 1:
Calculate information gain (IG) for all
tokens.
MR Job 2:
Select tokens with largest IG.
Create structured data for record, tokens:
question #4 | 0 | 1 | 0 | 1 | 1
MR Job 3:
Build a classifier over the newly structured
data (prior slides)
Information Gain
CONFIDENTIAL | 14
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
 We’ve created two models
 One structured,
 one unstructured.
 But they don’t work together.
CONFIDENTIAL | 15
STEP 5: ENSEMBLE MODEL
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
 Join many models together
 By using their output
 As input to ensemble model.
 Best when models perform
differently
 Exploit differences with
nonlinearities
 Like interaction effects.
Ensembling
Mapper 1:
Load multiple models
Score the models per record and output
Reducer 1:
Key: Id of record
Value: List of model outputs
Join model outputs to make new records
MR Job 2:
Build a model over the output data as if it
was raw data.
CONFIDENTIAL | 16
 We’ve created two models:
 one structured,
 one unstructured
 and have ensembled them
 to create a single, powerful model
 and solve a practical business problem.
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 17
 This required simple infrastructure
 a blend of analysis and scripting skills
 an understanding of BIG data science techniques
 but not a team of PhDs or a billion dollars.
HOW DID WE GET HERE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 18
Questions?
www.thinkbiganalytics.com
@danmallinger

Mais conteúdo relacionado

Mais procurados

GraphTour 2020 - Customer Journey with Neo4j Services
GraphTour 2020 - Customer Journey with Neo4j ServicesGraphTour 2020 - Customer Journey with Neo4j Services
GraphTour 2020 - Customer Journey with Neo4j Services
Neo4j
 
SpeedTrack Tech Overview 2015
SpeedTrack Tech Overview 2015SpeedTrack Tech Overview 2015
SpeedTrack Tech Overview 2015
Michael Zoltowski
 

Mais procurados (20)

Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
 
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
 
GraphTour 2020 - Opening Keynote
GraphTour 2020 - Opening KeynoteGraphTour 2020 - Opening Keynote
GraphTour 2020 - Opening Keynote
 
GraphTour 2020 - Customer Journey with Neo4j Services
GraphTour 2020 - Customer Journey with Neo4j ServicesGraphTour 2020 - Customer Journey with Neo4j Services
GraphTour 2020 - Customer Journey with Neo4j Services
 
Big data perspective solution & technology
Big data perspective solution & technologyBig data perspective solution & technology
Big data perspective solution & technology
 
GraphTour 2020 - Customer Journey with Neo4j Services
GraphTour 2020 - Customer Journey with Neo4j ServicesGraphTour 2020 - Customer Journey with Neo4j Services
GraphTour 2020 - Customer Journey with Neo4j Services
 
Enterprise ready: a look at Neo4j in production
Enterprise ready: a look at Neo4j in productionEnterprise ready: a look at Neo4j in production
Enterprise ready: a look at Neo4j in production
 
CI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. HuntCI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. Hunt
 
GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
 
Data & Analytics at Scale
Data & Analytics at ScaleData & Analytics at Scale
Data & Analytics at Scale
 
SpeedTrack Tech Overview 2015
SpeedTrack Tech Overview 2015SpeedTrack Tech Overview 2015
SpeedTrack Tech Overview 2015
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Building up a Data Science Team from Scratch
Building up a Data Science Team from ScratchBuilding up a Data Science Team from Scratch
Building up a Data Science Team from Scratch
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
 
Intelligence Demo – Illustrating the Value of Your Connected Data
Intelligence Demo – Illustrating the Value of Your Connected DataIntelligence Demo – Illustrating the Value of Your Connected Data
Intelligence Demo – Illustrating the Value of Your Connected Data
 
PASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLPASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureML
 

Destaque

January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
Yahoo Developer Network
 

Destaque (10)

Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
 
Apachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowApachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to know
 
Dan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATL
Dan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATLDan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATL
Dan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATL
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop cluster
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics?
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Predictive Analytics using R
Predictive Analytics using RPredictive Analytics using R
Predictive Analytics using R
 

Semelhante a BIG Data Science: A Path Forward

Semelhante a BIG Data Science: A Path Forward (20)

A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Multiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesMultiplaform Solution for Graph Datasources
Multiplaform Solution for Graph Datasources
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Python and data analytics
Python and data analyticsPython and data analytics
Python and data analytics
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Spark
SparkSpark
Spark
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
How to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldHow to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database World
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 

Último

Último (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

BIG Data Science: A Path Forward

  • 1. June 2013 BIG DATA SCIENCE: A PATH FORWARD
  • 2. CONFIDENTIAL | 2 linkedin.com/in/danmallinger/ @danmallinger www.thinkbiganalytics.com  Data Science Lead @ Think Big  Product/Brand Obsessive  Teacher  Occasional Engineer
  • 3. CONFIDENTIAL | 3 TODAY • High level exploration of the • skills, tools, and techniques • needed to achieve early success • and to help you build • your data science practice.
  • 4. CONFIDENTIAL | 4  Understand our organizational needs for data science  Infrastructure: Technological tools and platforms.  Talent: Staff hired and trained.  Capabilities: Data science techniques utilized. INFRASTRUCTURE, TALENT, & CAPABILITIES Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Data Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Models Text Analysis
  • 5. CONFIDENTIAL | 5  Boxed Solutions: Mahout & Platform  Toolkits: RHadoop, Scikit, etc.  You will need toolkits to solve unique problems  but smart techniques make that easier.  Boxed solutions are limited  but can be a good source of early velocity. ANALYTICS TOOLS
  • 6. CONFIDENTIAL | 6  Gigabytes from Stackoverflow  Questions from users  With metadata  Users have reputations  Questions open or closed  Follow along  Thinking about your data  To learn in a  Familiar context and  Plan DATA Presenter Audience Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis
  • 7. CONFIDENTIAL | 7 select count(1) as total , sum(has_code) , avg(body_count) , stddev_samp(body_count) , corr(reputation, owner_questions) , histogram_numeric(body_count, 10) from questions ; STEP 1: EXPLORE Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis Patterns through Hive Patterns through Tableau
  • 8. CONFIDENTIAL | 8  Summaries of unstructured data  Time-since metrics select transform(…) using ‘python …’  Clustering: Browsing cohorts /bin/mahout canopy STEP 2: FEATURE BUILDING Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis SQL Windowing Cross-Record Features
  • 9. CONFIDENTIAL | 9 • Sample (don’t parallelize) • Naturally parallel • SVD • Random Forests • Estimators and Ensembles • Bootstrapping • Localizing • Advanced Parallelization • Linear models with SGD • Neural networks PARALLEL MODELS IN HADOOP Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis
  • 10. CONFIDENTIAL | 10  Single R model  run many times  over samples  and aggregated m <- C5.0(status ~ …) STEP 3: STRUCTURED MODEL (BAGGING) Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis Mapper 1: Define n reducer keys Send any record to reducer I with probability p Reducer 1: Key: Id of sample Value: List of records Perform analysis over records Reducer 2: Key: One Value: List of models Aggregate the models (e.g. average) Bagging a Model
  • 11. CONFIDENTIAL | 11 WHERE ARE WE? Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis  We’ve created a structured model  to flag questions that won’t be closed  using Big Data.  But we haven’t used unstructured data.
  • 12. CONFIDENTIAL | 12 TEXT ANALYSIS Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis • Is “the big dog” really different from “dog is big?” • How about “I like eggs but hate tofu” and “I hate eggs but like tofu?” • Language has lexical and syntactical features • Different techniques leverage these in different ways  Bag of Words: Structure doesn’t matter  n-gram: Structure matters (but not that much)  Feature Extraction: BACON! BACON! BACON!
  • 13. CONFIDENTIAL | 13 STEP 4: UNSTRUCTURED MODEL Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis  Similar to Hadoop’s Word Count  Create counts for token/category pairs  Use counts to calculate Information Gain MR Job 1: Calculate information gain (IG) for all tokens. MR Job 2: Select tokens with largest IG. Create structured data for record, tokens: question #4 | 0 | 1 | 0 | 1 | 1 MR Job 3: Build a classifier over the newly structured data (prior slides) Information Gain
  • 14. CONFIDENTIAL | 14 WHERE ARE WE? Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis  We’ve created two models  One structured,  one unstructured.  But they don’t work together.
  • 15. CONFIDENTIAL | 15 STEP 5: ENSEMBLE MODEL Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis  Join many models together  By using their output  As input to ensemble model.  Best when models perform differently  Exploit differences with nonlinearities  Like interaction effects. Ensembling Mapper 1: Load multiple models Score the models per record and output Reducer 1: Key: Id of record Value: List of model outputs Join model outputs to make new records MR Job 2: Build a model over the output data as if it was raw data.
  • 16. CONFIDENTIAL | 16  We’ve created two models:  one structured,  one unstructured  and have ensembled them  to create a single, powerful model  and solve a practical business problem. WHERE ARE WE? Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis
  • 17. CONFIDENTIAL | 17  This required simple infrastructure  a blend of analysis and scripting skills  an understanding of BIG data science techniques  but not a team of PhDs or a billion dollars. HOW DID WE GET HERE? Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis