SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
Knowledge Discovery
& Data Mining
22nd ACM SIGKDD 2016
André Karpištšenko
~80 sessions for 2,700 participants
• Business Applications and Frameworks at Scale
• Data Streams Mining
• DashOpt features
• Outlier Detection
• Bayesian Optimization
• Deep Learning
• Investing into AI and Data
• Bonus keywords
88 countries, 35% YoY, 15-20% acceptance
Business Application Examples
• Consumer Internet focus: Content Ranking,
Recommendation, User Intent and Context Prediction
• Industrial Internet focus: Autonomy, Predictive
Maintenance, Operational Intelligence, Production Planning
• B2B focus: Targeting, Lead Generation, Sales Development,
Opportunity Management, Account Management
• Web content analytics: Image, Video, Text Classification for
Relevance, Products Categorization, Sentiments
• Other: Cyber Security, Fraud/Spam Detection, NLP, Speech
Recognition, Image/Video Recognition
Predictive Modeling Flow
DashOpt
Feature
Engineering
Raw
Data
Raw
Features
Labels
Feature
Integration
Features
with Labels
Data
Partitioning
Training
Data
Validation
Data
Testing
Data
Model Training
Evaluate for
model selection
Compute offline
evaluation metrics
Best model
Offline scoring
and indexing
Online/offline
systems
Online A/B test
Label
preparation Log data
Scoring
features
Raw features
Feature
integration
Model
Performance
Test Results
Data Technologies
• Most common: HDFS, MapReduce, Spark, Hive
• Decision spectrum: build, assemble, buy
• Factors: network, open-source, maturity, needs
Exploratory Production
Classic ML Platform at Scale
Workflows
HDFS Feature Mart Ground Truth Models Scores
MapReduce / Yarn
Workflow Scheduler & Manager
Workflows Workflows
Intelligence Engine
Metadata
Store
Pig, Hive,
Python, Scala,
Shell script, …
Feature Engineering
Libraries
Machine Learning
Libraries
Drivers
Drivers
Application 1 Application 2 Application N
…
http://github.com/szilard/benchm-ml
Baseline Performance Benchmark
https://sites.google.com/site/iotminingtutorial/
IoT Data Streams Mining
• 10 billion devices in 2015 -> 34 billion devices by 2020
• Continuous data, dynamic models, distributed, few seconds
Streams Mining: Actors Model
Data processing pipeline Distributed processing
Kappa Architecture
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Outlier Detection
• Single point anomaly detection: likelihood over distribution
• Finding anomalous groups: divergence estimation
• Methods: percentage change, T-test, Chi-square test, Generalized ESD (Extreme
Studentized Deviate) test, Seasonal Hybrid ESD, etc.
• Goal: move from detection to automated response
Outlier Detection in Practice
• Too many detections of too little value
• Use methods for thresholds
• Breakout detection and Concept Drift
• For changing distributions move baselines over time
• Risk of overfitting to known anomalies, not finding unknown anomalies
Bayesian aka Active Optimization
• Examples: Design of Experiments, hyper-parameters of
supervised learning, algorithms tested with simulations
f is an unknown expensive black-box function with the goal to
approximately optimize f with as few experiments as possible
• No free lunch theorem
• Other bio-inspired
algorithms for
optimization exploitation
and exploration: neural
networks, genetic
algorithms, swarm
intelligence, ant colony
optimisation, etc.
Bayesian Optimization in Practice
• SigOpt experience: 20 dimensions, above human capacity.
• Uber ATC experience: scaling active optimization to high
dimensions as the default works reliably for 5-7 dim.
• Variables are added during optimization.
• Choose fidelity using heuristics.
Deep Learning
Deep Learning
• Compute power, GPU, learning architectures and a lot of
labeled data are what drive DL
• Applied for Vision and Speech: matches human performance
• Not possible where experiments are costly: biotech
• Kaggle winners are not DL models: tree ensembles, SVMs
• Common technologies: TensorFlow, Caffe, Theano, Keras
• Thousands of pieces of software: modules and layers
• Explainability and interpretability are the next big things
• EU regulation. Tradeoff: accuracy vs explainability.
Deep Learning Trends
• Vision nets are deeper and structured (Larsson 2016)
• Language nets have also dynamics, memory and attention
(Rocktaschel 2016, Miller 2016)
• Probabilistic programming (Lake, Tenenbaum)
• Programs as networks (Riedel)
• The Neural Programmer and Interpreter for learning programs
(Reed et al 2016)
• Computation graphs interacting with memory
• Loop for reasoning for nested questions (Miller 2016)
• Generative adversarial networks (Reed 2016). Models capable of
imagining images, videos and text.
Investing into AI and Data
• Data acquisition, real-time detection and visualization not solved yet.
• Empower more people to do data science. Automate routines.
• Unsolved problems are learning from unlabeled data, planning,
reasoning, problem solving, concept formulation, 1/10k compute.
• Key decisions: Timing, accuracy in what is hard, find verticals and
focus, identify differentiation & size of the prize & people & partners
Business of outliers: 1% capital returns 526x, 48% returns 0
0
450
900
Q2'15 Q3'15 Q4'15 Q1'16 Q2'16
Peak Data
Peak ML
VC assessment:
Bonus Keywords
• Lifelong Machine Learning: systems approach, transfer learning,
never-ending learners. Useful for knowledge build.
• Graphons: graph convergence and limits through infinite number of
vertices. Useful for privacy preserving mining.
• Computational Social Science: how individuals interact to produce
collective behaviour. Individuals exert more effort by themselves than
groups.
• Information Security: trusted key management is most sensitive.
Secret must be changed frequently. Confidentiality easier to violate
than authenticity. Integrity. Offence more lucrative than defence.
• Enterprise Data: in reality “random data salad” prone to constant
change due to M&A, politics, dynamic schema DBs (e.g. Mongo), legacy
burden, restructuring, leadership changes, data hoarding. Machine
driven, human guided processes required.
Thinking about Value from Data Science

Mais conteúdo relacionado

Mais procurados

Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructurejoshwills
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Turi, Inc.
 
Introduction to Azure Machine Learning
Introduction to Azure Machine LearningIntroduction to Azure Machine Learning
Introduction to Azure Machine LearningPaul Prae
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create PyData
 
Dealing with uncertainty in fintech using AI
Dealing with uncertainty in fintech using AIDealing with uncertainty in fintech using AI
Dealing with uncertainty in fintech using AIData Products Meetup
 
Predicting Medical Test Results using Driverless AI
Predicting Medical Test Results using Driverless AIPredicting Medical Test Results using Driverless AI
Predicting Medical Test Results using Driverless AISri Ambati
 
Azure Machine Learning 101
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101Andrew Badera
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?Ivo Andreev
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with DatoTuri, Inc.
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
 
H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonSri Ambati
 
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignReal-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignJuliet Hougland
 
The path to be a data scientist
The path to be a data scientistThe path to be a data scientist
The path to be a data scientistPoo Kuan Hoong
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...Spark Summit
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnSarah Guido
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesTuri, Inc.
 
Introduction to Azure machine learning
Introduction to Azure machine learningIntroduction to Azure machine learning
Introduction to Azure machine learningJasjit Chopra
 
Teaching the cloud to think
Teaching the cloud to thinkTeaching the cloud to think
Teaching the cloud to thinkJosh Gillespie
 

Mais procurados (20)

Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructure
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015
 
Introduction to Azure Machine Learning
Introduction to Azure Machine LearningIntroduction to Azure Machine Learning
Introduction to Azure Machine Learning
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
 
Dealing with uncertainty in fintech using AI
Dealing with uncertainty in fintech using AIDealing with uncertainty in fintech using AI
Dealing with uncertainty in fintech using AI
 
Predicting Medical Test Results using Driverless AI
Predicting Medical Test Results using Driverless AIPredicting Medical Test Results using Driverless AI
Predicting Medical Test Results using Driverless AI
 
Azure Machine Learning 101
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in Python
 
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignReal-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
 
The path to be a data scientist
The path to be a data scientistThe path to be a data scientist
The path to be a data scientist
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive Services
 
Introduction to Azure machine learning
Introduction to Azure machine learningIntroduction to Azure machine learning
Introduction to Azure machine learning
 
Teaching the cloud to think
Teaching the cloud to thinkTeaching the cloud to think
Teaching the cloud to think
 

Semelhante a Knowledge Discovery

Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceData Science Milan
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017StampedeCon
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017Prashant Bhatmule
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsTraditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsGanesan Narayanasamy
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedcedrinemadera
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseSoftServe
 
fINAL Lesson_1_Course_Introduction_v1.pptx
fINAL Lesson_1_Course_Introduction_v1.pptxfINAL Lesson_1_Course_Introduction_v1.pptx
fINAL Lesson_1_Course_Introduction_v1.pptxdataKarthik
 
Customer value analysis of big data products
Customer value analysis of big data productsCustomer value analysis of big data products
Customer value analysis of big data productsVikas Sardana
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Roger Barga
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!Khalid Salama
 
Data Science Training in Chandigarh h
Data Science Training in Chandigarh    hData Science Training in Chandigarh    h
Data Science Training in Chandigarh hasmeerana605
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningTamir Taha
 
Data Science at UCSB Information Meeting
Data Science at UCSB Information MeetingData Science at UCSB Information Meeting
Data Science at UCSB Information MeetingJason Freeberg
 

Semelhante a Knowledge Discovery (20)

Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial Intelligence
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017
 
AI in the Enterprise at Scale
AI in the Enterprise at ScaleAI in the Enterprise at Scale
AI in the Enterprise at Scale
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsTraditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science Expertise
 
Deep learning
Deep learningDeep learning
Deep learning
 
fINAL Lesson_1_Course_Introduction_v1.pptx
fINAL Lesson_1_Course_Introduction_v1.pptxfINAL Lesson_1_Course_Introduction_v1.pptx
fINAL Lesson_1_Course_Introduction_v1.pptx
 
Customer value analysis of big data products
Customer value analysis of big data productsCustomer value analysis of big data products
Customer value analysis of big data products
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
 
Managing AI Products
Managing AI ProductsManaging AI Products
Managing AI Products
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Data Science Training in Chandigarh h
Data Science Training in Chandigarh    hData Science Training in Chandigarh    h
Data Science Training in Chandigarh h
 
Data mining
Data miningData mining
Data mining
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Data Science at UCSB Information Meeting
Data Science at UCSB Information MeetingData Science at UCSB Information Meeting
Data Science at UCSB Information Meeting
 

Mais de André Karpištšenko

Mais de André Karpištšenko (7)

Lingvist - Statistical Methods in Language Learning
Lingvist - Statistical Methods in Language LearningLingvist - Statistical Methods in Language Learning
Lingvist - Statistical Methods in Language Learning
 
Starship, Building Intelligent Delivery Robots
Starship, Building Intelligent Delivery RobotsStarship, Building Intelligent Delivery Robots
Starship, Building Intelligent Delivery Robots
 
Cognitive plausibility in learning algorithms
Cognitive plausibility in learning algorithmsCognitive plausibility in learning algorithms
Cognitive plausibility in learning algorithms
 
Practical Deep Learning
Practical Deep LearningPractical Deep Learning
Practical Deep Learning
 
Data science for everyone
Data science for everyoneData science for everyone
Data science for everyone
 
AI Control
AI ControlAI Control
AI Control
 
Deep learning
Deep learningDeep learning
Deep learning
 

Último

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 

Último (20)

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 

Knowledge Discovery

  • 1. Knowledge Discovery & Data Mining 22nd ACM SIGKDD 2016 André Karpištšenko
  • 2. ~80 sessions for 2,700 participants • Business Applications and Frameworks at Scale • Data Streams Mining • DashOpt features • Outlier Detection • Bayesian Optimization • Deep Learning • Investing into AI and Data • Bonus keywords 88 countries, 35% YoY, 15-20% acceptance
  • 3. Business Application Examples • Consumer Internet focus: Content Ranking, Recommendation, User Intent and Context Prediction • Industrial Internet focus: Autonomy, Predictive Maintenance, Operational Intelligence, Production Planning • B2B focus: Targeting, Lead Generation, Sales Development, Opportunity Management, Account Management • Web content analytics: Image, Video, Text Classification for Relevance, Products Categorization, Sentiments • Other: Cyber Security, Fraud/Spam Detection, NLP, Speech Recognition, Image/Video Recognition
  • 4. Predictive Modeling Flow DashOpt Feature Engineering Raw Data Raw Features Labels Feature Integration Features with Labels Data Partitioning Training Data Validation Data Testing Data Model Training Evaluate for model selection Compute offline evaluation metrics Best model Offline scoring and indexing Online/offline systems Online A/B test Label preparation Log data Scoring features Raw features Feature integration Model Performance Test Results
  • 5. Data Technologies • Most common: HDFS, MapReduce, Spark, Hive • Decision spectrum: build, assemble, buy • Factors: network, open-source, maturity, needs Exploratory Production
  • 6. Classic ML Platform at Scale Workflows HDFS Feature Mart Ground Truth Models Scores MapReduce / Yarn Workflow Scheduler & Manager Workflows Workflows Intelligence Engine Metadata Store Pig, Hive, Python, Scala, Shell script, … Feature Engineering Libraries Machine Learning Libraries Drivers Drivers Application 1 Application 2 Application N …
  • 8. https://sites.google.com/site/iotminingtutorial/ IoT Data Streams Mining • 10 billion devices in 2015 -> 34 billion devices by 2020 • Continuous data, dynamic models, distributed, few seconds
  • 9. Streams Mining: Actors Model Data processing pipeline Distributed processing Kappa Architecture https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
  • 10. Outlier Detection • Single point anomaly detection: likelihood over distribution • Finding anomalous groups: divergence estimation • Methods: percentage change, T-test, Chi-square test, Generalized ESD (Extreme Studentized Deviate) test, Seasonal Hybrid ESD, etc. • Goal: move from detection to automated response
  • 11. Outlier Detection in Practice • Too many detections of too little value • Use methods for thresholds • Breakout detection and Concept Drift • For changing distributions move baselines over time • Risk of overfitting to known anomalies, not finding unknown anomalies
  • 12. Bayesian aka Active Optimization • Examples: Design of Experiments, hyper-parameters of supervised learning, algorithms tested with simulations f is an unknown expensive black-box function with the goal to approximately optimize f with as few experiments as possible • No free lunch theorem • Other bio-inspired algorithms for optimization exploitation and exploration: neural networks, genetic algorithms, swarm intelligence, ant colony optimisation, etc.
  • 13. Bayesian Optimization in Practice • SigOpt experience: 20 dimensions, above human capacity. • Uber ATC experience: scaling active optimization to high dimensions as the default works reliably for 5-7 dim. • Variables are added during optimization. • Choose fidelity using heuristics.
  • 15. Deep Learning • Compute power, GPU, learning architectures and a lot of labeled data are what drive DL • Applied for Vision and Speech: matches human performance • Not possible where experiments are costly: biotech • Kaggle winners are not DL models: tree ensembles, SVMs • Common technologies: TensorFlow, Caffe, Theano, Keras • Thousands of pieces of software: modules and layers • Explainability and interpretability are the next big things • EU regulation. Tradeoff: accuracy vs explainability.
  • 16. Deep Learning Trends • Vision nets are deeper and structured (Larsson 2016) • Language nets have also dynamics, memory and attention (Rocktaschel 2016, Miller 2016) • Probabilistic programming (Lake, Tenenbaum) • Programs as networks (Riedel) • The Neural Programmer and Interpreter for learning programs (Reed et al 2016) • Computation graphs interacting with memory • Loop for reasoning for nested questions (Miller 2016) • Generative adversarial networks (Reed 2016). Models capable of imagining images, videos and text.
  • 17. Investing into AI and Data • Data acquisition, real-time detection and visualization not solved yet. • Empower more people to do data science. Automate routines. • Unsolved problems are learning from unlabeled data, planning, reasoning, problem solving, concept formulation, 1/10k compute. • Key decisions: Timing, accuracy in what is hard, find verticals and focus, identify differentiation & size of the prize & people & partners Business of outliers: 1% capital returns 526x, 48% returns 0 0 450 900 Q2'15 Q3'15 Q4'15 Q1'16 Q2'16 Peak Data Peak ML VC assessment:
  • 18. Bonus Keywords • Lifelong Machine Learning: systems approach, transfer learning, never-ending learners. Useful for knowledge build. • Graphons: graph convergence and limits through infinite number of vertices. Useful for privacy preserving mining. • Computational Social Science: how individuals interact to produce collective behaviour. Individuals exert more effort by themselves than groups. • Information Security: trusted key management is most sensitive. Secret must be changed frequently. Confidentiality easier to violate than authenticity. Integrity. Offence more lucrative than defence. • Enterprise Data: in reality “random data salad” prone to constant change due to M&A, politics, dynamic schema DBs (e.g. Mongo), legacy burden, restructuring, leadership changes, data hoarding. Machine driven, human guided processes required.
  • 19. Thinking about Value from Data Science