SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Introduction to Data Mining
       for Newbies



                         Nov. 2th, 2012
                          @echojuliett
Google Datacenter
@Douglas County, Georgia

“These colorful pipes send and receive water for cooling our facility.
Also pictured is a G-Bike, the vehicle of choice for team members to get
around outside our data centers.”




Source: http://www.google.com/about/datacenters/gallery/#/tech/10
Eunjeong Lucy Park
PhDs, Data scientist @SNU DMLab



A person who live on lattes.




Find me at:
http://dmlab.snu.ac.kr, http://lucypark.kr




                                             3
“All scientists are data scientists.”
                - Monica Rogati, Senior Research Scientist @LinkedIn




                                           Source: http://xkcd.com/242/   4
“Data is everywhere.”

                   Tweets
                                                      Cell phone logs




                     Social networking data


                                                Politician data


        Web documents




 Manufacturing fault data                     Credit card transactions



                                                                         5
“Data mining is…”

   •   “…the process of exploration an analysis, by automatic or semi-automatic means,
       of large quantities of data in order to discover meaningful patterns and rules.”
                                                                                        - Berry and Linoff, 1997




Source: Berry and Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Support, New York: Wiley, 1997.
                                                                                                                 6
“Data mining is…”

•   “…the belief in data.”
                                                                 - @echojuliett, 2012




•   Inductive reasoning
      Mathematical induction: prove for k=1, assume for k, then prove for k+1
      Induction vs. prejudice: # of cases
      Ex: What is your hobby?


                                                                                        7
“Data mining is…”




                    8
1.   Basic Concepts of Data Mining

2.   Origins of Data Mining

3.   Data Mining Tools

4.   Masters of Data Mining




                                     9
Data types




       Source: http://www.tipforest.com/t/83




      Structured data                          Unstructured data
(the general) Data mining process

                                                                  Interpretation

                                                    Data mining

                           Preprocessing                                           KNOWLEDGE
            Selection

                             Target data
                                                                     Patterns
                                                  Preprocessed
        DATA                                          data
     warehouse

  of somewhat domain (Marketing, Finance, Manufacturing, etc.)
Selection

  • Data exploration
     – How many variables?
           •   Independent variables, dependent variables, …

           •   Continuous variables, categorical variables, …

     – How many records?

     – What distribution?

     – …



  • Variable selection & dimensionality reduction
     – Ex: Step-wise selection, PCA (Principal Component Analysis)
Preprocessing

  • “Partitioning” the data
     – training data & validation data (& test data …)




                                  Data set




              Training data                      Validation data
Preprocessing

  • Beware of “overfitting”




 Source: Bishop, PRML, p.7
Data mining methods

            Predictive methods                           Descriptive methods

   Classification                                 Clustering




  Learns a method for predicting the instance     Finds “natural” grouping of instances given
  class from pre-labeled (classified) instances   un-labeled data

   Regression                                     Association Rules




                                                   Method for discovering interesting
  An attempt to predict a continuous attribute     relations between variables in large DBs
Regression
  • Linear regression, k-nearest neighbors(k-NN), artificial neural networks (ANN),
    …


  • Polynomial curve fitting

        •   The basic form

                                                                                 min




        •   The advanced form

                                                         min



  • Example:
        •   Tomorrow’s stock price = f (recent prices, economic indicators, …)
Classification
  • Regression with a categorical dependent variable


  • Naïve Bayes classification, decision trees, ANNs, SVMs,…




  • Ex: E-mail spam detection



                                                   inbox


                       ?
                                                  spam
Clustering
  • Grouping of similar objects
  • Unsupervised, Exploratory Knowledge Discovery


  • k-means, hierarchical clustering, SOM, …




  • Ex: Politician segmentation
                                                   J ac c ard Sim ilarit y bas ed H ierarc hic al C lus t ering D endrogram (D 9)




         0. 8




         0. 7




         0. 6




         0. 5




         0. 4




         0. 3




         0. 2




         0. 1




           0
            322323 298 133248 45 19122616520532238172 76 18294 294 2780 174185186 72 17321622969 117 61141203 17435 5346 37 267176212 1857 230125310
            326312297 7720619 268277195262 75 10198 9978 20713096 253318 136255194243 250143179188 20 177154285266 213122 51 1724 30 1510 271291 59
             321315299 128237183234204 86 1271002387 28 90 23540307 126 2 13 225231259120 67 71 156202 261198209150 10338 52 286 11 155 7 36 148292309
             320295301 31326482 281263 264 89 169 170240 233146159 4 313 16 44 208161163 4816726929 25863252 56 47 175 42 68 107 118221 5 14714 134305 88
              325296319 84 265260192 256 244 178 276 273279 257 55 308 91 9 6137 270 232220280272106 50 242 49 4154 249149 12 26
              317304324129
               316303288168 22 28327893 211 197 152 92 97 34 214 31 145
               311302289 13116422419379 199 181 85
                               160200  171189217 18781 18433 300 95 314 70 196153 65 62 58 245 246 215108112287 166 157 222 135227 43 8 66 124 123
                                                        282 210 290218      14020115825114283 236241 162 239 25 113274 228 21 109 102 39
                                                                            116254104   60  223 144180 110139115 105190 219119 284111
                                                                                                                                    73    247151121293
                                                                                                                                             138114328
                                                                                                                                             275327306




            Democratic United Party                                                        Grand National Party                               Others
            (liberal)                                                                      (conservative)
Association Rules




 Source: http://lucypark.tistory.com/48
Data mining methods

            Predictive methods                           Descriptive methods

   Classification                                 Clustering




  Learns a method for predicting the instance     Finds “natural” grouping of instances given
  class from pre-labeled (classified) instances   un-labeled data

   Regression                                     Association Rules




                                                   Method for discovering interesting
  An attempt to predict a continuous attribute     relations between variables in large DBs
Pop quiz!




            21
Pop quiz!




            22
Pop quiz!




            23
Pop quiz!




            24
Pop quiz!




 Source: http://www.cis.hut.fi/research/som-research/worldmap.html
                                                                     25
Pop quiz!




 Source: http://popupcity.net/2009/04/why-are-that-many-logos-blue/
                                                                      26
Pop quiz!




            27
1.   Basic Concepts of Data Mining

2.   Origins of Data Mining

3.   Data Mining Tools

4.   Masters of Data Mining




                                     28
Historical Note
  Data Fishing, Data Dredging: 1960-
     • used by statisticians (as a bad name)



  Knowledge Discovery in Databases (KDD): 1989-
     • used by Artificial Intelligence (AI), Machine Learning (ML) communities



  Data Mining, Data Analytics: 1990-
     • used in DB communities, business



  Big data: 2000-
Comparisons
  • Data mining
  • Statistics
  • Machine learning
  • Pattern recognition
  • …
1.   Basic Concepts of Data Mining

2.   Origins of Data Mining

3.   Data Mining Tools

4.   Masters of Data Mining




                                     31
R




Source: http://www.kdnuggets.com/2012/05/top-analytics-data-mining-big-data-software.html
SAS Enterprise Miner (“E-miner”)
XLMiner
  • 15-day trial version available at http://www.solver.com/xlminer-data-mining
  • Useful for prototyping


  • Supports:
      •   Preprocessing
           •   Data partitioning
           •   Missing data imputation
           •   Categorical data transformation
           •   PCA (Principal Component Analysis)
      •   Algorithms
           •   Multiple linear regression
           •   k-NN (k nearest neighbors)
           •   CART (classification and regression trees)
           •   ANN (artificial neural networks)
           •   Discriminant analysis
           •   logistic regression
           •   Naïve Bayes classification
           •   Association rules
           •   k-means clustering
           •   Hierarchical clustering
More…
 • Mathworks MATLAB / GNU Octave
     Most DM algorithms are preinstalled
     Relatively easy to learn



 • General purpose programming languages
     For example, C, Java, Python, etc.
     Packages such as Orange(http://orange.biolab.si/) for Python are available
     May be more fit for tasks like natural language processing


 • Even more…
     Try visiting http://www.kdnuggets.com/software/suites.html
1.   Basic Concepts of Data Mining

2.   Origins of Data Mining

3.   Data Mining Tools

4.   Masters of Data Mining




                                     36
Foreign warriors




  •   Mitchell (Carnegie Mellon University)

  •   Vapnik (NEC Labs)

  •   Bishop (Microsoft Cambridge)

  •   Smola (Yahoo, Australian National University)

  •   Ng (Stanford University)
Foreign warriors




  •   조성준 (서울대)

  •   조재희 (광운대)

  •   조성배 (연세대)

  •   이성임 (단국대)

  •   김성범 (고려대)
References
  •   [1] Duda, Hart, Stork, Pattern Classification 2nd ed., Wiley, 2001.

  •   [2] Bishop, Pattern Recognition and Machine Learning (PRML), Springer, 2006.

  •   [3] Shmueli, Patel, Bruce, Data Mining for Business Intelligence, 2nd ed., Wiley, 2010
Any Questions?


                 ?

Mais conteúdo relacionado

Mais procurados

Movie Recommendation System.pptx
Movie Recommendation System.pptxMovie Recommendation System.pptx
Movie Recommendation System.pptxrandominfo
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
Time series clustering presentation
Time series clustering presentationTime series clustering presentation
Time series clustering presentationEleni Stamatelou
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysisAmenda Joy
 
Content based recommendation systems
Content based recommendation systemsContent based recommendation systems
Content based recommendation systemsAravindharamanan S
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
An introduction to Recommender Systems
An introduction to Recommender SystemsAn introduction to Recommender Systems
An introduction to Recommender SystemsDavid Zibriczky
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
data mining
data miningdata mining
data mininguoitc
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Simplilearn
 
Data Mining: Applying data mining
Data Mining: Applying data miningData Mining: Applying data mining
Data Mining: Applying data miningDataminingTools Inc
 
A Hybrid Recommendation system
A Hybrid Recommendation systemA Hybrid Recommendation system
A Hybrid Recommendation systemPranav Prakash
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 

Mais procurados (20)

Movie Recommendation System.pptx
Movie Recommendation System.pptxMovie Recommendation System.pptx
Movie Recommendation System.pptx
 
Text clustering
Text clusteringText clustering
Text clustering
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Time series clustering presentation
Time series clustering presentationTime series clustering presentation
Time series clustering presentation
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
Content based recommendation systems
Content based recommendation systemsContent based recommendation systems
Content based recommendation systems
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
An introduction to Recommender Systems
An introduction to Recommender SystemsAn introduction to Recommender Systems
An introduction to Recommender Systems
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
data mining
data miningdata mining
data mining
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Ai inductive bias and knowledge
Ai inductive bias and knowledgeAi inductive bias and knowledge
Ai inductive bias and knowledge
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
 
Data Mining: Applying data mining
Data Mining: Applying data miningData Mining: Applying data mining
Data Mining: Applying data mining
 
A Hybrid Recommendation system
A Hybrid Recommendation systemA Hybrid Recommendation system
A Hybrid Recommendation system
 
Text mining
Text miningText mining
Text mining
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 

Destaque

On Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and BeyondOn Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and BeyondEunjeong (Lucy) Park
 
한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남Eunjeong (Lucy) Park
 
The beginner’s guide to 웹 크롤링 (스크래핑)
The beginner’s guide to 웹 크롤링 (스크래핑)The beginner’s guide to 웹 크롤링 (스크래핑)
The beginner’s guide to 웹 크롤링 (스크래핑)Eunjeong (Lucy) Park
 
딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향홍배 김
 
Normalization 방법
Normalization 방법 Normalization 방법
Normalization 방법 홍배 김
 
자바, 미안하다! 파이썬 한국어 NLP
자바, 미안하다! 파이썬 한국어 NLP자바, 미안하다! 파이썬 한국어 NLP
자바, 미안하다! 파이썬 한국어 NLPEunjeong (Lucy) Park
 
머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)홍배 김
 
도도와 파이썬: 좋은 선택과 나쁜 선택
도도와 파이썬: 좋은 선택과 나쁜 선택도도와 파이썬: 좋은 선택과 나쁜 선택
도도와 파이썬: 좋은 선택과 나쁜 선택Jc Kim
 
Learning to remember rare events
Learning to remember rare eventsLearning to remember rare events
Learning to remember rare events홍배 김
 
Selenium을 이용한 동적 사이트 크롤러 만들기
Selenium을 이용한 동적 사이트 크롤러 만들기Selenium을 이용한 동적 사이트 크롤러 만들기
Selenium을 이용한 동적 사이트 크롤러 만들기Gyuhyeon Jeon
 
Q Learning과 CNN을 이용한 Object Localization
Q Learning과 CNN을 이용한 Object LocalizationQ Learning과 CNN을 이용한 Object Localization
Q Learning과 CNN을 이용한 Object Localization홍배 김
 
[Week2] 데이터 스크래핑
[Week2] 데이터 스크래핑[Week2] 데이터 스크래핑
[Week2] 데이터 스크래핑neuroassociates
 
텐서플로 걸음마 (TensorFlow Tutorial)
텐서플로 걸음마 (TensorFlow Tutorial)텐서플로 걸음마 (TensorFlow Tutorial)
텐서플로 걸음마 (TensorFlow Tutorial)Taejun Kim
 
Getting started with Data Warehousing and BI
Getting started with Data Warehousing and BIGetting started with Data Warehousing and BI
Getting started with Data Warehousing and BIEdureka!
 
TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDatatdc-globalcode
 
A neural image caption generator
A neural image caption generatorA neural image caption generator
A neural image caption generator홍배 김
 
Python study 1강 (오픈소스컨설팅 내부 강의)
Python study 1강 (오픈소스컨설팅 내부 강의)Python study 1강 (오픈소스컨설팅 내부 강의)
Python study 1강 (오픈소스컨설팅 내부 강의)정명훈 Jerry Jeong
 
Data mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid languageData mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid languageq-Maxim
 
TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDatatdc-globalcode
 
MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...
MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...
MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...Sebastian Raschka
 

Destaque (20)

On Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and BeyondOn Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and Beyond
 
한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남
 
The beginner’s guide to 웹 크롤링 (스크래핑)
The beginner’s guide to 웹 크롤링 (스크래핑)The beginner’s guide to 웹 크롤링 (스크래핑)
The beginner’s guide to 웹 크롤링 (스크래핑)
 
딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향
 
Normalization 방법
Normalization 방법 Normalization 방법
Normalization 방법
 
자바, 미안하다! 파이썬 한국어 NLP
자바, 미안하다! 파이썬 한국어 NLP자바, 미안하다! 파이썬 한국어 NLP
자바, 미안하다! 파이썬 한국어 NLP
 
머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)
 
도도와 파이썬: 좋은 선택과 나쁜 선택
도도와 파이썬: 좋은 선택과 나쁜 선택도도와 파이썬: 좋은 선택과 나쁜 선택
도도와 파이썬: 좋은 선택과 나쁜 선택
 
Learning to remember rare events
Learning to remember rare eventsLearning to remember rare events
Learning to remember rare events
 
Selenium을 이용한 동적 사이트 크롤러 만들기
Selenium을 이용한 동적 사이트 크롤러 만들기Selenium을 이용한 동적 사이트 크롤러 만들기
Selenium을 이용한 동적 사이트 크롤러 만들기
 
Q Learning과 CNN을 이용한 Object Localization
Q Learning과 CNN을 이용한 Object LocalizationQ Learning과 CNN을 이용한 Object Localization
Q Learning과 CNN을 이용한 Object Localization
 
[Week2] 데이터 스크래핑
[Week2] 데이터 스크래핑[Week2] 데이터 스크래핑
[Week2] 데이터 스크래핑
 
텐서플로 걸음마 (TensorFlow Tutorial)
텐서플로 걸음마 (TensorFlow Tutorial)텐서플로 걸음마 (TensorFlow Tutorial)
텐서플로 걸음마 (TensorFlow Tutorial)
 
Getting started with Data Warehousing and BI
Getting started with Data Warehousing and BIGetting started with Data Warehousing and BI
Getting started with Data Warehousing and BI
 
TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigData
 
A neural image caption generator
A neural image caption generatorA neural image caption generator
A neural image caption generator
 
Python study 1강 (오픈소스컨설팅 내부 강의)
Python study 1강 (오픈소스컨설팅 내부 강의)Python study 1강 (오픈소스컨설팅 내부 강의)
Python study 1강 (오픈소스컨설팅 내부 강의)
 
Data mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid languageData mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid language
 
TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigData
 
MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...
MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...
MusicMood - Machine Learning in Automatic Music Mood Prediction Based on Song...
 

Semelhante a Introduction to Data Mining for Newbies

`Data mining
`Data mining`Data mining
`Data miningJebin R
 
Data Mining in Operating System
Data Mining in Operating SystemData Mining in Operating System
Data Mining in Operating SystemITz_1
 
Что такое Data Science
Что такое Data ScienceЧто такое Data Science
Что такое Data ScienceOlga Lavrentieva
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slidestafosepsdfasg
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesDeepaR42
 
Data mining & Decison Trees
Data mining & Decison TreesData mining & Decison Trees
Data mining & Decison TreesSelman Bozkır
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Miningdataminers.ir
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
DM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfDM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfssuserb933d8
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining IntroAsma CHERIF
 
Unit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptUnit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptPadmajaLaksh
 

Semelhante a Introduction to Data Mining for Newbies (20)

Data mining
Data miningData mining
Data mining
 
`Data mining
`Data mining`Data mining
`Data mining
 
Data Mining in Operating System
Data Mining in Operating SystemData Mining in Operating System
Data Mining in Operating System
 
Что такое Data Science
Что такое Data ScienceЧто такое Data Science
Что такое Data Science
 
Introduction to data warehouse
Introduction to data warehouseIntroduction to data warehouse
Introduction to data warehouse
 
data mining
data miningdata mining
data mining
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and Techniques
 
Data mining & Decison Trees
Data mining & Decison TreesData mining & Decison Trees
Data mining & Decison Trees
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Mining
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
01-pengantar.pdf
01-pengantar.pdf01-pengantar.pdf
01-pengantar.pdf
 
DM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfDM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdf
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
unit 1 DATA MINING.ppt
unit 1 DATA MINING.pptunit 1 DATA MINING.ppt
unit 1 DATA MINING.ppt
 
Data Mining and Knowledge
Data Mining and KnowledgeData Mining and Knowledge
Data Mining and Knowledge
 
Graph
GraphGraph
Graph
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
Unit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptUnit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.ppt
 

Introduction to Data Mining for Newbies

  • 1. Introduction to Data Mining for Newbies Nov. 2th, 2012 @echojuliett
  • 2. Google Datacenter @Douglas County, Georgia “These colorful pipes send and receive water for cooling our facility. Also pictured is a G-Bike, the vehicle of choice for team members to get around outside our data centers.” Source: http://www.google.com/about/datacenters/gallery/#/tech/10
  • 3. Eunjeong Lucy Park PhDs, Data scientist @SNU DMLab A person who live on lattes. Find me at: http://dmlab.snu.ac.kr, http://lucypark.kr 3
  • 4. “All scientists are data scientists.” - Monica Rogati, Senior Research Scientist @LinkedIn Source: http://xkcd.com/242/ 4
  • 5. “Data is everywhere.” Tweets Cell phone logs Social networking data Politician data Web documents Manufacturing fault data Credit card transactions 5
  • 6. “Data mining is…” • “…the process of exploration an analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.” - Berry and Linoff, 1997 Source: Berry and Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Support, New York: Wiley, 1997. 6
  • 7. “Data mining is…” • “…the belief in data.” - @echojuliett, 2012 • Inductive reasoning  Mathematical induction: prove for k=1, assume for k, then prove for k+1  Induction vs. prejudice: # of cases  Ex: What is your hobby? 7
  • 9. 1. Basic Concepts of Data Mining 2. Origins of Data Mining 3. Data Mining Tools 4. Masters of Data Mining 9
  • 10. Data types Source: http://www.tipforest.com/t/83 Structured data Unstructured data
  • 11. (the general) Data mining process Interpretation Data mining Preprocessing KNOWLEDGE Selection Target data Patterns Preprocessed DATA data warehouse of somewhat domain (Marketing, Finance, Manufacturing, etc.)
  • 12. Selection • Data exploration – How many variables? • Independent variables, dependent variables, … • Continuous variables, categorical variables, … – How many records? – What distribution? – … • Variable selection & dimensionality reduction – Ex: Step-wise selection, PCA (Principal Component Analysis)
  • 13. Preprocessing • “Partitioning” the data – training data & validation data (& test data …) Data set Training data Validation data
  • 14. Preprocessing • Beware of “overfitting” Source: Bishop, PRML, p.7
  • 15. Data mining methods Predictive methods Descriptive methods Classification Clustering Learns a method for predicting the instance Finds “natural” grouping of instances given class from pre-labeled (classified) instances un-labeled data Regression Association Rules Method for discovering interesting An attempt to predict a continuous attribute relations between variables in large DBs
  • 16. Regression • Linear regression, k-nearest neighbors(k-NN), artificial neural networks (ANN), … • Polynomial curve fitting • The basic form min • The advanced form min • Example: • Tomorrow’s stock price = f (recent prices, economic indicators, …)
  • 17. Classification • Regression with a categorical dependent variable • Naïve Bayes classification, decision trees, ANNs, SVMs,… • Ex: E-mail spam detection inbox ? spam
  • 18. Clustering • Grouping of similar objects • Unsupervised, Exploratory Knowledge Discovery • k-means, hierarchical clustering, SOM, … • Ex: Politician segmentation J ac c ard Sim ilarit y bas ed H ierarc hic al C lus t ering D endrogram (D 9) 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 322323 298 133248 45 19122616520532238172 76 18294 294 2780 174185186 72 17321622969 117 61141203 17435 5346 37 267176212 1857 230125310 326312297 7720619 268277195262 75 10198 9978 20713096 253318 136255194243 250143179188 20 177154285266 213122 51 1724 30 1510 271291 59 321315299 128237183234204 86 1271002387 28 90 23540307 126 2 13 225231259120 67 71 156202 261198209150 10338 52 286 11 155 7 36 148292309 320295301 31326482 281263 264 89 169 170240 233146159 4 313 16 44 208161163 4816726929 25863252 56 47 175 42 68 107 118221 5 14714 134305 88 325296319 84 265260192 256 244 178 276 273279 257 55 308 91 9 6137 270 232220280272106 50 242 49 4154 249149 12 26 317304324129 316303288168 22 28327893 211 197 152 92 97 34 214 31 145 311302289 13116422419379 199 181 85 160200 171189217 18781 18433 300 95 314 70 196153 65 62 58 245 246 215108112287 166 157 222 135227 43 8 66 124 123 282 210 290218 14020115825114283 236241 162 239 25 113274 228 21 109 102 39 116254104 60 223 144180 110139115 105190 219119 284111 73 247151121293 138114328 275327306 Democratic United Party Grand National Party Others (liberal) (conservative)
  • 19. Association Rules Source: http://lucypark.tistory.com/48
  • 20. Data mining methods Predictive methods Descriptive methods Classification Clustering Learns a method for predicting the instance Finds “natural” grouping of instances given class from pre-labeled (classified) instances un-labeled data Regression Association Rules Method for discovering interesting An attempt to predict a continuous attribute relations between variables in large DBs
  • 21. Pop quiz! 21
  • 22. Pop quiz! 22
  • 23. Pop quiz! 23
  • 24. Pop quiz! 24
  • 25. Pop quiz! Source: http://www.cis.hut.fi/research/som-research/worldmap.html 25
  • 26. Pop quiz! Source: http://popupcity.net/2009/04/why-are-that-many-logos-blue/ 26
  • 27. Pop quiz! 27
  • 28. 1. Basic Concepts of Data Mining 2. Origins of Data Mining 3. Data Mining Tools 4. Masters of Data Mining 28
  • 29. Historical Note Data Fishing, Data Dredging: 1960- • used by statisticians (as a bad name) Knowledge Discovery in Databases (KDD): 1989- • used by Artificial Intelligence (AI), Machine Learning (ML) communities Data Mining, Data Analytics: 1990- • used in DB communities, business Big data: 2000-
  • 30. Comparisons • Data mining • Statistics • Machine learning • Pattern recognition • …
  • 31. 1. Basic Concepts of Data Mining 2. Origins of Data Mining 3. Data Mining Tools 4. Masters of Data Mining 31
  • 33. SAS Enterprise Miner (“E-miner”)
  • 34. XLMiner • 15-day trial version available at http://www.solver.com/xlminer-data-mining • Useful for prototyping • Supports: • Preprocessing • Data partitioning • Missing data imputation • Categorical data transformation • PCA (Principal Component Analysis) • Algorithms • Multiple linear regression • k-NN (k nearest neighbors) • CART (classification and regression trees) • ANN (artificial neural networks) • Discriminant analysis • logistic regression • Naïve Bayes classification • Association rules • k-means clustering • Hierarchical clustering
  • 35. More… • Mathworks MATLAB / GNU Octave  Most DM algorithms are preinstalled  Relatively easy to learn • General purpose programming languages  For example, C, Java, Python, etc.  Packages such as Orange(http://orange.biolab.si/) for Python are available  May be more fit for tasks like natural language processing • Even more…  Try visiting http://www.kdnuggets.com/software/suites.html
  • 36. 1. Basic Concepts of Data Mining 2. Origins of Data Mining 3. Data Mining Tools 4. Masters of Data Mining 36
  • 37. Foreign warriors • Mitchell (Carnegie Mellon University) • Vapnik (NEC Labs) • Bishop (Microsoft Cambridge) • Smola (Yahoo, Australian National University) • Ng (Stanford University)
  • 38. Foreign warriors • 조성준 (서울대) • 조재희 (광운대) • 조성배 (연세대) • 이성임 (단국대) • 김성범 (고려대)
  • 39. References • [1] Duda, Hart, Stork, Pattern Classification 2nd ed., Wiley, 2001. • [2] Bishop, Pattern Recognition and Machine Learning (PRML), Springer, 2006. • [3] Shmueli, Patel, Bruce, Data Mining for Business Intelligence, 2nd ed., Wiley, 2010