SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
Data Mining
                                                    2114.409: Creative Research Practice




HTTP://WWW.FLICKR.COM/PHOTOS/CPBILLS/2888144434/
Reflection
Homework 2

 Status?

 Auditors

Concerns

 Programming

 What can we build



                     HTTP://WWW.FLICKR.COM/PHOTOS/FLOWER87/76719859/
Course Outline
1. Foundations                 3. Prototyping
Introduction                   Crawling
Survey Methods / Data Mining   Text Mining
Visualization and Analysis     To be determined (TBD)
Social Mechanics               Project Update




2. Methods                     4. Refinement
Creativity and Brainstorming   TBD x3
Prototyping                    Project Presentations
Project Management             Reflection
Data Mining Overview
How do I see and
communicate answers?
                        Lecture 2, HW2

What questions should
I ask of the data?
                        Today, HW3
                        on-demand
How do I clean and
process the data?

How do I gather
                        Later?
meaningful data?
THIS LECTURE BARELY SCRATCHES THE
SURFACE OF INFORMATION VISUALIZATION.
IT IS A JUMPING OFF POINT.
Data Mining Overview
How do I see and
communicate answers?
                        Lecture 2, HW2

What questions should
I ask of the data?
                        Today, HW3
                        on-demand
How do I clean and
process the data?

How do I gather
                        Later?
meaningful data?
Data Exploration
Often the questions are not obvious and it’s
 useful to look at the data for inspiration.
Exploration: Data Cubes
             Basic operations:

             ‣ Group
               (how to chunk data)

             ‣ Summarize
               (sum, mean, etc.)

             ‣ Filter
               (which rows to include)
Pivot Table Tutorial
Data Mining Overview
How do I see and
communicate answers?
                        Lecture 2, HW2

What questions should
I ask of the data?
                        Today, HW3
                        on-demand
How do I clean and
process the data?

How do I gather
                        Later?
meaningful data?
Objectives

DATA MINING                  EACH TECHNIQUE
‣ What is it?                ‣ What is it doing?
‣ How does it relate to      ‣ Why is it useful?
  collective intelligence?
                             ‣ How might you apply it?
Are there patterns in the data?




HUMAN VISUAL   vs.
                     COMPUTER
  SYSTEM             ANALYSIS
Why might we prefer analysis?

         LABOR                       ACCURACY
Too many pictures to look at.   Can test for statistical
                                significance, etc.
Don’t know which are
interesting.                    Some patterns don’t
                                visualize easily.




                                         HTTP://WWW.FLICKR.COM/PHOTOS/STRIATIC/2144933705/
Common Techniques



                       Clustering                              Classification & Regression




             Association Rules                                     Anomaly Detection
HTTP://WWW.FLICKR.COM/PHOTOS/EXPLORATIVEAPPROACH/3866580875/
Clustering
Find natural
groupings in
the data



Organize data into classes:

‣ high intra-class similarity
‣ low inter-class similarity
Clustering
         Input Data                  Output Clusters



  Points                                           Hard
                                              OR



    OR




                                       Soft
Similarities                                  OR




         [ # of clusters ]              Hierarchical
K-Means
5


4
                     k1
3


2
            k2


1

                              k3
0
    0   1        2        3        4   5
K-Means
5


4
                     k1
3


2
            k2


1

                              k3
0
    0   1        2        3        4   5
K-Means
5


4
                         k1

3


2

                         k3
1           k2

0
    0   1        2   3        4   5
K-Means
5


4
                         k1

3


2

                         k3
1           k2

0
    0   1        2   3        4   5
K-Means
                            5
expression in condition 2



                            4
                                                               k1
                            3


                            2

                                    k2
                            1                             k3

                            0
                                0    1       2       3         4    5

                                     expression in condition 1
Classification               Regression




Learn to map objects to   Learn map objects to
categories                continuous variables
Typical Applications
Speech      Handwriting   OCR
Classification
Observations    X   Learn         f(x) = y
Labels          Y
                     Y = gender


 Male




Female
                                       X = height
The Whole Process
                     Data Set
                                Featurization



                   Featurized

                  Random Split (e.g. 90/10)



Training Data                                   Test Data
       Training



   Model
                          Evaluation




                      Results
Real-World Classification

Observations   X   Y - 100’s of labels
                   X - 1000’s of features
Labels         Y   N - Millions of examples
                   ? - Not all data is labeled
                   ? - Some data is mis-labeled

 f(x) = y          Model spatial context
                   Model temporal context
Association Rules
Learn interesting
relations in the data




                        = proportion of events in which X occurs
Anomaly Detection

          Detect strange
          events in the data
Homework: Data Mining
1. Form groups!

2. Choose a Collective Intelligence topic from
   Lecture 1, or propose similar.

3. Make a list of data sources that might
   provide insights to that topic.

4. Propose a set of meaningful questions about
   the data based on your intuition.

5. How would you have to clean/process your
   data to start answering those questions?

6. Consider clustering, association rules,
   anomaly detection, classification. For each
   technique, how might you apply it to the
   data and what would it show?

7. Document your work and be prepared to
   present.
                                                 HTTP://WWW.FLICKR.COM/PHOTOS/31907740@N00/4860840019/
Data Mining Overview
How do I see and
communicate answers?
                        Lecture 2, HW2

What questions should
I ask of the data?
                        Today, HW3
                        on-demand
How do I clean and
process the data?

How do I gather
                        Later?
meaningful data?
Guest Lecture
Feedback

Mais conteúdo relacionado

Semelhante a Data Mining

Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Miningbutest
 
Scientific data management from the lab to the web
Scientific data management   from the lab to the webScientific data management   from the lab to the web
Scientific data management from the lab to the webJose Manuel Gómez-Pérez
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1Aseel Addawood
 
Lecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfLecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfJojo314349
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningKai Koenig
 
Action research for_librarians_carl2012
Action research for_librarians_carl2012Action research for_librarians_carl2012
Action research for_librarians_carl2012srosenblatt
 
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...IRJET Journal
 
Action research for_librarians_carl2012
Action research for_librarians_carl2012Action research for_librarians_carl2012
Action research for_librarians_carl2012srosenblatt
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxYogeshGairola2
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research ObjectsDavid De Roure
 
Useful Techniques in Artificial Intelligence
Useful Techniques in Artificial IntelligenceUseful Techniques in Artificial Intelligence
Useful Techniques in Artificial IntelligenceIla Group
 
1-Data Understanding.pdf
1-Data Understanding.pdf1-Data Understanding.pdf
1-Data Understanding.pdfgopikahari7
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4Roger Barga
 
An improved fuzzy system for representing web pages in Clustering Tasks
An improved fuzzy system for representing web pages in Clustering TasksAn improved fuzzy system for representing web pages in Clustering Tasks
An improved fuzzy system for representing web pages in Clustering TasksAlberto Pérez
 
CIKM Tutorial 2008
CIKM Tutorial 2008CIKM Tutorial 2008
CIKM Tutorial 2008Peiling Wang
 

Semelhante a Data Mining (20)

Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Mining
 
Scientific data management from the lab to the web
Scientific data management   from the lab to the webScientific data management   from the lab to the web
Scientific data management from the lab to the web
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1
 
Lecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfLecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdf
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Action research for_librarians_carl2012
Action research for_librarians_carl2012Action research for_librarians_carl2012
Action research for_librarians_carl2012
 
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
 
Action research for_librarians_carl2012
Action research for_librarians_carl2012Action research for_librarians_carl2012
Action research for_librarians_carl2012
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptx
 
Data mining
Data miningData mining
Data mining
 
Data mining
Data miningData mining
Data mining
 
Log Data Mining
Log Data MiningLog Data Mining
Log Data Mining
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
 
Useful Techniques in Artificial Intelligence
Useful Techniques in Artificial IntelligenceUseful Techniques in Artificial Intelligence
Useful Techniques in Artificial Intelligence
 
1-Data Understanding.pdf
1-Data Understanding.pdf1-Data Understanding.pdf
1-Data Understanding.pdf
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
 
An improved fuzzy system for representing web pages in Clustering Tasks
An improved fuzzy system for representing web pages in Clustering TasksAn improved fuzzy system for representing web pages in Clustering Tasks
An improved fuzzy system for representing web pages in Clustering Tasks
 
CIKM Tutorial 2008
CIKM Tutorial 2008CIKM Tutorial 2008
CIKM Tutorial 2008
 

Mais de Michael Shilman

Controlled Experiments - Shengdong Zhao
Controlled Experiments - Shengdong ZhaoControlled Experiments - Shengdong Zhao
Controlled Experiments - Shengdong ZhaoMichael Shilman
 
Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!Michael Shilman
 
Seungwon Hwang: Entity Graph Mining and Matching
Seungwon Hwang: Entity Graph Mining and MatchingSeungwon Hwang: Entity Graph Mining and Matching
Seungwon Hwang: Entity Graph Mining and MatchingMichael Shilman
 
Ignite Seoul: Machine Learning
Ignite Seoul: Machine LearningIgnite Seoul: Machine Learning
Ignite Seoul: Machine LearningMichael Shilman
 
Collective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: IntroductionCollective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: IntroductionMichael Shilman
 

Mais de Michael Shilman (7)

Project Management
Project ManagementProject Management
Project Management
 
Controlled Experiments - Shengdong Zhao
Controlled Experiments - Shengdong ZhaoControlled Experiments - Shengdong Zhao
Controlled Experiments - Shengdong Zhao
 
Iterative Prototyping
Iterative PrototypingIterative Prototyping
Iterative Prototyping
 
Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!
 
Seungwon Hwang: Entity Graph Mining and Matching
Seungwon Hwang: Entity Graph Mining and MatchingSeungwon Hwang: Entity Graph Mining and Matching
Seungwon Hwang: Entity Graph Mining and Matching
 
Ignite Seoul: Machine Learning
Ignite Seoul: Machine LearningIgnite Seoul: Machine Learning
Ignite Seoul: Machine Learning
 
Collective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: IntroductionCollective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: Introduction
 

Último

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Data Mining

  • 1. Data Mining 2114.409: Creative Research Practice HTTP://WWW.FLICKR.COM/PHOTOS/CPBILLS/2888144434/
  • 2. Reflection Homework 2 Status? Auditors Concerns Programming What can we build HTTP://WWW.FLICKR.COM/PHOTOS/FLOWER87/76719859/
  • 3. Course Outline 1. Foundations 3. Prototyping Introduction Crawling Survey Methods / Data Mining Text Mining Visualization and Analysis To be determined (TBD) Social Mechanics Project Update 2. Methods 4. Refinement Creativity and Brainstorming TBD x3 Prototyping Project Presentations Project Management Reflection
  • 4. Data Mining Overview How do I see and communicate answers? Lecture 2, HW2 What questions should I ask of the data? Today, HW3 on-demand How do I clean and process the data? How do I gather Later? meaningful data?
  • 5. THIS LECTURE BARELY SCRATCHES THE SURFACE OF INFORMATION VISUALIZATION. IT IS A JUMPING OFF POINT.
  • 6. Data Mining Overview How do I see and communicate answers? Lecture 2, HW2 What questions should I ask of the data? Today, HW3 on-demand How do I clean and process the data? How do I gather Later? meaningful data?
  • 7. Data Exploration Often the questions are not obvious and it’s useful to look at the data for inspiration.
  • 8. Exploration: Data Cubes Basic operations: ‣ Group (how to chunk data) ‣ Summarize (sum, mean, etc.) ‣ Filter (which rows to include)
  • 10. Data Mining Overview How do I see and communicate answers? Lecture 2, HW2 What questions should I ask of the data? Today, HW3 on-demand How do I clean and process the data? How do I gather Later? meaningful data?
  • 11. Objectives DATA MINING EACH TECHNIQUE ‣ What is it? ‣ What is it doing? ‣ How does it relate to ‣ Why is it useful? collective intelligence? ‣ How might you apply it?
  • 12. Are there patterns in the data? HUMAN VISUAL vs. COMPUTER SYSTEM ANALYSIS
  • 13. Why might we prefer analysis? LABOR ACCURACY Too many pictures to look at. Can test for statistical significance, etc. Don’t know which are interesting. Some patterns don’t visualize easily. HTTP://WWW.FLICKR.COM/PHOTOS/STRIATIC/2144933705/
  • 14. Common Techniques Clustering Classification & Regression Association Rules Anomaly Detection HTTP://WWW.FLICKR.COM/PHOTOS/EXPLORATIVEAPPROACH/3866580875/
  • 15. Clustering Find natural groupings in the data Organize data into classes: ‣ high intra-class similarity ‣ low inter-class similarity
  • 16. Clustering Input Data Output Clusters Points Hard OR OR Soft Similarities OR [ # of clusters ] Hierarchical
  • 17. K-Means 5 4 k1 3 2 k2 1 k3 0 0 1 2 3 4 5
  • 18. K-Means 5 4 k1 3 2 k2 1 k3 0 0 1 2 3 4 5
  • 19. K-Means 5 4 k1 3 2 k3 1 k2 0 0 1 2 3 4 5
  • 20. K-Means 5 4 k1 3 2 k3 1 k2 0 0 1 2 3 4 5
  • 21. K-Means 5 expression in condition 2 4 k1 3 2 k2 1 k3 0 0 1 2 3 4 5 expression in condition 1
  • 22. Classification Regression Learn to map objects to Learn map objects to categories continuous variables
  • 23. Typical Applications Speech Handwriting OCR
  • 24. Classification Observations X Learn f(x) = y Labels Y Y = gender Male Female X = height
  • 25. The Whole Process Data Set Featurization Featurized Random Split (e.g. 90/10) Training Data Test Data Training Model Evaluation Results
  • 26. Real-World Classification Observations X Y - 100’s of labels X - 1000’s of features Labels Y N - Millions of examples ? - Not all data is labeled ? - Some data is mis-labeled f(x) = y Model spatial context Model temporal context
  • 27. Association Rules Learn interesting relations in the data = proportion of events in which X occurs
  • 28. Anomaly Detection Detect strange events in the data
  • 29. Homework: Data Mining 1. Form groups! 2. Choose a Collective Intelligence topic from Lecture 1, or propose similar. 3. Make a list of data sources that might provide insights to that topic. 4. Propose a set of meaningful questions about the data based on your intuition. 5. How would you have to clean/process your data to start answering those questions? 6. Consider clustering, association rules, anomaly detection, classification. For each technique, how might you apply it to the data and what would it show? 7. Document your work and be prepared to present. HTTP://WWW.FLICKR.COM/PHOTOS/31907740@N00/4860840019/
  • 30. Data Mining Overview How do I see and communicate answers? Lecture 2, HW2 What questions should I ask of the data? Today, HW3 on-demand How do I clean and process the data? How do I gather Later? meaningful data?