SlideShare uma empresa Scribd logo
1 de 11
Baixar para ler offline
Data Mining &
Column Stores
Aung Thu Rha Hein
Why use Data Mining?
• Explosive growth of data available
• Major sources:
          • Business: Web, E-Commerce, transactions
          • Science : Remote Sensing, bioinformatics,….
          • Society : news, gadgets, social media

• Too much data but too little information
• To extract useful information from the data and to interpret
  the data
• can automate the process of finding relationships and patterns
  in raw data
What is Data Mining?
• Knowledge Discovery in Databases, or ”KDD”
• the process of extracting hidden predictive information
  from large data sets
• Converting information into knowledge to predict the
  future trends and decisions
• Examples :
             consumer buying behavior of retail supermarket sales
             Google instant, YouTube instant
             Blogs and news: Technorati, News360 and so on
             Social Mining : Livehoods: find pattern and behaviors
              of foursquare check-in data
Data Mining Process
The Cross-Industry Standard Process (CRISP-DM)



                                                 Business understanding

                                                 Data understanding

                                                 Data preparation

                                                 Modeling

                                                 Evaluation

                                                 Deployment
Techniques
I.    Association Rule-also known as market basket analysis.
           discover interesting associations between attributes
II.   Classification- a technique based on machine learning
           use mathematical techniques such as decision trees, linear
            programming, neural network and statistics.
III. Clustering- makes meaningful or useful cluster of objects that
                  have similar characteristic
IV. Prediction-discovers relationship between independent variables
                 and relationship between dependent and
                 independent variables
V. Sequential Patterns-discover similar patterns in data transaction
                 over a business period
Tools
• There are three categories of tools for data mining:
      i. Traditional Data Mining Tools
      ii. Dashboards
      iii. Text-mining Tools


Some data mining tools:
      •   R- r-project.org
      •   Datameer Analytics Solution - datameer.com
      •   SAS Analytics- sas.com
      •   Google Chart API- code.google.com/apis/chart
Column Stores
• stores data tables as columns of data
   • Column Oriented DBMS-
       • Bigtable, DBase, Hypertable, Cassandra(Relational)
       • Sybase IQ, MonetDB, C-Store, Vertica, VectorWise, Infobright (NoSQL)
• Use in systems like data warehouses and data mining
• Example:         Emp_ID Emp_Name             Emp_Dept Emp_Salar
                                                        y
                   1          Smith            IT            40000
                   2          Adam             Sales         35000
                   3          Jones            Marketing     45000
the database must coax its two-dimensional table into one for the operating
system
             • 1,2,3
                Smith, Adam, Jones
               IT, Sales, Marketing
                40000, 35000, 45000
Advantages and Disadvantages of
Column Stores
Advantages
• Only need to read relevant data( improved bandwidth utilization)
• Improved cache locality
       No need to transmit surrounding attributes
• Compression efficiency-column compress better than rows
       Because rows contain values from different domain
       Row-store compression ratio: 1:3
       Colum-Store: 1:10
Disadvantages
• Increased Disk seek time
• Increased cost of inserts.
• Increased tuple reconstruction costs
Case Study: Bazaarvoice
• Facing difficulties to aggregate large amounts of data on the fly in real time
  for analytics product
• Common among queries- a small number of columns with most values
  being aggregates such as counts, sums and averages
• Use InfoBright, an open source database built on MySQL
• Test result using a data set with 100MM records in the main fact table




• Average query execution time for analytical queries was 20x faster than
  MySQL’s
Case Study: Bazaarvoice(cont.)
• disk footprint was over 10x smaller compared to MySQL due to data
  compression.
• Why?
  • Column stores- small disk I/O
  • “knowledge grid”, aggregate data Infobright calculates during data
    loading
      • E.g. pre-calculate min, max, and avg value for each column in the
        pack
  • Limitations of InfoBright
      • does not support DML
      • only way is to bulk loads using “LOAD DATA INFILE …” command
      • no way to update or delete existing data without reloading the table
References
Data Mining
•   http://en.wikipedia.org/wiki/Data_mining
•   http://www.inc.com/magazine/20101001/4-essential-data-mining-tools.html
•   http://www.dataminingtechniques.net/
•   http://www.unc.edu/~xluan/258/datamining.html
•   http://www.data-miners.com/
•   http://www.exforsys.com/tutorials/data-mining/how-data-mining-is-evolving.html
•   http://livehoods.org/


Column Stores
• http://en.wikipedia.org/wiki/Column_store
• http://developer.bazaarvoice.com/why-columns-are-cool
• http://www.calpont.com/doc/Calpont_Whitepaper-Best-Practices-
  in_the_Use_of_Columnar_Databases.pdf

Mais conteúdo relacionado

Mais procurados

Data Confidentiality in Cloud Computing
Data Confidentiality in Cloud ComputingData Confidentiality in Cloud Computing
Data Confidentiality in Cloud ComputingRitesh Dwivedi
 
Ensuring data storage security in cloud computing
Ensuring data storage security in cloud computingEnsuring data storage security in cloud computing
Ensuring data storage security in cloud computingUday Wankar
 
Multi Tenancy In The Cloud
Multi Tenancy In The CloudMulti Tenancy In The Cloud
Multi Tenancy In The Cloudrohit_ainapure
 
Data Management Gateway - Deep Dive
Data Management Gateway - Deep DiveData Management Gateway - Deep Dive
Data Management Gateway - Deep DiveJean-Pierre Riehl
 
Introduction of cloud computing
Introduction of cloud computingIntroduction of cloud computing
Introduction of cloud computingSuman Sharma
 
Saa s multitenant database architecture
Saa s multitenant database architectureSaa s multitenant database architecture
Saa s multitenant database architecturemmubashirkhan
 
Cloud computing 1
Cloud computing  1Cloud computing  1
Cloud computing 1Ashok Kumar
 
security Issues of cloud computing
security Issues of cloud computingsecurity Issues of cloud computing
security Issues of cloud computingprachupanchal
 
Cloud Computing Overview
Cloud Computing OverviewCloud Computing Overview
Cloud Computing OverviewSean Connolly
 
Infrastructure as a service (iaa s)
Infrastructure as a service (iaa s)Infrastructure as a service (iaa s)
Infrastructure as a service (iaa s)johndorian555
 
Platform as a Service
Platform as a ServicePlatform as a Service
Platform as a ServiceAshok Kumar
 
Multi-tenancy in Private Clouds
Multi-tenancy in Private CloudsMulti-tenancy in Private Clouds
Multi-tenancy in Private CloudsPatrick Nicolas
 
Microsoft Cloud Computing
Microsoft Cloud ComputingMicrosoft Cloud Computing
Microsoft Cloud ComputingDavid Chou
 
Third party cloud services cloud computing
Third party cloud services cloud computingThird party cloud services cloud computing
Third party cloud services cloud computingSohailAliMalik
 
Cloud computing intro
Cloud computing  introCloud computing  intro
Cloud computing introAshok Kumar
 
Multi-tenancy In the Cloud
Multi-tenancy In the CloudMulti-tenancy In the Cloud
Multi-tenancy In the Cloudsdevillers
 

Mais procurados (20)

Data Confidentiality in Cloud Computing
Data Confidentiality in Cloud ComputingData Confidentiality in Cloud Computing
Data Confidentiality in Cloud Computing
 
Ensuring data storage security in cloud computing
Ensuring data storage security in cloud computingEnsuring data storage security in cloud computing
Ensuring data storage security in cloud computing
 
Multi Tenancy In The Cloud
Multi Tenancy In The CloudMulti Tenancy In The Cloud
Multi Tenancy In The Cloud
 
Data Management Gateway - Deep Dive
Data Management Gateway - Deep DiveData Management Gateway - Deep Dive
Data Management Gateway - Deep Dive
 
Introduction of cloud computing
Introduction of cloud computingIntroduction of cloud computing
Introduction of cloud computing
 
Saa s multitenant database architecture
Saa s multitenant database architectureSaa s multitenant database architecture
Saa s multitenant database architecture
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Cloud computing 1
Cloud computing  1Cloud computing  1
Cloud computing 1
 
security Issues of cloud computing
security Issues of cloud computingsecurity Issues of cloud computing
security Issues of cloud computing
 
Cloud Computing Overview
Cloud Computing OverviewCloud Computing Overview
Cloud Computing Overview
 
CLOUD COMPUTING AND STORAGE
CLOUD COMPUTING AND STORAGECLOUD COMPUTING AND STORAGE
CLOUD COMPUTING AND STORAGE
 
Infrastructure as a service (iaa s)
Infrastructure as a service (iaa s)Infrastructure as a service (iaa s)
Infrastructure as a service (iaa s)
 
Cloud security
Cloud securityCloud security
Cloud security
 
Platform as a Service
Platform as a ServicePlatform as a Service
Platform as a Service
 
Multi-tenancy in Private Clouds
Multi-tenancy in Private CloudsMulti-tenancy in Private Clouds
Multi-tenancy in Private Clouds
 
Microsoft Cloud Computing
Microsoft Cloud ComputingMicrosoft Cloud Computing
Microsoft Cloud Computing
 
Third party cloud services cloud computing
Third party cloud services cloud computingThird party cloud services cloud computing
Third party cloud services cloud computing
 
Cloud computing intro
Cloud computing  introCloud computing  intro
Cloud computing intro
 
Multi-tenancy In the Cloud
Multi-tenancy In the CloudMulti-tenancy In the Cloud
Multi-tenancy In the Cloud
 
Security in Cloud Computing
Security in Cloud ComputingSecurity in Cloud Computing
Security in Cloud Computing
 

Semelhante a Data mining & column stores

Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationSunderland City Council
 
Dbms and it infrastructure
Dbms and  it infrastructureDbms and  it infrastructure
Dbms and it infrastructureprojectandppt
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDhilsath Fathima
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsGDi Techno Solutions
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptxinfinix8
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptAravindReddy565690
 
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsTraditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsGanesan Narayanasamy
 
Lecture 1-big data engineering (Introduction).pdf
Lecture 1-big data engineering (Introduction).pdfLecture 1-big data engineering (Introduction).pdf
Lecture 1-big data engineering (Introduction).pdfahmedibrahimghnnam01
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1malathieswaran29
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxAIMLSEMINARS
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptxKannanThangavelu2
 

Semelhante a Data mining & column stores (20)

IT webinar 2016
IT webinar 2016IT webinar 2016
IT webinar 2016
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data Visualisation
 
Dbms and it infrastructure
Dbms and  it infrastructureDbms and  it infrastructure
Dbms and it infrastructure
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
data warehousing
data warehousingdata warehousing
data warehousing
 
Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousing
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).ppt
 
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsTraditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
 
Lecture 1-big data engineering (Introduction).pdf
Lecture 1-big data engineering (Introduction).pdfLecture 1-big data engineering (Introduction).pdf
Lecture 1-big data engineering (Introduction).pdf
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
 
Data warehousing and Data mining
Data warehousing and Data mining Data warehousing and Data mining
Data warehousing and Data mining
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 

Mais de Aung Thu Rha Hein

Bioinformatics for Computer Scientists
Bioinformatics for Computer Scientists Bioinformatics for Computer Scientists
Bioinformatics for Computer Scientists Aung Thu Rha Hein
 
Analysis of hybrid image with FFT (Fast Fourier Transform)
Analysis of hybrid image with FFT (Fast Fourier Transform)Analysis of hybrid image with FFT (Fast Fourier Transform)
Analysis of hybrid image with FFT (Fast Fourier Transform)Aung Thu Rha Hein
 
Introduction to Common Weakness Enumeration (CWE)
Introduction to Common Weakness Enumeration (CWE)Introduction to Common Weakness Enumeration (CWE)
Introduction to Common Weakness Enumeration (CWE)Aung Thu Rha Hein
 
Private Browsing: A Window of Forensic Opportunity
Private Browsing: A Window of Forensic OpportunityPrivate Browsing: A Window of Forensic Opportunity
Private Browsing: A Window of Forensic OpportunityAung Thu Rha Hein
 
Digital Forensic: Brief Intro & Research Challenge
Digital Forensic: Brief Intro & Research ChallengeDigital Forensic: Brief Intro & Research Challenge
Digital Forensic: Brief Intro & Research ChallengeAung Thu Rha Hein
 
Survey & Review of Digital Forensic
Survey & Review of Digital ForensicSurvey & Review of Digital Forensic
Survey & Review of Digital ForensicAung Thu Rha Hein
 
Partitioned Based Regression Verification
Partitioned Based Regression VerificationPartitioned Based Regression Verification
Partitioned Based Regression VerificationAung Thu Rha Hein
 
CRAXweb: Automatic Exploit Generation for Web Applications
CRAXweb: Automatic Exploit Generation for Web ApplicationsCRAXweb: Automatic Exploit Generation for Web Applications
CRAXweb: Automatic Exploit Generation for Web ApplicationsAung Thu Rha Hein
 
Web application security: Threats & Countermeasures
Web application security: Threats & CountermeasuresWeb application security: Threats & Countermeasures
Web application security: Threats & CountermeasuresAung Thu Rha Hein
 
Can the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaughtCan the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaughtAung Thu Rha Hein
 
Fuzzy logic based students’ learning assessment
Fuzzy logic based students’ learning assessmentFuzzy logic based students’ learning assessment
Fuzzy logic based students’ learning assessmentAung Thu Rha Hein
 

Mais de Aung Thu Rha Hein (18)

Writing with ease
Writing with easeWriting with ease
Writing with ease
 
Bioinformatics for Computer Scientists
Bioinformatics for Computer Scientists Bioinformatics for Computer Scientists
Bioinformatics for Computer Scientists
 
Analysis of hybrid image with FFT (Fast Fourier Transform)
Analysis of hybrid image with FFT (Fast Fourier Transform)Analysis of hybrid image with FFT (Fast Fourier Transform)
Analysis of hybrid image with FFT (Fast Fourier Transform)
 
Introduction to Common Weakness Enumeration (CWE)
Introduction to Common Weakness Enumeration (CWE)Introduction to Common Weakness Enumeration (CWE)
Introduction to Common Weakness Enumeration (CWE)
 
Private Browsing: A Window of Forensic Opportunity
Private Browsing: A Window of Forensic OpportunityPrivate Browsing: A Window of Forensic Opportunity
Private Browsing: A Window of Forensic Opportunity
 
Network switching
Network switchingNetwork switching
Network switching
 
Digital Forensic: Brief Intro & Research Challenge
Digital Forensic: Brief Intro & Research ChallengeDigital Forensic: Brief Intro & Research Challenge
Digital Forensic: Brief Intro & Research Challenge
 
Survey & Review of Digital Forensic
Survey & Review of Digital ForensicSurvey & Review of Digital Forensic
Survey & Review of Digital Forensic
 
Partitioned Based Regression Verification
Partitioned Based Regression VerificationPartitioned Based Regression Verification
Partitioned Based Regression Verification
 
CRAXweb: Automatic Exploit Generation for Web Applications
CRAXweb: Automatic Exploit Generation for Web ApplicationsCRAXweb: Automatic Exploit Generation for Web Applications
CRAXweb: Automatic Exploit Generation for Web Applications
 
Botnets 101
Botnets 101Botnets 101
Botnets 101
 
Session initiation protocol
Session initiation protocolSession initiation protocol
Session initiation protocol
 
TPC-H in MongoDB
TPC-H in MongoDBTPC-H in MongoDB
TPC-H in MongoDB
 
Web application security: Threats & Countermeasures
Web application security: Threats & CountermeasuresWeb application security: Threats & Countermeasures
Web application security: Threats & Countermeasures
 
Can the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaughtCan the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaught
 
Fuzzy logic based students’ learning assessment
Fuzzy logic based students’ learning assessmentFuzzy logic based students’ learning assessment
Fuzzy logic based students’ learning assessment
 
Link state routing protocol
Link state routing protocolLink state routing protocol
Link state routing protocol
 
Chat bot analysis
Chat bot analysisChat bot analysis
Chat bot analysis
 

Último

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 

Último (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Data mining & column stores

  • 1. Data Mining & Column Stores Aung Thu Rha Hein
  • 2. Why use Data Mining? • Explosive growth of data available • Major sources: • Business: Web, E-Commerce, transactions • Science : Remote Sensing, bioinformatics,…. • Society : news, gadgets, social media • Too much data but too little information • To extract useful information from the data and to interpret the data • can automate the process of finding relationships and patterns in raw data
  • 3. What is Data Mining? • Knowledge Discovery in Databases, or ”KDD” • the process of extracting hidden predictive information from large data sets • Converting information into knowledge to predict the future trends and decisions • Examples :  consumer buying behavior of retail supermarket sales  Google instant, YouTube instant  Blogs and news: Technorati, News360 and so on  Social Mining : Livehoods: find pattern and behaviors of foursquare check-in data
  • 4. Data Mining Process The Cross-Industry Standard Process (CRISP-DM) Business understanding Data understanding Data preparation Modeling Evaluation Deployment
  • 5. Techniques I. Association Rule-also known as market basket analysis.  discover interesting associations between attributes II. Classification- a technique based on machine learning  use mathematical techniques such as decision trees, linear programming, neural network and statistics. III. Clustering- makes meaningful or useful cluster of objects that have similar characteristic IV. Prediction-discovers relationship between independent variables and relationship between dependent and independent variables V. Sequential Patterns-discover similar patterns in data transaction over a business period
  • 6. Tools • There are three categories of tools for data mining: i. Traditional Data Mining Tools ii. Dashboards iii. Text-mining Tools Some data mining tools: • R- r-project.org • Datameer Analytics Solution - datameer.com • SAS Analytics- sas.com • Google Chart API- code.google.com/apis/chart
  • 7. Column Stores • stores data tables as columns of data • Column Oriented DBMS- • Bigtable, DBase, Hypertable, Cassandra(Relational) • Sybase IQ, MonetDB, C-Store, Vertica, VectorWise, Infobright (NoSQL) • Use in systems like data warehouses and data mining • Example: Emp_ID Emp_Name Emp_Dept Emp_Salar y 1 Smith IT 40000 2 Adam Sales 35000 3 Jones Marketing 45000 the database must coax its two-dimensional table into one for the operating system • 1,2,3 Smith, Adam, Jones IT, Sales, Marketing 40000, 35000, 45000
  • 8. Advantages and Disadvantages of Column Stores Advantages • Only need to read relevant data( improved bandwidth utilization) • Improved cache locality  No need to transmit surrounding attributes • Compression efficiency-column compress better than rows  Because rows contain values from different domain  Row-store compression ratio: 1:3  Colum-Store: 1:10 Disadvantages • Increased Disk seek time • Increased cost of inserts. • Increased tuple reconstruction costs
  • 9. Case Study: Bazaarvoice • Facing difficulties to aggregate large amounts of data on the fly in real time for analytics product • Common among queries- a small number of columns with most values being aggregates such as counts, sums and averages • Use InfoBright, an open source database built on MySQL • Test result using a data set with 100MM records in the main fact table • Average query execution time for analytical queries was 20x faster than MySQL’s
  • 10. Case Study: Bazaarvoice(cont.) • disk footprint was over 10x smaller compared to MySQL due to data compression. • Why? • Column stores- small disk I/O • “knowledge grid”, aggregate data Infobright calculates during data loading • E.g. pre-calculate min, max, and avg value for each column in the pack • Limitations of InfoBright • does not support DML • only way is to bulk loads using “LOAD DATA INFILE …” command • no way to update or delete existing data without reloading the table
  • 11. References Data Mining • http://en.wikipedia.org/wiki/Data_mining • http://www.inc.com/magazine/20101001/4-essential-data-mining-tools.html • http://www.dataminingtechniques.net/ • http://www.unc.edu/~xluan/258/datamining.html • http://www.data-miners.com/ • http://www.exforsys.com/tutorials/data-mining/how-data-mining-is-evolving.html • http://livehoods.org/ Column Stores • http://en.wikipedia.org/wiki/Column_store • http://developer.bazaarvoice.com/why-columns-are-cool • http://www.calpont.com/doc/Calpont_Whitepaper-Best-Practices- in_the_Use_of_Columnar_Databases.pdf