SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
Statistics & Data Mining


 R. Akerkar
 TMRF, Kolhapur, India




                  Data Mining - R. Akerkar   1
Why Data Preprocessing?
   Data in the real world is dirty
                                  y
       incomplete: lacking attribute values, lacking certain
        attributes of interest, or containing only aggregate data
           e.g., occupation=
            e g occupation=“”
       noisy: containing errors or outliers
           e.g., Salary=“-10”
       inconsistent: containing discrepancies in codes or names
           e.g., Age=“42” Birthday=“03/07/1997”
           e.g., Was rating “1,2,3”, now rating “A, B, C
                              1,2,3 ,             A, C”
           e.g., discrepancy between duplicate records


       This is the
        Thi i th real world…
                    l    ld
                                 Data Mining - R. Akerkar           2
Why Is Data Dirty?
      y             y
   Incomplete data comes from
       n/a data value when collected
         /
       different consideration between the time when the data was
        collected and when it is analyzed.
       human/hardware/software problems
   Noisy data comes from the process of data
       Collection instrument’s f lt
        C ll ti i t         t’ fault
       Data entry
       transmission
   Inconsistent data comes from
       Different data sources
       Functional d
        F   ti   l dependency violation
                        d      i l ti

                                 Data Mining - R. Akerkar            3
Why Is Data Preprocessing
    Important?
   No quality data no quality mining results!
               data,
       Quality decisions must be based on quality data
           e.g., duplicate or missing data may cause incorrect or even
              g, p                   g        y
            misleading statistics.
       Data warehouse needs consistent integration of quality data
   Data extraction, cleaning, and transformation
    comprises the majority of the work of building a data
    warehouse. Bill
    warehouse —Bill Inmon



                                Data Mining - R. Akerkar                  4
Major Tasks in Data Preprocessing
   Data cleaning
       Fill in missing values, smooth noisy data, and resolve
        inconsistencies
   Data integration
       Integration of multiple databases, data cubes, or files
   Data transformation
       Normalization and aggregation ( a distance based mining algorithms
        provide better results if data is normalized and scaled to range.)
   Data reduction
       Obtains reduced representation in volume but p
                             p                             produces the same or
        similar analytical results (correlation analysis).
   Data discretization
       Part of data reduction but with particular importance, especially for
                                        p            p           p      y
        numerical data.

                                 Data Mining - R. Akerkar                         5
Forms of data preprocessing




           Data Mining - R. Akerkar   6
Data Cleaning
                    g
   Importance
       “Data l
        “D t cleaning i one of th th
                    i is        f the three bi
                                            biggest problems i
                                                  t    bl    in
        data warehousing”—Ralph Kimball
       “Data cleaning is the number one problem in data
        warehousing”—DCI survey
             h    i ” DCI
   Data cleaning tasks
       Fill in missing values (time consuming)
       Identify outliers and smooth out noisy data
       Correct inconsistent data
       Resolve redundancy caused by data integration
                         y         y          g

                             Data Mining - R. Akerkar             7
Missing Data
          g
   Data is not always available
       E.g.,
        E g many tuples have no recorded value for several attributes
                                                           attributes,
        such as customer income in sales data
   Missing data may be due to
       equipment malfunction
       inconsistent with other recorded data and thus deleted
       data not entered due to misunderstanding
       certain data may not be considered important at the time of
        entry
       not register history or changes of the data
   Missing data may need to be inferred.


                             Data Mining - R. Akerkar                 8
How to Handle Missing
    Data?
   Ignore the tuple: usually done when class label is missing (assuming the
    tasks in classification—not effective when the percentage of missing values
    per attribute varies considerably.
   Fill in the missing value manually: t di
         i th    i i      l        ll tedious + i f
                                                infeasible?
                                                      ibl ?
   Fill in it automatically with
       a global constant : e g “unknown”, a new class?!
                            e.g., unknown
       the attribute mean
       the attribute mean for all samples belonging to the same class: smarter
       the most probable value: inference-based such as Bayesian formula or
        decision tree or regression.



                                    Data Mining - R. Akerkar                      9
Noisy Data
   Noise: random error or variance in a measured variable.

   For example, for a numeric attribute “price” how can we smooth out
    the data to remove the noise.

   Incorrect attribute values may due to
     faulty data collection instruments

     data entry problems

     data transmission problems

     technology limitation
                gy
     inconsistency in naming convention




                               Data Mining - R. Akerkar                  10
How to Handle Noisy Data?
   Binning method:
       first sort data and partition into (equi-depth) bins
       then one can smooth by bin means, smooth by bin median,
        smooth by bin boundaries etc
                         boundaries, etc.


   Regression
      g
       smooth by fitting the data into regression functions




                              Data Mining - R. Akerkar         11
Binning Methods for Data Smoothing
* Binning methods smooth a sorted data by consulting its
   neighborhood. Then sorted values are distributed in number of
   buckets.
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
   29, 34
* Partition into (equi-depth) bins of depth 4:
    - Bin 1: 4, 8, 9, 15
    - Bin 2: 21, 21, 24, 25
    - Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
    - Bin 1: 9, 9, 9, 9
    - Bin 2: 23, 23, 23, 23
    - Bin 3: 29, 29, 29, 29
* S
  Smoothing by bin boundaries:
         thi b bi b         d i
    - Bin 1: 4, 4, 4, 15                                     Similarly, smoothing
    - Bin 2: 21, 21, 25, 25                                  by bin median can
    - Bin 3: 26, 26, 26, 34
                , , ,                                        be employed.


                                  Data Mining - R. Akerkar                     12
Simple Discretization Methods: Binning

   Equal-width (distance) partitioning:
       Divides the
        Di id th range i t N i t
                            into intervals of equal size:
                                           l f       l i
        uniform grid
       if A and B are the lowest and highest values of the
                                          g
        attribute, the width of intervals will be: W = (B –A)/N.
       The most straightforward, but outliers may dominate
        presentation
       Skewed data is not handled well.

       Binning is
        Bi i i applied t each i di id l f t
                         li d to   h individual feature (
                                                        (or
        attribute). It does not use the class information.


                            Data Mining - R. Akerkar               13
   Equal depth
    Equal-depth (frequency) partitioning:
       Divides the range into N intervals, each containing
        approximately same number of samples
       Good data scaling
       Managing categorical attributes can be tricky.




                          Data Mining - R. Akerkar        14
Exercise 1

   Suppose the data for analysis includes the attribute Age. The age
    values for the data tuples (instances) are (in increasing order):

   13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25,30, 33, 33, 35,
      , , , , , , , , , , , , , , , , , ,
    35, 35, 35, 36, 40, 45, 46, 52, 70.

   Use binning (by bin means) to smooth the above data using a bin
                                                      data,
    depth of 3.
   Illustrate your steps, and comment on the effect of this technique for
    the given data
                data.




                               Data Mining - R. Akerkar                      15
Data Integration
             g
   Data integration:
       combines data from multiple sources into a coherent store

   Schema integration
       Entity identification
        E tit id tifi ti problem: identify real world entities f
                               bl     id tif      l   ld titi from
        multiple data sources, e.g., A.cust-id  B.cust-#
       integrate metadata from different sources

   Detecting and resolving data value conflicts
       for the same real world entity, attribute values from different
        sources
        so rces are different
       possible reasons: different representations, different scales,
        e.g., metric vs. British units


                              Data Mining - R. Akerkar                16
Handling Redundancy in Data Integration

   Redundant data occur often when integration of multiple databases
       The same attribute may have different names in different
        databases
       One attribute may be a “derived” attribute in another table e g
                                derived                       table, e.g.,
        annual revenue
   Redundant data may be able to be detected by correlation analysis
   Careful integration of the data from multiple sources may help
    reduce/avoid redundancies and inconsistencies and improve mining
    speed and quality




                               Data Mining - R. Akerkar                      17
Correlation Analysis

   Redundancies can be detected by this method.
   Given two attributes, such analysis can measure how strongly one
    attribute implies the other, based on available data.

   The correlation between attributes A and B can be measured by




   Where n is number of tuples,     and     are respective mean values
    of A and B, and    and       are the respective standard deviations
    of A and B


                             Data Mining - R. Akerkar                  18
Correlation Analysis
   If the resulting value of the equation is greater than 0, then A and B
    are positively correlated.
       i.e. the l
        i th values of A i
                        f increase as th values of B i
                                       the l        f increase.
       The higher the value, the more each attribute implies the other.
       Hence, high value may indicate that A (or B) may be removed as
        redundancy.
        redundancy

   If the resulting value is equal to zero, then A and B are independent
       There is no correlation between them.

   If the resulting value is less than zero, then A and B are negatively
    correlated.
       i.e. the values of one attribute increase as the values of other attribute
        decrease.
       Each attribute discourages the other.



                                   Data Mining - R. Akerkar                          19
Correlation Analysis




 Above are three possible relationships between data. The graphs of high positive
 and negative correlation are approaching a value of 1 and -1 respectively. The
 graph showing no correlation has a value of 0.




                                 Data Mining - R. Akerkar                           20
Categorical Data
   To find the correlation between two categorical attributes we make
    use of contingency tables.
                  g     y

   Let us consider the following:
   Let there be 4 car manufacturers given by the set
    {A,B,C,D} and let there be three segments of cars
    manufactured by these companies given by the set
    {S,M,L},
    {S M L} where S stands for small cars M stands for
                                      cars,
    medium sized cars and L stands for Large cars.

   Observer collects data about the cars passing by,
    manufactured by these companies and categorize them
    according to their sizes.

                              Data Mining - R. Akerkar                   21
   For finding the correlation between car manufacturers and the
    size of cars that they manufacture we formulate a hypothesis,
    that the size of car manufactured and the companies that
    manufacture the cars are independent of each other
                                                  other.

   In other terms, we are saying that there is absolutely no
    correlation between the car manufacturing company and the size
    of the cars that they manufacture.
   Such a hypothesis in statistical terms is called the null
    hypothesis and is denoted by Ho Ho.

   Null hypothesis: The car size and car manufacturers are
    attributes independent of each other.
                                   other

                            Data Mining - R. Akerkar                22
Data Transformation
   Smoothing: remove noise from data (binning and regression)
   Aggregation: summarization data cube construction
                 summarization,
   E.g. Daily sales data aggregated to compute monthly and annual total
    amount.
   Generalization: concept hierarchy climbing


   Normalization is useful for classification algorithms involving neural nets,
    clustering etc..
   Normalization: attribute data are scaled to fall within a small, specified range
    such as – 1.0 to 1.0
       min-max normalization
       z-score normalization
       normalization by decimal scaling
   Attribute/feature construction
       New attributes constructed f
                                   from the given ones

                                 Data Mining - R. Akerkar                          23
Data Transformation: Normalization
   min-max normalization (This type of normalization transforms the data into a
    desired range, usually [0,1]. )


             v  minA
       v'              (new _ maxA  new _ minA)  new _ minA
            maxA  minA

    where, [minA, maxA] is the initial range and [new_minA, new_maxA] is the
          ,[    ,     ]                   g      [        ,         ]
    new range.
    e.g.: If v = 73600 in [12000, 98000] Then v’ = 0.716 in the range [0, 1].
    Here value for “income” is transformed to 0.716


    It preserves the relationship among th original d t values.
                 th    l ti   hi        the i i l data l

                                      Data Mining - R. Akerkar                 24
z-score normalization

By using this type of normalization, the mean of the transformed set
   of data points is reduced to zero. For this, the mean and
   standard deviation of the initial set of data values are required.
   The t
   Th transformation formula is
             f      ti f     l i
                                           v  mean A
                                     v'
                                         stand _ dev          A


Where, meanA and std_devA are the mean and standard deviation
  of the initial data values.


e.g.: If meanIncome = 54000, and std_devIncome = 16000, then v
   = 76000 transformed to v’ =1.225.

This is useful when the actual min and max of attribute are
   unknown.



                             Data Mining - R. Akerkar                   25
Normalisation by Decimal Scaling
   This type of scaling transforms the data into a range between [-
    1,1]. The transformation formula is

                                  v
                             v'                j
                                 10
Where j is the smallest integer such that Max(|           v' |)<1

   e.g.: Suppose recorded value of A is in initial range [-991, 99], k is
    3, and v = -991 becomes v' = -0.991.
   The
    Th mean absolute value of A is 991.
                b l t     l   f i 991
   To normalise, we divide each value by 1000 (i.e. j = 3) so -991
    normalises -0.991


                               Data Mining - R. Akerkar                      26
Exercise 2
   Using the data for Age in previous Question, answer the following:

a) Use min-max normalization to transform the value 35 for age into the
   range [0.0; 1.0].

b) Use z-score normalization to transform the value 35 for age, where
   the standard deviation of age is 12.94.

c) Use normalization by decimal scaling to transform the value 35 for
   age.

d) Comment on which method you would prefer to use for the given
   data, giving reasons as to why.




                              Data Mining - R. Akerkar                   27
What Is Prediction?
   Prediction is similar to classification
       First, construct a model
       Second, use model to predict unknown value
           Major method for prediction is regression
               Linear and multiple regression
               Non-linear regression
   Prediction is different from classification
    P di ti i diff        tf      l   ifi ti
       Classification refers to predict categorical class label
       Prediction models continuous-valued functions
   E.g. A model to predict the salary of university graduate with 15 years of
    work experience.



                                     Data Mining - R. Akerkar               28
Regression
   Regression shows a relationship between the average values of
    two variables.
   Thus regression is very useful in estimating and predicting the
    average value of one variable for a given value of other variable.
   The estimate or prediction may be made with the help of a
    regression line.

   There are two types of variables in regression analysis-
    independent variable and dependent variable.
   The variable whose value is to be predicted is called dependent
    variable and th variable whose value i used f prediction i
        i bl   d the   i bl    h       l is      d for    di ti is
    called independent variable.



                             Data Mining - R. Akerkar                 29
   Linear regression: If the regression curve is a straight
             g                  g                        g
    line, then there is a linear regression between two variables.

   Linear regression models a random variable, Y (called
                                           variable
    response variable) as a linear function of another random
    variable, X ( called a predictor variable)
   Y=+X
     Two parameters ,  and  specify the line and are to
      be estimated by using the data at hand. (regression
      coefficients)
     The variance of Y is assumed to be constant.

     Th coefficients can be solved f
      The      ffi i t     b     l d for b th method of
                                           by the th d f
      least squares (minimizes the error between the actual
      data and the estimate of the line. )

                             Data Mining - R. Akerkar                30
Linear Regression

   Given s samples or data points of the form (x1, y1), (x2, y2) …(xs, ys)
   The regression coefficients can be estimated as,




   Where         is the average of x1, x2 … and          is the average of y1,
    y2,…




                               Data Mining - R. Akerkar                       31
Multiple Regression

   Multiple regression: Y =  + 1 X1 + 2 X2.
         p     g

       Many nonlinear functions can be transformed into the
        above.
       The regression analysis for studying more than
        two variables at a time
                           time.
       It involves more than one predictor variable.
       Method of least square can be applied to solve for
        , 1, and 2.



                          Data Mining - R. Akerkar             32
Non-Linear Regression

   If the curve of regression is not a straight line, i.e., a first degree
    equation in the variables x and y, then it is called a non-linear
    regression or curvilinear regression

   Consider a cubic polynomial relationship,
     Y =  +  1X +  2X2 +  3X3.

   To convert above equation in linear form, we define new variable:
       X1 = X, X2 = X2, X3 = X3

   Thus we get,
            g
     Y =  +  1X1 +  2 X2+  3 X3.



   This is solvable by the method of least squares
                                            squares.

                                   Data Mining - R. Akerkar                   33
Exercise 3
   Following table shows a set of                    X                    Y
    paired data where X is the number          Years Experience   Salary (in $ 1000s)
    of years of work experience of a                      3               30
    college graduates and Y is the                        8               57
    corresponding salary of the                           9               64
    graduate.
                                                          13              72
   Draw a graph of the data. Do X                        3               36
    and Y seem to have a linear
                                                          6               43
    relationship?
                                                          11              59
   Also, predict the salary of a
                                                          21              90
    college graduate with 10 years of
                                                          1               20
    experience.
    experience
                                                          16              83




                               Data Mining - R. Akerkar                                 34
Assignment
                                                                X            Y
                                                          Midterm exam   Final exam
    The following table shows the midterm and                72            84
     final exam grades obtained for students in a             50            63
     data mining course.                                      81            77
                                                              74            78
1.   Plot the data. Do X and Y seem to have a                 94            90
     linear relationship?                                     86            75
2.   Use the method of least squares to find an               59            49
     equation for the prediction of a student’s
                                      student s               83            79
     final grade based on the student’s midterm               65            77
     grade in the course.                                     33            52
3.   Predict the final grade of a student who
                       g                                      88            74
     received an 86 on the midterm exam.                      81            90




                               Data Mining - R. Akerkar                               35

Mais conteúdo relacionado

Mais procurados

Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data miningKamal Acharya
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining Sulman Ahmed
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 
I. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMI. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMvikas dhakane
 
Fields of digital image processing slides
Fields of digital image processing slidesFields of digital image processing slides
Fields of digital image processing slidesSrinath Dhayalamoorthy
 
Classification and prediction
Classification and predictionClassification and prediction
Classification and predictionAcad
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methodsReza Ramezani
 
Association rule mining
Association rule miningAssociation rule mining
Association rule miningAcad
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classificationKrish_ver2
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methodsProf.Nilesh Magar
 
Multidimensional schema
Multidimensional schemaMultidimensional schema
Multidimensional schemaChaand Chopra
 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine LearningKuppusamy P
 

Mais procurados (20)

Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Lecture13 - Association Rules
Lecture13 - Association RulesLecture13 - Association Rules
Lecture13 - Association Rules
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
I. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMI. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHM
 
Fields of digital image processing slides
Fields of digital image processing slidesFields of digital image processing slides
Fields of digital image processing slides
 
Classification and prediction
Classification and predictionClassification and prediction
Classification and prediction
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methods
 
Decision tree
Decision treeDecision tree
Decision tree
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methods
 
Multidimensional schema
Multidimensional schemaMultidimensional schema
Multidimensional schema
 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine Learning
 

Destaque

Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsDatamining Tools
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionKhalid Aziz
 
Correlation analysis ppt
Correlation analysis pptCorrelation analysis ppt
Correlation analysis pptAnil Mishra
 
Correlation
CorrelationCorrelation
CorrelationTech_MX
 

Destaque (7)

Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Correlation
CorrelationCorrelation
Correlation
 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysis
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Correlation analysis ppt
Correlation analysis pptCorrelation analysis ppt
Correlation analysis ppt
 
Correlation
CorrelationCorrelation
Correlation
 
Correlation ppt...
Correlation ppt...Correlation ppt...
Correlation ppt...
 

Semelhante a Statistics and Data Mining

Semelhante a Statistics and Data Mining (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Cs501 data preprocessingdw
Cs501 data preprocessingdwCs501 data preprocessingdw
Cs501 data preprocessingdw
 
Data processing
Data processingData processing
Data processing
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdf
 
Pre processing
Pre processingPre processing
Pre processing
 
Data Mining
Data MiningData Mining
Data Mining
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessing
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
 
Data processing
Data processingData processing
Data processing
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Data PreProcessing
Data PreProcessingData PreProcessing
Data PreProcessing
 
DataPreProcessing
DataPreProcessing DataPreProcessing
DataPreProcessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 

Mais de R A Akerkar

Rajendraakerkar lemoproject
Rajendraakerkar lemoprojectRajendraakerkar lemoproject
Rajendraakerkar lemoprojectR A Akerkar
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaR A Akerkar
 
Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?R A Akerkar
 
Big data in Business Innovation
Big data in Business Innovation   Big data in Business Innovation
Big data in Business Innovation R A Akerkar
 
What is Big Data ?
What is Big Data ?What is Big Data ?
What is Big Data ?R A Akerkar
 
Connecting and Exploiting Big Data
Connecting and Exploiting Big DataConnecting and Exploiting Big Data
Connecting and Exploiting Big DataR A Akerkar
 
Linked open data
Linked open dataLinked open data
Linked open dataR A Akerkar
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extractionR A Akerkar
 
Big data: analyzing large data sets
Big data: analyzing large data setsBig data: analyzing large data sets
Big data: analyzing large data setsR A Akerkar
 
Description logics
Description logicsDescription logics
Description logicsR A Akerkar
 
artificial intelligence
artificial intelligenceartificial intelligence
artificial intelligenceR A Akerkar
 
Case Based Reasoning
Case Based ReasoningCase Based Reasoning
Case Based ReasoningR A Akerkar
 
Semantic Markup
Semantic Markup Semantic Markup
Semantic Markup R A Akerkar
 
Intelligent natural language system
Intelligent natural language systemIntelligent natural language system
Intelligent natural language systemR A Akerkar
 
Knowledge Organization Systems
Knowledge Organization SystemsKnowledge Organization Systems
Knowledge Organization SystemsR A Akerkar
 
Rational Unified Process for User Interface Design
Rational Unified Process for User Interface DesignRational Unified Process for User Interface Design
Rational Unified Process for User Interface DesignR A Akerkar
 
Unified Modelling Language
Unified Modelling LanguageUnified Modelling Language
Unified Modelling LanguageR A Akerkar
 

Mais de R A Akerkar (20)

Rajendraakerkar lemoproject
Rajendraakerkar lemoprojectRajendraakerkar lemoproject
Rajendraakerkar lemoproject
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social Media
 
Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?
 
Big data in Business Innovation
Big data in Business Innovation   Big data in Business Innovation
Big data in Business Innovation
 
What is Big Data ?
What is Big Data ?What is Big Data ?
What is Big Data ?
 
Connecting and Exploiting Big Data
Connecting and Exploiting Big DataConnecting and Exploiting Big Data
Connecting and Exploiting Big Data
 
Linked open data
Linked open dataLinked open data
Linked open data
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extraction
 
Big data: analyzing large data sets
Big data: analyzing large data setsBig data: analyzing large data sets
Big data: analyzing large data sets
 
Description logics
Description logicsDescription logics
Description logics
 
Data Mining
Data MiningData Mining
Data Mining
 
Link analysis
Link analysisLink analysis
Link analysis
 
artificial intelligence
artificial intelligenceartificial intelligence
artificial intelligence
 
Case Based Reasoning
Case Based ReasoningCase Based Reasoning
Case Based Reasoning
 
Semantic Markup
Semantic Markup Semantic Markup
Semantic Markup
 
Intelligent natural language system
Intelligent natural language systemIntelligent natural language system
Intelligent natural language system
 
Data mining
Data miningData mining
Data mining
 
Knowledge Organization Systems
Knowledge Organization SystemsKnowledge Organization Systems
Knowledge Organization Systems
 
Rational Unified Process for User Interface Design
Rational Unified Process for User Interface DesignRational Unified Process for User Interface Design
Rational Unified Process for User Interface Design
 
Unified Modelling Language
Unified Modelling LanguageUnified Modelling Language
Unified Modelling Language
 

Último

Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 

Último (20)

Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 

Statistics and Data Mining

  • 1. Statistics & Data Mining R. Akerkar TMRF, Kolhapur, India Data Mining - R. Akerkar 1
  • 2. Why Data Preprocessing?  Data in the real world is dirty y  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g., occupation= e g occupation=“”  noisy: containing errors or outliers  e.g., Salary=“-10”  inconsistent: containing discrepancies in codes or names  e.g., Age=“42” Birthday=“03/07/1997”  e.g., Was rating “1,2,3”, now rating “A, B, C 1,2,3 , A, C”  e.g., discrepancy between duplicate records  This is the Thi i th real world… l ld Data Mining - R. Akerkar 2
  • 3. Why Is Data Dirty? y y  Incomplete data comes from  n/a data value when collected /  different consideration between the time when the data was collected and when it is analyzed.  human/hardware/software problems  Noisy data comes from the process of data  Collection instrument’s f lt C ll ti i t t’ fault  Data entry  transmission  Inconsistent data comes from  Different data sources  Functional d F ti l dependency violation d i l ti Data Mining - R. Akerkar 3
  • 4. Why Is Data Preprocessing Important?  No quality data no quality mining results! data,  Quality decisions must be based on quality data  e.g., duplicate or missing data may cause incorrect or even g, p g y misleading statistics.  Data warehouse needs consistent integration of quality data  Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse. Bill warehouse —Bill Inmon Data Mining - R. Akerkar 4
  • 5. Major Tasks in Data Preprocessing  Data cleaning  Fill in missing values, smooth noisy data, and resolve inconsistencies  Data integration  Integration of multiple databases, data cubes, or files  Data transformation  Normalization and aggregation ( a distance based mining algorithms provide better results if data is normalized and scaled to range.)  Data reduction  Obtains reduced representation in volume but p p produces the same or similar analytical results (correlation analysis).  Data discretization  Part of data reduction but with particular importance, especially for p p p y numerical data. Data Mining - R. Akerkar 5
  • 6. Forms of data preprocessing Data Mining - R. Akerkar 6
  • 7. Data Cleaning g  Importance  “Data l “D t cleaning i one of th th i is f the three bi biggest problems i t bl in data warehousing”—Ralph Kimball  “Data cleaning is the number one problem in data warehousing”—DCI survey h i ” DCI  Data cleaning tasks  Fill in missing values (time consuming)  Identify outliers and smooth out noisy data  Correct inconsistent data  Resolve redundancy caused by data integration y y g Data Mining - R. Akerkar 7
  • 8. Missing Data g  Data is not always available  E.g., E g many tuples have no recorded value for several attributes attributes, such as customer income in sales data  Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data  Missing data may need to be inferred. Data Mining - R. Akerkar 8
  • 9. How to Handle Missing Data?  Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably.  Fill in the missing value manually: t di i th i i l ll tedious + i f infeasible? ibl ?  Fill in it automatically with  a global constant : e g “unknown”, a new class?! e.g., unknown  the attribute mean  the attribute mean for all samples belonging to the same class: smarter  the most probable value: inference-based such as Bayesian formula or decision tree or regression. Data Mining - R. Akerkar 9
  • 10. Noisy Data  Noise: random error or variance in a measured variable.  For example, for a numeric attribute “price” how can we smooth out the data to remove the noise.  Incorrect attribute values may due to  faulty data collection instruments  data entry problems  data transmission problems  technology limitation gy  inconsistency in naming convention Data Mining - R. Akerkar 10
  • 11. How to Handle Noisy Data?  Binning method:  first sort data and partition into (equi-depth) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries etc boundaries, etc.  Regression g  smooth by fitting the data into regression functions Data Mining - R. Akerkar 11
  • 12. Binning Methods for Data Smoothing * Binning methods smooth a sorted data by consulting its neighborhood. Then sorted values are distributed in number of buckets. * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins of depth 4: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * S Smoothing by bin boundaries: thi b bi b d i - Bin 1: 4, 4, 4, 15 Similarly, smoothing - Bin 2: 21, 21, 25, 25 by bin median can - Bin 3: 26, 26, 26, 34 , , , be employed. Data Mining - R. Akerkar 12
  • 13. Simple Discretization Methods: Binning  Equal-width (distance) partitioning:  Divides the Di id th range i t N i t into intervals of equal size: l f l i uniform grid  if A and B are the lowest and highest values of the g attribute, the width of intervals will be: W = (B –A)/N.  The most straightforward, but outliers may dominate presentation  Skewed data is not handled well.  Binning is Bi i i applied t each i di id l f t li d to h individual feature ( (or attribute). It does not use the class information. Data Mining - R. Akerkar 13
  • 14. Equal depth Equal-depth (frequency) partitioning:  Divides the range into N intervals, each containing approximately same number of samples  Good data scaling  Managing categorical attributes can be tricky. Data Mining - R. Akerkar 14
  • 15. Exercise 1  Suppose the data for analysis includes the attribute Age. The age values for the data tuples (instances) are (in increasing order):  13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25,30, 33, 33, 35, , , , , , , , , , , , , , , , , , , 35, 35, 35, 36, 40, 45, 46, 52, 70.  Use binning (by bin means) to smooth the above data using a bin data, depth of 3.  Illustrate your steps, and comment on the effect of this technique for the given data data. Data Mining - R. Akerkar 15
  • 16. Data Integration g  Data integration:  combines data from multiple sources into a coherent store  Schema integration  Entity identification E tit id tifi ti problem: identify real world entities f bl id tif l ld titi from multiple data sources, e.g., A.cust-id  B.cust-#  integrate metadata from different sources  Detecting and resolving data value conflicts  for the same real world entity, attribute values from different sources so rces are different  possible reasons: different representations, different scales, e.g., metric vs. British units Data Mining - R. Akerkar 16
  • 17. Handling Redundancy in Data Integration  Redundant data occur often when integration of multiple databases  The same attribute may have different names in different databases  One attribute may be a “derived” attribute in another table e g derived table, e.g., annual revenue  Redundant data may be able to be detected by correlation analysis  Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality Data Mining - R. Akerkar 17
  • 18. Correlation Analysis  Redundancies can be detected by this method.  Given two attributes, such analysis can measure how strongly one attribute implies the other, based on available data.  The correlation between attributes A and B can be measured by  Where n is number of tuples, and are respective mean values of A and B, and and are the respective standard deviations of A and B Data Mining - R. Akerkar 18
  • 19. Correlation Analysis  If the resulting value of the equation is greater than 0, then A and B are positively correlated.  i.e. the l i th values of A i f increase as th values of B i the l f increase.  The higher the value, the more each attribute implies the other.  Hence, high value may indicate that A (or B) may be removed as redundancy. redundancy  If the resulting value is equal to zero, then A and B are independent  There is no correlation between them.  If the resulting value is less than zero, then A and B are negatively correlated.  i.e. the values of one attribute increase as the values of other attribute decrease.  Each attribute discourages the other. Data Mining - R. Akerkar 19
  • 20. Correlation Analysis Above are three possible relationships between data. The graphs of high positive and negative correlation are approaching a value of 1 and -1 respectively. The graph showing no correlation has a value of 0. Data Mining - R. Akerkar 20
  • 21. Categorical Data  To find the correlation between two categorical attributes we make use of contingency tables. g y  Let us consider the following:  Let there be 4 car manufacturers given by the set {A,B,C,D} and let there be three segments of cars manufactured by these companies given by the set {S,M,L}, {S M L} where S stands for small cars M stands for cars, medium sized cars and L stands for Large cars.  Observer collects data about the cars passing by, manufactured by these companies and categorize them according to their sizes. Data Mining - R. Akerkar 21
  • 22. For finding the correlation between car manufacturers and the size of cars that they manufacture we formulate a hypothesis, that the size of car manufactured and the companies that manufacture the cars are independent of each other other.  In other terms, we are saying that there is absolutely no correlation between the car manufacturing company and the size of the cars that they manufacture.  Such a hypothesis in statistical terms is called the null hypothesis and is denoted by Ho Ho.  Null hypothesis: The car size and car manufacturers are attributes independent of each other. other Data Mining - R. Akerkar 22
  • 23. Data Transformation  Smoothing: remove noise from data (binning and regression)  Aggregation: summarization data cube construction summarization,  E.g. Daily sales data aggregated to compute monthly and annual total amount.  Generalization: concept hierarchy climbing  Normalization is useful for classification algorithms involving neural nets, clustering etc..  Normalization: attribute data are scaled to fall within a small, specified range such as – 1.0 to 1.0  min-max normalization  z-score normalization  normalization by decimal scaling  Attribute/feature construction  New attributes constructed f from the given ones Data Mining - R. Akerkar 23
  • 24. Data Transformation: Normalization  min-max normalization (This type of normalization transforms the data into a desired range, usually [0,1]. ) v  minA v'  (new _ maxA  new _ minA)  new _ minA maxA  minA where, [minA, maxA] is the initial range and [new_minA, new_maxA] is the ,[ , ] g [ , ] new range. e.g.: If v = 73600 in [12000, 98000] Then v’ = 0.716 in the range [0, 1]. Here value for “income” is transformed to 0.716 It preserves the relationship among th original d t values. th l ti hi the i i l data l Data Mining - R. Akerkar 24
  • 25. z-score normalization By using this type of normalization, the mean of the transformed set of data points is reduced to zero. For this, the mean and standard deviation of the initial set of data values are required. The t Th transformation formula is f ti f l i v  mean A v' stand _ dev A Where, meanA and std_devA are the mean and standard deviation of the initial data values. e.g.: If meanIncome = 54000, and std_devIncome = 16000, then v = 76000 transformed to v’ =1.225. This is useful when the actual min and max of attribute are unknown. Data Mining - R. Akerkar 25
  • 26. Normalisation by Decimal Scaling  This type of scaling transforms the data into a range between [- 1,1]. The transformation formula is v v' j 10 Where j is the smallest integer such that Max(| v' |)<1  e.g.: Suppose recorded value of A is in initial range [-991, 99], k is 3, and v = -991 becomes v' = -0.991.  The Th mean absolute value of A is 991. b l t l f i 991  To normalise, we divide each value by 1000 (i.e. j = 3) so -991 normalises -0.991 Data Mining - R. Akerkar 26
  • 27. Exercise 2  Using the data for Age in previous Question, answer the following: a) Use min-max normalization to transform the value 35 for age into the range [0.0; 1.0]. b) Use z-score normalization to transform the value 35 for age, where the standard deviation of age is 12.94. c) Use normalization by decimal scaling to transform the value 35 for age. d) Comment on which method you would prefer to use for the given data, giving reasons as to why. Data Mining - R. Akerkar 27
  • 28. What Is Prediction?  Prediction is similar to classification  First, construct a model  Second, use model to predict unknown value  Major method for prediction is regression  Linear and multiple regression  Non-linear regression  Prediction is different from classification P di ti i diff tf l ifi ti  Classification refers to predict categorical class label  Prediction models continuous-valued functions  E.g. A model to predict the salary of university graduate with 15 years of work experience. Data Mining - R. Akerkar 28
  • 29. Regression  Regression shows a relationship between the average values of two variables.  Thus regression is very useful in estimating and predicting the average value of one variable for a given value of other variable.  The estimate or prediction may be made with the help of a regression line.  There are two types of variables in regression analysis- independent variable and dependent variable.  The variable whose value is to be predicted is called dependent variable and th variable whose value i used f prediction i i bl d the i bl h l is d for di ti is called independent variable. Data Mining - R. Akerkar 29
  • 30. Linear regression: If the regression curve is a straight g g g line, then there is a linear regression between two variables.  Linear regression models a random variable, Y (called variable response variable) as a linear function of another random variable, X ( called a predictor variable)  Y=+X  Two parameters ,  and  specify the line and are to be estimated by using the data at hand. (regression coefficients)  The variance of Y is assumed to be constant.  Th coefficients can be solved f The ffi i t b l d for b th method of by the th d f least squares (minimizes the error between the actual data and the estimate of the line. ) Data Mining - R. Akerkar 30
  • 31. Linear Regression  Given s samples or data points of the form (x1, y1), (x2, y2) …(xs, ys)  The regression coefficients can be estimated as,  Where is the average of x1, x2 … and is the average of y1, y2,… Data Mining - R. Akerkar 31
  • 32. Multiple Regression  Multiple regression: Y =  + 1 X1 + 2 X2. p g  Many nonlinear functions can be transformed into the above.  The regression analysis for studying more than two variables at a time time.  It involves more than one predictor variable.  Method of least square can be applied to solve for , 1, and 2. Data Mining - R. Akerkar 32
  • 33. Non-Linear Regression  If the curve of regression is not a straight line, i.e., a first degree equation in the variables x and y, then it is called a non-linear regression or curvilinear regression  Consider a cubic polynomial relationship,  Y =  +  1X +  2X2 +  3X3.  To convert above equation in linear form, we define new variable:  X1 = X, X2 = X2, X3 = X3  Thus we get, g  Y =  +  1X1 +  2 X2+  3 X3.  This is solvable by the method of least squares squares. Data Mining - R. Akerkar 33
  • 34. Exercise 3  Following table shows a set of X Y paired data where X is the number Years Experience Salary (in $ 1000s) of years of work experience of a 3 30 college graduates and Y is the 8 57 corresponding salary of the 9 64 graduate. 13 72  Draw a graph of the data. Do X 3 36 and Y seem to have a linear 6 43 relationship? 11 59  Also, predict the salary of a 21 90 college graduate with 10 years of 1 20 experience. experience 16 83 Data Mining - R. Akerkar 34
  • 35. Assignment X Y Midterm exam Final exam  The following table shows the midterm and 72 84 final exam grades obtained for students in a 50 63 data mining course. 81 77 74 78 1. Plot the data. Do X and Y seem to have a 94 90 linear relationship? 86 75 2. Use the method of least squares to find an 59 49 equation for the prediction of a student’s student s 83 79 final grade based on the student’s midterm 65 77 grade in the course. 33 52 3. Predict the final grade of a student who g 88 74 received an 86 on the midterm exam. 81 90 Data Mining - R. Akerkar 35