4. Why Mine Data? Commercial Viewpoint
Lots of data is being collected
and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Computers have become cheaper and more
powerful
Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in
SACC2011
Customer Relationship Management)
5. Why Mine Data? Scientific Viewpoint
Data collected and stored at
enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene
expression data
Traditional techniques infeasible for
raw data
Data mining may help scientists
in classifying and segmenting data
in Hypothesis Formation
SACC2011
6. Mining Large Data Sets - Motivation
There is often information “hidden” in the data
that is not readily evident
Human analysts may take weeks to discover
useful information
Much of the data is never analyzed at all
4,000,000
3,500,000
3,000,000 The Data Gap
2,500,000
2,000,000 Total new disk (TB) since 1995
1,500,000
1,000,000 Number of
500,000 analysts
0 SACC2011
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
1995 1996 1997 1998 1999
7. What is Data Mining?
Many Definitions
Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data
Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
SACC2011
8. What is (not) Data Mining?
lWhat is not Data l What is Data Mining?
Mining?
– Look up phone – Certain names are more
number in phone prevalent in certain US
directory locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g. Amazon
rainforest, Amazon.com,)
SACC2011
9. Origins of Data Mining
Draws ideas from machine learning/AI,
pattern recognition, statistics, and
database systems
Traditional Techniques Statistics/
Machine Learning/
Pattern
may be unsuitable due to AI
Recognition
Enormity of data
Data Mining
High dimensionality
of data
Database
Heterogeneous, systems
distributed nature
of data SACC2011
10. Data Mining Tasks
Data
Mining
Tasks
Prediction Description
Methods Methods
Use some Find human-
variables to interpretable
predict unknown patterns that
or future values of describe the data.
other variables.
SACC2011
11. Data Mining Tasks...
Classification [Predictive] Clustering [Descriptive]
Regression Data Association Rule
Discovery [Descriptive]
[Predictive]
Mining
Deviation Detection
[Predictive] Sequential Pattern Discovery
[Descriptive]
SACC2011
40. Nearest Neighbor Classifiers
Basic idea:
If it walks like a duck, quacks like a duck, then
it’s probably a duck
Compute
Distance Test
Record
Training Choose k of the
Records “nearest” records
SACC2011
41. Nearest-Neighbor Classifiers
Unknown record l Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
l To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
SACC2011
42. Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
SACC2011
48. Privacy-preserving Data Mining
Hide sensitive individual data
values from the outside world
A Random Rotation
• Privacy-
•
Perturbation Preserving Data
Approach to Privacy Data mining
Data Classification conversion
A Framework for
•
•Deriving Private cryptology
Information from High Accuracy
Randomized Data … Privacy-
Preserving
Mining
A valid and effcient decision
model based on the distorted
data can be constructed
SACC2011
49. 微分流形:保持拓扑特性
设 M 是一个Hausdorff 拓扑空间, 若对每一点 p M ,都有
P 的一个开领域 U 和 R n 的一个开子集同胚, 则称 M 为 n
维拓扑流形, 简称为 n 维流形.
SACC2011
50. 几种流形学习算法
1 2 3
等距映射(Isomap)
局部线性嵌入(LLE) 拉普拉斯特征映射
J.B. Tenenbaum, V. de (Laplacian Eigenmap)
S. T. Roweis and L.
Silva, and J. C. M. Belkin, P. Niyogi,
K. Saul. Nonlinear
Langford. A global Laplacian Eigenmaps
dimensionality for Dimensionality
geometric framework
reduction by locally Reduction and Data
for nonlinear
linear embedding. Representation. Neural
dimensionality
Science, vol. 290, pp. Computation, Vol. 15,
reduction. Science, vol. Issue 6, pp. 1373 –1396,
2323--2326, 2000.
290, pp. 2319--2323, 2003 .
2000.
SACC2011
53. Dimensionality Reduction:
ISOMAP
By: Tenenbaum, de Silva,
Langford (2000)
Construct a neighbourhood graph
For each pair of points in the graph,
compute the shortest geodesic distances
SACC2011