SlideShare uma empresa Scribd logo
1 de 41
CACS Lafayette (LA) , May 1, 2009 A Domain-Driven Framework for Clustering with Plug-in Fitness Functions and  its Application to Spatial Data Mining Christoph F. Eick Department of Computer Science University of Houston
Talk  Outline Domain-driven Data Mining (D3M, DDDM) A Framework for Clustering with Plug-in Fitness Functions MOSAIC---a Clustering Algorithm that Supports Plug-in Fitness Functions Popular Fitness Functions Case Studies: Applications to Spatial Data Mining Co-location Mining  Multi-objective Clustering Change Analysis in Spatial Data Summary and Conclusion.
Other Contributors to the Work Presented Today Current PhD Students:  Oner-UlviCelepcikay Chun-Shen Chen  RachsudaJiamthapthaksin,  VadeeratRinsurongkawong Former PhD Student:  Wei Ding (Assistant Professor, UMASS, Boston)  Former Master Students:  RachanaParmar Dan Jiang Seungchan Lee Domain Experts: Jean-Philippe Nicot (Bureau of Economic Geology, UT Austin) Tomasz F. Stepinski(Lunar and Planetary Institute, Houston) Michael Twa (College of Optometry, University of Houston),
DDDM—what is it about? Differences concerning the objectives of data mining created a gap between academia and applications of data mining in business and science. Traditional data mining targets the production of generic, domain-independent algorithms and tools; as a result, data mining algorithms have little capability to adapt to external, domain-specific constraints and evaluation measures. To overcome this mismatch, the need to incorporate domain intelligence into data mining algorithms has been recognized by current research. Domain intelligence requires: the involvement of domain knowledge and experts,  the consideration of domain constraints and domain-specific evaluation measures the discovery of in-depth patterns based on a deep domain model On top of the data-driven framework, DDDM aims to develop novel methodologies and techniques for integrating domain knowledge as well as actionability measures into the KDD process and to actively involves humans.
The Vision of DDDM    “DDDM…can assist in a paradigm shift from “data-driven hidden pattern mining” to “domain-driven actionable knowledge discovery”, and provides support for KDD to be translated to the real business situations as widely expected.” [CZ07]
IEEE TKDE Special Issue
2. Clustering with Plug-in Fitness Functions Motivation: Finding subgroups in geo-referenced datasets has many applications. However, in many applications the subgroups to be searched for do not share the characteristics considered by traditional clustering algorithms, such as cluster compactness and separation. Domain knowledge frequently imposes additional requirements concerning what constitutes a “good” subgroup. Consequently, it is desirable to develop clustering algorithms that provide plug-in fitness functions that allow domain experts to express desirable characteristics of subgroups they are looking for. Only very few clustering algorithms published in the literature provide plug-in fitness functions; consequently existing clustering paradigms have to be modified and extended by our research to provide such capabilities.
Clustering with Plug-In Fitness Functions Clustering  algorithms No fitness function Fixed  Fitness Function Provide plug-in fitness function Implicit Fitness Function DBSCAN Hierarchical Clustering K-Means PAM CHAMELEON MOSAIC
Current Suite of Spatial Clustering Algorithms Representative-based: SCEC[1], SPAM[3], CLEVER[4] Grid-based: SCMRG[1] Agglomerative: MOSAIC[2] Density-based: SCDE [4], DCONTOUR[8] (not really plug-in but some fitness functions can be simulated) Density-based Grid-based Agglomerative-based Representative-based Clustering Algorithms Remark: All algorithms partition a dataset into clusters by maximizing a  reward-based, plug-in fitness function.
Spatial Clustering Algorithms Datasets are assumed to have the following structure: (<spatial attributes>;<non-spatial attributes>)     e.g. (longitude, latitude; <chemical concentrations>+) Clusters are found in the subspace of the spatial attributes, called regions in the following. The non-spatial attributes are used by the fitness function but neither in distance computations nor by the clustering algorithm itself.   Clustering algorithms are assumed to maximize reward-based fitness functions that have the following structure:      where b is a parameter that determines the premium put on cluster size (larger values  fewer, larger clusters)
3. MOSAIC—a Clustering Algorithm that Supports Plug-in Fitness Functions MOSAIC[2]supports plug-in fitness functions and provides a generic framework that integrates representative-based clustering, agglomerative clustering, and proximity graphs, and which approximates arbitrary shape clusters using unions of small convex polygons.  (a) input (b) output Fig. 6: An illustration of MOSAIC’s approach
3.1 Representative-based Clustering 2 Attribute1 1  3 Attribute2 4 Objective: Find a set of objects OR such that the clustering X  obtained by using the objects in OR as representatives minimizes q(X). Properties:  ,[object Object]
 Cluster shapes are limited to convex polygonsPopular Algorithms: K-means, K-medoids, CLEVER, SPAM
3.2 MOSAIC and Agglomerative Clustering Traditional Agglomerative Clustering Algorithms Decision which clusters to merge next is made solely based on distances between clusters. In particular, two clusters that are closest to each other with respect to a distance measure (single link, group average,…) are merged. Use of some distance measures might lead to non-contiguous clusters. Example: If group average is used, clusters C3 and C4 would be merged next
MOSAIC and Agglomerative Clustering Advantages MOSAIC over traditional agglomerative clustering: Plug-in fitness function Conducts a wider search—considers all neighboring clusters and merges the pair of clusters that enhances fitness the most Clusters are always contiguous  Expensive algorithm is only run for 20-1000 iterations Highly generic algorithm
3.3 Proximity Graphs How to identify neighbouring clusters for representative-based clustering algorithms? Proximity graphs provide various definitions of “neighbour”: NNG = Nearest Neighbour Graph MST = Minimum Spanning Tree RNG = Relative Neighbourhood Graph GG = Gabriel Graph DT = Delaunay Triangulation (neighbours of a 1NN-classifier)
Proximity Graphs: Delaunay The Delaunay Triangulation is the dual of the Voronoi diagram Three points are each others neighbours if their tangent sphere contains no other points Complete: captures all neighbouring clusters Time-consuming to compute; impossible to compute in high dimensions.
Proximity Graphs: Gabriel The Gabriel graph is a subset of the Delaunay Triangulation (some decision boundary might be missed) Points are neighbours only if their (diametral) sphere of influence is empty Can be computed more efficiently: O(k3) Approximate algorithms with faster complexity exist
MOSAIC’s Input Fig. 10: Gabriel graph for clusters generated by  a representative-based clustering algorithm
3.4 Pseudo Code MOSAIC 1.  Run a representative-based clustering algorithm to create a       large number of clusters. 2. Read the representatives of the obtained clusters. 3. Create a merge candidate relation using proximity graphs. 4. WHILE there are merge-candidates (Ci ,Cj) left     BEGIN      Merge the pair of merge-candidates (Ci,Cj), that              enhances fitness function q the most, into a new cluster C’      Update merge-candidates:      C Merge-Candidate(C’,C)  Merge-Candidate(Ci,C)     				Merge-Candidate(Cj,C)       END    RETURN the best clustering X found.
Complexity MOSAIC Let  n be the number of objects in the dataset k be the number of clusters generated by the representative-based algorithm Complexity MOSAIC: O(k3 + k2*O(q(x))) Remarks:  The above formula assumes that fitness is computed from the scratch when a new clustering is obtained Lower complexities can be obtained with incrementally reusing results of previous fitness computations Our current implementation assumes that only additive fitness functions are used
4. Interestingness Measure for Spatial Clustering with Plug-in Fitness Functions Clustering algorithms maximize fitness functions that must have the following structure Various interestingness functions i have been introduced in our preliminary work: For supervised clustering [1] Maximizing the variance of a continuous variable [5] For regional association rule scoping [9] For co-location patterns involving continuous variables [4] ….  Some examples of fitness functions will be presented in the case studies
5. Case Studies  Co-location patterns involving arsenic pollution  Multi-objective Clustering Change analysis involving earth quake patterns
5.1 Co-location Patterns Involving Arsenic Pollution
Regional Co-location Mining     Goal: To discover regional co-location patterns involving continuous variables in which continuous variables take values from the wings of their statistical distribution Regional Co-location Mining Dataset: (longitude,latitude,<concentrations>+)
Summary Co-location Approach   Pattern Interestingness in a region is evaluated using products of (cut-off) z-scores. In general, products of z-scores measure correlation.  Additionally, purity is considered that is controlled by a parameter . Finally, the parameter  determines how much premium is put on the size of a region when computing region rewards.
Domain-Driven Clustering for Co-location Mining 1. Define problem 2. Create/Select a fitness function 3. Select a clustering algorithm 4. Select parameters of the clustering algorithm, parameters of the fitness function and constraints with respect to which patterns are considered  Hydrologist 5. Run the clustering algorithm to discover  interesting regions and their associated patterns 6. Analyze the results
Example: 2 Sets of Results Using Medium/High Rewards for Purity
Challenges  Regional RCLM Kind of “seeking a needle in a haystack” problem, because we search for both interesting places and interesting patterns. Our current Interestingness measure is not anti-monotone: a superset of a co-location set might be more interesting.   Observation: different fitness function parameter settings lead to quite different results, many of which are valuable to domain experts; therefore, it is desirable combine results of many runs. “Clustering of the future”: run clustering algorithms multiple times with multiple fitness functions, and summarize the resultsmulti-run/multi-objective clustering
5.2 Multi-Run Clustering Find clusters that good with respect to multiple objectives in automated fashion. Each objective is captured in a reward-based fitness function. To achieve the goal, we run clustering algorithms multiple times with respect to compound fitness functions that capture multiple objectives and store non-dominated clusters in a cluster repository. Summarization tools are provided that create final clusterings with respect to a user’s perspective.
An Architecture for Multi-objective Clustering S2 S1 Given: set of objectives Q that need to be satisfied; moreover, Q’Q. Clustering  Algorithm Goal-driven Fitness  Function Generator A Spatial Dataset Q’ S3 M X Storage Unit Steps in multi-run clustering: S1: Generate a compound fitness         functions.  S2: Run a clustering algorithm.  S3: Update the cluster list M. S4: Summarize clusters discovered        M’. Q’ Cluster Summarization Unit S4 M’
Example: Multi-Objective RCLM Example: Finding co-location patterns  with respect to Arsenic and a single  other chemical is a single objective;  we are interested in finding co-location  regions that satisfy multiple of those  objectives; that is, where high arsenic concentrations are co-located with  high concentrations of many other chemicals. AsMoVF- Cl-SO42-TDS (Rank 3) AsMoVBF- Cl-SO42-TDS (Rank 1) AsMo Cl-SO42-TDS (Rank 4) AsMoVBF- Cl-SO42-TDS (Rank 2) AsMoB Cl-SO42-TDS (Rank 5) Figure a: the top 5 regions ordered by rewards using user-defined query {As,Mo}
5.3 Change Analysis in Spatial Data Question: How do interesting regions where deep earthquakes  are in close proximity to shallow earthquakes change? Red: clusters in Oold;  Blue: clusters in Onew Cluster Interestingness Measure: Variance of Earthquake Depth
Novelty Regions in Onew Novelty Change Predicate: Novelty(r)  |(r—(r’1  r’k))|>0 with rXnew; Xold={r’1,...,r’k}
Domain-Driven Change Analysis in Spatial Data Determine two datasets Oold and Onew for which change patterns have to be extracted  2. Cluster both datasets with respect to an interestingness perspective to obtain clusters for each dataset.  3. Determine relevant change predicates and  select thresholds of change predicates Geologist 4. Instantiate change predicates based on the results of step 3.  5. Summarize emergent patterns 6. Analyze emergent patterns
6. Conclusion  A generic, domain-driven clustering framework has been introduced It incorporates domain intelligence into domain-specific plug-in fitness functions that are maximized by clustering algorithms.  Clustering algorithms are independent of the fitness function employed. Several clustering algorithms including prototype-based, agglomerative, and grid-based clustering algorithms have been designed and implemented in our past research.  We conducted several case studies in our past research that illustrate the capability of the proposed domain-driven spatial clustering framework to solve challenging problems in planetary sciences, geology, environmental sciences, and optometry.
UH-DMML References  C. F. Eick, B. Vaezian, D. Jiang, and J. Wang, Discovery of Interesting Regions in Spatial Datasets Using Supervised Clustering, in Proc. 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Berlin, Germany, September 2006.  C. Choo, R. Jiamthapthaksin, C.-S. Chen, O. Celepcikay, C. Giusti, and C. F. Eick, MOSAIC: A Proximity Graph Approach to Agglomerative Clustering, in Proc. 9th International Conference on Data Warehousing and Knowledge Discovery (DaWaK), Regensburg, Germany, September 2007.  W. Ding, R. Jiamthapthaksin, R. Parmar, D. Jiang, T. Stepinski, and C. F. Eick, Towards Region Discovery in Spatial Datasets, in Proc. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Osaka, Japan, May 2008.  C. F. Eick, R. Parmar, W. Ding, T. Stepinki, and J.-P. Nicot, Finding Regional Co-location Patterns for Sets of Continuous Variables in Spatial Datasets, in Proc. 16th ACM SIGSPATIAL International Conference on Advances in GIS (ACM-GIS), Irvine, California, November 2008. C.-S. Chen, V. Rinsurongkawong, C.F. Eick, and M.D. Twa, Change Analysis in Spatial Data by Combining Contouring Algorithms with Supervised Density Functions, in Proc. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand, April 2009.  A. Bagherjeiran, O. U. Celepcikay, R. Jiamthapthaksin, C.-S. Chen, V. Rinsurongkawong, S. Lee, J. Thomas, and C. F. Eick, Cougar**2: An Open Source Machine Learning and Data Mining Development Framework, in Proc. Open Source Data Mining Workshop (OSDM), Bangkok, Thailand, April 2009. C. F. Eick, O. U. Celepcikay, and R. Jiamthapthaksin, A Unifying Domain-driven Framework for Clustering with Plug-in Fitness Functions and Region Discovery, submitted to IEEE TKDE. R. Jiamthapthaksin, C. F. Eick, and R. Vilalta, A Framework for Multi-Objective Clustering and its Application to Co-Location Mining, submitted to Fifth International Conference on Advanced Data Mining and Applications (ADMA), Beijing, China, August 2009.  W. Ding, C. F. Eick, X. Yuan, J. Wang, and J.-P. Nicot, A Framework for Regional Association Rule Mining and Scoping in Spatial Datasets, under review for publication in Geoinformatica.
Other References L. Cao and C. Zhang, “The Evolution of KDD: Towards Domain-Driven Data Mining,” Journal of Pattern Recognition and Artificial Intelligence, vol.21, no. 4, pp. 677-692, World Scientific Publishing Company, 2007. O. Thonnard  and M. Dacier, Actionable Knowledge Discovery for Threats Intelligence Support using a Multi-Dimensional Data Mining Methodology, DDDM08.
Region Discovery Framework Objective: Develop and implement an integrated framework to automatically discover interesting regional patterns in spatial datasets. Treats region discovery as a clustering problem.
Region Discovery Framework Continued The clustering algorithms we currently investigate solve the following problem: Given: A dataset O with a schema R A distance function d defined on instances of R A fitness function q(X) that evaluates clustering X={c1,…,ck} as follows: q(X)= cXreward(c)=cXinterestingness(c)*size(c) with b>1 Objective: Find c1,…,ck O such that: cicj= if ij X={c1,…,ck} maximizes q(X) All cluster ciX are contiguous in the spatial subspace c1,…,ck  O  c1,…,ck  are usually ranked based on the reward each cluster receives, and low reward clusters are frequently not reported
    [CZ07]

Mais conteúdo relacionado

Mais procurados

Mining Regional Knowledge in Spatial Dataset
Mining Regional Knowledge in Spatial DatasetMining Regional Knowledge in Spatial Dataset
Mining Regional Knowledge in Spatial Datasetbutest
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel basedIJITCA Journal
 
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...ijcsit
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
 
ijrrest_vol-2_issue-2_013
ijrrest_vol-2_issue-2_013ijrrest_vol-2_issue-2_013
ijrrest_vol-2_issue-2_013Ashish Gupta
 
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...Scientific Review
 
Seed net automatic seed generation with deep reinforcement learning for robus...
Seed net automatic seed generation with deep reinforcement learning for robus...Seed net automatic seed generation with deep reinforcement learning for robus...
Seed net automatic seed generation with deep reinforcement learning for robus...NAVER Engineering
 
Pillar k means
Pillar k meansPillar k means
Pillar k meansswathi b
 
Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료taeseon ryu
 
Slope one recommender on hadoop
Slope one recommender on hadoopSlope one recommender on hadoop
Slope one recommender on hadoopYONG ZHENG
 
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondenceParn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondenceNAVER Engineering
 
MediaEval 2016 - MLPBOON Predicting Media Interestingness System
MediaEval 2016 - MLPBOON Predicting Media Interestingness SystemMediaEval 2016 - MLPBOON Predicting Media Interestingness System
MediaEval 2016 - MLPBOON Predicting Media Interestingness Systemmultimediaeval
 
CLUSTERING HYPERSPECTRAL DATA
CLUSTERING HYPERSPECTRAL DATACLUSTERING HYPERSPECTRAL DATA
CLUSTERING HYPERSPECTRAL DATAcsandit
 
IRJET- Satellite Image Resolution Enhancement using Dual-tree Complex Wav...
IRJET-  	  Satellite Image Resolution Enhancement using Dual-tree Complex Wav...IRJET-  	  Satellite Image Resolution Enhancement using Dual-tree Complex Wav...
IRJET- Satellite Image Resolution Enhancement using Dual-tree Complex Wav...IRJET Journal
 
Satellite Image Enhancement Using Dual Tree Complex Wavelet Transform
Satellite Image Enhancement Using Dual Tree Complex Wavelet TransformSatellite Image Enhancement Using Dual Tree Complex Wavelet Transform
Satellite Image Enhancement Using Dual Tree Complex Wavelet TransformjournalBEEI
 
Image resolution enhancement by using wavelet transform 2
Image resolution enhancement by using wavelet transform 2Image resolution enhancement by using wavelet transform 2
Image resolution enhancement by using wavelet transform 2IAEME Publication
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR OptimizationniveditJain
 

Mais procurados (20)

Mining Regional Knowledge in Spatial Dataset
Mining Regional Knowledge in Spatial DatasetMining Regional Knowledge in Spatial Dataset
Mining Regional Knowledge in Spatial Dataset
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel based
 
SciVisHalosFinalPaper
SciVisHalosFinalPaperSciVisHalosFinalPaper
SciVisHalosFinalPaper
 
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering Algorithm
 
ijrrest_vol-2_issue-2_013
ijrrest_vol-2_issue-2_013ijrrest_vol-2_issue-2_013
ijrrest_vol-2_issue-2_013
 
2006 ssiai
2006 ssiai2006 ssiai
2006 ssiai
 
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
 
Seed net automatic seed generation with deep reinforcement learning for robus...
Seed net automatic seed generation with deep reinforcement learning for robus...Seed net automatic seed generation with deep reinforcement learning for robus...
Seed net automatic seed generation with deep reinforcement learning for robus...
 
Pillar k means
Pillar k meansPillar k means
Pillar k means
 
Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료
 
Slope one recommender on hadoop
Slope one recommender on hadoopSlope one recommender on hadoop
Slope one recommender on hadoop
 
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondenceParn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
 
MediaEval 2016 - MLPBOON Predicting Media Interestingness System
MediaEval 2016 - MLPBOON Predicting Media Interestingness SystemMediaEval 2016 - MLPBOON Predicting Media Interestingness System
MediaEval 2016 - MLPBOON Predicting Media Interestingness System
 
CLUSTERING HYPERSPECTRAL DATA
CLUSTERING HYPERSPECTRAL DATACLUSTERING HYPERSPECTRAL DATA
CLUSTERING HYPERSPECTRAL DATA
 
Zoooooohaib
ZoooooohaibZoooooohaib
Zoooooohaib
 
IRJET- Satellite Image Resolution Enhancement using Dual-tree Complex Wav...
IRJET-  	  Satellite Image Resolution Enhancement using Dual-tree Complex Wav...IRJET-  	  Satellite Image Resolution Enhancement using Dual-tree Complex Wav...
IRJET- Satellite Image Resolution Enhancement using Dual-tree Complex Wav...
 
Satellite Image Enhancement Using Dual Tree Complex Wavelet Transform
Satellite Image Enhancement Using Dual Tree Complex Wavelet TransformSatellite Image Enhancement Using Dual Tree Complex Wavelet Transform
Satellite Image Enhancement Using Dual Tree Complex Wavelet Transform
 
Image resolution enhancement by using wavelet transform 2
Image resolution enhancement by using wavelet transform 2Image resolution enhancement by using wavelet transform 2
Image resolution enhancement by using wavelet transform 2
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR Optimization
 

Destaque

SCHOOL MISSION AND STRUCTURE
SCHOOL MISSION AND STRUCTURESCHOOL MISSION AND STRUCTURE
SCHOOL MISSION AND STRUCTUREbutest
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .pptbutest
 
PowerPoint
PowerPointPowerPoint
PowerPointbutest
 
ph-report.doc
ph-report.docph-report.doc
ph-report.docbutest
 
source2
source2source2
source2butest
 
Plastickachirurgie
PlastickachirurgiePlastickachirurgie
Plastickachirurgieguest19dc09
 
Khansaa Stars Trailer
Khansaa  Stars TrailerKhansaa  Stars Trailer
Khansaa Stars Trailerkhaled samir
 
Denis Samoseev Risk Management
Denis Samoseev Risk ManagementDenis Samoseev Risk Management
Denis Samoseev Risk Managementguest092df8
 
Communicating STEM 2010 Fiachra O Marcaigh-amas
Communicating STEM 2010 Fiachra O Marcaigh-amasCommunicating STEM 2010 Fiachra O Marcaigh-amas
Communicating STEM 2010 Fiachra O Marcaigh-amasAMAS
 
Amatciems - o frumusețe unică
Amatciems - o frumusețe unicăAmatciems - o frumusețe unică
Amatciems - o frumusețe unicăTransmix Romania
 
learningIntro.doc
learningIntro.doclearningIntro.doc
learningIntro.docbutest
 
PMDC NEB Step-1 (Review of abdominal contents)-day-7
PMDC NEB Step-1 (Review of abdominal contents)-day-7PMDC NEB Step-1 (Review of abdominal contents)-day-7
PMDC NEB Step-1 (Review of abdominal contents)-day-7DrSaeed Shafi
 
Best Weight Loss Plan For 2010
Best Weight Loss Plan For 2010Best Weight Loss Plan For 2010
Best Weight Loss Plan For 2010Brians Garage Sale
 

Destaque (13)

SCHOOL MISSION AND STRUCTURE
SCHOOL MISSION AND STRUCTURESCHOOL MISSION AND STRUCTURE
SCHOOL MISSION AND STRUCTURE
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
PowerPoint
PowerPointPowerPoint
PowerPoint
 
ph-report.doc
ph-report.docph-report.doc
ph-report.doc
 
source2
source2source2
source2
 
Plastickachirurgie
PlastickachirurgiePlastickachirurgie
Plastickachirurgie
 
Khansaa Stars Trailer
Khansaa  Stars TrailerKhansaa  Stars Trailer
Khansaa Stars Trailer
 
Denis Samoseev Risk Management
Denis Samoseev Risk ManagementDenis Samoseev Risk Management
Denis Samoseev Risk Management
 
Communicating STEM 2010 Fiachra O Marcaigh-amas
Communicating STEM 2010 Fiachra O Marcaigh-amasCommunicating STEM 2010 Fiachra O Marcaigh-amas
Communicating STEM 2010 Fiachra O Marcaigh-amas
 
Amatciems - o frumusețe unică
Amatciems - o frumusețe unicăAmatciems - o frumusețe unică
Amatciems - o frumusețe unică
 
learningIntro.doc
learningIntro.doclearningIntro.doc
learningIntro.doc
 
PMDC NEB Step-1 (Review of abdominal contents)-day-7
PMDC NEB Step-1 (Review of abdominal contents)-day-7PMDC NEB Step-1 (Review of abdominal contents)-day-7
PMDC NEB Step-1 (Review of abdominal contents)-day-7
 
Best Weight Loss Plan For 2010
Best Weight Loss Plan For 2010Best Weight Loss Plan For 2010
Best Weight Loss Plan For 2010
 

Semelhante a Mining Regional Knowledge in Spatial Dataset

Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsYONG ZHENG
 
UHDMML.pps
UHDMML.ppsUHDMML.pps
UHDMML.ppsbutest
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
 
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...acijjournal
 
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...IJCSIS Research Publications
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
 
Region-based Semi-supervised Clustering Image Segmentation
Region-based Semi-supervised Clustering Image SegmentationRegion-based Semi-supervised Clustering Image Segmentation
Region-based Semi-supervised Clustering Image SegmentationOnur Yılmaz
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptxGandhiMathy6
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...IRJET Journal
 
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelClustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelWaqas Tariq
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsIJMER
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10mqasimsheikh5
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basicHouw Liong The
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberHouw Liong The
 
IMAGE SEGMENTATION BY MODIFIED MAP-ML ESTIMATIONS
IMAGE SEGMENTATION BY MODIFIED MAP-ML ESTIMATIONSIMAGE SEGMENTATION BY MODIFIED MAP-ML ESTIMATIONS
IMAGE SEGMENTATION BY MODIFIED MAP-ML ESTIMATIONScscpconf
 
Image segmentation by modified map ml
Image segmentation by modified map mlImage segmentation by modified map ml
Image segmentation by modified map mlcsandit
 
Image segmentation by modified map ml estimations
Image segmentation by modified map ml estimationsImage segmentation by modified map ml estimations
Image segmentation by modified map ml estimationsijesajournal
 
Introduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIntroduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIJSRD
 

Semelhante a Mining Regional Knowledge in Spatial Dataset (20)

Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
 
UHDMML.pps
UHDMML.ppsUHDMML.pps
UHDMML.pps
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...
 
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
 
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
 
Region-based Semi-supervised Clustering Image Segmentation
Region-based Semi-supervised Clustering Image SegmentationRegion-based Semi-supervised Clustering Image Segmentation
Region-based Semi-supervised Clustering Image Segmentation
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming data
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
 
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelClustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data Fragments
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
IMAGE SEGMENTATION BY MODIFIED MAP-ML ESTIMATIONS
IMAGE SEGMENTATION BY MODIFIED MAP-ML ESTIMATIONSIMAGE SEGMENTATION BY MODIFIED MAP-ML ESTIMATIONS
IMAGE SEGMENTATION BY MODIFIED MAP-ML ESTIMATIONS
 
Image segmentation by modified map ml
Image segmentation by modified map mlImage segmentation by modified map ml
Image segmentation by modified map ml
 
Image segmentation by modified map ml estimations
Image segmentation by modified map ml estimationsImage segmentation by modified map ml estimations
Image segmentation by modified map ml estimations
 
Introduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIntroduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering Ensemble
 

Mais de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mais de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Mining Regional Knowledge in Spatial Dataset

  • 1. CACS Lafayette (LA) , May 1, 2009 A Domain-Driven Framework for Clustering with Plug-in Fitness Functions and its Application to Spatial Data Mining Christoph F. Eick Department of Computer Science University of Houston
  • 2. Talk Outline Domain-driven Data Mining (D3M, DDDM) A Framework for Clustering with Plug-in Fitness Functions MOSAIC---a Clustering Algorithm that Supports Plug-in Fitness Functions Popular Fitness Functions Case Studies: Applications to Spatial Data Mining Co-location Mining Multi-objective Clustering Change Analysis in Spatial Data Summary and Conclusion.
  • 3. Other Contributors to the Work Presented Today Current PhD Students: Oner-UlviCelepcikay Chun-Shen Chen RachsudaJiamthapthaksin, VadeeratRinsurongkawong Former PhD Student: Wei Ding (Assistant Professor, UMASS, Boston) Former Master Students: RachanaParmar Dan Jiang Seungchan Lee Domain Experts: Jean-Philippe Nicot (Bureau of Economic Geology, UT Austin) Tomasz F. Stepinski(Lunar and Planetary Institute, Houston) Michael Twa (College of Optometry, University of Houston),
  • 4. DDDM—what is it about? Differences concerning the objectives of data mining created a gap between academia and applications of data mining in business and science. Traditional data mining targets the production of generic, domain-independent algorithms and tools; as a result, data mining algorithms have little capability to adapt to external, domain-specific constraints and evaluation measures. To overcome this mismatch, the need to incorporate domain intelligence into data mining algorithms has been recognized by current research. Domain intelligence requires: the involvement of domain knowledge and experts, the consideration of domain constraints and domain-specific evaluation measures the discovery of in-depth patterns based on a deep domain model On top of the data-driven framework, DDDM aims to develop novel methodologies and techniques for integrating domain knowledge as well as actionability measures into the KDD process and to actively involves humans.
  • 5. The Vision of DDDM “DDDM…can assist in a paradigm shift from “data-driven hidden pattern mining” to “domain-driven actionable knowledge discovery”, and provides support for KDD to be translated to the real business situations as widely expected.” [CZ07]
  • 7. 2. Clustering with Plug-in Fitness Functions Motivation: Finding subgroups in geo-referenced datasets has many applications. However, in many applications the subgroups to be searched for do not share the characteristics considered by traditional clustering algorithms, such as cluster compactness and separation. Domain knowledge frequently imposes additional requirements concerning what constitutes a “good” subgroup. Consequently, it is desirable to develop clustering algorithms that provide plug-in fitness functions that allow domain experts to express desirable characteristics of subgroups they are looking for. Only very few clustering algorithms published in the literature provide plug-in fitness functions; consequently existing clustering paradigms have to be modified and extended by our research to provide such capabilities.
  • 8. Clustering with Plug-In Fitness Functions Clustering algorithms No fitness function Fixed Fitness Function Provide plug-in fitness function Implicit Fitness Function DBSCAN Hierarchical Clustering K-Means PAM CHAMELEON MOSAIC
  • 9. Current Suite of Spatial Clustering Algorithms Representative-based: SCEC[1], SPAM[3], CLEVER[4] Grid-based: SCMRG[1] Agglomerative: MOSAIC[2] Density-based: SCDE [4], DCONTOUR[8] (not really plug-in but some fitness functions can be simulated) Density-based Grid-based Agglomerative-based Representative-based Clustering Algorithms Remark: All algorithms partition a dataset into clusters by maximizing a reward-based, plug-in fitness function.
  • 10. Spatial Clustering Algorithms Datasets are assumed to have the following structure: (<spatial attributes>;<non-spatial attributes>) e.g. (longitude, latitude; <chemical concentrations>+) Clusters are found in the subspace of the spatial attributes, called regions in the following. The non-spatial attributes are used by the fitness function but neither in distance computations nor by the clustering algorithm itself. Clustering algorithms are assumed to maximize reward-based fitness functions that have the following structure: where b is a parameter that determines the premium put on cluster size (larger values  fewer, larger clusters)
  • 11. 3. MOSAIC—a Clustering Algorithm that Supports Plug-in Fitness Functions MOSAIC[2]supports plug-in fitness functions and provides a generic framework that integrates representative-based clustering, agglomerative clustering, and proximity graphs, and which approximates arbitrary shape clusters using unions of small convex polygons. (a) input (b) output Fig. 6: An illustration of MOSAIC’s approach
  • 12.
  • 13. Cluster shapes are limited to convex polygonsPopular Algorithms: K-means, K-medoids, CLEVER, SPAM
  • 14. 3.2 MOSAIC and Agglomerative Clustering Traditional Agglomerative Clustering Algorithms Decision which clusters to merge next is made solely based on distances between clusters. In particular, two clusters that are closest to each other with respect to a distance measure (single link, group average,…) are merged. Use of some distance measures might lead to non-contiguous clusters. Example: If group average is used, clusters C3 and C4 would be merged next
  • 15. MOSAIC and Agglomerative Clustering Advantages MOSAIC over traditional agglomerative clustering: Plug-in fitness function Conducts a wider search—considers all neighboring clusters and merges the pair of clusters that enhances fitness the most Clusters are always contiguous Expensive algorithm is only run for 20-1000 iterations Highly generic algorithm
  • 16. 3.3 Proximity Graphs How to identify neighbouring clusters for representative-based clustering algorithms? Proximity graphs provide various definitions of “neighbour”: NNG = Nearest Neighbour Graph MST = Minimum Spanning Tree RNG = Relative Neighbourhood Graph GG = Gabriel Graph DT = Delaunay Triangulation (neighbours of a 1NN-classifier)
  • 17. Proximity Graphs: Delaunay The Delaunay Triangulation is the dual of the Voronoi diagram Three points are each others neighbours if their tangent sphere contains no other points Complete: captures all neighbouring clusters Time-consuming to compute; impossible to compute in high dimensions.
  • 18. Proximity Graphs: Gabriel The Gabriel graph is a subset of the Delaunay Triangulation (some decision boundary might be missed) Points are neighbours only if their (diametral) sphere of influence is empty Can be computed more efficiently: O(k3) Approximate algorithms with faster complexity exist
  • 19. MOSAIC’s Input Fig. 10: Gabriel graph for clusters generated by a representative-based clustering algorithm
  • 20. 3.4 Pseudo Code MOSAIC 1. Run a representative-based clustering algorithm to create a large number of clusters. 2. Read the representatives of the obtained clusters. 3. Create a merge candidate relation using proximity graphs. 4. WHILE there are merge-candidates (Ci ,Cj) left BEGIN Merge the pair of merge-candidates (Ci,Cj), that enhances fitness function q the most, into a new cluster C’ Update merge-candidates: C Merge-Candidate(C’,C)  Merge-Candidate(Ci,C) Merge-Candidate(Cj,C) END RETURN the best clustering X found.
  • 21. Complexity MOSAIC Let n be the number of objects in the dataset k be the number of clusters generated by the representative-based algorithm Complexity MOSAIC: O(k3 + k2*O(q(x))) Remarks: The above formula assumes that fitness is computed from the scratch when a new clustering is obtained Lower complexities can be obtained with incrementally reusing results of previous fitness computations Our current implementation assumes that only additive fitness functions are used
  • 22. 4. Interestingness Measure for Spatial Clustering with Plug-in Fitness Functions Clustering algorithms maximize fitness functions that must have the following structure Various interestingness functions i have been introduced in our preliminary work: For supervised clustering [1] Maximizing the variance of a continuous variable [5] For regional association rule scoping [9] For co-location patterns involving continuous variables [4] …. Some examples of fitness functions will be presented in the case studies
  • 23. 5. Case Studies Co-location patterns involving arsenic pollution Multi-objective Clustering Change analysis involving earth quake patterns
  • 24. 5.1 Co-location Patterns Involving Arsenic Pollution
  • 25. Regional Co-location Mining Goal: To discover regional co-location patterns involving continuous variables in which continuous variables take values from the wings of their statistical distribution Regional Co-location Mining Dataset: (longitude,latitude,<concentrations>+)
  • 26. Summary Co-location Approach Pattern Interestingness in a region is evaluated using products of (cut-off) z-scores. In general, products of z-scores measure correlation. Additionally, purity is considered that is controlled by a parameter . Finally, the parameter  determines how much premium is put on the size of a region when computing region rewards.
  • 27. Domain-Driven Clustering for Co-location Mining 1. Define problem 2. Create/Select a fitness function 3. Select a clustering algorithm 4. Select parameters of the clustering algorithm, parameters of the fitness function and constraints with respect to which patterns are considered Hydrologist 5. Run the clustering algorithm to discover interesting regions and their associated patterns 6. Analyze the results
  • 28. Example: 2 Sets of Results Using Medium/High Rewards for Purity
  • 29. Challenges Regional RCLM Kind of “seeking a needle in a haystack” problem, because we search for both interesting places and interesting patterns. Our current Interestingness measure is not anti-monotone: a superset of a co-location set might be more interesting. Observation: different fitness function parameter settings lead to quite different results, many of which are valuable to domain experts; therefore, it is desirable combine results of many runs. “Clustering of the future”: run clustering algorithms multiple times with multiple fitness functions, and summarize the resultsmulti-run/multi-objective clustering
  • 30. 5.2 Multi-Run Clustering Find clusters that good with respect to multiple objectives in automated fashion. Each objective is captured in a reward-based fitness function. To achieve the goal, we run clustering algorithms multiple times with respect to compound fitness functions that capture multiple objectives and store non-dominated clusters in a cluster repository. Summarization tools are provided that create final clusterings with respect to a user’s perspective.
  • 31. An Architecture for Multi-objective Clustering S2 S1 Given: set of objectives Q that need to be satisfied; moreover, Q’Q. Clustering Algorithm Goal-driven Fitness Function Generator A Spatial Dataset Q’ S3 M X Storage Unit Steps in multi-run clustering: S1: Generate a compound fitness functions. S2: Run a clustering algorithm. S3: Update the cluster list M. S4: Summarize clusters discovered M’. Q’ Cluster Summarization Unit S4 M’
  • 32. Example: Multi-Objective RCLM Example: Finding co-location patterns with respect to Arsenic and a single other chemical is a single objective; we are interested in finding co-location regions that satisfy multiple of those objectives; that is, where high arsenic concentrations are co-located with high concentrations of many other chemicals. AsMoVF- Cl-SO42-TDS (Rank 3) AsMoVBF- Cl-SO42-TDS (Rank 1) AsMo Cl-SO42-TDS (Rank 4) AsMoVBF- Cl-SO42-TDS (Rank 2) AsMoB Cl-SO42-TDS (Rank 5) Figure a: the top 5 regions ordered by rewards using user-defined query {As,Mo}
  • 33. 5.3 Change Analysis in Spatial Data Question: How do interesting regions where deep earthquakes are in close proximity to shallow earthquakes change? Red: clusters in Oold; Blue: clusters in Onew Cluster Interestingness Measure: Variance of Earthquake Depth
  • 34. Novelty Regions in Onew Novelty Change Predicate: Novelty(r)  |(r—(r’1  r’k))|>0 with rXnew; Xold={r’1,...,r’k}
  • 35. Domain-Driven Change Analysis in Spatial Data Determine two datasets Oold and Onew for which change patterns have to be extracted 2. Cluster both datasets with respect to an interestingness perspective to obtain clusters for each dataset. 3. Determine relevant change predicates and select thresholds of change predicates Geologist 4. Instantiate change predicates based on the results of step 3. 5. Summarize emergent patterns 6. Analyze emergent patterns
  • 36. 6. Conclusion A generic, domain-driven clustering framework has been introduced It incorporates domain intelligence into domain-specific plug-in fitness functions that are maximized by clustering algorithms. Clustering algorithms are independent of the fitness function employed. Several clustering algorithms including prototype-based, agglomerative, and grid-based clustering algorithms have been designed and implemented in our past research. We conducted several case studies in our past research that illustrate the capability of the proposed domain-driven spatial clustering framework to solve challenging problems in planetary sciences, geology, environmental sciences, and optometry.
  • 37. UH-DMML References C. F. Eick, B. Vaezian, D. Jiang, and J. Wang, Discovery of Interesting Regions in Spatial Datasets Using Supervised Clustering, in Proc. 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Berlin, Germany, September 2006. C. Choo, R. Jiamthapthaksin, C.-S. Chen, O. Celepcikay, C. Giusti, and C. F. Eick, MOSAIC: A Proximity Graph Approach to Agglomerative Clustering, in Proc. 9th International Conference on Data Warehousing and Knowledge Discovery (DaWaK), Regensburg, Germany, September 2007. W. Ding, R. Jiamthapthaksin, R. Parmar, D. Jiang, T. Stepinski, and C. F. Eick, Towards Region Discovery in Spatial Datasets, in Proc. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Osaka, Japan, May 2008. C. F. Eick, R. Parmar, W. Ding, T. Stepinki, and J.-P. Nicot, Finding Regional Co-location Patterns for Sets of Continuous Variables in Spatial Datasets, in Proc. 16th ACM SIGSPATIAL International Conference on Advances in GIS (ACM-GIS), Irvine, California, November 2008. C.-S. Chen, V. Rinsurongkawong, C.F. Eick, and M.D. Twa, Change Analysis in Spatial Data by Combining Contouring Algorithms with Supervised Density Functions, in Proc. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand, April 2009. A. Bagherjeiran, O. U. Celepcikay, R. Jiamthapthaksin, C.-S. Chen, V. Rinsurongkawong, S. Lee, J. Thomas, and C. F. Eick, Cougar**2: An Open Source Machine Learning and Data Mining Development Framework, in Proc. Open Source Data Mining Workshop (OSDM), Bangkok, Thailand, April 2009. C. F. Eick, O. U. Celepcikay, and R. Jiamthapthaksin, A Unifying Domain-driven Framework for Clustering with Plug-in Fitness Functions and Region Discovery, submitted to IEEE TKDE. R. Jiamthapthaksin, C. F. Eick, and R. Vilalta, A Framework for Multi-Objective Clustering and its Application to Co-Location Mining, submitted to Fifth International Conference on Advanced Data Mining and Applications (ADMA), Beijing, China, August 2009. W. Ding, C. F. Eick, X. Yuan, J. Wang, and J.-P. Nicot, A Framework for Regional Association Rule Mining and Scoping in Spatial Datasets, under review for publication in Geoinformatica.
  • 38. Other References L. Cao and C. Zhang, “The Evolution of KDD: Towards Domain-Driven Data Mining,” Journal of Pattern Recognition and Artificial Intelligence, vol.21, no. 4, pp. 677-692, World Scientific Publishing Company, 2007. O. Thonnard and M. Dacier, Actionable Knowledge Discovery for Threats Intelligence Support using a Multi-Dimensional Data Mining Methodology, DDDM08.
  • 39. Region Discovery Framework Objective: Develop and implement an integrated framework to automatically discover interesting regional patterns in spatial datasets. Treats region discovery as a clustering problem.
  • 40. Region Discovery Framework Continued The clustering algorithms we currently investigate solve the following problem: Given: A dataset O with a schema R A distance function d defined on instances of R A fitness function q(X) that evaluates clustering X={c1,…,ck} as follows: q(X)= cXreward(c)=cXinterestingness(c)*size(c) with b>1 Objective: Find c1,…,ck O such that: cicj= if ij X={c1,…,ck} maximizes q(X) All cluster ciX are contiguous in the spatial subspace c1,…,ck  O c1,…,ck are usually ranked based on the reward each cluster receives, and low reward clusters are frequently not reported
  • 41. [CZ07]
  • 42. Arsenic Water Pollution Problem Arsenic pollution is a serious problem in the Texas water supply. Hard to explain what causes arsenic pollution to occur. Several Datasets were created using the Ground Water Database (GWDB) by Texas Water Development Board (TWDB) that tests water wells regularly, one of which was used in the experimental evaluation in the paper: All the wells have a non-null samples for arsenic Multiple sample values are aggregated using avg/max functions Other chemicals may have null values Format: (Longitude, Latitude, <z-values of chemical concentrations>)