SlideShare uma empresa Scribd logo
1 de 24
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-HeeBae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School of Informatics and Computing Pervasive Technology Institute Indiana University SALSA project http://salsahpc.indiana.edu
Outline Introduction to Point Data Visualization Review of Dimension Reduction Algorithms. Multidimensional Scaling (MDS) Generative Topographic Mapping (GTM) Challenges Interpolation MDS Interpolation GTM Interpolation Experimental Results Conclusion 1
Point Data Visualization Visualize high-dimensional data as points in 2D or 3D by dimension reduction. Distances in target dimension approximate to the distances in the original HD space. Interactively browse data Easy to recognize clusters or groups An example of chemical data (PubChem) Visualization to display disease-gene relationship, aiming at finding cause-effect relationships between disease and genes. 2
Multi-Dimensional Scaling Pairwise dissimilarity matrix N-by-N matrix Each element can be a distance, score, rank, …  Given Δ, find a mapping in the target dimension Criteria (or objective function) STRESS SSTRESS SMACOF is one of algorithms to solve MDS problem 3
Generative Topographic Mapping K latent points N data points Input is high-dimensional vector points  Latent Variable Model (LVM) Define K latent variables (zk) Map K latent points to the data space by using a non-linear function f (by EM approach) Construct maps of data points in the latent space based on Gaussian Mixture Model  4
GTM vs. MDS 5 GTM MDS (SMACOF) ,[object Object]
 Find an optimal configuration in a lower-dimension
 Iterative optimization methodPurpose Maximize Log-Likelihood Minimize STRESS or SSTRESS ObjectiveFunction O(KN) (K << N) O(N2) Complexity EM Iterative Majorization (EM-like) Optimization Method Vector representation Pairwise Distance as well as Vector Input Format
Challenges Data is getting larger and high-dimensional PubChem : database of 60M chemical compounds Our initial results on 100K sequences need to be extended to millions of sequences Typical dimension 150-1000 MDS Results on 768 (32x24) core cluster with 1.54TB memory 6 Interpolation reduces the computational complexity O(N2)  O(n2 + (N-n)n)
Interpolation Approach Two-step procedure A dimension reduction alg. constructs a mapping of n sample data (among total N data) in target dimension. Remaining (N-n) out-of-samples are mapped in target dimension w.r.t. the constructed mapping of the n sample data w/o moving sample mappings. MPI nIn-sample Trained data Training N-n Out-of-sample Interpolated map 7 1 Interpolation 2 ...... P-1 p Total N data MapReduce
MDS Interpolation Assume it is given the mappings of n sampled data in target dimension (result of normal MDS). Landmark points (do not move during interpolation) Out-of-samples (N-n) are interpolated based on the mappings of n sample points. Find k-NN of the new point among n sample data. Based on the mappings of k-NN, find a position for a new point by the proposed iterative majorizing approach. Computational Complexity – O(Mn), M = N-n 8
GTM Interpolation Assume it is given the position of K latent points based on the sample data in the latent space. The most time consuming part of GTM Out-of-samples (N-n) are positioned directly w.r.t. Gaussian Mixture Model between the new point and the given position of K latent points. Computational Complexity – O(M), M = N-n 9
Experiment Environments 10
Quality Comparison (1) 11 GTM interpolation quality comparison w.r.t. different sample size of N = 100k MDS interpolation quality comparison w.r.t. different sample size of N = 100k
Quality Comparison (2) 12 GTM interpolation quality up to 2M MDS interpolation quality up to 2M
Parallel Efficiency 13 MDS parallel efficiency on Cluster-II GTM parallel efficiency on Cluster-II
GTM Interpolation via MapReduce 14 GTM Interpolation–Time per core to process 100k data points per core GTM Interpolationparallel efficiency ,[object Object]
DryadLINQ using a 16 core machine with 16 GB, Hadoop 8 core with 48 GB, Azure small instances with 1 core  with 1.7 GB.ThilinaGunarathne, Tak-Lon Wu, Judy Qiu, and Geoffrey Fox, “Cloud Computing Paradigms for Pleasingly Parallel  Biomedical Applications,” in Proceedings of ECMLS Workshop of ACM HPDC 2010
MDS Interpolation via MapReduce 15 DryadLINQ on 32 nodes X 24 Cores cluster with 48 GB per node. Azure using small instances ThilinaGunarathne, Tak-Lon Wu, Judy Qiu, and Geoffrey Fox, “Cloud Computing Paradigms for Pleasingly Parallel  Biomedical Applications,” in Proceedings of ECMLS Workshop of ACM HPDC 2010
MDS Interpolation Map 16 PubChem data visualization by using MDS (100k) and Interpolation (100k+100k).
GTM Interpolation Map 17 PubChem data visualization by using GTM (100k) and Interpolation (2M + 100k).
Conclusion Dimension reduction algorithms (e.g. GTM and MDS) are computation and memory intensive applications. Apply interpolation (out-of-sample) approach to GTM and MDS in order to process and visualize large- and high-dimensional dataset. It is possible to process millions data point via interpolation. Could be parallelized by MapReduce fashion as well as MPI fashion. 18
Future Works Make available as a Service Hierarchical Interpolation could reduce the computational complexity  				O(Mn) O(Mlog(n)) 19
Acknowledgment Our internal collaborators in School of Informatics and Computing at IUB Prof. David Wild Dr. Qian Zhu 20

Mais conteúdo relacionado

Mais procurados

CLUSTERING HYPERSPECTRAL DATA
CLUSTERING HYPERSPECTRAL DATACLUSTERING HYPERSPECTRAL DATA
CLUSTERING HYPERSPECTRAL DATAcsandit
 
010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian ProcessHa Phuong
 
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 AlgorithmSelf Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 AlgorithmChenghao Jin
 
Tutorial of topological data analysis part 3(Mapper algorithm)
Tutorial of topological data analysis part 3(Mapper algorithm)Tutorial of topological data analysis part 3(Mapper algorithm)
Tutorial of topological data analysis part 3(Mapper algorithm)Ha Phuong
 
MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...
MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...
MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...Daksh Raj Chopra
 
Btv thesis defense_v1.02-final
Btv thesis defense_v1.02-finalBtv thesis defense_v1.02-final
Btv thesis defense_v1.02-finalVinh Bui
 
QTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature MapQTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature MapHa Phuong
 
Lecture7 xing fei-fei
Lecture7 xing fei-feiLecture7 xing fei-fei
Lecture7 xing fei-feiTianlu Wang
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
 
Co-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approachCo-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approachAllen Wu
 
Pillar k means
Pillar k meansPillar k means
Pillar k meansswathi b
 
Clustering (from Google)
Clustering (from Google)Clustering (from Google)
Clustering (from Google)Sri Prasanna
 
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...acijjournal
 
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)neeraj7svp
 
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...Susang Kim
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Scienceinventy
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringAllenWu
 

Mais procurados (20)

CLUSTERING HYPERSPECTRAL DATA
CLUSTERING HYPERSPECTRAL DATACLUSTERING HYPERSPECTRAL DATA
CLUSTERING HYPERSPECTRAL DATA
 
010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process
 
Pca ankita dubey
Pca ankita dubeyPca ankita dubey
Pca ankita dubey
 
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 AlgorithmSelf Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
 
Tutorial of topological data analysis part 3(Mapper algorithm)
Tutorial of topological data analysis part 3(Mapper algorithm)Tutorial of topological data analysis part 3(Mapper algorithm)
Tutorial of topological data analysis part 3(Mapper algorithm)
 
MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...
MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...
MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...
 
Btv thesis defense_v1.02-final
Btv thesis defense_v1.02-finalBtv thesis defense_v1.02-final
Btv thesis defense_v1.02-final
 
50120140501016
5012014050101650120140501016
50120140501016
 
QTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature MapQTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature Map
 
Lecture7 xing fei-fei
Lecture7 xing fei-feiLecture7 xing fei-fei
Lecture7 xing fei-fei
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering Algorithm
 
Co-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approachCo-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approach
 
Pillar k means
Pillar k meansPillar k means
Pillar k means
 
Clustering (from Google)
Clustering (from Google)Clustering (from Google)
Clustering (from Google)
 
A1804010105
A1804010105A1804010105
A1804010105
 
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
 
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
 
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clustering
 

Destaque

関東CV勉強会 Kernel PCA (2011.2.19)
関東CV勉強会 Kernel PCA (2011.2.19)関東CV勉強会 Kernel PCA (2011.2.19)
関東CV勉強会 Kernel PCA (2011.2.19)Akisato Kimura
 
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...Shuyo Nakatani
 
WSDM2016読み会 Collaborative Denoising Auto-Encoders for Top-N Recommender Systems
WSDM2016読み会 Collaborative Denoising Auto-Encoders for Top-N Recommender SystemsWSDM2016読み会 Collaborative Denoising Auto-Encoders for Top-N Recommender Systems
WSDM2016読み会 Collaborative Denoising Auto-Encoders for Top-N Recommender SystemsKotaro Tanahashi
 
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNETomoki Hayashi
 
AutoEncoderで特徴抽出
AutoEncoderで特徴抽出AutoEncoderで特徴抽出
AutoEncoderで特徴抽出Kai Sasaki
 
非線形データの次元圧縮 150905 WACODE 2nd
非線形データの次元圧縮 150905 WACODE 2nd非線形データの次元圧縮 150905 WACODE 2nd
非線形データの次元圧縮 150905 WACODE 2ndMika Yoshimura
 
CVIM#11 3. 最小化のための数値計算
CVIM#11 3. 最小化のための数値計算CVIM#11 3. 最小化のための数値計算
CVIM#11 3. 最小化のための数値計算sleepy_yoshi
 
Numpy scipyで独立成分分析
Numpy scipyで独立成分分析Numpy scipyで独立成分分析
Numpy scipyで独立成分分析Shintaro Fukushima
 
基底変換、固有値・固有ベクトル、そしてその先
基底変換、固有値・固有ベクトル、そしてその先基底変換、固有値・固有ベクトル、そしてその先
基底変換、固有値・固有ベクトル、そしてその先Taketo Sano
 
Hyperoptとその周辺について
Hyperoptとその周辺についてHyperoptとその周辺について
Hyperoptとその周辺についてKeisuke Hosaka
 

Destaque (12)

Topic Models
Topic ModelsTopic Models
Topic Models
 
関東CV勉強会 Kernel PCA (2011.2.19)
関東CV勉強会 Kernel PCA (2011.2.19)関東CV勉強会 Kernel PCA (2011.2.19)
関東CV勉強会 Kernel PCA (2011.2.19)
 
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
 
WSDM2016読み会 Collaborative Denoising Auto-Encoders for Top-N Recommender Systems
WSDM2016読み会 Collaborative Denoising Auto-Encoders for Top-N Recommender SystemsWSDM2016読み会 Collaborative Denoising Auto-Encoders for Top-N Recommender Systems
WSDM2016読み会 Collaborative Denoising Auto-Encoders for Top-N Recommender Systems
 
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNE
 
AutoEncoderで特徴抽出
AutoEncoderで特徴抽出AutoEncoderで特徴抽出
AutoEncoderで特徴抽出
 
LDA入門
LDA入門LDA入門
LDA入門
 
非線形データの次元圧縮 150905 WACODE 2nd
非線形データの次元圧縮 150905 WACODE 2nd非線形データの次元圧縮 150905 WACODE 2nd
非線形データの次元圧縮 150905 WACODE 2nd
 
CVIM#11 3. 最小化のための数値計算
CVIM#11 3. 最小化のための数値計算CVIM#11 3. 最小化のための数値計算
CVIM#11 3. 最小化のための数値計算
 
Numpy scipyで独立成分分析
Numpy scipyで独立成分分析Numpy scipyで独立成分分析
Numpy scipyで独立成分分析
 
基底変換、固有値・固有ベクトル、そしてその先
基底変換、固有値・固有ベクトル、そしてその先基底変換、固有値・固有ベクトル、そしてその先
基底変換、固有値・固有ベクトル、そしてその先
 
Hyperoptとその周辺について
Hyperoptとその周辺についてHyperoptとその周辺について
Hyperoptとその周辺について
 

Semelhante a Dimension Reduction And Visualization Of Large High Dimensional Data Via Interpolation

Web image annotation by diffusion maps manifold learning algorithm
Web image annotation by diffusion maps manifold learning algorithmWeb image annotation by diffusion maps manifold learning algorithm
Web image annotation by diffusion maps manifold learning algorithmijfcstjournal
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Multimodal Biometrics Recognition by Dimensionality Diminution Method
Multimodal Biometrics Recognition by Dimensionality Diminution MethodMultimodal Biometrics Recognition by Dimensionality Diminution Method
Multimodal Biometrics Recognition by Dimensionality Diminution MethodIJERA Editor
 
One shot scene specific crowd counting
One shot scene specific crowd countingOne shot scene specific crowd counting
One shot scene specific crowd countingmadhobilota
 
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...csandit
 
Clustering Algorithms for Data Stream
Clustering Algorithms for Data StreamClustering Algorithms for Data Stream
Clustering Algorithms for Data StreamIRJET Journal
 
240311_Thuy_Labseminar[Contrastive Multi-View Representation Learning on Grap...
240311_Thuy_Labseminar[Contrastive Multi-View Representation Learning on Grap...240311_Thuy_Labseminar[Contrastive Multi-View Representation Learning on Grap...
240311_Thuy_Labseminar[Contrastive Multi-View Representation Learning on Grap...thanhdowork
 
Introducing New Parameters to Compare the Accuracy and Reliability of Mean-Sh...
Introducing New Parameters to Compare the Accuracy and Reliability of Mean-Sh...Introducing New Parameters to Compare the Accuracy and Reliability of Mean-Sh...
Introducing New Parameters to Compare the Accuracy and Reliability of Mean-Sh...sipij
 
Laplacian-regularized Graph Bandits
Laplacian-regularized Graph BanditsLaplacian-regularized Graph Bandits
Laplacian-regularized Graph Banditslauratoni4
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxSivam Chinna
 
NS - CUK Seminar: V.T.Hoang, Review on "Long Range Graph Benchmark.", NeurIPS...
NS - CUK Seminar: V.T.Hoang, Review on "Long Range Graph Benchmark.", NeurIPS...NS - CUK Seminar: V.T.Hoang, Review on "Long Range Graph Benchmark.", NeurIPS...
NS - CUK Seminar: V.T.Hoang, Review on "Long Range Graph Benchmark.", NeurIPS...ssuser4b1f48
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLlauratoni4
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsDevansh16
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...cscpconf
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAEditor Jacotech
 

Semelhante a Dimension Reduction And Visualization Of Large High Dimensional Data Via Interpolation (20)

Web image annotation by diffusion maps manifold learning algorithm
Web image annotation by diffusion maps manifold learning algorithmWeb image annotation by diffusion maps manifold learning algorithm
Web image annotation by diffusion maps manifold learning algorithm
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Multimodal Biometrics Recognition by Dimensionality Diminution Method
Multimodal Biometrics Recognition by Dimensionality Diminution MethodMultimodal Biometrics Recognition by Dimensionality Diminution Method
Multimodal Biometrics Recognition by Dimensionality Diminution Method
 
One shot scene specific crowd counting
One shot scene specific crowd countingOne shot scene specific crowd counting
One shot scene specific crowd counting
 
ME Synopsis
ME SynopsisME Synopsis
ME Synopsis
 
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
 
Clustering Algorithms for Data Stream
Clustering Algorithms for Data StreamClustering Algorithms for Data Stream
Clustering Algorithms for Data Stream
 
240311_Thuy_Labseminar[Contrastive Multi-View Representation Learning on Grap...
240311_Thuy_Labseminar[Contrastive Multi-View Representation Learning on Grap...240311_Thuy_Labseminar[Contrastive Multi-View Representation Learning on Grap...
240311_Thuy_Labseminar[Contrastive Multi-View Representation Learning on Grap...
 
Introducing New Parameters to Compare the Accuracy and Reliability of Mean-Sh...
Introducing New Parameters to Compare the Accuracy and Reliability of Mean-Sh...Introducing New Parameters to Compare the Accuracy and Reliability of Mean-Sh...
Introducing New Parameters to Compare the Accuracy and Reliability of Mean-Sh...
 
Laplacian-regularized Graph Bandits
Laplacian-regularized Graph BanditsLaplacian-regularized Graph Bandits
Laplacian-regularized Graph Bandits
 
crowd counting.pptx
crowd counting.pptxcrowd counting.pptx
crowd counting.pptx
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptx
 
NS - CUK Seminar: V.T.Hoang, Review on "Long Range Graph Benchmark.", NeurIPS...
NS - CUK Seminar: V.T.Hoang, Review on "Long Range Graph Benchmark.", NeurIPS...NS - CUK Seminar: V.T.Hoang, Review on "Long Range Graph Benchmark.", NeurIPS...
NS - CUK Seminar: V.T.Hoang, Review on "Long Range Graph Benchmark.", NeurIPS...
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RL
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
 
1376846406 14447221
1376846406  144472211376846406  14447221
1376846406 14447221
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
 

Dimension Reduction And Visualization Of Large High Dimensional Data Via Interpolation

  • 1. Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-HeeBae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School of Informatics and Computing Pervasive Technology Institute Indiana University SALSA project http://salsahpc.indiana.edu
  • 2. Outline Introduction to Point Data Visualization Review of Dimension Reduction Algorithms. Multidimensional Scaling (MDS) Generative Topographic Mapping (GTM) Challenges Interpolation MDS Interpolation GTM Interpolation Experimental Results Conclusion 1
  • 3. Point Data Visualization Visualize high-dimensional data as points in 2D or 3D by dimension reduction. Distances in target dimension approximate to the distances in the original HD space. Interactively browse data Easy to recognize clusters or groups An example of chemical data (PubChem) Visualization to display disease-gene relationship, aiming at finding cause-effect relationships between disease and genes. 2
  • 4. Multi-Dimensional Scaling Pairwise dissimilarity matrix N-by-N matrix Each element can be a distance, score, rank, … Given Δ, find a mapping in the target dimension Criteria (or objective function) STRESS SSTRESS SMACOF is one of algorithms to solve MDS problem 3
  • 5. Generative Topographic Mapping K latent points N data points Input is high-dimensional vector points Latent Variable Model (LVM) Define K latent variables (zk) Map K latent points to the data space by using a non-linear function f (by EM approach) Construct maps of data points in the latent space based on Gaussian Mixture Model 4
  • 6.
  • 7. Find an optimal configuration in a lower-dimension
  • 8. Iterative optimization methodPurpose Maximize Log-Likelihood Minimize STRESS or SSTRESS ObjectiveFunction O(KN) (K << N) O(N2) Complexity EM Iterative Majorization (EM-like) Optimization Method Vector representation Pairwise Distance as well as Vector Input Format
  • 9. Challenges Data is getting larger and high-dimensional PubChem : database of 60M chemical compounds Our initial results on 100K sequences need to be extended to millions of sequences Typical dimension 150-1000 MDS Results on 768 (32x24) core cluster with 1.54TB memory 6 Interpolation reduces the computational complexity O(N2)  O(n2 + (N-n)n)
  • 10. Interpolation Approach Two-step procedure A dimension reduction alg. constructs a mapping of n sample data (among total N data) in target dimension. Remaining (N-n) out-of-samples are mapped in target dimension w.r.t. the constructed mapping of the n sample data w/o moving sample mappings. MPI nIn-sample Trained data Training N-n Out-of-sample Interpolated map 7 1 Interpolation 2 ...... P-1 p Total N data MapReduce
  • 11. MDS Interpolation Assume it is given the mappings of n sampled data in target dimension (result of normal MDS). Landmark points (do not move during interpolation) Out-of-samples (N-n) are interpolated based on the mappings of n sample points. Find k-NN of the new point among n sample data. Based on the mappings of k-NN, find a position for a new point by the proposed iterative majorizing approach. Computational Complexity – O(Mn), M = N-n 8
  • 12. GTM Interpolation Assume it is given the position of K latent points based on the sample data in the latent space. The most time consuming part of GTM Out-of-samples (N-n) are positioned directly w.r.t. Gaussian Mixture Model between the new point and the given position of K latent points. Computational Complexity – O(M), M = N-n 9
  • 14. Quality Comparison (1) 11 GTM interpolation quality comparison w.r.t. different sample size of N = 100k MDS interpolation quality comparison w.r.t. different sample size of N = 100k
  • 15. Quality Comparison (2) 12 GTM interpolation quality up to 2M MDS interpolation quality up to 2M
  • 16. Parallel Efficiency 13 MDS parallel efficiency on Cluster-II GTM parallel efficiency on Cluster-II
  • 17.
  • 18. DryadLINQ using a 16 core machine with 16 GB, Hadoop 8 core with 48 GB, Azure small instances with 1 core with 1.7 GB.ThilinaGunarathne, Tak-Lon Wu, Judy Qiu, and Geoffrey Fox, “Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications,” in Proceedings of ECMLS Workshop of ACM HPDC 2010
  • 19. MDS Interpolation via MapReduce 15 DryadLINQ on 32 nodes X 24 Cores cluster with 48 GB per node. Azure using small instances ThilinaGunarathne, Tak-Lon Wu, Judy Qiu, and Geoffrey Fox, “Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications,” in Proceedings of ECMLS Workshop of ACM HPDC 2010
  • 20. MDS Interpolation Map 16 PubChem data visualization by using MDS (100k) and Interpolation (100k+100k).
  • 21. GTM Interpolation Map 17 PubChem data visualization by using GTM (100k) and Interpolation (2M + 100k).
  • 22. Conclusion Dimension reduction algorithms (e.g. GTM and MDS) are computation and memory intensive applications. Apply interpolation (out-of-sample) approach to GTM and MDS in order to process and visualize large- and high-dimensional dataset. It is possible to process millions data point via interpolation. Could be parallelized by MapReduce fashion as well as MPI fashion. 18
  • 23. Future Works Make available as a Service Hierarchical Interpolation could reduce the computational complexity O(Mn) O(Mlog(n)) 19
  • 24. Acknowledgment Our internal collaborators in School of Informatics and Computing at IUB Prof. David Wild Dr. Qian Zhu 20
  • 25. Thank you Question? Email me at sebae@cs.indiana.edu 21
  • 26. EM optimization Find K centers for N data K-clustering problem, known as NP-hard Use Expectation-Maximization (EM) method EM algorithm Find local optimal solution iteratively until converge E-step: M-step: 22
  • 27. Parallelization Interpolation is pleasingly parallel application Out-of-sample data are independent each other. We can parallelize interpolation app. by MapReduce fashion as well as MPI fashion. ThilinaGunarathne, Tak-Lon Wu, Judy Qiu, and Geoffrey Fox, “Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications,” in Proceedings of ECMLS Workshop of ACM HPDC 2010 23 nIn-sample Trained data Training N-n Out-of-sample Interpolated map 1 Interpolation 2 ...... P-1 p Total N data

Notas do Editor

  1. Microarray data
  2. EC2 HM4XL - High memory large instances (8 * 3.25 Ghz cores, 68GB memory)EC2 HCXL – High CPU extra large (8 * 2.4 Ghz, 7GB memory)EC2 Large – 2 * 2.4 Ghz , 7.5 GB memoryGTM interpolation is memory bound. Hence, less cores per memory (less memory &amp; memory bandwidth contention) has the advantage. DryadLINQ efficiency suffers due to 16 core 16 GB machines.
  3. Efficiency drop in MDS DryadLINQ at the last point is due to unbalanced partition (2600 blocks in to 768 cores)