SlideShare uma empresa Scribd logo
1 de 15
Study on Image Pattern Selection via Support Vector

        Machine for Improving Chinese Herb GC×GC Data

               Classification and Clustering Performance

                                              wu zhili

                                   Vincent@comp.hkbu.edu.hk

                                        Comp Dept , HKBU

                                      And other authors ….




Abstract


The two-dimensional Gas Chromatography (2D-GC) has been a highly powerful technique in the

analysis of complex mixtures. However, despite the much informative 2-D GC intensity image

that is easily visualized by experts for manual interpretation, it has imposed great complexities and

difficulties upon computational analysis approaches when intending to precisely and automatically

process the 2D-GC data compared with the already matured signal processing method for 1D-GC

data.



Complemented by some techniques used in the pre/pro-processing for image analysis, this paper

proposes the support vector machine (SVM) method for pattern selection from 2D-GC images.

The experimentation for Chinese Herb data classification and clustering shows the improvement
adopting the SVM feature selection method.



Keywords: Chinese herb, 2D-GC, SVM, Feature Selection, Image Analysis, Classification,

Clustering.
1 Introduction


1.1 The Significance and importance for Chinese Herb Data analysis
   ….

1.2 The superiority of 2D-GC when compared with 1D-GC, and its exact
    suitability for Chinese Herb analysis
    ….

1.3 Thedifficulties of analyzing 2D-GC data when dealing with the
   computational complexity and the intractability of pattern recognition
   ….
   As from the previous introduction to 2D-GC, the data captured is saved with a matrix form
   with each column of data corresponding to intensities sampled within the retention time for
   the second column of the 2D-GC device and the row length corresponding to the total time the
   experimentation lasts. It is thus computationally overhead to analyze such large data matrices.

   A way to address the computational complexity is to reduce the matrix dimension. Only
   significant and distinctive patterns in the image are retained as meaningful features. For
   example, the ANOVA method is adopted [ Ref…] for feature selection from 2-D GC data
   matrix. It uses a small subset of samples from each class of data and remains the matrix
   entries which are with large inter-class variances and small intra-class variances.

   This paper presents linear SVM for feature selection. Linear SVM, as a linear classifier for
   each pair of two classes of data, tries to specify a weighting for each matrix entry. The
   weighting is separable (e.g. opposite sign, or an obvious threshold) for entries from different
   classes. And the absolute weightings signify the importance of the corresponding entries.
   Hereby those entries with the largest weightings are retained as meaningful features. For a set
   of data from more than two classes, the feature selection operates in a pair by pair manner,
   and the features selected by multiple runs are unified into one combined feature set.

   The 2D-GC data is analyzed as image on which the patterns are regarded as stable and unique
   characteristics for a certain herb species or chemical component. But the properties of patterns
   such as the areas, intensities and positions always show some varieties, it thus requires the
   comparing algorithms to be variant-tolerant, but at the same time not to degrade the sensitivity
   when analyzing patterns from two different species. Thus some image processing techniques
   are adopted for more accurately pattern extraction and matching.



1.4 Why the machine learning approaches such as SVM and the algorithm
    family of classification and clustering can help the Chinese Herb 2D-GC
    Image data analysis
….

    Machine Learning is the study of computer algorithms that improve automatically through
    experience [1]. As remarked in [2], the machine learning will be




2. The Feature Selection Methodology: Linear Support Vector

Machine

The formulism of SVM is introduced here…..




3. SVM applied to pattern extraction from Chinese Herb 2D-GC

Image

Part 1. Input data:

Please give the full name of the element A and element B

A             B              Sample ID
0%            100%           12345
30%           70%            6 7 8 9 10
50%           50%            11 12 13 14 15




Data format:
     Form a (400 x 510) matrix for each sample observation. During each time segment of 4
seconds, 400 readings are sampled from the column 2 of the GC device at a rate of 1FID/0.01s.
And a complete run of experiment lasts 34 [(510 x 4s)/60 = 34] minutes. And the FID intensities
range from 21 to nearly 4500.
     But in the following analysis, we discard the part of readings obtained in the first 8 minutes
when the compounds are going through the GC device due to the severe noise when booting the
machine. Thus we are handling totally 15 data matrices with size 400 x 390.


Part 2. 2D GC has larger information capacity than 1D GC:

It is claimed the 1D GC can be obtained by accumulating the FID signal strengths in each column
of the 2D GC data matrix (Fig 1). If that is true, it is straightforward that the 2 D GC has larger
information capacity than the corresponding 1D GC (a 400x390 matrix vs. a 1x390 vector) (Fig
2).
Although it is argued that such a reconstruction does not compare the 1D GC with the 2D GC
under the same condition (e.g. 2D is sampled at the frequency 100 = 1/(0.01s), but 1D is sampled
at the frequency 0.25 = 1/(4s) ), our other experiments show that this reconstruction is credible.

A bunch of 1D experiments are conducted at a high frequency (f=100) and lasting 34 minutes too.
And then the FID readings are sequentially folded to segments with a length of 400, thus
transferred to data matrices with size 400 x 510. Shown as images (Fig 3), it is nearly identical to
those obtained by simply reconstructing the 2D GC and then replicating and tiling with mean
readings to a matrix.
Fig. 1 (a) Image show of a sample GC x GC chromatogram. (b) Reconstructed first
column chromatogram of the same 2D GC x GC chromatogram.




Fig 2. Extending the 1D GC data into the same matrix form, whose column can be supposed to be
filled with a flat mean reading obtained from the column 2 of the GC device. It is obvious that the
2D GC x GC signals are more distinct due to the various strengths along each column.
Fig 3. 1D GC data by increasing the frequency to (f=100)


Part 3. Further Data preprocessing
     3.1 Guassian filtering with size 3 x 7.
           It is generally believed that same characteristic patterns can be observed in the graphs
for two 2D GC experiments on same compounds. Now assume a significant pattern centering at
(x,y) in the graph is observed, where x is the rowwise pixel position and y is the columnwise pixel
position (or it is regarded as time when referring the 2D GC experiment). And from the knowledge
of experimentation, such a pattern should not be observed as only a single pulse at an isolated
position, but be observed during an ∆x and ∆y interval (For the image representation, it should be
a rectangle box with width ∆x and height ∆y. Thus when comparing two 2D GC graph, we cannot
simply notice the difference of FID signal strengths between each pair of pixels at the same (time /
graph) position. However, we should consider the pattern difference at nearly the same region of
graphs. A simple way to enforce the effect of the neighboring pixels for a centering pixel is to use
some local filters with a small window size. Among the huge set of filters available in the field of
image analysis, the Gaussian smooth filter is popularly used. In our GC graph analysis, we select
the Gaussian filter with window size 3 x 7, which fits for the case that the columnwise correlation
of 2D GC data should be more accuractely pinpointed.


Part 4. Feature Selection Using Linear Support Vector Classification Machine

4.1 Feature Selection for 2D GC Data by Using Linear SVM

     It is necessary to reduce each 2D GC data from a huge matrix to a more economic size by
discarding some insignificant values, only keeping those important features. It is not only a great
help to reduce the computational burden casting upon the lately used classification or clustering
algorithms, more importantly, it also essential to sketch out the featured patterns in the 2D GC
data for chemist’s inspection or further chemical analysis.

     A novel machine learning approach to feature selection is proposed recently by utilizing the
state-of-the art Support Vector Machine. Following the general setting to re-shape each 2D GC
matrix to a one dimensional vector by sequentially tiling each column, and stack the vectors of all
samples together, we form a N x d data matrix, where N = 15 and d = 400 x 390 = 15600.

     Since support vector machine is a classification method, some training samples are used to
guide the feature selection procedures. For instance, four samples (the 1st, 2nd, 11-th, 12-th) are
regarded as the training samples, where the 1st and 2nd samples are in the same class (purely
composed with A) and the remaining two samples (the 11-th, 12-th) are grouped into the opposite
class (contaminated by some B). Denote the four data vectors as xi (I = 1,2,11,12).

     The linear Support Vector Machine tries to construct a separation function f(x) = wx + b such
that wx1 + b > 0, wx2 + b > 0, wx11 + b<0 and wx12 + b < 0 with some constraints on w and b.
After some systematic solving procedures, we can obtain an explicit solution for w from a linear
Support Vector Machine. And generally the w, which has the same dimension as the x, expresses
the importance of each dimension of xi by the corresponding term in w.

      A       B              Sample                              Classification Result
0%            100%           12345                               11111
30%           70%            6 7 8 9 10                          22222
50%           50%            11 12 13 14 15                      22222

Table 2. Using parameter C = 1, tolerance = 0.001, and cache size = 100 MB. The Training error is
0 and the testing err is also 0.

And constructing the linear SVM mainly aims at feature selection, although its classification is
already encouraging enough as shown in (Table 2). After training a linear SVM, the w obtained is
illustrated in following graph:
Fig 4. Each FID signal (total 15600) is associated with a weight value,




Fig 5. Reducing the number of features doesn’t hurt the classification accuracy too much.
Fig 6. Fractional Area of features along the threshold value
4.2 Classification accuracy comparison between 2D and 1D data

     A comparison has been done to validate that 2D GC data has a larger information capacity
than reconstructed 1D GC: using Linear SVM to separate more classes of experimental data and
compare their classification accuracies.

     We then produce 5 classes of compounds with various percentages of elements B. In
particular, the percentage of B are 0% 10% 20% 30% 40%, and each specific kind of blending has
been fed into the GC device 5 times to get a set of 5 replicated 2D measurements and then obtain
the same number of reconstructed 1D data vectors.

     It is a multi-task classification task in fact. And we report the separation rate per pair of
classes by using the training sample rates 0. 4 (2 training per class):

                     B:A=0:100     B:A=10:90      B:A=20:80     B:A=30:70      B:A=40:60
       B:A=0:100          -           0.77            1             1              1
       B:A=10:90        0.78            -            0.83          0.96           0.96
       B:A=20:80        1.00          0.92             -           0.75           0.99
       B:A=30:70        1.00          1.00           0.86            -            0.81
       B:A=40:60        1.00          1.00           0.93          0.93             -
     Table 3. The overall (both training and testing) classification accuracies for 1D and 2D GC
data using linear SVM under the parameter settings: C=1, tolerance = 0.001, cache size = 100 MB.
Each table cell shows the accuracy of classifying one type of sample named by the column title
from another type marked with the row title. The upper triangular part shows the results for 1D
GC data, and the lower triangular part is for 2D data. The better accuracy of any diagonally
symmetric pair of values is highlighted. The above results are averaged from 10 repeated
experiments with different training samples used.

      From the table 3, we can notice that most results for classifying 2D data are better than those
for classifying 1D data. The only exception happens when classifying 40%-B from 30%-B. But
their accuracy difference is not too large to disprove the superiority of using 2D data by taking an
account of the device noise and the limited samples used.


Part 5: Further optimizing the 2D GC features by Image processing Methods

     The features selected by the novel Linear Support Vector Machine can effectively distinguish
samples without B mixed from those contaminated by B. For example, extracting only 1 percent
features still completely classifies the two classes of data (Fig 5 & 6).

     Although the classification results, as shown above, are insensitive to the number of features,
choosing how many percents of features is still critical from the view of chemistry domain experts
because it might be too dangerous to use an extremely small set of features to represent a sample
originally high-dimensionally featured. We might have to determine the optimal threshold for w
obtained from the Linear SVM such that both the number of features is not formidably large and
the sample representation by features is not vulnerably oversimplified.

     If we have plenty of training samples, some classical methods such as the cross validation
can be used to guide the process of deciding the optimal w. But in chemo metric field, in
particular, for our 2D GC experiment case, obtaining more samples is very time-consuming and
labor demanding.

     The w itself is a long vector and has the one-to-one correspondence to the data vector of the
2D vectors. Recalling many unsupervised feature selection methods achieve good results too by
only noticing the pixel intensities and pattern spatiality on each single 2D image, we can apply
same methodology to the way of selecting w if we retransforming the w vector to a 2-D image
with identical size of each 2D GC image.

     To extract more reasonable features from the images without too much supervision, we can
employ some image processing techniques for contour/boundary detection. We adopt a set of
threshold values (totally twenty levels) to locate the contours in the image valued by w. Those
threshold values are automatically selected for each sample image in an unsupervised manner.
Following is a set of important area found for the image of w.
Figure 7.




Part 6: Clustering upon the whole dataset using new feature vectors
    6.1 PCA analyses to show the improvement due to feature selection

      To verify the effectiveness of the Support Vector Machine feature selection approach, we
report the clustering results on the feature-reduced 2D GC data compared with the originally raw
2D GC data.
      The clustering algorithms used are PCA combining with the K-mean methods. The input data
are firstly represented by a subset of principle components through PCA, and then are clustered
into several groups by the K-mean algorithm. For high dimensional input vectors, such as the raw
2D GC vectors with a length of 15600, the PCA with the K-L transform trick (identical to the
kernel-PCA using Linear Kernel) is used to avoid directly operating on the covariance matrix
which in size is proportional to the square of the dimension of input vectors. And even for other
lower dimension data, the linear kernel PCA is also performed besides the conventional PCA
testing, and the better results are reported under the same category of (PCA+K-mean) in the
following table.

Training                Raw 2D GC Data                    Feature Selection (a)    Feature Selection
samples                                                                                   (b)
           Linear SVM       K-mean      PCA+K-mean           PCA + K-mean           PCA + K-mean
2              0.9356                                               0.9244                   0.8533
4              0.9711                                               0.9156                   0.9289
                              0.8956          0.8874
6              0.9911                                               0.9022                   0.9244


                    Table 3. Validate the feature selection Methods for 2D GC data.

      Shown in table 3, the Linear SVM , as a supervised classification algorithm, always achieves
the best results under different training sample rate even though it operates on the raw data. While
the K-mean clustering, as an unsupervised method, either with PCA added or not, does not achieve
good results on raw 2D data. After feature selection, the clustering algorithms obtain higher
accuracies with smaller computation complexity.
      One can also observe that the feature selection (b) has a positive correlation with the training
sample rate used at the linear SVM stage. Although the scheme (b) runs worse when the training
rate is small, it boosts to a higher precision compared with the scheme (a) as the training rate
increases. Also it is verifiable from the following figure 8 and figure 9. The summation of variance
percentage of the first 3 principle components in Figure 9 achieve 80%, which is much higher than
the total variance percentages of the first three principle components in Figure 8.




               Figure 8. PCA and K-mean on the data using feature selection scheme (a)
Figure 9. PCA and K-mean on the data using feature selection scheme (b).




4. Conclusion



5. References
[1] Machine Learning, Tom Mitchell, McGraw Hill, 1997.

Mais conteúdo relacionado

Mais procurados

Paper id 252014146
Paper id 252014146Paper id 252014146
Paper id 252014146
IJRAT
 
(MS word document)
(MS word document)(MS word document)
(MS word document)
butest
 

Mais procurados (19)

International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)
 
Modified approximate 8-point multiplier less DCT like transform
Modified approximate 8-point multiplier less DCT like transformModified approximate 8-point multiplier less DCT like transform
Modified approximate 8-point multiplier less DCT like transform
 
WAVELET BASED AUTHENTICATION/SECRET TRANSMISSION THROUGH IMAGE RESIZING (WA...
WAVELET BASED AUTHENTICATION/SECRET  TRANSMISSION THROUGH IMAGE RESIZING  (WA...WAVELET BASED AUTHENTICATION/SECRET  TRANSMISSION THROUGH IMAGE RESIZING  (WA...
WAVELET BASED AUTHENTICATION/SECRET TRANSMISSION THROUGH IMAGE RESIZING (WA...
 
Segmentation by Fusion of Self-Adaptive SFCM Cluster in Multi-Color Space Com...
Segmentation by Fusion of Self-Adaptive SFCM Cluster in Multi-Color Space Com...Segmentation by Fusion of Self-Adaptive SFCM Cluster in Multi-Color Space Com...
Segmentation by Fusion of Self-Adaptive SFCM Cluster in Multi-Color Space Com...
 
06 9237 it texture classification based edit putri
06 9237 it   texture classification based edit putri06 9237 it   texture classification based edit putri
06 9237 it texture classification based edit putri
 
R044120124
R044120124R044120124
R044120124
 
Noise tolerant color image segmentation using support vector machine
Noise tolerant color image segmentation using support vector machineNoise tolerant color image segmentation using support vector machine
Noise tolerant color image segmentation using support vector machine
 
Paper id 252014146
Paper id 252014146Paper id 252014146
Paper id 252014146
 
An Efficient Multiplierless Transform algorithm for Video Coding
An Efficient Multiplierless Transform algorithm for Video CodingAn Efficient Multiplierless Transform algorithm for Video Coding
An Efficient Multiplierless Transform algorithm for Video Coding
 
Advanced Data Structures 2006
Advanced Data Structures 2006Advanced Data Structures 2006
Advanced Data Structures 2006
 
Qubit data structures for
Qubit data structures forQubit data structures for
Qubit data structures for
 
Gray Image Watermarking using slant transform - digital image processing
Gray Image Watermarking using slant transform - digital image processingGray Image Watermarking using slant transform - digital image processing
Gray Image Watermarking using slant transform - digital image processing
 
Ijetr011917
Ijetr011917Ijetr011917
Ijetr011917
 
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
 
(MS word document)
(MS word document)(MS word document)
(MS word document)
 
6. 7772 8117-1-pb
6. 7772 8117-1-pb6. 7772 8117-1-pb
6. 7772 8117-1-pb
 
A flexible method to create wave file features
A flexible method to create wave file features A flexible method to create wave file features
A flexible method to create wave file features
 
Texture classification of fabric defects using machine learning
Texture classification of fabric defects using machine learning Texture classification of fabric defects using machine learning
Texture classification of fabric defects using machine learning
 
Combining Generative And Discriminative Classifiers For Semantic Automatic Im...
Combining Generative And Discriminative Classifiers For Semantic Automatic Im...Combining Generative And Discriminative Classifiers For Semantic Automatic Im...
Combining Generative And Discriminative Classifiers For Semantic Automatic Im...
 

Destaque (9)

Multi device layout pattern
Multi device layout patternMulti device layout pattern
Multi device layout pattern
 
Crochet warp knitting machine(bu tex)
Crochet warp knitting machine(bu tex)Crochet warp knitting machine(bu tex)
Crochet warp knitting machine(bu tex)
 
Weave
WeaveWeave
Weave
 
Electronic controls in knitting
Electronic controls in knittingElectronic controls in knitting
Electronic controls in knitting
 
Positive+yarn+feeding
Positive+yarn+feeding Positive+yarn+feeding
Positive+yarn+feeding
 
Weaving ( cordillera and region 1)
Weaving ( cordillera and region 1)Weaving ( cordillera and region 1)
Weaving ( cordillera and region 1)
 
Study on knitting elements of circular knitting machine (butex)
Study on knitting elements of circular knitting machine (butex)Study on knitting elements of circular knitting machine (butex)
Study on knitting elements of circular knitting machine (butex)
 
Some lessons of Weaving
Some lessons of WeavingSome lessons of Weaving
Some lessons of Weaving
 
Process sequence of weaving
Process sequence of weavingProcess sequence of weaving
Process sequence of weaving
 

Semelhante a Part 3. Further Data preprocessing

11.0003www.iiste.org call for paper_d_discrete cosine transform for image com...
11.0003www.iiste.org call for paper_d_discrete cosine transform for image com...11.0003www.iiste.org call for paper_d_discrete cosine transform for image com...
11.0003www.iiste.org call for paper_d_discrete cosine transform for image com...
Alexander Decker
 
Research Paper v2.0
Research Paper v2.0Research Paper v2.0
Research Paper v2.0
Kapil Tiwari
 

Semelhante a Part 3. Further Data preprocessing (20)

Pipelined Architecture of 2D-DCT, Quantization and ZigZag Process for JPEG Im...
Pipelined Architecture of 2D-DCT, Quantization and ZigZag Process for JPEG Im...Pipelined Architecture of 2D-DCT, Quantization and ZigZag Process for JPEG Im...
Pipelined Architecture of 2D-DCT, Quantization and ZigZag Process for JPEG Im...
 
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...
 
A Blind Steganalysis on JPEG Gray Level Image Based on Statistical Features a...
A Blind Steganalysis on JPEG Gray Level Image Based on Statistical Features a...A Blind Steganalysis on JPEG Gray Level Image Based on Statistical Features a...
A Blind Steganalysis on JPEG Gray Level Image Based on Statistical Features a...
 
Detection of leaf diseases and classification using digital image processing
Detection of leaf diseases and classification using digital image processingDetection of leaf diseases and classification using digital image processing
Detection of leaf diseases and classification using digital image processing
 
Performance Improvement of Vector Quantization with Bit-parallelism Hardware
Performance Improvement of Vector Quantization with Bit-parallelism HardwarePerformance Improvement of Vector Quantization with Bit-parallelism Hardware
Performance Improvement of Vector Quantization with Bit-parallelism Hardware
 
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATIONA DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
 
Presentation for korea multimedia(in english)
Presentation for korea multimedia(in english)Presentation for korea multimedia(in english)
Presentation for korea multimedia(in english)
 
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATIONA DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
 
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
 
Application of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisApplication of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosis
 
Orientation Spectral Resolution Coding for Pattern Recognition
Orientation Spectral Resolution Coding for Pattern RecognitionOrientation Spectral Resolution Coding for Pattern Recognition
Orientation Spectral Resolution Coding for Pattern Recognition
 
IEEE 2014 Matlab Projects
IEEE 2014 Matlab ProjectsIEEE 2014 Matlab Projects
IEEE 2014 Matlab Projects
 
IEEE 2014 Matlab Projects
IEEE 2014 Matlab ProjectsIEEE 2014 Matlab Projects
IEEE 2014 Matlab Projects
 
I010135760
I010135760I010135760
I010135760
 
Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)
 
Parallel implementation of geodesic distance transform with application in su...
Parallel implementation of geodesic distance transform with application in su...Parallel implementation of geodesic distance transform with application in su...
Parallel implementation of geodesic distance transform with application in su...
 
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
 
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
 
11.0003www.iiste.org call for paper_d_discrete cosine transform for image com...
11.0003www.iiste.org call for paper_d_discrete cosine transform for image com...11.0003www.iiste.org call for paper_d_discrete cosine transform for image com...
11.0003www.iiste.org call for paper_d_discrete cosine transform for image com...
 
Research Paper v2.0
Research Paper v2.0Research Paper v2.0
Research Paper v2.0
 

Mais de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

Mais de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Part 3. Further Data preprocessing

  • 1. Study on Image Pattern Selection via Support Vector Machine for Improving Chinese Herb GC×GC Data Classification and Clustering Performance wu zhili Vincent@comp.hkbu.edu.hk Comp Dept , HKBU And other authors …. Abstract The two-dimensional Gas Chromatography (2D-GC) has been a highly powerful technique in the analysis of complex mixtures. However, despite the much informative 2-D GC intensity image that is easily visualized by experts for manual interpretation, it has imposed great complexities and difficulties upon computational analysis approaches when intending to precisely and automatically process the 2D-GC data compared with the already matured signal processing method for 1D-GC data. Complemented by some techniques used in the pre/pro-processing for image analysis, this paper proposes the support vector machine (SVM) method for pattern selection from 2D-GC images. The experimentation for Chinese Herb data classification and clustering shows the improvement
  • 2. adopting the SVM feature selection method. Keywords: Chinese herb, 2D-GC, SVM, Feature Selection, Image Analysis, Classification, Clustering.
  • 3. 1 Introduction 1.1 The Significance and importance for Chinese Herb Data analysis …. 1.2 The superiority of 2D-GC when compared with 1D-GC, and its exact suitability for Chinese Herb analysis …. 1.3 Thedifficulties of analyzing 2D-GC data when dealing with the computational complexity and the intractability of pattern recognition …. As from the previous introduction to 2D-GC, the data captured is saved with a matrix form with each column of data corresponding to intensities sampled within the retention time for the second column of the 2D-GC device and the row length corresponding to the total time the experimentation lasts. It is thus computationally overhead to analyze such large data matrices. A way to address the computational complexity is to reduce the matrix dimension. Only significant and distinctive patterns in the image are retained as meaningful features. For example, the ANOVA method is adopted [ Ref…] for feature selection from 2-D GC data matrix. It uses a small subset of samples from each class of data and remains the matrix entries which are with large inter-class variances and small intra-class variances. This paper presents linear SVM for feature selection. Linear SVM, as a linear classifier for each pair of two classes of data, tries to specify a weighting for each matrix entry. The weighting is separable (e.g. opposite sign, or an obvious threshold) for entries from different classes. And the absolute weightings signify the importance of the corresponding entries. Hereby those entries with the largest weightings are retained as meaningful features. For a set of data from more than two classes, the feature selection operates in a pair by pair manner, and the features selected by multiple runs are unified into one combined feature set. The 2D-GC data is analyzed as image on which the patterns are regarded as stable and unique characteristics for a certain herb species or chemical component. But the properties of patterns such as the areas, intensities and positions always show some varieties, it thus requires the comparing algorithms to be variant-tolerant, but at the same time not to degrade the sensitivity when analyzing patterns from two different species. Thus some image processing techniques are adopted for more accurately pattern extraction and matching. 1.4 Why the machine learning approaches such as SVM and the algorithm family of classification and clustering can help the Chinese Herb 2D-GC Image data analysis
  • 4. …. Machine Learning is the study of computer algorithms that improve automatically through experience [1]. As remarked in [2], the machine learning will be 2. The Feature Selection Methodology: Linear Support Vector Machine The formulism of SVM is introduced here….. 3. SVM applied to pattern extraction from Chinese Herb 2D-GC Image Part 1. Input data: Please give the full name of the element A and element B A B Sample ID 0% 100% 12345 30% 70% 6 7 8 9 10 50% 50% 11 12 13 14 15 Data format: Form a (400 x 510) matrix for each sample observation. During each time segment of 4 seconds, 400 readings are sampled from the column 2 of the GC device at a rate of 1FID/0.01s. And a complete run of experiment lasts 34 [(510 x 4s)/60 = 34] minutes. And the FID intensities range from 21 to nearly 4500. But in the following analysis, we discard the part of readings obtained in the first 8 minutes when the compounds are going through the GC device due to the severe noise when booting the machine. Thus we are handling totally 15 data matrices with size 400 x 390. Part 2. 2D GC has larger information capacity than 1D GC: It is claimed the 1D GC can be obtained by accumulating the FID signal strengths in each column of the 2D GC data matrix (Fig 1). If that is true, it is straightforward that the 2 D GC has larger information capacity than the corresponding 1D GC (a 400x390 matrix vs. a 1x390 vector) (Fig 2).
  • 5. Although it is argued that such a reconstruction does not compare the 1D GC with the 2D GC under the same condition (e.g. 2D is sampled at the frequency 100 = 1/(0.01s), but 1D is sampled at the frequency 0.25 = 1/(4s) ), our other experiments show that this reconstruction is credible. A bunch of 1D experiments are conducted at a high frequency (f=100) and lasting 34 minutes too. And then the FID readings are sequentially folded to segments with a length of 400, thus transferred to data matrices with size 400 x 510. Shown as images (Fig 3), it is nearly identical to those obtained by simply reconstructing the 2D GC and then replicating and tiling with mean readings to a matrix.
  • 6. Fig. 1 (a) Image show of a sample GC x GC chromatogram. (b) Reconstructed first column chromatogram of the same 2D GC x GC chromatogram. Fig 2. Extending the 1D GC data into the same matrix form, whose column can be supposed to be filled with a flat mean reading obtained from the column 2 of the GC device. It is obvious that the 2D GC x GC signals are more distinct due to the various strengths along each column.
  • 7. Fig 3. 1D GC data by increasing the frequency to (f=100) Part 3. Further Data preprocessing 3.1 Guassian filtering with size 3 x 7. It is generally believed that same characteristic patterns can be observed in the graphs for two 2D GC experiments on same compounds. Now assume a significant pattern centering at (x,y) in the graph is observed, where x is the rowwise pixel position and y is the columnwise pixel position (or it is regarded as time when referring the 2D GC experiment). And from the knowledge of experimentation, such a pattern should not be observed as only a single pulse at an isolated position, but be observed during an ∆x and ∆y interval (For the image representation, it should be a rectangle box with width ∆x and height ∆y. Thus when comparing two 2D GC graph, we cannot simply notice the difference of FID signal strengths between each pair of pixels at the same (time / graph) position. However, we should consider the pattern difference at nearly the same region of graphs. A simple way to enforce the effect of the neighboring pixels for a centering pixel is to use some local filters with a small window size. Among the huge set of filters available in the field of image analysis, the Gaussian smooth filter is popularly used. In our GC graph analysis, we select the Gaussian filter with window size 3 x 7, which fits for the case that the columnwise correlation of 2D GC data should be more accuractely pinpointed. Part 4. Feature Selection Using Linear Support Vector Classification Machine 4.1 Feature Selection for 2D GC Data by Using Linear SVM It is necessary to reduce each 2D GC data from a huge matrix to a more economic size by
  • 8. discarding some insignificant values, only keeping those important features. It is not only a great help to reduce the computational burden casting upon the lately used classification or clustering algorithms, more importantly, it also essential to sketch out the featured patterns in the 2D GC data for chemist’s inspection or further chemical analysis. A novel machine learning approach to feature selection is proposed recently by utilizing the state-of-the art Support Vector Machine. Following the general setting to re-shape each 2D GC matrix to a one dimensional vector by sequentially tiling each column, and stack the vectors of all samples together, we form a N x d data matrix, where N = 15 and d = 400 x 390 = 15600. Since support vector machine is a classification method, some training samples are used to guide the feature selection procedures. For instance, four samples (the 1st, 2nd, 11-th, 12-th) are regarded as the training samples, where the 1st and 2nd samples are in the same class (purely composed with A) and the remaining two samples (the 11-th, 12-th) are grouped into the opposite class (contaminated by some B). Denote the four data vectors as xi (I = 1,2,11,12). The linear Support Vector Machine tries to construct a separation function f(x) = wx + b such that wx1 + b > 0, wx2 + b > 0, wx11 + b<0 and wx12 + b < 0 with some constraints on w and b. After some systematic solving procedures, we can obtain an explicit solution for w from a linear Support Vector Machine. And generally the w, which has the same dimension as the x, expresses the importance of each dimension of xi by the corresponding term in w. A B Sample Classification Result 0% 100% 12345 11111 30% 70% 6 7 8 9 10 22222 50% 50% 11 12 13 14 15 22222 Table 2. Using parameter C = 1, tolerance = 0.001, and cache size = 100 MB. The Training error is 0 and the testing err is also 0. And constructing the linear SVM mainly aims at feature selection, although its classification is already encouraging enough as shown in (Table 2). After training a linear SVM, the w obtained is illustrated in following graph:
  • 9. Fig 4. Each FID signal (total 15600) is associated with a weight value, Fig 5. Reducing the number of features doesn’t hurt the classification accuracy too much.
  • 10. Fig 6. Fractional Area of features along the threshold value
  • 11. 4.2 Classification accuracy comparison between 2D and 1D data A comparison has been done to validate that 2D GC data has a larger information capacity than reconstructed 1D GC: using Linear SVM to separate more classes of experimental data and compare their classification accuracies. We then produce 5 classes of compounds with various percentages of elements B. In particular, the percentage of B are 0% 10% 20% 30% 40%, and each specific kind of blending has been fed into the GC device 5 times to get a set of 5 replicated 2D measurements and then obtain the same number of reconstructed 1D data vectors. It is a multi-task classification task in fact. And we report the separation rate per pair of classes by using the training sample rates 0. 4 (2 training per class): B:A=0:100 B:A=10:90 B:A=20:80 B:A=30:70 B:A=40:60 B:A=0:100 - 0.77 1 1 1 B:A=10:90 0.78 - 0.83 0.96 0.96 B:A=20:80 1.00 0.92 - 0.75 0.99 B:A=30:70 1.00 1.00 0.86 - 0.81 B:A=40:60 1.00 1.00 0.93 0.93 - Table 3. The overall (both training and testing) classification accuracies for 1D and 2D GC data using linear SVM under the parameter settings: C=1, tolerance = 0.001, cache size = 100 MB.
  • 12. Each table cell shows the accuracy of classifying one type of sample named by the column title from another type marked with the row title. The upper triangular part shows the results for 1D GC data, and the lower triangular part is for 2D data. The better accuracy of any diagonally symmetric pair of values is highlighted. The above results are averaged from 10 repeated experiments with different training samples used. From the table 3, we can notice that most results for classifying 2D data are better than those for classifying 1D data. The only exception happens when classifying 40%-B from 30%-B. But their accuracy difference is not too large to disprove the superiority of using 2D data by taking an account of the device noise and the limited samples used. Part 5: Further optimizing the 2D GC features by Image processing Methods The features selected by the novel Linear Support Vector Machine can effectively distinguish samples without B mixed from those contaminated by B. For example, extracting only 1 percent features still completely classifies the two classes of data (Fig 5 & 6). Although the classification results, as shown above, are insensitive to the number of features, choosing how many percents of features is still critical from the view of chemistry domain experts because it might be too dangerous to use an extremely small set of features to represent a sample originally high-dimensionally featured. We might have to determine the optimal threshold for w obtained from the Linear SVM such that both the number of features is not formidably large and the sample representation by features is not vulnerably oversimplified. If we have plenty of training samples, some classical methods such as the cross validation can be used to guide the process of deciding the optimal w. But in chemo metric field, in particular, for our 2D GC experiment case, obtaining more samples is very time-consuming and labor demanding. The w itself is a long vector and has the one-to-one correspondence to the data vector of the 2D vectors. Recalling many unsupervised feature selection methods achieve good results too by only noticing the pixel intensities and pattern spatiality on each single 2D image, we can apply same methodology to the way of selecting w if we retransforming the w vector to a 2-D image with identical size of each 2D GC image. To extract more reasonable features from the images without too much supervision, we can employ some image processing techniques for contour/boundary detection. We adopt a set of threshold values (totally twenty levels) to locate the contours in the image valued by w. Those threshold values are automatically selected for each sample image in an unsupervised manner. Following is a set of important area found for the image of w.
  • 13. Figure 7. Part 6: Clustering upon the whole dataset using new feature vectors 6.1 PCA analyses to show the improvement due to feature selection To verify the effectiveness of the Support Vector Machine feature selection approach, we report the clustering results on the feature-reduced 2D GC data compared with the originally raw 2D GC data. The clustering algorithms used are PCA combining with the K-mean methods. The input data are firstly represented by a subset of principle components through PCA, and then are clustered into several groups by the K-mean algorithm. For high dimensional input vectors, such as the raw 2D GC vectors with a length of 15600, the PCA with the K-L transform trick (identical to the kernel-PCA using Linear Kernel) is used to avoid directly operating on the covariance matrix which in size is proportional to the square of the dimension of input vectors. And even for other lower dimension data, the linear kernel PCA is also performed besides the conventional PCA testing, and the better results are reported under the same category of (PCA+K-mean) in the following table. Training Raw 2D GC Data Feature Selection (a) Feature Selection samples (b) Linear SVM K-mean PCA+K-mean PCA + K-mean PCA + K-mean
  • 14. 2 0.9356 0.9244 0.8533 4 0.9711 0.9156 0.9289 0.8956 0.8874 6 0.9911 0.9022 0.9244 Table 3. Validate the feature selection Methods for 2D GC data. Shown in table 3, the Linear SVM , as a supervised classification algorithm, always achieves the best results under different training sample rate even though it operates on the raw data. While the K-mean clustering, as an unsupervised method, either with PCA added or not, does not achieve good results on raw 2D data. After feature selection, the clustering algorithms obtain higher accuracies with smaller computation complexity. One can also observe that the feature selection (b) has a positive correlation with the training sample rate used at the linear SVM stage. Although the scheme (b) runs worse when the training rate is small, it boosts to a higher precision compared with the scheme (a) as the training rate increases. Also it is verifiable from the following figure 8 and figure 9. The summation of variance percentage of the first 3 principle components in Figure 9 achieve 80%, which is much higher than the total variance percentages of the first three principle components in Figure 8. Figure 8. PCA and K-mean on the data using feature selection scheme (a)
  • 15. Figure 9. PCA and K-mean on the data using feature selection scheme (b). 4. Conclusion 5. References [1] Machine Learning, Tom Mitchell, McGraw Hill, 1997.