Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
1. International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov – Dec 2015
ISSN: 2395-1303 http://www.ijetjournal.org Page 82
A Capable Text Data Mining Using in Artificial Neural Network
Mrs.R.Kalpana, Mrs.P.Padmapriya
1,2
(HEAD, Computer Science Department, Annai Vailankanni Arts and Science College, Thanjavur-7.)
I. INTRODUCTION
ANNs are processing devices such as algorithms or
hardware that are freely modeled after the neuronal
structure of the mammalian with smaller scales. A large
ANN might have lot of processor units whereas a
mammalian brain has huge of neurons to increase their
overall interaction and emergent behavior. In Neural
Network that address classification problems, training
set, testing set, learning rate are considered as key
tasks. That is collection of input/output patterns that
are used to train the network and used to assess
the network performance, set the rate of adjustments.
This paper describes a proposed back propagation
neural net classifier that performs cross validation
for original Neural Network. In order to reduce the
optimization of classification accuracy, training time.
This algorithm is independent of specify data sets so that
many ideas and solutions can be transferred to other
classifier paradigm. We have to propose text data
mining with this Artificial Neural Network.
Clustering or Cluster Analysis is one of the data
mining concepts is an unsupervised pattern where this
pattern try to identify intrinsic sets of a text document.
So that a group of clusters is created in which clusters
demonstrate intra cluster similarity and inter cluster
similarity [1]. Commonly text clustering patterns
attempt to separate the documents into sets where each
set represents various themes that are different than
those areas represented by other groups.
Most of the current text clustering methods based on
Vector Space Model (VSM). VSM is a broadly used
data representation for text classification on clustering.
Methods used for text mining includes decision
trees[2],conceptual clustering[3], statistical analysis[4]
and clustering based on data summarization[5].
Usually, in text data mining techniques, the term
frequency of a phrase or a word is computed to discover
the importance of the phrase in the file. However, two
phrases can have the same frequency in their papers, but
one phrase adds more to the meaning of its sentences
than another phrase.
Abstract:
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
Keywords — Concept analysis, document clustering, k-Nearest Neighbor (k-NN), data visualization,
Self-Organizing Map (SOM).
RESEARCH ARTICLE OPEN ACCESS
2. International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov – Dec 2015
ISSN: 2395-1303 http://www.ijetjournal.org Page 83
II. Concept-based mining model
The proposed concept-based mining
model consists of sentence-based concept
analysis, document-based concept analysis,
corpus-based concept-analysis, and concept-based
similarity measure. A raw text document is the
input to the proposed model. Each document has
well-defined sentence boundaries. Each sentence in
the document is labeled automatically based on
parser. After running the semantic role labeler, each
sentence in the document might have one or more
labeled verb argument structures. In this model,
both the verb and the argument are considered as
terms. One semantic role in the same sentence. In
such cases, this term plays important semantic roles
that contribute to the meaning of the sentence. In
the concept-based mining model, a labeled terms
either word or phrase is considered as concept. The
System architecture consists of the following main
modules:
oText preprocessing
oConcept Analysis and
oConcept based similarity measure
Fig.1 is an Architecture of Concept Based model
and it consists of sentence-based concept analysis,
document-based concept analysis and concept-
based similarity measure.
Fig.1 Architecture of Concept Based Model
A. Text Preprocessing
1) Label Terms
A raw text document is the input to the proposed
model. Each document has well defined sentence
boundaries. Each sentence in the document is
labeled automatically based on the parser. After
running the semantic role labeler, each sentence in
the document might have one or more labeled verb
argument structures. The labeled verb argument
structures, the output of the role labeling task, are
captured and analyzed by the concept-based mining
model on sentence, document levels. In this model,
both the verb and the argument are considered as
terms. One term can be an argument to more than
one verb in the same sentence. This means that this
term can have more than one semantic role in the
same sentence. In such cases, this term plays
important semantic roles that contribute to the
meaning of the sentence. In the concept-based
mining model, a labeled terms either word or phrase
is considered as concept.
2) Removing stop words
In computing stop words are words which are
filtered out prior to, or after, processing of natural
language data (text). It is controlled by human input
and not automated. There is not one definite list of
stop words which all tools use, if even used. Some
tools specifically avoid using them to support
phrase search.
3) Stem words
In linguistic morphology, stemming is the
process for reducing inflected (or sometimes
derived) words to their stem, base or root form –
generally a written word form. The stem need not
be identical to the morphological root of the word;
it is usually sufficient that related words map to the
same stem, even if this stem is not in itself a valid
root. Algorithms for stemming have been studied in
computer science since 1968. Many search engines
treat words with the same stem as synonyms as a
kind of query broadening, a process called
conflation. Stemming programs are commonly
referred to as stemming algorithms or stemmers.
B. Concept Analysis
To analyze each concept at the sentence level is
called as
Sentence based Concept Analysis.
Consider the following sentence:
“Texas and Australia researchers have created
industry-ready sheets of materials made from
nanotubes that could lead to the development of
artificial muscles”.
Text Preprocess:
Separate Sentences, Label
Terms, removing stop words.
Concept Analysis
• Sentence based
• Document based
• Corpus based
Concept
based
similarity
3. International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov – Dec 2015
ISSN: 2395-1303 http://www.ijetjournal.org Page 84
In this example, stop words are removed and
concepts are shown without stemming for better
readability as follows:
1. Concepts in the first verb-argument structure of
the verb created:
• Texas Australia researchers
• created
• industry-ready sheets of material nanotubes lead
development of artificial muscles
2. Concepts in the second verb-argument structure
of the verb made:
• materials
• nanotubes lead development artificial muscles
3. Concepts in the third verb-argument structure of
the verb lead:
• nanotubes
• lead
• development artificial muscles.
It is imperative to note that these concepts are
extracted from the same sentence. Thus, the
concepts mentioned in this example sentence are:
• Texas
• Australia
• researchers
• created
• industry
• ready
• sheets
• materials
• nanotubes
• lead
• development
• artificial
• muscles
After finding the concepts at sentence level,
concepts are
also found at document level.
III. Performances of Neural Network
Systems
One concern in machine learning community is
that a system trained on small samples may not
perform well on test data. On the other hand, if
training data sets are too large, our concern is how
well and efficiently a system can learn. The
objective of this study [6] is what neural network
systems are better suited for applications that have
small or large training data. For studying neural
learning from small training data we chose five
data sets like contact-lenses, cpu, weather
symbolic, Weather, labor-nega-data. All five
collections have rather balanced distribution among
all classes, and the number of pattern classes is not
too large. First, we utilized our developed text
mining algorithms, including text mining
techniques based on classification of data in
several data collections. After that, we employ
exiting neural network to deal with measure the
training time for five data sets.
Experimental results show that the accuracy was
the same for all datasets but Contact-lences, which
is the only one with absent attributes. For Contact-
lences the exactness with Proposed Neural
Network was in average around 0.3 % less than
with the original Neural Network. The larger the
dataset, the better the improvement in speed. Other
informal experiments with larger datasets
showed that Proposed Neural Network can be
more than ten times quicker when the dataset is
bigger than CPU or the network has many unknown
elements.
IV. Advantages and Disadvantages of
Neural Networks
The calculated output [7] is compared to the
identified output. If the calculated output is correct,
then nothing more is necessary. If the computed
output is incorrect, then the weights are adjusted
so as to make the computed output closer to the
known output. This process is continued for a
large number of cases, or time-series, until the net
gives the correct output for a given input. The entire
collection of cases learned is called a “training
sample” (Connor, Martin and Atlas, 1994). In most
real world problems, the neural network is never
100% correct. Neural networks are programmed to
learn up to a given threshold of error. After the
neural network learns up to the error threshold,
the weight adaptation mechanism is turned off and
the net is tested on known cases it has not seen
before. The application of the neural network to
unseen cases gives the true error rate (Baets, 1994).
Artificial neural networks present a number of
advantages over conventional methods of analysis.
4. International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov – Dec 2015
ISSN: 2395-1303 http://www.ijetjournal.org Page 85
First, artificial neural networks make no
assumptions about the nature of the distribution of
the data and are not therefore, biased in their
analysis. Instead of making assumptions about the
underlying population, neural networks with at least
one middle layer use the data to develop an
internal representation of the relationship
between the variables (White, 1992). Second,
since time-series data are dynamic in nature, it is
necessary to have non-linear tools in order to
discern relationships among time-series data.
Neural networks are best at discovering non-linear
relationships (Wasserman, 1989; Hoptroff, 1993;
Moshiri, Cameron, and Scuse, 1999; Shtub and
Versano, 1999; Garcia and Gencay, 2000; and
Hamm and Brorsen, 2000). Third, neural
networks perform well with missing or incomplete
data. Whereas traditional regression analysis is not
adaptive, typically processing all older data together
with new data, neural networks adapt their
weights as new input data becomes available
(Kuo and Reitch, 1994). Fourth, it is relatively
easy to obtain a forecast in a short period of time as
compared with an econometric model. However,
there are some problem connected with the use
of artificial neural networks. No estimation or
prediction errors are calculated with an artificial
neural network (Caporaletti, Dorsey, Johnson,
and Powell, 1994). Also, artificial neural
networks are “black boxes,” for it is impractical
to form out how relations in unseen layers are
estimated (Li, 1994). In addition, a network may
become a bit overzealous and try to fit a curve to
some data even when there is no relationship.
Another problem is that neural networks have long
guidance times. Reducing guidance time is crucial
because building a neural network forecasting
system is a process of trial and error. Therefore, the
more research a researcher can run in a finite period
of time, the more confident he can be of the result.
V. CONCLUSION
This effort links the gap between Artificial Neural
network processing and text data mining
disciplines. A new concept based mining model
composed of four components i.e sentence based
concept analysis, documents based concept
analysis, corpus based concept analysis and concept
based similarity measure is future to develop the
text clustering quality. By utilizing the semantic
formation of the sentences in documents, a
enhanced text clustering result is achieved. By
merging the factors disturbing the weights of
thoughts on the sentence, document, and corpus
levels, a concept-based match determine that is able
of the exact result of pair wise documents is
invented. This allows performing model matching
and concept-based similarity calculations among
documents in a very robust and accurate way. The
quality of text clustering achieved by his model
considerably better the traditional solo term based
approaches. There are a number of chances for
extending this effort. One direction is to connection
this effort to Web document clustering. Another
direction is to apply the same model to text data
classification.
REFERENCES
[1] Shady Shehata, Fakhri Karray and Mohamed
S. Kamel, “An Efficient Concept-Based Mining
Model for Enhancing Text Clustering”, IEEE
Transactions on Knowledge and Data Engineering,
Vol. 22, No.10, pp. 1360 – 1371, October 2010.
[2] U.Y. Nahm and R.J. Mooney, “A Mutually
Beneficial Integration of Data Mining and
Information Extraction”, Proc.17th
Nat’l Conf.
Artificial Intelligence (AAAI ’00), pp. 627-632,
2000.
[3] L.Talavera and J. Bejar, “Generality-Based
Conceptual Clustering with Probabilistic
Concepts”, IEEE Trans, Pattern Analysis and
Machine Intelligence, Vol.23, no.2, pp. 196-206,
Feb. 2001.
[4] T.Hofmann, “The Cluster-Abstraction Model:
Unsupervised Learning of Topic Hierarchies from
Text Data”, Proc. 16 th Int’l Joint Conf. Artificial
Intelligence (IJCAI ’99), pp.682-687, 1999.
[5] T.Honkela, S.Kaski, k.Lagus, and T.
Kohonen, “WEBSOM – Self Organizing Maps of
Document Collections,” Proc. Workshop Self
Organizing Maps (WSOM ’97),1997.
[6] Guobin Ou,Yi Lu Murphey, “Multi-class
pattern classification using neural networks”,
Pattern Recognition 40 (2007).
5. International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov – Dec 2015
ISSN: 2395-1303 http://www.ijetjournal.org Page 86
[7] Yochanan Shachmurove, Department of
Economics, The City College of the City,
University of New York and The University of
Pennsylvania, Dorota Witkowska, Department of
Management,Technical University of Lodz
“CARESS Working Paper #00-11Utilizing
Artificial Neural Network Model to Predict Stock
Markets” September 2000.
[8] M. Steinbach, G. Karypis, and V. Kumar, “A
Comparison of Document Clustering Techniques,”
Proc. Knowledge Discovery and Data Mining
(KDD) Workshop Text Mining, Aug. 2000.
[9] C. Fillmore, “The Case for Case,” Universals
in Linguistic Theory, Holt, Rinehart and Winston,
1968.
[10] S.Y. Lu and K.S. Fu, “A Sentence-to-
Sentence Clustering Procedure for Pattern
Analysis,” IEEE Trans. Systems, Man, and
Cybernetics, vol. 8, no. 5, pp. 381-389, May 1978.
[11] S. Pradhan, W. Ward, K. Hacioglu, J.
Martin, and D. Jurafsky, “Shallow Semantic
Parsing Using Support Vector Machines,” Proc.
Human Language Technology/North Am. Assoc.
for Computational Linguistics (HLT/NAACL),
2004.