O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Text Mining Infrastructure in R

15.289 visualizações

Publicada em

This presentation includes basic text processing techniques using R packages

Publicada em: Educação, Tecnologia

Text Mining Infrastructure in R

  1. 1. Text Mining Infrastructure in R Presented By Ashraf Uddin (http://ashrafsau.blogspot.in/) South Asian University, New Delhi, India. 29 January 2014
  2. 2. What is R?  A free software environment for statistical computing and graphics.  open source package based developed by Bell Labs  Many statistical functions are already built in  Contributed packages expand the functionality to cutting edge research  Implementation languages C, Fortran
  3. 3. What is R?  R is the result of a collaborative effort with contributions from all over the world  R was initially written by Robert Gentleman and Ross Ihaka—also known as "R & R" of the Statistics Department of the University of Auckland  R was inspired by the S environment  R can be extended (easily) via packages. More about R
  4. 4. What R does and does not ois not a database, but connects to DBMSs olanguage interpreter can be very slow, but allows to call own C/C++ code ono professional / commercial support
  5. 5. Data Types in R      numeric (integer, double, complex) character logical Data frame factor
  6. 6. Contributed Packages  Currently, the CRAN package repository features 5034 available packages
  7. 7. Growing users of R
  8. 8. Text Mining: Basics Text is Unstructured collections of words Documents are basic units consisting of a sequence of tokens or terms Terms are words or roots of words, semantic units or phrases which are the atoms of indexing Repositories (databases) and corpora are collections of documents. Corpus conceptual entity similar to a database for holding and managing text documents Text mining involves computations to gain interesting information
  9. 9. Text Mining: Practical Applications  Spam filtering  Business Intelligence, Marketing applications : predictive analytics  Sentiment analysis  Text IR, indexing  Creating suggestion and recommendations (like amazon)  Monitoring public opinions (for example in blogs or review sites)  Customer service, email support  Automatic labeling of documents in business libraries  Fraud detection by investigating notification of claims  Fighting cyberbullying or cybercrime in IM and IRC chat And many more
  10. 10. A List Text Mining Tools
  11. 11. Text Mining Packages in R Corpora gsubfn kernlab KoNLP koRpus lsa `lda maxent movMF openNLP qdap RcmdrPlugin.temis RKEA Rweka skmeans RTextTools Snowball SnowballC tau tm.plugin.dc tm.plugin.factiva tm.plugin.mail topicmodels wordcloud Wordnet zipfR Textir tm textcat
  12. 12. Text Mining Packages in R plyr: Tools for splitting, applying and combining data class: Various functions for classification tm: A framework for text mining applications corpora: Statistics and data sets for corpus frequency data snowball: stemmers Rweka: interface to Weka, a collection of ML algorithms for data mining tasks wordnet: interface to WordNet using the Jawbone Java API to WordNet wordcloud: to make cloud of word textir: A suite of tools for text and sentiment mining tau: Text Analysis Utilities topicmodels: an interface to the C code for Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM) zipfR: Statistical models for word frequency distributions
  13. 13. Conceptual process in Text Mining  organize and structure the texts (into repository)  convenient representation (preprocessing)  Transform texts into structured formats (e.g. TDM)
  14. 14. The framework different file formats and in different locations  standardized interfaces to access the document (sources) Metadata valuable insights into the document structure  must be able to alleviate metadata usage to efficiently work with the documents  must provide tools and algorithm to perform common task (transformation)  To extract patterns of interest (filtering)
  15. 15. Text document collections: Corpus Constructor: Corpus(object = ..., readerControl = list(reader = object@DefaultReader, language = "en_US", load = FALSE)) Example: >txt <- system.file("texts", "txt", package = "tm") >(ovid <- Corpus(DirSource(txt), readerControl = list(reader = readPlain, language = "la", load = TRUE))) A corpus with 5 text documents
  16. 16. Corpus: Meta Data >meta(ovid[[1]]) Available meta data pairs are: Author : DateTimeStamp: 2013-11-19 18:54:04 Description : Heading : ID : ovid_1.txt Language : la Origin : >ID(ovid[[1]]) [1] "ovid_1.txt“
  17. 17. Corpus: Document’s text >ovid[[1]] Si quis in hoc artem populo non novit amandi, hoc legat et lecto carmine doctus amet. arte citae veloque rates remoque moventur, arte leves currus: arte regendus amor. curribus Automedon lentisque erat aptus habenis, Tiphys in Haemonia puppe magister erat: me Venus artificem tenero praefecit Amori; Tiphys et Automedon dicar Amoris ego. ille quidem ferus est et qui mihi saepe repugnet: sed puer est, aetas mollis et apta regi. Phillyrides puerum cithara perfecit Achillem, atque animos placida contudit arte feros. qui totiens socios, totiens exterruit hostes, creditur annosum pertimuisse senem.
  18. 18. Corpus: Meta Data >c(ovid[1:2], ovid[3:4]) A corpus with 4 text documents >length(ovid) 5 >summary(ovid) A corpus with 5 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID
  19. 19. Corpus: Meta Data >CMetaData(ovid) $create_date [1] "2013-11-19 18:54:04 GMT" $creator [1] "“ >DMetaData(ovid) MetaID 1 0 2 0 3 0 4 0 5 0
  20. 20. Corpus: Transformations and Filters >getTransformations() [1] "as.PlainTextDocument" "removeNumbers" "removePunctuation" "removeWords" [5] "stemDocument" "stripWhitespace“ >tm_map(ovid, FUN = tolower) A corpus with 5 text documents >getFilters() [1] "searchFullText" "sFilter" "tm_intersect" >tm_filter(ovid, FUN = searchFullText, "Venus", doclevel = TRUE) A corpus with 1 text document
  21. 21. Text Preprocessing: import >txt <- system.file("texts", "crude", package = "tm") >(acq <- Corpus(DirSource(txt), readerControl = list(reader = readPlain, language = "la", load = TRUE))) A corpus with 50 text documents >txt <- system.file("texts", "crude", package = "tm") >(crude <- Corpus(DirSource(txt), readerControl = list(reader = readPlain, language = "la", load = TRUE))) A corpus with 20 text documents resulting in 50 articles of topic acq and 20 articles of topic crude
  22. 22. Preprocessing: stemming  Morphological variants of a word (morphemes). Similar terms derived from a common stem: engineer, engineered, engineering use, user, users, used, using  Stemming in Information Retrieval. Grouping words with a common stem together.  For example, a search on reads, also finds read, reading, and readable  Stemming consists of removing suffixes and conflating the resulting morphemes. Occasionally, prefixes are also removed.
  23. 23. Preprocessing: stemming  Reduce terms to their “roots”  automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress
  24. 24. Preprocessing: stemming Typical rules in Stemming: sses ss ies  i ational  ate tional  tion Weight of word sensitive rules (m>1) EMENT → replacement → replac cement → cement
  25. 25. Preprocessing: stemming  help recall for some queries but harm precision on others  Fine distinctions may be lost through stemming.
  26. 26. Preprocessing: stemming >acq[[10]] Gulf Applied Technologies Inc said it sold its subsidiaries engaged in pipeline and terminal operations for 12.2 mln dlrs. The company said the sale is subject to certain post closing adjustments, which it did not explain. Reuter >stemDocument(acq[[10]]) Gulf Appli Technolog Inc said it sold it subsidiari engag in pipelin and terminal oper for 12.2 mln dlrs. The compani said the sale is subject to certain post clos adjustments, which it did not explain. Reuter >tm_map(acq, stemDocument) A corpus with 50 text documents
  27. 27. Preprocessing: Whitespace elimination & lower case conversion >stripWhitespace(acq[[10]]) Gulf Applied Technologies Inc said it sold its subsidiaries engaged in pipeline and terminal operations for 12.2 mln dlrs. The company said the sale is subject to certain post closing adjustments, which it did not explain. Reuter >tolower(acq[[10]]) gulf applied technologies inc said it sold its subsidiaries engaged in pipeline and terminal operations for 12.2 mln dlrs. the company said the sale is subject to certain post closing adjustments, which it did not explain. reuter
  28. 28. Preprocessing: Stopword removal Very common words, such as of, and, the, are rarely of use in information retrieval. A long stop list saves space in indexes, speeds processing, and eliminates many false hits. However, common words are sometimes significant in information retrieval, which is an argument for a short stop list. (Consider the query, "To be or not to be?")
  29. 29. Preprocessing: Stopword removal Include the most common words in the English language (perhaps 50 to 250 words). Do not include words that might be important for retrieval (Among the 200 most frequently occurring words in general literature in English are time, war, home, life, water, and world). In addition, include words that are very common in context (e.g., computer, information, system in a set of computing documents).
  30. 30. Preprocessing: Stopword removal about at did either etc above accordingacross actually afterwards again against along already also although amongst an another any anything anywhere are be became because been before beforehand begin being below beside besides billion both but by cannot caption co could didn't do does doesn't during each eg eight else elsewhere end even ever every adj after all almost alone always among anyhow anyone aren't around become becomes becoming beginning behind between beyond can can't couldn't don't down eighty ending enough everyone everything
  31. 31. Preprocessing: Stopword removal How many words should be in the stop list? • Long list lowers recall Which words should be in list? • Some common words may have retrieval importance: -- war, home, life, water, world • In certain domains, some words are very common: -- computer, program, source, machine, language
  32. 32. Preprocessing: Stopword removal >mystopwords <- c("and", "for", "in", "is", "it", "not", "the", "to") >removeWords(acq[[10]], mystopwords) Gulf Applied Technologies Inc said sold its subsidiaries engaged pipeline terminal operations 12.2 mln dlrs. The company said sale subject certain post closing adjustments, which did explain. Reuter >tm_map(acq, removeWords, mystopwords) A corpus with 50 text documents
  33. 33. Preprocessing: Synonyms > library("wordnet") synonyms("company") [1] "caller" "companionship" "company" "fellowship" [5] "party" "ship’s company" "society" "troupe“ replaceWords(acq[[10]], synonyms(dict, "company"), by = "company") Tm_map(acq, replaceWords, synonyms(dict, "company"), by = "company")
  34. 34. Preprocessing: Part of speech tagging >library("NLP","openNLP") s <- as.String(acq[[10]]) ## Need sentence and word token annotations. sent_token_annotator <- Maxent_Sent_Token_Annotator() word_token_annotator <- Maxent_Word_Token_Annotator() a2 <- annotate(s, list(sent_token_annotator, word_token_annotator)) pos_tag_annotator <- Maxent_POS_Tag_Annotator() #pos_tag_annotator a3 <- annotate(s, pos_tag_annotator, a2) a3w <- subset(a3, type == "word") tags <- sapply(a3w$features, "[[", "POS") sprintf("%s/%s", s[a3w], tags)
  35. 35. Preprocessing: Part of speech tagging "Gulf/NNP" "Applied/NNP" "Technologies/NNP" "Inc/NNP" "said/VBD" "it/PRP" "sold/VBD" "its/PRP$" "subsidiaries/NNS" "engaged/VBN" "in/IN" "pipeline/NN" "and/CC" "terminal/NN" "operations/NNS" "for/IN" "12.2/CD" "mln/NN" "dlrs/NNS" "./." "The/DT" "company/NN" "said/VBD" "the/DT" "sale/NN" "is/VBZ" "subject/JJ" "to/TO" "certain/JJ" "post/NN" "closing/NN" "adjustments/NNS" ",/," "which/WDT" "it/PRP" "did/VBD" "not/RB" "explain/VB" "./." "Reuter/NNP“ more
  36. 36. Preprocessing R Demo
  37. 37. Classification using KNN K-Nearest Neighbor algorithm:  Most basic instance-based method  Data are represented in a vector space  Supervised learning , V is the finite set {v1,......,vn} the k-NN returns the most common value among the k training examples nearest to xq.
  38. 38. KNN Feature space
  39. 39. KNN Training algorithm
  40. 40. Classification using KNN : Example Two classes: Red and Blue Green is Unknown With K=3, classification is Red With k=4, classification is Blue
  41. 41. How to determine the good value for k?     Determined experimentally Start with k=1 and use a test set to validate the error rate of the classifier Repeat with k=k+2 Choose the value of k for which the error rate is minimum  Note: k should be odd number to avoid ties
  42. 42. KNN for speech classification Datasets: Size: 40 instances Barak Obama 20 speeches Mitt Romney 20 speeches Training datasets: 70% (28) Test datasets: 30% (12) Accuracy: on average more than 90%
  43. 43. Speech Classification Implementation in R #initialize the R environment libs<-c("tm","plyr","class") lapply(libs,require,character.only=TRUE) #Set parameters / source directory dir.names<-c("obama","romney") path<-"E:/Ashraf/speeches" #clean text / preprocessing cleanCorpus<-function(corpus){ corpus.tmp<-tm_map(corpus,removePunctuation) corpus.tmp<-tm_map(corpus.tmp,stripWhitespace) corpus.tmp<-tm_map(corpus.tmp,tolower) corpus.tmp<-tm_map(corpus.tmp,removeWords,stopwords("english")) return (corpus.tmp) }
  44. 44. Speech Classification Implementation in R #build term document matrix generateTDM<-function(dir.name,dir.path){ s.dir<-sprintf("%s/%s",dir.path,dir.name) s.cor<-Corpus(DirSource(directory=s.dir,encoding="ANSI")) s.cor.cl<-cleanCorpus(s.cor) s.tdm<-TermDocumentMatrix(s.cor.cl) s.tdm<-removeSparseTerms(s.tdm,0.7) result<-list(name=dir.name,tdm=s.tdm) } tdm<-lapply(dir.names,generateTDM,dir.path=path)
  45. 45. Speech Classification Implementation in R #attach candidate name to each row of TDM bindCandidateToTDM<-function(tdm){ s.mat<-t(data.matrix(tdm[["tdm"]])) s.df<-as.data.frame(s.mat,StringAsFactors=FALSE) s.df<-cbind(s.df,rep(tdm[["name"]],nrow(s.df))) colnames(s.df)[ncol(s.df)]<-"targetcandidate" return (s.df) } candTDM<-lapply(tdm,bindCandidateToTDM)
  46. 46. Speech Classification Implementation in R #stack the TDMs together (for both Obama and Romnie) tdm.stack<-do.call(rbind.fill,candTDM) tdm.stack[is.na(tdm.stack)]<-0 #hold-out / splitting training and test data sets train.idx<-sample(nrow(tdm.stack),ceiling(nrow(tdm.stack)*0.7)) test.idx<-(1:nrow(tdm.stack))[-train.idx])
  47. 47. Speech Classification Implementation in R #model KNN tdm.cand<-tdm.stack[,"targetcandidate"] tdm.stack.nl<-tdm.stack[,!colnames(tdm.stack)%in%"targetcandidate"] knn.pred<knn(tdm.stack.nl[train.idx,],tdm.stack.nl[test.idx,],tdm.cand[train.idx]) #accuracy of the prediction conf.mat<-table('Predictions'=knn.pred,Actual=tdm.cand[test.idx]) (accuracy<-(sum(diag(conf.mat))/length(test.idx))*100) #show result show(conf.mat) show(accuracy)
  48. 48. Speech Classification Implementation in R Show R Demo
  49. 49. References 1. Text Mining Infrastructure in R, Ingo Feinerer, Kurt Hornik, David Meyer, Vol. 25, Issue 5, Mar 2008, Journal of Statistical Software. 2. http://mittromneycentral.com/speeches/ 3. http://obamaspeeches.com/ 4. http://cran.r-project.org/