This document provides an overview and examples of text mining Japanese documents with Scala and Spark. It discusses preprocessing Japanese text with word segmentation tools like Kuromoji, building topic models with LDA, and generating word embeddings with Word2Vec. Examples are given for segmenting Japanese text, creating document corpora, training LDA models to extract topics, and using Word2Vec to find word similarities. The document emphasizes that high quality word vectors require large datasets.
2. About Me
• Eduardo Gonzalez
• Japan Business Systems
• Japanese System Integrator (SIer)
• Social Systems Design Center (R&D)
• Pittsburgh University
• Computer Science
• Japanese
@wm_eddie
3. Agenda
• Intro to Text mining with Spark
• Pre-processing Japanese Text
• Japanese Word Breaking
• Spark Gotchas
• Topic Extraction with LDA
• Intro to Word2Vec
• Recommendation with Word Embedding
4. Machine Learning
Vocabulary
• Feature: A number that represents something
about a data point
• Label: A feature of the data we want to predict
• Document: A block of text with a unique ID
• Model: A learned set of parameters that can
be used for prediction
• Corpus: A collection of documents
機械学習の前提となる語彙としてFeature、Label、Document、Model、Corpusが
ある
5. What is Apache
Spark
• A library that defines a Resilient Distributed Dataset
type and a set of transformations
• RDDs are only representations of calculations
• A runtime that can execute RDDs in a distributed
manner
• A master process that schedules and monitors executors
• Executors actually do the calculations and can keep results in their
memory
• Spark SQL, MLLib and Graph X define special types of
RDDs
Sparkは汎用分散処理基盤で、SQL/機械学習/グラフといったコンポーネントを保
持する
6. Apache Spark
Example
import org.apache.spark.{SparkConf, SparkContext}
object Main extends App {
val sc = new SparkContext(new SparkConf())
val text = sc.textFile("hdfs:///kjb.txt")
val counts = text.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.collect().foreach(println)
}
SparkでWordCountアプリケーションを構築するとこのようになる
7. Spark’s Text-Mining
Tools
• LDA for Topic Extraction
• Word2Vec an unsupervised way to turn words
into features based on their meaning
• CountVectorizer turns documents into vectors
based on word count
• HashingTF-IDF calculates important words of
a document with respect to the corpus
• And much more
SparkのテキストマイニングツールとしてLDA、CountVectorizer、HashingTF-
IDF等のツールがある
8. How to use Spark
LDA
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/sample_lda_data.txt")
val parsedData = data.map(s => Vectors.dense(s.trim.split('
').map(_.toDouble)))
// Index documents with unique IDs
val corpus = parsedData.zipWithIndex.map(_.swap).cache()
// Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)
10. LDA Step 0: Get
words
LDA実行にあたり、まずはじめに単語を抽出する必要がある
11. Word Segmentation
• Hard to actually get right.
• Simple in theory with English
• Str.Split(“ “)
• But not enough for real data.
• (Take parens for example.)
• [“(Take”, “parens”, “for”, “example.)”]
• Etc.
実際の単語抽出は難しく、区切りで分割するだけではうまくいかない
12. Word Segmentation
• Since Japanese lacks spaces it’s hard even in
theory
• A probabilistic approach is necessary
• Thankfully there are libraries that can help
日本語単語の抽出は単語区切り文字がなく、確率的アプローチが必要、ライブラ
リで効率的に実行できる
13. Morphological
Analyzers
• Include POS tagging, pronunciation and
stemming
• MeCab
• Written in C++with SWIG bindings to pretty
much everything
• Kuromoji
• Written in Java available via maven
• Others
形態素解析(品詞タグ付け、発音、語幹処理服務)用にMeCabやKuromoji等のラ
イブラリがある
14. JMecab &
Spark/Hadoop
• Not impossible but difficult
• Add MeCab to each node
• Add jar to classpaths
• Include jar in project for compilation
• Not too bad with own hardware but
painful with Amazon EMR or Azure
HDInsight
JMecabは事前Installが必要なため、オンプレでは何とかなるが、クラウド環境で
は実行困難
15. Kuromoji &
Spark/Hadoop
• Easy
• Include dependency in build.sbt
• Include jar file in FatJar with sbt-
assembly
Kuromojiは依存性を追加し、FatJarをビルドするだけなので使いやすい
16. Using Kuromoji
import org.atilika.kuromoji.Tokenizer
object Main extends App {
import scala.collection.JavaConverters.asScalaBufferConverter
val tokenizer = Tokenizer.builder().build()
val ex1 = "リストのような構造の物から条件を満たす物を探す"
val res1 = tokenizer.tokenize(ex1).asScala
for (token <- res1) {
println(s"${token.getBaseForm}t${token.getPartOfSpeech}")
}
}
19. Vocabulary
lazy val tokenizer = Tokenizer.builder().build()
val text = sc.textFile("documents")
val words = for {
line <- text
token <- tokenizer.tokenize(line).asScala
} yield token.getBaseForm
val vocab = words.distinct().zipWithIndex().collectAsMap()
29. Using the LDA model
• Prediction requires a LocalLDAModel
• Use .toLocal if
isInstanceOf[DistributedLDAModel]
• Convert to Vector using same steps
• Be sure to filter out words not in the vocabulary
• Call topicDistributions to see topic scores
LDAモデルはトピックの予想のために使用される
31. Now what?
• Find the minimum logLikelihood in a set
of documents you know are OK
• Report anomaly whenever a new
document has a lower logLikelihood
トピックを正しく予想できた集合の最小対数尤度を計算、新しい文書がその値を
下回ったら「異常」に分類
33. Word2Vec
• Created vectors that represents points in
meaning space
• Unsupervised but requires a lot of data to
generate good vectors
• Google’s sample vectors trained on 100
billion words (~X00GB?)
• Vectors with less data can provide
interesting similarities but can’t do so
consistently
Word2Vecでは単語をベクトル化して定量的に表現可能で、単語同士の類似度を
出すことができる
34. Word2Vec Intuition
• Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.
Linguistic Regularities in Continuous Space Word
Representations. In Proceedings of NAACL HLT, 2013.
実際の単語ベクトル化例
37. Making
Word2VecModel
val documentWords: RDD[Seq[String]] =
text.map(line => tokenizer.tokenize(line).asScala.map(_.getSurfaceForm).toSeq)
documentWords.cache()
val model = new Word2Vec().setVectorSize(300).fit(documentWords)
41. Embedding with Vector
Concatenation
• Calculate sum of words in description
• Add it to vectors from
Word2VecModel.getVectors with special
keyword (Ex. ITEM_1234)
• Create new Word2VecModel using constructor
• ※Not state of the art but can produce
reasonable recommendations without user
rating data
ベクトル連結による embedding、「アイテム」ごとに含まれる単語のベクトルを
合計する
43. Item Embedding (2/2)
def stringToVector(s: String): Array[Double] = {
val words = tokenizer.tokenize(s).asScala.map(_.getSurfaceForm).toSeq
val vectors = words.map(word =>
Try(model.transform(word)).getOrElse(model.transform("は")))
val breezeVectors: Seq[DenseVector[Double]] = vectors.map(v => new
DenseVector(v.toArray))
val concat = breezeVectors.foldLeft(DenseVector.zeros[Double](vectorLength))((a, b)
=> a :+ b)
concat.toArray
}
val embedVectors: Map[String, Array[Float]] = embeds.map {
case (key, value) => (key, stringToVector(value).map(_.toFloat))
}
val embedModel = new Word2VecModel(embedVectors ++ model.getVectors)