Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Japanese Text Mining
with Scala and Spark
Eduardo Gonzalez
Scala Matsuri 2016

About Me
• Eduardo Gonzalez
• Japan Business Systems
• Japanese System Integrator (SIer)
• Social Systems Design Center (R&D)
• Pittsburgh University
• Computer Science
• Japanese
@wm_eddie

Agenda
• Intro to Text mining with Spark
• Pre-processing Japanese Text
• Japanese Word Breaking
• Spark Gotchas
• Topic Extraction with LDA
• Intro to Word2Vec
• Recommendation with Word Embedding

Machine Learning
Vocabulary
• Feature: A number that represents something
about a data point
• Label: A feature of the data we want to predict
• Document: A block of text with a unique ID
• Model: A learned set of parameters that can
be used for prediction
• Corpus: A collection of documents
機械学習の前提となる語彙としてFeature、Label、Document、Model、Corpusが
ある

What is Apache
Spark
• A library that defines a Resilient Distributed Dataset
type and a set of transformations
• RDDs are only representations of calculations
• A runtime that can execute RDDs in a distributed
manner
• A master process that schedules and monitors executors
• Executors actually do the calculations and can keep results in their
memory
• Spark SQL, MLLib and Graph X define special types of
RDDs
Sparkは汎用分散処理基盤で、SQL/機械学習/グラフといったコンポーネントを保
持する

Apache Spark
Example
import org.apache.spark.{SparkConf, SparkContext}
object Main extends App {
val sc = new SparkContext(new SparkConf())
val text = sc.textFile("hdfs:///kjb.txt")
val counts = text.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.collect().foreach(println)
}
SparkでWordCountアプリケーションを構築するとこのようになる

Spark’s Text-Mining
Tools
• LDA for Topic Extraction
• Word2Vec an unsupervised way to turn words
into features based on their meaning
• CountVectorizer turns documents into vectors
based on word count
• HashingTF-IDF calculates important words of
a document with respect to the corpus
• And much more
SparkのテキストマイニングツールとしてLDA、CountVectorizer、HashingTF-
IDF等のツールがある

How to use Spark
LDA
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/sample_lda_data.txt")
val parsedData = data.map(s => Vectors.dense(s.trim.split('
').map(_.toDouble)))
// Index documents with unique IDs
val corpus = parsedData.zipWithIndex.map(_.swap).cache()
// Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)

sample_lda_data.txt
ただ、入力のLDAデータは文章のようには見えない
1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0
(´Д｀)
This does not
look like text

LDA Step 0: Get
words
LDA実行にあたり、まずはじめに単語を抽出する必要がある

Word Segmentation
• Hard to actually get right.
• Simple in theory with English
• Str.Split(“ “)
• But not enough for real data.
• (Take parens for example.)
• [“(Take”, “parens”, “for”, “example.)”]
• Etc.
実際の単語抽出は難しく、区切りで分割するだけではうまくいかない

Word Segmentation
• Since Japanese lacks spaces it’s hard even in
theory
• A probabilistic approach is necessary
• Thankfully there are libraries that can help
日本語単語の抽出は単語区切り文字がなく、確率的アプローチが必要、ライブラ
リで効率的に実行できる

Morphological
Analyzers
• Include POS tagging, pronunciation and
stemming
• MeCab
• Written in C++with SWIG bindings to pretty
much everything
• Kuromoji
• Written in Java available via maven
• Others
形態素解析（品詞タグ付け、発音、語幹処理服務）用にMeCabやKuromoji等のラ
イブラリがある

JMecab &
Spark/Hadoop
• Not impossible but difficult
• Add MeCab to each node
• Add jar to classpaths
• Include jar in project for compilation
• Not too bad with own hardware but
painful with Amazon EMR or Azure
HDInsight
JMecabは事前Installが必要なため、オンプレでは何とかなるが、クラウド環境で
は実行困難

Kuromoji &
Spark/Hadoop
• Easy
• Include dependency in build.sbt
• Include jar file in FatJar with sbt-
assembly
Kuromojiは依存性を追加し、FatJarをビルドするだけなので使いやすい

Using Kuromoji
import org.atilika.kuromoji.Tokenizer
object Main extends App {
import scala.collection.JavaConverters.asScalaBufferConverter
val tokenizer = Tokenizer.builder().build()
val ex1 = "リストのような構造の物から条件を満たす物を探す"
val res1 = tokenizer.tokenize(ex1).asScala
for (token <- res1) {
println(s"${token.getBaseForm}t${token.getPartOfSpeech}")
}
}

Using Kuromoji
Kuromojiを使うとこのように認識される
厚生名詞,一般,*,*
年金名詞,一般,*,*
基金名詞,一般,*,*
脱退名詞,サ変接続,*,*
に助詞,格助詞,一般,*
伴う動詞,自立,*,*
手続き名詞,サ変接続,*,*
について助詞,格助詞,連語,*
の助詞,連体化,*,*
リマ名詞,固有名詞,地域,一般
インド名詞,固有名詞,地域,国
です助動詞,*,*,*
リスト名詞,一般,*,*
よう名詞,非自立,助動詞語幹,*
だ助動詞,*,*,*
構造名詞,一般,*,*
物名詞,非自立,一般,*
から助詞,格助詞,一般,*
条件名詞,一般,*,*
を助詞,格助詞,一般,*
満たす動詞,自立,*,*
物名詞,非自立,一般,*
を助詞,格助詞,一般,*
探す動詞,自立,*,*

Step 1: Build
Vocabulary
語彙の構築

Vocabulary
lazy val tokenizer = Tokenizer.builder().build()
val text = sc.textFile("documents")
val words = for {
line <- text
token <- tokenizer.tokenize(line).asScala
} yield token.getBaseForm
val vocab = words.distinct().zipWithIndex().collectAsMap()

Step 2: Create Corpus
コーパスの作成

Corpus
val documentWords: RDD[Array[String]] =
text.map(line => tokenizer.tokenize(line).asScala.map(t => t.getBaseForm).toArray)
val documentCounts: RDD[Array[(String, Int)]] =
documentWords.map(words => words.distinct.map { word =>
(word, words.count(_ == word))
})
val documentIndexAndCount: RDD[Seq[(Int, Double)]] =
documentCounts.map(wordsAndCount => wordsAndCount.map {
case (word, count) => (vocab(word).toInt, count.toDouble)
})
val corpus: RDD[(Long, Vector)] =
documentIndexAndCount.map(Vectors.sparse(vocab.size,
_)).zipWithIndex.map(_.swap)

Step 3: Learn Topics
トピックモデルの学習

Learn Topics
val ldaModel = new LDA().setK(10).setMaxIterations(100).run(corpus)
val topics = ldaModel.describeTopics(10).map {
case (terms, weights) =>
terms.map(vocabulary(_)).zip(weights)
}
topics.zipWithIndex.foreach {
case (topic, i) =>
println(s"TOPIC $i")
topic.foreach { case (term, weight) => println(s"$termt$weight") }
println(s"==========")
}

Step 4: Evaluate
結果の評価

Topics?
Topic 0:
です 0.10870545899718176。0.09623411796419644さん 0.06105040403724023
Topic 1:
の0.11035671185240525を0.07860862808644907する 0.05605566895190625
Topic 2:
お願い 0.07579177409154919ご0.04431117457391179よろしく 0.032788330612439916
結果は助詞や文章の補助単語になっていた

Filter Stopwords
val popular = words
.map(w => (w, 1))
.reduceByKey(_ + _)
.sortBy(-_._2)
.take(50)
.map(_._1)
.toSet
val vocabIndicies = words.distinct().filter(w =>
!popular.contains(w)).zipWithIndex()
val vocab: Map[String, Long] = vocabIndicies.collectAsMap()
val vocabulary = vocabIndicies.collect().map(_._1)
ストップワードの除去

Topics!
Topic 0:
変更 0.032952997236706624サーバー 0.03140777729144046設定 0.021643554361727567エ
ラー 0.017955380768330902
Topic 1:
ログ 0.028665774057609564時間 0.026686704628121154時 0.02404938565591628発生
0.020797622509804107
Topic 2:
様0.0474658820402456株式会社 0.026174292703953685お世話 0.021939329774535308

Using the LDA model
• Prediction requires a LocalLDAModel
• Use .toLocal if
isInstanceOf[DistributedLDAModel]
• Convert to Vector using same steps
• Be sure to filter out words not in the vocabulary
• Call topicDistributions to see topic scores
LDAモデルはトピックの予想のために使用される

Topics Prediction
New document topics:
0.091084004103132,0.1044111561202625,0.09090943947509807,0.11607354553753861,0.104042
84803971378,0.09697071269561051,0.09571658794577831,0.0919546186785918,0.091762489301
32802,0.11707459810294643
New document topics:
0.09424474530277152,0.1183270779577911,0.09230776874419214,0.09835759337114718,0.1315
9581881630272,0.09279638945611612,0.094124104743527,0.09295449996673977,0.09291472297
512052,0.09237727866629193
トピックの予想
Topic 0 Topic 1 Topic 2 Topic …

Now what?
• Find the minimum logLikelihood in a set
of documents you know are OK
• Report anomaly whenever a new
document has a lower logLikelihood
トピックを正しく予想できた集合の最小対数尤度を計算、新しい文書がその値を
下回ったら「異常」に分類

Anomaly Detection
val newDoc = sc.parallelize(Seq("平素は当社サービスをご利用いただき、誠にありがとうございます。
"))
def stringToCountVector(strings: RDD[String]) = {
. . .
}
val score = lda.logLikelihood(stringToCountVector(newDoc))
/*
-2153492.694125671
*/

Word2Vec
• Created vectors that represents points in
meaning space
• Unsupervised but requires a lot of data to
generate good vectors
• Google’s sample vectors trained on 100
billion words (~X00GB?)
• Vectors with less data can provide
interesting similarities but can’t do so
consistently
Word2Vecでは単語をベクトル化して定量的に表現可能で、単語同士の類似度を
出すことができる

Word2Vec Intuition
• Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.
Linguistic Regularities in Continuous Space Word
Representations. In Proceedings of NAACL HLT, 2013.
実際の単語ベクトル化例

Vector Concatenation
ベクトル連結
ITEM_01
営業
活用
営業
の
情報
共有
と
サポート. . .

Step 1: Make vectors
単語ベクトルの生成

Making
Word2VecModel
val documentWords: RDD[Seq[String]] =
text.map(line => tokenizer.tokenize(line).asScala.map(_.getSurfaceForm).toSeq)
documentWords.cache()
val model = new Word2Vec().setVectorSize(300).fit(documentWords)

Step 2: Use vectors
単語ベクトルの適用

Using
Word2VecModel
model.findSynonyms(“日本”, 5).foreach(println)
/*
(マイクロソフト,3.750299190465294)
(ビジネス,3.7329870992662104)
(株式会社,3.323983664186244)
(システムズ,3.1331352923187987)
(ビジネスプロダクティビティ,2.595931613590554)
*/
実際に単語類似度算出例、ただし、元データで結果は大きく変動するため元デー
タが非常に重要
Big dataset is
very important.

Recommendation
• Paragraph Vectors
• Not available in Spark T_T
文章のベクトル化によるレコメンドはSparkではできない

Embedding with Vector
Concatenation
• Calculate sum of words in description
• Add it to vectors from
Word2VecModel.getVectors with special
keyword (Ex. ITEM_1234)
• Create new Word2VecModel using constructor
• ※Not state of the art but can produce
reasonable recommendations without user
rating data
ベクトル連結による embedding、「アイテム」ごとに含まれる単語のベクトルを
合計する

Item Embedding (1/2)
val embeds = Map(
"ITEM_001_01" -> "営業部門の情報共有と活用をサポートし",
"ITEM_001_02" -> "組織的な営業力･売れる仕組みを構築します",
"ITEM_001_03" -> "営業情報のコミュニケーション基盤を構築する",
"ITEM_002_01" -> "一般的なサーバ、ネットワーク機器やOSレベルの監視に加え",
"ITEM_002_02" -> "またモニタリングポータルでは、アラームの発生状況",
"ITEM_002_03" -> "監視システムにより取得されたパフォーマンス情報が逐次ダッシュボード形式",
"ITEM_003_01" -> "IPネットワークインフラストラクチャを構築します",
"ITEM_003_02" -> "導入にとどまらず、アプリケーションやOAシステムとの融合を図ったユニファイドコミュニ
ケーション環境を構築",
"ITEM_003_03" -> "企業内および企業外へのコンテンツの効果的な配信環境、閲覧環境をご提供します"
)

Item Embedding (2/2)
def stringToVector(s: String): Array[Double] = {
val words = tokenizer.tokenize(s).asScala.map(_.getSurfaceForm).toSeq
val vectors = words.map(word =>
Try(model.transform(word)).getOrElse(model.transform("は")))
val breezeVectors: Seq[DenseVector[Double]] = vectors.map(v => new
DenseVector(v.toArray))
val concat = breezeVectors.foldLeft(DenseVector.zeros[Double](vectorLength))((a, b)
=> a :+ b)
concat.toArray
}
val embedVectors: Map[String, Array[Float]] = embeds.map {
case (key, value) => (key, stringToVector(value).map(_.toFloat))
}
val embedModel = new Word2VecModel(embedVectors ++ model.getVectors)

Recommending
Similar
embedModel.findSynonyms("ITEM_001_01", 5).foreach(println)
/*
(ITEM_001_03,12.577457221575695)
(ITEM_003_03,12.542920930725996)
(ITEM_003_02,12.315240961298104)
(ITEM_001_02,12.260734177166485)
(ITEM_002_01,10.866897938028856)
*/
類似度の計算

Recommending New
val newSentence = stringToVector("会計・受発注及び生産管理を中心としたシステム")
embedModel.findSynonyms(Vectors.dense(newSentence), 5).foreach(println)
/*
(ITEM_001_02,14.372981084681571)
(ITEM_003_03,14.343473534848325)
(ITEM_001_01,13.83593570884867)
(ITEM_002_01,13.61507040314043)
(ITEM_002_03,13.462141195072414)
*/
新しいサンプルからのレコメンド

Thank you
• Questions?
• Example source code at:
• https://github.com/wmeddie/spark-text

Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (17)

Semelhante a Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Semelhante a Scala Matsuri 2016: Japanese Text Mining with Scala and Spark (20)

Último

Último (20)

Scala Matsuri 2016: Japanese Text Mining with Scala and Spark