SlideShare uma empresa Scribd logo
1 de 46
Japanese Text Mining
with Scala and Spark
Eduardo Gonzalez
Scala Matsuri 2016
About Me
• Eduardo Gonzalez
• Japan Business Systems
• Japanese System Integrator (SIer)
• Social Systems Design Center (R&D)
• Pittsburgh University
• Computer Science
• Japanese
@wm_eddie
Agenda
• Intro to Text mining with Spark
• Pre-processing Japanese Text
• Japanese Word Breaking
• Spark Gotchas
• Topic Extraction with LDA
• Intro to Word2Vec
• Recommendation with Word Embedding
Machine Learning
Vocabulary
• Feature: A number that represents something
about a data point
• Label: A feature of the data we want to predict
• Document: A block of text with a unique ID
• Model: A learned set of parameters that can
be used for prediction
• Corpus: A collection of documents
機械学習の前提となる語彙としてFeature、Label、Document、Model、Corpusが
ある
What is Apache
Spark
• A library that defines a Resilient Distributed Dataset
type and a set of transformations
• RDDs are only representations of calculations
• A runtime that can execute RDDs in a distributed
manner
• A master process that schedules and monitors executors
• Executors actually do the calculations and can keep results in their
memory
• Spark SQL, MLLib and Graph X define special types of
RDDs
Sparkは汎用分散処理基盤で、SQL/機械学習/グラフといったコンポーネントを保
持する
Apache Spark
Example
import org.apache.spark.{SparkConf, SparkContext}
object Main extends App {
val sc = new SparkContext(new SparkConf())
val text = sc.textFile("hdfs:///kjb.txt")
val counts = text.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.collect().foreach(println)
}
SparkでWordCountアプリケーションを構築するとこのようになる
Spark’s Text-Mining
Tools
• LDA for Topic Extraction
• Word2Vec an unsupervised way to turn words
into features based on their meaning
• CountVectorizer turns documents into vectors
based on word count
• HashingTF-IDF calculates important words of
a document with respect to the corpus
• And much more
SparkのテキストマイニングツールとしてLDA、CountVectorizer、HashingTF-
IDF等のツールがある
How to use Spark
LDA
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/sample_lda_data.txt")
val parsedData = data.map(s => Vectors.dense(s.trim.split('
').map(_.toDouble)))
// Index documents with unique IDs
val corpus = parsedData.zipWithIndex.map(_.swap).cache()
// Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)
sample_lda_data.txt
ただ、入力のLDAデータは文章のようには見えない
1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0
(´Д`)
This does not
look like text
LDA Step 0: Get
words
LDA実行にあたり、まずはじめに単語を抽出する必要がある
Word Segmentation
• Hard to actually get right.
• Simple in theory with English
• Str.Split(“ “)
• But not enough for real data.
• (Take parens for example.)
• [“(Take”, “parens”, “for”, “example.)”]
• Etc.
実際の単語抽出は難しく、区切りで分割するだけではうまくいかない
Word Segmentation
• Since Japanese lacks spaces it’s hard even in
theory
• A probabilistic approach is necessary
• Thankfully there are libraries that can help
日本語単語の抽出は単語区切り文字がなく、確率的アプローチが必要、ライブラ
リで効率的に実行できる
Morphological
Analyzers
• Include POS tagging, pronunciation and
stemming
• MeCab
• Written in C++with SWIG bindings to pretty
much everything
• Kuromoji
• Written in Java available via maven
• Others
形態素解析(品詞タグ付け、発音、語幹処理服務)用にMeCabやKuromoji等のラ
イブラリがある
JMecab &
Spark/Hadoop
• Not impossible but difficult
• Add MeCab to each node
• Add jar to classpaths
• Include jar in project for compilation
• Not too bad with own hardware but
painful with Amazon EMR or Azure
HDInsight
JMecabは事前Installが必要なため、オンプレでは何とかなるが、クラウド環境で
は実行困難
Kuromoji &
Spark/Hadoop
• Easy
• Include dependency in build.sbt
• Include jar file in FatJar with sbt-
assembly
Kuromojiは依存性を追加し、FatJarをビルドするだけなので使いやすい
Using Kuromoji
import org.atilika.kuromoji.Tokenizer
object Main extends App {
import scala.collection.JavaConverters.asScalaBufferConverter
val tokenizer = Tokenizer.builder().build()
val ex1 = "リストのような構造の物から条件を満たす物を探す"
val res1 = tokenizer.tokenize(ex1).asScala
for (token <- res1) {
println(s"${token.getBaseForm}t${token.getPartOfSpeech}")
}
}
Using Kuromoji
Kuromojiを使うとこのように認識される
厚生 名詞,一般,*,*
年金 名詞,一般,*,*
基金 名詞,一般,*,*
脱退 名詞,サ変接続,*,*
に 助詞,格助詞,一般,*
伴う 動詞,自立,*,*
手続き 名詞,サ変接続,*,*
について 助詞,格助詞,連語,*
の 助詞,連体化,*,*
リマ 名詞,固有名詞,地域,一般
インド 名詞,固有名詞,地域,国
です 助動詞,*,*,*
リスト 名詞,一般,*,*
の 助詞,連体化,*,*
よう 名詞,非自立,助動詞語幹,*
だ 助動詞,*,*,*
構造 名詞,一般,*,*
の 助詞,連体化,*,*
物 名詞,非自立,一般,*
から 助詞,格助詞,一般,*
条件 名詞,一般,*,*
を 助詞,格助詞,一般,*
満たす 動詞,自立,*,*
物 名詞,非自立,一般,*
を 助詞,格助詞,一般,*
探す 動詞,自立,*,*
Step 1: Build
Vocabulary
語彙の構築
Vocabulary
lazy val tokenizer = Tokenizer.builder().build()
val text = sc.textFile("documents")
val words = for {
line <- text
token <- tokenizer.tokenize(line).asScala
} yield token.getBaseForm
val vocab = words.distinct().zipWithIndex().collectAsMap()
Step 2: Create Corpus
コーパスの作成
Corpus
val documentWords: RDD[Array[String]] =
text.map(line => tokenizer.tokenize(line).asScala.map(t => t.getBaseForm).toArray)
val documentCounts: RDD[Array[(String, Int)]] =
documentWords.map(words => words.distinct.map { word =>
(word, words.count(_ == word))
})
val documentIndexAndCount: RDD[Seq[(Int, Double)]] =
documentCounts.map(wordsAndCount => wordsAndCount.map {
case (word, count) => (vocab(word).toInt, count.toDouble)
})
val corpus: RDD[(Long, Vector)] =
documentIndexAndCount.map(Vectors.sparse(vocab.size,
_)).zipWithIndex.map(_.swap)
Step 3: Learn Topics
トピックモデルの学習
Learn Topics
val ldaModel = new LDA().setK(10).setMaxIterations(100).run(corpus)
val topics = ldaModel.describeTopics(10).map {
case (terms, weights) =>
terms.map(vocabulary(_)).zip(weights)
}
topics.zipWithIndex.foreach {
case (topic, i) =>
println(s"TOPIC $i")
topic.foreach { case (term, weight) => println(s"$termt$weight") }
println(s"==========")
}
Step 4: Evaluate
結果の評価
Topics?
Topic 0:
です 0.10870545899718176。0.09623411796419644さん 0.06105040403724023
Topic 1:
の0.11035671185240525を0.07860862808644907する 0.05605566895190625
Topic 2:
お願い 0.07579177409154919ご0.04431117457391179よろしく 0.032788330612439916
結果は助詞や文章の補助単語になっていた
Step 5: GOTO 2
Filter Stopwords
val popular = words
.map(w => (w, 1))
.reduceByKey(_ + _)
.sortBy(-_._2)
.take(50)
.map(_._1)
.toSet
val vocabIndicies = words.distinct().filter(w =>
!popular.contains(w)).zipWithIndex()
val vocab: Map[String, Long] = vocabIndicies.collectAsMap()
val vocabulary = vocabIndicies.collect().map(_._1)
ストップワードの除去
Topics!
Topic 0:
変更 0.032952997236706624サーバー 0.03140777729144046設定 0.021643554361727567エ
ラー 0.017955380768330902
Topic 1:
ログ 0.028665774057609564時間 0.026686704628121154時 0.02404938565591628発生
0.020797622509804107
Topic 2:
様0.0474658820402456株式会社 0.026174292703953685お世話 0.021939329774535308
Using the LDA model
• Prediction requires a LocalLDAModel
• Use .toLocal if
isInstanceOf[DistributedLDAModel]
• Convert to Vector using same steps
• Be sure to filter out words not in the vocabulary
• Call topicDistributions to see topic scores
LDAモデルはトピックの予想のために使用される
Topics Prediction
New document topics:
0.091084004103132,0.1044111561202625,0.09090943947509807,0.11607354553753861,0.104042
84803971378,0.09697071269561051,0.09571658794577831,0.0919546186785918,0.091762489301
32802,0.11707459810294643
New document topics:
0.09424474530277152,0.1183270779577911,0.09230776874419214,0.09835759337114718,0.1315
9581881630272,0.09279638945611612,0.094124104743527,0.09295449996673977,0.09291472297
512052,0.09237727866629193
トピックの予想
Topic 0 Topic 1 Topic 2 Topic …
Now what?
• Find the minimum logLikelihood in a set
of documents you know are OK
• Report anomaly whenever a new
document has a lower logLikelihood
トピックを正しく予想できた集合の最小対数尤度を計算、新しい文書がその値を
下回ったら「異常」に分類
Anomaly Detection
val newDoc = sc.parallelize(Seq("平素は当社サービスをご利用いただき、誠にありがとうございます。
"))
def stringToCountVector(strings: RDD[String]) = {
. . .
}
val score = lda.logLikelihood(stringToCountVector(newDoc))
/*
-2153492.694125671
*/
Word2Vec
• Created vectors that represents points in
meaning space
• Unsupervised but requires a lot of data to
generate good vectors
• Google’s sample vectors trained on 100
billion words (~X00GB?)
• Vectors with less data can provide
interesting similarities but can’t do so
consistently
Word2Vecでは単語をベクトル化して定量的に表現可能で、単語同士の類似度を
出すことができる
Word2Vec Intuition
• Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.
Linguistic Regularities in Continuous Space Word
Representations. In Proceedings of NAACL HLT, 2013.
実際の単語ベクトル化例
Vector Concatenation
ベクトル連結
ITEM_01
営業
活用
営業
の
情報
共有
と
サポート. . .
Step 1: Make vectors
単語ベクトルの生成
Making
Word2VecModel
val documentWords: RDD[Seq[String]] =
text.map(line => tokenizer.tokenize(line).asScala.map(_.getSurfaceForm).toSeq)
documentWords.cache()
val model = new Word2Vec().setVectorSize(300).fit(documentWords)
Step 2: Use vectors
単語ベクトルの適用
Using
Word2VecModel
model.findSynonyms(“日本”, 5).foreach(println)
/*
(マイクロソフト,3.750299190465294)
(ビジネス,3.7329870992662104)
(株式会社,3.323983664186244)
(システムズ,3.1331352923187987)
(ビジネスプロダクティビティ,2.595931613590554)
*/
実際に単語類似度算出例、ただし、元データで結果は大きく変動するため元デー
タが非常に重要
Big dataset is
very important.
Recommendation
• Paragraph Vectors
• Not available in Spark T_T
文章のベクトル化によるレコメンドはSparkではできない
Embedding with Vector
Concatenation
• Calculate sum of words in description
• Add it to vectors from
Word2VecModel.getVectors with special
keyword (Ex. ITEM_1234)
• Create new Word2VecModel using constructor
• ※Not state of the art but can produce
reasonable recommendations without user
rating data
ベクトル連結による embedding、「アイテム」ごとに含まれる単語のベクトルを
合計する
Item Embedding (1/2)
val embeds = Map(
"ITEM_001_01" -> "営業部門の情報共有と活用をサポートし",
"ITEM_001_02" -> "組織的な営業力・売れる仕組みを構築します",
"ITEM_001_03" -> "営業情報のコミュニケーション基盤を構築する",
"ITEM_002_01" -> "一般的なサーバ、ネットワーク機器やOSレベルの監視に加え",
"ITEM_002_02" -> "またモニタリングポータルでは、アラームの発生状況",
"ITEM_002_03" -> "監視システムにより取得されたパフォーマンス情報が逐次ダッシュボード形式",
"ITEM_003_01" -> "IPネットワークインフラストラクチャを構築します",
"ITEM_003_02" -> "導入にとどまらず、アプリケーションやOAシステムとの融合を図ったユニファイドコミュニ
ケーション環境を構築",
"ITEM_003_03" -> "企業内および企業外へのコンテンツの効果的な配信環境、閲覧環境をご提供します"
)
Item Embedding (2/2)
def stringToVector(s: String): Array[Double] = {
val words = tokenizer.tokenize(s).asScala.map(_.getSurfaceForm).toSeq
val vectors = words.map(word =>
Try(model.transform(word)).getOrElse(model.transform("は")))
val breezeVectors: Seq[DenseVector[Double]] = vectors.map(v => new
DenseVector(v.toArray))
val concat = breezeVectors.foldLeft(DenseVector.zeros[Double](vectorLength))((a, b)
=> a :+ b)
concat.toArray
}
val embedVectors: Map[String, Array[Float]] = embeds.map {
case (key, value) => (key, stringToVector(value).map(_.toFloat))
}
val embedModel = new Word2VecModel(embedVectors ++ model.getVectors)
Recommending
Similar
embedModel.findSynonyms("ITEM_001_01", 5).foreach(println)
/*
(ITEM_001_03,12.577457221575695)
(ITEM_003_03,12.542920930725996)
(ITEM_003_02,12.315240961298104)
(ITEM_001_02,12.260734177166485)
(ITEM_002_01,10.866897938028856)
*/
類似度の計算
Recommending New
val newSentence = stringToVector("会計・受発注及び生産管理を中心としたシステム")
embedModel.findSynonyms(Vectors.dense(newSentence), 5).foreach(println)
/*
(ITEM_001_02,14.372981084681571)
(ITEM_003_03,14.343473534848325)
(ITEM_001_01,13.83593570884867)
(ITEM_002_01,13.61507040314043)
(ITEM_002_03,13.462141195072414)
*/
新しいサンプルからのレコメンド
Thank you
• Questions?
• Example source code at:
• https://github.com/wmeddie/spark-text

Mais conteúdo relacionado

Mais procurados

Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoWeaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoTaro L. Saito
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghStuart Roebuck
 
Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?Mario Camou Riveroll
 
Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...
Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...
Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...scalaconfjp
 
Refactoring to Scala DSLs and LiftOff 2009 Recap
Refactoring to Scala DSLs and LiftOff 2009 RecapRefactoring to Scala DSLs and LiftOff 2009 Recap
Refactoring to Scala DSLs and LiftOff 2009 RecapDave Orme
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at TwitterAlex Payne
 
A Brief, but Dense, Intro to Scala
A Brief, but Dense, Intro to ScalaA Brief, but Dense, Intro to Scala
A Brief, but Dense, Intro to ScalaDerek Chen-Becker
 
Java Serialization Facts and Fallacies
Java Serialization Facts and FallaciesJava Serialization Facts and Fallacies
Java Serialization Facts and FallaciesRoman Elizarov
 
Scala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on HerokuScala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on HerokuHavoc Pennington
 
Millions quotes per second in pure java
Millions quotes per second in pure javaMillions quotes per second in pure java
Millions quotes per second in pure javaRoman Elizarov
 
Android course session 5 (Threads & socket)
Android course session 5 (Threads & socket)Android course session 5 (Threads & socket)
Android course session 5 (Threads & socket)Keroles M.Yakoub
 
Scala : language of the future
Scala : language of the futureScala : language of the future
Scala : language of the futureAnsviaLab
 
Why GC is eating all my CPU?
Why GC is eating all my CPU?Why GC is eating all my CPU?
Why GC is eating all my CPU?Roman Elizarov
 
Lagergren jvmls-2013-final
Lagergren jvmls-2013-finalLagergren jvmls-2013-final
Lagergren jvmls-2013-finalMarcus Lagergren
 
Scala in a wild enterprise
Scala in a wild enterpriseScala in a wild enterprise
Scala in a wild enterpriseRafael Bagmanov
 
Solid and Sustainable Development in Scala
Solid and Sustainable Development in ScalaSolid and Sustainable Development in Scala
Solid and Sustainable Development in Scalascalaconfjp
 

Mais procurados (20)

Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoWeaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup Edinburgh
 
Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?
 
Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...
Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...
Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
 
Refactoring to Scala DSLs and LiftOff 2009 Recap
Refactoring to Scala DSLs and LiftOff 2009 RecapRefactoring to Scala DSLs and LiftOff 2009 Recap
Refactoring to Scala DSLs and LiftOff 2009 Recap
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
A Brief, but Dense, Intro to Scala
A Brief, but Dense, Intro to ScalaA Brief, but Dense, Intro to Scala
A Brief, but Dense, Intro to Scala
 
Java Serialization Facts and Fallacies
Java Serialization Facts and FallaciesJava Serialization Facts and Fallacies
Java Serialization Facts and Fallacies
 
Scala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on HerokuScala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on Heroku
 
Millions quotes per second in pure java
Millions quotes per second in pure javaMillions quotes per second in pure java
Millions quotes per second in pure java
 
[Start] Scala
[Start] Scala[Start] Scala
[Start] Scala
 
Scala coated JVM
Scala coated JVMScala coated JVM
Scala coated JVM
 
Android course session 5 (Threads & socket)
Android course session 5 (Threads & socket)Android course session 5 (Threads & socket)
Android course session 5 (Threads & socket)
 
Scala Introduction
Scala IntroductionScala Introduction
Scala Introduction
 
Scala : language of the future
Scala : language of the futureScala : language of the future
Scala : language of the future
 
Why GC is eating all my CPU?
Why GC is eating all my CPU?Why GC is eating all my CPU?
Why GC is eating all my CPU?
 
Lagergren jvmls-2013-final
Lagergren jvmls-2013-finalLagergren jvmls-2013-final
Lagergren jvmls-2013-final
 
Scala in a wild enterprise
Scala in a wild enterpriseScala in a wild enterprise
Scala in a wild enterprise
 
Solid and Sustainable Development in Scala
Solid and Sustainable Development in ScalaSolid and Sustainable Development in Scala
Solid and Sustainable Development in Scala
 

Destaque

Functional Programming For All - Scala Matsuri 2016
Functional Programming For All - Scala Matsuri 2016Functional Programming For All - Scala Matsuri 2016
Functional Programming For All - Scala Matsuri 2016Zachary Abbott
 
Contributing to Scala OSS from East Asia #ScalaMatsuri
 Contributing to Scala OSS from East Asia #ScalaMatsuri Contributing to Scala OSS from East Asia #ScalaMatsuri
Contributing to Scala OSS from East Asia #ScalaMatsuriKazuhiro Sera
 
あなたのScalaを爆速にする7つの方法(日本語版)
あなたのScalaを爆速にする7つの方法(日本語版)あなたのScalaを爆速にする7つの方法(日本語版)
あなたのScalaを爆速にする7つの方法(日本語版)x1 ichi
 
バッチを Akka Streams で再実装したら100倍速くなった話 #ScalaMatsuri
バッチを Akka Streams で再実装したら100倍速くなった話 #ScalaMatsuriバッチを Akka Streams で再実装したら100倍速くなった話 #ScalaMatsuri
バッチを Akka Streams で再実装したら100倍速くなった話 #ScalaMatsuriKazuki Negoro
 
Why Reactive Matters #ScalaMatsuri
Why Reactive Matters #ScalaMatsuriWhy Reactive Matters #ScalaMatsuri
Why Reactive Matters #ScalaMatsuriYuta Okamoto
 
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...Yongzheng (Tiger) Zhang
 
Graphics for big data reference architecture blog
Graphics for big data reference architecture blogGraphics for big data reference architecture blog
Graphics for big data reference architecture blogSunil Soares
 
Using Deep Learning for Recommendation
Using Deep Learning for RecommendationUsing Deep Learning for Recommendation
Using Deep Learning for RecommendationEduardo Gonzalez
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit
 
Normalization
NormalizationNormalization
Normalizationochesing
 
形態素解析の過去・現在・未来
形態素解析の過去・現在・未来形態素解析の過去・現在・未来
形態素解析の過去・現在・未来Preferred Networks
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With SparkShivaji Dutta
 
How to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksHow to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksSlideShare
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShareSlideShare
 

Destaque (17)

Functional Programming For All - Scala Matsuri 2016
Functional Programming For All - Scala Matsuri 2016Functional Programming For All - Scala Matsuri 2016
Functional Programming For All - Scala Matsuri 2016
 
Thinking in Cats
Thinking in CatsThinking in Cats
Thinking in Cats
 
ScalaMatsuri 2016
ScalaMatsuri 2016ScalaMatsuri 2016
ScalaMatsuri 2016
 
Contributing to Scala OSS from East Asia #ScalaMatsuri
 Contributing to Scala OSS from East Asia #ScalaMatsuri Contributing to Scala OSS from East Asia #ScalaMatsuri
Contributing to Scala OSS from East Asia #ScalaMatsuri
 
あなたのScalaを爆速にする7つの方法(日本語版)
あなたのScalaを爆速にする7つの方法(日本語版)あなたのScalaを爆速にする7つの方法(日本語版)
あなたのScalaを爆速にする7つの方法(日本語版)
 
バッチを Akka Streams で再実装したら100倍速くなった話 #ScalaMatsuri
バッチを Akka Streams で再実装したら100倍速くなった話 #ScalaMatsuriバッチを Akka Streams で再実装したら100倍速くなった話 #ScalaMatsuri
バッチを Akka Streams で再実装したら100倍速くなった話 #ScalaMatsuri
 
Why Reactive Matters #ScalaMatsuri
Why Reactive Matters #ScalaMatsuriWhy Reactive Matters #ScalaMatsuri
Why Reactive Matters #ScalaMatsuri
 
Zen of Akka
Zen of AkkaZen of Akka
Zen of Akka
 
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
 
Graphics for big data reference architecture blog
Graphics for big data reference architecture blogGraphics for big data reference architecture blog
Graphics for big data reference architecture blog
 
Using Deep Learning for Recommendation
Using Deep Learning for RecommendationUsing Deep Learning for Recommendation
Using Deep Learning for Recommendation
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
Normalization
NormalizationNormalization
Normalization
 
形態素解析の過去・現在・未来
形態素解析の過去・現在・未来形態素解析の過去・現在・未来
形態素解析の過去・現在・未来
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With Spark
 
How to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksHow to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & Tricks
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
 

Semelhante a Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentationRamesh Mudunuri
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Sharktrihug
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JJosh Patterson
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmaplucenerevolution
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road maplucenerevolution
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 

Semelhante a Scala Matsuri 2016: Japanese Text Mining with Scala and Spark (20)

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Spark
SparkSpark
Spark
 

Último

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

  • 1. Japanese Text Mining with Scala and Spark Eduardo Gonzalez Scala Matsuri 2016
  • 2. About Me • Eduardo Gonzalez • Japan Business Systems • Japanese System Integrator (SIer) • Social Systems Design Center (R&D) • Pittsburgh University • Computer Science • Japanese @wm_eddie
  • 3. Agenda • Intro to Text mining with Spark • Pre-processing Japanese Text • Japanese Word Breaking • Spark Gotchas • Topic Extraction with LDA • Intro to Word2Vec • Recommendation with Word Embedding
  • 4. Machine Learning Vocabulary • Feature: A number that represents something about a data point • Label: A feature of the data we want to predict • Document: A block of text with a unique ID • Model: A learned set of parameters that can be used for prediction • Corpus: A collection of documents 機械学習の前提となる語彙としてFeature、Label、Document、Model、Corpusが ある
  • 5. What is Apache Spark • A library that defines a Resilient Distributed Dataset type and a set of transformations • RDDs are only representations of calculations • A runtime that can execute RDDs in a distributed manner • A master process that schedules and monitors executors • Executors actually do the calculations and can keep results in their memory • Spark SQL, MLLib and Graph X define special types of RDDs Sparkは汎用分散処理基盤で、SQL/機械学習/グラフといったコンポーネントを保 持する
  • 6. Apache Spark Example import org.apache.spark.{SparkConf, SparkContext} object Main extends App { val sc = new SparkContext(new SparkConf()) val text = sc.textFile("hdfs:///kjb.txt") val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.collect().foreach(println) } SparkでWordCountアプリケーションを構築するとこのようになる
  • 7. Spark’s Text-Mining Tools • LDA for Topic Extraction • Word2Vec an unsupervised way to turn words into features based on their meaning • CountVectorizer turns documents into vectors based on word count • HashingTF-IDF calculates important words of a document with respect to the corpus • And much more SparkのテキストマイニングツールとしてLDA、CountVectorizer、HashingTF- IDF等のツールがある
  • 8. How to use Spark LDA import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel} import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile("data/mllib/sample_lda_data.txt") val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble))) // Index documents with unique IDs val corpus = parsedData.zipWithIndex.map(_.swap).cache() // Cluster the documents into three topics using LDA val ldaModel = new LDA().setK(3).run(corpus)
  • 9. sample_lda_data.txt ただ、入力のLDAデータは文章のようには見えない 1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0 2 0 0 1 1 4 1 0 0 4 9 0 1 2 0 2 1 0 3 0 0 5 0 2 3 9 3 1 1 9 3 0 2 0 0 1 3 4 2 0 3 4 5 1 1 1 4 0 2 1 0 3 0 0 5 0 2 2 9 1 1 1 9 2 1 2 0 0 1 3 4 4 0 3 4 2 1 3 0 0 0 2 8 2 0 3 0 2 0 2 7 2 1 1 1 9 0 2 2 0 0 3 3 4 1 0 0 4 5 1 3 0 1 0 (´Д`) This does not look like text
  • 10. LDA Step 0: Get words LDA実行にあたり、まずはじめに単語を抽出する必要がある
  • 11. Word Segmentation • Hard to actually get right. • Simple in theory with English • Str.Split(“ “) • But not enough for real data. • (Take parens for example.) • [“(Take”, “parens”, “for”, “example.)”] • Etc. 実際の単語抽出は難しく、区切りで分割するだけではうまくいかない
  • 12. Word Segmentation • Since Japanese lacks spaces it’s hard even in theory • A probabilistic approach is necessary • Thankfully there are libraries that can help 日本語単語の抽出は単語区切り文字がなく、確率的アプローチが必要、ライブラ リで効率的に実行できる
  • 13. Morphological Analyzers • Include POS tagging, pronunciation and stemming • MeCab • Written in C++with SWIG bindings to pretty much everything • Kuromoji • Written in Java available via maven • Others 形態素解析(品詞タグ付け、発音、語幹処理服務)用にMeCabやKuromoji等のラ イブラリがある
  • 14. JMecab & Spark/Hadoop • Not impossible but difficult • Add MeCab to each node • Add jar to classpaths • Include jar in project for compilation • Not too bad with own hardware but painful with Amazon EMR or Azure HDInsight JMecabは事前Installが必要なため、オンプレでは何とかなるが、クラウド環境で は実行困難
  • 15. Kuromoji & Spark/Hadoop • Easy • Include dependency in build.sbt • Include jar file in FatJar with sbt- assembly Kuromojiは依存性を追加し、FatJarをビルドするだけなので使いやすい
  • 16. Using Kuromoji import org.atilika.kuromoji.Tokenizer object Main extends App { import scala.collection.JavaConverters.asScalaBufferConverter val tokenizer = Tokenizer.builder().build() val ex1 = "リストのような構造の物から条件を満たす物を探す" val res1 = tokenizer.tokenize(ex1).asScala for (token <- res1) { println(s"${token.getBaseForm}t${token.getPartOfSpeech}") } }
  • 17. Using Kuromoji Kuromojiを使うとこのように認識される 厚生 名詞,一般,*,* 年金 名詞,一般,*,* 基金 名詞,一般,*,* 脱退 名詞,サ変接続,*,* に 助詞,格助詞,一般,* 伴う 動詞,自立,*,* 手続き 名詞,サ変接続,*,* について 助詞,格助詞,連語,* の 助詞,連体化,*,* リマ 名詞,固有名詞,地域,一般 インド 名詞,固有名詞,地域,国 です 助動詞,*,*,* リスト 名詞,一般,*,* の 助詞,連体化,*,* よう 名詞,非自立,助動詞語幹,* だ 助動詞,*,*,* 構造 名詞,一般,*,* の 助詞,連体化,*,* 物 名詞,非自立,一般,* から 助詞,格助詞,一般,* 条件 名詞,一般,*,* を 助詞,格助詞,一般,* 満たす 動詞,自立,*,* 物 名詞,非自立,一般,* を 助詞,格助詞,一般,* 探す 動詞,自立,*,*
  • 19. Vocabulary lazy val tokenizer = Tokenizer.builder().build() val text = sc.textFile("documents") val words = for { line <- text token <- tokenizer.tokenize(line).asScala } yield token.getBaseForm val vocab = words.distinct().zipWithIndex().collectAsMap()
  • 20. Step 2: Create Corpus コーパスの作成
  • 21. Corpus val documentWords: RDD[Array[String]] = text.map(line => tokenizer.tokenize(line).asScala.map(t => t.getBaseForm).toArray) val documentCounts: RDD[Array[(String, Int)]] = documentWords.map(words => words.distinct.map { word => (word, words.count(_ == word)) }) val documentIndexAndCount: RDD[Seq[(Int, Double)]] = documentCounts.map(wordsAndCount => wordsAndCount.map { case (word, count) => (vocab(word).toInt, count.toDouble) }) val corpus: RDD[(Long, Vector)] = documentIndexAndCount.map(Vectors.sparse(vocab.size, _)).zipWithIndex.map(_.swap)
  • 22. Step 3: Learn Topics トピックモデルの学習
  • 23. Learn Topics val ldaModel = new LDA().setK(10).setMaxIterations(100).run(corpus) val topics = ldaModel.describeTopics(10).map { case (terms, weights) => terms.map(vocabulary(_)).zip(weights) } topics.zipWithIndex.foreach { case (topic, i) => println(s"TOPIC $i") topic.foreach { case (term, weight) => println(s"$termt$weight") } println(s"==========") }
  • 25. Topics? Topic 0: です 0.10870545899718176。0.09623411796419644さん 0.06105040403724023 Topic 1: の0.11035671185240525を0.07860862808644907する 0.05605566895190625 Topic 2: お願い 0.07579177409154919ご0.04431117457391179よろしく 0.032788330612439916 結果は助詞や文章の補助単語になっていた
  • 27. Filter Stopwords val popular = words .map(w => (w, 1)) .reduceByKey(_ + _) .sortBy(-_._2) .take(50) .map(_._1) .toSet val vocabIndicies = words.distinct().filter(w => !popular.contains(w)).zipWithIndex() val vocab: Map[String, Long] = vocabIndicies.collectAsMap() val vocabulary = vocabIndicies.collect().map(_._1) ストップワードの除去
  • 28. Topics! Topic 0: 変更 0.032952997236706624サーバー 0.03140777729144046設定 0.021643554361727567エ ラー 0.017955380768330902 Topic 1: ログ 0.028665774057609564時間 0.026686704628121154時 0.02404938565591628発生 0.020797622509804107 Topic 2: 様0.0474658820402456株式会社 0.026174292703953685お世話 0.021939329774535308
  • 29. Using the LDA model • Prediction requires a LocalLDAModel • Use .toLocal if isInstanceOf[DistributedLDAModel] • Convert to Vector using same steps • Be sure to filter out words not in the vocabulary • Call topicDistributions to see topic scores LDAモデルはトピックの予想のために使用される
  • 30. Topics Prediction New document topics: 0.091084004103132,0.1044111561202625,0.09090943947509807,0.11607354553753861,0.104042 84803971378,0.09697071269561051,0.09571658794577831,0.0919546186785918,0.091762489301 32802,0.11707459810294643 New document topics: 0.09424474530277152,0.1183270779577911,0.09230776874419214,0.09835759337114718,0.1315 9581881630272,0.09279638945611612,0.094124104743527,0.09295449996673977,0.09291472297 512052,0.09237727866629193 トピックの予想 Topic 0 Topic 1 Topic 2 Topic …
  • 31. Now what? • Find the minimum logLikelihood in a set of documents you know are OK • Report anomaly whenever a new document has a lower logLikelihood トピックを正しく予想できた集合の最小対数尤度を計算、新しい文書がその値を 下回ったら「異常」に分類
  • 32. Anomaly Detection val newDoc = sc.parallelize(Seq("平素は当社サービスをご利用いただき、誠にありがとうございます。 ")) def stringToCountVector(strings: RDD[String]) = { . . . } val score = lda.logLikelihood(stringToCountVector(newDoc)) /* -2153492.694125671 */
  • 33. Word2Vec • Created vectors that represents points in meaning space • Unsupervised but requires a lot of data to generate good vectors • Google’s sample vectors trained on 100 billion words (~X00GB?) • Vectors with less data can provide interesting similarities but can’t do so consistently Word2Vecでは単語をベクトル化して定量的に表現可能で、単語同士の類似度を 出すことができる
  • 34. Word2Vec Intuition • Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013. 実際の単語ベクトル化例
  • 36. Step 1: Make vectors 単語ベクトルの生成
  • 37. Making Word2VecModel val documentWords: RDD[Seq[String]] = text.map(line => tokenizer.tokenize(line).asScala.map(_.getSurfaceForm).toSeq) documentWords.cache() val model = new Word2Vec().setVectorSize(300).fit(documentWords)
  • 38. Step 2: Use vectors 単語ベクトルの適用
  • 40. Recommendation • Paragraph Vectors • Not available in Spark T_T 文章のベクトル化によるレコメンドはSparkではできない
  • 41. Embedding with Vector Concatenation • Calculate sum of words in description • Add it to vectors from Word2VecModel.getVectors with special keyword (Ex. ITEM_1234) • Create new Word2VecModel using constructor • ※Not state of the art but can produce reasonable recommendations without user rating data ベクトル連結による embedding、「アイテム」ごとに含まれる単語のベクトルを 合計する
  • 42. Item Embedding (1/2) val embeds = Map( "ITEM_001_01" -> "営業部門の情報共有と活用をサポートし", "ITEM_001_02" -> "組織的な営業力・売れる仕組みを構築します", "ITEM_001_03" -> "営業情報のコミュニケーション基盤を構築する", "ITEM_002_01" -> "一般的なサーバ、ネットワーク機器やOSレベルの監視に加え", "ITEM_002_02" -> "またモニタリングポータルでは、アラームの発生状況", "ITEM_002_03" -> "監視システムにより取得されたパフォーマンス情報が逐次ダッシュボード形式", "ITEM_003_01" -> "IPネットワークインフラストラクチャを構築します", "ITEM_003_02" -> "導入にとどまらず、アプリケーションやOAシステムとの融合を図ったユニファイドコミュニ ケーション環境を構築", "ITEM_003_03" -> "企業内および企業外へのコンテンツの効果的な配信環境、閲覧環境をご提供します" )
  • 43. Item Embedding (2/2) def stringToVector(s: String): Array[Double] = { val words = tokenizer.tokenize(s).asScala.map(_.getSurfaceForm).toSeq val vectors = words.map(word => Try(model.transform(word)).getOrElse(model.transform("は"))) val breezeVectors: Seq[DenseVector[Double]] = vectors.map(v => new DenseVector(v.toArray)) val concat = breezeVectors.foldLeft(DenseVector.zeros[Double](vectorLength))((a, b) => a :+ b) concat.toArray } val embedVectors: Map[String, Array[Float]] = embeds.map { case (key, value) => (key, stringToVector(value).map(_.toFloat)) } val embedModel = new Word2VecModel(embedVectors ++ model.getVectors)
  • 45. Recommending New val newSentence = stringToVector("会計・受発注及び生産管理を中心としたシステム") embedModel.findSynonyms(Vectors.dense(newSentence), 5).foreach(println) /* (ITEM_001_02,14.372981084681571) (ITEM_003_03,14.343473534848325) (ITEM_001_01,13.83593570884867) (ITEM_002_01,13.61507040314043) (ITEM_002_03,13.462141195072414) */ 新しいサンプルからのレコメンド
  • 46. Thank you • Questions? • Example source code at: • https://github.com/wmeddie/spark-text