Latent Semantic Analysis of Wikipedia with Spark

•

15 gostaram•9,463 visualizações

These are the slides from my talk for Text by the Bay in April 2015.

1© Cloudera, Inc. All rights reserved.
LSA-ing Wikipedia with Spark
Sandy Ryza | Senior Data Scientist

2© Cloudera, Inc. All rights reserved.
Me
• Data scientist at Cloudera
• Recently lead Cloudera’s Apache Spark
development
• Author of Advanced Analytics with Spark

3© Cloudera, Inc. All rights reserved.
LSA-ing Wikipedia with Spark
Sandy Ryza | Senior Data Scientist

4© Cloudera, Inc. All rights reserved.
Latent Semantic Analysis
• Fancy name for applying a matrix decomposition (SVD) to text data

5© Cloudera, Inc. All rights reserved.

6© Cloudera, Inc. All rights reserved.

7© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

8© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

9© Cloudera, Inc. All rights reserved.
Wikipedia Content Data Set
• http://dumps.wikimedia.org/enwiki/latest/
• XML-formatted
• 46 GB uncompressed

10© Cloudera, Inc. All rights reserved.
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>
<id>584215651</id>
<parentid>584213644</parentid>
<timestamp>2013-12-02T15:14:01Z</timestamp>
<contributor>
<username>AnomieBOT</username>
<id>7611264</id>
</contributor>
<comment>Rescuing orphaned refs ("autogenerated1" from rev
584155010; "bbc" from rev 584155010)</comment>
<text xml:space="preserve">{{Redirect|Anarchist|the fictional character|
Anarchist (comics)}}
{{Redirect|Anarchists}}
{{pp-move-indef}}
{{Anarchism sidebar}}
'''Anarchism''' is a [[political philosophy]] that advocates [[stateless society|
stateless societies]] often defined as [[self-governance|self-governed]] voluntary
institutions,<ref>"ANARCHISM, a social philosophy that rejects
authoritarian government and maintains that voluntary institutions are best suited
to express man's natural social tendencies." George Woodcock.
"Anarchism" at The Encyclopedia of Philosophy</ref><ref>
"In a society developed on these lines, the voluntary associations which
already now begin to cover all the fields of human activity would take a still
greater extension so as to substitute
...

11© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

12© Cloudera, Inc. All rights reserved.
import org.apache.mahout.text.wikipedia.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
val path = "hdfs:///user/ds/wikidump.xml"
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "<page>")
conf.set(XmlInputFormat.END_TAG_KEY, "</page>")
val kvs = sc.newAPIHadoopFile(path,
classOf[XmlInputFormat],
classOf[LongWritable],
classOf[Text],
conf)
val rawXmls = kvs.map(p => p._2.toString)

13© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

14© Cloudera, Inc. All rights reserved.
Lemmatization
“the boy’s cars are different colors”
“the boy car be different color”

15© Cloudera, Inc. All rights reserved.
CoreNLP
def createNLPPipeline(): StanfordCoreNLP = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
new StanfordCoreNLP(props)
}

16© Cloudera, Inc. All rights reserved.
Stop Words
“the boy car be different color”
“boy car different color”

17© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

18© Cloudera, Inc. All rights reserved.
Tail Monkey Algorithm Scala
Document 1 1.5 1.8
Document 2 2.0 4.3
Document 3 1.4 6.7
Document 4 1.6
Document 5 1.2
Term-Document Matrix

19© Cloudera, Inc. All rights reserved.
tf-idf
• (Term Frequency) * (Inverse Document Frequency)
• tf(document, word) = # times word appears in document
• idf(word) = 1 / (# documents that contain word)

20© Cloudera, Inc. All rights reserved.
val rowVectors: RDD[Vector] = ...

21© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

22© Cloudera, Inc. All rights reserved.
Singular Value Decomposition
• Factors matrix into the product of three matrices: U, S, and V
• m = # documents
• n = # terms
• U is m x n
• S is n x n
• V is n x n

23© Cloudera, Inc. All rights reserved.
Low Rank Approximation
• Account for synonymy by condensing related terms.
• Account for polysemy by placing less weight on terms that have multiple meanings.
• Throw out noise.
SVD can find the rank-k approximation that has the lowest Frobenius distance from the
original matrix.

24© Cloudera, Inc. All rights reserved.
Singular Value Decomposition
• Factors matrix into the product of three matrices: U, S, and V
• m = # documents
• n = # terms
• k = # concepts
• U is m x n
• S is k x k
• V is k x n

25© Cloudera, Inc. All rights reserved.
Docs:
Terms:U S V

26© Cloudera, Inc. All rights reserved.
Docs:
Terms:U S V

27© Cloudera, Inc. All rights reserved.
rowVectors.cache()
val mat = new RowMatrix(rowVectors)
val k = 1000
val svd = mat.computeSVD(k, computeU=true)

28© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

29© Cloudera, Inc. All rights reserved.
What are the top “concepts”?
I.e. what dimensions in term-space and document-space explain most of the variance of
the data?

30© Cloudera, Inc. All rights reserved.
Docs:
Terms:U S V

31© Cloudera, Inc. All rights reserved.
U S V

32© Cloudera, Inc. All rights reserved.
U S V

$33© Cloudera, Inc. All rights reserved. def topTermsInConcept(concept: Int, numTerms: Int) : Seq[(String, Double)] = { val v = svd.V.toBreezeMatrix val termWeights = v(::, k).toArray.zipWithIndex val sorted = termWeights.sortBy(-_._1) sorted.take(numTerms) }$

$34© Cloudera, Inc. All rights reserved. def topDocsInConcept(concept: Int, numDocs: Int) : Seq[Seq[(String, Double)]] = { val u = svd.U val docWeights = u.rows.map(_.toArray(concept)).zipWithUniqueId() docWeights.top(numDocs) }$

35© Cloudera, Inc. All rights reserved.
Concept 1
Terms: department, commune, communes, insee, france, see, also,
southwestern, oise, marne, moselle, manche, eure, aisne, isère
Docs: Communes in France, Saint-Mard, Meurthe-et-Moselle,
Saint-Firmin, Meurthe-et-Moselle, Saint-Clément, Meurthe-et-Moselle,
Saint-Sardos, Lot-et-Garonne, Saint-Urcisse, Lot-et-Garonne, Saint-Sernin,
Lot-et-Garonne, Saint-Robert, Lot-et-Garonne, Saint-Léon, Lot-et-Garonne,
Saint-Astier, Lot-et-Garonne

36© Cloudera, Inc. All rights reserved.
Concept 2
Terms: genus, species, moth, family, lepidoptera, beetle, bulbophyllum,
snail, database, natural, find, geometridae, reference, museum, noctuidae
Docs: Chelonia (genus), Palea (genus), Argiope (genus), Sphingini,
Cribrilinidae, Tahla (genus), Gigartinales, Parapodia (genus),
Alpina (moth), Arycanda (moth)

37© Cloudera, Inc. All rights reserved.
Querying
• Given a set of terms, find the closest documents in the latent space

38© Cloudera, Inc. All rights reserved.
Reconstructed Matrix
(U * S * V)
Doc
Term

$39© Cloudera, Inc. All rights reserved. def topTermsForTerm( normalizedVS : BDenseMatrix[Double], termId: Int): Seq[(Double, Int)] = { val rowVec = new BDenseVector[Double](row(normalizedVS, termId).toArray) val termScores = (normalizedVS * rowVec).toArray.zipWithIndex termScores.sortBy(- _._1).take(10) } val VS = multiplyByDiagonalMatrix(svd.V, svd.s) val normalizedVS = rowsNormalized(VS) topTermsForTerm(normalizedVS, id, termIds)$

40© Cloudera, Inc. All rights reserved.
printRelevantTerms("radiohead")
radiohead 0.9999999999999993
lyrically 0.8837403315233519
catchy 0.8780717902060333
riff 0.861326571452104
lyricsthe 0.8460798060853993
lyric 0.8434937575368959
upbeat 0.8410212279939793
Term Similarity

41© Cloudera, Inc. All rights reserved.
printRelevantTerms("algorithm")
algorithm 1.000000000000002
heuristic 0.8773199836391916
compute 0.8561015487853708
constraint 0.8370707630657652
optimization 0.8331940333186296
complexity 0.823738607119692
algorithmic 0.8227315888559854
Term Similarity

42© Cloudera, Inc. All rights reserved.
(algorithm,1.000000000000002), (heuristic,0.8773199836391916),
(compute,0.8561015487853708), (constraint,0.8370707630657652),
(optimization,0.8331940333186296), (complexity,0.823738607119692),
(algorithmic,0.8227315888559854), (iterative,0.822364922633442),
(recursive,0.8176921180556759), (minimization,0.8160188481409465)

$43© Cloudera, Inc. All rights reserved. def topDocsForTerm( US: RowMatrix, V: Matrix, termId: Int) : Seq[(Double, Long)] = { val rowArr = row(V, termId).toArray val rowVec = Matrices.dense(termRowArr.length, 1, termRowArr) val docScores = US.multiply(termRowVec) val allDocWeights = docScores.rows.map( _.toArray(0)). zipWithUniqueId() allDocWeights.top( 10) }$

44© Cloudera, Inc. All rights reserved.
printRelevantDocs("fir")
Silver tree 0.006292909647173194
See the forest for the trees 0.004785047583508223
Eucalyptus tree 0.004592837783089319
Sequoia tree 0.004497446632469554
Willow tree 0.004429936059594164
Coniferous tree 0.004381572286629475
Tulip Tree 0.004374705020233878
Document Similarity

45© Cloudera, Inc. All rights reserved.
• https://github.com/sryza/aas/tree/master/ch06-lsa
• https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html
•
More detail?

46© Cloudera, Inc. All rights reserved.
Thank you
@sandysifting

Recomendados

History of MySQL

History of MySQL

History of MySQLKentAnderson43

Snowflake Overview

Snowflake Overview

Snowflake OverviewSnowflake Computing

Greenplum Database Overview

Greenplum Database Overview

Greenplum Database Overview EMC

Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...

Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...

Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...DataStax Academy

Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...

Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...

Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...IDERA Software

Introduction to Cassandra

Introduction to Cassandra

Introduction to CassandraGokhan Atil

Kafka Connect - debezium

Kafka Connect - debezium

Kafka Connect - debeziumKasun Don

Data models in NoSQL

Data models in NoSQL

Data models in NoSQLDr-Dipali Meher

Recomendados

History of MySQL

History of MySQL

History of MySQLKentAnderson43

Snowflake Overview

Snowflake Overview

Snowflake OverviewSnowflake Computing

Greenplum Database Overview

Greenplum Database Overview

Greenplum Database Overview EMC

Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...

Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...

Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...DataStax Academy

Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...

Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...

Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...IDERA Software

Introduction to Cassandra

Introduction to Cassandra

Introduction to CassandraGokhan Atil

Kafka Connect - debezium

Kafka Connect - debezium

Kafka Connect - debeziumKasun Don

Data models in NoSQL

Data models in NoSQL

Data models in NoSQLDr-Dipali Meher

Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...

Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...

Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...Demetris Trihinas

Deep Dive on Amazon Redshift

Deep Dive on Amazon Redshift

Deep Dive on Amazon RedshiftAmazon Web Services

Mysql security 5.7

Mysql security 5.7

Mysql security 5.7 Mark Swarbrick

Map ReduceSri Prasanna

Snowflake Data Loading.pptx

Snowflake Data Loading.pptx

Snowflake Data Loading.pptxParag860410

Change Data Streaming Patterns for Microservices With Debezium

Change Data Streaming Patterns for Microservices With Debezium

Change Data Streaming Patterns for Microservices With Debezium confluent

MySQL Group Replication

MySQL Group Replication

MySQL Group ReplicationKenny Gryp

Building robust CDC pipeline with Apache Hudi and Debezium

Building robust CDC pipeline with Apache Hudi and Debezium

Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle

A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...

A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...

A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks

Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...

Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...

Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Amazon Web Services

Building and running cloud native cassandra

Building and running cloud native cassandra

Building and running cloud native cassandraVinay Kumar Chella

An Introduction to MongoDB Compass

An Introduction to MongoDB Compass

An Introduction to MongoDB CompassMongoDB

ProxySQL Tutorial - PLAM 2016

ProxySQL Tutorial - PLAM 2016

ProxySQL Tutorial - PLAM 2016Derek Downey

RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017

RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017

RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017Amazon Web Services

Azure Data Factory Data Flow Performance Tuning 101

Azure Data Factory Data Flow Performance Tuning 101

Azure Data Factory Data Flow Performance Tuning 101Mark Kromer

Sql vs NoSQL-Presentation

Sql vs NoSQL-Presentation

Sql vs NoSQL-PresentationShubham Tomar

Amazon EMR Masterclass

Amazon EMR Masterclass

Amazon EMR MasterclassAmazon Web Services

Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day

Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day

Amazon Redshift의 이해와 활용 (김용우) - AWS DB DayAmazon Web Services Korea

9. Document Oriented Databases

9. Document Oriented Databases

9. Document Oriented DatabasesFabio Fumarola

Unsupervised Learning with Apache Spark

Unsupervised Learning with Apache Spark

Unsupervised Learning with Apache SparkDB Tsai

Latent Semanctic Analysis Auro Tripathy

Latent Semanctic Analysis Auro Tripathy

Latent Semanctic Analysis Auro TripathyAuro Tripathy

Mais conteúdo relacionado

Mais procurados

Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...

Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...

Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...Demetris Trihinas

Deep Dive on Amazon Redshift

Deep Dive on Amazon Redshift

Deep Dive on Amazon RedshiftAmazon Web Services

Mysql security 5.7

Mysql security 5.7

Mysql security 5.7 Mark Swarbrick

Map ReduceSri Prasanna

Snowflake Data Loading.pptx

Snowflake Data Loading.pptx

Snowflake Data Loading.pptxParag860410

Change Data Streaming Patterns for Microservices With Debezium

Change Data Streaming Patterns for Microservices With Debezium

Change Data Streaming Patterns for Microservices With Debezium confluent

MySQL Group Replication

MySQL Group Replication

MySQL Group ReplicationKenny Gryp

Building robust CDC pipeline with Apache Hudi and Debezium

Building robust CDC pipeline with Apache Hudi and Debezium

Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle

A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...

A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...

A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks

Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...

Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...

Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Amazon Web Services

Building and running cloud native cassandra

Building and running cloud native cassandra

Building and running cloud native cassandraVinay Kumar Chella

An Introduction to MongoDB Compass

An Introduction to MongoDB Compass

An Introduction to MongoDB CompassMongoDB

ProxySQL Tutorial - PLAM 2016

ProxySQL Tutorial - PLAM 2016

ProxySQL Tutorial - PLAM 2016Derek Downey

RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017

RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017

RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017Amazon Web Services

Azure Data Factory Data Flow Performance Tuning 101

Azure Data Factory Data Flow Performance Tuning 101

Azure Data Factory Data Flow Performance Tuning 101Mark Kromer

Sql vs NoSQL-Presentation

Sql vs NoSQL-Presentation

Sql vs NoSQL-PresentationShubham Tomar

Amazon EMR Masterclass

Amazon EMR Masterclass

Amazon EMR MasterclassAmazon Web Services

Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day

Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day

Amazon Redshift의 이해와 활용 (김용우) - AWS DB DayAmazon Web Services Korea

9. Document Oriented Databases

9. Document Oriented Databases

9. Document Oriented DatabasesFabio Fumarola

Mais procurados (20)

Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...

Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...

Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...

Deep Dive on Amazon Redshift

Deep Dive on Amazon Redshift

Deep Dive on Amazon Redshift

Mysql security 5.7

Mysql security 5.7

Mysql security 5.7

Map Reduce

Snowflake Data Loading.pptx

Snowflake Data Loading.pptx

Snowflake Data Loading.pptx

Change Data Streaming Patterns for Microservices With Debezium

Change Data Streaming Patterns for Microservices With Debezium

Change Data Streaming Patterns for Microservices With Debezium

MySQL Group Replication

MySQL Group Replication

MySQL Group Replication

Building robust CDC pipeline with Apache Hudi and Debezium

Building robust CDC pipeline with Apache Hudi and Debezium

Building robust CDC pipeline with Apache Hudi and Debezium

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Apache Iceberg Presentation for the St. Louis Big Data IDEA

A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...

A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...

A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...

Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...

Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...

Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...

Building and running cloud native cassandra

Building and running cloud native cassandra

Building and running cloud native cassandra

An Introduction to MongoDB Compass

An Introduction to MongoDB Compass

An Introduction to MongoDB Compass

ProxySQL Tutorial - PLAM 2016

ProxySQL Tutorial - PLAM 2016

ProxySQL Tutorial - PLAM 2016

RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017

RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017

RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017

Azure Data Factory Data Flow Performance Tuning 101

Azure Data Factory Data Flow Performance Tuning 101

Azure Data Factory Data Flow Performance Tuning 101

Sql vs NoSQL-Presentation

Sql vs NoSQL-Presentation

Sql vs NoSQL-Presentation

Amazon EMR Masterclass

Amazon EMR Masterclass

Amazon EMR Masterclass

Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day

Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day

Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day

9. Document Oriented Databases

9. Document Oriented Databases

9. Document Oriented Databases

Destaque

Unsupervised Learning with Apache Spark

Unsupervised Learning with Apache Spark

Unsupervised Learning with Apache SparkDB Tsai

Latent Semanctic Analysis Auro Tripathy

Latent Semanctic Analysis Auro Tripathy

Latent Semanctic Analysis Auro TripathyAuro Tripathy

Latent Semantic Indexing and Analysis

Latent Semantic Indexing and Analysis

Latent Semantic Indexing and AnalysisMercy Livingstone

Big data app meetup 2016-06-15

Big data app meetup 2016-06-15

Big data app meetup 2016-06-15Illia Polosukhin

Driving Healthcare Operations with Data Science

Driving Healthcare Operations with Data Science

Driving Healthcare Operations with Data ScienceSandy Ryza

How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi

How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi

How to use Latent Semantic Analysis to Glean Real Insight - Franco AmalfiSocial Media Camp

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...Davide Chicco

Introduction to Probabilistic Latent Semantic Analysis

Introduction to Probabilistic Latent Semantic Analysis

Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics

Algorithm ethics

Algorithm ethics

Algorithm ethicsJoerg Blumtritt

Empreendedorismo Agil

Empreendedorismo Agil

Empreendedorismo AgilSaulo Arruda

Ili structuredauthoring

Ili structuredauthoring

Ili structuredauthoringTony Hirst

Diving into Machine Learning with Rob Craft, Group Product Manager at Google!

Diving into Machine Learning with Rob Craft, Group Product Manager at Google!

Diving into Machine Learning with Rob Craft, Group Product Manager at Google!TheFamily

A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...

A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...

A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...Pablo Mendes

LOD2 Webinar Series: DBpedia Spotlight

LOD2 Webinar Series: DBpedia Spotlight

LOD2 Webinar Series: DBpedia SpotlightLOD2 Creating Knowledge out of Interlinked Data

DBpedia Spotlight at I-SEMANTICS 2011

DBpedia Spotlight at I-SEMANTICS 2011

DBpedia Spotlight at I-SEMANTICS 2011Pablo Mendes

Google Cloud Machine Learning

Google Cloud Machine Learning

Google Cloud Machine Learning India Quotient

Lec13: Clustering Based Medical Image Segmentation Methods

Lec13: Clustering Based Medical Image Segmentation Methods

Lec13: Clustering Based Medical Image Segmentation MethodsUlaş Bağcı

Context Semantic Analysis: a knowledge-based technique for computing inter-do...

Context Semantic Analysis: a knowledge-based technique for computing inter-do...

Context Semantic Analysis: a knowledge-based technique for computing inter-do...Fabio Benedetti

Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine

Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine

Deep Learning Meetup 7 - Building a Deep Learning-powered Search EngineKoby Karp

GoogLeNet Insights

GoogLeNet Insights

GoogLeNet InsightsAuro Tripathy

Destaque (20)

Unsupervised Learning with Apache Spark

Unsupervised Learning with Apache Spark

Unsupervised Learning with Apache Spark

Latent Semanctic Analysis Auro Tripathy

Latent Semanctic Analysis Auro Tripathy

Latent Semanctic Analysis Auro Tripathy

Latent Semantic Indexing and Analysis

Latent Semantic Indexing and Analysis

Latent Semantic Indexing and Analysis

Big data app meetup 2016-06-15

Big data app meetup 2016-06-15

Big data app meetup 2016-06-15

Driving Healthcare Operations with Data Science

Driving Healthcare Operations with Data Science

Driving Healthcare Operations with Data Science

How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi

How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi

How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...

Introduction to Probabilistic Latent Semantic Analysis

Introduction to Probabilistic Latent Semantic Analysis

Introduction to Probabilistic Latent Semantic Analysis

Algorithm ethics

Algorithm ethics

Algorithm ethics

Empreendedorismo Agil

Empreendedorismo Agil

Empreendedorismo Agil

Ili structuredauthoring

Ili structuredauthoring

Ili structuredauthoring

Diving into Machine Learning with Rob Craft, Group Product Manager at Google!

Diving into Machine Learning with Rob Craft, Group Product Manager at Google!

Diving into Machine Learning with Rob Craft, Group Product Manager at Google!

A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...

A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...

A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...

LOD2 Webinar Series: DBpedia Spotlight

LOD2 Webinar Series: DBpedia Spotlight

LOD2 Webinar Series: DBpedia Spotlight

DBpedia Spotlight at I-SEMANTICS 2011

DBpedia Spotlight at I-SEMANTICS 2011

DBpedia Spotlight at I-SEMANTICS 2011

Google Cloud Machine Learning

Google Cloud Machine Learning

Google Cloud Machine Learning

Lec13: Clustering Based Medical Image Segmentation Methods

Lec13: Clustering Based Medical Image Segmentation Methods

Lec13: Clustering Based Medical Image Segmentation Methods

Context Semantic Analysis: a knowledge-based technique for computing inter-do...

Context Semantic Analysis: a knowledge-based technique for computing inter-do...

Context Semantic Analysis: a knowledge-based technique for computing inter-do...

Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine

Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine

Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine

GoogLeNet Insights

GoogLeNet Insights

GoogLeNet Insights

Semelhante a Latent Semantic Analysis of Wikipedia with Spark

Spark etlImran Rashid

What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016

What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016

What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016StampedeCon

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at ClouderaDataconomy Media

Overview of running R in the Oracle Database

Overview of running R in the Oracle Database

Overview of running R in the Oracle DatabaseBrendan Tierney

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...Hakka Labs

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson

Bids talk 9.18Travis Oliphant

Aprovisionamiento multi-proveedor con Terraform - Plain Concepts DevOps day

Aprovisionamiento multi-proveedor con Terraform - Plain Concepts DevOps day

Aprovisionamiento multi-proveedor con Terraform - Plain Concepts DevOps dayPlain Concepts

Cassandra + Spark (You’ve got the lighter, let’s start a fire)

Cassandra + Spark (You’ve got the lighter, let’s start a fire)

Cassandra + Spark (You’ve got the lighter, let’s start a fire)Robert Stupp

Python Data Ecosystem: Thoughts on Building for the Future

Python Data Ecosystem: Thoughts on Building for the Future

Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson

Spark zeppelin-cassandra at synchrotron

Spark zeppelin-cassandra at synchrotron

Spark zeppelin-cassandra at synchrotronDuyhai Doan

Real Time Data Processing Using Spark Streaming

Real Time Data Processing Using Spark Streaming

Real Time Data Processing Using Spark StreamingHari Shreedharan

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.

Time Series Analysis

Time Series Analysis

Time Series AnalysisQAware GmbH

Time Series Processing with Solr and Spark

Time Series Processing with Solr and Spark

Time Series Processing with Solr and SparkJosef Adersberger

Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...

Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...

Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...Lucidworks

Unlocking Your Hadoop Data with Apache Spark and CDH5

Unlocking Your Hadoop Data with Apache Spark and CDH5

Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur

Why Your Apache Spark Job is Failing

Why Your Apache Spark Job is Failing

Why Your Apache Spark Job is FailingCloudera, Inc.

Why your Spark Job is Failing

Why your Spark Job is Failing

Why your Spark Job is FailingDataWorks Summit

Semelhante a Latent Semantic Analysis of Wikipedia with Spark (20)

Spark etl

What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016

What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016

What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera

Overview of running R in the Oracle Database

Overview of running R in the Oracle Database

Overview of running R in the Oracle Database

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Bids talk 9.18

Aprovisionamiento multi-proveedor con Terraform - Plain Concepts DevOps day

Aprovisionamiento multi-proveedor con Terraform - Plain Concepts DevOps day

Aprovisionamiento multi-proveedor con Terraform - Plain Concepts DevOps day

Cassandra + Spark (You’ve got the lighter, let’s start a fire)

Cassandra + Spark (You’ve got the lighter, let’s start a fire)

Cassandra + Spark (You’ve got the lighter, let’s start a fire)

Python Data Ecosystem: Thoughts on Building for the Future

Python Data Ecosystem: Thoughts on Building for the Future

Python Data Ecosystem: Thoughts on Building for the Future

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Spark zeppelin-cassandra at synchrotron

Spark zeppelin-cassandra at synchrotron

Spark zeppelin-cassandra at synchrotron

Real Time Data Processing Using Spark Streaming

Real Time Data Processing Using Spark Streaming

Real Time Data Processing Using Spark Streaming

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Time Series Analysis

Time Series Analysis

Time Series Analysis

Time Series Processing with Solr and Spark

Time Series Processing with Solr and Spark

Time Series Processing with Solr and Spark

Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...

Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...

Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...

Unlocking Your Hadoop Data with Apache Spark and CDH5

Unlocking Your Hadoop Data with Apache Spark and CDH5

Unlocking Your Hadoop Data with Apache Spark and CDH5

Why Your Apache Spark Job is Failing

Why Your Apache Spark Job is Failing

Why Your Apache Spark Job is Failing

Why your Spark Job is Failing

Why your Spark Job is Failing

Why your Spark Job is Failing

Último

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

Boost PC performance: How more available memory can improve productivity

Boost PC performance: How more available memory can improve productivity

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

A Year of the Servo Reboot: Where Are We Now?

A Year of the Servo Reboot: Where Are We Now?

A Year of the Servo Reboot: Where Are We Now?Igalia

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

A Domino Admins Adventures (Engage 2024)

A Domino Admins Adventures (Engage 2024)

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Powerful Google developer tools for immediate impact! (2023-24 C)

Powerful Google developer tools for immediate impact! (2023-24 C)

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Axa Assurance Maroc - Insurer Innovation Award 2024

Axa Assurance Maroc - Insurer Innovation Award 2024

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Manulife - Insurer Innovation Award 2024

Manulife - Insurer Innovation Award 2024

Manulife - Insurer Innovation Award 2024The Digital Insurer

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Apidays New York 2024 - The value of a flexible API Management solution for O...

Apidays New York 2024 - The value of a flexible API Management solution for O...

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Data Cloud, More than a CDP by Matt Robison

Data Cloud, More than a CDP by Matt Robison

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

The 7 Things I Know About Cyber Security After 25 Years | April 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Scaling API-first – The story of a global engineering organization

Scaling API-first – The story of a global engineering organization

Scaling API-first – The story of a global engineering organizationRadu Cotescu

AWS Community Day CPH - Three problems of Terraform

AWS Community Day CPH - Three problems of Terraform

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Boost Fertility New Invention Ups Success Rates.pdf

Boost Fertility New Invention Ups Success Rates.pdf

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Último (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Boost PC performance: How more available memory can improve productivity

Boost PC performance: How more available memory can improve productivity

Boost PC performance: How more available memory can improve productivity

A Year of the Servo Reboot: Where Are We Now?

A Year of the Servo Reboot: Where Are We Now?

A Year of the Servo Reboot: Where Are We Now?

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

A Domino Admins Adventures (Engage 2024)

A Domino Admins Adventures (Engage 2024)

A Domino Admins Adventures (Engage 2024)

Powerful Google developer tools for immediate impact! (2023-24 C)

Powerful Google developer tools for immediate impact! (2023-24 C)

Powerful Google developer tools for immediate impact! (2023-24 C)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Axa Assurance Maroc - Insurer Innovation Award 2024

Axa Assurance Maroc - Insurer Innovation Award 2024

Axa Assurance Maroc - Insurer Innovation Award 2024

Manulife - Insurer Innovation Award 2024

Manulife - Insurer Innovation Award 2024

Manulife - Insurer Innovation Award 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Apidays New York 2024 - The value of a flexible API Management solution for O...

Apidays New York 2024 - The value of a flexible API Management solution for O...

Apidays New York 2024 - The value of a flexible API Management solution for O...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Data Cloud, More than a CDP by Matt Robison

Data Cloud, More than a CDP by Matt Robison

Data Cloud, More than a CDP by Matt Robison

The 7 Things I Know About Cyber Security After 25 Years | April 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Scaling API-first – The story of a global engineering organization

Scaling API-first – The story of a global engineering organization

Scaling API-first – The story of a global engineering organization

AWS Community Day CPH - Three problems of Terraform

AWS Community Day CPH - Three problems of Terraform

AWS Community Day CPH - Three problems of Terraform

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Boost Fertility New Invention Ups Success Rates.pdf

Boost Fertility New Invention Ups Success Rates.pdf

Boost Fertility New Invention Ups Success Rates.pdf

Latent Semantic Analysis of Wikipedia with Spark

1. 1© Cloudera, Inc. All rights reserved. LSA-ing Wikipedia with Spark Sandy Ryza | Senior Data Scientist

2. 2© Cloudera, Inc. All rights reserved. Me • Data scientist at Cloudera • Recently lead Cloudera’s Apache Spark development • Author of Advanced Analytics with Spark

3. 3© Cloudera, Inc. All rights reserved. LSA-ing Wikipedia with Spark Sandy Ryza | Senior Data Scientist

4. 4© Cloudera, Inc. All rights reserved. Latent Semantic Analysis • Fancy name for applying a matrix decomposition (SVD) to text data

5. 5© Cloudera, Inc. All rights reserved.

6. 6© Cloudera, Inc. All rights reserved.

7. 7© Cloudera, Inc. All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results

8. 8© Cloudera, Inc. All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results

9. 9© Cloudera, Inc. All rights reserved. Wikipedia Content Data Set • http://dumps.wikimedia.org/enwiki/latest/ • XML-formatted • 46 GB uncompressed

10. 10© Cloudera, Inc. All rights reserved. <page> <title>Anarchism</title> <ns>0</ns> <id>12</id> <revision> <id>584215651</id> <parentid>584213644</parentid> <timestamp>2013-12-02T15:14:01Z</timestamp> <contributor> <username>AnomieBOT</username> <id>7611264</id> </contributor> <comment>Rescuing orphaned refs ("autogenerated1" from rev 584155010; "bbc" from rev 584155010)</comment> <text xml:space="preserve">{{Redirect|Anarchist|the fictional character| Anarchist (comics)}} {{Redirect|Anarchists}} {{pp-move-indef}} {{Anarchism sidebar}} '''Anarchism''' is a [[political philosophy]] that advocates [[stateless society| stateless societies]] often defined as [[self-governance|self-governed]] voluntary institutions,<ref>"ANARCHISM, a social philosophy that rejects authoritarian government and maintains that voluntary institutions are best suited to express man's natural social tendencies." George Woodcock. "Anarchism" at The Encyclopedia of Philosophy</ref><ref> "In a society developed on these lines, the voluntary associations which already now begin to cover all the fields of human activity would take a still greater extension so as to substitute ...

11. 11© Cloudera, Inc. All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results

12. 12© Cloudera, Inc. All rights reserved. import org.apache.mahout.text.wikipedia.XmlInputFormat import org.apache.hadoop.conf.Configuration import org.apache.hadoop.io._ val path = "hdfs:///user/ds/wikidump.xml" val conf = new Configuration() conf.set(XmlInputFormat.START_TAG_KEY, "<page>") conf.set(XmlInputFormat.END_TAG_KEY, "</page>") val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat], classOf[LongWritable], classOf[Text], conf) val rawXmls = kvs.map(p => p._2.toString)

13. 13© Cloudera, Inc. All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results

14. 14© Cloudera, Inc. All rights reserved. Lemmatization “the boy’s cars are different colors” “the boy car be different color”

15. 15© Cloudera, Inc. All rights reserved. CoreNLP def createNLPPipeline(): StanfordCoreNLP = { val props = new Properties() props.put("annotators", "tokenize, ssplit, pos, lemma") new StanfordCoreNLP(props) }

16. 16© Cloudera, Inc. All rights reserved. Stop Words “the boy car be different color” “boy car different color”

17. 17© Cloudera, Inc. All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results

18. 18© Cloudera, Inc. All rights reserved. Tail Monkey Algorithm Scala Document 1 1.5 1.8 Document 2 2.0 4.3 Document 3 1.4 6.7 Document 4 1.6 Document 5 1.2 Term-Document Matrix

19. 19© Cloudera, Inc. All rights reserved. tf-idf • (Term Frequency) * (Inverse Document Frequency) • tf(document, word) = # times word appears in document • idf(word) = 1 / (# documents that contain word)

20. 20© Cloudera, Inc. All rights reserved. val rowVectors: RDD[Vector] = ...

21. 21© Cloudera, Inc. All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results

22. 22© Cloudera, Inc. All rights reserved. Singular Value Decomposition • Factors matrix into the product of three matrices: U, S, and V • m = # documents • n = # terms • U is m x n • S is n x n • V is n x n

23. 23© Cloudera, Inc. All rights reserved. Low Rank Approximation • Account for synonymy by condensing related terms. • Account for polysemy by placing less weight on terms that have multiple meanings. • Throw out noise. SVD can find the rank-k approximation that has the lowest Frobenius distance from the original matrix.

24. 24© Cloudera, Inc. All rights reserved. Singular Value Decomposition • Factors matrix into the product of three matrices: U, S, and V • m = # documents • n = # terms • k = # concepts • U is m x n • S is k x k • V is k x n

25. 25© Cloudera, Inc. All rights reserved. Docs: Terms:U S V

26. 26© Cloudera, Inc. All rights reserved. Docs: Terms:U S V

27. 27© Cloudera, Inc. All rights reserved. rowVectors.cache() val mat = new RowMatrix(rowVectors) val k = 1000 val svd = mat.computeSVD(k, computeU=true)

28. 28© Cloudera, Inc. All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results

29. 29© Cloudera, Inc. All rights reserved. What are the top “concepts”? I.e. what dimensions in term-space and document-space explain most of the variance of the data?

30. 30© Cloudera, Inc. All rights reserved. Docs: Terms:U S V

31. 31© Cloudera, Inc. All rights reserved. U S V

32. 32© Cloudera, Inc. All rights reserved. U S V

33. 33© Cloudera, Inc. All rights reserved. def topTermsInConcept(concept: Int, numTerms: Int) : Seq[(String, Double)] = { val v = svd.V.toBreezeMatrix val termWeights = v(::, k).toArray.zipWithIndex val sorted = termWeights.sortBy(-_._1) sorted.take(numTerms) }

34. 34© Cloudera, Inc. All rights reserved. def topDocsInConcept(concept: Int, numDocs: Int) : Seq[Seq[(String, Double)]] = { val u = svd.U val docWeights = u.rows.map(_.toArray(concept)).zipWithUniqueId() docWeights.top(numDocs) }

35. 35© Cloudera, Inc. All rights reserved. Concept 1 Terms: department, commune, communes, insee, france, see, also, southwestern, oise, marne, moselle, manche, eure, aisne, isère Docs: Communes in France, Saint-Mard, Meurthe-et-Moselle, Saint-Firmin, Meurthe-et-Moselle, Saint-Clément, Meurthe-et-Moselle, Saint-Sardos, Lot-et-Garonne, Saint-Urcisse, Lot-et-Garonne, Saint-Sernin, Lot-et-Garonne, Saint-Robert, Lot-et-Garonne, Saint-Léon, Lot-et-Garonne, Saint-Astier, Lot-et-Garonne

36. 36© Cloudera, Inc. All rights reserved. Concept 2 Terms: genus, species, moth, family, lepidoptera, beetle, bulbophyllum, snail, database, natural, find, geometridae, reference, museum, noctuidae Docs: Chelonia (genus), Palea (genus), Argiope (genus), Sphingini, Cribrilinidae, Tahla (genus), Gigartinales, Parapodia (genus), Alpina (moth), Arycanda (moth)

37. 37© Cloudera, Inc. All rights reserved. Querying • Given a set of terms, find the closest documents in the latent space

38. 38© Cloudera, Inc. All rights reserved. Reconstructed Matrix (U * S * V) Doc Term

39. 39© Cloudera, Inc. All rights reserved. def topTermsForTerm( normalizedVS : BDenseMatrix[Double], termId: Int): Seq[(Double, Int)] = { val rowVec = new BDenseVector[Double](row(normalizedVS, termId).toArray) val termScores = (normalizedVS * rowVec).toArray.zipWithIndex termScores.sortBy(- _._1).take(10) } val VS = multiplyByDiagonalMatrix(svd.V, svd.s) val normalizedVS = rowsNormalized(VS) topTermsForTerm(normalizedVS, id, termIds)

40. 40© Cloudera, Inc. All rights reserved. printRelevantTerms("radiohead") radiohead 0.9999999999999993 lyrically 0.8837403315233519 catchy 0.8780717902060333 riff 0.861326571452104 lyricsthe 0.8460798060853993 lyric 0.8434937575368959 upbeat 0.8410212279939793 Term Similarity

41. 41© Cloudera, Inc. All rights reserved. printRelevantTerms("algorithm") algorithm 1.000000000000002 heuristic 0.8773199836391916 compute 0.8561015487853708 constraint 0.8370707630657652 optimization 0.8331940333186296 complexity 0.823738607119692 algorithmic 0.8227315888559854 Term Similarity

42. 42© Cloudera, Inc. All rights reserved. (algorithm,1.000000000000002), (heuristic,0.8773199836391916), (compute,0.8561015487853708), (constraint,0.8370707630657652), (optimization,0.8331940333186296), (complexity,0.823738607119692), (algorithmic,0.8227315888559854), (iterative,0.822364922633442), (recursive,0.8176921180556759), (minimization,0.8160188481409465)

43. 43© Cloudera, Inc. All rights reserved. def topDocsForTerm( US: RowMatrix, V: Matrix, termId: Int) : Seq[(Double, Long)] = { val rowArr = row(V, termId).toArray val rowVec = Matrices.dense(termRowArr.length, 1, termRowArr) val docScores = US.multiply(termRowVec) val allDocWeights = docScores.rows.map( _.toArray(0)). zipWithUniqueId() allDocWeights.top( 10) }

44. 44© Cloudera, Inc. All rights reserved. printRelevantDocs("fir") Silver tree 0.006292909647173194 See the forest for the trees 0.004785047583508223 Eucalyptus tree 0.004592837783089319 Sequoia tree 0.004497446632469554 Willow tree 0.004429936059594164 Coniferous tree 0.004381572286629475 Tulip Tree 0.004374705020233878 Document Similarity

45. 45© Cloudera, Inc. All rights reserved. • https://github.com/sryza/aas/tree/master/ch06-lsa • https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html • More detail?

46. 46© Cloudera, Inc. All rights reserved. Thank you @sandysifting