Enviar pesquisa
Carregar
Latent Semantic Analysis of Wikipedia with Spark
•
15 gostaram
•
9,463 visualizações
Sandy Ryza
Seguir
These are the slides from my talk for Text by the Bay in April 2015.
Leia menos
Leia mais
Tecnologia
Denunciar
Compartilhar
Denunciar
Compartilhar
1 de 46
Baixar agora
Baixar para ler offline
Recomendados
History of MySQL
History of MySQL
KentAnderson43
Snowflake Overview
Snowflake Overview
Snowflake Computing
Greenplum Database Overview
Greenplum Database Overview
EMC
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
DataStax Academy
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
IDERA Software
Introduction to Cassandra
Introduction to Cassandra
Gokhan Atil
Kafka Connect - debezium
Kafka Connect - debezium
Kasun Don
Data models in NoSQL
Data models in NoSQL
Dr-Dipali Meher
Recomendados
History of MySQL
History of MySQL
KentAnderson43
Snowflake Overview
Snowflake Overview
Snowflake Computing
Greenplum Database Overview
Greenplum Database Overview
EMC
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
DataStax Academy
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
IDERA Software
Introduction to Cassandra
Introduction to Cassandra
Gokhan Atil
Kafka Connect - debezium
Kafka Connect - debezium
Kasun Don
Data models in NoSQL
Data models in NoSQL
Dr-Dipali Meher
Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...
Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...
Demetris Trihinas
Deep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
Mysql security 5.7
Mysql security 5.7
Mark Swarbrick
Map Reduce
Map Reduce
Sri Prasanna
Snowflake Data Loading.pptx
Snowflake Data Loading.pptx
Parag860410
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
confluent
MySQL Group Replication
MySQL Group Replication
Kenny Gryp
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Amazon Web Services
Building and running cloud native cassandra
Building and running cloud native cassandra
Vinay Kumar Chella
An Introduction to MongoDB Compass
An Introduction to MongoDB Compass
MongoDB
ProxySQL Tutorial - PLAM 2016
ProxySQL Tutorial - PLAM 2016
Derek Downey
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
Amazon Web Services
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
Sql vs NoSQL-Presentation
Sql vs NoSQL-Presentation
Shubham Tomar
Amazon EMR Masterclass
Amazon EMR Masterclass
Amazon Web Services
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Web Services Korea
9. Document Oriented Databases
9. Document Oriented Databases
Fabio Fumarola
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
DB Tsai
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro Tripathy
Auro Tripathy
Mais conteúdo relacionado
Mais procurados
Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...
Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...
Demetris Trihinas
Deep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
Mysql security 5.7
Mysql security 5.7
Mark Swarbrick
Map Reduce
Map Reduce
Sri Prasanna
Snowflake Data Loading.pptx
Snowflake Data Loading.pptx
Parag860410
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
confluent
MySQL Group Replication
MySQL Group Replication
Kenny Gryp
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Amazon Web Services
Building and running cloud native cassandra
Building and running cloud native cassandra
Vinay Kumar Chella
An Introduction to MongoDB Compass
An Introduction to MongoDB Compass
MongoDB
ProxySQL Tutorial - PLAM 2016
ProxySQL Tutorial - PLAM 2016
Derek Downey
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
Amazon Web Services
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
Sql vs NoSQL-Presentation
Sql vs NoSQL-Presentation
Shubham Tomar
Amazon EMR Masterclass
Amazon EMR Masterclass
Amazon Web Services
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Web Services Korea
9. Document Oriented Databases
9. Document Oriented Databases
Fabio Fumarola
Mais procurados
(20)
Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...
Designing Scalable and Secure Microservices by Embracing DevOps-as-a-Service ...
Deep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Mysql security 5.7
Mysql security 5.7
Map Reduce
Map Reduce
Snowflake Data Loading.pptx
Snowflake Data Loading.pptx
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
MySQL Group Replication
MySQL Group Replication
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Building and running cloud native cassandra
Building and running cloud native cassandra
An Introduction to MongoDB Compass
An Introduction to MongoDB Compass
ProxySQL Tutorial - PLAM 2016
ProxySQL Tutorial - PLAM 2016
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Sql vs NoSQL-Presentation
Sql vs NoSQL-Presentation
Amazon EMR Masterclass
Amazon EMR Masterclass
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
9. Document Oriented Databases
9. Document Oriented Databases
Destaque
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
DB Tsai
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro Tripathy
Auro Tripathy
Latent Semantic Indexing and Analysis
Latent Semantic Indexing and Analysis
Mercy Livingstone
Big data app meetup 2016-06-15
Big data app meetup 2016-06-15
Illia Polosukhin
Driving Healthcare Operations with Data Science
Driving Healthcare Operations with Data Science
Sandy Ryza
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
Social Media Camp
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
Davide Chicco
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
Algorithm ethics
Algorithm ethics
Joerg Blumtritt
Empreendedorismo Agil
Empreendedorismo Agil
Saulo Arruda
Ili structuredauthoring
Ili structuredauthoring
Tony Hirst
Diving into Machine Learning with Rob Craft, Group Product Manager at Google!
Diving into Machine Learning with Rob Craft, Group Product Manager at Google!
TheFamily
A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...
A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...
Pablo Mendes
LOD2 Webinar Series: DBpedia Spotlight
LOD2 Webinar Series: DBpedia Spotlight
LOD2 Creating Knowledge out of Interlinked Data
DBpedia Spotlight at I-SEMANTICS 2011
DBpedia Spotlight at I-SEMANTICS 2011
Pablo Mendes
Google Cloud Machine Learning
Google Cloud Machine Learning
India Quotient
Lec13: Clustering Based Medical Image Segmentation Methods
Lec13: Clustering Based Medical Image Segmentation Methods
Ulaş Bağcı
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Fabio Benedetti
Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
Koby Karp
GoogLeNet Insights
GoogLeNet Insights
Auro Tripathy
Destaque
(20)
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro Tripathy
Latent Semantic Indexing and Analysis
Latent Semantic Indexing and Analysis
Big data app meetup 2016-06-15
Big data app meetup 2016-06-15
Driving Healthcare Operations with Data Science
Driving Healthcare Operations with Data Science
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
Algorithm ethics
Algorithm ethics
Empreendedorismo Agil
Empreendedorismo Agil
Ili structuredauthoring
Ili structuredauthoring
Diving into Machine Learning with Rob Craft, Group Product Manager at Google!
Diving into Machine Learning with Rob Craft, Group Product Manager at Google!
A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...
A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...
LOD2 Webinar Series: DBpedia Spotlight
LOD2 Webinar Series: DBpedia Spotlight
DBpedia Spotlight at I-SEMANTICS 2011
DBpedia Spotlight at I-SEMANTICS 2011
Google Cloud Machine Learning
Google Cloud Machine Learning
Lec13: Clustering Based Medical Image Segmentation Methods
Lec13: Clustering Based Medical Image Segmentation Methods
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
GoogLeNet Insights
GoogLeNet Insights
Semelhante a Latent Semantic Analysis of Wikipedia with Spark
Spark etl
Spark etl
Imran Rashid
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
StampedeCon
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
Dataconomy Media
Overview of running R in the Oracle Database
Overview of running R in the Oracle Database
Brendan Tierney
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
Hakka Labs
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
Bids talk 9.18
Bids talk 9.18
Travis Oliphant
Aprovisionamiento multi-proveedor con Terraform - Plain Concepts DevOps day
Aprovisionamiento multi-proveedor con Terraform - Plain Concepts DevOps day
Plain Concepts
Cassandra + Spark (You’ve got the lighter, let’s start a fire)
Cassandra + Spark (You’ve got the lighter, let’s start a fire)
Robert Stupp
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
Spark zeppelin-cassandra at synchrotron
Spark zeppelin-cassandra at synchrotron
Duyhai Doan
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
Time Series Analysis
Time Series Analysis
QAware GmbH
Time Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
Josef Adersberger
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Lucidworks
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
SAP Concur
Why Your Apache Spark Job is Failing
Why Your Apache Spark Job is Failing
Cloudera, Inc.
Why your Spark Job is Failing
Why your Spark Job is Failing
DataWorks Summit
Semelhante a Latent Semantic Analysis of Wikipedia with Spark
(20)
Spark etl
Spark etl
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
Overview of running R in the Oracle Database
Overview of running R in the Oracle Database
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Bids talk 9.18
Bids talk 9.18
Aprovisionamiento multi-proveedor con Terraform - Plain Concepts DevOps day
Aprovisionamiento multi-proveedor con Terraform - Plain Concepts DevOps day
Cassandra + Spark (You’ve got the lighter, let’s start a fire)
Cassandra + Spark (You’ve got the lighter, let’s start a fire)
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Spark zeppelin-cassandra at synchrotron
Spark zeppelin-cassandra at synchrotron
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Time Series Analysis
Time Series Analysis
Time Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
Why Your Apache Spark Job is Failing
Why Your Apache Spark Job is Failing
Why your Spark Job is Failing
Why your Spark Job is Failing
Último
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Edi Saputra
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Roshan Dwivedi
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Principled Technologies
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Igalia
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Gabriella Davis
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
wesley chun
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Drew Madelung
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
The Digital Insurer
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
The Digital Insurer
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
The Digital Insurer
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
apidays
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Miguel Araújo
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Anna Loughnan Colquhoun
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Rafal Los
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
Radu Cotescu
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Andrey Devyatkin
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
apidays
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
sudhanshuwaghmare1
Último
(20)
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Latent Semantic Analysis of Wikipedia with Spark
1.
1© Cloudera, Inc.
All rights reserved. LSA-ing Wikipedia with Spark Sandy Ryza | Senior Data Scientist
2.
2© Cloudera, Inc.
All rights reserved. Me • Data scientist at Cloudera • Recently lead Cloudera’s Apache Spark development • Author of Advanced Analytics with Spark
3.
3© Cloudera, Inc.
All rights reserved. LSA-ing Wikipedia with Spark Sandy Ryza | Senior Data Scientist
4.
4© Cloudera, Inc.
All rights reserved. Latent Semantic Analysis • Fancy name for applying a matrix decomposition (SVD) to text data
5.
5© Cloudera, Inc.
All rights reserved.
6.
6© Cloudera, Inc.
All rights reserved.
7.
7© Cloudera, Inc.
All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
8.
8© Cloudera, Inc.
All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
9.
9© Cloudera, Inc.
All rights reserved. Wikipedia Content Data Set • http://dumps.wikimedia.org/enwiki/latest/ • XML-formatted • 46 GB uncompressed
10.
10© Cloudera, Inc.
All rights reserved. <page> <title>Anarchism</title> <ns>0</ns> <id>12</id> <revision> <id>584215651</id> <parentid>584213644</parentid> <timestamp>2013-12-02T15:14:01Z</timestamp> <contributor> <username>AnomieBOT</username> <id>7611264</id> </contributor> <comment>Rescuing orphaned refs ("autogenerated1" from rev 584155010; "bbc" from rev 584155010)</comment> <text xml:space="preserve">{{Redirect|Anarchist|the fictional character| Anarchist (comics)}} {{Redirect|Anarchists}} {{pp-move-indef}} {{Anarchism sidebar}} '''Anarchism''' is a [[political philosophy]] that advocates [[stateless society| stateless societies]] often defined as [[self-governance|self-governed]] voluntary institutions,<ref>"ANARCHISM, a social philosophy that rejects authoritarian government and maintains that voluntary institutions are best suited to express man's natural social tendencies." George Woodcock. "Anarchism" at The Encyclopedia of Philosophy</ref><ref> "In a society developed on these lines, the voluntary associations which already now begin to cover all the fields of human activity would take a still greater extension so as to substitute ...
11.
11© Cloudera, Inc.
All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
12.
12© Cloudera, Inc.
All rights reserved. import org.apache.mahout.text.wikipedia.XmlInputFormat import org.apache.hadoop.conf.Configuration import org.apache.hadoop.io._ val path = "hdfs:///user/ds/wikidump.xml" val conf = new Configuration() conf.set(XmlInputFormat.START_TAG_KEY, "<page>") conf.set(XmlInputFormat.END_TAG_KEY, "</page>") val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat], classOf[LongWritable], classOf[Text], conf) val rawXmls = kvs.map(p => p._2.toString)
13.
13© Cloudera, Inc.
All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
14.
14© Cloudera, Inc.
All rights reserved. Lemmatization “the boy’s cars are different colors” “the boy car be different color”
15.
15© Cloudera, Inc.
All rights reserved. CoreNLP def createNLPPipeline(): StanfordCoreNLP = { val props = new Properties() props.put("annotators", "tokenize, ssplit, pos, lemma") new StanfordCoreNLP(props) }
16.
16© Cloudera, Inc.
All rights reserved. Stop Words “the boy car be different color” “boy car different color”
17.
17© Cloudera, Inc.
All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
18.
18© Cloudera, Inc.
All rights reserved. Tail Monkey Algorithm Scala Document 1 1.5 1.8 Document 2 2.0 4.3 Document 3 1.4 6.7 Document 4 1.6 Document 5 1.2 Term-Document Matrix
19.
19© Cloudera, Inc.
All rights reserved. tf-idf • (Term Frequency) * (Inverse Document Frequency) • tf(document, word) = # times word appears in document • idf(word) = 1 / (# documents that contain word)
20.
20© Cloudera, Inc.
All rights reserved. val rowVectors: RDD[Vector] = ...
21.
21© Cloudera, Inc.
All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
22.
22© Cloudera, Inc.
All rights reserved. Singular Value Decomposition • Factors matrix into the product of three matrices: U, S, and V • m = # documents • n = # terms • U is m x n • S is n x n • V is n x n
23.
23© Cloudera, Inc.
All rights reserved. Low Rank Approximation • Account for synonymy by condensing related terms. • Account for polysemy by placing less weight on terms that have multiple meanings. • Throw out noise. SVD can find the rank-k approximation that has the lowest Frobenius distance from the original matrix.
24.
24© Cloudera, Inc.
All rights reserved. Singular Value Decomposition • Factors matrix into the product of three matrices: U, S, and V • m = # documents • n = # terms • k = # concepts • U is m x n • S is k x k • V is k x n
25.
25© Cloudera, Inc.
All rights reserved. Docs: Terms:U S V
26.
26© Cloudera, Inc.
All rights reserved. Docs: Terms:U S V
27.
27© Cloudera, Inc.
All rights reserved. rowVectors.cache() val mat = new RowMatrix(rowVectors) val k = 1000 val svd = mat.computeSVD(k, computeU=true)
28.
28© Cloudera, Inc.
All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
29.
29© Cloudera, Inc.
All rights reserved. What are the top “concepts”? I.e. what dimensions in term-space and document-space explain most of the variance of the data?
30.
30© Cloudera, Inc.
All rights reserved. Docs: Terms:U S V
31.
31© Cloudera, Inc.
All rights reserved. U S V
32.
32© Cloudera, Inc.
All rights reserved. U S V
33.
33© Cloudera, Inc.
All rights reserved. def topTermsInConcept(concept: Int, numTerms: Int) : Seq[(String, Double)] = { val v = svd.V.toBreezeMatrix val termWeights = v(::, k).toArray.zipWithIndex val sorted = termWeights.sortBy(-_._1) sorted.take(numTerms) }
34.
34© Cloudera, Inc.
All rights reserved. def topDocsInConcept(concept: Int, numDocs: Int) : Seq[Seq[(String, Double)]] = { val u = svd.U val docWeights = u.rows.map(_.toArray(concept)).zipWithUniqueId() docWeights.top(numDocs) }
35.
35© Cloudera, Inc.
All rights reserved. Concept 1 Terms: department, commune, communes, insee, france, see, also, southwestern, oise, marne, moselle, manche, eure, aisne, isère Docs: Communes in France, Saint-Mard, Meurthe-et-Moselle, Saint-Firmin, Meurthe-et-Moselle, Saint-Clément, Meurthe-et-Moselle, Saint-Sardos, Lot-et-Garonne, Saint-Urcisse, Lot-et-Garonne, Saint-Sernin, Lot-et-Garonne, Saint-Robert, Lot-et-Garonne, Saint-Léon, Lot-et-Garonne, Saint-Astier, Lot-et-Garonne
36.
36© Cloudera, Inc.
All rights reserved. Concept 2 Terms: genus, species, moth, family, lepidoptera, beetle, bulbophyllum, snail, database, natural, find, geometridae, reference, museum, noctuidae Docs: Chelonia (genus), Palea (genus), Argiope (genus), Sphingini, Cribrilinidae, Tahla (genus), Gigartinales, Parapodia (genus), Alpina (moth), Arycanda (moth)
37.
37© Cloudera, Inc.
All rights reserved. Querying • Given a set of terms, find the closest documents in the latent space
38.
38© Cloudera, Inc.
All rights reserved. Reconstructed Matrix (U * S * V) Doc Term
39.
39© Cloudera, Inc.
All rights reserved. def topTermsForTerm( normalizedVS : BDenseMatrix[Double], termId: Int): Seq[(Double, Int)] = { val rowVec = new BDenseVector[Double](row(normalizedVS, termId).toArray) val termScores = (normalizedVS * rowVec).toArray.zipWithIndex termScores.sortBy(- _._1).take(10) } val VS = multiplyByDiagonalMatrix(svd.V, svd.s) val normalizedVS = rowsNormalized(VS) topTermsForTerm(normalizedVS, id, termIds)
40.
40© Cloudera, Inc.
All rights reserved. printRelevantTerms("radiohead") radiohead 0.9999999999999993 lyrically 0.8837403315233519 catchy 0.8780717902060333 riff 0.861326571452104 lyricsthe 0.8460798060853993 lyric 0.8434937575368959 upbeat 0.8410212279939793 Term Similarity
41.
41© Cloudera, Inc.
All rights reserved. printRelevantTerms("algorithm") algorithm 1.000000000000002 heuristic 0.8773199836391916 compute 0.8561015487853708 constraint 0.8370707630657652 optimization 0.8331940333186296 complexity 0.823738607119692 algorithmic 0.8227315888559854 Term Similarity
42.
42© Cloudera, Inc.
All rights reserved. (algorithm,1.000000000000002), (heuristic,0.8773199836391916), (compute,0.8561015487853708), (constraint,0.8370707630657652), (optimization,0.8331940333186296), (complexity,0.823738607119692), (algorithmic,0.8227315888559854), (iterative,0.822364922633442), (recursive,0.8176921180556759), (minimization,0.8160188481409465)
43.
43© Cloudera, Inc.
All rights reserved. def topDocsForTerm( US: RowMatrix, V: Matrix, termId: Int) : Seq[(Double, Long)] = { val rowArr = row(V, termId).toArray val rowVec = Matrices.dense(termRowArr.length, 1, termRowArr) val docScores = US.multiply(termRowVec) val allDocWeights = docScores.rows.map( _.toArray(0)). zipWithUniqueId() allDocWeights.top( 10) }
44.
44© Cloudera, Inc.
All rights reserved. printRelevantDocs("fir") Silver tree 0.006292909647173194 See the forest for the trees 0.004785047583508223 Eucalyptus tree 0.004592837783089319 Sequoia tree 0.004497446632469554 Willow tree 0.004429936059594164 Coniferous tree 0.004381572286629475 Tulip Tree 0.004374705020233878 Document Similarity
45.
45© Cloudera, Inc.
All rights reserved. • https://github.com/sryza/aas/tree/master/ch06-lsa • https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html • More detail?
46.
46© Cloudera, Inc.
All rights reserved. Thank you @sandysifting
Baixar agora