SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
Exploring Language Classification
With Apache Spark and the Spark Notebook
A practical introduction to interactive Data Engineering
Gerard Maas
Gerard Maas
Lead Engineer @ Kensu
Computer Engineer
Scala Programmer
Early Spark Adopter
Spark Notebook Dev
Cassandra MVP (2015, 2016)
Stack Overflow Top Contributor
(Spark, Spark Streaming, Scala)
Wannabe IoT Hacker
Arduino Enthusiast
@maasg
https://github.com/maasg
https://www.linkedin.com/
in/gerardmaas/
https://stackoverflow.com
/users/764040/maasg
DATA SCIENCE GOVERNANCE
Adalog helps enterprises to ensure that data pipelines continually deliver
their value by combining the contextual information when the pipeline was
created with the evolving environment where the pipelines execute.
CONNECT - COLLECT - LEARN
Language Classification
Language Classification
Some inspiration...
What’s is a language? How is it composed?
Letter Frequency
Could we characterize a language by calculating the relative frequency of letters in some
text ?
Spanish vs English letter frequency
n-grams
"cavnar and trenkle"
bi-grams: ca,av,vn,na,ar,r_,_a,an,nd,d_,_t,tr,re,en,nk,kl,le,e_
tri-grams: cav,avn,vna,nar,ar_,r_a,_an,and,nd_,d_t,_tr,tre,ren,enk,nkl,kle,le_
quad-grams: cavn,...
http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf
Could we characterize a language by calculating the relative frequency of sequence of
letters in some text ?
Tech
Spark APIs
RDD -> Resilient Distributed Datasets
- Lazy, functional-oriented, low level API
- Basis for execution of all high-level libraries
Dataframes
- Column-oriented, SQL-inspired DSL
- Many optimizations under the hood (Catalyst, Tungsten)
Dataset
- Best of both worlds (except …)
Spark Notebook
A dynamic and visual web-based notebook for Spark
with Scala
Spark Notebook - Open Source Roadmap
2017
GIT Kerberos
Project
Generator
Q1 Q2 Q3
Announcements: blog.kensu.io
Notebooks
Notebooks for this presentation are located at:
https://github.com/maasg/spark-notebooks
- have fun!
https://github.com/maasg/spark-notebooks/languageclassification/language-detection-letter-freq.snb
Implements the idea of using a letter frequency model to classify the language in a doc.
Uses the dataset found in https://github.com/maasg/spark-notebooks/languageclassification/data/
It produces a training set of sampled strings that will be used also for the n-gram classifier
(Note: this notebook is missing a function that’s left as an exercise to the reader. The folder
/solutions contains the full working version.)
Notebook 1 : Naive Language Classification
Notebook 2 : n-gram Language Classification
https://github.com/maasg/spark-notebooks/languageclassification/n-gram-language-classification.snb
Implements the n-gram algorithm described in the paper.
Uses the dataset found in https://github.com/maasg/spark-notebooks/languageclassification/data/
Uses the resulting classifier to implement a custom Spark ML Transformer that can be easily used to classify
new texts. Transformers can be combined into Spark ML Pipelines of arbitrary complexity.
(Note: this notebook is missing a function that’s left as an exercise to the reader. The folder
/solutions contains the full working version.)

Mais conteúdo relacionado

Destaque

Destaque (20)

Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Apache Spark and Oracle Stream Analytics
Apache Spark and Oracle Stream AnalyticsApache Spark and Oracle Stream Analytics
Apache Spark and Oracle Stream Analytics
 
Spark Summit 2015 Highlights in Tweets
Spark Summit 2015 Highlights in TweetsSpark Summit 2015 Highlights in Tweets
Spark Summit 2015 Highlights in Tweets
 
Data Analytics with Apache Spark and Cassandra
Data Analytics with Apache Spark and CassandraData Analytics with Apache Spark and Cassandra
Data Analytics with Apache Spark and Cassandra
 
Международная и российская практика проектного управления
Международная и российская практика проектного управленияМеждународная и российская практика проектного управления
Международная и российская практика проектного управления
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSense
 
Top 5 Deep Learning and AI Stories 3/9
Top 5 Deep Learning and AI Stories 3/9Top 5 Deep Learning and AI Stories 3/9
Top 5 Deep Learning and AI Stories 3/9
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Founders4Schools diversity slides 6 march 2017
Founders4Schools diversity slides 6 march 2017Founders4Schools diversity slides 6 march 2017
Founders4Schools diversity slides 6 march 2017
 
Les défauts de WordPress pour le SEO
Les défauts de WordPress pour le SEOLes défauts de WordPress pour le SEO
Les défauts de WordPress pour le SEO
 
Squeezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile PhonesSqueezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile Phones
 
Nmap Hacking Guide
Nmap Hacking GuideNmap Hacking Guide
Nmap Hacking Guide
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
Nmap
NmapNmap
Nmap
 
Apache Spark: Coming up to speed
Apache Spark: Coming up to speedApache Spark: Coming up to speed
Apache Spark: Coming up to speed
 
NMAP - The Network Scanner
NMAP - The Network ScannerNMAP - The Network Scanner
NMAP - The Network Scanner
 
Incident response: Advanced Network Forensics
Incident response: Advanced Network ForensicsIncident response: Advanced Network Forensics
Incident response: Advanced Network Forensics
 

Semelhante a Exploring language classification with spark and the spark notebook

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 

Semelhante a Exploring language classification with spark and the spark notebook (20)

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
What is Spark
What is SparkWhat is Spark
What is Spark
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with R
 
Sequoia Spark Talk March 2015.pdf
Sequoia Spark Talk March 2015.pdfSequoia Spark Talk March 2015.pdf
Sequoia Spark Talk March 2015.pdf
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
.net developer for Jupyter Notebook and Apache Spark and viceversa
.net developer for Jupyter Notebook and Apache Spark and viceversa.net developer for Jupyter Notebook and Apache Spark and viceversa
.net developer for Jupyter Notebook and Apache Spark and viceversa
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Dart
DartDart
Dart
 
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 

Último

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 

Exploring language classification with spark and the spark notebook

  • 1. Exploring Language Classification With Apache Spark and the Spark Notebook A practical introduction to interactive Data Engineering Gerard Maas
  • 2. Gerard Maas Lead Engineer @ Kensu Computer Engineer Scala Programmer Early Spark Adopter Spark Notebook Dev Cassandra MVP (2015, 2016) Stack Overflow Top Contributor (Spark, Spark Streaming, Scala) Wannabe IoT Hacker Arduino Enthusiast @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/ https://stackoverflow.com /users/764040/maasg
  • 3. DATA SCIENCE GOVERNANCE Adalog helps enterprises to ensure that data pipelines continually deliver their value by combining the contextual information when the pipeline was created with the evolving environment where the pipelines execute. CONNECT - COLLECT - LEARN
  • 4.
  • 7. What’s is a language? How is it composed?
  • 8. Letter Frequency Could we characterize a language by calculating the relative frequency of letters in some text ? Spanish vs English letter frequency
  • 9. n-grams "cavnar and trenkle" bi-grams: ca,av,vn,na,ar,r_,_a,an,nd,d_,_t,tr,re,en,nk,kl,le,e_ tri-grams: cav,avn,vna,nar,ar_,r_a,_an,and,nd_,d_t,_tr,tre,ren,enk,nkl,kle,le_ quad-grams: cavn,... http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf Could we characterize a language by calculating the relative frequency of sequence of letters in some text ?
  • 10. Tech
  • 11.
  • 12. Spark APIs RDD -> Resilient Distributed Datasets - Lazy, functional-oriented, low level API - Basis for execution of all high-level libraries Dataframes - Column-oriented, SQL-inspired DSL - Many optimizations under the hood (Catalyst, Tungsten) Dataset - Best of both worlds (except …)
  • 13. Spark Notebook A dynamic and visual web-based notebook for Spark with Scala
  • 14. Spark Notebook - Open Source Roadmap 2017 GIT Kerberos Project Generator Q1 Q2 Q3 Announcements: blog.kensu.io
  • 15. Notebooks Notebooks for this presentation are located at: https://github.com/maasg/spark-notebooks - have fun!
  • 16. https://github.com/maasg/spark-notebooks/languageclassification/language-detection-letter-freq.snb Implements the idea of using a letter frequency model to classify the language in a doc. Uses the dataset found in https://github.com/maasg/spark-notebooks/languageclassification/data/ It produces a training set of sampled strings that will be used also for the n-gram classifier (Note: this notebook is missing a function that’s left as an exercise to the reader. The folder /solutions contains the full working version.) Notebook 1 : Naive Language Classification
  • 17. Notebook 2 : n-gram Language Classification https://github.com/maasg/spark-notebooks/languageclassification/n-gram-language-classification.snb Implements the n-gram algorithm described in the paper. Uses the dataset found in https://github.com/maasg/spark-notebooks/languageclassification/data/ Uses the resulting classifier to implement a custom Spark ML Transformer that can be easily used to classify new texts. Transformers can be combined into Spark ML Pipelines of arbitrary complexity. (Note: this notebook is missing a function that’s left as an exercise to the reader. The folder /solutions contains the full working version.)