SlideShare a Scribd company logo
1 of 32
Download to read offline
Advanced Natural Language Processing
with Spark NLP
Alex Thomas, Principal Data Scientist at WiseCube
David Talby, CTO at John Snow Labs
Agenda
Introducing Spark NLP
Accuracy, scalability, and speed benchmarks
Out-of-the-box functionality
Getting Things Done
End-to-end NLP tasks in 3 lines of code
Key concepts and a backstage tour
Notebooks!
Using pre-trained pipelines & models
Named entity recognition
Document classification
INTRODUCING SPARK NLP
STATE OF THE ART NLP FOR PYTHON, JAVA & SCALA
1. ACCURACY
2. SCALABILITY
3. SPEED
SPARK NLP IN THE ENTERPRISE
O’REILLY AI ADOPTION IN THE ENTERPRISE SURVEY OF 1,300 PRACTITIONERS, FEB 2019
• ”State of the art” means the best peer-reviewed academic results
• Public benchmarks: Comparing production-grade NLP libraries
• Public benchmarks of pre-trained models: nlp.johnsnowlabs.com
“Spark NLP 2.4 sets new accuracy records for common tasks including NER, OCR & Matching”
New: Redesigned NER-DL and BERT-large
New: Spark OCR image filters & scalable pipelines
New: Hierarchical clinical entity resolution
“Spark NLP 2.5 delivers state-of-the-art accuracy for spell checking and sentiment analysis”
New: ALBERT & XLNet embeddings
New: Contextual spell checker
New: DL-based sentiment analysis
ACCURACY
SCALABILITY
• Zero code changes to scale a pipeline
to any Spark cluster
• Only natively distributed
open-source NLP library
• Spark provides execution planning,
caching, serialization, shuffling
• Caveats
– Speedup depends heavily on what
you actually do
– Not all algorithms scale well
– Spark configuration matters
SPEED: GET THE MOST FROM MODERN HARDWARE
• Optimized builds of Spark NLP
for both Intel and Nvidia
• Benchmark done on AWS:
Train a Named Entity Recognizer in
French
• Achieving F1-score of 89% requires
at least 80 Epochs with batch size of
512
• Intel outperformed Nvidia: Cascade
Lake was 19% faster & 46% cheaper
than Tesla P-100
Production Grade + Active Community
In production in multiple Fortune 500’s
26 new releases in 2018, 30 in 2019
Active Slack community
Permissive open source license: Apache 2.0
SPARK NLP
out-of-the-box
functionality
OFFICIALLY SUPPORTED RUNTIMES
Getting Things Done
SENTIMENT ANALYSIS
import sparknlp
sparknlp.start()
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline('analyze-sentiment', 'en')
result = pipeline.annotate('Harry Potter is a great movie’)
print(result['sentiment’]) ## will print ['positive']
NAMED ENTITY RECOGNITION
pipeline = PretrainedPipeline('recognize_entities_bert', 'en')
result = pipeline.annotate('Harry and Ron met in Hogsmeade')
print(result['ner'])
# prints ['I-PER', 'O', 'I-PER', 'O', 'O', 'I-LOC')
SPELL CHECKING & CORRECTION
Now in Scala:
val pipeline = PretrainedPipeline("spell_check_ml", "en")
val result = pipeline.annotate("Harry Potter is a graet muvie")
println(result("spell"))
/* will print Seq[String](…, "is", "a", "great", "movie") */
UNDER THE HOOD
1.sparknlp.start() starts a new Spark session if there isn’t one, and returns it.
2.PretrainedPipeline() loads the English version of the explain_document_dl pipeline,
the pre-trained models, and the embeddings it depends on.
3. These are stored and cached locally.
4. TensorFlow is initialized, within the same JVM process that runs Spark.
The pre-trained embeddings and deep-learning models (like NER) are loaded. Models are
automatically distributed and shared if running on a cluster.
5. The annotate() call runs an NLP inference pipeline which activates each stage’s
algorithm (tokenization, POS, etc.).
6. The NER stage is run on TensorFlow – applying a neural network with bi-LSTM layers for
tokens and a CNN for characters.
7. Embeddings are used to convert contextual tokens into vectors during the NER inference
process.
8. The result object is a plain old local Python dictionary.
KEY CONCEPT #1: PIPELINE
A list of text processing steps.
Each step has input and output columns.
Document
Assembler
Sentence
Detector
Tokenizer
Sentiment
Analyzer
text document sentence token sentiment
KEY CONCEPT #2: ANNOTATOR
sentiment_detector = SentimentDetector() 
.setInputCols(["sentence”]) 
.setOutputCol("sentiment_score") 
.setDictionary(resource_path+"sent.txt")
An object encapsulating one text processing step.
KEY CONCEPT #3: RESOURCE
• Trained ML models
• Trained DL networks
• Dictionaries
• Embeddings
• Rules
• Pretrained pipelines
An external file that an annotator needs.
Resources can be shared, cached, and locally stored.
KEY CONCEPT #4: PRETRAINED PIPELINE
A pre-built pipeline, with all the annotators and resources it needs.
PUTTING IT ALL TOGETHER: TRAINING A NER WITH BERT
Initialization
Training data
Resources
Annotator
Pipeline
Run Training
Notebooks!
Cleaning, Splitting, and Finding Text
+ Understanding Grammar
Run “Spark NLP Basics” notebook
Oprn on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb
Using Pre-trained Pipelines
+ Named Entity Recognition
Run “Entity Recognizer with Deep Learning” notebook
Open on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/colab/4-%20Entity%20Recognizer%20DL.ipynb
Training your own NER model
Run “NER BERT Training” notebook
Open on Google Colab: https://colab.research.google.com/drive/1A1ovV74nOG-MEpVQnmageeU-ksRLSmXZ
Walkthrough in blog post: https://www.johnsnowlabs.com/named-entity-recognition-ner-with-bert-in-spark-nlp/
Document Classification
+ Universal Sentence Embeddings
Run “Text Classification with ClassifierDL” notebook
Open on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb
WHAT ELSE IS AVAILABLE?
• Spark NLP for Healthcare: 50+ models for clinical entity recognition, linking
to medical terminologies, assertion status detection, and de-identification
• Spark OCR: 20 annotators for image enhancement, layout, and smart editing
LEARN MORE: TECHNICAL CASE STUDIES
Improving Patient Flow Forecasting
Automated clinical coding & chart
reviews
Knowledge Extraction from
Pathology Reports
High-accuracy fact extraction
from long financial documents
Improving Mental Health for
HIV-Positive Adolescents
Accelerating Clinical Trial Recruiting
NEXT STEPS
1. READ THE DOCS & JOIN SLACK
HTTPS://NLP.JOHNSNOWLABS.COM
2. STAR & FORK THE REPO
GITHUB.COM/JOHNSNOWLABS/SPARK-NLP
3. QUESTIONS? GET IT TOUCH
Thank you!
alex@wisecube.com
david@johnsnowlabs.com
Advanced Natural Language Processing with Apache Spark NLP

More Related Content

What's hot

Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
Amazon Web Services
 
stackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviate
NETWAYS
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Simplilearn
 

What's hot (20)

Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?
 
Association Rule Mining Using WEKA
Association Rule Mining Using WEKAAssociation Rule Mining Using WEKA
Association Rule Mining Using WEKA
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Spark NLP: State of the Art Natural Language Processing at Scale
Spark NLP: State of the Art Natural Language Processing at ScaleSpark NLP: State of the Art Natural Language Processing at Scale
Spark NLP: State of the Art Natural Language Processing at Scale
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Machine Learning in 10 Minutes | What is Machine Learning? | Edureka
Machine Learning in 10 Minutes | What is Machine Learning? | EdurekaMachine Learning in 10 Minutes | What is Machine Learning? | Edureka
Machine Learning in 10 Minutes | What is Machine Learning? | Edureka
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
 
How to choose Machine Learning algorithm.
How to choose Machine Learning  algorithm.How to choose Machine Learning  algorithm.
How to choose Machine Learning algorithm.
 
stackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviate
 
Toxic Comment Classification
Toxic Comment ClassificationToxic Comment Classification
Toxic Comment Classification
 
Deploy PyTorch models in Production on AWS with TorchServe
Deploy PyTorch models in Production on AWS with TorchServeDeploy PyTorch models in Production on AWS with TorchServe
Deploy PyTorch models in Production on AWS with TorchServe
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
 

Similar to Advanced Natural Language Processing with Apache Spark NLP

Similar to Advanced Natural Language Processing with Apache Spark NLP (20)

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Nat...
 Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Nat... Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Nat...
Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Nat...
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Open Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learningOpen Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learning
 
PPT5: Neuron Introduction
PPT5: Neuron IntroductionPPT5: Neuron Introduction
PPT5: Neuron Introduction
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Simplify Governance of Streaming Data
Simplify Governance of Streaming Data Simplify Governance of Streaming Data
Simplify Governance of Streaming Data
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
PySaprk
PySaprkPySaprk
PySaprk
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 

Recently uploaded (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 

Advanced Natural Language Processing with Apache Spark NLP

  • 1.
  • 2. Advanced Natural Language Processing with Spark NLP Alex Thomas, Principal Data Scientist at WiseCube David Talby, CTO at John Snow Labs
  • 3. Agenda Introducing Spark NLP Accuracy, scalability, and speed benchmarks Out-of-the-box functionality Getting Things Done End-to-end NLP tasks in 3 lines of code Key concepts and a backstage tour Notebooks! Using pre-trained pipelines & models Named entity recognition Document classification
  • 4. INTRODUCING SPARK NLP STATE OF THE ART NLP FOR PYTHON, JAVA & SCALA 1. ACCURACY 2. SCALABILITY 3. SPEED
  • 5. SPARK NLP IN THE ENTERPRISE O’REILLY AI ADOPTION IN THE ENTERPRISE SURVEY OF 1,300 PRACTITIONERS, FEB 2019
  • 6.
  • 7. • ”State of the art” means the best peer-reviewed academic results • Public benchmarks: Comparing production-grade NLP libraries • Public benchmarks of pre-trained models: nlp.johnsnowlabs.com “Spark NLP 2.4 sets new accuracy records for common tasks including NER, OCR & Matching” New: Redesigned NER-DL and BERT-large New: Spark OCR image filters & scalable pipelines New: Hierarchical clinical entity resolution “Spark NLP 2.5 delivers state-of-the-art accuracy for spell checking and sentiment analysis” New: ALBERT & XLNet embeddings New: Contextual spell checker New: DL-based sentiment analysis ACCURACY
  • 8. SCALABILITY • Zero code changes to scale a pipeline to any Spark cluster • Only natively distributed open-source NLP library • Spark provides execution planning, caching, serialization, shuffling • Caveats – Speedup depends heavily on what you actually do – Not all algorithms scale well – Spark configuration matters
  • 9. SPEED: GET THE MOST FROM MODERN HARDWARE • Optimized builds of Spark NLP for both Intel and Nvidia • Benchmark done on AWS: Train a Named Entity Recognizer in French • Achieving F1-score of 89% requires at least 80 Epochs with batch size of 512 • Intel outperformed Nvidia: Cascade Lake was 19% faster & 46% cheaper than Tesla P-100
  • 10. Production Grade + Active Community In production in multiple Fortune 500’s 26 new releases in 2018, 30 in 2019 Active Slack community Permissive open source license: Apache 2.0
  • 14. SENTIMENT ANALYSIS import sparknlp sparknlp.start() from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('analyze-sentiment', 'en') result = pipeline.annotate('Harry Potter is a great movie’) print(result['sentiment’]) ## will print ['positive']
  • 15. NAMED ENTITY RECOGNITION pipeline = PretrainedPipeline('recognize_entities_bert', 'en') result = pipeline.annotate('Harry and Ron met in Hogsmeade') print(result['ner']) # prints ['I-PER', 'O', 'I-PER', 'O', 'O', 'I-LOC')
  • 16. SPELL CHECKING & CORRECTION Now in Scala: val pipeline = PretrainedPipeline("spell_check_ml", "en") val result = pipeline.annotate("Harry Potter is a graet muvie") println(result("spell")) /* will print Seq[String](…, "is", "a", "great", "movie") */
  • 17. UNDER THE HOOD 1.sparknlp.start() starts a new Spark session if there isn’t one, and returns it. 2.PretrainedPipeline() loads the English version of the explain_document_dl pipeline, the pre-trained models, and the embeddings it depends on. 3. These are stored and cached locally. 4. TensorFlow is initialized, within the same JVM process that runs Spark. The pre-trained embeddings and deep-learning models (like NER) are loaded. Models are automatically distributed and shared if running on a cluster. 5. The annotate() call runs an NLP inference pipeline which activates each stage’s algorithm (tokenization, POS, etc.). 6. The NER stage is run on TensorFlow – applying a neural network with bi-LSTM layers for tokens and a CNN for characters. 7. Embeddings are used to convert contextual tokens into vectors during the NER inference process. 8. The result object is a plain old local Python dictionary.
  • 18. KEY CONCEPT #1: PIPELINE A list of text processing steps. Each step has input and output columns. Document Assembler Sentence Detector Tokenizer Sentiment Analyzer text document sentence token sentiment
  • 19. KEY CONCEPT #2: ANNOTATOR sentiment_detector = SentimentDetector() .setInputCols(["sentence”]) .setOutputCol("sentiment_score") .setDictionary(resource_path+"sent.txt") An object encapsulating one text processing step.
  • 20. KEY CONCEPT #3: RESOURCE • Trained ML models • Trained DL networks • Dictionaries • Embeddings • Rules • Pretrained pipelines An external file that an annotator needs. Resources can be shared, cached, and locally stored.
  • 21. KEY CONCEPT #4: PRETRAINED PIPELINE A pre-built pipeline, with all the annotators and resources it needs.
  • 22. PUTTING IT ALL TOGETHER: TRAINING A NER WITH BERT Initialization Training data Resources Annotator Pipeline Run Training
  • 24. Cleaning, Splitting, and Finding Text + Understanding Grammar Run “Spark NLP Basics” notebook Oprn on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/ blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb
  • 25. Using Pre-trained Pipelines + Named Entity Recognition Run “Entity Recognizer with Deep Learning” notebook Open on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/ blob/master/tutorials/colab/4-%20Entity%20Recognizer%20DL.ipynb
  • 26. Training your own NER model Run “NER BERT Training” notebook Open on Google Colab: https://colab.research.google.com/drive/1A1ovV74nOG-MEpVQnmageeU-ksRLSmXZ Walkthrough in blog post: https://www.johnsnowlabs.com/named-entity-recognition-ner-with-bert-in-spark-nlp/
  • 27. Document Classification + Universal Sentence Embeddings Run “Text Classification with ClassifierDL” notebook Open on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/ blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb
  • 28. WHAT ELSE IS AVAILABLE? • Spark NLP for Healthcare: 50+ models for clinical entity recognition, linking to medical terminologies, assertion status detection, and de-identification • Spark OCR: 20 annotators for image enhancement, layout, and smart editing
  • 29. LEARN MORE: TECHNICAL CASE STUDIES Improving Patient Flow Forecasting Automated clinical coding & chart reviews Knowledge Extraction from Pathology Reports High-accuracy fact extraction from long financial documents Improving Mental Health for HIV-Positive Adolescents Accelerating Clinical Trial Recruiting
  • 30. NEXT STEPS 1. READ THE DOCS & JOIN SLACK HTTPS://NLP.JOHNSNOWLABS.COM 2. STAR & FORK THE REPO GITHUB.COM/JOHNSNOWLABS/SPARK-NLP 3. QUESTIONS? GET IT TOUCH