SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
Creating Knowledge
bases from text in
absence of training data.
Sanghamitra Deb
Accenture Technology Laboratory
Phil Rogers, Jana Thompson, Hans Li
Typical Business Process
Executive
Summary
Business
Decisions
hours of knowledge
curation by experts
The Generalized approach of extracting text: Parsing
Tokenization Normalization Parsing Lemmatization
Tokenization: Separating sentences, words, remove
special characters, phrase detections
Normalization: lowering words, word-sense
disambiguation
Parsing: Detecting parts of speech, nouns, verbs etc.
Lemmatization: Remove plurals and different word
forms to a single word (found in the dictionary).
Extract sentences that contain the
specific attribute
POS tag and extract unigrams,bigrams
and trigrams centered on nouns
Extract Features: words around nouns:
bag of words/word vectors,
position of the noun and length of sentence.
Train a Machine Learning model to predict which unigrams, bigrams
or trigrams satisfy the specific relationship: for example the drug-disease
treatment relationship.
Map training data to create a balanced
positive and negative training set.
The Generalized approach of extracting text : ML
Extract sentences that contain the
specific attribute
POS tag and extract unigrams,bigrams
and trigrams centered on nouns
Extract Features: words around nouns:
bag of words/word vectors,
position of the noun and length of sentence.
Train a Machine Learning model to predict which unigrams, bigrams
or trigrams satisfy the specific relationship: for example the drug-disease
treatment relationship.
Map training data to create a balanced
positive and negative training set.
The Generalized approach of extracting text : ML
How do we generate this training data?
A different Approach
Stanford
Replaces training data by encoding domain knowledge
The snorkel approach of Entity Extraction
Extract sentences that contain the
specific attribute
POS tag and extract unigrams,bigrams
and trigrams centered on nouns
Write Rules: Encode your domain knowledge
into rules.
Validate Rules: coverage, conflicts, accuracy
Run learning: logistic regression, lstm, …
Examine a random
set of candidates,
create new rules
Observe the lowest
accuracy(highest conflict)
rules and edit them
iterate
Training Data | Rules
.
.
..
.*
.
.
..
.
.*
*
Planetary Orbits
How does snorkel work without training data
Write Rules: Encode your domain knowledge into rules.
The rules are modeled as a Naive Bayes model which assumes that the
rules are conditionally independent.
These probabilities are fed into Machine Learning algorithm: Logistic
Regression in the simplest case to create a model used to make
future predictions
Even though most of the time this is not true, in practice it generates a
pretty good training set with probabilities of being in either class.
http://arxiv.org/pdf/1512.06474v2.pdf
Data Dive: FDA Drug Labels
It is indicated for treating respiratory disorder caused
due to allergy.
For the relief of symptoms of depression.
Evidence supporting efficacy of carbamazepine as an
anticonvulsant was derived from active drug-controlled
studies that enrolled patients with the following seizure
types:
When oral therapy is not feasible and the strength ,
dosage form , and route of administration of the drug
reasonably lend the preparation to the treatment of the
condition
Data Dive: FDA Drug Labels
Candidate Extraction
Using domain knowledge and language structure collect
a set of high recall low precision. Typically this set should
have 80% recall and 20% precision.
60% accuracy, too specific need to make it more general
30% accuracy, this looks fine
…………………………………………………………………………………………………………………………………………………………………….
…………………………………………………………………………………………………………………………………………………………………….
Automated Features:
pos-tags
context
dep-tree
char-offsets
Rule Functions
Testing Rule Functions:
0
75
150
225
300
-1 0 1
Generation of training data
One rule
0
55
110
165
220
-1 0 1
Generation of training data
two rules
0
45
90
135
180
-1 0 1
Generation of training data
three rules
0
35
70
105
140
-1 0 1
Generation of training data
four rules
0
35
70
105
140
-1 0 1
Generation of training data
20 rules
Results and performance.
drug-name
disease
candidate
Candidates snorkel
Lithium
Carbonate
bipolar
disorder
1 1
Lithium
Carbonate
individual 1 0
Lithium
Carbonate
maintenance 1 0
Lithium
Carbonate
manic episode 1 1
Precision and recall ~90%
Evolution of F1-score with sample size
Relationship extractions
•Is person X married to person Y?
•Does drug X cure disease Y?
•Does software X (example: snorkel) run on programing language Y
(example: python3)
Define filters for candidate extraction for a pair (X,Y)
example: (snorkel, python2.7), (snorkel,python3.1), …
Once you have the pairs , examine them using annotation tool.
Write rules ——> observe their performance against annotated data.
Iterate
Crowdsourced training data
In some cases training data is generated on the same dataset
by multiple people.
In snorkel each source can be incorporated as a separate
rule function.
The model for the rules figure out the relative weights for each
person and create a cleaner training data.
Why Docker?
• Portability: develop here run
there: Internal Clusters, aws,
google cloud etc, Reusable by
team and clients
• isolation: os and docker
isolated from bugs.
• Fast
• Easy virtualization : hard ware
emulation, virtualized os.
• Lightweight
Python stack on docker
FROM ubuntu:latest
# MAINTAINER Sanghamitra Deb <sangha123@gmail.com>
CMD echo Installing Accenture Tech Labs Scientific Python Enviro
RUN apt-get install python -y
RUN apt-get update && apt-get upgrade -y
RUN apt-get install curl -y
RUN apt-get install emacs -y
RUN curl -O https://bootstrap.pypa.io/get-pip.py
RUN python get-pip.py
RUN rm get-pip.py
RUN echo "export PATH=~/.local/bin:$PATH" >> ~/.bashrc
RUN apt-get install python-setuptools build-essential python-dev -y
RUN apt-get install gfortran swig -y
RUN apt-get install libatlas-dev liblapack-dev -y
RUN apt-get install libfreetype6 libfreetype6-dev -y
RUN apt-get install libxft-dev -y
RUN apt-get install libxml2-dev libxslt-dev zlib1g-dev
RUN apt-get install python-numpy
ADD requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt -q
Dockerfile
scipy
matplotlib
ipython
jupyter
pandas
Bottleneck
patsy
pymc
statsmodels
scikit-learn
BeautifulSoup
seaborn
gensim
fuzzywuzzy
xmltodict
untangle
nltk
flask
enum34
requirements.txt
docker build -t sangha/python .
docker run -it -p 1108:1108 -p 1106:1106 --name pharmaExtraction0.1 -v
/location/in/hadoop/ sangha/python bash
docker exec -it pharmaExtraction0.1 bash
docker exec -d  pharmaExtraction0.1 python  /root/pycodes/rest_api.py
Building the Dockerfile
Typical ML pipeline vs Snorkel
(1) Candidate Extraction.
(2) Rule Function
(3) Hyperparameter tuning
Snorkel :
Pros:
• Very little training
data necessary
• Do not have to
think about feature
generation
• Do not need deep
knowledge in
Machine Learning
• Convenient UI for
data annotation
• Created structured
databases from
unstructured text
Cons:
• Code is new, so it
may not be robust
to all situations.
• Doing online
prediction is
difficult.
• Not much
transparency in the
internal workings.
Banks: Loan
Approval Paleontology
Design of Clinical Trials
Legal
Investigation
Market Research
Reports
Human Trafficking
Skills extraction from resume
Content Marketing
Product descriptions and
reviews
Pharmaceutical
Industry
Applicability across 

a variety of industries
and use cases
Where to get it?
https://github.com/HazyResearch/snorkel
http://arxiv.org/pdf/1512.06474v2.pdf

Mais conteúdo relacionado

Destaque

Clinical research and clinical data management - Ikya Global
Clinical research and clinical data management - Ikya GlobalClinical research and clinical data management - Ikya Global
Clinical research and clinical data management - Ikya Globalikya global
 
Flexible Study Design in Oracle Clinical and Remote Data Capture 4.6
Flexible Study Design in Oracle Clinical and Remote Data Capture 4.6Flexible Study Design in Oracle Clinical and Remote Data Capture 4.6
Flexible Study Design in Oracle Clinical and Remote Data Capture 4.6Perficient
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseJosh Patterson
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecJosh Patterson
 
H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonSri Ambati
 
Medical Informatics: Computational Analytics in Healthcare
Medical Informatics: Computational Analytics in HealthcareMedical Informatics: Computational Analytics in Healthcare
Medical Informatics: Computational Analytics in HealthcareNUS-ISS
 
Semantic Natural Language Understanding with Spark, UIMA & Machine Learned On...
Semantic Natural Language Understanding with Spark, UIMA & Machine Learned On...Semantic Natural Language Understanding with Spark, UIMA & Machine Learned On...
Semantic Natural Language Understanding with Spark, UIMA & Machine Learned On...David Talby
 
Machine learning and big data
Machine learning and big dataMachine learning and big data
Machine learning and big dataPoo Kuan Hoong
 
Protocol Understanding_ Clinical Data Management_KatalystHLS
Protocol Understanding_ Clinical Data Management_KatalystHLSProtocol Understanding_ Clinical Data Management_KatalystHLS
Protocol Understanding_ Clinical Data Management_KatalystHLSKatalyst HLS
 
Argus Product Tab Screens - Katalyst HLS
Argus Product Tab Screens - Katalyst HLSArgus Product Tab Screens - Katalyst HLS
Argus Product Tab Screens - Katalyst HLSKatalyst HLS
 
Big Data and Clinical Research: Trends, Issues and Considerations
Big Data and Clinical Research: Trends, Issues and ConsiderationsBig Data and Clinical Research: Trends, Issues and Considerations
Big Data and Clinical Research: Trends, Issues and ConsiderationsMerge eClinicalOS
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksJosh Patterson
 
Adverse Events and Serious Adverse Events - Katalyst HLS
Adverse Events and Serious Adverse Events - Katalyst HLSAdverse Events and Serious Adverse Events - Katalyst HLS
Adverse Events and Serious Adverse Events - Katalyst HLSKatalyst HLS
 
Overview of Validation in Pharma_Katalyst HLS
Overview of Validation in Pharma_Katalyst HLSOverview of Validation in Pharma_Katalyst HLS
Overview of Validation in Pharma_Katalyst HLSKatalyst HLS
 
Argus Analysis Tab Screen - Katalyst HLS
Argus Analysis Tab Screen - Katalyst HLSArgus Analysis Tab Screen - Katalyst HLS
Argus Analysis Tab Screen - Katalyst HLSKatalyst HLS
 
Argus Event Tab Screen - Katalyst HLS
Argus Event Tab Screen - Katalyst HLSArgus Event Tab Screen - Katalyst HLS
Argus Event Tab Screen - Katalyst HLSKatalyst HLS
 
Clinical Data Management Process Overview_Katalyst HLS
Clinical Data Management Process Overview_Katalyst HLSClinical Data Management Process Overview_Katalyst HLS
Clinical Data Management Process Overview_Katalyst HLSKatalyst HLS
 
Clinical data management process setup
Clinical data management process  setupClinical data management process  setup
Clinical data management process setupDr.K Pati
 

Destaque (19)

Clinical research and clinical data management - Ikya Global
Clinical research and clinical data management - Ikya GlobalClinical research and clinical data management - Ikya Global
Clinical research and clinical data management - Ikya Global
 
Flexible Study Design in Oracle Clinical and Remote Data Capture 4.6
Flexible Study Design in Oracle Clinical and Remote Data Capture 4.6Flexible Study Design in Oracle Clinical and Remote Data Capture 4.6
Flexible Study Design in Oracle Clinical and Remote Data Capture 4.6
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
 
H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in Python
 
Medical Informatics: Computational Analytics in Healthcare
Medical Informatics: Computational Analytics in HealthcareMedical Informatics: Computational Analytics in Healthcare
Medical Informatics: Computational Analytics in Healthcare
 
Semantic Natural Language Understanding with Spark, UIMA & Machine Learned On...
Semantic Natural Language Understanding with Spark, UIMA & Machine Learned On...Semantic Natural Language Understanding with Spark, UIMA & Machine Learned On...
Semantic Natural Language Understanding with Spark, UIMA & Machine Learned On...
 
Machine learning and big data
Machine learning and big dataMachine learning and big data
Machine learning and big data
 
Protocol Understanding_ Clinical Data Management_KatalystHLS
Protocol Understanding_ Clinical Data Management_KatalystHLSProtocol Understanding_ Clinical Data Management_KatalystHLS
Protocol Understanding_ Clinical Data Management_KatalystHLS
 
Clinical trial
Clinical trialClinical trial
Clinical trial
 
Argus Product Tab Screens - Katalyst HLS
Argus Product Tab Screens - Katalyst HLSArgus Product Tab Screens - Katalyst HLS
Argus Product Tab Screens - Katalyst HLS
 
Big Data and Clinical Research: Trends, Issues and Considerations
Big Data and Clinical Research: Trends, Issues and ConsiderationsBig Data and Clinical Research: Trends, Issues and Considerations
Big Data and Clinical Research: Trends, Issues and Considerations
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Adverse Events and Serious Adverse Events - Katalyst HLS
Adverse Events and Serious Adverse Events - Katalyst HLSAdverse Events and Serious Adverse Events - Katalyst HLS
Adverse Events and Serious Adverse Events - Katalyst HLS
 
Overview of Validation in Pharma_Katalyst HLS
Overview of Validation in Pharma_Katalyst HLSOverview of Validation in Pharma_Katalyst HLS
Overview of Validation in Pharma_Katalyst HLS
 
Argus Analysis Tab Screen - Katalyst HLS
Argus Analysis Tab Screen - Katalyst HLSArgus Analysis Tab Screen - Katalyst HLS
Argus Analysis Tab Screen - Katalyst HLS
 
Argus Event Tab Screen - Katalyst HLS
Argus Event Tab Screen - Katalyst HLSArgus Event Tab Screen - Katalyst HLS
Argus Event Tab Screen - Katalyst HLS
 
Clinical Data Management Process Overview_Katalyst HLS
Clinical Data Management Process Overview_Katalyst HLSClinical Data Management Process Overview_Katalyst HLS
Clinical Data Management Process Overview_Katalyst HLS
 
Clinical data management process setup
Clinical data management process  setupClinical data management process  setup
Clinical data management process setup
 

Semelhante a Data day2017

Arules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownArules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownAdrian Cuyugan
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple stepsRenjith M P
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalystdwm042
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using MahoutIMC Institute
 
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11Harish Ganesan
 
TensorFlow BASTA2018 Machinelearning
TensorFlow BASTA2018 MachinelearningTensorFlow BASTA2018 Machinelearning
TensorFlow BASTA2018 MachinelearningMax Kleiner
 
computer notes - Data Structures - 1
computer notes - Data Structures - 1computer notes - Data Structures - 1
computer notes - Data Structures - 1ecomputernotes
 
Good practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationGood practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationPrestaShop
 
200612_BioPackathon_ss
200612_BioPackathon_ss200612_BioPackathon_ss
200612_BioPackathon_ssSatoshi Kume
 
Computer notes - data structures
Computer notes - data structuresComputer notes - data structures
Computer notes - data structuresecomputernotes
 
Object Oriented Programming in Matlab
Object Oriented Programming in Matlab Object Oriented Programming in Matlab
Object Oriented Programming in Matlab AlbanLevy
 
Machine learning key to your formulation challenges
Machine learning key to your formulation challengesMachine learning key to your formulation challenges
Machine learning key to your formulation challengesMarc Borowczak
 
MCL309_Deep Learning on a Raspberry Pi
MCL309_Deep Learning on a Raspberry PiMCL309_Deep Learning on a Raspberry Pi
MCL309_Deep Learning on a Raspberry PiAmazon Web Services
 
B2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draftB2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draftSteve Feldman
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningYunyao Li
 
Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: butest
 
Automated Unit Testing
Automated Unit TestingAutomated Unit Testing
Automated Unit TestingMike Lively
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine LearningNarong Intiruk
 

Semelhante a Data day2017 (20)

Arules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownArules_TM_Rpart_Markdown
Arules_TM_Rpart_Markdown
 
Raptor user manual3.0
Raptor user manual3.0Raptor user manual3.0
Raptor user manual3.0
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
Amazon cloud search comparison report
Amazon cloud search comparison reportAmazon cloud search comparison report
Amazon cloud search comparison report
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
 
TensorFlow BASTA2018 Machinelearning
TensorFlow BASTA2018 MachinelearningTensorFlow BASTA2018 Machinelearning
TensorFlow BASTA2018 Machinelearning
 
computer notes - Data Structures - 1
computer notes - Data Structures - 1computer notes - Data Structures - 1
computer notes - Data Structures - 1
 
Good practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationGood practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimization
 
200612_BioPackathon_ss
200612_BioPackathon_ss200612_BioPackathon_ss
200612_BioPackathon_ss
 
Computer notes - data structures
Computer notes - data structuresComputer notes - data structures
Computer notes - data structures
 
Object Oriented Programming in Matlab
Object Oriented Programming in Matlab Object Oriented Programming in Matlab
Object Oriented Programming in Matlab
 
Machine learning key to your formulation challenges
Machine learning key to your formulation challengesMachine learning key to your formulation challenges
Machine learning key to your formulation challenges
 
MCL309_Deep Learning on a Raspberry Pi
MCL309_Deep Learning on a Raspberry PiMCL309_Deep Learning on a Raspberry Pi
MCL309_Deep Learning on a Raspberry Pi
 
B2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draftB2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draft
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active Learning
 
Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web:
 
Automated Unit Testing
Automated Unit TestingAutomated Unit Testing
Automated Unit Testing
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine Learning
 

Mais de Sanghamitra Deb

Multi-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningMulti-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningSanghamitra Deb
 
Computer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureComputer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureSanghamitra Deb
 
Intro to NLP: Text Categorization and Topic Modeling
Intro to NLP: Text Categorization and Topic ModelingIntro to NLP: Text Categorization and Topic Modeling
Intro to NLP: Text Categorization and Topic ModelingSanghamitra Deb
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for BeginnersSanghamitra Deb
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & MetricsSanghamitra Deb
 
Developing Recommendation System to provide a Personalized Learning experienc...
Developing Recommendation System to provide a PersonalizedLearning experienc...Developing Recommendation System to provide a PersonalizedLearning experienc...
Developing Recommendation System to provide a Personalized Learning experienc...Sanghamitra Deb
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsSanghamitra Deb
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningSanghamitra Deb
 
NLP and Machine Learning for non-experts
NLP and Machine Learning for non-expertsNLP and Machine Learning for non-experts
NLP and Machine Learning for non-expertsSanghamitra Deb
 
Democratizing NLP content modeling with transfer learning using GPUs
Democratizing NLP content modeling with transfer learning using GPUsDemocratizing NLP content modeling with transfer learning using GPUs
Democratizing NLP content modeling with transfer learning using GPUsSanghamitra Deb
 
Natural Language Comprehension: Human Machine Collaboration.
Natural Language Comprehension: Human Machine Collaboration.Natural Language Comprehension: Human Machine Collaboration.
Natural Language Comprehension: Human Machine Collaboration.Sanghamitra Deb
 
Understanding Product Attributes from Reviews
Understanding Product Attributes from ReviewsUnderstanding Product Attributes from Reviews
Understanding Product Attributes from ReviewsSanghamitra Deb
 

Mais de Sanghamitra Deb (14)

odsc_2023.pdf
odsc_2023.pdfodsc_2023.pdf
odsc_2023.pdf
 
Multi-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningMulti-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learning
 
Computer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureComputer Vision Landscape : Present and Future
Computer Vision Landscape : Present and Future
 
Intro to NLP: Text Categorization and Topic Modeling
Intro to NLP: Text Categorization and Topic ModelingIntro to NLP: Text Categorization and Topic Modeling
Intro to NLP: Text Categorization and Topic Modeling
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
 
Developing Recommendation System to provide a Personalized Learning experienc...
Developing Recommendation System to provide a PersonalizedLearning experienc...Developing Recommendation System to provide a PersonalizedLearning experienc...
Developing Recommendation System to provide a Personalized Learning experienc...
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
NLP and Machine Learning for non-experts
NLP and Machine Learning for non-expertsNLP and Machine Learning for non-experts
NLP and Machine Learning for non-experts
 
Democratizing NLP content modeling with transfer learning using GPUs
Democratizing NLP content modeling with transfer learning using GPUsDemocratizing NLP content modeling with transfer learning using GPUs
Democratizing NLP content modeling with transfer learning using GPUs
 
Natural Language Comprehension: Human Machine Collaboration.
Natural Language Comprehension: Human Machine Collaboration.Natural Language Comprehension: Human Machine Collaboration.
Natural Language Comprehension: Human Machine Collaboration.
 
Understanding Product Attributes from Reviews
Understanding Product Attributes from ReviewsUnderstanding Product Attributes from Reviews
Understanding Product Attributes from Reviews
 

Último

Low Rate Call Girls Udaipur {9xx000xx09} ❤️VVIP NISHA CCall Girls in Udaipur ...
Low Rate Call Girls Udaipur {9xx000xx09} ❤️VVIP NISHA CCall Girls in Udaipur ...Low Rate Call Girls Udaipur {9xx000xx09} ❤️VVIP NISHA CCall Girls in Udaipur ...
Low Rate Call Girls Udaipur {9xx000xx09} ❤️VVIP NISHA CCall Girls in Udaipur ...Sheetaleventcompany
 
Lucknow Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Luckn...
Lucknow Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Luckn...Lucknow Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Luckn...
Lucknow Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Luckn...Sheetaleventcompany
 
❤️Chandigarh Escorts☎️9814379184☎️ Call Girl service in Chandigarh☎️ Chandiga...
❤️Chandigarh Escorts☎️9814379184☎️ Call Girl service in Chandigarh☎️ Chandiga...❤️Chandigarh Escorts☎️9814379184☎️ Call Girl service in Chandigarh☎️ Chandiga...
❤️Chandigarh Escorts☎️9814379184☎️ Call Girl service in Chandigarh☎️ Chandiga...Sheetaleventcompany
 
💞 Safe And Secure Call Girls Prayagraj 🧿 9332606886 🧿 High Class Call Girl Se...
💞 Safe And Secure Call Girls Prayagraj 🧿 9332606886 🧿 High Class Call Girl Se...💞 Safe And Secure Call Girls Prayagraj 🧿 9332606886 🧿 High Class Call Girl Se...
💞 Safe And Secure Call Girls Prayagraj 🧿 9332606886 🧿 High Class Call Girl Se...India Call Girls
 
❤️ Chandigarh Call Girls Service☎️9878799926☎️ Call Girl service in Chandigar...
❤️ Chandigarh Call Girls Service☎️9878799926☎️ Call Girl service in Chandigar...❤️ Chandigarh Call Girls Service☎️9878799926☎️ Call Girl service in Chandigar...
❤️ Chandigarh Call Girls Service☎️9878799926☎️ Call Girl service in Chandigar...daljeetkaur2026
 
💚Chandigarh Call Girls Service 💯Jiya 📲🔝8868886958🔝Call Girls In Chandigarh No...
💚Chandigarh Call Girls Service 💯Jiya 📲🔝8868886958🔝Call Girls In Chandigarh No...💚Chandigarh Call Girls Service 💯Jiya 📲🔝8868886958🔝Call Girls In Chandigarh No...
💚Chandigarh Call Girls Service 💯Jiya 📲🔝8868886958🔝Call Girls In Chandigarh No...Sheetaleventcompany
 
Erotic Call Girls Bangalore {7304373326} ❤️VVIP SIYA Call Girls in Bangalore ...
Erotic Call Girls Bangalore {7304373326} ❤️VVIP SIYA Call Girls in Bangalore ...Erotic Call Girls Bangalore {7304373326} ❤️VVIP SIYA Call Girls in Bangalore ...
Erotic Call Girls Bangalore {7304373326} ❤️VVIP SIYA Call Girls in Bangalore ...Sheetaleventcompany
 
❤️Chandigarh Escort Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ C...
❤️Chandigarh Escort Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ C...❤️Chandigarh Escort Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ C...
❤️Chandigarh Escort Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ C...Sheetaleventcompany
 
Low Rate Call Girls Pune {9142599079} ❤️VVIP NISHA Call Girls in Pune Maharas...
Low Rate Call Girls Pune {9142599079} ❤️VVIP NISHA Call Girls in Pune Maharas...Low Rate Call Girls Pune {9142599079} ❤️VVIP NISHA Call Girls in Pune Maharas...
Low Rate Call Girls Pune {9142599079} ❤️VVIP NISHA Call Girls in Pune Maharas...Sheetaleventcompany
 
❤️Amritsar Call Girls Service☎️98151-129OO☎️ Call Girl service in Amritsar☎️ ...
❤️Amritsar Call Girls Service☎️98151-129OO☎️ Call Girl service in Amritsar☎️ ...❤️Amritsar Call Girls Service☎️98151-129OO☎️ Call Girl service in Amritsar☎️ ...
❤️Amritsar Call Girls Service☎️98151-129OO☎️ Call Girl service in Amritsar☎️ ...shallyentertainment1
 
🍑👄Ludhiana Escorts Service☎️98157-77685🍑👄 Call Girl service in Ludhiana☎️Ludh...
🍑👄Ludhiana Escorts Service☎️98157-77685🍑👄 Call Girl service in Ludhiana☎️Ludh...🍑👄Ludhiana Escorts Service☎️98157-77685🍑👄 Call Girl service in Ludhiana☎️Ludh...
🍑👄Ludhiana Escorts Service☎️98157-77685🍑👄 Call Girl service in Ludhiana☎️Ludh...dilpreetentertainmen
 
2024 PCP #IMPerative Updates in Rheumatology
2024 PCP #IMPerative Updates in Rheumatology2024 PCP #IMPerative Updates in Rheumatology
2024 PCP #IMPerative Updates in RheumatologySidney Erwin Manahan
 
💞 Safe And Secure Call Girls Coimbatore 🧿 9332606886 🧿 High Class Call Girl S...
💞 Safe And Secure Call Girls Coimbatore 🧿 9332606886 🧿 High Class Call Girl S...💞 Safe And Secure Call Girls Coimbatore 🧿 9332606886 🧿 High Class Call Girl S...
💞 Safe And Secure Call Girls Coimbatore 🧿 9332606886 🧿 High Class Call Girl S...India Call Girls
 
Gorgeous Call Girls In Pune {9xx000xx09} ❤️VVIP ANKITA Call Girl in Pune Maha...
Gorgeous Call Girls In Pune {9xx000xx09} ❤️VVIP ANKITA Call Girl in Pune Maha...Gorgeous Call Girls In Pune {9xx000xx09} ❤️VVIP ANKITA Call Girl in Pune Maha...
Gorgeous Call Girls In Pune {9xx000xx09} ❤️VVIP ANKITA Call Girl in Pune Maha...Sheetaleventcompany
 
Call Girls In Indore 📞9235973566📞Just Call Inaaya📲 Call Girls Service In Indo...
Call Girls In Indore 📞9235973566📞Just Call Inaaya📲 Call Girls Service In Indo...Call Girls In Indore 📞9235973566📞Just Call Inaaya📲 Call Girls Service In Indo...
Call Girls In Indore 📞9235973566📞Just Call Inaaya📲 Call Girls Service In Indo...Sheetaleventcompany
 
❤️Call Girl In Chandigarh☎️9814379184☎️ Call Girl service in Chandigarh☎️ Cha...
❤️Call Girl In Chandigarh☎️9814379184☎️ Call Girl service in Chandigarh☎️ Cha...❤️Call Girl In Chandigarh☎️9814379184☎️ Call Girl service in Chandigarh☎️ Cha...
❤️Call Girl In Chandigarh☎️9814379184☎️ Call Girl service in Chandigarh☎️ Cha...Sheetaleventcompany
 
❤️Zirakpur Escorts☎️7837612180☎️ Call Girl service in Zirakpur☎️ Zirakpur Cal...
❤️Zirakpur Escorts☎️7837612180☎️ Call Girl service in Zirakpur☎️ Zirakpur Cal...❤️Zirakpur Escorts☎️7837612180☎️ Call Girl service in Zirakpur☎️ Zirakpur Cal...
❤️Zirakpur Escorts☎️7837612180☎️ Call Girl service in Zirakpur☎️ Zirakpur Cal...Sheetaleventcompany
 
Call Girl In Indore 📞9235973566📞Just Call Inaaya📲 Call Girls Service In Indor...
Call Girl In Indore 📞9235973566📞Just Call Inaaya📲 Call Girls Service In Indor...Call Girl In Indore 📞9235973566📞Just Call Inaaya📲 Call Girls Service In Indor...
Call Girl In Indore 📞9235973566📞Just Call Inaaya📲 Call Girls Service In Indor...Sheetaleventcompany
 
💸Cash Payment No Advance Call Girls Hyderabad 🧿 9332606886 🧿 High Class Call ...
💸Cash Payment No Advance Call Girls Hyderabad 🧿 9332606886 🧿 High Class Call ...💸Cash Payment No Advance Call Girls Hyderabad 🧿 9332606886 🧿 High Class Call ...
💸Cash Payment No Advance Call Girls Hyderabad 🧿 9332606886 🧿 High Class Call ...India Call Girls
 
science quiz bee questions.doc FOR ELEMENTARY SCIENCE
science quiz bee questions.doc FOR ELEMENTARY SCIENCEscience quiz bee questions.doc FOR ELEMENTARY SCIENCE
science quiz bee questions.doc FOR ELEMENTARY SCIENCEmaricelsampaga
 

Último (20)

Low Rate Call Girls Udaipur {9xx000xx09} ❤️VVIP NISHA CCall Girls in Udaipur ...
Low Rate Call Girls Udaipur {9xx000xx09} ❤️VVIP NISHA CCall Girls in Udaipur ...Low Rate Call Girls Udaipur {9xx000xx09} ❤️VVIP NISHA CCall Girls in Udaipur ...
Low Rate Call Girls Udaipur {9xx000xx09} ❤️VVIP NISHA CCall Girls in Udaipur ...
 
Lucknow Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Luckn...
Lucknow Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Luckn...Lucknow Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Luckn...
Lucknow Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Luckn...
 
❤️Chandigarh Escorts☎️9814379184☎️ Call Girl service in Chandigarh☎️ Chandiga...
❤️Chandigarh Escorts☎️9814379184☎️ Call Girl service in Chandigarh☎️ Chandiga...❤️Chandigarh Escorts☎️9814379184☎️ Call Girl service in Chandigarh☎️ Chandiga...
❤️Chandigarh Escorts☎️9814379184☎️ Call Girl service in Chandigarh☎️ Chandiga...
 
💞 Safe And Secure Call Girls Prayagraj 🧿 9332606886 🧿 High Class Call Girl Se...
💞 Safe And Secure Call Girls Prayagraj 🧿 9332606886 🧿 High Class Call Girl Se...💞 Safe And Secure Call Girls Prayagraj 🧿 9332606886 🧿 High Class Call Girl Se...
💞 Safe And Secure Call Girls Prayagraj 🧿 9332606886 🧿 High Class Call Girl Se...
 
❤️ Chandigarh Call Girls Service☎️9878799926☎️ Call Girl service in Chandigar...
❤️ Chandigarh Call Girls Service☎️9878799926☎️ Call Girl service in Chandigar...❤️ Chandigarh Call Girls Service☎️9878799926☎️ Call Girl service in Chandigar...
❤️ Chandigarh Call Girls Service☎️9878799926☎️ Call Girl service in Chandigar...
 
💚Chandigarh Call Girls Service 💯Jiya 📲🔝8868886958🔝Call Girls In Chandigarh No...
💚Chandigarh Call Girls Service 💯Jiya 📲🔝8868886958🔝Call Girls In Chandigarh No...💚Chandigarh Call Girls Service 💯Jiya 📲🔝8868886958🔝Call Girls In Chandigarh No...
💚Chandigarh Call Girls Service 💯Jiya 📲🔝8868886958🔝Call Girls In Chandigarh No...
 
Erotic Call Girls Bangalore {7304373326} ❤️VVIP SIYA Call Girls in Bangalore ...
Erotic Call Girls Bangalore {7304373326} ❤️VVIP SIYA Call Girls in Bangalore ...Erotic Call Girls Bangalore {7304373326} ❤️VVIP SIYA Call Girls in Bangalore ...
Erotic Call Girls Bangalore {7304373326} ❤️VVIP SIYA Call Girls in Bangalore ...
 
❤️Chandigarh Escort Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ C...
❤️Chandigarh Escort Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ C...❤️Chandigarh Escort Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ C...
❤️Chandigarh Escort Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ C...
 
Low Rate Call Girls Pune {9142599079} ❤️VVIP NISHA Call Girls in Pune Maharas...
Low Rate Call Girls Pune {9142599079} ❤️VVIP NISHA Call Girls in Pune Maharas...Low Rate Call Girls Pune {9142599079} ❤️VVIP NISHA Call Girls in Pune Maharas...
Low Rate Call Girls Pune {9142599079} ❤️VVIP NISHA Call Girls in Pune Maharas...
 
❤️Amritsar Call Girls Service☎️98151-129OO☎️ Call Girl service in Amritsar☎️ ...
❤️Amritsar Call Girls Service☎️98151-129OO☎️ Call Girl service in Amritsar☎️ ...❤️Amritsar Call Girls Service☎️98151-129OO☎️ Call Girl service in Amritsar☎️ ...
❤️Amritsar Call Girls Service☎️98151-129OO☎️ Call Girl service in Amritsar☎️ ...
 
🍑👄Ludhiana Escorts Service☎️98157-77685🍑👄 Call Girl service in Ludhiana☎️Ludh...
🍑👄Ludhiana Escorts Service☎️98157-77685🍑👄 Call Girl service in Ludhiana☎️Ludh...🍑👄Ludhiana Escorts Service☎️98157-77685🍑👄 Call Girl service in Ludhiana☎️Ludh...
🍑👄Ludhiana Escorts Service☎️98157-77685🍑👄 Call Girl service in Ludhiana☎️Ludh...
 
2024 PCP #IMPerative Updates in Rheumatology
2024 PCP #IMPerative Updates in Rheumatology2024 PCP #IMPerative Updates in Rheumatology
2024 PCP #IMPerative Updates in Rheumatology
 
💞 Safe And Secure Call Girls Coimbatore 🧿 9332606886 🧿 High Class Call Girl S...
💞 Safe And Secure Call Girls Coimbatore 🧿 9332606886 🧿 High Class Call Girl S...💞 Safe And Secure Call Girls Coimbatore 🧿 9332606886 🧿 High Class Call Girl S...
💞 Safe And Secure Call Girls Coimbatore 🧿 9332606886 🧿 High Class Call Girl S...
 
Gorgeous Call Girls In Pune {9xx000xx09} ❤️VVIP ANKITA Call Girl in Pune Maha...
Gorgeous Call Girls In Pune {9xx000xx09} ❤️VVIP ANKITA Call Girl in Pune Maha...Gorgeous Call Girls In Pune {9xx000xx09} ❤️VVIP ANKITA Call Girl in Pune Maha...
Gorgeous Call Girls In Pune {9xx000xx09} ❤️VVIP ANKITA Call Girl in Pune Maha...
 
Call Girls In Indore 📞9235973566📞Just Call Inaaya📲 Call Girls Service In Indo...
Call Girls In Indore 📞9235973566📞Just Call Inaaya📲 Call Girls Service In Indo...Call Girls In Indore 📞9235973566📞Just Call Inaaya📲 Call Girls Service In Indo...
Call Girls In Indore 📞9235973566📞Just Call Inaaya📲 Call Girls Service In Indo...
 
❤️Call Girl In Chandigarh☎️9814379184☎️ Call Girl service in Chandigarh☎️ Cha...
❤️Call Girl In Chandigarh☎️9814379184☎️ Call Girl service in Chandigarh☎️ Cha...❤️Call Girl In Chandigarh☎️9814379184☎️ Call Girl service in Chandigarh☎️ Cha...
❤️Call Girl In Chandigarh☎️9814379184☎️ Call Girl service in Chandigarh☎️ Cha...
 
❤️Zirakpur Escorts☎️7837612180☎️ Call Girl service in Zirakpur☎️ Zirakpur Cal...
❤️Zirakpur Escorts☎️7837612180☎️ Call Girl service in Zirakpur☎️ Zirakpur Cal...❤️Zirakpur Escorts☎️7837612180☎️ Call Girl service in Zirakpur☎️ Zirakpur Cal...
❤️Zirakpur Escorts☎️7837612180☎️ Call Girl service in Zirakpur☎️ Zirakpur Cal...
 
Call Girl In Indore 📞9235973566📞Just Call Inaaya📲 Call Girls Service In Indor...
Call Girl In Indore 📞9235973566📞Just Call Inaaya📲 Call Girls Service In Indor...Call Girl In Indore 📞9235973566📞Just Call Inaaya📲 Call Girls Service In Indor...
Call Girl In Indore 📞9235973566📞Just Call Inaaya📲 Call Girls Service In Indor...
 
💸Cash Payment No Advance Call Girls Hyderabad 🧿 9332606886 🧿 High Class Call ...
💸Cash Payment No Advance Call Girls Hyderabad 🧿 9332606886 🧿 High Class Call ...💸Cash Payment No Advance Call Girls Hyderabad 🧿 9332606886 🧿 High Class Call ...
💸Cash Payment No Advance Call Girls Hyderabad 🧿 9332606886 🧿 High Class Call ...
 
science quiz bee questions.doc FOR ELEMENTARY SCIENCE
science quiz bee questions.doc FOR ELEMENTARY SCIENCEscience quiz bee questions.doc FOR ELEMENTARY SCIENCE
science quiz bee questions.doc FOR ELEMENTARY SCIENCE
 

Data day2017

  • 1. Creating Knowledge bases from text in absence of training data. Sanghamitra Deb Accenture Technology Laboratory Phil Rogers, Jana Thompson, Hans Li
  • 3. The Generalized approach of extracting text: Parsing Tokenization Normalization Parsing Lemmatization Tokenization: Separating sentences, words, remove special characters, phrase detections Normalization: lowering words, word-sense disambiguation Parsing: Detecting parts of speech, nouns, verbs etc. Lemmatization: Remove plurals and different word forms to a single word (found in the dictionary).
  • 4. Extract sentences that contain the specific attribute POS tag and extract unigrams,bigrams and trigrams centered on nouns Extract Features: words around nouns: bag of words/word vectors, position of the noun and length of sentence. Train a Machine Learning model to predict which unigrams, bigrams or trigrams satisfy the specific relationship: for example the drug-disease treatment relationship. Map training data to create a balanced positive and negative training set. The Generalized approach of extracting text : ML
  • 5. Extract sentences that contain the specific attribute POS tag and extract unigrams,bigrams and trigrams centered on nouns Extract Features: words around nouns: bag of words/word vectors, position of the noun and length of sentence. Train a Machine Learning model to predict which unigrams, bigrams or trigrams satisfy the specific relationship: for example the drug-disease treatment relationship. Map training data to create a balanced positive and negative training set. The Generalized approach of extracting text : ML How do we generate this training data?
  • 6. A different Approach Stanford Replaces training data by encoding domain knowledge
  • 7. The snorkel approach of Entity Extraction Extract sentences that contain the specific attribute POS tag and extract unigrams,bigrams and trigrams centered on nouns Write Rules: Encode your domain knowledge into rules. Validate Rules: coverage, conflicts, accuracy Run learning: logistic regression, lstm, … Examine a random set of candidates, create new rules Observe the lowest accuracy(highest conflict) rules and edit them iterate
  • 8. Training Data | Rules . . .. .* . . .. . .* * Planetary Orbits
  • 9. How does snorkel work without training data Write Rules: Encode your domain knowledge into rules. The rules are modeled as a Naive Bayes model which assumes that the rules are conditionally independent. These probabilities are fed into Machine Learning algorithm: Logistic Regression in the simplest case to create a model used to make future predictions Even though most of the time this is not true, in practice it generates a pretty good training set with probabilities of being in either class. http://arxiv.org/pdf/1512.06474v2.pdf
  • 10. Data Dive: FDA Drug Labels
  • 11. It is indicated for treating respiratory disorder caused due to allergy. For the relief of symptoms of depression. Evidence supporting efficacy of carbamazepine as an anticonvulsant was derived from active drug-controlled studies that enrolled patients with the following seizure types: When oral therapy is not feasible and the strength , dosage form , and route of administration of the drug reasonably lend the preparation to the treatment of the condition Data Dive: FDA Drug Labels
  • 12. Candidate Extraction Using domain knowledge and language structure collect a set of high recall low precision. Typically this set should have 80% recall and 20% precision. 60% accuracy, too specific need to make it more general 30% accuracy, this looks fine ……………………………………………………………………………………………………………………………………………………………………. …………………………………………………………………………………………………………………………………………………………………….
  • 16. 0 75 150 225 300 -1 0 1 Generation of training data One rule
  • 17. 0 55 110 165 220 -1 0 1 Generation of training data two rules
  • 18. 0 45 90 135 180 -1 0 1 Generation of training data three rules
  • 19. 0 35 70 105 140 -1 0 1 Generation of training data four rules
  • 20. 0 35 70 105 140 -1 0 1 Generation of training data 20 rules
  • 21. Results and performance. drug-name disease candidate Candidates snorkel Lithium Carbonate bipolar disorder 1 1 Lithium Carbonate individual 1 0 Lithium Carbonate maintenance 1 0 Lithium Carbonate manic episode 1 1 Precision and recall ~90%
  • 22. Evolution of F1-score with sample size
  • 23. Relationship extractions •Is person X married to person Y? •Does drug X cure disease Y? •Does software X (example: snorkel) run on programing language Y (example: python3) Define filters for candidate extraction for a pair (X,Y) example: (snorkel, python2.7), (snorkel,python3.1), … Once you have the pairs , examine them using annotation tool. Write rules ——> observe their performance against annotated data. Iterate
  • 24. Crowdsourced training data In some cases training data is generated on the same dataset by multiple people. In snorkel each source can be incorporated as a separate rule function. The model for the rules figure out the relative weights for each person and create a cleaner training data.
  • 25. Why Docker? • Portability: develop here run there: Internal Clusters, aws, google cloud etc, Reusable by team and clients • isolation: os and docker isolated from bugs. • Fast • Easy virtualization : hard ware emulation, virtualized os. • Lightweight Python stack on docker
  • 26. FROM ubuntu:latest # MAINTAINER Sanghamitra Deb <sangha123@gmail.com> CMD echo Installing Accenture Tech Labs Scientific Python Enviro RUN apt-get install python -y RUN apt-get update && apt-get upgrade -y RUN apt-get install curl -y RUN apt-get install emacs -y RUN curl -O https://bootstrap.pypa.io/get-pip.py RUN python get-pip.py RUN rm get-pip.py RUN echo "export PATH=~/.local/bin:$PATH" >> ~/.bashrc RUN apt-get install python-setuptools build-essential python-dev -y RUN apt-get install gfortran swig -y RUN apt-get install libatlas-dev liblapack-dev -y RUN apt-get install libfreetype6 libfreetype6-dev -y RUN apt-get install libxft-dev -y RUN apt-get install libxml2-dev libxslt-dev zlib1g-dev RUN apt-get install python-numpy ADD requirements.txt /tmp/requirements.txt RUN pip install -r /tmp/requirements.txt -q Dockerfile scipy matplotlib ipython jupyter pandas Bottleneck patsy pymc statsmodels scikit-learn BeautifulSoup seaborn gensim fuzzywuzzy xmltodict untangle nltk flask enum34 requirements.txt docker build -t sangha/python . docker run -it -p 1108:1108 -p 1106:1106 --name pharmaExtraction0.1 -v /location/in/hadoop/ sangha/python bash docker exec -it pharmaExtraction0.1 bash docker exec -d  pharmaExtraction0.1 python  /root/pycodes/rest_api.py Building the Dockerfile
  • 27. Typical ML pipeline vs Snorkel (1) Candidate Extraction. (2) Rule Function (3) Hyperparameter tuning
  • 28. Snorkel : Pros: • Very little training data necessary • Do not have to think about feature generation • Do not need deep knowledge in Machine Learning • Convenient UI for data annotation • Created structured databases from unstructured text Cons: • Code is new, so it may not be robust to all situations. • Doing online prediction is difficult. • Not much transparency in the internal workings.
  • 29. Banks: Loan Approval Paleontology Design of Clinical Trials Legal Investigation Market Research Reports Human Trafficking Skills extraction from resume Content Marketing Product descriptions and reviews Pharmaceutical Industry Applicability across 
 a variety of industries and use cases
  • 30. Where to get it? https://github.com/HazyResearch/snorkel http://arxiv.org/pdf/1512.06474v2.pdf