A major part of Big Data collected in most industries is in the form of unstructured text. Some examples are log files in IT sector, analysts reports in the finance sector, patents, laboratory notes and papers, etc. Some of the challenges of gaining insights from unstructred text is converting it into structured information and generating training sets for machine learning. Typically training sets for supervised learning are generated through the process of human annotation. In case of text this involves reading several thousands to million lines of texts by subject matter experts. This is very expensive and may not always be available, hence it is important to solve the problem of generating training sets before attempting to build machine learning models. Our approach is to combine rule based techniques with small amounts of SME time to by pass time consuming manual creation of training data. Once we have a good set of rules mimicking the training data we will use them to create knowledgebases out of the structured data. This knowledgebase can be further queried to gain insight on the domain. I have applied this technique to several domains, such as data from drug labels and medical journals, log data generated through customer interaction, generation of market research reports, etc. I will talk about the results in some of these domains and the advantage of using this approach.
3. The Generalized approach of extracting text: Parsing
Tokenization Normalization Parsing Lemmatization
Tokenization: Separating sentences, words, remove
special characters, phrase detections
Normalization: lowering words, word-sense
disambiguation
Parsing: Detecting parts of speech, nouns, verbs etc.
Lemmatization: Remove plurals and different word
forms to a single word (found in the dictionary).
4. Extract sentences that contain the
specific attribute
POS tag and extract unigrams,bigrams
and trigrams centered on nouns
Extract Features: words around nouns:
bag of words/word vectors,
position of the noun and length of sentence.
Train a Machine Learning model to predict which unigrams, bigrams
or trigrams satisfy the specific relationship: for example the drug-disease
treatment relationship.
Map training data to create a balanced
positive and negative training set.
The Generalized approach of extracting text : ML
5. Extract sentences that contain the
specific attribute
POS tag and extract unigrams,bigrams
and trigrams centered on nouns
Extract Features: words around nouns:
bag of words/word vectors,
position of the noun and length of sentence.
Train a Machine Learning model to predict which unigrams, bigrams
or trigrams satisfy the specific relationship: for example the drug-disease
treatment relationship.
Map training data to create a balanced
positive and negative training set.
The Generalized approach of extracting text : ML
How do we generate this training data?
7. The snorkel approach of Entity Extraction
Extract sentences that contain the
specific attribute
POS tag and extract unigrams,bigrams
and trigrams centered on nouns
Write Rules: Encode your domain knowledge
into rules.
Validate Rules: coverage, conflicts, accuracy
Run learning: logistic regression, lstm, …
Examine a random
set of candidates,
create new rules
Observe the lowest
accuracy(highest conflict)
and edit rules
iterate
9. It is indicated for treating respiratory disorder caused
due to allergy.
For the relief of symptoms of depression.
Evidence supporting efficacy of carbamazepine as an
anticonvulsant was derived from active drug-controlled
studies that enrolled patients with the following seizure
types:
When oral therapy is not feasible and the strength ,
dosage form , and route of administration of the drug
reasonably lend the preparation to the treatment of the
condition
Data Dive: FDA Drug Labels
10. Data Dive: Clinical Trials Data
We present a case of a 10-year-old boy who had severe relapsing
pancreatitis three times in two months within 3 weeks after starting treatment
with methylphenidate ( ritalin ) due to attention deficit hyperactivity
disorder (adhd).
The boy was generally healthy except for that he was newly diagnosed with
adhd and started the use of methylphenidate ( ritalin ) for the past three
weeks at a dose, of 30 mg daily.
We believe that the number of persons suffering from pancreatitis due to the
use of ritalin is more than this published case.
Physicians must pay attention regarding this possible complication and it
should be taken into consideration in every patient with abdominal pain who
started consuming ritalin.
11. Final Goal: Entity and relationship Extraction
Data Dosage Drug
Treats
Disease
Side
Effects
Age Gender Ethnicity duration
10-year-old 0 0 0 0 1 0 0 0
pancreatiti
s-ritalin
0 0 0 1 0 0 0 0
adhd-ritalin 0 0 1 0 0 0 0 0
ritalin 0 1 0 0 0 0 0 0
30 mg 1 0 0 0 0 0 0 0
past three
weeks
0 0 0 0 0 0 0 1
boy 0 0 0 0 0 1 0 0
12. Candidate Extraction
Using domain knowledge and language structure collect
a set of high recall low precision. Typically this set should
have 80% recall and 20% precision.
60% accuracy, too specific need to make it more general
30% accuracy, this looks fine
…………………………………………………………………………………………………………………………………………………………………….
…………………………………………………………………………………………………………………………………………………………………….
18. Why Docker?
• Portability: develop here run
there: Internal Clusters, aws,
google cloud etc, Reusable by
team and clients
• isolation: os and docker
isolated from bugs.
• Fast
• Easy virtualization : hard ware
emulation, virtualized os.
• Lightweight
Python stack on docker
19. FROM ubuntu:latest
# MAINTAINER Sanghamitra Deb <sangha123@gmail.com>
CMD echo Installing Accenture Tech Labs Scientific Python Enviro
RUN apt-get install python -y
RUN apt-get update && apt-get upgrade -y
RUN apt-get install curl -y
RUN apt-get install emacs -y
RUN curl -O https://bootstrap.pypa.io/get-pip.py
RUN python get-pip.py
RUN rm get-pip.py
RUN echo "export PATH=~/.local/bin:$PATH" >> ~/.bashrc
RUN apt-get install python-setuptools build-essential python-dev -y
RUN apt-get install gfortran swig -y
RUN apt-get install libatlas-dev liblapack-dev -y
RUN apt-get install libfreetype6 libfreetype6-dev -y
RUN apt-get install libxft-dev -y
RUN apt-get install libxml2-dev libxslt-dev zlib1g-dev
RUN apt-get install python-numpy
ADD requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt -q
Dockerfile
scipy
matplotlib
ipython
jupyter
pandas
Bottleneck
patsy
pymc
statsmodels
scikit-learn
BeautifulSoup
seaborn
gensim
fuzzywuzzy
xmltodict
untangle
nltk
flask
enum34
requirements.txt
docker build -t sangha/python .
docker run -it -p 1108:1108 -p 1106:1106 --name pharmaExtraction0.1 -v
/location/in/hadoop/ sangha/python bash
docker exec -it pharmaExtraction0.1 bash
docker exec -d pharmaExtraction0.1 python /root/pycodes/rest_api.py
Building the Dockerfile
20. Typical ML pipeline vs Snorkel
(1) Candidate Extraction.
(2) Rule Function
(3) Hyperparameter tuning
21. Snorkel :
Pros:
• Very little training
data necessary
• Do not have to
think about feature
generation
• Do not need deep
knowledge in
Machine Learning
• Convenient UI for
data annotation
• Created structured
databases from
unstructured text
Cons:
• Code is getting
refactored very
rapidly and
frequently.
• Not much
transparency in the
internal workings.
22. Banks: Loan
Approval Paleontology
Design of Clinical Trials
Legal
Investigation
Market Research
Reports
Human Trafficking
Inventory Management
Content Marketing
Product descriptions and
reviews
Pharmaceutical
Industry
Applicability across
a variety of industries
and use cases
23. Where to get it?
https://github.com/HazyResearch/snorkel
http://arxiv.org/pdf/1512.06474v2.pdf