Real World NLP, ML, and Big Data

Real World NLP and ML
Devin Bost
Software Architect
devin.bost@imaginelearning.com

Questions welcome during
presentation

NLP
ML
Sentiment analysis
Automated essay scoring
Content summarization
Chatbots
Information retrieval
Cluster analysis
Language neural
networks
Language translation
AI Big Data

http://eduardomagrani.com/en/we-are-big-data-new-technologies-and-personal-data-management/
Everything is

NLP
ML
Sentiment analysis
Automated essay scoring
Content summarization
Chatbots
Information retrieval
Cluster analysis
Language neural
networks
Document categorization
AI Big Data

https://openi.nlm.nih.gov/detailedresult.php?img=PMC2841207_1471-2105-11-101-2&req=4
Penn Treebank example:

Meta-analysis of studies: Burns, G. A., Feng, D., & Hovy, E. (2008). Intelligent
approaches to mining the primary research literature:
techniques, systems, and examples. In Computational
Intelligence in Medical Informatics (pp. 17-50). Springer,
Berlin, Heidelberg. Retrieved from:
http://www.academia.edu/download/30797420/burns_feng
_hovy_comp_intel-final.pdf

https://medium.com/@athif.shaffy/one-hot-encoding-of-text-b69124bef0a7
One-hot vectors:

https://www.kdnuggets.com/2017/04/must-know-curse-dimensionality.html
The curse of dimensionality:

Statistical word embeddings: Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases
and their compositionality. In Advances in neural information processing systems (pp. 3111-3119). At:
https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
Cited by over 9804 papers according to Google Scholar, as of: 10/22/2018
Based on statistical relationships between words:
https://www.coursera.org/lecture/intro-to-deep-learning/word-embeddings-dhzl5

Images from: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

https://www.packtpub.com/mapt/book/big_data_and_business_intelligence/9781788398060/8/ch08lvl1sec56/mapping-with-word2vec-embeddings

So what are bigrams?
Examples of less useful bigrams:
Of the
what is
they are
to the
way to
hey you
Examples of useful bigrams:
New York
West Virginia
Imagine Learning
Imagine Math
Microsoft Office
Neural network
Ping pong

The problem with student chat data:

Top 10 bigrams:
1. need help
2. back need
3. nice day
4. help nice
5. click back
6. please come
7. hear voice
8. type please
9. problem ask
10.ask find

http://playground.tensorflow.org/

Chatbot:
• Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-
based neural machine translation. arXiv preprint arXiv:1508.04025. At:
https://arxiv.org/pdf/1508.04025
• Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473. At:
https://arxiv.org/pdf/1409.0473

• Next attempt with chatbots.
• Added context.

• Validation score improved!...
• Problem is that model validation scores have different meanings
when the model changes

• Key point: Ensure that your application allows your accuracy to be
imperfect.

Neural networks: Very good at detecting patterns, but
they don’t always beat less complex ML models (e.g.
Naïve Bayes, XGBoost, etc.)
The data volume paradigm:
Most common cases
https://blog.easysol.net/building-ai-applications/

https://people.duke.edu/~ccc14/sta-663/CUDAPython.html

Using NLTK
named entity
extraction

• Without data, you have no machine learning!
• Should be obvious, right?
• You’d be surprised.

https://medium.com/data-ops/the-data-lake-is-a-design-pattern-888323323c66

Data pantry:
https://www.pinterest.com/pin/424956914822695344/?lp=true

Great software… but now what? (The problem LinkedIn experienced.)

How to make it cost effective:

Kinesis data
stream
Kinesis analytics or
Flink/ Spark-
Streaming on EMR
Lambda,
Proactive
Intervention
IoT Core
Client
devices/browsers
API
Gateway
(1)
(2) (3)
(4)
(5)
(6)
(7) (8) (9)
(10)
Lambda,
Auth.

S3 storage CloudWatch logging CloudWatch logging
$8/year
per 1,000,000
events
+ cost of analytics

Real-time streaming predictive analytics
https://analyse.kmi.open.ac.uk/

Clues that you have an organizational or
architectural problem:

Excuse #1: But all of our developers are so constantly busy that
we will never get around to making those changes!
Implication: But we have so much technical debt, we
spend all of our time putting out fires!
Image cropped from: https://www.flickr.com/photos/41284017@N08/9599182665
From: http://gis.nwcg.gov/gist_2004/logos/federal_logos.html

Excuse #2: We have all of the data that we
need!
Implication: We are so unwilling
to take a look at the reality of our problem
that we have no idea how bad it really is.

Excuse #3: It’s really not
that important. We have
higher priorities.
Implication: We think
we’re so right 100% of
the time that no data
could possibly ever tell us
that we’re ever wrong.
Or, we don’t make
mistakes (only our
developers do).
https://www.recruiter.com/i/does-a-worker%E2%80%99s-personal-life-affect-your-brand/fingers-pointing-blame-to-man/

Excuse #4: We make our decisions based on our instincts and
gut feelings.
Implication: We’re so unwilling to have our
assumptions challenged that we don’t want to think about the
idea that additional data could make our instincts even better.
https://medium.com/@vaidoshia/building-my-own-design-gut-instinct-f7f773d6d608

Excuse #5: That’s nice, but that doesn’t apply to us.
Implication: I live in my own little world where truth
doesn’t apply to me.
https://www.deviantart.com/bluejennybird/art/my-own-planet-159966933

Excuse #6: That would be too expensive.
Implication: We’re at least 5 years behind on what big
data technologies and cloud services can offer.
What’s a serverless
function?
What’s an event
stream?
[picture of a person
getting rained on by a
cloud] http://i.telegraph.co.uk/multimedia/archive/01244/appleimac1984_1244597i.jpg

Excuse #7: We don’t have time for that.
We’re so busy chasing the carrot in front of our faces that we probably won’t notice if our
competitors knock us out of the market until it’s too late.
https://www.derekhuether.com/blog/2010/11/12/chasing-the-carrot
https://forum.slowtwitch.com/forum/Slowtwitch_Forums_C1/Triathlon_Forum_F1/What%27s_the_average_first_year_out_of_pocket%3F_P5797700/

Excuse #8: We need to make use of our existing technologies.
We can’t bear the thought that we have been wasting
our investments in outdated technologies. Or, we don’t think
this effort is important enough to justify our investment. (See
excuses 1-7.)

Excuse #9: It would be too hard to maintain
Implication:
I don’t know what “serverless” means. Is that part of
“The Cloud”?
https://www.thoughtco.com/types-of-clouds-recognize-in-the-sky-4025569

Python libraries for exploring word embeddings include:
• Gensim: https://radimrehurek.com/gensim/tutorial.html
• SpaCy: https://spacy.io/usage/spacy-101
• NLTK: https://www.nltk.org
• CoreNLP: https://stanfordnlp.github.io/CoreNLP/

Real World NLP, ML, and Big Data

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Real World NLP, ML, and Big Data

Semelhante a Real World NLP, ML, and Big Data (20)

Mais de Devin Bost

Mais de Devin Bost (6)

Último

Último (20)

Real World NLP, ML, and Big Data

Notas do Editor