SlideShare a Scribd company logo
1 of 46
Download to read offline
<Little Big Data #1>
summatic@scatterlab.co.kr
1
• 

(@ , 2016. 1~)

• 

(2016. 8~)

• 

(2018. 5~)
!2
:
:
:• 

• (?) 

• B

•
.

•
.

•
.

•
.

• 

• ( , id )
.

• ( , , )
.
!3
• Intro

• 

• 

• 

• 

• 

• Preprocessing

• Word Embedding

• Document Similarity

•
!4
Intro
• 

• 

• “ ” -> “ " -> “ ” .

• “ ” .

• .

• .

• 

• .

• .
!6
-
• Hell 

• .

• 

• 

• 

• 

•
< >
- ?
- ? / ? ?
< > , ,
< > , ,
< >
< > , , ,
!7
• 

• 

•
-
< >
/ / ? / / ? / ? / 

< >
- (X) -> (O)
- ? (X) -> ? (O)
- ? (X) -> ? (O)
- (X) -> (O)
< >
-
-
!8
- preprocess
• Data Science 

• Garbage in, Garbage out

• , preprocess
.

• preprocess ?
!10
Preprocessing
Preprocessing -
• 

• preprocess (POS1 tagger)
.

• : 

• KoNLPy2

• 

• , ,
1) POS: Part of speech

2) http://konlpy-ko.readthedocs.io/ko/v0.4.3/
!12
Preprocessing - ( )
• . ?
 • _NP _MAG _VV _ECE 

_VXA _EFN ._SF _MAG 

_VV _EFQ ?_SF
• 
 • _NP _MAG _NNG 

_XSV _ECE
• . 
 • _NNG _VA _ECD _VV 

_EFN ._SF _MAG _VV 

_ECE _NNG _XSV _ECS
< > < >
!13
Preprocessing - ( )
• . ?
 • _UN _JKS _MAG _MAG 

_VV _ECE _NNG _MAG 

_MAG _VV _ECS ?_SF
• 
 • _NP _NNG _NNG 

_JKM _VV
• . 
 • _NNG _VA _ECD _NP 

_UN ._SF _MAG _VV _ECE
_MAG _VV _ECS _EMO
< > < >
!15
Preprocessing -
• 

• ( , corpus)


• (corpus)

•
!17
: https://ko.wikipedia.org/wiki/
Preprocessing -
• Sejong Corpus

• National Institute of the Korean Language, 1998-2007.

• 

• (..)
!18
: https://ithub.korean.go.kr/user/guide/corpus/guide1.do
• preprocess

• normalize( )

• preprocessing

• 

• tokenizing
< >
count(“ ”) < count(“ ?”) , “ ” .
Preprocessing -
!19
Preprocessing - Tokenizing
• Tokenizing: 

• token , .

• , token 

• “ ” “ ” tokenizing
.
!20
< >
before tokenizing:
.
after tokenizing:
/ / / / / / / / / / / / / / / /
/ / / / .
• 

• 

• c1c2..cn-1 cn c1..cn 

•
Preprocessing - Tokenizing(Cohesion Probability)
!21
< >
“ ” “ ” .
: https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/05/05/cohesion/
Preprocessing - Tokenizing(Cohesion Probability)
• ) 

• = +
!22
substring count
- count( ) = 20000
- count( ) = 1500
- count( ) = 1200
- count( ) = 30
- count( ) = 15
cohesion probability
- CP( ) = 0.2738
- CP( ) = 0.3914
- CP( ) = 0.1968
- CP( ) = 0.2371
Preprocessing - Tokenizing
• Cohesion probability .

• .

• [ 2017] NLP - 

• 

• https://www.slideshare.net/kimhyunjoonglovit/pycon2017-koreannlp

• 

• https://github.com/lovit/soynlp
!23
Word Embedding
Word Embedding - Word2Vec
• vector .

• word embedding word representation .

• word2vec

• You shall know a word by the company it keeps (Firth, J. R. 1957:11)
!25
Word Embedding - Word2Vec
• word2vec OOV
.

• OOV(Out-of-vocabulary): (=dictionary ) vocabulary
vector 

• training input vocabulary OOV
, inference .

• inference : 

•


• ( , )
, dictionary .
!26
• word2vec 

• word2vec:

• 

• fasttext: 

• where the set of n grams appearing in w

• subword
Word Embedding - Fasttext
!27
< >
w: Alpaca
n grams of w (n=3) = <Al, Alp, lpa, pac, aca, ca>
: Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:
1607.04606.
Word Embedding - Fasttext +
• fasttext .

• (character) subword 

• subword 

• , OOV .
!28
< >
subwords( ) = < , , , , >
< >
= _ _ _
subwords( ) = < , _, _ , …, >
Word Embedding - Fasttext
•
!29
- , 0.8590
- , 0.8465
- , 0.8180
- , 0.8055
- , 0.8018
- , 0.8017
- , 0.8007
- , 0.7983
- , 0.7972
- , 0.7948
- , 0.9022
- , 0.8986
- , 0.8887
- , 0.8866
- , 0.8567
- , 0.8498
- , 0.8474
- , 0.8413
- , 0.8335
- , 0.8191
Word Embedding - Fasttext
• 

• 

• Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word
vectors with subword information. arXiv preprint arXiv:1607.04606.

• 

• https://github.com/facebookresearch/fastText

• https://radimrehurek.com/gensim/models/fasttext.html

• https://github.com/summatic/hangul_jamo_fasttext
!30
Sentence Similarity
Setence Similarity
• document
.

• document short sentence .

• word embedding vector embedding
cosine similarity .
!32
< >
sim( , ?)
Sentence Similarity - BOW + Word Embedding
• word vector 

• doc2vec 

• word embedding 

• word embedding ?

• word embedding 

• !=
!34
- similarity( , ) = 0.9011
- similarity( , ) = 0.8839
- similarity( , ) = 0.9707
Sentence Similarity - RNN
• sentence embedding RNN (LSTM, Bi-
RNN, GRU ) .

• RNN language modeling

• “ .” <-> “ ”


• sequence embedding .

• .. “ ” “ ” embedding .

• “?”
!35
Sentence Similarity - Term vector
• vector embedding
embedding .

• embedding term vector 

• one hot encoding .

• term vector cosine similarity, edit distance
.
!36
< >
- I love you, you love me
- {“I”: 1, “love”: 2, “you”: 2, “me”: 1}
Sentence Similarity - Term vector
• term vector 

• . 

• 

• pair1 pair2 ?
!38
< >
pair1: I love you <-> I like you
pair2: I love you <-> I hate you
Sentence Similarity - ESA Similarity
• ESA: Explicit Semantic Analysis

• (=word vector) 

• cosine similarity

• ESA similarity
!39
I love you
I like you
similarity I love you
I 1 0.2 0.5
like 0.3 0.9 0.4
you 0.5 0.4 1
1 0.9 1
Sentence Similarity - ESA Similarity
• ESA: Explicit Semantic Analysis

• (=word vector) 

• cosine similarity

• ESA similarity
!40
I love you
I hate you
similarity I love you
I 1 0.2 0.5
hate 0.3 0.5 0.4
you 0.5 0.4 1
1 0.5 1
Sentence Similarity - ESA Similarity
• ESA: Explicit Semantic Analysis

• I love you 

• .
!41
I like you I hate you
cosine 0.667 0.667
ESA 0.967 0.833
Sentence Similarity - ESA Similarity
• .

• 

• Song, Y., & Roth, D. (2015). Unsupervised sparse vector densification for short
text similarity. In Proceedings of the 2015 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies (pp. 1275-1280).

• 

• ( )
!42
• preprocessing 80% 

• Zipf’s law

• corpus ,


• ( ) .


• 

• 

• , count based


• unlabeled data label 

• label insight
!44
WE WANT YOU!
- End of Document -
46

More Related Content

Similar to <Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정)

Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotechPy "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotechShinichi Nakagawa
 
Elasticsearch at EyeEm
Elasticsearch at EyeEmElasticsearch at EyeEm
Elasticsearch at EyeEmLars Fronius
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for RubyistsSean Cribbs
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic searchHenry Saputra
 
How to look like a model? MongoDB for Rails apps
How to look like a model? MongoDB for Rails appsHow to look like a model? MongoDB for Rails apps
How to look like a model? MongoDB for Rails appsboogie_cat
 
Programming Contest Hacks
Programming Contest HacksProgramming Contest Hacks
Programming Contest HacksKosei Moriyama
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go ProgrammingLin Yo-An
 
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기Heejong Ahn
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Databricks
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints
 
Zeppelin, TensorFlow, Deep Learning 맛보기
Zeppelin, TensorFlow, Deep Learning 맛보기Zeppelin, TensorFlow, Deep Learning 맛보기
Zeppelin, TensorFlow, Deep Learning 맛보기Taejun Kim
 
Svetlin Nakov - What's New In CLR 2.0
Svetlin Nakov - What's New In CLR 2.0Svetlin Nakov - What's New In CLR 2.0
Svetlin Nakov - What's New In CLR 2.0Svetlin Nakov
 
Archetype autoplugins
Archetype autopluginsArchetype autoplugins
Archetype autopluginsMark Schaake
 
Abusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and ProfitAbusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and ProfitWojciech Gawroński
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithmAndrew Koo
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to pythonActiveState
 

Similar to <Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정) (20)

Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotechPy "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
 
Elasticsearch at EyeEm
Elasticsearch at EyeEmElasticsearch at EyeEm
Elasticsearch at EyeEm
 
04 standard class library c#
04 standard class library c#04 standard class library c#
04 standard class library c#
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic search
 
How to look like a model? MongoDB for Rails apps
How to look like a model? MongoDB for Rails appsHow to look like a model? MongoDB for Rails apps
How to look like a model? MongoDB for Rails apps
 
Programming Contest Hacks
Programming Contest HacksProgramming Contest Hacks
Programming Contest Hacks
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go Programming
 
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
Zeppelin, TensorFlow, Deep Learning 맛보기
Zeppelin, TensorFlow, Deep Learning 맛보기Zeppelin, TensorFlow, Deep Learning 맛보기
Zeppelin, TensorFlow, Deep Learning 맛보기
 
Svetlin Nakov - What's New In CLR 2.0
Svetlin Nakov - What's New In CLR 2.0Svetlin Nakov - What's New In CLR 2.0
Svetlin Nakov - What's New In CLR 2.0
 
Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)
 
Archetype autoplugins
Archetype autopluginsArchetype autoplugins
Archetype autoplugins
 
Abusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and ProfitAbusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and Profit
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithm
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 

Recently uploaded

High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 

Recently uploaded (20)

High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 

<Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정)

  • 1. <Little Big Data #1> summatic@scatterlab.co.kr 1
  • 2. • 
 (@ , 2016. 1~) • 
 (2016. 8~) • 
 (2018. 5~) !2
  • 3. : : :• • (?) • B • . • . • . • . • • ( , id ) . • ( , , ) . !3
  • 4. • Intro • • • • • • Preprocessing • Word Embedding • Document Similarity • !4
  • 6. • • • “ ” -> “ " -> “ ” . • “ ” . • . • . • • . • . !6
  • 7. - • Hell • . • • • • • < > - ? - ? / ? ? < > , , < > , , < > < > , , , !7
  • 8. • • • - < > / / ? / / ? / ? / 
 < > - (X) -> (O) - ? (X) -> ? (O) - ? (X) -> ? (O) - (X) -> (O) < > - - !8
  • 9.
  • 10. - preprocess • Data Science • Garbage in, Garbage out • , preprocess . • preprocess ? !10
  • 12. Preprocessing - • • preprocess (POS1 tagger) . • : • KoNLPy2 • • , , 1) POS: Part of speech 2) http://konlpy-ko.readthedocs.io/ko/v0.4.3/ !12
  • 13. Preprocessing - ( ) • . ? • _NP _MAG _VV _ECE 
 _VXA _EFN ._SF _MAG 
 _VV _EFQ ?_SF • • _NP _MAG _NNG 
 _XSV _ECE • . • _NNG _VA _ECD _VV 
 _EFN ._SF _MAG _VV 
 _ECE _NNG _XSV _ECS < > < > !13
  • 14.
  • 15. Preprocessing - ( ) • . ? • _UN _JKS _MAG _MAG 
 _VV _ECE _NNG _MAG 
 _MAG _VV _ECS ?_SF • • _NP _NNG _NNG 
 _JKM _VV • . • _NNG _VA _ECD _NP 
 _UN ._SF _MAG _VV _ECE _MAG _VV _ECS _EMO < > < > !15
  • 16.
  • 17. Preprocessing - • • ( , corpus) • (corpus) • !17 : https://ko.wikipedia.org/wiki/
  • 18. Preprocessing - • Sejong Corpus • National Institute of the Korean Language, 1998-2007. • • (..) !18 : https://ithub.korean.go.kr/user/guide/corpus/guide1.do
  • 19. • preprocess • normalize( ) • preprocessing • • tokenizing < > count(“ ”) < count(“ ?”) , “ ” . Preprocessing - !19
  • 20. Preprocessing - Tokenizing • Tokenizing: • token , . • , token • “ ” “ ” tokenizing . !20 < > before tokenizing: . after tokenizing: / / / / / / / / / / / / / / / / / / / / .
  • 21. • • • c1c2..cn-1 cn c1..cn • Preprocessing - Tokenizing(Cohesion Probability) !21 < > “ ” “ ” . : https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/05/05/cohesion/
  • 22. Preprocessing - Tokenizing(Cohesion Probability) • ) • = + !22 substring count - count( ) = 20000 - count( ) = 1500 - count( ) = 1200 - count( ) = 30 - count( ) = 15 cohesion probability - CP( ) = 0.2738 - CP( ) = 0.3914 - CP( ) = 0.1968 - CP( ) = 0.2371
  • 23. Preprocessing - Tokenizing • Cohesion probability . • . • [ 2017] NLP - • • https://www.slideshare.net/kimhyunjoonglovit/pycon2017-koreannlp • • https://github.com/lovit/soynlp !23
  • 25. Word Embedding - Word2Vec • vector . • word embedding word representation . • word2vec • You shall know a word by the company it keeps (Firth, J. R. 1957:11) !25
  • 26. Word Embedding - Word2Vec • word2vec OOV . • OOV(Out-of-vocabulary): (=dictionary ) vocabulary vector • training input vocabulary OOV , inference . • inference : • • ( , ) , dictionary . !26
  • 27. • word2vec • word2vec: • • fasttext: • where the set of n grams appearing in w • subword Word Embedding - Fasttext !27 < > w: Alpaca n grams of w (n=3) = <Al, Alp, lpa, pac, aca, ca> : Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv: 1607.04606.
  • 28. Word Embedding - Fasttext + • fasttext . • (character) subword • subword • , OOV . !28 < > subwords( ) = < , , , , > < > = _ _ _ subwords( ) = < , _, _ , …, >
  • 29. Word Embedding - Fasttext • !29 - , 0.8590 - , 0.8465 - , 0.8180 - , 0.8055 - , 0.8018 - , 0.8017 - , 0.8007 - , 0.7983 - , 0.7972 - , 0.7948 - , 0.9022 - , 0.8986 - , 0.8887 - , 0.8866 - , 0.8567 - , 0.8498 - , 0.8474 - , 0.8413 - , 0.8335 - , 0.8191
  • 30. Word Embedding - Fasttext • • • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. • • https://github.com/facebookresearch/fastText • https://radimrehurek.com/gensim/models/fasttext.html • https://github.com/summatic/hangul_jamo_fasttext !30
  • 32. Setence Similarity • document . • document short sentence . • word embedding vector embedding cosine similarity . !32 < > sim( , ?)
  • 33.
  • 34. Sentence Similarity - BOW + Word Embedding • word vector • doc2vec • word embedding • word embedding ? • word embedding • != !34 - similarity( , ) = 0.9011 - similarity( , ) = 0.8839 - similarity( , ) = 0.9707
  • 35. Sentence Similarity - RNN • sentence embedding RNN (LSTM, Bi- RNN, GRU ) . • RNN language modeling • “ .” <-> “ ” • sequence embedding . • .. “ ” “ ” embedding . • “?” !35
  • 36. Sentence Similarity - Term vector • vector embedding embedding . • embedding term vector • one hot encoding . • term vector cosine similarity, edit distance . !36 < > - I love you, you love me - {“I”: 1, “love”: 2, “you”: 2, “me”: 1}
  • 37.
  • 38. Sentence Similarity - Term vector • term vector • . • • pair1 pair2 ? !38 < > pair1: I love you <-> I like you pair2: I love you <-> I hate you
  • 39. Sentence Similarity - ESA Similarity • ESA: Explicit Semantic Analysis • (=word vector) • cosine similarity • ESA similarity !39 I love you I like you similarity I love you I 1 0.2 0.5 like 0.3 0.9 0.4 you 0.5 0.4 1 1 0.9 1
  • 40. Sentence Similarity - ESA Similarity • ESA: Explicit Semantic Analysis • (=word vector) • cosine similarity • ESA similarity !40 I love you I hate you similarity I love you I 1 0.2 0.5 hate 0.3 0.5 0.4 you 0.5 0.4 1 1 0.5 1
  • 41. Sentence Similarity - ESA Similarity • ESA: Explicit Semantic Analysis • I love you • . !41 I like you I hate you cosine 0.667 0.667 ESA 0.967 0.833
  • 42. Sentence Similarity - ESA Similarity • . • • Song, Y., & Roth, D. (2015). Unsupervised sparse vector densification for short text similarity. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1275-1280). • • ( ) !42
  • 43.
  • 44. • preprocessing 80% • Zipf’s law • corpus , • ( ) . • • • , count based • unlabeled data label • label insight !44
  • 46. - End of Document - 46