O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Globally Scalable Web Document Classification
Using Word2Vec
Kohei Nakaji (SmartNews)
keyword: machine learning for discovery
SmartNews Demo
About SmartNews
Japan
Launched 2013
4M+ Monthly Active Users
50% DAU/MAU
100+ Publishers
2013 App of The Year
US
Launched ...
Outline of our algorithm
Structure Analysis
Semantics Analysis
URLs Found
Importance Estimation
10 million/day
1000+/day
D...
Outline of our algorithm
Structure Analysis
Semantics Analysis
URLs Found
Importance Estimation
10 million/day
1000+ /day
...
Web Document Classification
ENTERTAINMENT
SPORTS
TECHNOLOGY
LIFESTYLE
SCIENCE
…
Task definition:
When an arbitrary web docum...
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
There are roughly two steps:
There are roughly two steps:
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
Main Content Extraction
Two approaches:
html
html
easier, but takes time
difficult, but fast
・Extract after rendering whole...
Main Content Extraction
・Extract after rendering whole page
・Extract from HTML
html
html
easier, but takes time
difficult, ...
Main Content Extraction from HTML
<html>
<body>

<div>click <a>here</a> for </div>

<div>

<a>tweet</a><a>share</a>
<p>
Ro...
Main Content Extraction from HTML
Rule1:
div which has

text length > 200
num of ‘a’ tag < 3
is Main Content
Rule-based ex...
Main Content Extraction from HTML
Rule1:
div which has

text length > 200
num of ‘a’ tag < 3
is Main Content
Rule-based ex...
Main Content Extraction from HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(featur...
Main Content Extraction from HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(featur...
Feature Extraction from HTML
<html>
<body>

<div>click <a>here</a> for </div>

<div>

<a>tweet</a><a>share</a>
<p>
Robert ...
Feature Extraction from HTML
<html>
<body>

<div>click <a>here</a> for </div>

<div>

<a>tweet</a><a>share</a>
<p>
Robert ...
Feature Extraction from HTML
<html>
<body>

<div>click <a>here</a> for </div>

<div>

<a>tweet</a><a>share</a>
<p>
Robert ...
Main Content Extraction from HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(featur...
Main Content Extraction from HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(featur...
Making Main Content Using Decision Tree
(features)block1:
not main
(features)block2:
not main
(features)block3:
main
(feat...
Main Content Extraction From HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(featur...
There are roughly two steps:
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
Text Classification
Ordinary text classification architecture:
② live data
(features)
① training
(features, entertainment)
(...
Text Classification
Ordinary text classification architecture:
② live data
(features)
① training
(features, entertainment)
(...
Feature Extraction in Text Classification
Will LeBron James
deliver an NBA
championship to
Cleveland?
‘Bag-of-words’ is com...
Feature Extraction in Text Classification
Will LeBron James
deliver an NBA
championship to
Cleveland?
‘Bag-of-words’ is com...
Feature Extraction in Text Classification
Similarly used in Japanese.
私は中路です。
よろしくお願いします。
stop words
person dictionary
私
は
...
Another Option: Paragraph Vector
Example:
私は中路です。
よろしくお願いします。
[0.2, 0.3, ……0.2]
Will LeBron James deliver
an NBA championship to
Cleveland?
[0.1, 0.4, ……0....
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every docume...
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every docume...
Word Vector in word2vec Model
Every word is mapped to unique word vector
with good properties.
[0.1, 0.2, ……0.2]=
[0.1, 0....
Procedure to Create Word Vectors
Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)
cat
sat
the
street
on
A cat sat on th...
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every docume...
Example:
私は中路です。
よろしくお願いします。
[0.2, 0.3, ……0.2]
Will LeBron James deliver
an NBA championship to
Cleveland?
[0.1, 0.4, ……0....
Procedure to Create Paragraph Vectors
for uw vw
A cat sat on the street.
…
doc_1 : doc_2 :
…
I love cat very much.
w220
He...
Procedure to Create Paragraph Vector
for uw vw, and di
vw② Preserve uw , as ˜uw , ˜vw
After training, we can get a good pa...
Procedure to Create Paragraph Vector
Feature Extractor
[0.2, 0.3, ……0.2]
d
˜uw ˜vw
Paragraph Vector :
Lmaximize
Ldocmaximi...
Text Classification
Ordinary text classification architecture:
② live data
([0.1, -0.1, …])
① training
([0.1, 0.3, …], enter...
Good
Benefits of Using Paragraph Vector
・High Scalability
・High Precision in Text Classification
Several percent better than...
Benefits of Using Paragraph Vector
It is important that Paragraph Vector has a
different nature than Bag-of-Words
Reason: W...
Our Use Case
Validation
Use one to validate the other.
Combination
Use the more reliable result of two classifiers:
Bag-of-...
In multilingual localization
Use only Paragraph Vector-based classifier without
any feature engineering.
Our Use Case (futu...
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
There are roughly two steps:
The Challenge
The Challenge
News is uncertainty seeking for long-term values.
Exploitation Exploration
What SmartNews does:
uncertainty ...
The Challenge
Searching not optimal, but acceptable form of exploration.
Why? Humans are not rational enough to simply acc...
We are building our engineering team in SF -
please join us!
採用してます
・ML/NLP Engineer
・Data Science Engineer
…
kohei.nakaji@smartnews.com
References
Main Content Extraction
・Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl
Text Classification
Boilerplat...
References
About SmartNews
・Japan’s SmartNews Raises Another $10M At A $320M Valuation
To Expand In The U.S.
・SmartNews, T...
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
Próximos SlideShares
Carregando em…5
×

[SmartNews] Globally Scalable Web Document Classification Using Word2Vec

22.760 visualizações

Publicada em

This is the slides for SF Bayarea Machine Learning Meetup (http://www.meetup.com/SF-Bayarea-Machine-Learning/events/221739934/)

Publicada em: Software
  • Did you know that once you lose your Ex, there is still a good chance you can get them back? Learn how ◆◆◆ http://t.cn/R50e2MX
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • DOWNLOAD FULL BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Woohoo, guys just came across some IIM A personal interview experiences  http://catking.in/2016/12/20/iim-ahmedabad-personal-interview-experiences/ #TargetIIM #CAT2017 #IIMorNothing 
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui

[SmartNews] Globally Scalable Web Document Classification Using Word2Vec

  1. 1. Globally Scalable Web Document Classification Using Word2Vec Kohei Nakaji (SmartNews)
  2. 2. keyword: machine learning for discovery
  3. 3. SmartNews Demo
  4. 4. About SmartNews Japan Launched 2013 4M+ Monthly Active Users 50% DAU/MAU 100+ Publishers 2013 App of The Year US Launched Oct 2014 1M+ Monthly Active Users Same engagement 80+ Publishers Top News Category App International Launched Feb 2015 10M Downloads WW Same engagement English beta Featured App Funding: $50M
  5. 5. Outline of our algorithm Structure Analysis Semantics Analysis URLs Found Importance Estimation 10 million/day 1000+/day Diversification Signals on the Internet
  6. 6. Outline of our algorithm Structure Analysis Semantics Analysis URLs Found Importance Estimation 10 million/day 1000+ /day Diversification Signals on the Internet Web Document Classification ⊂
  7. 7. Web Document Classification ENTERTAINMENT SPORTS TECHNOLOGY LIFESTYLE SCIENCE … Task definition: When an arbitrary web document arrives, choose one category exclusively from a pre-determined category set. WORLD
  8. 8. Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ② There are roughly two steps:
  9. 9. There are roughly two steps: Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ②
  10. 10. Main Content Extraction Two approaches: html html easier, but takes time difficult, but fast ・Extract after rendering whole page ・Extract from HTML
  11. 11. Main Content Extraction ・Extract after rendering whole page ・Extract from HTML html html easier, but takes time difficult, but fast Two approaches: Our Approach
  12. 12. Main Content Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p> </div> </body> </html> Example: main content not main content
  13. 13. Main Content Extraction from HTML Rule1: div which has
 text length > 200 num of ‘a’ tag < 3 is Main Content Rule-based extraction algorithm is possible. English: Rule2: div which has
 text length < 100 num of ‘p’ tag > 4 is Main Content RuleN: …
  14. 14. Main Content Extraction from HTML Rule1: div which has
 text length > 200 num of ‘a’ tag < 3 is Main Content Rule-based extraction algorithm is possible. English: Rule2: div which has
 text length < 100 num of ‘p’ tag > 4 is Main Content RuleN: … But not scalable. Japanese: … … … …
  15. 15. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  16. 16. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  17. 17. Feature Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html> Separate HTML into ‘text block’s Step1:
  18. 18. Feature Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html> Step1: Separate HTML into ‘text block’s Step2: Extract local features for every text block ex: word count = 36, num of <a> = 0
  19. 19. Feature Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html> Step1: Separate HTML into ‘text block’s Step2: Extract local features for every text block ex: word count = 36, num of <a> = 0 Step3: Define feature of each text block as combination of local features word count(current block) : 36, num of <a>(current block) : 0, word count (previous block) : 4, num of <a> (previous block) : 1 ex:
  20. 20. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach: See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  21. 21. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  22. 22. Making Main Content Using Decision Tree (features)block1: not main (features)block2: not main (features)block3: main (features)block5: main (features)block4: not main
  23. 23. Main Content Extraction From HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  24. 24. There are roughly two steps: Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ②
  25. 25. Text Classification Ordinary text classification architecture: ② live data (features) ① training (features, entertainment) (features, sports) (features, entertainment) features ? ? … entertainment sports (features, politics) … sports training algorithm classifier feature extraction
  26. 26. Text Classification Ordinary text classification architecture: ② live data (features) ① training (features, entertainment) (features, sports) (features, entertainment) features ? ? … entertainment sports (features, politics) … sports training algorithm classifier feature extraction
  27. 27. Feature Extraction in Text Classification Will LeBron James deliver an NBA championship to Cleveland? ‘Bag-of-words’ is commonly used as a feature vector. Will deliver an NBA championship to Cleveland James LeBron
  28. 28. Feature Extraction in Text Classification Will LeBron James deliver an NBA championship to Cleveland? ‘Bag-of-words’ is commonly used as a feature vector Will deliver an NBA championship to Cleveland James LeBron stop words sports players dictionary with some feature engineering. NBA_PLAYER tf-idf
  29. 29. Feature Extraction in Text Classification Similarly used in Japanese. 私は中路です。 よろしくお願いします。 stop words person dictionary 私 は 中路 よろしく お願い し ます です PERSON tf-idf
  30. 30. Another Option: Paragraph Vector
  31. 31. Example: 私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2] Will LeBron James deliver an NBA championship to Cleveland? [0.1, 0.4, ……0.1] Paragraph Vector (dimension ∼ several 100)
  32. 32. Outline of Distributed Representation ・word2vec ・paragraph vector every word is mapped to unique word vector. every document is mapped to unique vector. (Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053) (https://code.google.com/p/word2vec/)
  33. 33. Outline of Distributed Representation ・word2vec ・paragraph vector every word is mapped to unique word vector. every document is mapped to unique vector. (https://code.google.com/p/word2vec/) (Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
  34. 34. Word Vector in word2vec Model Every word is mapped to unique word vector with good properties. [0.1, 0.2, ……0.2]= [0.1, 0.1, ……-0.1]= [0.3, 0.4, ……0]= [0.3, 0.3, ……0.3]= Germany Berlin Paris France … “Germany - Berlin = France - Paris” vFrance vParis vGermany vBerlin
  35. 35. Procedure to Create Word Vectors Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf) cat sat the street on A cat sat on the street. … I love cat very much. w220 w221 He comes from Japan. … … TX t=1 logP(wt|wt c, · · · wt+c) P(wt|wt c, · · · wt+c) = exp(uwt · v) P W exp(uW · v) v = X t0 6=t, ct0 c vw 0 t for anduw vw vw is word vector for w. Word vectors are trained so that it becomes a good feature for predicting surrounding words. Objective Function (cbow-case) Model (sum-case) = Procedure ① Maximize ② L L
  36. 36. Outline of Distributed Representation ・word2vec ・paragraph vector every word is mapped to unique word vector. every document is mapped to unique vector. (Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
  37. 37. Example: 私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2] Will LeBron James deliver an NBA championship to Cleveland? [0.1, 0.4, ……0.1] Paragraph Vectors (dimension ∼ 100s)
  38. 38. Procedure to Create Paragraph Vectors for uw vw A cat sat on the street. … doc_1 : doc_2 : … I love cat very much. w220 He comes from Japan. … w221 Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf) cat sat the street on doc_1 TX t=1 logP(wt|wt c, · · · wt+c, doc i) P(wt|wt c, · · · wt+c, doc i) = exp(uwt · v) P W exp(uW · v) v = X t0 6=t, ct0 c vw 0 t + di , and di wt is included vw② Preserve uw , as ˜uw , ˜vw document where Add a vector to the model for each document. Objective Function (dbow-case) = Model (sum-case) Procedure ① Maximize L L
  39. 39. Procedure to Create Paragraph Vector for uw vw, and di vw② Preserve uw , as ˜uw , ˜vw After training, we can get a good paragraph vector as a feature for a new document. Objective Function (dbow-case) Model (sum-case) Procedure ① Maximize TX t=1 logP(wt|wt c, · · · wt+c, doc) P(wt|wt c, · · · wt+c, doc) = exp(˜uwt · ˜v) P W exp(˜uW · ˜v) ˜v = X t0 6=t, ct0 c ˜vwt 0 + d We love SmartNews. … doc : I love SmartNews very much. d Ldoc = ③ Maximize for L Ldoc d ④ Use as a paragraph vectord training live data
  40. 40. Procedure to Create Paragraph Vector Feature Extractor [0.2, 0.3, ……0.2] d ˜uw ˜vw Paragraph Vector : Lmaximize Ldocmaximize
  41. 41. Text Classification Ordinary text classification architecture: ② live data ([0.1, -0.1, …]) ① training ([0.1, 0.3, …], entertainment) ([0.2, -0.3, …], sports) ([0.1, 0.1, …], entertainment) features ? ? … entertainment sports ([0.1, -0.2, …], politics) … sports training algorithm classifier feature extraction
  42. 42. Good Benefits of Using Paragraph Vector ・High Scalability ・High Precision in Text Classification Several percent better than using Bag-of-Words with feature engineering in our Japanese/English data set. We don’t need to work hard for feature engineering in each language. Bad ・Difficulty in analyzing error It is hard to understand the meaning of each component of paragraph vector. labeled: ∼several 10000 unlabeled: ∼100000
  43. 43. Benefits of Using Paragraph Vector It is important that Paragraph Vector has a different nature than Bag-of-Words Reason: We can get a better classifier by combining two different types of classifiers.
  44. 44. Our Use Case Validation Use one to validate the other. Combination Use the more reliable result of two classifiers: Bag-of-Words-based classifier vs. Paragraph Vector-based classifier
  45. 45. In multilingual localization Use only Paragraph Vector-based classifier without any feature engineering. Our Use Case (future)
  46. 46. Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ② There are roughly two steps:
  47. 47. The Challenge
  48. 48. The Challenge News is uncertainty seeking for long-term values. Exploitation Exploration What SmartNews does: uncertainty seeking discovery What Big Data Firms typically do: preference estimation and risk quantification What if parents don't feed vegetables to children who only like meat? What if you keep hearing only opinions that match yours?
  49. 49. The Challenge Searching not optimal, but acceptable form of exploration. Why? Humans are not rational enough to simply accept the optimum. Without acceptance, users will never read SmartNews. ・topic extraction We are developing: ・image extraction ・multi-arm bandit based scoring model ① For better Feature Vector of users and articles ② For Human-Acceptable Exploration user interests ① ② … feature vector for 10 million users real-time feature vector for articles x
  50. 50. We are building our engineering team in SF - please join us! 採用してます ・ML/NLP Engineer ・Data Science Engineer …
  51. 51. kohei.nakaji@smartnews.com
  52. 52. References Main Content Extraction ・Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl Text Classification Boilerplate Detection using Shallow Text Features ・BoilerPipe (GoogleCode) ・Quoc V. Le, Tomas Mikolov Distributed Representations of Sentences and Documents ・Word2Vec (GoogleCode)
  53. 53. References About SmartNews ・Japan’s SmartNews Raises Another $10M At A $320M Valuation To Expand In The U.S. ・SmartNews, The Minimalist News App That's A Hit In Japan, Sets Its Sights On The U.S. ・Japanese news app SmartNews nabs $10M bridge round, at pre-money valuation of $320M ・About our Company SmartNews Articles about SmartNews

×