O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

AIMeetup #4: Neural-machine-translation

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 28 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a AIMeetup #4: Neural-machine-translation (20)

Anúncio

Mais de 2040.io (16)

Mais recentes (20)

Anúncio

AIMeetup #4: Neural-machine-translation

  1. 1. How to build own translator in 15 minutes Neural Machine Translation in practice Bartek Rozkrut 2040.io
  2. 2. Why so important? 40 billion USD / year industry Huge barrier for many people Provide unlimited access to knowledge Scale NLP problems
  3. 3. Why own translator? 1.Private / sensitive data 2.Huge amount of data – eg. e-mail translation (cost) 3.Off-line / off-cloud / on-premise 4.Custom domain-specific translation / vocabulary
  4. 4. Neural Machine Translation – example workflow 1. Download Parallel Corpus files 2. Append all corpus files (source + target) in same order 3. Split TRAIN / VAL set 4. Tokenization 5. Preprocess 6. Train 7. Release model (CPU compatible) 8. Translate! 9. REPEAT! 
  5. 5. Parallel Corpus – public data HTTP://OPUS.LINGFIL.UU.SE
  6. 6. Parallel Corpus (source file – PL, EUROPARL) 1.Tytuł: Admirał NATO potrzebuje przyjaciół. 2.Dziękuję. 3.Naprawdę potrzebuję... 4.Ten program stał się katalizatorem. Następnego dnia setki osób chciały mnie dodać do znajomych. Indonezyjczycy i Finowie Pisali: "Admirale, słyszeliśmy, że potrzebuje pan znajomych, a tak przy okazji, co to jest NATO?"
  7. 7. Parallel Corpus (target file - EN , EUROPARL) 1.The headline was: NATO Admiral Needs Friends. 2.Thank you. 3.Which I do. 4.And the story was a catalyst, and the next morning I had hundreds of Facebook friend requests from Indonesians and Finns, mostly saying, "Admiral, we heard you need a friend, and oh, by the way, what is NATO?"
  8. 8. Vocabulary 1.Word level 2.Sub-word level (eg. Byte Pair Encoding) 3.Character level
  9. 9. BLEU
  10. 10. HTTP://OPENNMT.NET/ OPENNMT – DECEMBER 2016
  11. 11. HTTPS://GOOGLE.GITHUB.IO/SEQ2SEQ/ GOOGLE’S SEQ2SEQ – MARCH 2017
  12. 12. Our experience from PL=>EN training 1.100k vocabulary (word-level) 2.Bidirectional LSTM, 2 layers, RNN size 500 3.5M sentences from public data sources 4.~ 20 BLEU
  13. 13. OpenNMT – run Docker container Run CPU-based interactive session with command: sudo docker run -it 2040/opennmt bash Run GPU-based interactive session with command: sudo nvidia-docker run -it 2040/opennmt bash
  14. 14. OpenNMT – split paralell corpus split -l $[ $(wc -l src.txt|cut -d" " -f1) * 9/10 ] src.txt mv xaa train-src.txt mv xab val-src.txt split -l $[ $(wc -l tgt.txt|cut -d" " -f1) * 9/10 ] tgt.txt mv xaa train-tgt.txt mv xab val-tgt.txt
  15. 15. OpenNMT – preprocess paralell corpus th tools/tokenize.lua -joiner_annotate -mode aggressive < train-src.txt > train-src.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < train-tgt.txt > train-tgt.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < val-src.txt > val- src.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < val-tgt.txt > val- tgt.txt.tok th preprocess.lua -train_src train-src.txt.tok -train_tgt train-tgt.txt.tok - valid_src val-src.txt.tok -valid_tgt val-tgt.txt.tok -save_data _data
  16. 16. OpenNMT – train && release && translate th train.lua -data _data-train.t7 -layers 2 -rnn_size 500 -brnn -save_model model -gpuid 1 th tools/release_model.lua -model model.t7 -gpuid 1 th translate.lua -model model.t7 -src src-val.txt -output file-tgt.tok -gpuid 1
  17. 17. Best hyperparams from 250k GPU hours (thx Google) HTTPS://ARXIV.ORG/ABS/1703.03906
  18. 18. Other applications 1.Image 2 Text 2.OCR (eg. Tesseract OCR v4.0 – LSTM) 3.Lip reading 4.Simple Q&A 5.Chatbots
  19. 19. HTTP://WEB.STANFORD.EDU/CLASS/CS224N/ SLIDES USED WITH PERMISSION FROM RICHARD SOCHER
  20. 20. Thanks! Bartek Rozkrut bartek@2040.io

×