Mais conteúdo relacionado

Similar a Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media Streams(20)


Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media Streams

  1. Deriving Actionable Insights from High Volume Media Streams Jörn Kottmann Peter Thygesen Big Data Spain 2017, Madrid
  2. $WhoAreWe Jörn Kottmann ● Senior Software Engineer, Sandstone SA, Luxembourg ● Member of Apache Software Foundation ● PMC Chair & Committer, Apache OpenNLP ● PMC and Committer, Apache UIMA Peter Thygesen ● Senior Software Engineer & Partner, Paqle A/S, Denmark ● PMC and Committer, Apache OpenNLP
  3. What is a Natural Language?
  4. What is a Natural Language? Is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation (From Wikipedia)
  5. What is NOT a Natural Language?
  6. Characteristics of Natural Language Unstructured Ambiguous Complex Hidden semantic Irony Informal Unpredictable Rich Most updated Noise Hard to search Metaphors
  7. and it holds most of human knowledge
  8. and it holds most of human knowledge
  9. and it holds most of human knowledge
  10. As information overload grows ever worse, computers may become our only hope for handling a growing deluge of documents. MIT Press - May 12, 2017
  11. What is Natural Language Processing? NLP is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. (From Wikipedia)
  12. ???
  13. How?
  14. Guten Morgen 早上好 おはようございます Hyvää huomenta Καλημέρα доброе утро शुभ भात dzień dobry ¡Buenos días! By solving small problems each time A pipeline where an ambiguity type is solved, incrementally. Language Detector
  15. By solving small problems each time A pipeline where an ambiguity type is solved, incrementally. Sentence Detector Mr. Robert talk is today at room num. 7. Let's go? | | | | ❌ | | ✅ Tokenizer Mr. Robert talk is today at room num. 7. Let's go? || | | | | | | || || | ||| | | ❌ | | | | | | | | || | | | | | ✅
  16. By solving small problems each time Each step of a pipeline solves one ambiguity problem. Name Finder <Person>Washington</Person> was the first president of the USA. <Place>Washington</Place> is a state in the Pacific Northwest region of the USA. POS Tagger Laura Keene brushed by him with the glass of water . | | | | | | | | | | | NNP NNP VBD IN PRP IN DT NN IN NN .
  17. By solving small problems each time A pipeline can be long and resolve many ambiguities Lemmatizer He is better than many others | | | | | | He be good than many other
  18. Language Detector Sentence Detector Tokenizer POS Tagger Lemmatizer Name Finder Chunker Language 1 Language 2 Language N Index . . .
  19. Apache OpenNLP
  20. Apache OpenNLP Mature project (> 10 years) Actively developed Machine learning Java Easy to train Highly customizable Fast Language Detector Sentence detector Tokenizer Part of Speech Tagger Lemmatizer Chunker Parser ....
  21. Language Detection ● Extract character n-grams as features, can be done for any string in any language ● Use the n-grams as features for classification via Maximum Entropy, Perceptron or Naive Bayes ● Train on high quality data like Leipzig corpus to classify the text in more than 100 languages Source: Language Detection Library for Java - Shuyo Nakatani
  22. Training Language Detection Model Corpus - Leipzig ( svn co opennlp-corpus //(roughly 25GB) bin/opennlp" LanguageDetectorConverter leipzig -sentencesDir data -sentencesPerSample 5 -samplesPerLanguage 2000 -encoding UTF-8 > ld-train.txt bin/opennlp" LanguageDetectorConverter leipzig -sentencesDir data -sentencesPerSample 5 -samplesPerLanguage 2000 -samplesToSkip 2000 -encoding UTF-8 > ld-eval.txt bin/opennlp" LanguageDetectorTrainer -model lang.bin -params MAXENT_45_PARAMS.txt -data ld-train.txt -encoding UTF-8 bin/opennlp" LanguageDetectorEvaluator -model lang.bin -misclassified true -reportOutputFile report.txt -data ld-eval.txt -encoding UTF-8
  23. Training Models for English Corpus - OntoNotes ( bin/opennlp TokenNameFinderTrainer.ontonotes -lang eng -ontoNotesDir ~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-ner-ontonotes.bin bin/opennlp POSTaggerTrainer.ontonotes -lang eng -ontoNotesDir ~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-maxent.bin
  24. Training Models for Portuguese Corpus - Amazonia ( bin/opennlp -lang por -data -model por-tokenizer.bin -detokenizer lang/pt/tokenizer/pt-detokenizer.xml -encoding ISO-8859-1 bin/opennlp -lang por -data -model por-pos.bin -encoding ISO-8859-1 -includeFeatures false bin/opennlp -lang por -data -model por-chunk.bin -encoding ISO-8859-1 bin/opennlp -lang por -data -model por-ner.bin -encoding ISO-8859-1
  25. Name Finder API - Detect Names TokenNameFinderModel model = new TokenNameFinderModel( OpenNLPMain.class.getResourceAsStream("/opennlp-models/por-ner.bin")); NameFinderME nameFinder = new NameFinderME(model); for (String document[][] : documents) { for (String[] sentence : document) { Span nameSpans[] = nameFinder.find(sentence); // do something with the names } nameFinder.clearAdaptiveData(); }
  26. Name Finder API - Train a model ObjectStream<String> lineStream = new PlainTextByLineStream( new FileInputStream("en-ner-person.train"), StandardCharsets.UTF8); TokenNameFinderModel model; try (ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream)) { model = NameFinderME.train("eng", "person", sampleStream, TrainingParameters.defaultParams(), newTokenNameFinderFactory()); } model.serialize(modelFile);
  27. Name Finder API - Evaluate a model TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new NameFinderME(model)); evaluator.evaluate(sampleStream); FMeasure result = evaluator.getFMeasure(); System.out.println(result.toString());
  28. Name Finder API - Cross Evaluate a model FileInputStream sampleDataIn = new FileInputStream("en-ner-person.train"); ObjectStream<NameSample> sampleStream = new PlainTextByLineStream(sampleDataIn.getChannel(), StandardCharsets.UTF_8); TokenNameFinderCrossValidator evaluator = new TokenNameFinderCrossValidator("eng", 100, 5); evaluator.evaluate(sampleStream, 10); FMeasure result = evaluator.getFMeasure(); System.out.println(result.toString());
  29. Apache Flink
  30. Apache Flink Mature project - 320+ contributors, > 11K commits Very Active project on Github Java/Scala Streaming first Fault-Tolerant Unified Batch and Streaming APIs Stateful Stream Processing
  31. Apache Flink - NLP Pipeline final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<Annotation> rawStream = env.readFile( new AnnotationInputFormat(NewsArticleAnnotationFactory.getFactory()), parameterTool.getRequired("file")); SplitStream<Annotation> articleStream = rawStream .map(new LanguageDetectorFunction()) .split(new LanguageSelector(nlpLanguages));
  32. Apache Flink - NLP Pipeline"eng") .map(new SentenceDetectorFunction(engSentenceModel)) .map(new TokenizerFunction(engTokenizerModel)) .map(new POSTaggerFunction(engPosModel)) .map(new ChunkerFunction(engChunkModel)) .map(new NameFinderFunction(engNerPersonModel)) .addSink(new ElasticsearchSink<>(config, transportAddresses, new ESSinkFunction()));
  33. Apache Flink - NLP Pipeline"por") .map(new SentenceDetectorFunction(porSentenceModel)) .map(new TokenizerFunction(porTokenizerModel)) .map(new POSTaggerFunction(porPosModel)) .map(new ChunkerFunction(porChunkModel)) .map(new NameFinderFunction(porNerPersonModel)) .addSink(new ElasticsearchSink<>(config, transportAddresses, new ESSinkFunction()));
  34. Apache Flink - NLP Pipeline private static class LanguageSelector implements OutputSelector<Tuple2<String, String>> { public Iterable<String> select(Tuple2<String, String> s) { List<String> list = new ArrayList<>(); list.add(languageDetectorME.predictLanguage(s.f1).getLang()); return list; } }
  35. Apache Flink - Pos Tagger and NER class POSTaggerMapFunction implements RichMapFunction<Tuple2<String, String[]>, POSSample> { … public void open(Configuration parameters) throws Exception { posTagger = new POSTaggerME(model); } public POSSample map(Tuple2<String, String[]> s) { String[] tags = posTagger.tag(s.f1); return new POSSample(s.f0, s.f1, tags); } }
  36. Apache Flink - Pos Tagger and NER class NameFinderMapFunction implements RichMapFunction<Tuple2<String, String[]>,NameSample> { … public void open(Configuration parameters) throws Exception { nameFinder = new NameFinderME(model); } public NameSample map(Tuple2<String, String[]> s) { Span[] names = nameFinder.find(s.f1); return new NameSample(s.f0, s.f1, names, null, true); } }
  37. TODO: Add Kibana preview screenshot
  38. What’s Coming ?? ● Apache MxNet: Mature Project: backed by Amazon, Apple, Intel, NVidia ● Modular: Tensor library, reinforcement learning, ETL,.. ● Focused on integrating with JVM ecosystem while supporting state of the art like gpus on large clusters ● Implements most neural nets you’d need for language ● Named Entity Recognition using MxNet with LSTMs ● Language Detection using MxNet, short texts ● Possible: Translation using Bidirectional LSTMs with embeddings ● Computation graph architecture for more advanced use cases
  39. Credits Suneel Marthi @suneelmarthi Tommaso Teofili @tteofili William Colen @wcolen Rodrigo Agerri @ragerri Jörn Kottmann @joernkottmann Peter Thygesen in:thygesen Daniel Russ in:daniel-russ-9541aa15 Koji Sekiguchi @kojisays Jeff Zemerick in:jeffzemerick Bruno Kinoshita @kinow
  40. Questions ???