SlideShare uma empresa Scribd logo
1 de 12
Baixar para ler offline
An open source tech stack to
Analyze all reviews on the Internet
Steffen Wenz, CTO TrustYou
steffen@trustyou.net
✓ Very good hotel!*
✓ Near city centre
“Close to the city center”
✓ Clean rooms
« Chambre impeccable »
✓ Popular with solo travelers
“Remote doesnt work”
*) Ramada Cluj (Full summary)
DBCrawling
Semantic
Analysis
TrustYou
Analytics
API
Google,
Hotels.com …
TrustYou Architecture
200 million
reqs/month
❤ Python
Scrapy
● Build your own web crawlers
● Extract data via CSS selectors, XPath, regexes …
● Handles “tag soup”, queuing, request parallelism,
cookies, throttling …
● Code sample on GitHub
NLP in Python
● NLTK
○ Word/sentence tokenization
○ POS tagging, parsing
● Great support for scientific computation:
NumPy, SciPy, Pandas
● Scikit-learn
● TensorFlow!
Gensim: Fun with Word2Vec
>>> # trained from 100k meetup descriptions!
>>> m = gensim.models.Word2Vec.load("data/word2vec")
>>> m.most_similar(positive=["python"])[:3]
[(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django',
0.8189617991447449)]
>>> m.doesnt_match(["python", "c++", "javascript"])
'c++'
>>> m.most_similar(positive=["berlin"])[:3]
[(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland',
0.7970746755599976)]
>>> m.most_similar(positive=["ladies"])[:3]
[(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]
Big Data & Open Source
2004
MapReduce, GFS
BigTable, Spanner, F1
…
Apache
Beam …
Spark
● User writes driver program which transparently
schedules execution in a cluster
● Faster and more expressive than MapReduce
● Spark SQL: Interactive query of large datasets
● Spark Streaming: Spark is “batch first”, but fast enough
to implement stream processing with “mini batches”
● Spark MLlib: Machine learning
● Build complex pipelines of
batch jobs
○ Dependency resolution
○ Parallelism
○ Resume failed jobs
● Some support for Hadoop
● Pythonic replacement for Oozie
Luigi
Try it out!
GitHub repo showcasing:
● Luigi
● Scrapy
● Word2Vec model training with gensim
@ https://github.com/trustyou/meetups
steffen@trustyou.com

Mais conteúdo relacionado

Mais procurados

閒聊Python應用在game server的開發
閒聊Python應用在game server的開發閒聊Python應用在game server的開發
閒聊Python應用在game server的開發Eric Chen
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBMongoDB
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRick Copeland
 
ooc - OSDC 2010 - Amos Wenger
ooc - OSDC 2010 - Amos Wengerooc - OSDC 2010 - Amos Wenger
ooc - OSDC 2010 - Amos WengerAmos Wenger
 
ELK Stack - Turn boring logfiles into sexy dashboard
ELK Stack - Turn boring logfiles into sexy dashboardELK Stack - Turn boring logfiles into sexy dashboard
ELK Stack - Turn boring logfiles into sexy dashboardGeorg Sorst
 
clWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUclWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUJohn Colvin
 
Skiron - Experiments in CPU Design in D
Skiron - Experiments in CPU Design in DSkiron - Experiments in CPU Design in D
Skiron - Experiments in CPU Design in DMithun Hunsur
 
Why my Go program is slow?
Why my Go program is slow?Why my Go program is slow?
Why my Go program is slow?Inada Naoki
 
Atmosphere 2016 - Krzysztof Kaczmarek - Don't fear the brackets - Clojure in ...
Atmosphere 2016 - Krzysztof Kaczmarek - Don't fear the brackets - Clojure in ...Atmosphere 2016 - Krzysztof Kaczmarek - Don't fear the brackets - Clojure in ...
Atmosphere 2016 - Krzysztof Kaczmarek - Don't fear the brackets - Clojure in ...PROIDEA
 
Robust C++ Task Systems Through Compile-time Checks
Robust C++ Task Systems Through Compile-time ChecksRobust C++ Task Systems Through Compile-time Checks
Robust C++ Task Systems Through Compile-time ChecksStoyan Nikolov
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineAndy McKay
 
Продвинутая отладка JavaScript с помощью Chrome Dev Tools
Продвинутая отладка JavaScript с помощью Chrome Dev ToolsПродвинутая отладка JavaScript с помощью Chrome Dev Tools
Продвинутая отладка JavaScript с помощью Chrome Dev ToolsFDConf
 
Introduction to MongoDB for C# developers
Introduction to MongoDB for C# developersIntroduction to MongoDB for C# developers
Introduction to MongoDB for C# developersTaras Romanyk
 
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013Gregg Donovan
 
History & Practices for UniRx(EN)
History & Practices for UniRx(EN)History & Practices for UniRx(EN)
History & Practices for UniRx(EN)Yoshifumi Kawai
 
Learning groovy -EU workshop
Learning groovy  -EU workshopLearning groovy  -EU workshop
Learning groovy -EU workshopadam1davis
 

Mais procurados (20)

閒聊Python應用在game server的開發
閒聊Python應用在game server的開發閒聊Python應用在game server的開發
閒聊Python應用在game server的開發
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDB
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
 
ooc - OSDC 2010 - Amos Wenger
ooc - OSDC 2010 - Amos Wengerooc - OSDC 2010 - Amos Wenger
ooc - OSDC 2010 - Amos Wenger
 
ELK Stack - Turn boring logfiles into sexy dashboard
ELK Stack - Turn boring logfiles into sexy dashboardELK Stack - Turn boring logfiles into sexy dashboard
ELK Stack - Turn boring logfiles into sexy dashboard
 
clWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUclWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPU
 
Skiron - Experiments in CPU Design in D
Skiron - Experiments in CPU Design in DSkiron - Experiments in CPU Design in D
Skiron - Experiments in CPU Design in D
 
Why my Go program is slow?
Why my Go program is slow?Why my Go program is slow?
Why my Go program is slow?
 
The State of JavaScript
The State of JavaScriptThe State of JavaScript
The State of JavaScript
 
Atmosphere 2016 - Krzysztof Kaczmarek - Don't fear the brackets - Clojure in ...
Atmosphere 2016 - Krzysztof Kaczmarek - Don't fear the brackets - Clojure in ...Atmosphere 2016 - Krzysztof Kaczmarek - Don't fear the brackets - Clojure in ...
Atmosphere 2016 - Krzysztof Kaczmarek - Don't fear the brackets - Clojure in ...
 
Robust C++ Task Systems Through Compile-time Checks
Robust C++ Task Systems Through Compile-time ChecksRobust C++ Task Systems Through Compile-time Checks
Robust C++ Task Systems Through Compile-time Checks
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
 
Продвинутая отладка JavaScript с помощью Chrome Dev Tools
Продвинутая отладка JavaScript с помощью Chrome Dev ToolsПродвинутая отладка JavaScript с помощью Chrome Dev Tools
Продвинутая отладка JavaScript с помощью Chrome Dev Tools
 
Node.js extensions in C++
Node.js extensions in C++Node.js extensions in C++
Node.js extensions in C++
 
Introduction to MongoDB for C# developers
Introduction to MongoDB for C# developersIntroduction to MongoDB for C# developers
Introduction to MongoDB for C# developers
 
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
 
MongoDB
MongoDBMongoDB
MongoDB
 
History & Practices for UniRx(EN)
History & Practices for UniRx(EN)History & Practices for UniRx(EN)
History & Practices for UniRx(EN)
 
Learning groovy -EU workshop
Learning groovy  -EU workshopLearning groovy  -EU workshop
Learning groovy -EU workshop
 
Living With Garbage
Living With GarbageLiving With Garbage
Living With Garbage
 

Semelhante a DevTalks Cluj - Open-Source Technologies for Analyzing Text

Smart Data Meetup - NLP At Scale
Smart Data Meetup - NLP At ScaleSmart Data Meetup - NLP At Scale
Smart Data Meetup - NLP At ScaleSteffen Wenz
 
Integrate LLM in your applications 101
Integrate LLM in your applications 101 Integrate LLM in your applications 101
Integrate LLM in your applications 101 Marco De Nittis
 
Trip Report from Meeting C++ 2017: It's Way More Than C++
Trip Report from Meeting C++ 2017: It's Way More Than C++Trip Report from Meeting C++ 2017: It's Way More Than C++
Trip Report from Meeting C++ 2017: It's Way More Than C++Andrey Upadyshev
 
Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기
Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기 Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기
Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기 Mario Cho
 
Edição de Texto Rico com React e Draft.js
Edição de Texto Rico com React e Draft.jsEdição de Texto Rico com React e Draft.js
Edição de Texto Rico com React e Draft.jsGuilherme Vierno
 
Statistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnStatistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnOlivier Grisel
 
Future Tech Now: AI, ML & ChatBots
Future Tech Now: AI, ML & ChatBotsFuture Tech Now: AI, ML & ChatBots
Future Tech Now: AI, ML & ChatBotsSagittarius
 
Web_of_Things_2013
Web_of_Things_2013Web_of_Things_2013
Web_of_Things_2013Max Kleiner
 
Khan farhan cv
Khan farhan cvKhan farhan cv
Khan farhan cvfarhan0039
 
2015 Pharo Prague Lambda Meetup
2015 Pharo Prague Lambda Meetup2015 Pharo Prague Lambda Meetup
2015 Pharo Prague Lambda MeetupPharo
 
Moving Forward with AI
Moving Forward with AIMoving Forward with AI
Moving Forward with AIAdrian Hornsby
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
 
Resume-Aditya Parkhi NCSU_MSCS
Resume-Aditya Parkhi NCSU_MSCSResume-Aditya Parkhi NCSU_MSCS
Resume-Aditya Parkhi NCSU_MSCSaditya140
 
Low Code Neuro-Symbolic Agents.pdf
Low Code Neuro-Symbolic Agents.pdfLow Code Neuro-Symbolic Agents.pdf
Low Code Neuro-Symbolic Agents.pdfDenis Gagné
 
Google Cloud: Data Analysis and Machine Learningn Technologies
Google Cloud: Data Analysis and Machine Learningn Technologies Google Cloud: Data Analysis and Machine Learningn Technologies
Google Cloud: Data Analysis and Machine Learningn Technologies Andrés Leonardo Martinez Ortiz
 
Build 2017 - B8090 - How Microsoft Cognitive Services can help your apps comm...
Build 2017 - B8090 - How Microsoft Cognitive Services can help your apps comm...Build 2017 - B8090 - How Microsoft Cognitive Services can help your apps comm...
Build 2017 - B8090 - How Microsoft Cognitive Services can help your apps comm...Windows Developer
 

Semelhante a DevTalks Cluj - Open-Source Technologies for Analyzing Text (20)

Smart Data Meetup - NLP At Scale
Smart Data Meetup - NLP At ScaleSmart Data Meetup - NLP At Scale
Smart Data Meetup - NLP At Scale
 
Integrate LLM in your applications 101
Integrate LLM in your applications 101 Integrate LLM in your applications 101
Integrate LLM in your applications 101
 
Trip Report from Meeting C++ 2017: It's Way More Than C++
Trip Report from Meeting C++ 2017: It's Way More Than C++Trip Report from Meeting C++ 2017: It's Way More Than C++
Trip Report from Meeting C++ 2017: It's Way More Than C++
 
Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기
Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기 Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기
Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기
 
Edição de Texto Rico com React e Draft.js
Edição de Texto Rico com React e Draft.jsEdição de Texto Rico com React e Draft.js
Edição de Texto Rico com React e Draft.js
 
Real_World_0days.pdf
Real_World_0days.pdfReal_World_0days.pdf
Real_World_0days.pdf
 
Statistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnStatistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learn
 
Future Tech Now: AI, ML & ChatBots
Future Tech Now: AI, ML & ChatBotsFuture Tech Now: AI, ML & ChatBots
Future Tech Now: AI, ML & ChatBots
 
Web_of_Things_2013
Web_of_Things_2013Web_of_Things_2013
Web_of_Things_2013
 
Khan farhan cv
Khan farhan cvKhan farhan cv
Khan farhan cv
 
2015 Pharo Prague Lambda Meetup
2015 Pharo Prague Lambda Meetup2015 Pharo Prague Lambda Meetup
2015 Pharo Prague Lambda Meetup
 
Moving Forward with AI
Moving Forward with AIMoving Forward with AI
Moving Forward with AI
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Resume
ResumeResume
Resume
 
Resume-Aditya Parkhi NCSU_MSCS
Resume-Aditya Parkhi NCSU_MSCSResume-Aditya Parkhi NCSU_MSCS
Resume-Aditya Parkhi NCSU_MSCS
 
PonArasuNeranjSubramanian_ASU
PonArasuNeranjSubramanian_ASUPonArasuNeranjSubramanian_ASU
PonArasuNeranjSubramanian_ASU
 
Low Code Neuro-Symbolic Agents.pdf
Low Code Neuro-Symbolic Agents.pdfLow Code Neuro-Symbolic Agents.pdf
Low Code Neuro-Symbolic Agents.pdf
 
Chayma ben nacer
Chayma ben nacerChayma ben nacer
Chayma ben nacer
 
Google Cloud: Data Analysis and Machine Learningn Technologies
Google Cloud: Data Analysis and Machine Learningn Technologies Google Cloud: Data Analysis and Machine Learningn Technologies
Google Cloud: Data Analysis and Machine Learningn Technologies
 
Build 2017 - B8090 - How Microsoft Cognitive Services can help your apps comm...
Build 2017 - B8090 - How Microsoft Cognitive Services can help your apps comm...Build 2017 - B8090 - How Microsoft Cognitive Services can help your apps comm...
Build 2017 - B8090 - How Microsoft Cognitive Services can help your apps comm...
 

Último

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

DevTalks Cluj - Open-Source Technologies for Analyzing Text

  • 1. An open source tech stack to Analyze all reviews on the Internet Steffen Wenz, CTO TrustYou steffen@trustyou.net
  • 2. ✓ Very good hotel!* ✓ Near city centre “Close to the city center” ✓ Clean rooms « Chambre impeccable » ✓ Popular with solo travelers “Remote doesnt work” *) Ramada Cluj (Full summary)
  • 3.
  • 5. Scrapy ● Build your own web crawlers ● Extract data via CSS selectors, XPath, regexes … ● Handles “tag soup”, queuing, request parallelism, cookies, throttling … ● Code sample on GitHub
  • 6. NLP in Python ● NLTK ○ Word/sentence tokenization ○ POS tagging, parsing ● Great support for scientific computation: NumPy, SciPy, Pandas ● Scikit-learn ● TensorFlow!
  • 7. Gensim: Fun with Word2Vec >>> # trained from 100k meetup descriptions! >>> m = gensim.models.Word2Vec.load("data/word2vec") >>> m.most_similar(positive=["python"])[:3] [(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django', 0.8189617991447449)] >>> m.doesnt_match(["python", "c++", "javascript"]) 'c++' >>> m.most_similar(positive=["berlin"])[:3] [(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland', 0.7970746755599976)] >>> m.most_similar(positive=["ladies"])[:3] [(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]
  • 8. Big Data & Open Source 2004 MapReduce, GFS BigTable, Spanner, F1 … Apache Beam …
  • 9. Spark ● User writes driver program which transparently schedules execution in a cluster ● Faster and more expressive than MapReduce ● Spark SQL: Interactive query of large datasets ● Spark Streaming: Spark is “batch first”, but fast enough to implement stream processing with “mini batches” ● Spark MLlib: Machine learning
  • 10. ● Build complex pipelines of batch jobs ○ Dependency resolution ○ Parallelism ○ Resume failed jobs ● Some support for Hadoop ● Pythonic replacement for Oozie Luigi
  • 11. Try it out! GitHub repo showcasing: ● Luigi ● Scrapy ● Word2Vec model training with gensim @ https://github.com/trustyou/meetups