SlideShare a Scribd company logo
1 of 14
PyBabe
Eat whatever data you wanna eat




                                  Dataiku™
Projet Goal was …

 Integrate Game Logs on a Large actor, social Gaming

       IsCool Entertainment (Euronext: ALWEK), 70 people,
       10M€ revenues.

 Around 30 GB raw logs per day for 7 games(web, mobile)

       That’s about 10TB per year.

       At the end some Hadoop’ing + Analytics SQ L, but
       in the middle lots of data integration

 Anykind of logs and Data

       Partial database extracts

       Apache/NGinx logs

       Tracking Logs (Web Analytics stuff. etc..)

       Application Logs

       REST APIs (Currency Exchange, Geo Data,
       Facebook APIs. )..)


                                                            Dataiku™
As a reminder
What most data scientists do ?


    LinkedIn&Twitter

       “Data Science”              Real Life
     “Recommendation”        80% of its time is spent
   “Clustering algorithms”    getting the data right
         “Big Data”
     “Machine Learning”          19% Analytics
  “Hidden Markov Model”
    “Predictive Analytics”    1% Twitter & LinkedIn
    “Logistic Regression”

                                           Dataiku™
Goal

 An project based on a ETL solution had
 previously failed

 Need for

     Agility

     To manage any data

     To be quick

 The answer is ….

     PYTHON !!!



                                          Dataiku™
Step 1: Open your favorite
editor, write a .py file
 Script for data parsing, filling up the
 database, enrichment, cleanup,
 etc..

 Around 2000 line of code

 5 man days work

      !Good, but hard to maintain
      on the long run

      !Not fun

 I switched from emacs to
 SublimeText2 in the meantime, that
 was cool.


                                          Dataiku™
Step 2: Abstract and
Generalize. PyBabe
 Micro-ETL in Python
 Can read and write: FTP, HTTP, SQL, Filesystem, Amazon S3, E-Mail, ZIP,
 GZIP, MongoDB, Excel, etc..
 Basic file filters and transformations (filters, regular expressions, date parsing,
 geoip, transpose, sort, group, …)
 Use yield and named tuples
 Open-source
     https://github.com/fdouetteau/PyBabe
 And the old project ?
     The old project became 200 linesof specific code

                                                                       Dataiku™
Sample pybabe
(1) Fetch a log file in s3 and put integrate in

babe = Babe()


## Fetch multiple CSV file from S3, har, cache en local
babe = babe.pull(url=“s3://myapp/mydir 2012-07-07_*.csv.gz”,cache=True)

## Recupère l’IP dans le champs IP, trouve pas geoip le pays
babe = babe.geoip_country_code(field=“ip”, country_code=“country”, ignore_error_True)


## Récupère le user agent, et stocke le nom du navigateur
babe = babe.user_agent(field=“user_agent”, browser=“browser”)


## Ne garde que les champs pertinents
babe = babe.filterFields(fields=[“user_id”, “date”, “country”, “user_agent”])


## Stocke le résultat dans une base de donnée
babe.push_sql(database=“mydb”, table=“mytable”, username=“…”);


                                                                                Dataiku™
Sample PyBabe script
    (2) Large file sort, join
babe = Babe()
## Fetch a large CSV file
babe = babe.pull(filename=“mybigfile.csv”)


## Perform a disk-based sort, batch 100k lines in memory
babe = babe.sortDiskBased(field=“uid”, nsize=100000)


## Group By uid and sum revenu per user.
babe = babe.groupBy(field=“uid”, reducer=lambda x, y: (x.uid, x.amount + y.amount))


## Join this stream on “uid” with the result of a CSV file
abe = babe.join(Babe().pull_sql(database=“mydb”, table=“user_info”, “uid”, “uid”)


## Store the result in an Excel file
babe.push (filename=“reports.xlxs”);



                                                                                    Dataiku™
Sample PyBabe script
   (3) Mail a report

babe = Babe()
## Pull the result of a SQL query
babe = babe.pull(database=“mydb”, name=”First Query”, query=“SELECT …. “)


## Pull the result of a second SQL query
babe = babe.pull(database=“mydb”, name=”Second Query”, query=“SELECT ….”)

## Send the overall stream (concatenated) as an email, with content attached in Excel, and
some sample data in the body
babe = babe.sendmail(subject=“Your Report”, recipients=fd@me.com, data_in_body=True,
data_in_body_row_limit=10, attach_formats=“xlsx”)




                                                                                Dataiku™
Some Design Choices
 Use collections.namedtuple
 Use generators
     Nice and easy programming style
          def filter(stream, f):
          	   for data in stream:
          	   	    if isinstance(data, StreamMeta):
          	   	    	     yield data
          	   	    elif f(data):
          	   	    	     yield data
 IO Streaming whenever possible
     An HTTP downloaded file begins to be processed as it starts downloading
 Use bulk-loaders (SQL) or external program when faster than the python
 implementation (e.g gzip)

                                                                          Dataiku™
PyBabe data model
                           def sample_pull():
                               header =
                                   StreamHeader(name=”visits”,
                                   partition={‘day’:‘2012-09-14’},
 A Babe works on a                 fields=[“name”, “day”])
 generator that contains
                               yield header
 a sequence of partition
                               yield header.makeRow(‘Florian’,‘2012-09-14’)
 A Partition is                yield header.makeRow(‘John’, ‘2012-09-14’)

 composed of a header          yield StreamFooter()
 (StreamHeader), rows,
                               yield header.replace(partition={‘day’:‘2012-09-15’})
 and a Footer
                               yield header.makeRow(‘Phil’, ‘2012-09-15’)

                               yield StreamFooter()
                                                                     Dataiku™
Some thoughts and
associated projects
 strptime and performance
     Parse a date with time.strptime or datetime.strptime
         30 microseconds vs. 3 microseconds for regexp matching !!!
     “Tarpys” a date parsing library, with date guessing
 Charset management (pyencoding_cleaner)
     Sniff ISO or UTF-8 charset over a fragment
     Optionally try to fix bad encoding ( î, í, ü)
 python2.X csv module is ok but …
     No Unicode support
     Separator sniffing buggy on edge cases

                                                                      Dataiku™
Future

 Need to separate the github project into core and plugins
 Rewrite in C a CSV module ? …
 Configurable Error system. Should a error row fail the
 whole stream, fail the whole babe, send a warning, be or
 be skipped
 Pandas/NumPy integration
 An Homepage, Docs, etc...

                                                   Dataiku™
Ask questions ? ?
                  babe = Babe().pull(“questions.csv”)

                    babe = babe.filter(smart=True)

                      babe = babe.mapTo(oracle)
  Florian
 Douetteau             babe.push(“answers.csv”);
@fdouetteau
   CEO
  Dataiku
                  Dataiku : Our Goal
              -   Leverage and Provide the best of open
                  souce technologies to help people build
                     their own data science platform  Dataiku™

More Related Content

What's hot

Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2
Cdiscount
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
Dataiku
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku
 

What's hot (20)

The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from Scratch
 
How to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect
How to Build a Successful Data Team - Florian Douetteau @ PAPIs ConnectHow to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect
How to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect
 
Machine Learning Services Benchmark - Inês Almeida @ PAPIs Connect
Machine Learning Services Benchmark - Inês Almeida @ PAPIs ConnectMachine Learning Services Benchmark - Inês Almeida @ PAPIs Connect
Machine Learning Services Benchmark - Inês Almeida @ PAPIs Connect
 
Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!
 
Before Kaggle
Before KaggleBefore Kaggle
Before Kaggle
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
 
PASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLPASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureML
 

Similar to Eat whatever you can with PyBabe

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 

Similar to Eat whatever you can with PyBabe (20)

Hopsworks data engineering melbourne april 2020
Hopsworks   data engineering melbourne april 2020Hopsworks   data engineering melbourne april 2020
Hopsworks data engineering melbourne april 2020
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
 
Data herding
Data herdingData herding
Data herding
 
Data herding
Data herdingData herding
Data herding
 
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
KAPT Annotation processing & Code generation
KAPT Annotation processing & Code generationKAPT Annotation processing & Code generation
KAPT Annotation processing & Code generation
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the Web
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Having Fun with Play
Having Fun with PlayHaving Fun with Play
Having Fun with Play
 
BP204 - Take a REST and put your data to work with APIs!
BP204 - Take a REST and put your data to work with APIs!BP204 - Take a REST and put your data to work with APIs!
BP204 - Take a REST and put your data to work with APIs!
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 

More from Dataiku

Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from th
Dataiku
 

More from Dataiku (12)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from th
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Eat whatever you can with PyBabe

  • 1. PyBabe Eat whatever data you wanna eat Dataiku™
  • 2. Projet Goal was … Integrate Game Logs on a Large actor, social Gaming IsCool Entertainment (Euronext: ALWEK), 70 people, 10M€ revenues. Around 30 GB raw logs per day for 7 games(web, mobile) That’s about 10TB per year. At the end some Hadoop’ing + Analytics SQ L, but in the middle lots of data integration Anykind of logs and Data Partial database extracts Apache/NGinx logs Tracking Logs (Web Analytics stuff. etc..) Application Logs REST APIs (Currency Exchange, Geo Data, Facebook APIs. )..) Dataiku™
  • 3. As a reminder What most data scientists do ? LinkedIn&Twitter “Data Science” Real Life “Recommendation” 80% of its time is spent “Clustering algorithms” getting the data right “Big Data” “Machine Learning” 19% Analytics “Hidden Markov Model” “Predictive Analytics” 1% Twitter & LinkedIn “Logistic Regression” Dataiku™
  • 4. Goal An project based on a ETL solution had previously failed Need for Agility To manage any data To be quick The answer is …. PYTHON !!! Dataiku™
  • 5. Step 1: Open your favorite editor, write a .py file Script for data parsing, filling up the database, enrichment, cleanup, etc.. Around 2000 line of code 5 man days work !Good, but hard to maintain on the long run !Not fun I switched from emacs to SublimeText2 in the meantime, that was cool. Dataiku™
  • 6. Step 2: Abstract and Generalize. PyBabe Micro-ETL in Python Can read and write: FTP, HTTP, SQL, Filesystem, Amazon S3, E-Mail, ZIP, GZIP, MongoDB, Excel, etc.. Basic file filters and transformations (filters, regular expressions, date parsing, geoip, transpose, sort, group, …) Use yield and named tuples Open-source https://github.com/fdouetteau/PyBabe And the old project ? The old project became 200 linesof specific code Dataiku™
  • 7. Sample pybabe (1) Fetch a log file in s3 and put integrate in babe = Babe() ## Fetch multiple CSV file from S3, har, cache en local babe = babe.pull(url=“s3://myapp/mydir 2012-07-07_*.csv.gz”,cache=True) ## Recupère l’IP dans le champs IP, trouve pas geoip le pays babe = babe.geoip_country_code(field=“ip”, country_code=“country”, ignore_error_True) ## Récupère le user agent, et stocke le nom du navigateur babe = babe.user_agent(field=“user_agent”, browser=“browser”) ## Ne garde que les champs pertinents babe = babe.filterFields(fields=[“user_id”, “date”, “country”, “user_agent”]) ## Stocke le résultat dans une base de donnée babe.push_sql(database=“mydb”, table=“mytable”, username=“…”); Dataiku™
  • 8. Sample PyBabe script (2) Large file sort, join babe = Babe() ## Fetch a large CSV file babe = babe.pull(filename=“mybigfile.csv”) ## Perform a disk-based sort, batch 100k lines in memory babe = babe.sortDiskBased(field=“uid”, nsize=100000) ## Group By uid and sum revenu per user. babe = babe.groupBy(field=“uid”, reducer=lambda x, y: (x.uid, x.amount + y.amount)) ## Join this stream on “uid” with the result of a CSV file abe = babe.join(Babe().pull_sql(database=“mydb”, table=“user_info”, “uid”, “uid”) ## Store the result in an Excel file babe.push (filename=“reports.xlxs”); Dataiku™
  • 9. Sample PyBabe script (3) Mail a report babe = Babe() ## Pull the result of a SQL query babe = babe.pull(database=“mydb”, name=”First Query”, query=“SELECT …. “) ## Pull the result of a second SQL query babe = babe.pull(database=“mydb”, name=”Second Query”, query=“SELECT ….”) ## Send the overall stream (concatenated) as an email, with content attached in Excel, and some sample data in the body babe = babe.sendmail(subject=“Your Report”, recipients=fd@me.com, data_in_body=True, data_in_body_row_limit=10, attach_formats=“xlsx”) Dataiku™
  • 10. Some Design Choices Use collections.namedtuple Use generators Nice and easy programming style def filter(stream, f): for data in stream: if isinstance(data, StreamMeta): yield data elif f(data): yield data IO Streaming whenever possible An HTTP downloaded file begins to be processed as it starts downloading Use bulk-loaders (SQL) or external program when faster than the python implementation (e.g gzip) Dataiku™
  • 11. PyBabe data model def sample_pull(): header = StreamHeader(name=”visits”, partition={‘day’:‘2012-09-14’}, A Babe works on a fields=[“name”, “day”]) generator that contains yield header a sequence of partition yield header.makeRow(‘Florian’,‘2012-09-14’) A Partition is yield header.makeRow(‘John’, ‘2012-09-14’) composed of a header yield StreamFooter() (StreamHeader), rows, yield header.replace(partition={‘day’:‘2012-09-15’}) and a Footer yield header.makeRow(‘Phil’, ‘2012-09-15’) yield StreamFooter() Dataiku™
  • 12. Some thoughts and associated projects strptime and performance Parse a date with time.strptime or datetime.strptime 30 microseconds vs. 3 microseconds for regexp matching !!! “Tarpys” a date parsing library, with date guessing Charset management (pyencoding_cleaner) Sniff ISO or UTF-8 charset over a fragment Optionally try to fix bad encoding ( î, í, ü) python2.X csv module is ok but … No Unicode support Separator sniffing buggy on edge cases Dataiku™
  • 13. Future Need to separate the github project into core and plugins Rewrite in C a CSV module ? … Configurable Error system. Should a error row fail the whole stream, fail the whole babe, send a warning, be or be skipped Pandas/NumPy integration An Homepage, Docs, etc... Dataiku™
  • 14. Ask questions ? ? babe = Babe().pull(“questions.csv”) babe = babe.filter(smart=True) babe = babe.mapTo(oracle) Florian Douetteau babe.push(“answers.csv”); @fdouetteau CEO Dataiku Dataiku : Our Goal - Leverage and Provide the best of open souce technologies to help people build their own data science platform Dataiku™

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n