SlideShare uma empresa Scribd logo
1 de 23
Applied Data Analytics 
Building a real data product
Github Repository 
http://bit.ly/1eLBzki 
Matrix Factorization 
http://slidesha.re/15Qssf0 
Links to various resources
Goals for this Course 
● Apply the ideas and tools learned during all previous program courses 
● Use a real world data set with actionable prediction 
● Present a completed project to faculty and peers 
● Build a data project portfolio 
What are your goals? 
● Understand the Data Science Pipeline 
● Understand what a complete data product looks like 
● Be able to set up and implement a data product in Python
Some Logistics 
This is a small class, I’m hoping for lots of participation! 
Course materials can be found in two places: 
● iPython: http://bit.ly/1gJ73Tt 
● Github: https://github.com/DistrictDataLabs/science-bookclub 
● Slides: on slideshare or on Blackboard 
Recommended Reading: 
● Matrix Factorization: A simple tutorial and implementation 
● http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial- 
and-implementation-in-python/
Agenda - Day One 
● Review Data Products 
● Review Data Science Pipeline 
● Discuss architecture of the data product we’re going to build. 
● Setting up our project 
● Ingestion of Goodreads Data 
● Lunch 
● Creating a command line admin program 
● Wrangling of Goodreads Data 
● A computational data store
Agenda - Day Two 
● Review current state of recommender project 
● Matrix math review 
● Introduction to matrix factorization 
● Building a recommender system 
● Reporting with Jinja2 
● Lunch 
● Presentations of Capstone Projects 
● Course wrap-up
Building Data Products
A data product is a product that is 
based on the combination of data 
and algorithms.” 
Hilary Mason 
“
A data application acquires its value from the 
data itself, and creates more data as a result. 
It’s not just an application with data; it’s a 
data product. Data science enables the 
creation of data products.” 
Mike Loukides 
“
The Data Science Pipeline
Data Ingestion Data Munging 
and Wrangling 
Computation and 
Analyses 
Modeling and 
Application 
Reporting and 
Visualization
Data Ingestion 
● There is a world of data out 
there- how to get it? Web 
crawlers, APIs, Sensors? Python 
and other web scripting 
languages are custom made for 
this task. 
● The real question is how can we 
deal with such a giant volume 
and velocity of data? 
● Big Data and Data Science often 
require ingestion specialists!
Data Wrangling 
● Warehousing the data means 
storing the data in as raw a form 
as possible. 
● Extract, transform, and load 
operations move data to 
operational storage locations. 
● Filtering, aggregation, 
normalization and 
denormalization all ensure data is 
in a form it can be computed on. 
● Annotated training sets must be 
created for ML tasks.
Computation and Analyses 
● Hypothesis driven computation 
includes design and development 
of predictive models. 
● Many models have to be trained 
or constrained into a 
computational form like a Graph 
database, and this is time 
consuming. 
● Other data products like indices, 
relations, classifications, and 
clusters may be computed.
Modeling and Application 
This is the part we’re most familiar with. 
Supervised classification, Unsupervised 
clustering - Bayes, Logistic Regression, 
Decision Trees, and other models. 
This is also where the money is.
Reporting and Visualization 
● Often overlooked, this part is 
crucial, even if we have data 
products. 
● Humans recognize patterns 
better than machines. Human 
feedback is crucial in Active 
Learning and remodeling (error 
detection). 
● Mashups and collaborations 
generate more data- and 
therefore more value!
Don’t forget feedback! 
(Active Learning for Data 
Products)
What we’re going to build today 
SCIENCE BOOKCLUB!! 
● A book club that chooses what to 
read via a recommender system. 
● Uses GoodReads data to ingest 
and return feedback on books. 
● Statistical model is a non-negative 
matrix factorization 
● Reporting using Jinja (almost a 
web app)
Workflow 
1. Setting up a Python skeleton 
2. Creating and Running Tests 
3. Wading in with a configuration 
4. Ingestion with urllib and requests 
5. Creating a command line admin with argparse 
6. Wrangling with BeautifulSoup and SQLAlchemy 
7. Modeling with numpy 
8. Reporting with Jinja2
Matplotlib Jinja2 
Reporting 
Module 
Recommender 
Module 
Octavo Architecture (really clear DSP) 
requests.py 
Ingestion 
Module 
Raw Data 
Storage Computational 
Data Storage 
Wrangling 
Module 
BeautifulSou 
p 
SQLAlchemy 
Numpy
Let’s dive into some code!

Mais conteúdo relacionado

Mais procurados

Machine Learning
Machine LearningMachine Learning
Machine LearningShrey Malik
 
Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...
Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...
Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...SlideTeam
 
Python Web Development Tutorial | Web Development Using Django | Edureka
Python Web Development Tutorial | Web Development Using Django | EdurekaPython Web Development Tutorial | Web Development Using Django | Edureka
Python Web Development Tutorial | Web Development Using Django | EdurekaEdureka!
 
Introduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data AnalyticsIntroduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data AnalyticsPhoenix
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Venkata Reddy Konasani
 
Python in real world.
Python in real world.Python in real world.
Python in real world.Alph@.M
 
Predictive Modelling PowerPoint Presentation Slides
Predictive Modelling PowerPoint Presentation Slides Predictive Modelling PowerPoint Presentation Slides
Predictive Modelling PowerPoint Presentation Slides SlideTeam
 
CUbRIK tutorial at ICWE 2013: part 1 Introduction to Human Computation
CUbRIK tutorial at ICWE 2013: part 1 Introduction to Human ComputationCUbRIK tutorial at ICWE 2013: part 1 Introduction to Human Computation
CUbRIK tutorial at ICWE 2013: part 1 Introduction to Human ComputationCUbRIK Project
 
The fundamentals of Machine Learning
The fundamentals of Machine LearningThe fundamentals of Machine Learning
The fundamentals of Machine LearningHichem Felouat
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Introduction to data analysis using python
Introduction to data analysis using pythonIntroduction to data analysis using python
Introduction to data analysis using pythonGuido Luz Percú
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Marina Santini
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its ApplicationsDr Ganesh Iyer
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using PythonChariza Pladin
 

Mais procurados (20)

Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...
Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...
Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...
 
Python Modules
Python ModulesPython Modules
Python Modules
 
Python Web Development Tutorial | Web Development Using Django | Edureka
Python Web Development Tutorial | Web Development Using Django | EdurekaPython Web Development Tutorial | Web Development Using Django | Edureka
Python Web Development Tutorial | Web Development Using Django | Edureka
 
Introduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data AnalyticsIntroduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data Analytics
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Python in real world.
Python in real world.Python in real world.
Python in real world.
 
Predictive Modelling PowerPoint Presentation Slides
Predictive Modelling PowerPoint Presentation Slides Predictive Modelling PowerPoint Presentation Slides
Predictive Modelling PowerPoint Presentation Slides
 
CUbRIK tutorial at ICWE 2013: part 1 Introduction to Human Computation
CUbRIK tutorial at ICWE 2013: part 1 Introduction to Human ComputationCUbRIK tutorial at ICWE 2013: part 1 Introduction to Human Computation
CUbRIK tutorial at ICWE 2013: part 1 Introduction to Human Computation
 
The fundamentals of Machine Learning
The fundamentals of Machine LearningThe fundamentals of Machine Learning
The fundamentals of Machine Learning
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Machine learning
Machine learning Machine learning
Machine learning
 
Introduction to data analysis using python
Introduction to data analysis using pythonIntroduction to data analysis using python
Introduction to data analysis using python
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
 
Machine Learning
Machine Learning Machine Learning
Machine Learning
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Data science
Data science Data science
Data science
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using Python
 
PyCharm demo
PyCharm demoPyCharm demo
PyCharm demo
 

Destaque

Startup Pitch Decks that Work: Creating a Winning Pitch Deck
Startup Pitch Decks that Work: Creating a Winning Pitch DeckStartup Pitch Decks that Work: Creating a Winning Pitch Deck
Startup Pitch Decks that Work: Creating a Winning Pitch DeckDavid Ehrenberg
 
300 Milligrams - Demo Day Presentation
300 Milligrams - Demo Day Presentation300 Milligrams - Demo Day Presentation
300 Milligrams - Demo Day Presentation500 Startups
 
500’s Demo Day Batch 12 >> Alfred
500’s Demo Day Batch 12 >> Alfred500’s Demo Day Batch 12 >> Alfred
500’s Demo Day Batch 12 >> Alfred500 Startups
 
BrandBoards demo day pitch deck
BrandBoards demo day pitch deckBrandBoards demo day pitch deck
BrandBoards demo day pitch deck500 Startups
 
Standard Treasury Series A Pitch Deck
Standard Treasury Series A Pitch DeckStandard Treasury Series A Pitch Deck
Standard Treasury Series A Pitch DeckZachary Townsend
 
Tealet - DRINK THE TEA
Tealet - DRINK THE TEATealet - DRINK THE TEA
Tealet - DRINK THE TEA500 Startups
 
500’s Demo Day Batch 11 >> Slidebean
500’s Demo Day Batch 11 >> Slidebean 500’s Demo Day Batch 11 >> Slidebean
500’s Demo Day Batch 11 >> Slidebean 500 Startups
 
Kickfolio - 500Startups Batch 5
Kickfolio - 500Startups Batch 5Kickfolio - 500Startups Batch 5
Kickfolio - 500Startups Batch 5500 Startups
 
TouristEye - Personalizing The Travel Experience - 500 Startups
TouristEye - Personalizing The Travel Experience - 500 StartupsTouristEye - Personalizing The Travel Experience - 500 Startups
TouristEye - Personalizing The Travel Experience - 500 Startups500 Startups
 
Pitch deck for Kejahunt
Pitch deck for KejahuntPitch deck for Kejahunt
Pitch deck for KejahuntJoshua Mutua
 
Square pitch deck
Square pitch deckSquare pitch deck
Square pitch deckpitchenvy
 
Contently Pitch Deck
Contently Pitch DeckContently Pitch Deck
Contently Pitch DeckRyan Gum
 

Destaque (20)

Startup Pitch Decks that Work: Creating a Winning Pitch Deck
Startup Pitch Decks that Work: Creating a Winning Pitch DeckStartup Pitch Decks that Work: Creating a Winning Pitch Deck
Startup Pitch Decks that Work: Creating a Winning Pitch Deck
 
300 Milligrams - Demo Day Presentation
300 Milligrams - Demo Day Presentation300 Milligrams - Demo Day Presentation
300 Milligrams - Demo Day Presentation
 
Cadee
CadeeCadee
Cadee
 
500’s Demo Day Batch 12 >> Alfred
500’s Demo Day Batch 12 >> Alfred500’s Demo Day Batch 12 >> Alfred
500’s Demo Day Batch 12 >> Alfred
 
Binpress
BinpressBinpress
Binpress
 
BrandBoards demo day pitch deck
BrandBoards demo day pitch deckBrandBoards demo day pitch deck
BrandBoards demo day pitch deck
 
Sverve
SverveSverve
Sverve
 
Standard Treasury Series A Pitch Deck
Standard Treasury Series A Pitch DeckStandard Treasury Series A Pitch Deck
Standard Treasury Series A Pitch Deck
 
PinMyPet
PinMyPetPinMyPet
PinMyPet
 
Farmeron
FarmeronFarmeron
Farmeron
 
Tealet - DRINK THE TEA
Tealet - DRINK THE TEATealet - DRINK THE TEA
Tealet - DRINK THE TEA
 
500’s Demo Day Batch 11 >> Slidebean
500’s Demo Day Batch 11 >> Slidebean 500’s Demo Day Batch 11 >> Slidebean
500’s Demo Day Batch 11 >> Slidebean
 
Kickfolio - 500Startups Batch 5
Kickfolio - 500Startups Batch 5Kickfolio - 500Startups Batch 5
Kickfolio - 500Startups Batch 5
 
Kibin
Kibin Kibin
Kibin
 
task.ly pitch deck
task.ly pitch decktask.ly pitch deck
task.ly pitch deck
 
TouristEye - Personalizing The Travel Experience - 500 Startups
TouristEye - Personalizing The Travel Experience - 500 StartupsTouristEye - Personalizing The Travel Experience - 500 Startups
TouristEye - Personalizing The Travel Experience - 500 Startups
 
Daily hundred Pitch Deck 2014
Daily hundred Pitch Deck 2014Daily hundred Pitch Deck 2014
Daily hundred Pitch Deck 2014
 
Pitch deck for Kejahunt
Pitch deck for KejahuntPitch deck for Kejahunt
Pitch deck for Kejahunt
 
Square pitch deck
Square pitch deckSquare pitch deck
Square pitch deck
 
Contently Pitch Deck
Contently Pitch DeckContently Pitch Deck
Contently Pitch Deck
 

Semelhante a Building Data Products with Python (Georgetown)

Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with PythonBenjamin Bengfort
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumVMware Tanzu
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Advanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project DeliveryAdvanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project DeliveryMark Constable
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Betacowork
 
Lak2018: Scaling Nationally: Seven Lesson Learned
Lak2018:  Scaling Nationally: Seven Lesson LearnedLak2018:  Scaling Nationally: Seven Lesson Learned
Lak2018: Scaling Nationally: Seven Lesson Learnedmwebbjisc
 
KSU IT Capstone Report 2012-2017.pdf
KSU IT Capstone Report 2012-2017.pdfKSU IT Capstone Report 2012-2017.pdf
KSU IT Capstone Report 2012-2017.pdfJack Zheng
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 
KSU IT4983 Capstone Projects Report 2017 Update
KSU IT4983 Capstone Projects Report 2017 UpdateKSU IT4983 Capstone Projects Report 2017 Update
KSU IT4983 Capstone Projects Report 2017 UpdateJack Zheng
 
Career in Python and data science
Career in Python and data science Career in Python and data science
Career in Python and data science Sagar Hedau
 
Big Data overview
Big Data overviewBig Data overview
Big Data overviewalexisroos
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Amazon Web Services
 
Agile data science
Agile data scienceAgile data science
Agile data scienceJoel Horwitz
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen
 
Visualising montioring and evaluation data
Visualising montioring and evaluation dataVisualising montioring and evaluation data
Visualising montioring and evaluation dataRob Worthington
 

Semelhante a Building Data Products with Python (Georgetown) (20)

Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with Python
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
2020 | Metadata Day | LinkedIn
2020 | Metadata Day | LinkedIn2020 | Metadata Day | LinkedIn
2020 | Metadata Day | LinkedIn
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Advanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project DeliveryAdvanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project Delivery
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 
Lak2018: Scaling Nationally: Seven Lesson Learned
Lak2018:  Scaling Nationally: Seven Lesson LearnedLak2018:  Scaling Nationally: Seven Lesson Learned
Lak2018: Scaling Nationally: Seven Lesson Learned
 
KSU IT Capstone Report 2012-2017.pdf
KSU IT Capstone Report 2012-2017.pdfKSU IT Capstone Report 2012-2017.pdf
KSU IT Capstone Report 2012-2017.pdf
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
KSU IT4983 Capstone Projects Report 2017 Update
KSU IT4983 Capstone Projects Report 2017 UpdateKSU IT4983 Capstone Projects Report 2017 Update
KSU IT4983 Capstone Projects Report 2017 Update
 
Career in Python and data science
Career in Python and data science Career in Python and data science
Career in Python and data science
 
Big Data overview
Big Data overviewBig Data overview
Big Data overview
 
Python and data analytics
Python and data analyticsPython and data analytics
Python and data analytics
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
 
Visualising montioring and evaluation data
Visualising montioring and evaluation dataVisualising montioring and evaluation data
Visualising montioring and evaluation data
 

Mais de Benjamin Bengfort

Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningBenjamin Bengfort
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Benjamin Bengfort
 
Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Benjamin Bengfort
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection ProcessBenjamin Bengfort
 
An Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportAn Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportBenjamin Bengfort
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataBenjamin Bengfort
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Benjamin Bengfort
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXBenjamin Bengfort
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBenjamin Bengfort
 

Mais de Benjamin Bengfort (16)

Getting Started with TRISA
Getting Started with TRISAGetting Started with TRISA
Getting Started with TRISA
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
 
Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 
An Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportAn Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation Report
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational Data
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 
Annotation with Redfox
Annotation with RedfoxAnnotation with Redfox
Annotation with Redfox
 
Rasta processing of speech
Rasta processing of speechRasta processing of speech
Rasta processing of speech
 

Último

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 

Último (20)

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

Building Data Products with Python (Georgetown)

  • 1. Applied Data Analytics Building a real data product
  • 2. Github Repository http://bit.ly/1eLBzki Matrix Factorization http://slidesha.re/15Qssf0 Links to various resources
  • 3. Goals for this Course ● Apply the ideas and tools learned during all previous program courses ● Use a real world data set with actionable prediction ● Present a completed project to faculty and peers ● Build a data project portfolio What are your goals? ● Understand the Data Science Pipeline ● Understand what a complete data product looks like ● Be able to set up and implement a data product in Python
  • 4. Some Logistics This is a small class, I’m hoping for lots of participation! Course materials can be found in two places: ● iPython: http://bit.ly/1gJ73Tt ● Github: https://github.com/DistrictDataLabs/science-bookclub ● Slides: on slideshare or on Blackboard Recommended Reading: ● Matrix Factorization: A simple tutorial and implementation ● http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial- and-implementation-in-python/
  • 5. Agenda - Day One ● Review Data Products ● Review Data Science Pipeline ● Discuss architecture of the data product we’re going to build. ● Setting up our project ● Ingestion of Goodreads Data ● Lunch ● Creating a command line admin program ● Wrangling of Goodreads Data ● A computational data store
  • 6. Agenda - Day Two ● Review current state of recommender project ● Matrix math review ● Introduction to matrix factorization ● Building a recommender system ● Reporting with Jinja2 ● Lunch ● Presentations of Capstone Projects ● Course wrap-up
  • 8. A data product is a product that is based on the combination of data and algorithms.” Hilary Mason “
  • 9.
  • 10. A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data products.” Mike Loukides “
  • 11.
  • 12. The Data Science Pipeline
  • 13. Data Ingestion Data Munging and Wrangling Computation and Analyses Modeling and Application Reporting and Visualization
  • 14. Data Ingestion ● There is a world of data out there- how to get it? Web crawlers, APIs, Sensors? Python and other web scripting languages are custom made for this task. ● The real question is how can we deal with such a giant volume and velocity of data? ● Big Data and Data Science often require ingestion specialists!
  • 15. Data Wrangling ● Warehousing the data means storing the data in as raw a form as possible. ● Extract, transform, and load operations move data to operational storage locations. ● Filtering, aggregation, normalization and denormalization all ensure data is in a form it can be computed on. ● Annotated training sets must be created for ML tasks.
  • 16. Computation and Analyses ● Hypothesis driven computation includes design and development of predictive models. ● Many models have to be trained or constrained into a computational form like a Graph database, and this is time consuming. ● Other data products like indices, relations, classifications, and clusters may be computed.
  • 17. Modeling and Application This is the part we’re most familiar with. Supervised classification, Unsupervised clustering - Bayes, Logistic Regression, Decision Trees, and other models. This is also where the money is.
  • 18. Reporting and Visualization ● Often overlooked, this part is crucial, even if we have data products. ● Humans recognize patterns better than machines. Human feedback is crucial in Active Learning and remodeling (error detection). ● Mashups and collaborations generate more data- and therefore more value!
  • 19. Don’t forget feedback! (Active Learning for Data Products)
  • 20. What we’re going to build today SCIENCE BOOKCLUB!! ● A book club that chooses what to read via a recommender system. ● Uses GoodReads data to ingest and return feedback on books. ● Statistical model is a non-negative matrix factorization ● Reporting using Jinja (almost a web app)
  • 21. Workflow 1. Setting up a Python skeleton 2. Creating and Running Tests 3. Wading in with a configuration 4. Ingestion with urllib and requests 5. Creating a command line admin with argparse 6. Wrangling with BeautifulSoup and SQLAlchemy 7. Modeling with numpy 8. Reporting with Jinja2
  • 22. Matplotlib Jinja2 Reporting Module Recommender Module Octavo Architecture (really clear DSP) requests.py Ingestion Module Raw Data Storage Computational Data Storage Wrangling Module BeautifulSou p SQLAlchemy Numpy
  • 23. Let’s dive into some code!