SlideShare uma empresa Scribd logo
1 de 121
Baixar para ler offline
Jonathan Dinu

Co-Founder, Zipfian Academy

jonathan@zipfianacademy.com

@clearspandex
@ZipfianAcademy
Data Engineering 101: Building your first
data product
May 4th, 2014
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
Formerly
Questions? tweet @zipfianacademy #pydata
Formerly
Questions? tweet @zipfianacademy #pydata
Currently
Questions? tweet @zipfianacademy #pydata
Today Disclaimer:
All characters appearing in this presentation are
fictitious. Any resemblance to real persons, living
or dead, is purely coincidental.
Questions? tweet @zipfianacademy #pydata
Today Disclaimer:
This presentation contains strong opinions that
you may or may not agree with. All thoughts are
my own.
Jonathan Dinu

Co-Founder, Zipfian Academy

jonathan@zipfianacademy.com

@clearspandex
Questions? tweet @zipfianacademy #pydata
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• CreatingValue for Users	

• Q&A
Questions? tweet @zipfianacademy #pydata
nwsrdr (News Reader)
Source: http://www.groovypost.com/wp-content/uploads/2013/05/Bookmark-
Button.png
OR
nwsrdr
+ nwrsrdr
+ nwrsrdr
+ nwrsrdr
nwsrdr
getnews.com/bookmarklet
When browsing the web simply click the 

+nwsrdr to save any page to nwsrdr
Get nwsrdr on your desktop
Questions? tweet @zipfianacademy #pydata
nwsrdr
• Auto-categorize Articles	

• Find Similar Articles	

• Recommend articles	

• Suggest Feeds to Follow	

• No Ads!
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
Questions? tweet @zipfianacademy #pydata
nwsrdr
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
• Naive Bayes (classification)	

• Clustering (unsupervised learning)	

• Collaborative Filtering	

• Triangle Closing	

• Real Business Model!
Questions? tweet @zipfianacademy #pydata
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
OR
Data Products
Product Built on Data
(that you sell)
Questions? tweet @zipfianacademy #pydata
OR
Data Products
Product that Generates Data
Questions? tweet @zipfianacademy #pydata
OR
Data Products
Product that Generates Data
(that you sell)
Questions? tweet @zipfianacademy #pydata
OR
Data Products
Product that Generates Data
(that you sell)
i.e. Facebook
Questions? tweet @zipfianacademy #pydata
OR
Data Products
Questions? tweet @zipfianacademy #pydata Source: http://gifgif.media.mit.edu/
OR
Data Products
Source: http://www.adamlaiacano.com/post/57703317453/data-generating-productsQuestions? tweet @zipfianacademy #pydata
OR
Data Generating
Products
Source: http://www.adamlaiacano.com/post/57703317453/data-generating-productsQuestions? tweet @zipfianacademy #pydata
Products that enhance a users’
experience the more “data” a user
provides
Data Generating
Products
Ex: Recommender Systems
Questions? tweet @zipfianacademy #pydata
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
OR
Data Science
Questions? tweet @zipfianacademy #pydata
i.e. solve more problems than you create
Data Science
Questions? tweet @zipfianacademy #pydata
Source: http://estoyentretenido.com/wp-content/uploads/2012/11/Jackie-Chan-
Meme.jpg
But.... How?!?!?!!?
Data Science
Questions? tweet @zipfianacademy #pydata
Data Engineering
Source: http://www.schooljotter.com/imagefolders/lady/Class_3/Engineer-
It-1350063721.PNG
Questions? tweet @zipfianacademy #pydata
Data Engineering
Source: http://www.schooljotter.com/imagefolders/lady/Class_3/Engineer-
It-1350063721.PNG
!
Questions? tweet @zipfianacademy #pydata
OR
Data Engineering
Questions? tweet @zipfianacademy #pydata
Prepared
Data
Test Set
Training	

Set Train
Model
Sampling
Evaluate
Cross 	

Validation
Data Science
Questions? tweet @zipfianacademy #pydata
Raw
Data
Cleaned
Data
Scrubbing
Prepared
DataVectorization
New
Data
Test Set
Training	

Set Train
Model
Sampling
Evaluate
Cross 	

Validation
Cleaned
Data
Prepared
DataVectorizationScrubbing
Predict
Labels/
Classes
Data Engineering
Questions? tweet @zipfianacademy #pydata
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
What
• Naive Bayes (classification)	

• Clustering (unsupervised learning)	

• Collaborative Filtering	

• Triangle Closing	

• Real Business Model
Questions? tweet @zipfianacademy #pydata
nwsrdr
• Auto-categorize Articles	

• Find Similar Articles	

• Recommend articles	

• No Ads!
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
Questions? tweet @zipfianacademy #pydata
nwsrdr
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
• Naive Bayes (classification)	

• Clustering (unsupervised learning)	

• Collaborative Filtering	

• Triangle Closing	

• Real Business Model!
Questions? tweet @zipfianacademy #pydata
Source: http://media.tumblr.com/tumblr_lakcynCyG31qbzcoy.jpg
Abstraction (Cake)
How
(ABK)
Questions? tweet @zipfianacademy #pydata
Obligatory
Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Obligatory
Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Obligatory
Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Obligatory
Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
At Scale Locally
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
Flask
yHat
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Obligatory
Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Pipeline
Iteration 0:
• Find out how much data	

• Run locally	

• Experiment
Questions? tweet @zipfianacademy #pydata
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Acquire
Retrieve Meta-data for ALL NYT articles
Questions? tweet @zipfianacademy #pydata
Acquire
api_key='xxxxxxxxxxxxx'!
!
!
!
url = 'http://api.nytimes.com/svc/search/v2/
articlesearch.json?fq=section_name.contains:("Arts"
"Business Day" "Opinion" "Sports" "U.S."
"World")&sort=newest&api-key=' + api_key!
!
!
!
# make an API request!
api = requests.get(url)!
Questions? tweet @zipfianacademy #pydata
Acquire
# parse resulting JSON and insert into a mongoDB collection!
for content in api.json()['response']['docs']:!
if not collection.find_one(content):!
collection.insert(content)!
!
!
# only returns 10 per page!
"There are only %i docuemtns returned 0_o" % !
! len(api.json()[‘response']['docs'])!
Questions? tweet @zipfianacademy #pydata
Acquire
# there are many more than 10 articles however!
total_art = articles_left = api.json()['response']['meta']['hits']!
!
!
print "There are currently %s articles in the NYT archive" % total_art!
!
!
#=> There are currently 15277775 articles in the NYT archive
Questions? tweet @zipfianacademy #pydata
Acquire
Gotchas!
• Rate Limiting	

• Page Limiting
Questions? tweet @zipfianacademy #pydata
Acquire
Iterate
Iteration 1:
• (Meaningful) Sample of Data	

• Prototype — “Close the Loop”
Questions? tweet @zipfianacademy #pydata
Retrieve Meta-data for ALL NYT articles
Questions? tweet @zipfianacademy #pydata
Acquire
(take 2)
# let us loop (and hopefully not hit our rate limit)!
while articles_left > 0 and page_count < max_pages:!
more_articles = requests.get(url + "&page=" + str(page) + "&end_date=" + str(last_date))!
print "Inserting page " + str(page)!
# make sure it was successful!
if more_articles.status_code == 200:!
for content in more_articles.json()['response']['docs']:!
latest_article = parser.parse(content['pub_date']).strftime("%Y%m%d")!
if not collection.find_one(content) and content['document_type'] == 'article':!
print "No dups"!
try:!
print "Inserting article " + str(content['headline'])!
collection.insert(content)!
except errors.DuplicateKeyError:!
print "Duplicates"!
continue!
else:!
print "In collection already”!
! ! …
Iteration 0.5
Questions? tweet @zipfianacademy #pydata
Acquire
articles_left -= 10!
page += 1!
page_count += 1!
cursor_count += 1!
final_page = max(final_page, page)!
else:!
if more_articles.status_code == 403:!
print "Sleepy..."!
# account for rate limiting!
time.sleep(2)!
elif cursor_count > 100:!
print "Adjusting date”!
! ! ! ! # account for page limiting!
cursor_count = 0!
page = 0!
last_date = latest_article!
else:!
print "ERRORS: " + str(more_articles.status_code)!
cursor_count = 0!
page = 0!
last_date = latest_article!
Questions? tweet @zipfianacademy #pydata
Acquire
Download HTML content of 	

articles from NYT.com
Questions? tweet @zipfianacademy #pydata
Acquire
(and store in MongoDB™)
Acquire
# now we can get some content!!
#limit = 100!
limit = 10000!
!
for article in collection.find({'html' : {'$exists' : False} }):!
if limit and limit > 0:!
if not article.has_key('html') and article['document_type'] == 'article':!
limit -= 1!
print article['web_url']!
html = requests.get(article['web_url'] + "?smid=tw-nytimes")!
!
if html.status_code == 200:!
soup = BeautifulSoup(html.text)!
!
# serialize html!
collection.update({ '_id' : article['_id'] }, { '$set' : !
! ! ! ! ! ! ! ! ! ! ! ! ! { 'html' : unicode(soup), 'content' : [] } !
! ! ! ! ! ! ! ! ! ! ! ! } )!
!
for p in soup.find_all('div', class_='articleBody'):!
collection.update({ '_id' : article['_id'] }, { '$push' : !
! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'content' : p.get_text() !
! ! ! ! ! ! ! ! ! ! ! ! ! } })!
Questions? tweet @zipfianacademy #pydata
Parse
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
scikit-learn/NLTK
Flask
yHat
Locally
Questions? tweet @zipfianacademy #pydata
Parse HTML with BeautifulSoup	

and Extract the article Body
Questions? tweet @zipfianacademy #pydata
(and store in MongoDB™)
Parse
# parse HTML content of articles!
for article in collection.find({'html' : {'$exists' : True} }):!
print article['web_url']!
soup = BeautifulSoup(article['html'], 'html.parser')!
arts = soup.find_all('div', class_='articleBody')!
!
if len(arts) == 0:!
arts = soup.find_all('p', class_=‘story-body-text')!
!
! ! …
Questions? tweet @zipfianacademy #pydata
Parse
Store
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
for p in arts:!
collection.update({ '_id' : article['_id'] }, { '$push' : !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'content' : p.get_text() } !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! })!
Questions? tweet @zipfianacademy #pydata
Store
Explore
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Exploratory Data Analysis with pandas
Questions? tweet @zipfianacademy #pydata
Explore
articles.describe()!
# ! ! text section!
# count 1405 1405!
# unique 1397 10!
!
fig = plt.figure()!
# histogram of section counts!
articles['section'].value_counts().plot(kind='bar')
Questions? tweet @zipfianacademy #pydata
Explore
Questions? tweet @zipfianacademy #pydata
Explore
error with 	

NYT API
Questions? tweet @zipfianacademy #pydata
Explore
api_key='xxxxxxxxxxxxx'!
!
!
!
url = 'http://api.nytimes.com/svc/search/v2/
articlesearch.json?fq=section_name.contains:("Arts"
"Business Day" "Opinion" "Sports" "U.S."
"World")&sort=newest&api-key=' + api_key!
!
!
!
# make an API request!
api = requests.get(url)!
Questions? tweet @zipfianacademy #pydata
Explore
error with 	

NYT API
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Vectorize
Tokenize article text and 	

create feature vectors with NLTK
Questions? tweet @zipfianacademy #pydata
Vectorize
Vectorize
wnl = nltk.WordNetLemmatizer()!
!
def tokenize_and_normalize(chunks):!
words = [ tokenize.word_tokenize(sent) for sent in
tokenize.sent_tokenize("".join(chunks)) ]!
flatten = [ inner for sublist in words for inner in sublist ]!
stripped = [] !
!
for word in flatten: !
if word not in stopwords.words('english'):!
try:!
stripped.append(word.encode('latin-1').decode('utf8').lower())!
except:!
print "Cannot encode: " + word!
!
no_punks = [ word for word in stripped if len(word) > 1 ] !
return [wnl.lemmatize(t) for t in no_punks]!
Questions? tweet @zipfianacademy #pydata
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Train
Train and score a model with scikit-learn
Questions? tweet @zipfianacademy #pydata
Train
# cross validate!
from sklearn.cross_validation import train_test_split!
!
xtrain, xtest, ytrain, ytest = !
! ! ! ! ! ! ! train_test_split(X, labels, test_size=0.3)!
!
# train a model!
alpha = 1!
multi_bayes = MultinomialNB(alpha=alpha)!
!
multi_bayes.fit(xtrain, ytrain)!
multi_bayes.score(xtest, ytest)
Questions? tweet @zipfianacademy #pydata
Train
Gotchas!
• Model only exists locally on Laptop	

• Not Automated for realtime prediction
Questions? tweet @zipfianacademy #pydata
Train
Exposé
Questions? tweet @zipfianacademy #pydata
Iteration 2:
• Expose your model	

• Automate your processes
Questions? tweet @zipfianacademy #pydata
Exposé
Getting that model	

off your lap(top)
Questions? tweet @zipfianacademy #pydata
Exposé
Source: http://pixel.nymag.com/imgs/daily/vulture/2012/03/09/09_joan-taylor.o.jpg/
a_560x0.jpg
Questions? tweet @zipfianacademy #pydata
Exposé
A model is just a function
Questions? tweet @zipfianacademy #pydata
Exposé
Inputs...
Questions? tweet @zipfianacademy #pydata
Exposé
Outputs...
Questions? tweet @zipfianacademy #pydata
Exposé
Serialize your model with pickle 	

(or cPickle or joblib)
Questions? tweet @zipfianacademy #pydata
Persistence
Source: http://www.glogster.com/mrsallenballard/pickles-i-love-em-/
g-6mevh13be74mgnc9i8qifa0
Persistence
Questions? tweet @zipfianacademy #pydata
Persistence
SerDes
• Disk	

• Database	

• Memory	

Questions? tweet @zipfianacademy #pydata
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Exposé
Deploy your Model to yHat
Questions? tweet @zipfianacademy #pydata
Exposé
class DocumentClassifier(YhatModel):!
@preprocess(in_type=dict, out_type=dict)!
def execute(self, data):!
featureBody = vectorizer.transform([data['content']])!
result = multi_bayes.predict(featureBody)!
list_res = result.tolist()!
return {"section_name": list_res}!
!
clf = DocumentClassifier()!
yh = Yhat("jonathan@zipfianacademy.com", “xxxxxx",!
! ! ! ! ! ! ! ! ! ! ! ! ! "http://cloud.yhathq.com/")!
yh.deploy("documentClassifier", DocumentClassifier, globals())
Questions? tweet @zipfianacademy #pydata
Exposé
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask (on Heroku)
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Present
Create a Flask application to 	

expose your model on the web
Questions? tweet @zipfianacademy #pydata
Present
yh = Yhat("<USERNAME>", "<API KEY>", "http://cloud.yhathq.com/")	
!
@app.route('/')	
def index():	
return app.send_static_file('index.html')	
!
@app.route('/predict', methods=['POST'])	
def predict():	
article = request.form['article']	
results = yh.predict("documentClf", { 'content': article })	
return jsonify({"results": results})	
Questions? tweet @zipfianacademy #pydata
Present
Pipeline
Only Data should Flow
Questions? tweet @zipfianacademy #pydata
Data
Remember to Remember	

(Lineage)
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
Questions? tweet @zipfianacademy #pydata
Pipeline
Immutable append only set of Raw Data
Computation is a view on data
*Lambda Architecture by Nathan MarzQuestions? tweet @zipfianacademy #pydata
Pipeline
Functional Data Science
• Modularity	

• Define interfaces	

• Separate data from computation	

• Data Lineage
Functional
Questions? tweet @zipfianacademy #pydata
Need Robust and Flexible Pipeline!
Questions? tweet @zipfianacademy #pydata
Pipeline
Whatever you do, DO NOT cross the streams
Questions? tweet @zipfianacademy #pydata
Pipeline
NYT
API
MongoDB
BeautifulSoup
Feature
Matrixscikit-learn
Web
App
Model
Deploy
yHat
Heroku
POST
Predict
Predicted
Section
Where we are
NLTK
scikit-learn
Questions? tweet @zipfianacademy #pydata
Gotchas!
• Only have a static subset of articles	

• Pipeline not automated for re-training
Questions? tweet @zipfianacademy #pydata
Gotchas
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
Iteration 3:
Source: http://vninja.net/wordpress/wp-content/uploads/2013/03/KCaAutomate.pngQuestions? tweet @zipfianacademy #pydata
Iterate
NYT
API
MongoDB
cron
Feature
Matrixscikit-learn
Web
App
Model
Deploy
yHat
Heroku
POST
Predict
Predicted
Section
Where we are
NLTK
scikit-learn
Questions? tweet @zipfianacademy #pydata
Amazon 	

EC2
testing
Start small (data)
and fast
(development)
testing
Increase size of
data set
Optimize and
productionize
PROFIT!
$$$
Questions? tweet @zipfianacademy #pydata
How to Scale
How to Scale
testing
Develop locally
testing
Distribute
computation 	

(run on cluster)
Tune parameters
PROFIT!
$$$
Questions? tweet @zipfianacademy #pydata
Can also use a
streaming algorithm or
single machine disk
based “medium data”
technologies (i.e.
database or memory
mapped files)
Products
If you build it...
Questions? tweet @zipfianacademy #pydata
Source: http://nateemery.com/wp-content/uploads/2013/05/field-of-dreams.jpg
Products
Questions? tweet @zipfianacademy #pydata
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
Q & A
Q&A
Questions? tweet @zipfianacademy #pydata
Zipfian
Academy
@ZipfianAcademy
Data Science & Data Engineering	

12-week Bootcamp (May 12th & Sep 8th)
Weekend Workshops
http://zipfianacademy.com/apply
http://zipfianacademy.com/workshops
Next: InteractiveVisualizations w/ d3.js (June 7th)
Questions? tweet @zipfianacademy #pydata
Thank You!
Jonathan Dinu

Co-Founder, Zipfian Academy

jonathan@zipfianacademy.com

@clearspandex
@ZipfianAcademy
http://zipfianacademy.com
Questions? tweet @zipfianacademy #pydata
Appendix
Questions? tweet @zipfianacademy #pydata
Data Sources
Obtain
(ranked by ease of use)
1. DaaS -- Data as a service	

2. Bulk Download	

3. APIs	

4. Web Scraping
Questions? tweet @zipfianacademy #pydata
DaaS
(Data as a Service)
•Time Series/Numeric: Quandl	

• Financial Modeling: Quantopian	

• Email Contextualization: Rapleaf	

• Location and POI: Factual
Data Sources
Questions? tweet @zipfianacademy #pydata
Bulk Download
(just like the good ol’ days)
• File Transfer Protocol (FTP): CDC	

•Amazon Web Services: Public Datasets	

• Infochimps: Data Marketplace	

•Academia: UCI Machine Learning Repository
Data Sources
Questions? tweet @zipfianacademy #pydata
APIs
(if it’s not RESTed, I’m not buying)
• Geographic: Foursquare	

• Social: Facebook	

•Audio: Rdio	

• Content:Tumblr	

• Realtime:Twitter 	

• Hidden:Yahoo Finance
Data Sources
Questions? tweet @zipfianacademy #pydata
Web Scraping
1. wget and curl 	

2. Web Spider/Crawler	

3. API scraping	

4. Manual Download
(DIY for life)
Data Sources
Questions? tweet @zipfianacademy #pydata
• DelimitedValues	

• TSV	

• CSV	

• WSV	

• JSON	

• XML	

• Ad Hoc Formats (avoid these if you can)
Data Formats
Questions? tweet @zipfianacademy #pydata
• JSON is made up of hash tables and arrays	

• Hash tables: { “foo” : 1, “bar” : 2, baz : “3” }	

• Arrays: [1, 2, 3]
• Arrays of arrays: [[1, 2, 3], [‘foo’, ‘bar’, ‘baz’]]
• Array of hashes: [{‘foo’:1, ‘bar’:2}, {‘baz’:3}]
• Hashes of hashes: {‘foo’: {‘bar’: 2, ‘baz’: 3}}
Questions? tweet @zipfianacademy #pydata
Data Formats
{"widget": {!
"debug": "on",!
"window": {!
"title": "Sample Konfabulator Widget",!
"name": "main_window",!
"width": 500,!
"height": 500!
},!
"image": { !
"src": "Images/Sun.png",!
"name": "sun1",!
"hOffset": 250,!
"vOffset": 250,!
"alignment": "center"!
},!
"text": {!
"data": "Click Here",!
"size": 36,!
"style": "bold",!
"name": "text1",!
"hOffset": 250,!
"vOffset": 100,!
"alignment": "center",!
"onMouseUp": "sun1.opacity = (sun1.opacity / 100) * 90;"!
}!
}} !
Questions? tweet @zipfianacademy #pydata
Data Formats
• XML is a recursive self-describing container	

<container>
<item>Foo</item>
<item>Bar</item>
<container>
<item attr=”SomethingAboutBaz”>Baz</item>
</container>
</item>
<container>
Questions? tweet @zipfianacademy #pydata
Data Formats
<widget>!
<debug>on</debug>!
<window title="Sample Konfabulator Widget">!
<name>main_window</name>!
<width>500</width>!
<height>500</height>!
</window>!
<image src="Images/Sun.png" name="sun1">!
<hOffset>250</hOffset>!
<vOffset>250</vOffset>!
<alignment>center</alignment>!
</image>!
<text data="Click Here" size="36" style="bold">!
<name>text1</name>!
<hOffset>250</hOffset>!
<vOffset>100</vOffset>!
<alignment>center</alignment>!
<onMouseUp>!
sun1.opacity = (sun1.opacity / 100) * 90;!
</onMouseUp>!
</text>!
</widget>!
Questions? tweet @zipfianacademy #pydata
Data Formats
• Ad hoc data formats	

• Fixed-width (Census data)	

• Graph Edgelists
• Voting records
• etc.
Questions? tweet @zipfianacademy #pydata
Data Formats
• 7-5-5 format	

•Sam foo bar!
•Roger baz 6!
•Jane 314 99
Questions? tweet @zipfianacademy #pydata
Data Formats
• Directed Graph Format	

1 2!
1 3!
1 4!
2 3!
4 4
Questions? tweet @zipfianacademy #pydata
Data Formats
• Directed Graph Format	

1 2!
1 3!
1 4!
2 3!
4 4
Questions? tweet @zipfianacademy #pydata
Data Formats
Programming languages like
Python, Ruby, and R have built in
parsers for data formats such as
JSON and CSV. For other
esoteric formats you will
probably have to write your own
Questions? tweet @zipfianacademy #pydata
Data Formats

Mais conteúdo relacionado

Semelhante a Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

Semelhante a Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014 (20)

H2O World - Data Science in Action @ 6sense - Viral Bajaria
H2O World - Data Science in Action @ 6sense - Viral BajariaH2O World - Data Science in Action @ 6sense - Viral Bajaria
H2O World - Data Science in Action @ 6sense - Viral Bajaria
 
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4) Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
 
Working With Facebook, Twitter, et al. - Social Media Camp
Working With Facebook, Twitter, et al. - Social Media CampWorking With Facebook, Twitter, et al. - Social Media Camp
Working With Facebook, Twitter, et al. - Social Media Camp
 
Offensive OSINT
Offensive OSINTOffensive OSINT
Offensive OSINT
 
Fringe IA (InfoCamp Seattle 2013)
Fringe IA (InfoCamp Seattle 2013)Fringe IA (InfoCamp Seattle 2013)
Fringe IA (InfoCamp Seattle 2013)
 
Organisational Identifiers at OpenCon satellite event, Oxford 2016
Organisational Identifiers at OpenCon satellite event, Oxford 2016Organisational Identifiers at OpenCon satellite event, Oxford 2016
Organisational Identifiers at OpenCon satellite event, Oxford 2016
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
Ultimate Free Digital Marketing Toolkit #EdgeBristol 2012
Ultimate Free Digital Marketing Toolkit #EdgeBristol 2012Ultimate Free Digital Marketing Toolkit #EdgeBristol 2012
Ultimate Free Digital Marketing Toolkit #EdgeBristol 2012
 
Be a Better Blogger: The Surprising Magic of Content Calendars
Be a Better Blogger: The Surprising Magic of Content CalendarsBe a Better Blogger: The Surprising Magic of Content Calendars
Be a Better Blogger: The Surprising Magic of Content Calendars
 
Data Science-Why?What?How? By Hari Prasad
Data Science-Why?What?How? By Hari PrasadData Science-Why?What?How? By Hari Prasad
Data Science-Why?What?How? By Hari Prasad
 
The Ultimate Free Digital Marketing Toolkit
The Ultimate Free Digital Marketing ToolkitThe Ultimate Free Digital Marketing Toolkit
The Ultimate Free Digital Marketing Toolkit
 
Skillteam workshop social media final v1.0 05.10.2011
Skillteam workshop social media final v1.0 05.10.2011Skillteam workshop social media final v1.0 05.10.2011
Skillteam workshop social media final v1.0 05.10.2011
 
ASPPA social media
ASPPA social mediaASPPA social media
ASPPA social media
 
Prototyping to the North Star
Prototyping to the North StarPrototyping to the North Star
Prototyping to the North Star
 
Microservices Manchester: Security, Microservces and Vault by Nicki Watt
Microservices Manchester:  Security, Microservces and Vault by Nicki WattMicroservices Manchester:  Security, Microservces and Vault by Nicki Watt
Microservices Manchester: Security, Microservces and Vault by Nicki Watt
 
Enterprise Open Source Intelligence Gathering
Enterprise Open Source Intelligence GatheringEnterprise Open Source Intelligence Gathering
Enterprise Open Source Intelligence Gathering
 
Data_Science_Generating_Value_From_Data_Course_Slides_red.pdf
Data_Science_Generating_Value_From_Data_Course_Slides_red.pdfData_Science_Generating_Value_From_Data_Course_Slides_red.pdf
Data_Science_Generating_Value_From_Data_Course_Slides_red.pdf
 
SEO 101: An Intro to Search Engine Optimization
SEO 101: An Intro to Search Engine OptimizationSEO 101: An Intro to Search Engine Optimization
SEO 101: An Intro to Search Engine Optimization
 
Twitter recruiting McGill Sept 2013
Twitter recruiting McGill Sept 2013Twitter recruiting McGill Sept 2013
Twitter recruiting McGill Sept 2013
 
PyData: Past, Present Future (PyData SV 2014 Keynote)
PyData: Past, Present Future (PyData SV 2014 Keynote)PyData: Past, Present Future (PyData SV 2014 Keynote)
PyData: Past, Present Future (PyData SV 2014 Keynote)
 

Mais de PyData

Mais de PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014