SlideShare uma empresa Scribd logo
1 de 96
Baixar para ler offline
Agile Analytics Applications
Russell Jurney
1
Wednesday, May 8, 13
About me…Bearding.
• I’m going to beat this guy
• Seriously
• Bearding is my #1 natural talent
• Salty Sea Beard
• Fortified with Pacific Ocean Minerals
2
Wednesday, May 8, 13
Agile Data: The Book (August, 2013)
3
Read @ Safari Rough Cuts
A philosophy,
not the only way
But still, its good! Really!
Wednesday, May 8, 13
We go fast... but don’t worry!
• Download the slides - click the links - read examples!
• If its not on the blog, its in the book!
• Order now: http://shop.oreilly.com/product/0636920025054.do
• Read the book on Safari Rough Cuts
4
Wednesday, May 8, 13
Agile Application Development: Check
• LAMP stack mature
• Post-Rails frameworks to choose from
• Enable rapid feedback and agility
5
+ NoSQL
Wednesday, May 8, 13
Data Warehousing
6
Wednesday, May 8, 13
Scientific Computing / HPC
• ‘Smart kid’ only: MPI, Globus, etc. until Hadoop
7
Tubes and Mercury (old school) Cores and Spindles (new school)
UNIVAC and Deep Blue both fill a warehouse. We’re back...
Wednesday, May 8, 13
Data Science?
8
33%
33%
33%
Application
Development
Data Warehousing
Scientific Computing / HPC
Wednesday, May 8, 13
Data Center as Computer
• Warehouse Scale Computers and applications
9
“A key challenge for architects of WSCs is to smooth out these discrepancies in a cost
efficient manner.” Click here for a paper on operating a ‘data center as computer.’
Wednesday, May 8, 13
Hadoop to the Rescue!
• Easy to use! (Pig, Hive, Cascading)
• CHEAP: 1% the cost of SAN/NAS
• A department can afford its own Hadoop cluster!
• Dump all your data in one place: Hadoop DFS
• Silos come CRASHING DOWN!
• JOIN like crazy!
• ETL like whoah!
• An army of mappers and reducers at your command
• OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!
10
Wednesday, May 8, 13
NOW
WHAT?
11
Wednesday, May 8, 13
Analytics Apps: It takes a Team
12
• Broad skill-set
• Nobody has them all
• Inherently collaborative
Wednesday, May 8, 13
Data Science Team
• 3-4 team members with broad, diverse skill-sets that overlap
• Transactional overhead dominates at 5+ people
• Expert researchers: lend 25-50% of their time to teams
• Creative workers. Run like a studio, not an assembly line
• Total freedom... with goals and deliverables.
• Work environment matters most
13
Wednesday, May 8, 13
How to get insight into product?
• Back-end has gotten THICKER
• Generating $$$ insight can take 10-100x app dev
• Timeline disjoint: analytics vs agile app-dev/design
• How do you ship insights efficiently?
• How do you collaborate on research vs developer timeline?
14
Wednesday, May 8, 13
The Wrong Way - Part One
15
“We made a great design.
Your job is to predict the future for it.”
Wednesday, May 8, 13
The Wrong Way - Part Two
16
“What is taking you so long
to reliably predict the future?”
Wednesday, May 8, 13
The Wrong Way - Part Three
17
“The users don’t understand
what 86% true means.”
Wednesday, May 8, 13
The Wrong Way - Part Four
18
GHJIAEHGIEhjagigehganb!!!!!RJ(@J?!!
Wednesday, May 8, 13
The Wrong Way - Inevitable Conclusion
19
Plane Mountain
Wednesday, May 8, 13
Reminds me of... the waterfall model
20:(Wednesday, May 8, 13
Chief Problem
21
You can’t design insight in analytics applications.
You discover it.
You discover by exploring.
Wednesday, May 8, 13
-> Strategy
22
So make an app for exploring your data.
Which becomes a palette for what you ship.
Iterate and publish intermediate results.
Wednesday, May 8, 13
Data Design
• Not the 1st query that = insight, it’s the 15th, or the 150th
• Capturing “Ah ha!” moments
• Slow to do those in batch…
• Faster, better context in an interactive web application.
• Pre-designed charts wind up terrible. So bad.
• Easy to invest man-years in the wrong statistical models
• Semantics of presenting predictions are complex, delicate
• Opportunity lies at intersection of data & design
23
Wednesday, May 8, 13
How do we get back to Agile?
24
Wednesday, May 8, 13
Statement of Principles
25
(then tricks, with code)
Wednesday, May 8, 13
Setup an environment where...
• Insights repeatedly produced
• Iterative work shared with entire team
• Interactive from day Zero
• Data model is consistent end-to-end
• Minimal impedance between layers
• Scope and depth of insights grow
• Insights form the palette for what you ship
• Until the application pays for itself and more
26
Wednesday, May 8, 13
Snowballing Audience
27
Wednesday, May 8, 13
Value document > relation
28
Most data is dirty. Most data is semi-structured or unstructured. Rejoice!
Wednesday, May 8, 13
Value document > relation
29
Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.
Wednesday, May 8, 13
Relational Data = Legacy Format
• Why JOIN? Storage is fundamentally cheap!
• Duplicate that JOIN data in one big record type!
• ETL once to document format on import, NOT every job
• Not zero JOINs, but far fewer JOINs
• Semi-structured documents preserve data’s actual structure
• Column compressed document formats beat JOINs!
30
Wednesday, May 8, 13
Value imperative > declarative
• We don’t know what we want to SELECT.
• Data is dirty - check each step, clean iteratively.
• 85% of data scientist’s time spent munging. See: ETL.
• Imperative is optimized for our process.
• Process = iterative, snowballing insight
• Efficiency matters, self optimize
31
Wednesday, May 8, 13
Value dataflow > SELECT
32
Wednesday, May 8, 13
Ex. dataflow: ETL + email sent count
33(I can’t read this either. Get a big version here.)
Wednesday, May 8, 13
Value Pig > Hive (for app-dev)
• Pigs eat ANYTHING
• Pig is optimized for refining data, as opposed to consuming it
• Pig is imperative, iterative
• Pig is dataflows, and SQLish (but not SQL)
• Code modularization/re-use: Pig Macros
• ILLUSTRATE speeds dev time (even UDFs)
• Easy UDFs in Java, JRuby, Jython, Javascript
• Pig Streaming = use any tool, period.
• Easily prepare our data as it will appear in our app.
• If you prefer Hive, use Hive.
34
But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive…
See: HCatalog for Pig/Hive integration.
Wednesday, May 8, 13
Localhost vs Petabyte scale: same tools
• Simplicity essential to scalability: highest level tools we can
• Prepare a good sample - tricky with joins, easy with
documents
• Local mode: pig -l /tmp -x local -v -w
• Frequent use of ILLUSTRATE
• 1st: Iterate, debug & publish locally
• 2nd: Run on cluster, publish to team/customer
• Consider skipping Object-Relational-Mapping (ORM)
• We do not trust ‘databases,’ only HDFS @ n=3.
• Everything we serve in our app is re-creatable via Hadoop.
35
Wednesday, May 8, 13
Data-Value Pyramid
36
Climb it. Do not skip steps. See here.
Wednesday, May 8, 13
0/1) Display atomic records on the web
37
Wednesday, May 8, 13
0.0) Document-serialize events
• Protobuf
• Thrift
• JSON
• Avro - I use Avro because the schema is onboard.
38
Wednesday, May 8, 13
0.1) Documents via Relation ETL
39
enron_messages = load '/enron/enron_messages.tsv' as (
message_id:chararray,
sql_date:chararray,
from_address:chararray,
from_name:chararray,
subject:chararray,
body:chararray
);
 
enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray);
 
split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';
 
headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;
with_headers = join headers by group, enron_messages by message_id parallel 10;
emails = foreach with_headers generate enron_messages::message_id as message_id,
CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,
TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray),
enron_messages::subject as subject,
enron_messages::body as body,
headers::tos.(address, name) as tos,
headers::ccs.(address, name) as ccs,
headers::bccs.(address, name) as bccs;
store emails into '/enron/emails.avro' using AvroStorage(
Example here.
Wednesday, May 8, 13
0.2) Serialize events from streams
40
class	
  GmailSlurper(object):
	
  	
  ...
	
  	
  def	
  init_imap(self,	
  username,	
  password):
	
  	
  	
  	
  self.username	
  =	
  username
	
  	
  	
  	
  self.password	
  =	
  password
	
  	
  	
  	
  try:
	
  	
  	
  	
  	
  	
  imap.shutdown()
	
  	
  	
  	
  except:
	
  	
  	
  	
  	
  	
  pass
	
  	
  	
  	
  self.imap	
  =	
  imaplib.IMAP4_SSL('imap.gmail.com',	
  993)
	
  	
  	
  	
  self.imap.login(username,	
  password)
	
  	
  	
  	
  self.imap.is_readonly	
  =	
  True
	
  	
  ...
	
  	
  def	
  write(self,	
  record):
	
  	
  	
  	
  self.avro_writer.append(record)
	
  	
  ...
	
  	
  def	
  slurp(self):
	
  	
  	
  	
  if(self.imap	
  and	
  self.imap_folder):
	
  	
  	
  	
  	
  	
  for	
  email_id	
  in	
  self.id_list:
	
  	
  	
  	
  	
  	
  	
  	
  (status,	
  email_hash,	
  charset)	
  =	
  self.fetch_email(email_id)
	
  	
  	
  	
  	
  	
  	
  	
  if(status	
  ==	
  'OK'	
  and	
  charset	
  and	
  'thread_id'	
  in	
  email_hash	
  and	
  'froms'	
  in	
  email_hash):
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  print	
  email_id,	
  charset,	
  email_hash['thread_id']
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  self.write(email_hash)
Scrape your own gmail in Python and Ruby.
Wednesday, May 8, 13
0.3) ETL Logs
41
log_data	
  =	
  LOAD	
  'access_log'	
  
	
  	
  	
  USING	
  org.apache.pig.piggybank.storage.apachelog.CommongLogLoader	
  
	
  	
  	
  AS	
  (remoteAddr,	
  
	
  	
  	
  	
  	
  	
  	
  remoteLogname,	
  
	
  	
  	
  	
  	
  	
  	
  user,	
  
	
  	
  	
  	
  	
  	
  	
  time,	
  
	
  	
  	
  	
  	
  	
  	
  method,	
  
	
  	
  	
  	
  	
  	
  	
  uri,	
  
	
  	
  	
  	
  	
  	
  	
  proto,	
  
	
  	
  	
  	
  	
  	
  	
  bytes);
Wednesday, May 8, 13
1) Plumb atomic events -> browser
42
(Example stack that enables high productivity)
Wednesday, May 8, 13
1.1) cat our Avro serialized events
43
me$ cat_avro ~/Data/enron.avro
{
u'bccs': [],
u'body': u'scamming people, blah blah',
u'ccs': [],
u'date': u'2000-08-28T01:50:00.000Z',
u'from': {u'address': u'bob.dobbs@enron.com', u'name': None},
u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>',
u'subject': u'Re: Enron trade for frop futures',
u'tos': [
{u'address': u'connie@enron.com', u'name': None}
]
}
Get cat_avro in python, ruby
Wednesday, May 8, 13
1.2) Load our events in Pig
44
me$ pig -l /tmp -x local -v -w
grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage();
grunt> describe enron_emails
emails: {
message_id: chararray,
datetime: chararray,
from:tuple(address:chararray,name:chararray)
subject: chararray,
body: chararray,
tos: {to: (address: chararray,name: chararray)},
ccs: {cc: (address: chararray,name: chararray)},
bccs: {bcc: (address: chararray,name: chararray)}
}
 
Wednesday, May 8, 13
1.3) ILLUSTRATE our events in Pig
45
grunt> illustrate enron_emails
 
---------------------------------------------------------------------------
| emails |
| message_id:chararray |
| datetime:chararray |
| from:tuple(address:chararray,name:chararray) |
| subject:chararray |
| body:chararray |
| tos:bag{to:tuple(address:chararray,name:chararray)} |
| ccs:bag{cc:tuple(address:chararray,name:chararray)} |
| bccs:bag{bcc:tuple(address:chararray,name:chararray)} |
---------------------------------------------------------------------------
| |
| <1731.10095812390082.JavaMail.evans@thyme> |
| 2001-01-09T06:38:00.000Z |
| (bob.dobbs@enron.com, J.R. Bob Dobbs) |
| Re: Enron trade for frop futures |
| scamming people, blah blah |
| {(connie@enron.com,)} |
| {} |
| {} |
Upgrade to Pig 0.10+
Wednesday, May 8, 13
1.4) Publish our events to a ‘database’
46
pig -l /tmp -x local -v -w -param avros=enron.avro 
-param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig
/* MongoDB libraries and configuration */
register /me/mongo-hadoop/mongo-2.7.3.jar
register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar
/* Set speculative execution off to avoid chance of duplicate records in Mongo */
set mapred.map.tasks.speculative.execution false
set mapred.reduce.tasks.speculative.execution false
define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */
/* By default, lets have 5 reducers */
set default_parallel 5
avros = load '$avros' using AvroStorage();
store avros into '$mongourl' using MongoStorage();
Full instructions here.
Which does this:
From Avro to MongoDB in one command:
Wednesday, May 8, 13
1.5) Check events in our ‘database’
47
$ mongo enron
MongoDB shell version: 2.0.2
connecting to: enron
> show collections
emails
system.indexes
> db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"})
{
" "_id" : ObjectId("502b4ae703643a6a49c8d180"),
" "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>",
" "date" : "2001-01-09T06:38:00.000Z",
" "from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" },
" "subject" : Re: Enron trade for frop futures,
" "body" : "Scamming more people...",
" "tos" : [ { "address" : "connie@enron", "name" : null } ],
" "ccs" : [ ],
" "bccs" : [ ]
}
Wednesday, May 8, 13
1.6) Publish events on the web
48
require 'rubygems'
require 'sinatra'
require 'mongo'
require 'json'
connection = Mongo::Connection.new
database = connection['agile_data']
collection = database['emails']
get '/email/:message_id' do |message_id|
data = collection.find_one({:message_id => message_id})
JSON.generate(data)
end
Wednesday, May 8, 13
1.6) Publish events on the web
49
Wednesday, May 8, 13
One-Liner to Transition Stack
50
Wednesday, May 8, 13
Whats the point?
• A designer can work against real data.
• An application developer can work against real data.
• A product manager can think in terms of real data.
• Entire team is grounded in reality!
• You’ll see how ugly your data really is.
• You’ll see how much work you have yet to do.
• Ship early and often!
• Feels agile, don’t it? Keep it up!
51
Wednesday, May 8, 13
1.7) Wrap events with Bootstrap
52
<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
</head>
<body>
<div class="container" style="margin-top: 100px;">
<table class="table table-striped table-bordered table-condensed">
<thead>
{% for key in data['keys'] %}
<th>{{ key }}</th>
{% endfor %}
</thead>
<tbody>
<tr>
{% for value in data['values'] %}
<td>{{ value }}</td>
{% endfor %}
</tr>
</tbody>
</table>
</div>
</body> Complete example here with code here.
Wednesday, May 8, 13
1.7) Wrap events with Bootstrap
53
Wednesday, May 8, 13
Refine. Add links between documents.
54Not the Mona Lisa, but coming along... See: here
Wednesday, May 8, 13
1.8) List links to sorted events
55
mongo enron
> db.emails.ensureIndex({message_id: 1})
> db.emails.find().sort({date:0}).limit(10).pretty()
{
{
" "_id" : ObjectId("4f7a5da2414e4dd0645d1176"),
" "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",
" "from" : [
...
pig -l /tmp -x local -v -w
emails_per_user = foreach (group emails by from.address) {
sorted = order emails by date;
last_1000 = limit sorted 1000;
generate group as from_address, emails as emails;
};
store emails_per_user into '$mongourl' using MongoStorage();
Use Pig, serve/cache a bag/array of email documents:
Use your ‘database’, if it can sort.
Wednesday, May 8, 13
1.8) List links to sorted documents
56
Wednesday, May 8, 13
1.9) Make it searchable...
57
If you have list, search is easy with ElasticSearch and Wonderdog...
/* Load ElasticSearch integration */
register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';
register '/me/elasticsearch-0.18.6/lib/*';
define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();
emails = load '/me/tmp/emails' using AvroStorage();
store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/
elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');
curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'
Test it with curl:
ElasticSearch has no security features. Take note. Isolate.
Wednesday, May 8, 13
2) Create Simple Charts
58
Wednesday, May 8, 13
2) Create Simple Tables and Charts
59
Wednesday, May 8, 13
2) Create Simple Charts
• Start with an HTML table on general principle.
• Then use nvd3.js - reusable charts for d3.js
• Aggregate by properties & displaying is first step in entity
resolution
• Start extracting entities. Ex: people, places, topics, time series
• Group documents by entities, rank and count.
• Publish top N, time series, etc.
• Fill a page with charts.
• Add a chart to your event page.
60
Wednesday, May 8, 13
2.1) Top N (of anything) in Pig
61
pig -l /tmp -x local -v -w
top_things = foreach (group things by key) {
sorted = order things by arbitrary_rank desc;
top_10_things = limit sorted 10;
generate group as key, top_10_things as top_10_things;
};
store top_n into '$mongourl' using MongoStorage();
Remember, this is the same structure the browser gets as json.
This would make a good Pig Macro.
Wednesday, May 8, 13
2.2) Time Series (of anything) in Pig
62
pig -l /tmp -x local -v -w
/* Group by our key and date rounded to the month, get a total */
things_by_month = foreach (group things by (key, ISOToMonth(datetime))
generate flatten(group) as (key, month),
COUNT_STAR(things) as total;
/* Sort our totals per key by month to get a time series */
things_timeseries = foreach (group things_by_month by key) {
timeseries = order things by month;
generate group as key, timeseries as timeseries;
};
store things_timeseries into '$mongourl' using MongoStorage();
Yet another good Pig Macro.
Wednesday, May 8, 13
Data processing in our stack
63
A new feature in our application might begin at any layer... great!
Any team member can add new features, no problemo!
I’m creative!
I know Pig!
I’m creative too!
I <3 Javascript!
omghi2u!
where r my legs?
send halp
Wednesday, May 8, 13
Data processing in our stack
64
... but we shift the data-processing towards batch, as we are able.
Ex: Overall total emails calculated in each layer
See real example here.
Wednesday, May 8, 13
3) Exploring with Reports
65
Wednesday, May 8, 13
3) Exploring with Reports
66
Wednesday, May 8, 13
3.0) From charts to reports...
• Extract entities from properties we aggregated by in charts (Step 2)
• Each entity gets its own type of web page
• Each unique entity gets its own web page
• Link to entities as they appear in atomic event documents (Step 1)
• Link most related entities together, same and between types.
• More visualizations!
• Parametize results via forms.
67
Wednesday, May 8, 13
3.1) Looks like this...
68
Wednesday, May 8, 13
3.2) Cultivate common keyspaces
69
Wednesday, May 8, 13
3.3) Get people clicking. Learn.
• Explore this web of generated pages, charts and links!
• Everyone on the team gets to know your data.
• Keep trying out different charts, metrics, entities, links.
• See whats interesting.
• Figure out what data needs cleaning and clean it.
• Start thinking about predictions & recommendations.
70
‘People’ could be just your team, if data is sensitive.
Wednesday, May 8, 13
4) Predictions and Recommendations
71
Wednesday, May 8, 13
4.0) Preparation
• We’ve already extracted entities, their properties and relationships
• Our charts show where our signal is rich
• We’ve cleaned our data to make it presentable
• The entire team has an intuitive understanding of the data
• They got that understanding by exploring the data
• We are all on the same page!
72
Wednesday, May 8, 13
4.2) Think in different perspectives
• Networks
• Time Series / Distributions
• Natural Language Processing
• Conditional Probabilities / Bayesian Inference
• Check out Chapter 2 of the book
73
Wednesday, May 8, 13
4.3) Networks
74
Wednesday, May 8, 13
4.3.1) Weighted Email Networks in Pig
75
DEFINE header_pairs(email, col1, col2) RETURNS pairs {
filtered = FILTER $email BY ($col1 IS NOT NULL) AND ($col2 IS NOT NULL);
flat = FOREACH filtered GENERATE FLATTEN($col1) AS $col1, FLATTEN($col2) AS $col2;
$pairs = FOREACH flat GENERATE LOWER($col1) AS ego1, LOWER($col2) AS ego2;
}
/* Get email address pairs for each type of connection, and union them together */
emails = LOAD '/me/Data/enron.avro' USING AvroStorage();
from_to = header_pairs(emails, from, to);
from_cc = header_pairs(emails, from, cc);
from_bcc = header_pairs(emails, from, bcc);
pairs = UNION from_to, from_cc, from_bcc;
/* Get a count of emails over these edges. */
pair_groups = GROUP pairs BY (ego1, ego2);
sent_counts = FOREACH pair_groups GENERATE FLATTEN(group) AS (ego1, ego2), COUNT_STAR(pairs) AS total;
Wednesday, May 8, 13
4.3.2) Networks Viz with Gephi
76
Wednesday, May 8, 13
4.3.3) Gephi = Easy
77
Wednesday, May 8, 13
4.3.4) Social Network Analysis
78
Wednesday, May 8, 13
4.4) Time Series & Distributions
79
pig -l /tmp -x local -v -w
/* Count things per day */
things_per_day = foreach (group things by (key, ISOToDay(datetime))
generate flatten(group) as (key, day),
COUNT_STAR(things) as total;
/* Sort our totals per key by day to get a sorted time series */
things_timeseries = foreach (group things_by_day by key) {
timeseries = order things by day;
generate group as key, timeseries as timeseries;
};
store things_timeseries into '$mongourl' using MongoStorage();
Wednesday, May 8, 13
4.4.1) Smooth Sparse Data
80See here.
Wednesday, May 8, 13
4.4.2) Regress to find Trends
81
JRuby Linear Regression UDF Pig to use the UDF
Trend Line in your Application
Wednesday, May 8, 13
4.5.1) Natural Language Processing
82
Example with code here and macro here.
import 'tfidf.macro';
my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body');
/* Get the top 10 Tf*Idf scores per message */
per_message_cassandra = foreach (group tfidf_all by message_id) {
sorted = order tfidf_all by value desc;
top_10_topics = limit sorted 10;
generate group, top_10_topics.(score, value);
}
Wednesday, May 8, 13
4.5.2) NLP: Extract Topics!
83
Wednesday, May 8, 13
4.5.3) NLP for All: Extract Topics!
• TF-IDF in Pig - 2 lines of code with Pig Macros:
• http://hortonworks.com/blog/pig-macro-for-tf-idf-makes-
topic-summarization-2-lines-of-pig/
• LDA with Pig and the Lucene Tokenizer:
• http://thedatachef.blogspot.be/2012/03/topic-discovery-
with-apache-pig-and.html
84
Wednesday, May 8, 13
4.6) Probability & Bayesian Inference
85
Wednesday, May 8, 13
4.6.1) Gmail Suggested Recipients
86
Wednesday, May 8, 13
4.6.1) Reproducing it with Pig...
87
Wednesday, May 8, 13
4.6.2) Step 1: COUNT(From -> To)
88
Wednesday, May 8, 13
4.6.2) Step 2: COUNT(From, To, Cc)/Total
89
P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone
Wednesday, May 8, 13
4.6.3) Wait - Stop Here! It works!
90
They match…
Wednesday, May 8, 13
4.4) Add predictions to reports
91
Wednesday, May 8, 13
5) Enable new actions
92
Wednesday, May 8, 13
Why doesn’t Kate reply to my emails?
• What time is best to catch her?
• Are they too long?
• Are they meant to be replied to (contain original content)?
• Are they nice? (sentiment analysis)
• Do I reply to her emails (reciprocity)?
• Do I cc the wrong people (my mom)?
93
Wednesday, May 8, 13
Example: Packetpig and PacketLoop
94
snort_alerts	
  =	
  LOAD	
  '$pcap'
	
  	
  USING	
  
com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig');
countries	
  =	
  FOREACH	
  snort_alerts
	
  	
  GENERATE
	
  	
  	
  	
  com.packetloop.packetpig.udf.geoip.Country(src)	
  as	
  country,
	
  	
  	
  	
  priority;
countries	
  =	
  GROUP	
  countries	
  BY	
  country;
countries	
  =	
  FOREACH	
  countries
	
  	
  GENERATE
	
  	
  	
  	
  group,
	
  	
  	
  	
  AVG(countries.priority)	
  as	
  average_severity;
STORE	
  countries	
  into	
  'output/choropleth_countries'	
  using	
  PigStorage(',');
Code here.
Wednesday, May 8, 13
Example: Packetpig and PacketLoop
95
Wednesday, May 8, 13
Thank You!
Questions & Answers
Follow: @rjurney
Read the Blog: datasyndrome.com
96
Wednesday, May 8, 13

Mais conteúdo relacionado

Mais procurados

Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Social Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem DomainSocial Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem DomainRussell Jurney
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MDDonald Miner
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01Krishna Sankar
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest Donald Miner
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future TensePaco Nathan
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 

Mais procurados (20)

Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Social Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem DomainSocial Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem Domain
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domain
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domain
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R Workshop
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future Tense
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 

Destaque

Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Your moment is Waiting
Your moment is WaitingYour moment is Waiting
Your moment is Waitingrittujacob
 
Enabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPopEnabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPopJason Plurad
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Blistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLBlistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLSimon Harris
 
Bitraf - Particle Photon IoT workshop
Bitraf - Particle Photon IoT workshopBitraf - Particle Photon IoT workshop
Bitraf - Particle Photon IoT workshopJens Brynildsen
 
Mapa mental de un lider tahi
Mapa mental de un lider  tahiMapa mental de un lider  tahi
Mapa mental de un lider tahiTahi04
 
ConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving FutureConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving FutureEricsson
 
Teraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview PresentationTeraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview PresentationGord Sissons
 
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...Andreas Önnerfors
 
Motivación laboral
Motivación laboralMotivación laboral
Motivación laboralalexander_hv
 
IBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBIBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBGord Sissons
 
ระบบสารสนเทศ
ระบบสารสนเทศระบบสารสนเทศ
ระบบสารสนเทศPetch Boonyakorn
 
2016 Results & Outlook
2016 Results & Outlook 2016 Results & Outlook
2016 Results & Outlook Total
 

Destaque (19)

Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Your moment is Waiting
Your moment is WaitingYour moment is Waiting
Your moment is Waiting
 
JSON-LD Update
JSON-LD UpdateJSON-LD Update
JSON-LD Update
 
Enabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPopEnabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPop
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Blistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLBlistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQL
 
tarea 7 gabriel
tarea 7 gabrieltarea 7 gabriel
tarea 7 gabriel
 
Bitraf - Particle Photon IoT workshop
Bitraf - Particle Photon IoT workshopBitraf - Particle Photon IoT workshop
Bitraf - Particle Photon IoT workshop
 
Mapa mental de un lider tahi
Mapa mental de un lider  tahiMapa mental de un lider  tahi
Mapa mental de un lider tahi
 
ConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving FutureConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving Future
 
Zipcar
ZipcarZipcar
Zipcar
 
Teraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview PresentationTeraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview Presentation
 
Feb 13 17 word of the day (1)
Feb 13 17 word of the day (1)Feb 13 17 word of the day (1)
Feb 13 17 word of the day (1)
 
Mapa mental
Mapa mentalMapa mental
Mapa mental
 
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
 
Motivación laboral
Motivación laboralMotivación laboral
Motivación laboral
 
IBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBIBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TB
 
ระบบสารสนเทศ
ระบบสารสนเทศระบบสารสนเทศ
ระบบสารสนเทศ
 
2016 Results & Outlook
2016 Results & Outlook 2016 Results & Outlook
2016 Results & Outlook
 

Semelhante a Agile analytics applications on hadoop

Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
Nemesis - SAINTCON.pdf
Nemesis - SAINTCON.pdfNemesis - SAINTCON.pdf
Nemesis - SAINTCON.pdfWill Schroeder
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
AppEngine Performance Tuning
AppEngine Performance TuningAppEngine Performance Tuning
AppEngine Performance TuningDavid Chen
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...OpenSource Connections
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify Dataconomy Media
 
The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18DataconomyGmbH
 
AzureML – zero to hero
AzureML – zero to heroAzureML – zero to hero
AzureML – zero to heroGovind Kanshi
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic WebRoberto García
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made EasyDataWorks Summit
 

Semelhante a Agile analytics applications on hadoop (20)

Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
Nemesis - SAINTCON.pdf
Nemesis - SAINTCON.pdfNemesis - SAINTCON.pdf
Nemesis - SAINTCON.pdf
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Debugging machine-learning
Debugging machine-learningDebugging machine-learning
Debugging machine-learning
 
AppEngine Performance Tuning
AppEngine Performance TuningAppEngine Performance Tuning
AppEngine Performance Tuning
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
 
Breaking data
Breaking dataBreaking data
Breaking data
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
 
The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18
 
AzureML – zero to hero
AzureML – zero to heroAzureML – zero to hero
AzureML – zero to hero
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made Easy
 

Agile analytics applications on hadoop

  • 1. Agile Analytics Applications Russell Jurney 1 Wednesday, May 8, 13
  • 2. About me…Bearding. • I’m going to beat this guy • Seriously • Bearding is my #1 natural talent • Salty Sea Beard • Fortified with Pacific Ocean Minerals 2 Wednesday, May 8, 13
  • 3. Agile Data: The Book (August, 2013) 3 Read @ Safari Rough Cuts A philosophy, not the only way But still, its good! Really! Wednesday, May 8, 13
  • 4. We go fast... but don’t worry! • Download the slides - click the links - read examples! • If its not on the blog, its in the book! • Order now: http://shop.oreilly.com/product/0636920025054.do • Read the book on Safari Rough Cuts 4 Wednesday, May 8, 13
  • 5. Agile Application Development: Check • LAMP stack mature • Post-Rails frameworks to choose from • Enable rapid feedback and agility 5 + NoSQL Wednesday, May 8, 13
  • 7. Scientific Computing / HPC • ‘Smart kid’ only: MPI, Globus, etc. until Hadoop 7 Tubes and Mercury (old school) Cores and Spindles (new school) UNIVAC and Deep Blue both fill a warehouse. We’re back... Wednesday, May 8, 13
  • 9. Data Center as Computer • Warehouse Scale Computers and applications 9 “A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.” Click here for a paper on operating a ‘data center as computer.’ Wednesday, May 8, 13
  • 10. Hadoop to the Rescue! • Easy to use! (Pig, Hive, Cascading) • CHEAP: 1% the cost of SAN/NAS • A department can afford its own Hadoop cluster! • Dump all your data in one place: Hadoop DFS • Silos come CRASHING DOWN! • JOIN like crazy! • ETL like whoah! • An army of mappers and reducers at your command • OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME! 10 Wednesday, May 8, 13
  • 12. Analytics Apps: It takes a Team 12 • Broad skill-set • Nobody has them all • Inherently collaborative Wednesday, May 8, 13
  • 13. Data Science Team • 3-4 team members with broad, diverse skill-sets that overlap • Transactional overhead dominates at 5+ people • Expert researchers: lend 25-50% of their time to teams • Creative workers. Run like a studio, not an assembly line • Total freedom... with goals and deliverables. • Work environment matters most 13 Wednesday, May 8, 13
  • 14. How to get insight into product? • Back-end has gotten THICKER • Generating $$$ insight can take 10-100x app dev • Timeline disjoint: analytics vs agile app-dev/design • How do you ship insights efficiently? • How do you collaborate on research vs developer timeline? 14 Wednesday, May 8, 13
  • 15. The Wrong Way - Part One 15 “We made a great design. Your job is to predict the future for it.” Wednesday, May 8, 13
  • 16. The Wrong Way - Part Two 16 “What is taking you so long to reliably predict the future?” Wednesday, May 8, 13
  • 17. The Wrong Way - Part Three 17 “The users don’t understand what 86% true means.” Wednesday, May 8, 13
  • 18. The Wrong Way - Part Four 18 GHJIAEHGIEhjagigehganb!!!!!RJ(@J?!! Wednesday, May 8, 13
  • 19. The Wrong Way - Inevitable Conclusion 19 Plane Mountain Wednesday, May 8, 13
  • 20. Reminds me of... the waterfall model 20:(Wednesday, May 8, 13
  • 21. Chief Problem 21 You can’t design insight in analytics applications. You discover it. You discover by exploring. Wednesday, May 8, 13
  • 22. -> Strategy 22 So make an app for exploring your data. Which becomes a palette for what you ship. Iterate and publish intermediate results. Wednesday, May 8, 13
  • 23. Data Design • Not the 1st query that = insight, it’s the 15th, or the 150th • Capturing “Ah ha!” moments • Slow to do those in batch… • Faster, better context in an interactive web application. • Pre-designed charts wind up terrible. So bad. • Easy to invest man-years in the wrong statistical models • Semantics of presenting predictions are complex, delicate • Opportunity lies at intersection of data & design 23 Wednesday, May 8, 13
  • 24. How do we get back to Agile? 24 Wednesday, May 8, 13
  • 25. Statement of Principles 25 (then tricks, with code) Wednesday, May 8, 13
  • 26. Setup an environment where... • Insights repeatedly produced • Iterative work shared with entire team • Interactive from day Zero • Data model is consistent end-to-end • Minimal impedance between layers • Scope and depth of insights grow • Insights form the palette for what you ship • Until the application pays for itself and more 26 Wednesday, May 8, 13
  • 28. Value document > relation 28 Most data is dirty. Most data is semi-structured or unstructured. Rejoice! Wednesday, May 8, 13
  • 29. Value document > relation 29 Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction. Wednesday, May 8, 13
  • 30. Relational Data = Legacy Format • Why JOIN? Storage is fundamentally cheap! • Duplicate that JOIN data in one big record type! • ETL once to document format on import, NOT every job • Not zero JOINs, but far fewer JOINs • Semi-structured documents preserve data’s actual structure • Column compressed document formats beat JOINs! 30 Wednesday, May 8, 13
  • 31. Value imperative > declarative • We don’t know what we want to SELECT. • Data is dirty - check each step, clean iteratively. • 85% of data scientist’s time spent munging. See: ETL. • Imperative is optimized for our process. • Process = iterative, snowballing insight • Efficiency matters, self optimize 31 Wednesday, May 8, 13
  • 32. Value dataflow > SELECT 32 Wednesday, May 8, 13
  • 33. Ex. dataflow: ETL + email sent count 33(I can’t read this either. Get a big version here.) Wednesday, May 8, 13
  • 34. Value Pig > Hive (for app-dev) • Pigs eat ANYTHING • Pig is optimized for refining data, as opposed to consuming it • Pig is imperative, iterative • Pig is dataflows, and SQLish (but not SQL) • Code modularization/re-use: Pig Macros • ILLUSTRATE speeds dev time (even UDFs) • Easy UDFs in Java, JRuby, Jython, Javascript • Pig Streaming = use any tool, period. • Easily prepare our data as it will appear in our app. • If you prefer Hive, use Hive. 34 But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive… See: HCatalog for Pig/Hive integration. Wednesday, May 8, 13
  • 35. Localhost vs Petabyte scale: same tools • Simplicity essential to scalability: highest level tools we can • Prepare a good sample - tricky with joins, easy with documents • Local mode: pig -l /tmp -x local -v -w • Frequent use of ILLUSTRATE • 1st: Iterate, debug & publish locally • 2nd: Run on cluster, publish to team/customer • Consider skipping Object-Relational-Mapping (ORM) • We do not trust ‘databases,’ only HDFS @ n=3. • Everything we serve in our app is re-creatable via Hadoop. 35 Wednesday, May 8, 13
  • 36. Data-Value Pyramid 36 Climb it. Do not skip steps. See here. Wednesday, May 8, 13
  • 37. 0/1) Display atomic records on the web 37 Wednesday, May 8, 13
  • 38. 0.0) Document-serialize events • Protobuf • Thrift • JSON • Avro - I use Avro because the schema is onboard. 38 Wednesday, May 8, 13
  • 39. 0.1) Documents via Relation ETL 39 enron_messages = load '/enron/enron_messages.tsv' as ( message_id:chararray, sql_date:chararray, from_address:chararray, from_name:chararray, subject:chararray, body:chararray );   enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray);   split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';   headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10; with_headers = join headers by group, enron_messages by message_id parallel 10; emails = foreach with_headers generate enron_messages::message_id as message_id, CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date, TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray), enron_messages::subject as subject, enron_messages::body as body, headers::tos.(address, name) as tos, headers::ccs.(address, name) as ccs, headers::bccs.(address, name) as bccs; store emails into '/enron/emails.avro' using AvroStorage( Example here. Wednesday, May 8, 13
  • 40. 0.2) Serialize events from streams 40 class  GmailSlurper(object):    ...    def  init_imap(self,  username,  password):        self.username  =  username        self.password  =  password        try:            imap.shutdown()        except:            pass        self.imap  =  imaplib.IMAP4_SSL('imap.gmail.com',  993)        self.imap.login(username,  password)        self.imap.is_readonly  =  True    ...    def  write(self,  record):        self.avro_writer.append(record)    ...    def  slurp(self):        if(self.imap  and  self.imap_folder):            for  email_id  in  self.id_list:                (status,  email_hash,  charset)  =  self.fetch_email(email_id)                if(status  ==  'OK'  and  charset  and  'thread_id'  in  email_hash  and  'froms'  in  email_hash):                    print  email_id,  charset,  email_hash['thread_id']                    self.write(email_hash) Scrape your own gmail in Python and Ruby. Wednesday, May 8, 13
  • 41. 0.3) ETL Logs 41 log_data  =  LOAD  'access_log'        USING  org.apache.pig.piggybank.storage.apachelog.CommongLogLoader        AS  (remoteAddr,                remoteLogname,                user,                time,                method,                uri,                proto,                bytes); Wednesday, May 8, 13
  • 42. 1) Plumb atomic events -> browser 42 (Example stack that enables high productivity) Wednesday, May 8, 13
  • 43. 1.1) cat our Avro serialized events 43 me$ cat_avro ~/Data/enron.avro { u'bccs': [], u'body': u'scamming people, blah blah', u'ccs': [], u'date': u'2000-08-28T01:50:00.000Z', u'from': {u'address': u'bob.dobbs@enron.com', u'name': None}, u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>', u'subject': u'Re: Enron trade for frop futures', u'tos': [ {u'address': u'connie@enron.com', u'name': None} ] } Get cat_avro in python, ruby Wednesday, May 8, 13
  • 44. 1.2) Load our events in Pig 44 me$ pig -l /tmp -x local -v -w grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage(); grunt> describe enron_emails emails: { message_id: chararray, datetime: chararray, from:tuple(address:chararray,name:chararray) subject: chararray, body: chararray, tos: {to: (address: chararray,name: chararray)}, ccs: {cc: (address: chararray,name: chararray)}, bccs: {bcc: (address: chararray,name: chararray)} }   Wednesday, May 8, 13
  • 45. 1.3) ILLUSTRATE our events in Pig 45 grunt> illustrate enron_emails   --------------------------------------------------------------------------- | emails | | message_id:chararray | | datetime:chararray | | from:tuple(address:chararray,name:chararray) | | subject:chararray | | body:chararray | | tos:bag{to:tuple(address:chararray,name:chararray)} | | ccs:bag{cc:tuple(address:chararray,name:chararray)} | | bccs:bag{bcc:tuple(address:chararray,name:chararray)} | --------------------------------------------------------------------------- | | | <1731.10095812390082.JavaMail.evans@thyme> | | 2001-01-09T06:38:00.000Z | | (bob.dobbs@enron.com, J.R. Bob Dobbs) | | Re: Enron trade for frop futures | | scamming people, blah blah | | {(connie@enron.com,)} | | {} | | {} | Upgrade to Pig 0.10+ Wednesday, May 8, 13
  • 46. 1.4) Publish our events to a ‘database’ 46 pig -l /tmp -x local -v -w -param avros=enron.avro -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig /* MongoDB libraries and configuration */ register /me/mongo-hadoop/mongo-2.7.3.jar register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar /* Set speculative execution off to avoid chance of duplicate records in Mongo */ set mapred.map.tasks.speculative.execution false set mapred.reduce.tasks.speculative.execution false define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */ /* By default, lets have 5 reducers */ set default_parallel 5 avros = load '$avros' using AvroStorage(); store avros into '$mongourl' using MongoStorage(); Full instructions here. Which does this: From Avro to MongoDB in one command: Wednesday, May 8, 13
  • 47. 1.5) Check events in our ‘database’ 47 $ mongo enron MongoDB shell version: 2.0.2 connecting to: enron > show collections emails system.indexes > db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}) { " "_id" : ObjectId("502b4ae703643a6a49c8d180"), " "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>", " "date" : "2001-01-09T06:38:00.000Z", " "from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" }, " "subject" : Re: Enron trade for frop futures, " "body" : "Scamming more people...", " "tos" : [ { "address" : "connie@enron", "name" : null } ], " "ccs" : [ ], " "bccs" : [ ] } Wednesday, May 8, 13
  • 48. 1.6) Publish events on the web 48 require 'rubygems' require 'sinatra' require 'mongo' require 'json' connection = Mongo::Connection.new database = connection['agile_data'] collection = database['emails'] get '/email/:message_id' do |message_id| data = collection.find_one({:message_id => message_id}) JSON.generate(data) end Wednesday, May 8, 13
  • 49. 1.6) Publish events on the web 49 Wednesday, May 8, 13
  • 50. One-Liner to Transition Stack 50 Wednesday, May 8, 13
  • 51. Whats the point? • A designer can work against real data. • An application developer can work against real data. • A product manager can think in terms of real data. • Entire team is grounded in reality! • You’ll see how ugly your data really is. • You’ll see how much work you have yet to do. • Ship early and often! • Feels agile, don’t it? Keep it up! 51 Wednesday, May 8, 13
  • 52. 1.7) Wrap events with Bootstrap 52 <link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet"> </head> <body> <div class="container" style="margin-top: 100px;"> <table class="table table-striped table-bordered table-condensed"> <thead> {% for key in data['keys'] %} <th>{{ key }}</th> {% endfor %} </thead> <tbody> <tr> {% for value in data['values'] %} <td>{{ value }}</td> {% endfor %} </tr> </tbody> </table> </div> </body> Complete example here with code here. Wednesday, May 8, 13
  • 53. 1.7) Wrap events with Bootstrap 53 Wednesday, May 8, 13
  • 54. Refine. Add links between documents. 54Not the Mona Lisa, but coming along... See: here Wednesday, May 8, 13
  • 55. 1.8) List links to sorted events 55 mongo enron > db.emails.ensureIndex({message_id: 1}) > db.emails.find().sort({date:0}).limit(10).pretty() { { " "_id" : ObjectId("4f7a5da2414e4dd0645d1176"), " "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>", " "from" : [ ... pig -l /tmp -x local -v -w emails_per_user = foreach (group emails by from.address) { sorted = order emails by date; last_1000 = limit sorted 1000; generate group as from_address, emails as emails; }; store emails_per_user into '$mongourl' using MongoStorage(); Use Pig, serve/cache a bag/array of email documents: Use your ‘database’, if it can sort. Wednesday, May 8, 13
  • 56. 1.8) List links to sorted documents 56 Wednesday, May 8, 13
  • 57. 1.9) Make it searchable... 57 If you have list, search is easy with ElasticSearch and Wonderdog... /* Load ElasticSearch integration */ register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar'; register '/me/elasticsearch-0.18.6/lib/*'; define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage(); emails = load '/me/tmp/emails' using AvroStorage(); store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/ elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins'); curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1' Test it with curl: ElasticSearch has no security features. Take note. Isolate. Wednesday, May 8, 13
  • 58. 2) Create Simple Charts 58 Wednesday, May 8, 13
  • 59. 2) Create Simple Tables and Charts 59 Wednesday, May 8, 13
  • 60. 2) Create Simple Charts • Start with an HTML table on general principle. • Then use nvd3.js - reusable charts for d3.js • Aggregate by properties & displaying is first step in entity resolution • Start extracting entities. Ex: people, places, topics, time series • Group documents by entities, rank and count. • Publish top N, time series, etc. • Fill a page with charts. • Add a chart to your event page. 60 Wednesday, May 8, 13
  • 61. 2.1) Top N (of anything) in Pig 61 pig -l /tmp -x local -v -w top_things = foreach (group things by key) { sorted = order things by arbitrary_rank desc; top_10_things = limit sorted 10; generate group as key, top_10_things as top_10_things; }; store top_n into '$mongourl' using MongoStorage(); Remember, this is the same structure the browser gets as json. This would make a good Pig Macro. Wednesday, May 8, 13
  • 62. 2.2) Time Series (of anything) in Pig 62 pig -l /tmp -x local -v -w /* Group by our key and date rounded to the month, get a total */ things_by_month = foreach (group things by (key, ISOToMonth(datetime)) generate flatten(group) as (key, month), COUNT_STAR(things) as total; /* Sort our totals per key by month to get a time series */ things_timeseries = foreach (group things_by_month by key) { timeseries = order things by month; generate group as key, timeseries as timeseries; }; store things_timeseries into '$mongourl' using MongoStorage(); Yet another good Pig Macro. Wednesday, May 8, 13
  • 63. Data processing in our stack 63 A new feature in our application might begin at any layer... great! Any team member can add new features, no problemo! I’m creative! I know Pig! I’m creative too! I <3 Javascript! omghi2u! where r my legs? send halp Wednesday, May 8, 13
  • 64. Data processing in our stack 64 ... but we shift the data-processing towards batch, as we are able. Ex: Overall total emails calculated in each layer See real example here. Wednesday, May 8, 13
  • 65. 3) Exploring with Reports 65 Wednesday, May 8, 13
  • 66. 3) Exploring with Reports 66 Wednesday, May 8, 13
  • 67. 3.0) From charts to reports... • Extract entities from properties we aggregated by in charts (Step 2) • Each entity gets its own type of web page • Each unique entity gets its own web page • Link to entities as they appear in atomic event documents (Step 1) • Link most related entities together, same and between types. • More visualizations! • Parametize results via forms. 67 Wednesday, May 8, 13
  • 68. 3.1) Looks like this... 68 Wednesday, May 8, 13
  • 69. 3.2) Cultivate common keyspaces 69 Wednesday, May 8, 13
  • 70. 3.3) Get people clicking. Learn. • Explore this web of generated pages, charts and links! • Everyone on the team gets to know your data. • Keep trying out different charts, metrics, entities, links. • See whats interesting. • Figure out what data needs cleaning and clean it. • Start thinking about predictions & recommendations. 70 ‘People’ could be just your team, if data is sensitive. Wednesday, May 8, 13
  • 71. 4) Predictions and Recommendations 71 Wednesday, May 8, 13
  • 72. 4.0) Preparation • We’ve already extracted entities, their properties and relationships • Our charts show where our signal is rich • We’ve cleaned our data to make it presentable • The entire team has an intuitive understanding of the data • They got that understanding by exploring the data • We are all on the same page! 72 Wednesday, May 8, 13
  • 73. 4.2) Think in different perspectives • Networks • Time Series / Distributions • Natural Language Processing • Conditional Probabilities / Bayesian Inference • Check out Chapter 2 of the book 73 Wednesday, May 8, 13
  • 75. 4.3.1) Weighted Email Networks in Pig 75 DEFINE header_pairs(email, col1, col2) RETURNS pairs { filtered = FILTER $email BY ($col1 IS NOT NULL) AND ($col2 IS NOT NULL); flat = FOREACH filtered GENERATE FLATTEN($col1) AS $col1, FLATTEN($col2) AS $col2; $pairs = FOREACH flat GENERATE LOWER($col1) AS ego1, LOWER($col2) AS ego2; } /* Get email address pairs for each type of connection, and union them together */ emails = LOAD '/me/Data/enron.avro' USING AvroStorage(); from_to = header_pairs(emails, from, to); from_cc = header_pairs(emails, from, cc); from_bcc = header_pairs(emails, from, bcc); pairs = UNION from_to, from_cc, from_bcc; /* Get a count of emails over these edges. */ pair_groups = GROUP pairs BY (ego1, ego2); sent_counts = FOREACH pair_groups GENERATE FLATTEN(group) AS (ego1, ego2), COUNT_STAR(pairs) AS total; Wednesday, May 8, 13
  • 76. 4.3.2) Networks Viz with Gephi 76 Wednesday, May 8, 13
  • 77. 4.3.3) Gephi = Easy 77 Wednesday, May 8, 13
  • 78. 4.3.4) Social Network Analysis 78 Wednesday, May 8, 13
  • 79. 4.4) Time Series & Distributions 79 pig -l /tmp -x local -v -w /* Count things per day */ things_per_day = foreach (group things by (key, ISOToDay(datetime)) generate flatten(group) as (key, day), COUNT_STAR(things) as total; /* Sort our totals per key by day to get a sorted time series */ things_timeseries = foreach (group things_by_day by key) { timeseries = order things by day; generate group as key, timeseries as timeseries; }; store things_timeseries into '$mongourl' using MongoStorage(); Wednesday, May 8, 13
  • 80. 4.4.1) Smooth Sparse Data 80See here. Wednesday, May 8, 13
  • 81. 4.4.2) Regress to find Trends 81 JRuby Linear Regression UDF Pig to use the UDF Trend Line in your Application Wednesday, May 8, 13
  • 82. 4.5.1) Natural Language Processing 82 Example with code here and macro here. import 'tfidf.macro'; my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body'); /* Get the top 10 Tf*Idf scores per message */ per_message_cassandra = foreach (group tfidf_all by message_id) { sorted = order tfidf_all by value desc; top_10_topics = limit sorted 10; generate group, top_10_topics.(score, value); } Wednesday, May 8, 13
  • 83. 4.5.2) NLP: Extract Topics! 83 Wednesday, May 8, 13
  • 84. 4.5.3) NLP for All: Extract Topics! • TF-IDF in Pig - 2 lines of code with Pig Macros: • http://hortonworks.com/blog/pig-macro-for-tf-idf-makes- topic-summarization-2-lines-of-pig/ • LDA with Pig and the Lucene Tokenizer: • http://thedatachef.blogspot.be/2012/03/topic-discovery- with-apache-pig-and.html 84 Wednesday, May 8, 13
  • 85. 4.6) Probability & Bayesian Inference 85 Wednesday, May 8, 13
  • 86. 4.6.1) Gmail Suggested Recipients 86 Wednesday, May 8, 13
  • 87. 4.6.1) Reproducing it with Pig... 87 Wednesday, May 8, 13
  • 88. 4.6.2) Step 1: COUNT(From -> To) 88 Wednesday, May 8, 13
  • 89. 4.6.2) Step 2: COUNT(From, To, Cc)/Total 89 P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone Wednesday, May 8, 13
  • 90. 4.6.3) Wait - Stop Here! It works! 90 They match… Wednesday, May 8, 13
  • 91. 4.4) Add predictions to reports 91 Wednesday, May 8, 13
  • 92. 5) Enable new actions 92 Wednesday, May 8, 13
  • 93. Why doesn’t Kate reply to my emails? • What time is best to catch her? • Are they too long? • Are they meant to be replied to (contain original content)? • Are they nice? (sentiment analysis) • Do I reply to her emails (reciprocity)? • Do I cc the wrong people (my mom)? 93 Wednesday, May 8, 13
  • 94. Example: Packetpig and PacketLoop 94 snort_alerts  =  LOAD  '$pcap'    USING   com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig'); countries  =  FOREACH  snort_alerts    GENERATE        com.packetloop.packetpig.udf.geoip.Country(src)  as  country,        priority; countries  =  GROUP  countries  BY  country; countries  =  FOREACH  countries    GENERATE        group,        AVG(countries.priority)  as  average_severity; STORE  countries  into  'output/choropleth_countries'  using  PigStorage(','); Code here. Wednesday, May 8, 13
  • 95. Example: Packetpig and PacketLoop 95 Wednesday, May 8, 13
  • 96. Thank You! Questions & Answers Follow: @rjurney Read the Blog: datasyndrome.com 96 Wednesday, May 8, 13