SlideShare uma empresa Scribd logo
1 de 133
Baixar para ler offline
BigData with Python
A gentle and simple introduction
Marcel Caraciolo
@marcelcaraciolo
Developer, Cientist, contributor to the Crab recsys project,
works with Python for 6 years, interested at mobile,
education, machine learning and dataaaaa!
Recife, Brazil - http://aimotion.blogspot.com
https://class.coursera.org/datasci-001/
Disclaimer
Alguns slides foram
retirados das apts
do curso de Bill Howe -
Introduction to Data
Science
About me
Co-founder of Crab - Python recsys library
Cientist Chief at Atepassar, e-learning social network
Co-Founder and Instructor of PyCursos, teaching Python on-line
Co-Founder of Pingmind, on-line infrastructure for MOOC’s
Interested at Python, mobile, e-learning and machine learning!
What is BigData ?
Big Data
“Big Data is any data that is expensive to manage
and hard to extract value from.”
Michael Franklin Thomas M. Siebel
Professor of Computer Science Director of the Algorithms,
Machines and People Lab University of Berkeley
Big Data
Erik Larson, 1989, Harper’s magazine
“The keepers of big data say they do it for the
consumer’s benefit. But data have a way of
being used for purposes other than originally
intended.”
!"#$%"$#&'(')")(
*$+,(!"#$%"$#&'(')")(
!"#$%&'(%$)*%+%),-+.),#$$)
*#/(#*)01.#2%03)
!4'1-%$)*%+%5)6$'70)5)1$-18)0+9#%25)
:%1.-(#)7#(#9%+#*5);2$)#+1<3)
Big Data, muitos dados.
Challenges
Volume
the size of the data
Velocity
the latency of data processing relative to the
growing demand for interactivity
Variety
the diversity of sources, formats, quality,
structures
!"#$%&'&$%"()*+",*+$-./0$
.,12()$$
.&3"'4$ .)1,5"'4$
.&12)$
.&12)$
Where does big
data comes from ?
“data exhaust” from customers
new and pervasive sensors
the ability to “keep everything”
!"#$%&'(')*"+$#"''
!"#$%$&'(
)*+,"-.%(/%"01(
!"#$%&'"()%*+%
2"-&*3&(.4.+*(/"5607-8(
,"-./%('-0"&12%3/&124.(1%
500%6&'216%./7%!.281&%0#.(1%
91:.-/*1(9-.%'7/,(
;-(5*5"+'(/"5607-8(
9211/%7.&.%(1/&21%
:'#37%6&.&1%723;1%
<#.6)%!1-'2=%
,>!?@%
:'(3.#%AB!%
!"#$%*(.-.%'7/,(
<$8(=.&.(
Where does big
data comes from ?
photo in public domain
Where does big
data comes from ?
4/28/13 Bill Howe, UW 11
!"#$%&'()#*+$
$
,-.#(''/$
$
$011$*2))2'3$-.45#$67#&7$-7$'8$9-&."$:;<:=$
$<$23$>$?3@#&3#@$67#&7$"-5#$-$,-.#(''/$
$-..'63@$
$$
$9'&#$@"-3$>;$(2))2'3$A2#.#7$'8$.'3@#3@$BC#($
$)23/7=$3#C7$7@'&2#7=$()'D$A'7@7=$3'@#7=$A"'@'$
$-)(6*7=$#@.EF$7"-&#G$#-."$*'3@"E$
$
$H')G7$>;%I$'8$G-@-$8'&$-3-)J727=$-GG7$<:$!I$'8$
$.'*A&#77#G$G-@-$G-2)J$
!"#$%&'!
!
!())!*$++$,-!./&'/0!12)!*$++$,-!34$+5!6#&&6/!
!789!:$++$,-!/&4';<!=.&'$&/!4!345!
!>!"?!3464!@,'!4-4+5/$/!A&-&'46&3!34$+5!
!!
"<&!B',:+&*C!
!"#$%&''&(!
!
!)*+!#,-.'&/0!.'$('1!!2++!#,-.'&/0!3#$4'!
!5$6-44$/0!3#$4'!0&378!9-(!!*+++!.'$('!
!):!;<!0&#&!9-(!&/&78'3'!=$/$(&#$0!0&378!
!!
>(&03?-/&7!0&#&!'#-(&=$1!#$6,/3@.$'!A!&/&78'3'!
#--7'!B.'#!0-!/-#!C-(D!&#!#,$'$!'6&7$'!E!
!
>,$!F(-G7$4H!
Scalability
What does
Scalable mean ?
Operationally:
In the past: Works even if data doesn’t fit in main memory
Now: Can make use of 1000s of cheap computers
Algorithmically:
In the past: If you have N data items, you must do no more
than Nm operations -- polynomial time algorithms
Now: If you have N data items, you must do no more than Nm/k operations,
for some large k -- “polynomial time algorithms must be parallelized”
Soon: If you have N data items, you should do no more than N * log(N) operations
Example
Given a set of DNA sequences, find all
sequences equal to GATTACGATATTA .
GATTACGATATTATACCTGCCGTAA
TACCTGCCGTAA = GATTACGATATTA?
GATTACGATATTA
time = 0
No.
Time = 0
Time = 1
CCCCCAATGAC = GATTACGATATTA?
GATTACGATATTA
time = 1
No.
Time = 17
GATTACGATATTA contains GATTACGATATTA?
GATTACGATATTA
time = 17
Yes!
Send it to the output.
GATTACGATATTA
40 records, 40 comparions
N records, N comparisons
The algorithmic complexity is order N: O(N)
GATTACGATATTA
AAAATCCTGCA
AAACGCCTGCA
TTTACGTCAA
TTTTCGTAATT
What if we sort the sequences?
What if we sort the sequences ?
CTGTACACAACCT < GATTACGATATTA
GATTACGATATTA
No match.
Skip to 75% mark
Start at the 50% mark
time = 0
0% 100%
CTGTACACAACCT
Time = 0
GGATACACATTTA > GATTACGATATTA
GATTACGATATTA
No match.
0% 100%
GGATACACATTTA
GATATTTTAAGC < GATTACGATATTA
GATTACGATATTA
No match.
Skip back to 68.75% mark
0% 100%
GATATTTTAAGC
GATTACGATATTA = GATTACGATATTA
GATTACGATATTA
0% 100%
Match!
Walk through the records until we
fail to match.
GATTACGATATTA
0% 100%
How many comparisons did we do?
40 records, only 4 comparisons
N records, log(N) comparisons
This algorithm is O(log(N)) Far better scalability
Relational
Databases
Relational Databases
•  Databases are good at Needle in Haystack problems:
–  Extracting small results from big datasets
–  Transparently provide old style scalability
–  Your query will always* finish, regardless of dataset size.
–  Indexes are easily built and automatically used when appropriate
CREATE INDEX seq_idx ON sequence(seq);
SELECT seq
FROM sequence
WHERE seq = GATTACGATATTA ;
Example
New task: Read Trimming
Given a set of DNA sequences,
trim the final n bps of each sequence
Generate a new dataset
GATTACGATATTATACCTGCCGTAA
TACCTGCCGTAA becomes TACCT
time = 0
Time = 0
Time = 1
CCCCCAATGAC becomes CCCCC
time = 1
Time = 17
GATTACGATATTA becomes GATTA
time = 17
Can we use an index?
No. We have to touch every record no matter what.
The task is fundamentally O(N)
Can we do any better?
Time = 0
Time = 1
Time = 2
Time = 3
Time = 7
time = 7
How much time did this take?
7 cycles
40 records, 6 workers
O(N/k)
time = 7
How much time did this ta
7 cycles
40 records, 6 workers
O(N/k)
Schematic of a Parallel “Read Trimming” Task
You are given short
“reads”: genomic
sequences about
35-75 characters each
Distribute the reads
among k computers
f f f f f f
f is a function to
trim a read; apply it
to every item
Now we have a big
distributed set of
trimmed reads
Schematic of a Parallel “Read Trimming” Task
You are given short
“reads”: genomic
sequences about
35-75 characters each
Distribute the reads
among k computers
f f f f f f
f is a function to
trim a read; apply it
to every item
Now we have a big
distributed set of
trimmed reads
Schematic of a Parallel “Read Trimming” Task
You are given short
“reads”: genomic
sequences about
35-75 characters each
Distribute the reads
among k computers
f f f f f f
f is a function to
trim a read; apply it
to every item
Now we have a big
distributed set of
trimmed reads
Schematic of a Parallel “Read Trimming” Task
You are given short
“reads”: genomic
sequences about
35-75 characters each
Distribute the reads
among k computers
f f f f f f
f is a function to
trim a read; apply it
to every item
Now we have a big
distributed set of
trimmed reads
You are given TIFF
images
Distribute the images
among k computers
f f f f f f
f is a function to
convert TIFF to
PNG; apply it to
every item
Now we have a big
distributed set of
converted images
New Task: Convert 405k TIFF images to PNG
http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/
Convert 405k TIFF images to PNG
You have sets of
parameters for
thousands of small
simulations
Divide the parameter
sets among k
computers
f f f f f f
f is runs the
simulation and
produces some
output; apply it to
every item
Now we have a big
distributed set of
simulation results
New Task: Run thousands of simulations
http://escience.washington.edu/get-help-now/dave-williams-simulating-muscle-dynamics-cloud
Run thousands of simulations
You have millions
of documents
Distribute the
documents among k
computers
f f f f f f
f finds the most
common word in a
single document
Now we have a big
distributed list of
(doc_id, word) pairs
Find the most common word in each document
Find the most common word in each
document
Consider a slightly more general program
to compute the word frequency of every
word in a singe documentConsider a slightly more general program to compute the
word frequency of every word in a single document
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
(people, 2)
(government, 6)
(assume, 1)
(history, 2)
…
You have millions
of documents
Distribute the
documents among k
computers
f f f f f f
For each document
f returns a set of
(word, freq) pairs
Compute the word frequency of 5M documents
Now we have a big
distributed list of
sets of word freqs.
There’s a pattern...
There’s a pattern here….
•  A function that maps a read to a trimmed read
•  A function that maps a TIFF image to a PNG image
•  A function that maps a set of parameters to a simulation result
•  A function that maps a document to its most common word
•  A function that maps a document to a histogram of word
frequencies
What if we want to compute the word
frequency across all documents?
US Constitution Declaration of
Independence
Articles of
Confederation
(people, 78)
(government, 123)
(assume, 23)
(history, 38)
…
You have millions
of documents
Distribute the
documents among
k computers
map map map map map map
For each document,
return a set of (word,
freq) pairs
Compute the word frequency across 5M documents
Now what?
•  But we don’t want a bunch of little histograms –
we want one big histogram.
•  How can we make sure that a single computer
has access to every occurrence of a given word
regardless of which document it appeared in?
Compute the word frequency across 5M docs
Compute the word frequency across 5M docs
Distribute the
documents among k
computers
map map map map map map
For each document,
return a set of
(word, freq) pairs
Now we have a big
distributed list of
sets of word freqs.
reduce reduce reduce reduce
Now just count the
occurrences of each word
44 3
We have our
distributed histogram
Compute the word frequency across 5M documents
Compute the word frequency across 5M docs
5/15/13 Bill Howe, eScience Institute 11
Some distributed algorithm…
Map
(Shuffle)
Reduce
Compute the word frequency across 5M docs
5/15/13 Bill Howe, eScience Institute 12
MapReduce Programming Model
•  Input & Output: each a set of key/value pairs
•  Programmer specifies two functions:
–  Processes input key/value pair
–  Produces set of intermediate pairs

–  Combines all intermediate values for a particular key
–  Produces a set of merged output values (usually just one)
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
Inspired by primitives from functional programming
languages such as Lisp, Scheme, and Haskell
slide source: Google, Inc.
Compute the word frequency across 5M docs
5/15/13 Bill Howe, eScience Institute 13
Example: What does this do?
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, 1);


reduce(String output_key, Iterator intermediate_values):
// output_key: word
// output_values: ????
int result = 0;
for each v in intermediate_values:
result += v;
Emit(result);
slide source: Google, Inc.
Example
3/12/09 Bill Howe, eScience Institute 14
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Document Processing
Word length histogram example
3/12/09 Bill Howe, eScience Institute 14
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Document Processing
3/12/09 Bill Howe, eScience Institute 15
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Word length histogram
How many big , medium , and small words are used?
Word length histogram example
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Big = Yellow = 10+ letters
Medium = Red = 5..9 letters
Small = Blue = 2..4 letters
Tiny = Pink = 1 letter
Example: Word length histogram
Word length histogram example
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Split the document into
chunks and process each
chunk on a different
computer
Chunk 1
Chunk 2
Example: Word length histogram
Word length histogram example
(yellow, 20)
(red, 71)
(blue, 93)
(pink, 6 )
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Map Task 1
(204 words)
Map Task 2
(190 words)
(key, value)
(yellow, 17)
(red, 77)
(blue, 107)
(pink, 3)
Example: Word length histogram
Word length histogram example
3/12/09 Bill Howe, eScience Institute 19
(yellow, 17)
(red, 77)
(blue, 107)
(pink, 3)
(yellow, 20)
(red, 71)
(blue, 93)
(pink, 6 )
Reduce tasks
(yellow, 17)
(yellow, 20)
(red, 77)
(red, 71)
(blue, 93)
(blue, 107)
(pink, 6)
(pink, 3)
Example: Word length histogram
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Map task 1
Map task 2
Shuffle step
(yellow, 37)
(red, 148)
(blue, 200)
(pink, 9)
MR Phases
!"#$%&'(")$*+&,&
MR Phases
•  Each Map and Reduce task has multiple phases:
MR Phases
MAP REDUCE
(w1,1)
(w2,1)
(w1,1)
…
(w1,1)
(w2,1)
…
(did1,v1)
(did2,v2)
(did3,v3)
. . . .
(w1, (1,1,1,…,1))
(w2, (1,1,…))
(w3,(1…))
…
…
…
…
(w1, 25)
(w2, 77)
(w3, 12)
…
…
…
…
Shuffle
!"#$%&'()%"**$"(+%,&-.$/%%
012%3',%45+,%+$3)%6&78%9:;%
MR Phases
Hadoop in
One Slide
src: Huy Vo,
NYU Poly
Taxonomy of
Parallel Architectures
5/15/13 Bill Howe, eScience Institute 22
Taxonomy of Parallel Architectures
Easiest to program,
but $$$$
Scales to 1000s of computers
5/15/13 Bill Howe, UW 32
5/15/13 Bill Howe, UW 25
5/15/13 Bill Howe, eScience Institute 26
Large-Scale Data Processing
•  Many tasks process big data, produce big data
•  Want to use hundreds or thousands of CPUs
–  ... but this needs to be easy
–  Parallel databases exist, but they are expensive, difficult to set
up, and do not necessarily scale to hundreds of nodes.
•  MapReduce is a lightweight framework, providing:
–  Automatic parallelization and distribution
–  Fault-tolerance
–  I/O scheduling
–  Status and monitoring
5/15/13 Bill Howe, eScience Institute 35
MapReduce Contemporaries
•  Dryad (Microsoft)
–  Relational Algebra
•  Pig (Yahoo)
–  Near Relational Algebra over MapReduce
•  HIVE (Facebook)
–  SQL over MapReduce
•  Cascading
–  Relational Algebra
•  Clustera
–  U of Wisconsin
•  Hbase
–  Indexing on HDFS
MapReduce Contemporaries
Talk is cheap,
Show me the code
Examples
More Examples: Build an Inverted Index
5/15/13 Bill Howe, UW 14
Input:
tweet1, (“I love pancakes for breakfast”)
tweet2, (“I dislike pancakes”)
tweet3, (“What should I eat for breakfast?”)
tweet4, (“I love to eat”)
Desired output:
“pancakes”, (tweet1, tweet2)
“breakfast”, (tweet1, tweet3)
“eat”, (tweet3, tweet4)
“love”, (tweet1, tweet4)
…
Examples
More Examples: Relational Join
5/15/13 Bill Howe, UW 15
Name SSN
Sue 999999999
Tony 777777777
EmpSSN DepName
999999999 Accounts
777777777 Sales
777777777 Marketing
Emplyee ⨝ Assigned Departments
Employee
Assigned Departments
Name SSN EmpSSN DepName
Sue 999999999 999999999 Accounts
Tony 777777777 777777777 Sales
Tony 777777777 777777777 Marketing
Examples
Relational Join in MapReduce: Before Map Phase
5/15/13 Bill Howe, UW 16
Name SSN
Sue 999999999
Tony 777777777
EmpSSN DepName
999999999 Accounts
777777777 Sales
777777777 Marketing
Employee
Assigned Departments
Key idea: Lump all the tuples
together into one dataset
Employee, Sue, 999999999
Employee, Tony, 777777777
Department, 999999999, Accounts
Department, 777777777, Sales
Department, 777777777, Marketing
What is this for?
Examples
Relational Join in MapReduce: Map Phase
5/15/13 Bill Howe, UW 17
Employee, Sue, 999999999
Employee, Tony, 777777777
Department, 999999999, Accounts
Department, 777777777, Sales
Department, 777777777, Marketing
key=999999999, value=(Employee, Sue, 999999999)
key=777777777, value=(Employee, Tony, 777777777)
key=999999999, value=(Department, 999999999, Accounts)
key=777777777, value=(Department, 777777777, Sales)
key=777777777, value=(Department, 777777777, Marketing)
why do we use this as the key?
Examples
Relational Join in MapReduce: Reduce Phase
5/15/13 Bill Howe, UW 18
key=999999999, values=[(Employee, Sue, 999999999),
(Department, 999999999, Accounts)]
key=777777777, values=[(Employee, Tony, 777777777),
(Department, 777777777, Sales),
(Department, 777777777, Marketing)]
Sue, 999999999, 999999999, Accounts
Tony, 777777777, 777777777, Sales
Tony, 777777777, 777777777, Marketing
Examples
Relational Join in MapReduce, again
5/15/13 Bill Howe, Data Science, Autumn 2012 19
Order(orderid, account, date)
1, aaa, d1
2, aaa, d2
3, bbb, d3
LineItem(orderid, itemid, qty)
1, 10, 1
1, 20, 3
2, 10, 5
2, 50, 100
3, 20, 1tagged with
relation name
Order
1, aaa, d1 ! 1 : “Order”, (1,aaa,d1)
2, aaa, d2 ! 2 : “Order”, (2,aaa,d2)
3, bbb, d3 ! 3 : “Order”, (3,bbb,d3)
Line
1, 10, 1 ! 1 : “Line”, (1, 10, 1)
1, 20, 3 ! 1 : “Line”, (1, 20, 3)
2, 10, 5 ! 2 : “Line”, (2, 10, 5)
2, 50, 100 ! 2 : “Line”, (2, 50, 100)
3, 20, 1 ! 3 : “Line”, (3, 20, 1)
Map
Reducer for key 1
“Order”, (1,aaa,d1)
“Line”, (1, 10, 1)
“Line”, (1, 20, 3)
(1, aaa, d1, 1, 10, 1)
(1, aaa, d1, 1, 20, 3)
ExamplesSimple Social Network Analysis:
Count Friends
Jim, Sue
Sue, Jim
Lin, Joe
Joe, Lin
Jim, Kai
Kai, Jim
Jim, Lin
Lin, Jim
Input Desired Output
Jim, 3
Lin, 2
Sue, 1
Kai, 1
Joe, 1
Jim, 1
Sue, 1
Lin, 1
Joe, 1
Jim, 1
Kai, 1
Jim, 1
Lin, 1
MAP REDUCE
Jim, (1, 1, 1)
Lin, (1, 1)
Sue, (1)
Kai, (1)
Joe, (1)
SHUFFLE
Examples
Matrix Multiplication
!" #" $" %&"
'" &" %#" !"
!" %&"
$" #"
%#" %&"
(" $"
X =
!" %)"
&#" $"
Examples
Matrix Multiply in MapReduce
C = A X B
A has dimensions L,M
B has dimensions M,N
•  In the map phase:
–  for each element (i,j) of A, emit ((i,k), A[i,j]) for k in 1..N
–  for each element (j,k) of B, emit ((i,k), B[j,k]) for i in 1..L
•  In the reduce phase, emit
–  key = (i,k)
–  value = Sumj (A[i,j] * B[j,k])
5/15/13 Bill Howe, Data Science, Autumn 2012 21
Examples
AB
X
A B
=
•  One reducer per output cell
Meeting Hadoop
Cluster Computing
•  Large number of commodity servers,
connected by high speed, commodity
network
•  Rack: holds a small number of servers
•  Data center: holds many racks
!"#$%&'(#))%'&*"&+!"
#$%&'()*+(",-(+./0"123145"67$%8('"1",-(+./09":;1;<"/0=>4""
/?"#@0@0A"/?"#$99@B("C$8$9(89;"D>"E$F$'$G$0"$0)"H==G$0""
),,(-.,(/0123,(456,7852(9,:3;-,(
Cluster Computing
•  Massive parallelism:
– 100s, or 1000s, or 10000s servers
– Many hours
•  Failure:
– If medium-time-between-failure is 1 year
– Then 10000 servers have one failure / hour
24slide src: Dan Suciu and Magda Balazinska
Distributed File System (DFS)
•  For very large files: TBs, PBs
•  Each file is partitioned into chunks,
typically 64MB
•  Each chunk is replicated several times
(!3), on different racks, for fault
tolerance
•  Implementations:
– Google’s DFS: GFS, proprietary
– Hadoop’s DFS: HDFS, open source
25slide src: Dan Suciu and Magda Balazinska
Hadoop
!"#$%&'%(#)**+%,%
!"#$%&"#'()*'(+(%"(&"#'(,-.%/#-/0,#'12,'
"(,3#'4-("#'*%4/,%&0/#*'-#$."'5,2-#44%)3'
2)'(')#/62,7'21'-2882*%/9'.(,*6(,#:'
!"#$%&'()"'*&+&*'",)-&$('
!"#$%%!&'((#)&#&*!+)(,-%.
.//'$)0(,123(),4'
5('%#4')0&')6'(%&'4(,)07&4('&$)'484(&94':1(%'*#,7&'0)')6'432'",)-&$(4'
;#%))'%#4')0&')6'(%&'2177&4('104(#**#<)0'=#>))"'
?300107'@///4')6'4&,+&,4')0'=#>))"'
!"#$%&'()*+),)
!"##$%&'"()'*'+,-'.&/01&'*'23$'4,5%&6'
'
'78193:&1:08&'5&93;/'"##$%&<='
'''''
' ')&,819'>;$3;&'
-&'./0&)01)2.(00$)$&03'4/)
!"#$%&"#"$'$()&*$+"$,&-../$0"#-$1.2$
3+456.%+&7$-&*&$&8&79"+"$
:#;*$<+8+84=$/&>#28"$"#&2%)$
?&%)+8#$7.4$&8&79"+"$
@#.A"/&%+B&7$&8&79"+"$
@#8.<#$C8&79"+"$
D204$D+"%.E#29$
F2&0-$&8-$%.</7+&8%#$<&8&4#<#8*$
G+-#.$&8-$+<&4#$&8&79"+"$
:2#8-$C8&79"+"$
!"#$%&'&$()*##+$,$-#./$-0&1$
2$34)5#.637$$
2$8)9':##;$$
2$<##/-'$$
2$=>?$$
2$@0&.'A$
2$B)&1CD4$$
2$E'F$G#H;$I04'&$$
2$G)"##J$$
2$IF0K'H$
2$B0.;'*$0.$$
$$
$
!"#$$%&'($)*)+',&-&
!"#$%& '#()%*+,%-!./0&
12+34#&
!56%& 758&
96:;&
<;;=%%(%:&
0>;;(&
/?+@%& A;B5%& C25::&
!"#$$%&#'()*'+,-$.&/&
!"#$%&'(&
!""#
$%&'()*+&,#-&.&/&010#
!"#$%&'()&&
!""#
!"#$$%&#'()*'+,)-#&./-&(0()-1&
!"#$$%&'$(%$)*)+,&-&
.$/&+0"12*0&
3",2&30"12*0&
4"(*&4$#*&
5"%&6*#71*&
!89:&&&&
MapReduce
http://labs.google.com/papers/mapreduce.html
Exemplo de MapReduce
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount().run()
Projeto MrJob
Criado pela Equipe de Engenharia doYelp
Totalmente Open-Source
Todo em Python
Utiliza Map-Reduce para Processamento
Permite rodar tanto no Amazon EMR como no Hadoop
Objetivos do MrJobs
Se você quer aprender MapReduce, ele é para você
Se você tem um problema cavalar e precisa de muito
processamento e não está afim de mexer em Hadoop
Se você já tem um cluster Hadoop e quer rodar scripts
Python
Se você quer migrar seu código Python do Hadoop
para o EMR
Se você não quer escrever Python (Impossível!), não é
para você!
Passos importantes
sudo easy_install mrjob
Vamos a uma demo...
Texto da posse de Obama em 2009.
Desempenho MapReduce
Mais informações
http://packages.python.org/mrjob/
https://github.com/Yelp/mrjob
Distributed Computing with mrJob
https://github.com/Yelp/mrjob
Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
Distributed Computing with mrJob
https://github.com/Yelp/mrjob
Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
Atepassar Recommendations
http://atepassar.com
Problema: Como recomendar novos amigos ?
Atepassar Recommendations
http://atepassar.com
Atepassar Recommendations
http://atepassar.com
marcel;jonas,maria,jose,amanda
maria;carol,fabiola,amanda,marcel
amanda;paula,patricia,maria,marcel
carol;maria,jose,patricia
fabiola;maria
paula;fabio,amanda
patricia;amanda,carol
jose;marcel,carol
jonas;marcel,fabio
fabio;jonas,paula
carla
"marcel" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["patricia", 1], ["paula", 1]]
"maria" [["jose", 2], ["patricia", 2], ["jonas", 1], ["paula", 1]]
"patricia" [["maria", 2], ["jose", 1], ["marcel", 1], ["paula", 1]]
"paula" [["jonas", 1], ["marcel", 1], ["maria", 1], ["patricia", 1]]
"amanda" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["jonas", 1], ["jose", 1]]
"carol" [["amanda", 2], ["marcel", 2], ["fabiola", 1]]
"fabio" [["amanda", 1], ["marcel", 1]]
"fabiola" [["amanda", 1], ["carol", 1], ["marcel", 1]]
"jonas" [["amanda", 1], ["jose", 1], ["maria", 1], ["paula", 1]]
"jose" [["maria", 2], ["amanda", 1], ["jonas", 1], ["patricia", 1]]
$python friends_recommender.py - r emr --num-ec2-instances 5 facebook_data.csv > output.dat
Atepassar Recommendations
http://atepassar.com
marcel;jonas,maria,jose,amanda
maria;carol,fabiola,amanda,marcel
amanda;paula,patricia,maria,marcel
carol;maria,jose,patricia
fabiola;maria
paula;fabio,amanda
patricia;amanda,carol
jose;marcel,carol
jonas;marcel,fabio
fabio;jonas,paula
carla
"marcel" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["patricia", 1], ["paula", 1]]
"maria" [["jose", 2], ["patricia", 2], ["jonas", 1], ["paula", 1]]
"patricia" [["maria", 2], ["jose", 1], ["marcel", 1], ["paula", 1]]
"paula" [["jonas", 1], ["marcel", 1], ["maria", 1], ["patricia", 1]]
"amanda" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["jonas", 1], ["jose", 1]]
"carol" [["amanda", 2], ["marcel", 2], ["fabiola", 1]]
"fabio" [["amanda", 1], ["marcel", 1]]
"fabiola" [["amanda", 1], ["carol", 1], ["marcel", 1]]
"jonas" [["amanda", 1], ["jose", 1], ["maria", 1], ["paula", 1]]
"jose" [["maria", 2], ["amanda", 1], ["jonas", 1], ["patricia", 1]]
$python friends_recommender.py - r emr --num-ec2-instances 5 facebook_data.csv > output.dat
marcel;jonas,maria,jose,amanda
carol;maria,jose,patricia
fabio;jonas,paula
...
Atepassar Recommendations
http://atepassar.com
fsim(u,i) = w1 f1(u,i) + w2 f2(u,i) + b,
https://gist.github.com/3970945#file_atepassar_recommender.py
Atepassar Recommendations
http://atepassar.com
Celery - para agendamento dos jobs coletores e
executores.
mrJob - para mapreduce e acesso ao Hadoop
MongoDb - para armazenamento das recomendações
Boto - acesso aos files do S3.
A melhor parte!
Projetos interessantes
Estrutura de dados para manipulação rápida
- slicing
- indexing
- subseting
Handling missing data
Agregações, Séries Temporais
Pandas: a data analysis library for Python,
poised to give R a run for its money… http://pandas.pydata.org/
Projetos interessantes
Outro framework para computação distribuída com
Python com MapReduce.
Criado pelo Instituto Nokia.
Backend dele é escrito em Erlang (funcional,
concorrente e bem escalável!)
Não utiliza o FileSystem mais usado HDFS e sim um
novo padrão por eles (DDFS).
http://discoproject.org/
Projetos interessantes
http://api.mongodb.org/python/2.0/examples/
map_reduce.html
Python & MongoDb
MongoDb - Banco de Dados Não relacional (NoSQL)
Possui suporte nativo built-in para fazer MapReduce.
Escrever o código em JS e não é muito legível e fica preso ao
Mongo ...
>>> reduce = Code("function (key, values) {"
... " var total = 0;"
... " for (var i = 0; i < values.length; i++) {"
... " total += values[i];"
... " }"
... " return total;"
... "}")
Projetos interessantes
https://github.com/klbostee/dumbo/wiki/Short-tutorial
Dumbo
Uma das primeiras bibliotecas em cima do
MapReduce e Python.
Complicado para começar e está desatualizada :(
Projetos interessantes
Um wrapper em Python em cima do Hadoop
para computação distribuída.
http://pydoop.sourceforge.net/docs/index.html
Legal, mas dá um trabalho para configurar.
Projetos interessantes
Mapreduce com Python na Google AppEngine
https://developers.google.com/appengine/docs/python/
dataprocessing/
Ainda experimental e fica “preso” à plataforma
AppEngine.
Projetos interessantes
Algoritmos de aprendizagem de máquina
Supervisionados & Não supervisionados
Pré-processamento, extração de dados
Avaliação de classificadores, Pipeline,
seleção de atributos.
http://scikit-learn.org/stable/
Projetos interessantes
Processamento de linguagem natural
Várias ferramentas para tokenização,
pos tagging, named entity recognition,
classificadores, etc.
Vários corpus disponíveis!
http://nltk.org/
Projetos interessantes
Processamento de linguagem natural
Várias ferramentas para tokenização,
pos tagging, named entity recognition,
classificadores, etc.
Vários corpus disponíveis!
http://nltk.org/
Fiquem de olho...
Pipeline for distributed Natural Language Processing, made in Python
https://github.com/NAMD/pypln
Projetos interessantes
Our Project’s Home Page
http://muricoca.github.com/crab
Future Releases
Planned Release 0.13
New home for python-recsys:
https://github.com/python-recsys/crab
Planned Release 0.14
Support to Item-Based Recommenders using MapReduce with MrJob
New commiters: vinnigracindo, ocelma, fcurella
Join us!
1. Read our Wiki Page
https://github.com/muricoca/crab/wiki/Developer-Resources
2. Check out our current sprints and open issues
https://github.com/muricoca/crab/issues
3. Forks, Pull Requests mandatory
4. Join us at irc.freenode.net #muricoca or at our
discussion list
http://groups.google.com/group/scikit-crab
Vários outros ...
Matplotlib
StatsModelsScipy
Numpy
Matplotlib
NetworkX
PyBrain
milk
Orange
Livros recomendados
Livros recomendados
http://shop.oreilly.com/product/0636920022640.do?cmp=il-radar-ebooks-big-data-now-radar
For free...
Livros recomendados
For free...
http://infolab.stanford.edu/~ullman/mmds/book.pdf
Artigos recomendados
For free...
http://aimotion.blogspot.com.br/2012/10/atepassar-
recommendations-recommending.html
http://aimotion.blogspot.com.br/2012/08/introduction-to-
recommendations-with.html
Artigos recomendados
For free...
http://aimotion.blogspot.com.br/2012/10/atepassar-
recommendations-recommending.html
http://aimotion.blogspot.com.br/2012/08/introduction-to-
recommendations-with.html
https://github.com/marcelcaraciolo/recsys-mapreduce-mrjob
BigData with Python
A gentle and simple introduction
Marcel Caraciolo
@marcelcaraciolo
Developer, Cientist, contributor to the Crab recsys project,
works with Python for 6 years, interested at mobile,
education, machine learning and dataaaaa!
Recife, Brazil - http://aimotion.blogspot.com

Mais conteúdo relacionado

Mais procurados

Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and OutTravis Oliphant
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013MLconf
 
Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...Kelle Cruz
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019Travis Oliphant
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Introducing TensorFlow: The game changer in building "intelligent" applications
Introducing TensorFlow: The game changer in building "intelligent" applicationsIntroducing TensorFlow: The game changer in building "intelligent" applications
Introducing TensorFlow: The game changer in building "intelligent" applicationsRokesh Jankie
 
Data science apps: beyond notebooks
Data science apps: beyond notebooksData science apps: beyond notebooks
Data science apps: beyond notebooksNatalino Busa
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Anubhav Jain
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Gael Varoquaux
 
RESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatialRESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatialGasperi Jerome
 
SciPy Latin America 2019
SciPy Latin America 2019SciPy Latin America 2019
SciPy Latin America 2019Travis Oliphant
 
Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in Rmikaelhuss
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009Ian Foster
 
PyData Barcelona Keynote
PyData Barcelona KeynotePyData Barcelona Keynote
PyData Barcelona KeynoteTravis Oliphant
 
Data Visualization in Python
Data Visualization in PythonData Visualization in Python
Data Visualization in PythonJagriti Goswami
 
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016MLconf
 

Mais procurados (20)

Python and Sage
Python and SagePython and Sage
Python and Sage
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013
 
Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Introducing TensorFlow: The game changer in building "intelligent" applications
Introducing TensorFlow: The game changer in building "intelligent" applicationsIntroducing TensorFlow: The game changer in building "intelligent" applications
Introducing TensorFlow: The game changer in building "intelligent" applications
 
Data science apps: beyond notebooks
Data science apps: beyond notebooksData science apps: beyond notebooks
Data science apps: beyond notebooks
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
RESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatialRESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatial
 
H20: A platform for big math
H20: A platform for big math H20: A platform for big math
H20: A platform for big math
 
SciPy Latin America 2019
SciPy Latin America 2019SciPy Latin America 2019
SciPy Latin America 2019
 
Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in R
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
 
Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
 
PyData Barcelona Keynote
PyData Barcelona KeynotePyData Barcelona Keynote
PyData Barcelona Keynote
 
Data Visualization in Python
Data Visualization in PythonData Visualization in Python
Data Visualization in Python
 
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
 

Destaque

Recommender Systems with Ruby (adding machine learning, statistics, etc)
Recommender Systems with Ruby (adding machine learning, statistics, etc)Recommender Systems with Ruby (adding machine learning, statistics, etc)
Recommender Systems with Ruby (adding machine learning, statistics, etc)Marcel Caraciolo
 
GeoMapper, Python Script for Visualizing Data on Social Networks with Geo-loc...
GeoMapper, Python Script for Visualizing Data on Social Networks with Geo-loc...GeoMapper, Python Script for Visualizing Data on Social Networks with Geo-loc...
GeoMapper, Python Script for Visualizing Data on Social Networks with Geo-loc...Marcel Caraciolo
 
Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks Marcel Caraciolo
 
Computação Científica com Python, Numpy e Scipy
Computação Científica com Python, Numpy e ScipyComputação Científica com Python, Numpy e Scipy
Computação Científica com Python, Numpy e ScipyMarcel Caraciolo
 
Python e Aprendizagem de Máquina (Inteligência Artificial)
Python e Aprendizagem de Máquina (Inteligência Artificial)Python e Aprendizagem de Máquina (Inteligência Artificial)
Python e Aprendizagem de Máquina (Inteligência Artificial)Marcel Caraciolo
 
Construindo Soluções Científicas com Big Data & MapReduce
Construindo Soluções Científicas com Big Data & MapReduceConstruindo Soluções Científicas com Big Data & MapReduce
Construindo Soluções Científicas com Big Data & MapReduceMarcel Caraciolo
 
Como Python está mudando a forma de aprendizagem à distância no Brasil
Como Python está mudando a forma de aprendizagem à distância no BrasilComo Python está mudando a forma de aprendizagem à distância no Brasil
Como Python está mudando a forma de aprendizagem à distância no BrasilMarcel Caraciolo
 
Redes Neurais e Python
Redes Neurais e PythonRedes Neurais e Python
Redes Neurais e Pythonpugpe
 
Palestra: Cientista de Dados – Dominando o Big Data com Software Livre
Palestra: Cientista de Dados – Dominando o Big Data com Software LivrePalestra: Cientista de Dados – Dominando o Big Data com Software Livre
Palestra: Cientista de Dados – Dominando o Big Data com Software LivreAmbiente Livre
 
Orientação a Objetos em Python
Orientação a Objetos em PythonOrientação a Objetos em Python
Orientação a Objetos em PythonLuciano Ramalho
 
Python para iniciantes
Python para iniciantesPython para iniciantes
Python para iniciantesrichardsonlima
 
Crab - A Python Framework for Building Recommendation Systems
Crab - A Python Framework for Building Recommendation SystemsCrab - A Python Framework for Building Recommendation Systems
Crab - A Python Framework for Building Recommendation SystemsMarcel Caraciolo
 
Python simplecv
Python simplecvPython simplecv
Python simplecvUFPA
 
Desenvolvendo uma aplicacao Full Javascript
Desenvolvendo uma aplicacao Full JavascriptDesenvolvendo uma aplicacao Full Javascript
Desenvolvendo uma aplicacao Full JavascriptDenis Vieira
 
Cientista da computacao usando python
Cientista da computacao usando pythonCientista da computacao usando python
Cientista da computacao usando pythonJean Lopes
 
Recommendation systems with Mahout: introduction
Recommendation systems with Mahout: introductionRecommendation systems with Mahout: introduction
Recommendation systems with Mahout: introductionAdam Warski
 
Pepe Legal Python e Babalu MongoDB, uma dupla dinâmica
Pepe Legal Python e Babalu MongoDB, uma dupla dinâmicaPepe Legal Python e Babalu MongoDB, uma dupla dinâmica
Pepe Legal Python e Babalu MongoDB, uma dupla dinâmicaFATEC São José dos Campos
 

Destaque (20)

Recommender Systems with Ruby (adding machine learning, statistics, etc)
Recommender Systems with Ruby (adding machine learning, statistics, etc)Recommender Systems with Ruby (adding machine learning, statistics, etc)
Recommender Systems with Ruby (adding machine learning, statistics, etc)
 
GeoMapper, Python Script for Visualizing Data on Social Networks with Geo-loc...
GeoMapper, Python Script for Visualizing Data on Social Networks with Geo-loc...GeoMapper, Python Script for Visualizing Data on Social Networks with Geo-loc...
GeoMapper, Python Script for Visualizing Data on Social Networks with Geo-loc...
 
Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks
 
Computação Científica com Python, Numpy e Scipy
Computação Científica com Python, Numpy e ScipyComputação Científica com Python, Numpy e Scipy
Computação Científica com Python, Numpy e Scipy
 
Python e Aprendizagem de Máquina (Inteligência Artificial)
Python e Aprendizagem de Máquina (Inteligência Artificial)Python e Aprendizagem de Máquina (Inteligência Artificial)
Python e Aprendizagem de Máquina (Inteligência Artificial)
 
Construindo Soluções Científicas com Big Data & MapReduce
Construindo Soluções Científicas com Big Data & MapReduceConstruindo Soluções Científicas com Big Data & MapReduce
Construindo Soluções Científicas com Big Data & MapReduce
 
Como Python está mudando a forma de aprendizagem à distância no Brasil
Como Python está mudando a forma de aprendizagem à distância no BrasilComo Python está mudando a forma de aprendizagem à distância no Brasil
Como Python está mudando a forma de aprendizagem à distância no Brasil
 
Redes Neurais e Python
Redes Neurais e PythonRedes Neurais e Python
Redes Neurais e Python
 
Aprendendo python
Aprendendo pythonAprendendo python
Aprendendo python
 
Palestra: Cientista de Dados – Dominando o Big Data com Software Livre
Palestra: Cientista de Dados – Dominando o Big Data com Software LivrePalestra: Cientista de Dados – Dominando o Big Data com Software Livre
Palestra: Cientista de Dados – Dominando o Big Data com Software Livre
 
Orientação a Objetos em Python
Orientação a Objetos em PythonOrientação a Objetos em Python
Orientação a Objetos em Python
 
Python para iniciantes
Python para iniciantesPython para iniciantes
Python para iniciantes
 
Crab - A Python Framework for Building Recommendation Systems
Crab - A Python Framework for Building Recommendation SystemsCrab - A Python Framework for Building Recommendation Systems
Crab - A Python Framework for Building Recommendation Systems
 
Python simplecv
Python simplecvPython simplecv
Python simplecv
 
Desenvolvendo uma aplicacao Full Javascript
Desenvolvendo uma aplicacao Full JavascriptDesenvolvendo uma aplicacao Full Javascript
Desenvolvendo uma aplicacao Full Javascript
 
Cientista da computacao usando python
Cientista da computacao usando pythonCientista da computacao usando python
Cientista da computacao usando python
 
Slide curso excel 2007 avancado
Slide curso excel 2007 avancadoSlide curso excel 2007 avancado
Slide curso excel 2007 avancado
 
Api facebook
Api facebookApi facebook
Api facebook
 
Recommendation systems with Mahout: introduction
Recommendation systems with Mahout: introductionRecommendation systems with Mahout: introduction
Recommendation systems with Mahout: introduction
 
Pepe Legal Python e Babalu MongoDB, uma dupla dinâmica
Pepe Legal Python e Babalu MongoDB, uma dupla dinâmicaPepe Legal Python e Babalu MongoDB, uma dupla dinâmica
Pepe Legal Python e Babalu MongoDB, uma dupla dinâmica
 

Semelhante a Big Data com Python

Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Functional data structures
Functional data structuresFunctional data structures
Functional data structuresRalf Laemmel
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Lecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdfLecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdfRanvinuHewage
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores inside-BigData.com
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pRobert Grossman
 
Extending Complex Event Processing to Graph-structured Information
Extending Complex Event Processing to Graph-structured InformationExtending Complex Event Processing to Graph-structured Information
Extending Complex Event Processing to Graph-structured InformationAntonio Vallecillo
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are AlgorithmsInfluxData
 
Wikipedia Views As A Proxy For Social Engagement
Wikipedia Views As A Proxy For Social EngagementWikipedia Views As A Proxy For Social Engagement
Wikipedia Views As A Proxy For Social EngagementDaniel Cuneo
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...
NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...
NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...ssuser4b1f48
 
Real time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyReal time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyVarun Vijayaraghavan
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesIan Foster
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentationlordjoe
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdfLevLafayette1
 
Computer notes - data structures
Computer notes - data structuresComputer notes - data structures
Computer notes - data structuresecomputernotes
 

Semelhante a Big Data com Python (20)

Bill howe 3_mapreduce
Bill howe 3_mapreduceBill howe 3_mapreduce
Bill howe 3_mapreduce
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Functional data structures
Functional data structuresFunctional data structures
Functional data structures
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Lecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdfLecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdf
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 
PPT
PPTPPT
PPT
 
Deep learning
Deep learningDeep learning
Deep learning
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
 
Extending Complex Event Processing to Graph-structured Information
Extending Complex Event Processing to Graph-structured InformationExtending Complex Event Processing to Graph-structured Information
Extending Complex Event Processing to Graph-structured Information
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are Algorithms
 
Wikipedia Views As A Proxy For Social Engagement
Wikipedia Views As A Proxy For Social EngagementWikipedia Views As A Proxy For Social Engagement
Wikipedia Views As A Proxy For Social Engagement
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...
NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...
NS-CUK Seminar: S.T.Nguyen, Review on "Continuous-Time Sequential Recommendat...
 
Real time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyReal time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.ly
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architectures
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentation
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
 
Computer notes - data structures
Computer notes - data structuresComputer notes - data structures
Computer notes - data structures
 

Mais de Marcel Caraciolo

Como interpretar seu próprio genoma com Python
Como interpretar seu próprio genoma com PythonComo interpretar seu próprio genoma com Python
Como interpretar seu próprio genoma com PythonMarcel Caraciolo
 
Joblib: Lightweight pipelining for parallel jobs (v2)
Joblib:  Lightweight pipelining for parallel jobs (v2)Joblib:  Lightweight pipelining for parallel jobs (v2)
Joblib: Lightweight pipelining for parallel jobs (v2)Marcel Caraciolo
 
Construindo softwares de bioinformática para análises clínicas : Desafios e...
Construindo softwares  de bioinformática  para análises clínicas : Desafios e...Construindo softwares  de bioinformática  para análises clínicas : Desafios e...
Construindo softwares de bioinformática para análises clínicas : Desafios e...Marcel Caraciolo
 
Como Python ajudou a automatizar o nosso laboratório v.2
Como Python ajudou a automatizar o nosso laboratório v.2Como Python ajudou a automatizar o nosso laboratório v.2
Como Python ajudou a automatizar o nosso laboratório v.2Marcel Caraciolo
 
Como Python pode ajudar na automação do seu laboratório
Como Python pode ajudar na automação do  seu laboratórioComo Python pode ajudar na automação do  seu laboratório
Como Python pode ajudar na automação do seu laboratórioMarcel Caraciolo
 
Python on Science ? Yes, We can.
Python on Science ?   Yes, We can.Python on Science ?   Yes, We can.
Python on Science ? Yes, We can.Marcel Caraciolo
 
Oficina Python: Hackeando a Web com Python 3
Oficina Python: Hackeando a Web com Python 3Oficina Python: Hackeando a Web com Python 3
Oficina Python: Hackeando a Web com Python 3Marcel Caraciolo
 
Opensource - Como começar e dá dinheiro ?
Opensource - Como começar e dá dinheiro ?Opensource - Como começar e dá dinheiro ?
Opensource - Como começar e dá dinheiro ?Marcel Caraciolo
 
Benchy, python framework for performance benchmarking of Python Scripts
Benchy, python framework for performance benchmarking  of Python ScriptsBenchy, python framework for performance benchmarking  of Python Scripts
Benchy, python framework for performance benchmarking of Python ScriptsMarcel Caraciolo
 
Python e 10 motivos por que devo conhece-la ?
Python e 10 motivos por que devo conhece-la ?Python e 10 motivos por que devo conhece-la ?
Python e 10 motivos por que devo conhece-la ?Marcel Caraciolo
 
Construindo Sistemas de Recomendação com Python
Construindo Sistemas de Recomendação com PythonConstruindo Sistemas de Recomendação com Python
Construindo Sistemas de Recomendação com PythonMarcel Caraciolo
 
Python, A pílula Azul da programação
Python, A pílula Azul da programaçãoPython, A pílula Azul da programação
Python, A pílula Azul da programaçãoMarcel Caraciolo
 
Novas Tendências para a Educação a Distância: Como reinventar a educação ?
Novas Tendências para a Educação a Distância: Como reinventar a educação ?Novas Tendências para a Educação a Distância: Como reinventar a educação ?
Novas Tendências para a Educação a Distância: Como reinventar a educação ?Marcel Caraciolo
 
Aula WebCrawlers com Regex - PyCursos
Aula WebCrawlers com Regex - PyCursosAula WebCrawlers com Regex - PyCursos
Aula WebCrawlers com Regex - PyCursosMarcel Caraciolo
 
Arquivos Zip com Python - Aula PyCursos
Arquivos Zip com Python - Aula PyCursosArquivos Zip com Python - Aula PyCursos
Arquivos Zip com Python - Aula PyCursosMarcel Caraciolo
 
PyFoursquare: Python Library for Foursquare
PyFoursquare: Python Library for FoursquarePyFoursquare: Python Library for Foursquare
PyFoursquare: Python Library for FoursquareMarcel Caraciolo
 
Sistemas de Recomendação: Como funciona e Onde Se aplica?
Sistemas de Recomendação: Como funciona e Onde Se aplica?Sistemas de Recomendação: Como funciona e Onde Se aplica?
Sistemas de Recomendação: Como funciona e Onde Se aplica?Marcel Caraciolo
 
Recomendação de Conteúdo para Redes Sociais Educativas
Recomendação de Conteúdo para Redes Sociais EducativasRecomendação de Conteúdo para Redes Sociais Educativas
Recomendação de Conteúdo para Redes Sociais EducativasMarcel Caraciolo
 
Content Recommendation Based on Data Mining in Adaptive Social Networks
Content Recommendation Based on Data Mining  in Adaptive Social NetworksContent Recommendation Based on Data Mining  in Adaptive Social Networks
Content Recommendation Based on Data Mining in Adaptive Social NetworksMarcel Caraciolo
 
Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems Marcel Caraciolo
 

Mais de Marcel Caraciolo (20)

Como interpretar seu próprio genoma com Python
Como interpretar seu próprio genoma com PythonComo interpretar seu próprio genoma com Python
Como interpretar seu próprio genoma com Python
 
Joblib: Lightweight pipelining for parallel jobs (v2)
Joblib:  Lightweight pipelining for parallel jobs (v2)Joblib:  Lightweight pipelining for parallel jobs (v2)
Joblib: Lightweight pipelining for parallel jobs (v2)
 
Construindo softwares de bioinformática para análises clínicas : Desafios e...
Construindo softwares  de bioinformática  para análises clínicas : Desafios e...Construindo softwares  de bioinformática  para análises clínicas : Desafios e...
Construindo softwares de bioinformática para análises clínicas : Desafios e...
 
Como Python ajudou a automatizar o nosso laboratório v.2
Como Python ajudou a automatizar o nosso laboratório v.2Como Python ajudou a automatizar o nosso laboratório v.2
Como Python ajudou a automatizar o nosso laboratório v.2
 
Como Python pode ajudar na automação do seu laboratório
Como Python pode ajudar na automação do  seu laboratórioComo Python pode ajudar na automação do  seu laboratório
Como Python pode ajudar na automação do seu laboratório
 
Python on Science ? Yes, We can.
Python on Science ?   Yes, We can.Python on Science ?   Yes, We can.
Python on Science ? Yes, We can.
 
Oficina Python: Hackeando a Web com Python 3
Oficina Python: Hackeando a Web com Python 3Oficina Python: Hackeando a Web com Python 3
Oficina Python: Hackeando a Web com Python 3
 
Opensource - Como começar e dá dinheiro ?
Opensource - Como começar e dá dinheiro ?Opensource - Como começar e dá dinheiro ?
Opensource - Como começar e dá dinheiro ?
 
Benchy, python framework for performance benchmarking of Python Scripts
Benchy, python framework for performance benchmarking  of Python ScriptsBenchy, python framework for performance benchmarking  of Python Scripts
Benchy, python framework for performance benchmarking of Python Scripts
 
Python e 10 motivos por que devo conhece-la ?
Python e 10 motivos por que devo conhece-la ?Python e 10 motivos por que devo conhece-la ?
Python e 10 motivos por que devo conhece-la ?
 
Construindo Sistemas de Recomendação com Python
Construindo Sistemas de Recomendação com PythonConstruindo Sistemas de Recomendação com Python
Construindo Sistemas de Recomendação com Python
 
Python, A pílula Azul da programação
Python, A pílula Azul da programaçãoPython, A pílula Azul da programação
Python, A pílula Azul da programação
 
Novas Tendências para a Educação a Distância: Como reinventar a educação ?
Novas Tendências para a Educação a Distância: Como reinventar a educação ?Novas Tendências para a Educação a Distância: Como reinventar a educação ?
Novas Tendências para a Educação a Distância: Como reinventar a educação ?
 
Aula WebCrawlers com Regex - PyCursos
Aula WebCrawlers com Regex - PyCursosAula WebCrawlers com Regex - PyCursos
Aula WebCrawlers com Regex - PyCursos
 
Arquivos Zip com Python - Aula PyCursos
Arquivos Zip com Python - Aula PyCursosArquivos Zip com Python - Aula PyCursos
Arquivos Zip com Python - Aula PyCursos
 
PyFoursquare: Python Library for Foursquare
PyFoursquare: Python Library for FoursquarePyFoursquare: Python Library for Foursquare
PyFoursquare: Python Library for Foursquare
 
Sistemas de Recomendação: Como funciona e Onde Se aplica?
Sistemas de Recomendação: Como funciona e Onde Se aplica?Sistemas de Recomendação: Como funciona e Onde Se aplica?
Sistemas de Recomendação: Como funciona e Onde Se aplica?
 
Recomendação de Conteúdo para Redes Sociais Educativas
Recomendação de Conteúdo para Redes Sociais EducativasRecomendação de Conteúdo para Redes Sociais Educativas
Recomendação de Conteúdo para Redes Sociais Educativas
 
Content Recommendation Based on Data Mining in Adaptive Social Networks
Content Recommendation Based on Data Mining  in Adaptive Social NetworksContent Recommendation Based on Data Mining  in Adaptive Social Networks
Content Recommendation Based on Data Mining in Adaptive Social Networks
 
Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems
 

Último

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 

Último (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 

Big Data com Python

  • 1. BigData with Python A gentle and simple introduction Marcel Caraciolo @marcelcaraciolo Developer, Cientist, contributor to the Crab recsys project, works with Python for 6 years, interested at mobile, education, machine learning and dataaaaa! Recife, Brazil - http://aimotion.blogspot.com
  • 2. https://class.coursera.org/datasci-001/ Disclaimer Alguns slides foram retirados das apts do curso de Bill Howe - Introduction to Data Science
  • 3. About me Co-founder of Crab - Python recsys library Cientist Chief at Atepassar, e-learning social network Co-Founder and Instructor of PyCursos, teaching Python on-line Co-Founder of Pingmind, on-line infrastructure for MOOC’s Interested at Python, mobile, e-learning and machine learning!
  • 5. Big Data “Big Data is any data that is expensive to manage and hard to extract value from.” Michael Franklin Thomas M. Siebel Professor of Computer Science Director of the Algorithms, Machines and People Lab University of Berkeley
  • 6. Big Data Erik Larson, 1989, Harper’s magazine “The keepers of big data say they do it for the consumer’s benefit. But data have a way of being used for purposes other than originally intended.”
  • 8. Challenges Volume the size of the data Velocity the latency of data processing relative to the growing demand for interactivity Variety the diversity of sources, formats, quality, structures
  • 10. Where does big data comes from ? “data exhaust” from customers new and pervasive sensors the ability to “keep everything”
  • 12. Where does big data comes from ? photo in public domain
  • 13. Where does big data comes from ? 4/28/13 Bill Howe, UW 11
  • 18. What does Scalable mean ? Operationally: In the past: Works even if data doesn’t fit in main memory Now: Can make use of 1000s of cheap computers Algorithmically: In the past: If you have N data items, you must do no more than Nm operations -- polynomial time algorithms Now: If you have N data items, you must do no more than Nm/k operations, for some large k -- “polynomial time algorithms must be parallelized” Soon: If you have N data items, you should do no more than N * log(N) operations
  • 19. Example Given a set of DNA sequences, find all sequences equal to GATTACGATATTA .
  • 22. Time = 1 CCCCCAATGAC = GATTACGATATTA? GATTACGATATTA time = 1 No.
  • 23. Time = 17 GATTACGATATTA contains GATTACGATATTA? GATTACGATATTA time = 17 Yes! Send it to the output.
  • 24. GATTACGATATTA 40 records, 40 comparions N records, N comparisons The algorithmic complexity is order N: O(N)
  • 25. GATTACGATATTA AAAATCCTGCA AAACGCCTGCA TTTACGTCAA TTTTCGTAATT What if we sort the sequences? What if we sort the sequences ?
  • 26. CTGTACACAACCT < GATTACGATATTA GATTACGATATTA No match. Skip to 75% mark Start at the 50% mark time = 0 0% 100% CTGTACACAACCT Time = 0
  • 27. GGATACACATTTA > GATTACGATATTA GATTACGATATTA No match. 0% 100% GGATACACATTTA
  • 28. GATATTTTAAGC < GATTACGATATTA GATTACGATATTA No match. Skip back to 68.75% mark 0% 100% GATATTTTAAGC
  • 29. GATTACGATATTA = GATTACGATATTA GATTACGATATTA 0% 100% Match! Walk through the records until we fail to match.
  • 30. GATTACGATATTA 0% 100% How many comparisons did we do? 40 records, only 4 comparisons N records, log(N) comparisons This algorithm is O(log(N)) Far better scalability
  • 31. Relational Databases Relational Databases •  Databases are good at Needle in Haystack problems: –  Extracting small results from big datasets –  Transparently provide old style scalability –  Your query will always* finish, regardless of dataset size. –  Indexes are easily built and automatically used when appropriate CREATE INDEX seq_idx ON sequence(seq); SELECT seq FROM sequence WHERE seq = GATTACGATATTA ;
  • 32. Example New task: Read Trimming Given a set of DNA sequences, trim the final n bps of each sequence Generate a new dataset
  • 35. Time = 1 CCCCCAATGAC becomes CCCCC time = 1
  • 36. Time = 17 GATTACGATATTA becomes GATTA time = 17
  • 37. Can we use an index? No. We have to touch every record no matter what. The task is fundamentally O(N) Can we do any better?
  • 38.
  • 43. Time = 7 time = 7 How much time did this take? 7 cycles 40 records, 6 workers O(N/k) time = 7 How much time did this ta 7 cycles 40 records, 6 workers O(N/k)
  • 44. Schematic of a Parallel “Read Trimming” Task You are given short “reads”: genomic sequences about 35-75 characters each Distribute the reads among k computers f f f f f f f is a function to trim a read; apply it to every item Now we have a big distributed set of trimmed reads Schematic of a Parallel “Read Trimming” Task You are given short “reads”: genomic sequences about 35-75 characters each Distribute the reads among k computers f f f f f f f is a function to trim a read; apply it to every item Now we have a big distributed set of trimmed reads Schematic of a Parallel “Read Trimming” Task You are given short “reads”: genomic sequences about 35-75 characters each Distribute the reads among k computers f f f f f f f is a function to trim a read; apply it to every item Now we have a big distributed set of trimmed reads Schematic of a Parallel “Read Trimming” Task You are given short “reads”: genomic sequences about 35-75 characters each Distribute the reads among k computers f f f f f f f is a function to trim a read; apply it to every item Now we have a big distributed set of trimmed reads
  • 45. You are given TIFF images Distribute the images among k computers f f f f f f f is a function to convert TIFF to PNG; apply it to every item Now we have a big distributed set of converted images New Task: Convert 405k TIFF images to PNG http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/ Convert 405k TIFF images to PNG
  • 46. You have sets of parameters for thousands of small simulations Divide the parameter sets among k computers f f f f f f f is runs the simulation and produces some output; apply it to every item Now we have a big distributed set of simulation results New Task: Run thousands of simulations http://escience.washington.edu/get-help-now/dave-williams-simulating-muscle-dynamics-cloud Run thousands of simulations
  • 47. You have millions of documents Distribute the documents among k computers f f f f f f f finds the most common word in a single document Now we have a big distributed list of (doc_id, word) pairs Find the most common word in each document Find the most common word in each document
  • 48. Consider a slightly more general program to compute the word frequency of every word in a singe documentConsider a slightly more general program to compute the word frequency of every word in a single document Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. (people, 2) (government, 6) (assume, 1) (history, 2) …
  • 49. You have millions of documents Distribute the documents among k computers f f f f f f For each document f returns a set of (word, freq) pairs Compute the word frequency of 5M documents Now we have a big distributed list of sets of word freqs.
  • 50. There’s a pattern... There’s a pattern here…. •  A function that maps a read to a trimmed read •  A function that maps a TIFF image to a PNG image •  A function that maps a set of parameters to a simulation result •  A function that maps a document to its most common word •  A function that maps a document to a histogram of word frequencies
  • 51. What if we want to compute the word frequency across all documents? US Constitution Declaration of Independence Articles of Confederation (people, 78) (government, 123) (assume, 23) (history, 38) …
  • 52. You have millions of documents Distribute the documents among k computers map map map map map map For each document, return a set of (word, freq) pairs Compute the word frequency across 5M documents Now what? •  But we don’t want a bunch of little histograms – we want one big histogram. •  How can we make sure that a single computer has access to every occurrence of a given word regardless of which document it appeared in? Compute the word frequency across 5M docs
  • 53. Compute the word frequency across 5M docs Distribute the documents among k computers map map map map map map For each document, return a set of (word, freq) pairs Now we have a big distributed list of sets of word freqs. reduce reduce reduce reduce Now just count the occurrences of each word 44 3 We have our distributed histogram Compute the word frequency across 5M documents
  • 54. Compute the word frequency across 5M docs 5/15/13 Bill Howe, eScience Institute 11 Some distributed algorithm… Map (Shuffle) Reduce
  • 55. Compute the word frequency across 5M docs 5/15/13 Bill Howe, eScience Institute 12 MapReduce Programming Model •  Input & Output: each a set of key/value pairs •  Programmer specifies two functions: –  Processes input key/value pair –  Produces set of intermediate pairs –  Combines all intermediate values for a particular key –  Produces a set of merged output values (usually just one) map (in_key, in_value) -> list(out_key, intermediate_value) reduce (out_key, list(intermediate_value)) -> list(out_value) Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell slide source: Google, Inc.
  • 56. Compute the word frequency across 5M docs 5/15/13 Bill Howe, eScience Institute 13 Example: What does this do? map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1); reduce(String output_key, Iterator intermediate_values): // output_key: word // output_values: ???? int result = 0; for each v in intermediate_values: result += v; Emit(result); slide source: Google, Inc.
  • 57. Example 3/12/09 Bill Howe, eScience Institute 14 Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Example: Document Processing
  • 58. Word length histogram example 3/12/09 Bill Howe, eScience Institute 14 Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Example: Document Processing 3/12/09 Bill Howe, eScience Institute 15 Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Example: Word length histogram How many big , medium , and small words are used?
  • 59. Word length histogram example Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Big = Yellow = 10+ letters Medium = Red = 5..9 letters Small = Blue = 2..4 letters Tiny = Pink = 1 letter Example: Word length histogram
  • 60. Word length histogram example Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Split the document into chunks and process each chunk on a different computer Chunk 1 Chunk 2 Example: Word length histogram
  • 61. Word length histogram example (yellow, 20) (red, 71) (blue, 93) (pink, 6 ) Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Map Task 1 (204 words) Map Task 2 (190 words) (key, value) (yellow, 17) (red, 77) (blue, 107) (pink, 3) Example: Word length histogram
  • 62. Word length histogram example 3/12/09 Bill Howe, eScience Institute 19 (yellow, 17) (red, 77) (blue, 107) (pink, 3) (yellow, 20) (red, 71) (blue, 93) (pink, 6 ) Reduce tasks (yellow, 17) (yellow, 20) (red, 77) (red, 71) (blue, 93) (blue, 107) (pink, 6) (pink, 3) Example: Word length histogram A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Map task 1 Map task 2 Shuffle step (yellow, 37) (red, 148) (blue, 200) (pink, 9)
  • 63. MR Phases !"#$%&'(")$*+&,& MR Phases •  Each Map and Reduce task has multiple phases:
  • 64. MR Phases MAP REDUCE (w1,1) (w2,1) (w1,1) … (w1,1) (w2,1) … (did1,v1) (did2,v2) (did3,v3) . . . . (w1, (1,1,1,…,1)) (w2, (1,1,…)) (w3,(1…)) … … … … (w1, 25) (w2, 77) (w3, 12) … … … … Shuffle !"#$%&'()%"**$"(+%,&-.$/%% 012%3',%45+,%+$3)%6&78%9:;%
  • 65. MR Phases Hadoop in One Slide src: Huy Vo, NYU Poly
  • 66. Taxonomy of Parallel Architectures 5/15/13 Bill Howe, eScience Institute 22 Taxonomy of Parallel Architectures Easiest to program, but $$$$ Scales to 1000s of computers
  • 69. 5/15/13 Bill Howe, eScience Institute 26 Large-Scale Data Processing •  Many tasks process big data, produce big data •  Want to use hundreds or thousands of CPUs –  ... but this needs to be easy –  Parallel databases exist, but they are expensive, difficult to set up, and do not necessarily scale to hundreds of nodes. •  MapReduce is a lightweight framework, providing: –  Automatic parallelization and distribution –  Fault-tolerance –  I/O scheduling –  Status and monitoring
  • 70. 5/15/13 Bill Howe, eScience Institute 35 MapReduce Contemporaries •  Dryad (Microsoft) –  Relational Algebra •  Pig (Yahoo) –  Near Relational Algebra over MapReduce •  HIVE (Facebook) –  SQL over MapReduce •  Cascading –  Relational Algebra •  Clustera –  U of Wisconsin •  Hbase –  Indexing on HDFS MapReduce Contemporaries
  • 71. Talk is cheap, Show me the code
  • 72. Examples More Examples: Build an Inverted Index 5/15/13 Bill Howe, UW 14 Input: tweet1, (“I love pancakes for breakfast”) tweet2, (“I dislike pancakes”) tweet3, (“What should I eat for breakfast?”) tweet4, (“I love to eat”) Desired output: “pancakes”, (tweet1, tweet2) “breakfast”, (tweet1, tweet3) “eat”, (tweet3, tweet4) “love”, (tweet1, tweet4) …
  • 73. Examples More Examples: Relational Join 5/15/13 Bill Howe, UW 15 Name SSN Sue 999999999 Tony 777777777 EmpSSN DepName 999999999 Accounts 777777777 Sales 777777777 Marketing Emplyee ⨝ Assigned Departments Employee Assigned Departments Name SSN EmpSSN DepName Sue 999999999 999999999 Accounts Tony 777777777 777777777 Sales Tony 777777777 777777777 Marketing
  • 74. Examples Relational Join in MapReduce: Before Map Phase 5/15/13 Bill Howe, UW 16 Name SSN Sue 999999999 Tony 777777777 EmpSSN DepName 999999999 Accounts 777777777 Sales 777777777 Marketing Employee Assigned Departments Key idea: Lump all the tuples together into one dataset Employee, Sue, 999999999 Employee, Tony, 777777777 Department, 999999999, Accounts Department, 777777777, Sales Department, 777777777, Marketing What is this for?
  • 75. Examples Relational Join in MapReduce: Map Phase 5/15/13 Bill Howe, UW 17 Employee, Sue, 999999999 Employee, Tony, 777777777 Department, 999999999, Accounts Department, 777777777, Sales Department, 777777777, Marketing key=999999999, value=(Employee, Sue, 999999999) key=777777777, value=(Employee, Tony, 777777777) key=999999999, value=(Department, 999999999, Accounts) key=777777777, value=(Department, 777777777, Sales) key=777777777, value=(Department, 777777777, Marketing) why do we use this as the key?
  • 76. Examples Relational Join in MapReduce: Reduce Phase 5/15/13 Bill Howe, UW 18 key=999999999, values=[(Employee, Sue, 999999999), (Department, 999999999, Accounts)] key=777777777, values=[(Employee, Tony, 777777777), (Department, 777777777, Sales), (Department, 777777777, Marketing)] Sue, 999999999, 999999999, Accounts Tony, 777777777, 777777777, Sales Tony, 777777777, 777777777, Marketing
  • 77. Examples Relational Join in MapReduce, again 5/15/13 Bill Howe, Data Science, Autumn 2012 19 Order(orderid, account, date) 1, aaa, d1 2, aaa, d2 3, bbb, d3 LineItem(orderid, itemid, qty) 1, 10, 1 1, 20, 3 2, 10, 5 2, 50, 100 3, 20, 1tagged with relation name Order 1, aaa, d1 ! 1 : “Order”, (1,aaa,d1) 2, aaa, d2 ! 2 : “Order”, (2,aaa,d2) 3, bbb, d3 ! 3 : “Order”, (3,bbb,d3) Line 1, 10, 1 ! 1 : “Line”, (1, 10, 1) 1, 20, 3 ! 1 : “Line”, (1, 20, 3) 2, 10, 5 ! 2 : “Line”, (2, 10, 5) 2, 50, 100 ! 2 : “Line”, (2, 50, 100) 3, 20, 1 ! 3 : “Line”, (3, 20, 1) Map Reducer for key 1 “Order”, (1,aaa,d1) “Line”, (1, 10, 1) “Line”, (1, 20, 3) (1, aaa, d1, 1, 10, 1) (1, aaa, d1, 1, 20, 3)
  • 78. ExamplesSimple Social Network Analysis: Count Friends Jim, Sue Sue, Jim Lin, Joe Joe, Lin Jim, Kai Kai, Jim Jim, Lin Lin, Jim Input Desired Output Jim, 3 Lin, 2 Sue, 1 Kai, 1 Joe, 1 Jim, 1 Sue, 1 Lin, 1 Joe, 1 Jim, 1 Kai, 1 Jim, 1 Lin, 1 MAP REDUCE Jim, (1, 1, 1) Lin, (1, 1) Sue, (1) Kai, (1) Joe, (1) SHUFFLE
  • 79. Examples Matrix Multiplication !" #" $" %&" '" &" %#" !" !" %&" $" #" %#" %&" (" $" X = !" %)" &#" $"
  • 80. Examples Matrix Multiply in MapReduce C = A X B A has dimensions L,M B has dimensions M,N •  In the map phase: –  for each element (i,j) of A, emit ((i,k), A[i,j]) for k in 1..N –  for each element (j,k) of B, emit ((i,k), B[j,k]) for i in 1..L •  In the reduce phase, emit –  key = (i,k) –  value = Sumj (A[i,j] * B[j,k]) 5/15/13 Bill Howe, Data Science, Autumn 2012 21
  • 81. Examples AB X A B = •  One reducer per output cell
  • 83. Cluster Computing •  Large number of commodity servers, connected by high speed, commodity network •  Rack: holds a small number of servers •  Data center: holds many racks !"#$%&'(#))%'&*"&+!" #$%&'()*+(",-(+./0"123145"67$%8('"1",-(+./09":;1;<"/0=>4"" /?"#@0@0A"/?"#$99@B("C$8$9(89;"D>"E$F$'$G$0"$0)"H==G$0"" ),,(-.,(/0123,(456,7852(9,:3;-,(
  • 84. Cluster Computing •  Massive parallelism: – 100s, or 1000s, or 10000s servers – Many hours •  Failure: – If medium-time-between-failure is 1 year – Then 10000 servers have one failure / hour 24slide src: Dan Suciu and Magda Balazinska
  • 85. Distributed File System (DFS) •  For very large files: TBs, PBs •  Each file is partitioned into chunks, typically 64MB •  Each chunk is replicated several times (!3), on different racks, for fault tolerance •  Implementations: – Google’s DFS: GFS, proprietary – Hadoop’s DFS: HDFS, open source 25slide src: Dan Suciu and Magda Balazinska
  • 98. Exemplo de MapReduce from mrjob.job import MRJob import re WORD_RE = re.compile(r"[w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount().run()
  • 99. Projeto MrJob Criado pela Equipe de Engenharia doYelp Totalmente Open-Source Todo em Python Utiliza Map-Reduce para Processamento Permite rodar tanto no Amazon EMR como no Hadoop
  • 100. Objetivos do MrJobs Se você quer aprender MapReduce, ele é para você Se você tem um problema cavalar e precisa de muito processamento e não está afim de mexer em Hadoop Se você já tem um cluster Hadoop e quer rodar scripts Python Se você quer migrar seu código Python do Hadoop para o EMR Se você não quer escrever Python (Impossível!), não é para você!
  • 102. Vamos a uma demo... Texto da posse de Obama em 2009.
  • 105. Distributed Computing with mrJob https://github.com/Yelp/mrjob Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
  • 106. Distributed Computing with mrJob https://github.com/Yelp/mrjob Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
  • 109. Atepassar Recommendations http://atepassar.com marcel;jonas,maria,jose,amanda maria;carol,fabiola,amanda,marcel amanda;paula,patricia,maria,marcel carol;maria,jose,patricia fabiola;maria paula;fabio,amanda patricia;amanda,carol jose;marcel,carol jonas;marcel,fabio fabio;jonas,paula carla "marcel" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["patricia", 1], ["paula", 1]] "maria" [["jose", 2], ["patricia", 2], ["jonas", 1], ["paula", 1]] "patricia" [["maria", 2], ["jose", 1], ["marcel", 1], ["paula", 1]] "paula" [["jonas", 1], ["marcel", 1], ["maria", 1], ["patricia", 1]] "amanda" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["jonas", 1], ["jose", 1]] "carol" [["amanda", 2], ["marcel", 2], ["fabiola", 1]] "fabio" [["amanda", 1], ["marcel", 1]] "fabiola" [["amanda", 1], ["carol", 1], ["marcel", 1]] "jonas" [["amanda", 1], ["jose", 1], ["maria", 1], ["paula", 1]] "jose" [["maria", 2], ["amanda", 1], ["jonas", 1], ["patricia", 1]] $python friends_recommender.py - r emr --num-ec2-instances 5 facebook_data.csv > output.dat
  • 110. Atepassar Recommendations http://atepassar.com marcel;jonas,maria,jose,amanda maria;carol,fabiola,amanda,marcel amanda;paula,patricia,maria,marcel carol;maria,jose,patricia fabiola;maria paula;fabio,amanda patricia;amanda,carol jose;marcel,carol jonas;marcel,fabio fabio;jonas,paula carla "marcel" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["patricia", 1], ["paula", 1]] "maria" [["jose", 2], ["patricia", 2], ["jonas", 1], ["paula", 1]] "patricia" [["maria", 2], ["jose", 1], ["marcel", 1], ["paula", 1]] "paula" [["jonas", 1], ["marcel", 1], ["maria", 1], ["patricia", 1]] "amanda" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["jonas", 1], ["jose", 1]] "carol" [["amanda", 2], ["marcel", 2], ["fabiola", 1]] "fabio" [["amanda", 1], ["marcel", 1]] "fabiola" [["amanda", 1], ["carol", 1], ["marcel", 1]] "jonas" [["amanda", 1], ["jose", 1], ["maria", 1], ["paula", 1]] "jose" [["maria", 2], ["amanda", 1], ["jonas", 1], ["patricia", 1]] $python friends_recommender.py - r emr --num-ec2-instances 5 facebook_data.csv > output.dat marcel;jonas,maria,jose,amanda carol;maria,jose,patricia fabio;jonas,paula ...
  • 111. Atepassar Recommendations http://atepassar.com fsim(u,i) = w1 f1(u,i) + w2 f2(u,i) + b, https://gist.github.com/3970945#file_atepassar_recommender.py
  • 112. Atepassar Recommendations http://atepassar.com Celery - para agendamento dos jobs coletores e executores. mrJob - para mapreduce e acesso ao Hadoop MongoDb - para armazenamento das recomendações Boto - acesso aos files do S3.
  • 114. Projetos interessantes Estrutura de dados para manipulação rápida - slicing - indexing - subseting Handling missing data Agregações, Séries Temporais Pandas: a data analysis library for Python, poised to give R a run for its money… http://pandas.pydata.org/
  • 115. Projetos interessantes Outro framework para computação distribuída com Python com MapReduce. Criado pelo Instituto Nokia. Backend dele é escrito em Erlang (funcional, concorrente e bem escalável!) Não utiliza o FileSystem mais usado HDFS e sim um novo padrão por eles (DDFS). http://discoproject.org/
  • 116. Projetos interessantes http://api.mongodb.org/python/2.0/examples/ map_reduce.html Python & MongoDb MongoDb - Banco de Dados Não relacional (NoSQL) Possui suporte nativo built-in para fazer MapReduce. Escrever o código em JS e não é muito legível e fica preso ao Mongo ... >>> reduce = Code("function (key, values) {" ... " var total = 0;" ... " for (var i = 0; i < values.length; i++) {" ... " total += values[i];" ... " }" ... " return total;" ... "}")
  • 117. Projetos interessantes https://github.com/klbostee/dumbo/wiki/Short-tutorial Dumbo Uma das primeiras bibliotecas em cima do MapReduce e Python. Complicado para começar e está desatualizada :(
  • 118. Projetos interessantes Um wrapper em Python em cima do Hadoop para computação distribuída. http://pydoop.sourceforge.net/docs/index.html Legal, mas dá um trabalho para configurar.
  • 119. Projetos interessantes Mapreduce com Python na Google AppEngine https://developers.google.com/appengine/docs/python/ dataprocessing/ Ainda experimental e fica “preso” à plataforma AppEngine.
  • 120. Projetos interessantes Algoritmos de aprendizagem de máquina Supervisionados & Não supervisionados Pré-processamento, extração de dados Avaliação de classificadores, Pipeline, seleção de atributos. http://scikit-learn.org/stable/
  • 121. Projetos interessantes Processamento de linguagem natural Várias ferramentas para tokenização, pos tagging, named entity recognition, classificadores, etc. Vários corpus disponíveis! http://nltk.org/
  • 122. Projetos interessantes Processamento de linguagem natural Várias ferramentas para tokenização, pos tagging, named entity recognition, classificadores, etc. Vários corpus disponíveis! http://nltk.org/ Fiquem de olho... Pipeline for distributed Natural Language Processing, made in Python https://github.com/NAMD/pypln
  • 123. Projetos interessantes Our Project’s Home Page http://muricoca.github.com/crab
  • 124. Future Releases Planned Release 0.13 New home for python-recsys: https://github.com/python-recsys/crab Planned Release 0.14 Support to Item-Based Recommenders using MapReduce with MrJob New commiters: vinnigracindo, ocelma, fcurella
  • 125. Join us! 1. Read our Wiki Page https://github.com/muricoca/crab/wiki/Developer-Resources 2. Check out our current sprints and open issues https://github.com/muricoca/crab/issues 3. Forks, Pull Requests mandatory 4. Join us at irc.freenode.net #muricoca or at our discussion list http://groups.google.com/group/scikit-crab
  • 132.
  • 133. BigData with Python A gentle and simple introduction Marcel Caraciolo @marcelcaraciolo Developer, Cientist, contributor to the Crab recsys project, works with Python for 6 years, interested at mobile, education, machine learning and dataaaaa! Recife, Brazil - http://aimotion.blogspot.com