Driscoll bi sig_15_jun2010

•Transferir como PPTX, PDF•

1 gostou•1,503 visualizações

Dataspora

Tecnologia

WHAT IS BIG DATA? Data that is distributed.

“The sexy job in the next ten years will be statisticians…” - Hal Varian = +

1. CHOOSE THE RIGHT TOOL You don’t need a chainsaw to cut butter.

2. COMPRESS EVERYTHING mysqldump -u myuser -p mypasssourceDB | br />gzip | ssh mike@dataspora.com "cat - | br />gunzip | mysql -u myuser -p mypasstargetDB" The world is IO-bound.

3. SPLIT UP YOUR DATA Split, apply, combine.

4. WORK WITH SAMPLES perl -ne "print if (rand() < 0.01)" data.csv > sample.csv Big Data is heavy, samples are light.

COPY FROM OTHERS git clone git://github.com/kevinweil/hadoop-lzo Use open source.

7. ESCHEW CHART TYPOLOGIES Charts are compositions, not containers.

8. COLORWITH CARE Color can enhance or insult.

WHY DO TELCO CUSTOMERS LEAVE? Sign up Leave Goal: “less churn.”

DATA: BILLIONS OF CALLS … and millions of callers.

DOES CALL QUALITY MATTER? … a difference, but not significant.

BUILD THE CALL GRAPH … but is it predictive?

700% INCREASE IN CHURN when a cancellation occurs in a call network.

THE BIG DATA STACK Actions Data Products (Content Filters, Rec Engines) Analytics (R, SPSS, SAS, SAP) Insights Big Data Dedicated RDBMS Data

THANKS! QUESTIONS? Michael Driscoll med@dataspora.com @dataspora on Twitter http://www.dataspora.com/blog SDForum BI SIG June 15, 2010

Mais conteúdo relacionado

Semelhante a Driscoll bi sig_15_jun2010

Big Data Platform Landscape by 2017Donghui Zhang

Suicide Risk Prediction Using Social Media and CassandraKen Krugler

C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...DataStax Academy

Sql saturday el salvador 2016 - Me, A Data Scientist?Fabricio Quintanilla

Big Data, Big OpportunitiesArimo, Inc.

Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...Matt Stubbs

Horizon 20110928Mike Miller

Literacy in the Age of Big DataCentre for Advanced Management Education

Biq query devfest2017_slidesgetdinesh

Big dataFACTS Computer Software L.L.C

Big Data By Vijay Bhaskar SemwalIIIT Allahabad

Data-Ed Webinar: Data Modeling FundamentalsDATAVERSITY

Multiplatform solution for graph datasourcesJavier Domínguez Montes

Data Driven Economy @CMUKomes Chandavimol

Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit

BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...Alex Liu

Satyam open analytics nycOpen Analytics

Speeding Up Data Science: From a Data Management PerspectiveJiannan Wang

Big data, why careDaan Gerits

Interview questions on Apache spark [part 2]knowbigdata

Semelhante a Driscoll bi sig_15_jun2010 (20)

Big Data Platform Landscape by 2017

Suicide Risk Prediction Using Social Media and Cassandra

C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...

Sql saturday el salvador 2016 - Me, A Data Scientist?

Big Data, Big Opportunities

Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...

Horizon 20110928

Literacy in the Age of Big Data

Biq query devfest2017_slides

Big data

Big Data By Vijay Bhaskar Semwal

Data-Ed Webinar: Data Modeling Fundamentals

Multiplatform solution for graph datasources

Data Driven Economy @CMU

Data infrastructure architecture for medium size organization: tips for colle...

BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...

Satyam open analytics nyc

Speeding Up Data Science: From a Data Management Perspective

Big data, why care

Interview questions on Apache spark [part 2]

Último

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Scaling API-first – The story of a global engineering organizationRadu Cotescu

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

How to convert PDF to text with Nanonetsnaman860154

Histor y of HAM Radio presentation slidevu2urc

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Slack Application Development 101 Slidespraypatel2

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Driscoll bi sig_15_jun2010

1. WINNING WITH BIG DATA Secrets of the Successful Data Scientist SDForum BI SIG June 15, 2010 Michael Driscoll @dataspora

2. WHY DATA MATTERS NOW

3. THE INDUSTRIAL AGE OF DATA

4. WHAT IS BIG DATA? Data that is distributed.

5. WHAT IS DATA SCIENCE?

6. WHY DATA SCIENCE IS SEXY

7. “The sexy job in the next ten years will be statisticians…” - Hal Varian = +

9. data model 1000 bytes 2 bytes

10. 9 WAYS TO WIN WITH DATA

11. 1. CHOOSE THE RIGHT TOOL You don’t need a chainsaw to cut butter.

12. 2. COMPRESS EVERYTHING mysqldump -u myuser -p mypasssourceDB | br />gzip | ssh mike@dataspora.com "cat - | br />gunzip | mysql -u myuser -p mypasstargetDB" The world is IO-bound.

13. 3. SPLIT UP YOUR DATA Split, apply, combine.

14. 4. WORK WITH SAMPLES perl -ne "print if (rand() < 0.01)" data.csv > sample.csv Big Data is heavy, samples are light.

15. 5. USE STATISTICS

16. COPY FROM OTHERS git clone git://github.com/kevinweil/hadoop-lzo Use open source.

17. 7. ESCHEW CHART TYPOLOGIES Charts are compositions, not containers.

18. 8. COLORWITH CARE Color can enhance or insult.

19. 9. TELL A STORY People are listening.

20. ONE SUCCESS STORY

21. WHY DO TELCO CUSTOMERS LEAVE? Sign up Leave Goal: “less churn.”

22. DATA: BILLIONS OF CALLS … and millions of callers.

23. DOES CALL QUALITY MATTER? … a difference, but not significant.

24. WHAT ABOUT SOCIAL NETWORKS? Hmmm...

25. BUILD THE CALL GRAPH … but is it predictive?

26. EVOLUTION OF A CALL GRAPH April

27. EVOLUTION OF A CALL GRAPH May

28. EVOLUTION OF A CALL GRAPH June

29. EVOLUTION OF A CALL GRAPH July

30. 700% INCREASE IN CHURN when a cancellation occurs in a call network.

31. FINAL THOUGHTS

32. THE BIG DATA STACK Actions Data Products (Content Filters, Rec Engines) Analytics (R, SPSS, SAS, SAP) Insights Big Data Dedicated RDBMS Data

33. THANKS! QUESTIONS? Michael Driscoll med@dataspora.com @dataspora on Twitter http://www.dataspora.com/blog SDForum BI SIG June 15, 2010

Notas do Editor

I’m Mike Driscoll, founder of Dataspora LLC, we’re a boutique analytics firm based in San Francisco.Before coming out to the Bay Area, I worked on the human genome project & got a doctorate in Computational Biology.Today I’m going to talk about Big Data, Data Science, and some tips for the Data Scientist.
If you had to put your finger on the beginning of the information age, it might be the creation of the first telegraph in 1792, in France, by a pair of brothers.The first time that man-made information began at the speed of light, over long distances.Cars, cash registers, subway turnstyles, gene chips, TiVos, and cell phones are streaming billions of data points.We live in a world exploding with data. In any given minute, databases somewhere are tracking mouse clicks on web sites, point of sale purchases, rider swipes through subway turnstyles, physician prescriptions, digital video recorder rewinds, and the location of every GPS-enabled car and phone on the planet.Prof. Joe Hellerstein of Berkeley has dubbed it “The Industrial Revolution of Data” – where machines, not people, are the dominant producers of data.So the world is streaming billions of data points per minute. This is Big Data – capital B, capital D. Ben Lorica of O’Reilly Media has said Big Data is “data that you have to think about” when storing, analyzing or otherwise grappling with it.But capturing data isn’t enough. We need tools to make sense of it.At Facebook, they call their data analysts, ‘data scientists’. I like this term, because it captures the point of collecting this data: testing hypotheses about the world.And to test hypotheses using Big Data, we need statistics.
In this talk I’m also going to be talking about tools for medium data; b/c these translate well into the Big Data space.
In this talk I’m also going to be talking about tools for medium data; b/c these translate well into the Big Data space.I’m defining data Science is: applying tools to data to answer questions. It is at the intersection of these tools. And it is a growing field, because data is getting bigger, and our tools are getting better. (Suffice to say, the questions we ask have been around since time immemorial: whoAnother word for questions is hypotheses.I’ll talk about tools for munging; the answers to these questions are
Do you really need Hadoop for that job? Think twice about it.Can you do everything on one machine?Escalate only as necessary… don’t solve problems that don’t yet exist.At the same time, optimize for scalability, not performance. Cleverness is usually punished in the long run.
Compressing gives you a 6-8x bump immediately in network and disk IO, out of the gate.This example also illustrates another piece: avoid hitting disk at all costs.If you’re working on the cloud,
This is the essence of parallelism, and in fact, of big data: the key is to some independent dimension on which to split your data.Otherwise everything sits together, in a monolithic file system, database, or data store -- which often spells disaster.* Even your data isn’t in a database, split it up the old-fashioned way – one file per hour, day, or month, depending on its size – these often form natural samples to work from.* Learn & understand how to partition, shard, or otherwise distribute your data in a database.* Parallel load is your friend: Several databases have parallel load features; Hadoop has distcp.
do you want to moving GBs and TBs around?sometimes you want to visualize and work on the data locally…so sample!* reservoir sampling is a fixed-memory algorithm for achieving a defined-sized sample* the above illustrates how to get a basic 1% uniform sample method in a perl one-liner
When we compare two real-valued measures, they will almost always be different.The critical question is: How confident are we in the difference? Is it significant?There’s also something to be said for significant but so small in magnitude as to be meaningless.(I once sat through a heart drug presentation, which showed a significant but inconsequential difference versus Aspirin. The price differential was not inconsequential, however).
Don’t reinvent the wheel, steal someone else’s wheels of 1s and 0s.Statistics is hard – so go ahead & use someone else’s stuff. Go ahead. It’s there. Just today I cribbed code from StackOverflow to make a heatmap in R.That what’s great about R. 2000 statistical libraries written by professors.
Not machines, people.
Okay, now I want you to try and forget everything you just heard about base graphics.ggplot2 is a new visualization package formally released in 2009, developed by Professor Hadley Wickham.It is a based a different perspective of developing graphics, and has its own set of functions and parameters.
Most telcos lose 1-2% of their customers every month.It’s 7x more expensive to acquire a customer, than to retain.
Not machines, people.
This illustrates what we said earlier: statistics matters. We needed to rule this out.(If anything the correlation occurs opposite of what we expected).
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
Windowing functions in Greenplum, which is a modified Postgres distributed database.
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
Okay, now I want you to try and forget everything you just heard about base graphics.ggplot2 is a new visualization package formally released in 2009, developed by Professor Hadley Wickham.It is a based a different perspective of developing graphics, and has its own set of functions and parameters.
The stack is loosely coupled: right tool for the right job. No one firm can do it all.- There aren’t – not yet at least – out of the box solutions for getting through this: the data scientists occupy the middle.Big Data is disrupting this entire stack: -- at the bottom, new DB firms like Aster-- in the middle, the same revo-You know who sits on the top of that stack? We do. That’s why storytelling is such an important skill.

Driscoll bi sig_15_jun2010

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Driscoll bi sig_15_jun2010

Semelhante a Driscoll bi sig_15_jun2010 (20)

Último

Último (20)

Driscoll bi sig_15_jun2010

Notas do Editor