O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Better Together - Using Spark and Redshift to Combine Your Data with Public Datasets

5.800 visualizações

Publicada em

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1bbrPiV.

Eugene Mandel discusses challenges of conforming data sources and compares processing stacks: Hadoop+Redshift vs Spark, showing how the technology drives the way the problem is modeled. Filmed at qconsf.com.

Eugene Mandel is Senior Data Engineer on the Data Science team at Jawbone.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Better Together - Using Spark and Redshift to Combine Your Data with Public Datasets

  1. 1. BETTER TOGETHER USING SPARK AND REDSHIFT TO COMBINE YOUR DATA WITH PUBLIC DATASETS EUGENE MANDEL (@EUGMANDEL) JAWBONE QCON SF 2014
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /hadoop-redshift-spark
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. JAWBONE DATA MOVEMENT SLEEP WORKOUTS MEALS MOOD
  5. 5. SOUTH NAPA EARTHQUAKE 2014
  6. 6. %OFPEOPLEAWAKEAT3:25 DISTANCE FROM EPICENTER (MILES)
  7. 7. DATA FUSION IS THE PROCESS OF INTEGRATION OF MULTIPLE DATA AND KNOWLEDGE REPRESENTING THE SAME REAL-WORLD OBJECT INTO A CONSISTENT, ACCURATE, AND USEFUL REPRESENTATION. (WIKIPEDIA)
  8. 8. DATA FUSION - HOW TO FIND THE ELEPHANT IMAGE SOURCE: HTTP://COMMONS.WIKIMEDIA.ORG/WIKI/FILE%3ABLIND_MEN_AND_ELEPHANT.PNG
  9. 9. DATA FUSION POWERFUL BUT HARD DATA IS NOISY DOMAIN UNDERSTANDING IS KEY
  10. 10. LET’S TALK ABOUT THE WEATHER
  11. 11. MODEL THE PROBLEM
  12. 12. 1,400 2,800 4,200 5,600 7,000 10 20 30 40 50 60 70 80 90 10 0 110 120 4300.0 4400.0 4500.0 5000.0 6000.0 6700.0 6700.0 7000.0 6600.0 6500.0 6100.0 5000.0 ACTIVITY AIR TEMP (°F) ?
  13. 13. FIND THE DATA
  14. 14. UNDERSTAND THE DATA
  15. 15. HOURLY DAILY
  16. 16. DATA GENERATION PROCESS NETWORK OF WEATHER STATIONS FREQUENCY OF MEASUREMENTS - HOURLY TO DAILY ! COLLABORATION WITH INTERNATIONAL AGENCIES ! AGGREGATION AND QA BY NCDC !
  17. 17. UNDERSTAND THE DOMAIN WEATHER STATION TIME: 2014-07-09 13:04:00 AIR TEMP: 86°F PRECIPITATION: 3CM
  18. 18. QA THE DATA
  19. 19. BUT ISN’T IT DONE?
  20. 20. AIR TEMP: 105°F BAKERSFIELD, CA JULY 17, 15:00 DULUTH, MN JAN 12, 05:00 …MAYBE NOT!
  21. 21. DATA VALIDATION DOMAIN KNOWLEDGE ! COMPARE MULTIPLE SOURCES - E.G. CLIMATE ! MANUAL REVIEW OF FLAGGED DATA POINTS
  22. 22. JOIN
  23. 23. DOMAIN SPECIFIC HOW? WEATHER STATION B LAT: 39.35 LON: -74.44 TIME: 2014-07-09 13:00:00 AIR TEMP: 60°F WEATHER STATION A LAT: 39.36 LON: -74.45 TIME: 2014-07-09 13:04:00 AIR TEMP: 74°F ELEVATION: 30FT ELEVATION: 120FT
  24. 24. DO THE DATASETS INTERSECT ENOUGH? COVERAGE PLACES ! TIMES ! USERS
  25. 25. ISOLATE THE EFFECT
  26. 26. CONFOUNDING VARIABLES WEEKDAYS/WEEKENDS ! DAYLIGHT ! RAIN/SNOW WHAT ELSE AFFECTS ACTIVITY?
  27. 27. REDSHIFT VS SPARK
  28. 28. AMAZON REDSHIFT RELATIONAL ANALYTICAL DATABASE BY AMAZON ! COMPLEX QUERIES ON LARGE DATASETS IN SECONDS ! SQL INTERFACE (POSTGRES) ! MANAGED CLUSTER
  29. 29. EXAMPLE: DAYLIGHT PYTHON REDSHIFT
  30. 30. IN-MEMORY DATA PROCESSING FRAMEWORK ! MODELS COMPUTATION AS A GRAPH OF RDDS (RESILIENT DISTRIBUTED DATASETS) ! FUNCTIONAL PROGRAMMING MODEL (SCALA, PYTHON) ! SQL ! CAN READ FROM SAME SOURCES AS HADOOP
  31. 31. SPARK EXAMPLE: DAYLIGHT
  32. 32. PICK YOUR OWN ADVENTURE SILVER BULLET? PROGRAMMER-FRIENDLY ! END-TO-END SOLUTION ! SELF-DOCUMENTING SPARK REDSHIFT EASY TO SHARE DATA WITH NON-DEVELOPERS ! MANAGED - EASY SCALING !
  33. 33. WHAT DID WE FIND?
  34. 34. IDEAL TEMP FOR MOVEMENT DAILYSTEPS MAX TEMP (F)
  35. 35. AND NOW BY STATE… DAILYSTEPS MAX TEMP (F)
  36. 36. HOURLY STEPS BY AIR TEMP WEEKENDS
  37. 37. LESS CHOICE = SMALLER EFFECT WEEKDAYS
  38. 38. DATA FUSION POWERFUL BUT HARD DATA IS NOISY DOMAIN UNDERSTANDING IS KEY
  39. 39. THANK YOU! @EUGMANDEL WWW.LINKEDIN.COM/IN/EUGENEMANDEL
  40. 40. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/hadoop- redshift-spark

×