O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data

605 visualizações

Publicada em

You like to use R, and you need to use big data. dplyr, one of the most popular packages for R, makes it easy to query large data sets in scalable processing engines like Apache Spark and Apache Impala.
But there can be pitfalls: dplyr works differently with different data sources—and those differences can bite you if you don’t know what you’re doing.
Ian Cook is a data scientist, an R contributor, and a curriculum developer at Cloudera University. In this webinar, Ian will show you exactly what you need to know about sparklyr (from RStudio) and the package implyr (from Cloudera). He will show you how to write dplyr code that works across these different interfaces. And, he will solve mysteries:
Do I need to know SQL to use dplyr?
When is a “tbl” not a “tibble”?
Why is 1 not always equal to 1?
When should you collect(), collapse(), and compute()?
How can you use dplyr to combine data stored in different systems?

3 things to learn:
Do I need to know SQL to use dplyr?
When should you collect(), collapse(), and compute()?
How can you use dplyr to combine data stored in different systems?

Publicada em: Tecnologia
  • Earn $500 for taking a 1 hour paid survey! read more... ▲▲▲ https://tinyurl.com/realmoneystreams2019
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Like to know how to take easy surveys and get huge checks - then you need to visit us now! Having so many paid surveys available to you all the time let you live the kind of life you want. learn more...●●● http://ishbv.com/surveys6/pdf
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui

Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data

  1. 1. 1© Cloudera, Inc. All rights reserved. dplyr Interfaces to Large-Scale Data Ian Cook @ianmcook ian@cloudera.com
  2. 2. 2© Cloudera, Inc. All rights reserved. Mission for Cloudera: Provide a platform for data analysts, data scientists to efficiently query, analyze, model large-scale data in clusters, cloud storage • By distributing Apache Spark, Apache Impala, other tools • By enabling productive use of these tools Python and R users often have difficulty moving from smaller data to large-scale distributed data • Familiar packages, methods don’t work the same way on distributed data Context
  3. 3. 3© Cloudera, Inc. All rights reserved. Poll question
  4. 4. 4© Cloudera, Inc. All rights reserved. ] SQLPySpark SparkR SQL SQL or DataFrame API SQL or DataFrame API SQL or DataFrame API SQL or DataFrame API SQL or DataFrame API SQL or DataFrame API
  5. 5. 5© Cloudera, Inc. All rights reserved. Poll question
  6. 6. 6© Cloudera, Inc. All rights reserved. ] SQLPySpark SparkR SQL dplyr
  7. 7. 7© Cloudera, Inc. All rights reserved. dplyr provides a set of verbs that perform common data manipulation steps • select() to select columns • filter() to filter rows • arrange() to order rows • mutate() to create new columns • summarise() to aggregate • group_by() to perform operations by group dplyr works on local data and with remote data sources • For remote sources, dplyr commands are translated into SQL dplyr
  8. 8. 8© Cloudera, Inc. All rights reserved. Poll question
  9. 9. 9© Cloudera, Inc. All rights reserved. Demonstration Example code at github.com/ianmcook/dplyr-examples
  10. 10. 10© Cloudera, Inc. All rights reserved. dplyr SQL backends dplyr ↕ dbplyr ↕ dplyr SQL backend package* ↕ DBI ↕ DBI-compatible interface package ↕ database driver or connector ↕ database/engine * optional
  11. 11. 11© Cloudera, Inc. All rights reserved. • Provides a SQL backend to dplyr for Spark • Also exposes the MLlib API and a subset of the Spark DataFrames API • Developed by RStudio spark.rstudio.com sparklyr
  12. 12. 12© Cloudera, Inc. All rights reserved. • Provides a SQL backend to dplyr for Impala • Uses ODBC or JDBC to connect to Impala • Developed at Cloudera tiny.cloudera.com/implyr implyr implyr
  13. 13. 13© Cloudera, Inc. All rights reserved. Five tips for using dplyr with SQL data sources
  14. 14. 14© Cloudera, Inc. All rights reserved. Use show_query() 1
  15. 15. 15© Cloudera, Inc. All rights reserved. filter() early arrange() late 2
  16. 16. 16© Cloudera, Inc. All rights reserved. Check your data types 3
  17. 17. 17© Cloudera, Inc. All rights reserved. Know your SQL engine 4
  18. 18. 18© Cloudera, Inc. All rights reserved. Know when to collect() 5
  19. 19. 19© Cloudera, Inc. All rights reserved. Questions? Ian Cook @ianmcook ian@cloudera.com
  20. 20. 20© Cloudera, Inc. All rights reserved. Cloudera Data Science Workbench More information tiny.cloudera.com/cdsw OnDemand training tiny.cloudera.com/cdsw-training

×