O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

dplyr Interfaces to Large-Scale Data

2.975 visualizações

Publicada em

The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable.

Publicada em: Software
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

dplyr Interfaces to Large-Scale Data

  1. 1. 1© Cloudera, Inc. All rights reserved. dplyr Interfaces to Large-Scale Data Ian Cook @ianmcook ian@cloudera.com
  2. 2. 2© Cloudera, Inc. All rights reserved. Mission for Cloudera: Provide a platform for data analysts, data scientists to efficiently query, analyze, model large-scale data in clusters, cloud storage • By distributing Apache Spark, Apache Impala, other tools • By enabling productive use of these tools Python and R users often have difficulty moving from smaller data to large-scale distributed data • Familiar packages, methods don’t work the same way on distributed data Context
  3. 3. 3© Cloudera, Inc. All rights reserved. ]
  4. 4. 4© Cloudera, Inc. All rights reserved. dplyr provides a set of verbs that perform common data manipulation steps: • select() to select columns • filter() to filter rows • arrange() to order rows • mutate() to create new columns • summarise() to aggregate • group_by() to perform operations by group dplyr works on local data and with remote data sources • For remote sources, dplyr commands are translated into SQL dplyr
  5. 5. 5© Cloudera, Inc. All rights reserved. Demonstration Example code at github.com/ianmcook/dplyr-examples
  6. 6. 6© Cloudera, Inc. All rights reserved. dplyr SQL backends dplyr ↕ dbplyr ↕ dplyr SQL backend package* ↕ DBI ↕ DBI-compatible interface package ↕ database driver or connector ↕ database/engine * optional
  7. 7. 7© Cloudera, Inc. All rights reserved. • Provides a SQL backend to dplyr for Spark • Also exposes the MLlib API and a subset of the Spark DataFrames API • Developed by RStudio spark.rstudio.com sparklyr
  8. 8. 8© Cloudera, Inc. All rights reserved. • Provides a SQL backend to dplyr for Impala • Uses ODBC or JDBC to connect to Impala • Developed at Cloudera tiny.cloudera.com/implyr implyr implyr
  9. 9. 9© Cloudera, Inc. All rights reserved. Five tips for using dplyr with SQL data sources
  10. 10. 10© Cloudera, Inc. All rights reserved. Use show_query() 1
  11. 11. 11© Cloudera, Inc. All rights reserved. filter() early arrange() late 2
  12. 12. 12© Cloudera, Inc. All rights reserved. Check your data types 3
  13. 13. 13© Cloudera, Inc. All rights reserved. Know your SQL engine 4
  14. 14. 14© Cloudera, Inc. All rights reserved. Know when to collect() 5
  15. 15. 15© Cloudera, Inc. All rights reserved. Thank you Ian Cook @ianmcook ian@cloudera.com

×