O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

RDD

1.243 visualizações

Publicada em

referance:Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Publicada em: Software
  • Entre para ver os comentários

RDD

  1. 1. Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury... 2012 University of California, Berkeley
  2. 2. OUTLINE • Introduction • Resilient Distributed Datasets (RDDs) • Representing RDDs • Evaluation • Conclusion
  3. 3. Introduction Cluster computing frameworks like MapReduce is not well in iterative machine learning and graph algorithms because data replication,disk I/O,serialization
  4. 4. Introduction Pregel is a system for iterative graph computations that keeps intermediate data in memory, while HaLoop offers an iterative MapReduce interface. but only support specific computation patterns They do not provide abstractions for more general reuse.
  5. 5. Introduction RDD is defining a programming interface that can provide fault tolerance efficiently RDD v.s distributed shared memory coarse-grained transformations (e.g., map, filter and join) fine-grained updates to mutable state lineage
  6. 6. Resilient Distributed Datasets (RDDs) RDD’s transformation are lazy operations that define a new RDD, while actions launch a computation to return a value to the program or write data to external storage.
  7. 7. Resilient Distributed Datasets (RDDs)
  8. 8. Resilient Distributed Datasets (RDDs) RDD is a read-only, partitioned collection of records, only be created (1) data in stable storage (2) other RDDs. lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.count()
  9. 9. Resilient Distributed Datasets (RDDs) RDD1 lines = spark.textFile(“hdfs://...") RDD2 errors = lines.filter(_.startsWith(“ERROR")) Long number = errors.count() RDD1 RDD2 Long tranformation action
  10. 10. Resilient Distributed Datasets (RDDs) DEMO
  11. 11. Resilient Distributed Datasets (RDDs) RDD1 lines = spark.textFile(“hdfs://...") RDD2 errors = lines.filter(_.startsWith(“ERROR")) RDD3 error = errors.persist() or cache() RDD3 error will in memory
  12. 12. Resilient Distributed Datasets (RDDs) Lineage: fault tolerance if RDD2 lost tranformation action RDD1 RDD2 Long recompute RDD1 and produce new RDD2
  13. 13. Resilient Distributed Datasets (RDDs) Spark provides the RDD abstraction through a language-integrated API scala a functional programming language for the Java VM
  14. 14. Representing RDDs dependencies between RDDs narrow dependencies:allow for pipelined execution on one cluster node wide dependencies:require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation
  15. 15. Representing RDDs in same node in different node
  16. 16. Representing RDDs how spark compute job stages partition RDD RDD in memory
  17. 17. Resilient Distributed Datasets (RDDs) Each stage contains as many pipelined transformations with narrow dependencies as possible. because avoid shuffled across the nodes
  18. 18. Evaluation Amazon:m1.xlarge EC2 nodes with 4 cores and 15 GB of RAM. We used HDFS for storage, with 256 MB blocks.
  19. 19. Evaluation 10 iterations on 100 GB datasets using 25–100 machines. logistic regression k-means logistic regression is less compute-intensive and thus more sensitive to time spent in deserialization and I/O.
  20. 20. Evaluation HadoopBinMem:convert input data to binary format,in memory
  21. 21. Evaluation pagerank 54 GB Wikipedia dump, 4 million articles. iterations :10
  22. 22. Evaluation pagerank iterations :10
  23. 23. Evaluation fault recovery k-means 100GB data,75 node ,iterations :10 one node fail at the start of the 6th iteration.
  24. 24. Evaluation k-means 100GB data 75 node iterations :10
  25. 25. Evaluation Behavior with Insufficient Memory logistic regression 100GB data , 25machine
  26. 26. Evaluation k-means 100GB data 25machine
  27. 27. Conclusion RDDs,an efficient, general-purpose and fault-tolerant abstraction for sharing data in cluster applications. RDDs offer an API based on coarse- grained transformations that lets them recover data efficiently using lineage. Spark v.s Hadoop fast to 20× in iterative applications and can be used interactively to query hundreds of gigabytes of data.

×