O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Rust is for "Big Data"

2.362 visualizações

Publicada em

Presentation given at the Boulder/Denver Rust Meetup on 4/11/18.

Publicada em: Software
  • Writing good research paper is quite easy and very difficult simultaneously. It depends on the individual skill set also. You can get help from research paper writing. Check out, please ⇒ www.HelpWriting.net ⇐
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • High paying Twitter jobs? $25 per hour, start immediately  http://t.cn/AieX6y8B
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • DOWNLOAD FULL MOVIE, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... ,DOWNLOAD FULL. MOVIE 4K,FHD,HD,480P here { https://tinyurl.com/yybdfxwh }
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui

Rust is for "Big Data"

  1. 1. Rust is for “Big Data” Andy Grove @ Boulder/Denver Rust Meetup 4/11/18
  2. 2. About Me • I’ve been a software engineer for ~30 years • 20 years of that using Java • Also some management/founder roles • In my day job I mostly work with Scala, Spark, Parquet, Kudu, Thrift, and HDFS • Yay! I'm a Big Data Engineer TM • I have been learning Rust in my spare time on and off over the past couple years • One of my goals for 2018 was to become proficient in Rust so I decided to take on a substantial project
  3. 3. What’s wrong with Spark/JVM? • Spark is actually pretty neat, but … • Garbage collection overheads can be huge • OutOfMemory errors are common • Java serialization is inefficient, even with Kryo • Expensive up-front query planning and code-generation make it inefficient for interactive queries and small data sets • Difficult to configure, monitor, and debug • Generally row-oriented, even when working with columnar data sources
  4. 4. A typical day in Spark-land …
  5. 5. Let’s build something better! • Rust > JVM: • Raw performance of compiled code • Efficient memory usage • Predictable memory usage • No serialization overhead to map raw bytes to Rust structs • Access to hardware (SIMD, DMA, etc)
  6. 6. Keep Calm and Keep Columnar • Column-oriented > Row-oriented • Just load the columns you need from disk (efficient projections) • “a > b” and “a + b” are now vectorized operations that can take advantage of SIMD (Same Instruction, Multiple Data) • Apache Arrow is a standardized columnar in-memory format for zero-copy data interchange between systems • Apache Parquet is a columnar file-format with efficient per- column encoding and compression
  7. 7. DataFusion • DataFusion is a proof-of-concept of a modern distributed compute platform, implemented in Rust • Programming model is similar to Apache Spark (DataFrame and SQL APIs) • Apache Arrow is used for the core memory model • Apache Parquet is partially supported (read-only and no support for nested types yet) • CSV is supported too (where there is Big Data, there is CSV) • etcd is used for co-ordination between nodes • Kubernetes/Docker deployment model (planned)
  8. 8. Arrow Memory Layout
  9. 9. Source code example
  10. 10. First Benchmark • Simple job to convert lat/lng pairs into ESRI WKT (Well-known text) format • SELECT ST_AsText(ST_Point(lat, lng)) FROM locations • Reads from CSV file • Calls two UDFs, and creates one UDT • Writes results to CSV file • Single thread, single core
  11. 11. Detailed Results (throughput rows/second) # Rows DataFusion 0.2.6 Apache Spark 2.2.1 Ratio 10^1 18,191 1,044,030 256,213 4 2 7,523.8 10^2 47,489 437 108.7 10^3 607,057 3,731 162.7 10^4 820,819 32,258 25.4 10^5 957,025 181,159 5.3 10^6 1,044,030 256,213 4.1 10^7 797,224 268,853 3.0 10^8 1,026,443 271,022 3.8 10^9 958,960 282,576 3.4
  12. 12. Thanks! • Resources: • DataFusion: https://datafusion.rs/ • My blog: https://andygrove.io • Apache Arrow: https://arrow.apache.org/ • Contact me: • LinkedIn: https://www.linkedin.com/in/andygrove/ • Twitter: @andygrove73 • Email: andygrove73@gmail.com

×