It’s no secret that Apache Spark is becoming the successor to MapReduce for data processing in Hadoop. With it’s easy development, flexible API, and performance benefits, Spark is a powerful data processing engine that has quickly gained popularity within the community. On the other hand Hive continues to be the most widely used data warehouse/ETL engine with large scale adoption across enterprises. Therefore, it’s imperative to enable Spark as the underlying execution engine for Hive to seamlessly allow existing and future Hive workloads to leverage the advantages of Spark.
With the recent release of Cloudera 5.7, we have delivered on this goal by adding support for Hive-on-Spark. Data engineers and ETL developers can now transition from MR to Spark for their Hive workloads seamlessly thereby benefitting from the advantages of Spark without any disruption on their end.
Join Santosh Kumar, Senior Product Manager at Cloudera, and Rui Li, Apache Hive committer and engineer at Intel, as we discuss:
An Introduction to Spark and its advantages over MR
An introduction of Hive-on-Spark: Goals and Design Principles
Migrating to HoS and a live demo
Configuring and tuning for batch workloads
What’s next for both tools