Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it.
TALK TRACK
MACHINE LEARNING WITH SPARK
WEBTRENDS PROVIDES DIGITAL MARKETING ANALYTICS
STATS
+ PROCESSES 10 BILLION EVENTS PER DAY
+ AT AN AVERAGE SPEED OF 20 MILLISECONDS PER EVENT
[Back ground]
Listen to Peter Crossley Hadoop Summit keynote(min 22:22):
http://brightcove.fora.tv/services/player/bcpid4287593805001?bckey=AQ~~,AAACbMgRlRk~,KnD13XNmCDZZPWkNmxmPMFTH2h0USbHh&bclid=4287661488001&bctid=4285724101001
Read recent article on Web Trends by Peter Crossley:
http://insidebigdata.com/2015/07/14/strategic-big-data-pivot-leads-webtrends-to-success/
Overwhelmed by data ingest rates, reduce sample data to fit into edge node
Random Forest models in R
Team expertise in R and not Scala/ Java
Lots of key features like textual features not incorporated (cannot handle feature blowup in R)
HBase has efficient scans, however can Spark leverage it?
Push predicates and prune columns
HBase has efficient scans, however can Spark leverage it?
Push predicates and prune columns
Allow for RDD sharing with the HDFS Memory Tier. Improve dynamic resource allocation via YARN. Mature SparkSQL and Spark Streaming to GA quality.
HDFS Memory Tier
There are many use cases where Spark today feels less than ideal. For example, using Spark in a shared environment, with a middle tier fielding request from multiple tier is a common problem. SparkContext is a heavyweight object and is tied to a specific user session. Using Spark in shared environment requires features such as RDD sharing with HDFS memory tier and SparkContext sharing.
There are other areas of improvement in Spark’s YARN integration. For example, today YARN applications logs are published when the application finishes running. This model does not work for Spark Streaming, which is a long running application and doesn’t get a chance to publish its logs. Spark’s YARN ATS integration also needs work to help it scale and not become a bottleneck.
SparkSQL is another critical area where we want to add more value by bringing Hive level features such as security (SPARK-11265), ACID and vectorization features to SparkSQL and make it GA in our platform over the coming months.
Seamless Data Access
One Hive
Seamless use of capabilities across Spark and Hive via SQL including common file formats. Deliver connectors for HBase (HFile).
Spark is a great data processing engine, and it provides more value when it can process more data. The value of data lake is that it brings more data under one roof and opens new opportunity for insights and to drive efficiency. The data lake is delivered by YARN and Hadoop to provide massive scale and run all types of workload to take advantage of that data.
With the DataSource API, Spark provides a first class way to bring data from external sources while leveraging these systems for filtering, predicate pushdown etc. We used DataSource API to bring ORC data into Spark. And now we are working to bring HBase data efficiently into processing with Spark. This will be different from existing Spark + HBase connector in that it will leverage HBase for predicate pushdown, column filtering and will be more efficient. Look for a tech preview of this feature in the coming months.
Data science notebooks and automation for the most common analysis scenarios.
Zeppelin
Include support for GeoSpatial and Entity Resolution.
Magellan
TALK TRACK
Ad-hoc experimentation Spark, Hive, Shell, Flink, Tajo, Ignite, Lens, etc
Deeply integrated with Spark + Hadoop Can be managed via Ambari Stacks
Supports multiple language backends Pluggable “Interpreters”
Incubating at Apache 100% open source and open community
[NEXT SLIDE]
Hive
ESRI Hive (thin wrapper on ESRI Java)
Magellan available on Github (https://github.com/harsha2010/magellan)
Can parse and understand most widely used formats
GeoJSON, ESRIShapefile
All geometries supported
1.0.3 released (http://spark-packages.org/package/harsha2010/magellan)
Broadcast join available for common scenarios
Work in progress (targeted 1.0.4)
Geohash Join optimization
Map Matching algorithm using Markov Models
Python and Scala support
Please give it a try and give us feedback!
One of our Insurance customers is using Spark to optimize the claims reimbursements and are using Spark’s machine learning capabilities to process and analyze all claims.
We have seen rapid adoption of Spark in our customer base and we want to thank our customers for choosing Spark on HDP. We also want to thank our partners Microsoft, Databricks, HP, NFLabs and the community on sharing this journey with us.