Democratizing Apache Spark for the Enterprise with Jonathan Gole
1. Confidential
Unifying Analytics with Apache Spark for the Enterprise
Jonathan Gole
Sr. Director, Product Management & Business Analytics
US Card, Data Products
3. 3Confidential
We’re innovating to disrupt an industry
“We founded Capital One on the belief that information and technology would revolutionize
financial services. Two decades later, our belief is even stronger.”
– Rich Fairbank, Co-founder & CEO, Capital One
How We Work
4. 4Confidential
US Card - Data Products team lead
Manage largest Apache Spark projects in Capital One
Who am I and why am I here?
5. 5Confidential
Infrastructure challenges have slowed down digital transformation at banks
Typical Bank Challenges
▸Mainframes
▸Slow batch data processes
▸Overly complex and redundant systems
▸Limited support for public cloud and open-source
Harm to the Business
▸Less efficient marketing targeting strategies
▸Limitations with new underwriting techniques
▸Greater process and operational complexity
▸Disconnected or incomplete digital experiences
6. 6Confidential
These infrastructure limitations often limit and frustrate associates
Data Engineer
Business Analyst
Repetitive work, building and fixing ETL pipelines
Difficulty analyzing all data at scale, frustrated by inability to get new insights to customers
We needed to improve our technology
AND our culture
Associates were isolated via separate
technology paradigms..
Software Engineer Reliant on data engineers to build workloads, limited access to data sources
Data Scientist Limited by compute, access to open-source, time to get new models to production
7. 7Confidential
Solution: Unifying data and AI through Apache Spark
Test & Prototype Decision to build around Apache Spark Apply a Product Lens and Learn
through Doing
8. 8Confidential
Learning from initial challenges
Infrastructure & Resilience Associate Learning Curve Diversity & Complexity of Use Cases
Value in centrally managed products to enable rapid innovation Realized we didn’t have the expertise to
build it all ourselves
9. 9Confidential
Next Step: deploy distinct optimized strategies to meet needs of all users
Operations Analytics
• Well-tested, production-ready code
• Deploy workloads/apps independently
• Develop primarily in Scala, Java
• Low-barrier to getting started, easy automation
• Emphasis on fast iteration and collaboration
• Work using SQL, Python, ML libraries
Shared “Quantum” application
framework and code libraries
Unified Analytics Platform, leveraging a
notebook UI & infrastructure automation
Personas
Defining
Needs
Solution
Data Engineers, Software Engineers Data Scientists, Analysts
10. 10Confidential
Continuous investment in improving our products and our ecosystem
Deploy a POC of a
Unified Analytics
Platform (UAP)Analytics
Operations
Opened
platform to
data
scientists
Deployed new
architecture to
scale
Integrations with
common
enterprise Apps,
data sources
Build “stockpile”
of well written
code for
common use
cases
Mass user
training, through
live & self-paced
course
Open to the
enterprise!
Developed
”Quantum” 1.0
framework and
deployment tools
Created
”inner-source”
model inside
enterprise
Built JSON-DSL
abstraction &
common code
libraries
Enable Spark 2.0,
structured streaming,
always-on streaming
operations
Spark on
Kubernetes for
unified dev-ops
Custom
application for
visually-editing
and testing jobs
Make it Work! Make it Scale! Make it Easy!Product
Philosophy
11. 11Confidential
Creating a vast ecosystem around Apache Spark
Infrastructure
Data
Integrations
System
Integrations
Quantum
Application
Framework
Unified
Analytics
Platform
User Interfaces
Notebooks Custom
Workflow Editor
BI Products
+
12. 12Confidential
Enabling innovation across a wide range of data-intensive use cases
ETL
Marketing
Campaigns
Account
Management
External Data
Sharing
Feature
Calculation
Models & ML
Streaming
Alerts
Cloud SQL
analytics
Business
reporting
…
13. 13Confidential
Creating GREAT jobs for our associates
Data Engineer
Business Analyst
Transforming operations via new data sources, real-time streams, & machine learning
Deriving better insights more quickly, partner more closely engineering, data science
Software Engineer Operate as a full-stack team, quickly adding data operations to the application stack
Data Scientist Use advanced ML techniques, easily deploy new models, and access valuable new data
More effective and collaborative culture around data
14. 14Confidential
Lessons learned for unifying analytics within the enterprise
Focus on your customers
Start small, prove value, and iterate
Embrace a community
Take a unified approach
Notas do Editor
Hello! Thank you to Databricks and the spark summit organizers for having us here.
I want to share our story of unifying analytics by successfully spreading Apache spark across our enterprise
You may know us for our credit cards and our catchy commercials..
Sorry I couldn’t bring Samuel L Jackson
However, we are a diversified financial services leader.
corporate offices nationwide – headquartered outside of Washington DC, office one block away in San Francisco.
Cap One is relatively new company, IPO’d in 1994. First FinTech unicorn.
To understand us, look to our long-term CEO and co-founder Rich Fairbank.
Founded to disrupt the industry by leveraging the technology and data revolution
Embraced rapid experimentation and rigorous data-driven decision making
Data technology was and is still at the heart of Capital One’s success
Apache Spark is the latest technology revolution transforming the way we use data to drive our businesses.
PdM team delivering new products and services for transforming our ability to succeed with Data.
In this role, my team has built the largest Spark projects in Capital one
tell you the story of how we transformed our business through the unifying capabilities of Apache Spark
Like other financial companies…Built a mature, growing business on prior technology paradigms
As data systems aged, limited in ingenuity and innovation in our core businesses
- company uses data at the heart of everything, the cost was meaningful
- needed to change this paradigm
associates were clearly impacted - less efficient, less able to focus on innovative, less able to test and learn.
testing & experimentation were core to the foundation of Capital One, so technology limitations were challenging the nature of our culture
shortcomings as a team - people were isolated from each other, working in different technology paradigms
Create a new paradigm – both eliminate technology limitations and our culture around data
typical evaluation process – work with experts, get hands dirty
build around Spark - Mot active project, Unifies batch and streaming, flexible APIs and support for multiple languages
more important - product-centric focus: Unified ownership, real high value use cases, not abstract technology principles
Focus on our associates as customers
Learn through doing - learn quickly and altering approach as needed
Do not start by building an enterprise platform, first build use cases ourselves to deeply understand needs
challenges in first use case:
rolled our own infrastructure - several months to master
Associates working in new paradigms: AWS & public cloud, distributed system, use of scala, separation of storage and compute
quickly realized we could not upfront design one system that would meet everyone’s needs
Diversity & complexity of our business - 80M customers, long-term relationships, regulatory burdens.
Aware of our own ignorance, need to continue to learn through use cases & our customers
Operations: business critical applications, fulfill customer & regulatory promise
Enable a high degree of flexibility and customization by engineers
Yet maintain a degree of efficiency in development, standards to well govern our critical capps
Answer: build a custom in-house shared application framework, leveraging shared code libraries and tooling.
Analytics – insight generational, data science, and machine learning
Iterate on new ideas quickly, collaborate effectively
Ignore infrastructure - It should just work, and let me be creative
Answer: deploy a shared data science platform – a Unified Analytics Platform.
We knew we didn’t have the expertise, so we partnered with Databricks
iteratively expanded on that strategy
Invested continuously into improving the products
core product philosophy to drive our decision making – similar to Suffering Oriented programming
Focus first on enabling and proving out new transformative functionalty – use real business use cases to validate progress - ( highlight Spark 2.0)
set the foundation for growth. Improve performance and scalability. Reduce technical debt. Build tooling to attracting future customers (Highlight – JSON DSL, Inner-Sourcing)
Expand through good user experience – invest in ease of use and efficiency (Ensure a great on-boarding expience for UAP, created self service training in notebooks)
Over time, this product and customer focus organically led to a large ecosystem operating around our Apache Spark products.
Spark as a primary data compute layer for many of our systems
We didn’t have control over this whole ecosystem, but we embedding flexibility in our platforms and allowed for emergent design
Apache sparks reflects this ecosystem approach- continue to deepen the connectivity across the data and software ecosystem
Truly shows the massive impact of Spark- a design that unifies domains can have a massive ROI
TO call out a few accomplishments, our developer & DS community:
Re-building 100s of ETL jobs in the cloud
self-service marketing automation platform
improving our fraud defenses, protecting customer identities
Enabled rapid testing of ML approaches, detecting changing credit trends, automating manual processes
transforming what our associates could achieve
leveling-up in their productivity
Enable shift to more creative ideas.
Given the shared technology paradigm, associates organically work together, collaborate more effectively.
growing importance of small, empowered, cross-functional or cross-skilled teams.
Highlight and reemphasize several learnings from our efforts with Spark:
1) see colleagues as customers, understand & solve for both broad platform needs & enable unique problem solving
2) Learn quickly through doing, instead of abstract evaluation.
3) enable community development, collect feedback, and provide training
4) unified approach – accelerate innovation by removing barriers
Thank you for having me here, I’m proud to participate in such a great Spark community