Implementing tools, let alone an entire Unified Data Platform, like Databricks, can be quite the undertaking. Implementing a tool which you have not yet learned all the ins and outs of can be even more frustrating. Have you ever wished that you could take some of that uncertainty away? Four years ago, Western Governors University (WGU) took on the task of rewriting all of our ETL pipelines in Scala/Python, as well as migrating our Enterprise Data Warehouse into Delta, all on the Databricks platform. Starting with 4 users and rapidly growing to over 120 users across 8 business units, our Databricks environment turned into an entire unified platform, being used by individuals of all skill levels, data requirements, and internal security requirements.
Through this process, our team has had the chance and opportunity to learn while making a lot of mistakes. Taking a look back at those mistakes, there are a lot of things we wish we had known before opening the platform to our enterprise.
We would like to share with you 10 things we wish we had known before WGU started operating in our Databricks environment. Covering topics surrounding user management from both an AWS and Databricks perspective, understanding and managing costs, creating custom pipelines for efficient code management, learning about new Apache Spark snippets that helped save us a fortune, and more. We would like to provide our recommendations on how one can overcome these pitfalls to help new, current and prospective users to make their environments easier, safer, and more reliable to work in.
3. Who Am I?
• Jake Kulas
• https://www.linkedin.com/in/jakekulas/
• Senior Big Data Developer / Data
Engineer at Western Governors
University
• Wisconsin transplant living in Utah
• BS / MS in Information Systems at the
University of Utah
• Working with Apache Spark/Databricks
for 4 years
5. What is Western Governors University?
§ Founded in 1997 by 19 US governors
§ Non-Profit All Online Competency
Based
§ Undergraduate and Graduate degrees
§ Regionally and Nationally accredited
§ 8 State affiliates
§ 228,000+ graduates
§ 135,000+ active students
§ 8,000+ employees
Education without boundaries
6. Introduction
§ Unify data platforms
▪ Data engineering
▪ Analysts / Researchers
▪ Data Scientists
▪ Psychometricians / Statisticians
§ EDW rearchitected on Delta
▪ EDW
▪ LakeHouse Architecture
Implementation Reasoning
§ New languages
▪ Scala
▪ Python
§ New platform
§ New to cloud architecture / design
▪ AWS internal difficulties
§ Rolled out to entire enterprise
▪ 8 business units
▪ 140+ direct users
▪ 300+ jobs
Implementation Without Education
8. Key Mistakes and Challenges
• Understanding of Apache Spark and Delta
• Optimizing JDBC
• Delta optimizations
• Multilingual empowerment
• Code management in a new environment
• CICD Your Way
• Reduce, Reuse, Recycle
• Cost management
• Job/Cluster management
• User management
• User groups / permissions
• Cluster segregation
• Leveraging secrets
• Training / Best Practices
13. Multilingual Empowerment
• Databricks allows multilingual
coding in notebooks
• Utilize what you know best to
get the job done
• Mixing languages based on task
at hand
• Python/Scala + SQL
• Scala + SQL + R
• R + SQL
• Empowering less experienced
analysts/engineers/users
16. CICD Your Way
• No defined way to do CICD
• Dependent on architecture
• Protecting production
• Creating production workspace folders
• Limiting permissions for users
• Git Integration in notebooks
• Empowering users to push
their own code
• Utilize Projects
• Now called Repos API
• https://docs.databricks.com/repos.html
Folder Permissions
https://docs.databricks.com/security/access-control/workspace-acl.html
Git Integration
17. Pipeline POC
▪ Users develop code
▪ Code push using git
integration
▪ Create pull request
▪ Approvals by other
users/managers
• GitHub
• Databricks Git
▪ Code pipeline picks
up repository
modification
▪ Lambda executes
projects (now
repos) API call to
update repository
• AWS Code Pipeline
▪ Workspace repo is
updated and
already linked job is
now up to date
• Databricks Repos
18. Reduce, Reuse, Recycle
• Write code once and share
• Utilizing Databricks %run command
• Enterprise use functions
• JDBC strings
• Secret retrieval
• Common transformations
• Notebook splitting based on
functionality
• Core/Master (main notebook)
• Operations (functions/method definitions)
• Configuration (properties definitions)
20. Managing Your Costs
§ Job/Cluster management
▪ Understanding your job
requirements
▪ Understanding cluster costs
▪ Use dashboarding to visualize
that cost
21. Monitoring Suggestions
▪ Use the correct cluster for the right job
▪ What is the main purpose of the cluster?
▪ Need for memory
▪ Need for processors
▪ Need for both
▪ Need for ML capabilities
▪ Use the Ganglia UI
▪ What is your job doing
▪ What are its requirements
▪ Frequency
▪ Completion times
▪ Data sizes
▪ Test, test, test -- Using Ganglia UI to compare
• Job Management
• Cluster Management
23. Dashboard Examples
• Job Monitoring in Tableau
• Job successes, failures and skips
in past 24 hours
• Failures in past 14 days
• Most recent failure explanations
• Stale table lists
• Usage Monitoring in Tableau
• Split between business units
• Allowing managers to see cluster
costs per month from both AWS
and Databricks
• Showing top 10 highest costing
jobs
25. Managing Your Users
§ User groups / permissions
§ Cluster segregation
§ Leveraging secrets
§ Training / Best Practices
26. Databricks Groups & Permissions
• Utilize Databricks groups to separate by
business unit or by user function
• Data Engineers
• Analysts
• Job Users
• Admins
• Utilize cloud permissions to limit access
• IAM Roles assumed by clusters using Instance Profiles
27. Cluster Segregation
• Would you trust 120+ users to manage
their own clusters?
• Possibilities with Cluster Policies
• Setting up team based shared clusters:
• Ad Hoc
• Machine Learning
• ETL
• All jobs run on automated clusters
• Jobs are owned and cost is assigned to each
business unit
• Cluster restrictions managed by cluster policies
28. Databricks Secrets
• User retrieval without physical
access
• Permissions to access scopes
• Return secrets through functions
• Example returning database
credentials:
readDBProperties(“research”)
29. Training & Best Practices
• It is hard to train hundreds of users on a new product
• Let your users learn and train them on how you want the
product to be used
• Utilize Databricks Academy for new hires / users
• Monthly ”tech talks” going over best practices or new features
• Open weekly office hours assisting engineers and analysts with
their code and general questions
30. Understanding and accomplishing
just half of these challenges prior
to releasing enterprise wide, could
have saved a year worth of work,
tens of thousands of dollars and a
more secure/efficient operating
environment.
31. Q & A
email: jake.kulas@wgu.edu
linkedin: https://www.linkedin.com/in/jakekulas/)