SlideShare uma empresa Scribd logo
1 de 74
Baixar para ler offline
Thanks for coming early!
Want to make clothes from code?
https://haute.codes
Want to hear about a KF book?
http://www.introtomlwithkubeflow.com
Teach kids Apache Spark?
http://distributedcomputing4kids.com
@holdenkarau
Starting to Contribute to
Apache Spark
Spark Summit EU 2019
I am on the PMC but this represents my own personal views
@holdenkarau
Who am I?
Holden
● Prefered pronouns: she/her
● Co-author of the Learning Spark & High Performance Spark books
● Spark PMC & Committer
● Twitter @holdenkarau
● Live stream code & reviews: http://bit.ly/holdenLiveOSS
● Spark Dev in the bay area (no longer @ Google)
@holdenkarau
@holdenkarau
What we are going to explore together!
Getting a change into Apache Spark & the components
involved:
● The current state of the Apache Spark dev community
● Reason to contribute to Apache Spark
● Different ways to contribute
● Places to find things to contribute
● Tooling around code & doc contributions
Torsten Reuschling
@holdenkarau
Who I think you wonderful humans are?
● Nice* people
● Don’t mind pictures of cats
● May know some Apache Spark?
● Want to contribute to Apache Spark
@holdenkarau
Why I’m assuming you might want to contribute:
● Fix your own bugs/problems with Apache Spark
● Learn more about distributed systems (for fun or profit)
● Improve your Scala/Python/R/Java experience
● You <3 functional programming and want to trick more
people into using it
● “Credibility” of some vague type
● You just like hacking on random stuff and Spark seems
shiny
@holdenkarau
What’s the state of the Spark dev community?
● Really large number of contributors
● Active PMC & Committer’s somewhat concentrated
○ Better than we used to be
● Also a lot of SF Bay Area - but certainly not exclusively
so
gigijin
@holdenkarau
How can we contribute to Spark?
● Direct code in the Apache Spark code base
● Code in packages built on top of Spark
● Code reviews
● Yak shaving (aka fixing things that Spark uses)
● Documentation improvements & examples
● Books, Talks, and Blogs
● Answering questions (mailing lists, stack overflow, etc.)
● Testing & Release Validation
Andrey
@holdenkarau
Which is right for you?
● Direct code in the Apache Spark code base
○ High visibility, some things can only really be done here
○ Can take a lot longer to get changes in
● Code in packages built on top of Spark
○ Really great for things like formats or standalone features
● Yak shaving (aka fixing things that Spark uses)
○ Super important to do sometimes - can take even longer to get in
romana klee
@holdenkarau
Which is right for you? (continued)
● Code reviews
○ High visibility to PMC, can be faster to get started, easier to time
box
○ Less tracked in metrics
● Documentation improvements & examples
○ Lots of places to contribute - mixed visibility - large impact
● Advocacy: Books, Talks, and Blogs
○ Can be high visibility
romana klee
@holdenkarau
Testing/Release Validation
● Join the dev@ list and look for [VOTE] threads
○ Check and see if Spark deploys on your environment
○ If your application still works, or if we need to fix something
○ Great way to keep your Spark application working with less work
● Adding more automated tests is good too
○ Especially integration tests
● Check out release previews
○ Run mirrors of your production workloads if possible and compare the
results
○ The earlier we know the easier it is to improve
○ Even if we can't fix it, gives you a heads up on coming changes
@holdenkarau
Helping users
● Join the user@ list to answer peoples questions
○ You'll probably want to make some filter rules so you see the
relevant ones
○ I tried this with ML once -- it didn't go great. Now I look for
specific Python questions.
● Contribute to docs (internal and external)
● Stackoverflow questions
● Blog posts
● Tools to explain errors
● Pay it forward
● Stream your experiences -- there is value in not being
alone
Mitchell Friedman
@holdenkarau
Contributing Code Directly to Spark
● Maybe we encountered a bug we want to fix
● Maybe we’ve got a feature we want to add
● Either way we should see if other people are doing it
● And if what we want to do is complex, it might be better
to find something simple to start with
● It’s dangerous to go alone - take this
http://spark.apache.org/contributing.html
Jon Nelson
@holdenkarau
The different pieces of Spark: 3+?
Apache Spark “Core”
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark
ML
bagel &
Graph X
MLLib
Community
Packages
Structured
Streaming
Spark on
Yarn
Spark on
Mesos
Spark on
Kubernetes
Standalone
Spark
@holdenkarau
Choosing a component?
● Core
○ Conservative to external changes, but biggest impact
● ML / MLlib
○ ML is the home of the future - you can improve existing algorithms -
new algorithms face uphill battle
● Structured Streaming
○ Current API is in a lot of flux so it is difficult for external
participation
● SQL
○ Lots of fun stuff - very active - I have limited personal experience
● Python / R
○ Improve coverage of current APIs, improve performance
Rikki's Refuge
@holdenkarau
Choosing a component? (cont)
● GraphX - See (external) GraphFrames instead
● Kubernetes
○ New, lots of active work and reviewers
● YARN
○ Old faithful, always needs a little work.
● Mesos
○ Needs some love, probably easy-ish-path to committer (still hard)
● Standalone
○ Not a lot going on
Rikki's Refuge
@holdenkarau
Onto JIRA - Issue tracking funtimes
● It’s like bugzilla or fog bugz
● There is an Apache JIRA for many Apache projects
● You can (and should) sign up for an account
● All changes in Spark (now) require a JIRA
● https://www.youtube.com/watch?v=ca8n9uW3afg
● Check it out at:
○ https://issues.apache.org/jira/browse/SPARK
@holdenkarau
What we can do with ASF JIRA?
● Search for issues (remember to filter to Spark project)
● Create new issues
○ search first to see if someone else has reported it
● Comment on issues to let people know we are working on it
● Ask people for clarification or help
○ e.g. “Reading this I think you want the null values to be replaced by
a string when processing - is that correct?”
○ @mentions work here too
@holdenkarau
What can’t we do with ASF JIRA?
● Assign issues (to ourselves or other people)
○ In lieu of assigning we can “watch” & comment
● Post long design documents (create a Google Doc & link to
it from the JIRA)
● Tag issues
○ While we can add tags, they often get removed
@holdenkarau
@holdenkarau
Finding a good “starter” issue:
● https://issues.apache.org/jira/browse/SPARK
○ Has an starter issue tag, but inconsistently applied
● Instead read through and look for simple issues
● Pick something in the same component you eventually want to work in
● Look at the reporter and commenters - is there a committer or someone
whose name you recognize?
● Leave a comment that says you are going to start working on this
● Look for old issues that we couldn't fix because of API compatibility
@holdenkarau
Going beyond reported issues:
Read the user list & look for issues
Grep for TODO in components you are interested in (e.g. grep
-r TODO ./python/pyspark or grep -R TODO ./core/src)
Look between language APIs and see if anything is missing
that you think is interesting
Check deprecations (internal & external)
neko kabachi
@holdenkarau
While we are here: Bug Triage
● Add tags as you go
○ e.g. Found a good starter issue in another area? Tag it!
● Things that are questions in the bug tracker?
○ Redirect folks to the dev/user lists gently and helpfully
● Data correctness issues tagged as "minor"?
○ Help us avoid missing important issues with "blockers"
● Additional information required to be useful?
○ Let people know what would help the bug be more actionable
● Old issue - not sure if it's fixed?
○ Try and repro. A repro from a 2nd person is so valuable
● It's ok that not to look at all of the issues
Carol VanHook
@holdenkarau
Finding SPIPs:
https://issues.apache.org/jira/browse/SPARK-24374?jql=projec
t%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Pro
gress%22%2C%20Reopened)%20AND%20text%20~%20%22SPIP%22
Large pieces of work
Not the easiest to contribute to, but can see design
Warrick Wynne
@holdenkarau
@holdenkarau
But before we get too far:
● Spark wishes to maintain compatibility between releases
● We're working on 3 though so this is the time to break
things
Meagan Fisher
@holdenkarau
Getting at the code: yay for GitHub :)
● https://github.com/apache/spark
● Make a fork of it
● Clone it locally
dougwoods
@holdenkarau
@holdenkarau
Building Spark
./build/sbt
or
./build/mvn
Working in Python? Make sure to build the package target so
your Python code will run :)
You can quickly verify build with the Spark Shell :)
Kara
@holdenkarau
What about documentation changes?
● Still use JIRAs to track
● We can’t edit the wiki :(
● But a lot of documentations lives in docs/*.md
Kreg Steppe
@holdenkarau
Building Spark’s docs
./docs/README.md has a lot of info - but quickly:
SKIP_API=1 jekyll build
SKIP_API=1 jekyll serve --watch
*Requires a recentish jekyll - install instructions assume
ruby2.0 only, on debian based s/gem/gem2.0/
@holdenkarau
Finding your way around the project
● Organized into sub-projects by directory
● IntelliJ is very popular with Spark developers
○ The free version is fine
● Some people like using emacs + ensime or magit too
● Language specific code is in each sub directory
@holdenkarau
Testing the issue
The spark-shell can often be a good way to verify the issue
reported in the JIRA is still occurring and come up with a
reasonable test.
Once you’ve got a handle on the issue in the spark-shell (or
if you decide to skip that step) check out
./[component]/src/test for Scala or doctests for Python
@holdenkarau
While we get our code working:
● Remember to follow the style guides
○ http://spark.apache.org/contributing.html#code-style-guide
● Please always add tests
○ For development we can run scala test with “sbt [module]/testOnly”
○ In python we can specify module with ./python/run-tests -m
● ./dev/lint-scala & ./dev/lint-python check for some style
● Changing the API? Make sure we pass or you update MiMa!
○ Sometimes its OK to make breaking changes, and MiMa can be a bit
overzealous so adding exceptions is common
@holdenkarau
A bit more on MiMa
● Spark wishes to maintain binary compatibility
○ in non-experimental components
○ 3.0 can be different
● MiMa exclusions can be added if we verify (and document
how we verified) the compatibility
● Often MiMa is a bit over sensitive so don’t feel stressed
- feel free to ask for help if confused
Julie
Krawczyk
@holdenkarau
Making the change:
No arguing about which editor please - kthnx
Making a doc change? Look inside docs/*.md
Making a code change? grep or intellij or github inside
project codesearch can all help you find what you're looking
for.
@holdenkarau
Python API change parity update?
@holdenkarau
Yay! Let’s make a PR :)
● Push to your branch
● Visit github
● Create PR (put JIRA name in title as well as component)
○ Components control where our PR shows up in
https://spark-prs.appspot.com/
● If you’ve been whitelisted tests will run
● Otherwise will wait for someone to verify
● Tag it “WIP” if its a work in progress (but maybe wait)
[puamelia]
@holdenkarau
Code review time
● Note: this is after the pull request creation
● I believe code reviews should be done in the open
○ With an exception of when we are deciding if we want to try and
submit a change
○ Even then should have hopefully decided that back at the JIRA stage
● My personal beliefs & your org’s may not align
● If you have the time you can contribute by reviewing
others code too (please!)
Mitchell
Joyce
@holdenkarau
And now onto the actual code review...
● Most often committers will review your code (eventually)
● Other people can help too
● People can be very busy (check the release schedule)
● If you don’t get traction try pinging people
○ Me ( @holdenkarau - I'm not an expert everywhere but I can look)
○ The author of the JIRA (even if not a committer)
○ The shepherd of the JIRA (if applicable)
○ The person who wrote the code you are changing (git blame)
○ Active committers for the component
Mitchell
Joyce
@holdenkarau
What does the review look like?
● LGTM - Looks good to me
○ Individual thinks the code looks good - ready to merge (sometimes
LGTM pending tests or LGTM but check with @[name]).
● SGTM - Sounds good to me (normally in response to a
suggestion)
● Sometimes get sent back to the drawing board
● Not all PRs get in - its ok!
○ Don’t feel bad & don’t get discouraged.
● Mixture of in-line comments & general comments
● You can see some videos of my live reviews at
http://bit.ly/holdenLiveOSS
Phil Long
@holdenkarau
@holdenkarau
@holdenkarau
@holdenkarau
@holdenkarau
@holdenkarau
@holdenkarau
@holdenkarau
@holdenkarau
@holdenkarau
That’s a pretty standard small PR
● It took some time to get merged in
● It was fairly simple
● Review cycles are long - so move on to other things
● Only two reviewers
● Apache Spark Jenkins comments on build status :)
○ “Jenkins retest this please” is great
● Big PRs - like making PySpark pip installable can have >
10 reviewers and take a long time
● Sometimes it can be hard to find reviewers - tag your PRs
& ping people on github
James Joel
@holdenkarau
Don’t get discouraged
David Martyn Hunt
It is normal to not get every pull request accepted
Sometimes other people will “scoop” you on your
pull request
Sometimes people will be super helpful with your
pull request
@holdenkarau
When things don't go well...
If you don’t hear anything there is a good chance it is a “soft no”
The community has been trying to get better at explicit “Won’t Fix” or saying no on PRs
If folks say "no" (explicitly or implicitly) it doesn't mean your idea isn't awesome
If your idea doesn't fit in Spark at present, see if you can make it as a library
If you can't make a library see what hooks Spark would need to make those libraries possible and
suggest them.
@holdenkarau
While we are waiting:
● Keep merging in master when we get out of sync
● If we don’t jenkins can’t run :(
● We get out of sync surprisingly quickly!
● If our pull request gets older than 30 days it might get
auto-closed
● If you don’t here anything try pinging the dev list to
see if it's a “soft no” (and or ping me :))
Moyan Brenn
Open Source Code reviews are a like
Mermaid School
1) They help you grow your skills
2) Build on your existings skills (e.g. swimming or Scala)
3) You get better with time but you need to start
4) People (read sometimes management*) don't
understand how they help you grow your skills and don't
want to pay for it
5) Coffee makes it better
Why the community needs you?
● Many projects suffer from maintainer burn out
○ Some of this comes from the pressure to review too much code
● Reviewing code is less “fun”
○ and with a largely fun motivated work base
● Some projects are limited by reviewers not coding
○ Spark has > 500 open PRs
● More diverse reviewers: more diverse solutions
● Experienced reviewers become blind to “the way it’s
always been done”
● Represent the user(s)
Jerry Lai
Rate of PRs / Reviews
Benefits you get from OSS reviews
● Grow skills
● See the world*
● Faster recognition
● Deeper integration in community
● The ability to contribute with fixed amounts of time
*Of open source & maybe the real world
See more of the world
● Starter issues are often designed to only touch a few
things
● Even moving beyond starter issues, there’s only so
many hours in the day and you can’t write everything
● Helps you can a better understanding of the project as a
whole
● Let's you take skills between projects faster
○ Know what good Python looks like? Great, many projects need help
with that
Vania Rivalta
Possible Faster Recognition
● General more contributors than reviewers
● Reviewers stand out
● Reviews can be the difference between a contributor
and someone trusted to make their own changes to the
project
● Allows you to work with more people
Sham Hardy
Easier to control your time
● Getting code into large OSS projects can take lots of
time
● Want to contribute a new PR? You will often need to
shepard a PR for an extended period of time
● “One more bug”
● With reviews: do what you can, but you don’t have to be
continuously responding to provide value
Rob Hill
Finding a good first PR to review
● Smaller PRs can be better
● Something you care about
● Often easier to be one of the early reviewers so if it’s
late stage stay away from
● You can drill down by component in
https://spark-prs.appspot.com/
Doing that first review:
● Feel free to leave comments like
○ “I’m new to the project reading this I think it’s intention is X is that
correct? Maybe we could add a comment here”
○ Look for when changes are getting out of sync with docs “Can we
update the docs or create a follow up issue to do that?”
○ Style: Is there a style guide? Does this follow it? Does this follow
general “good” style?
○ Building: Does this build on your platform?
○ Look around for duplicated logic elsewhere in the codebase
○ Find the original author and ping them to take a look
● Get your IDE set up and jump to definition a lot
● Be prepared to look at the libraries documentation
Communicate carefully please
● The internet is scary enough
● “This sucks” can be heartbreaking
● You don’t know how much time someone put in
● Make it clear you are new to the project (gives you
some more leeway) & sets expectations
● Understand folks can get defensive about designs:
sometimes it’s not worth the argument
● People are allowed to be wrong on the internet
● It’s ok to be scared
ivva
Phrasing matters a lot
● This is slow
● This is hard to
understand
● This library sucks
● No one would ever use
this
● You're using this wrong
● Could we do this faster?
● I'm confused, is it doing X
& could we add a
comment?
● Have you looked at X?
● What's the usage
pattern?
● X has problem Y, how
about Z?
OSS reviews videos (live & recorded):
https://www.youtube.com/user/holdenkarau
Depending on time we can do one now….
@holdenkarau
What about when we want to make big changes?
● Talk with the community
○ Developer mailing list dev@spark.apache.org
○ User mailing list user@spark.apache.org
● First change? Try and build some karma first
● Consider if it can be published as a spark-package
● Create a public design document (google doc normally)
● Be aware this will be somewhat of an uphill battle (I’m
sorry)
● You can look at SPIPs (Spark's versions of PEPs)
@holdenkarau
How about yak shaving?
● Lots of areas need shaving
● JVM deps are easier to update, Python deps are not :(
● Things built on top are a great place to go yak shaving
○ Jupyter etc.
Jason Crane
@holdenkarau
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
@holdenkarau
High Performance Spark!
You can buy it today! On the internet!
Cats love it*
*Or at least the box it comes in. If buying for a cat, get
print rather than e-book.
@holdenkarau
Sign up for the mailing list @
http://www.distributedcomputing4kids.com
@holdenkarau
Local to Amsterdam?
● I'll be back for ITNext at the end of the month
● Have spark/oss questions?
○ Let me know and we can set up office hours
● Also know of any good halloween parties?
○ I've got a cool costume but I'm told y'all don't normally celebrate
:(
@holdenkarau
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
.
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
I want to give better talks and feedback is welcome:
http://bit.ly/holdenTalkFeedback

Mais conteúdo relacionado

Semelhante a Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond

Contributing to Apache Spark 3
Contributing to Apache Spark 3Contributing to Apache Spark 3
Contributing to Apache Spark 3Holden Karau
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache SparkHolden Karau
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
 
A Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And Kubernetes
A Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And KubernetesA Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And Kubernetes
A Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And KubernetesLightbend
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...Holden Karau
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Holden Karau
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Holden Karau
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...Holden Karau
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
 
Overcoming the Fear of Contributing to Open Source
Overcoming the Fear of Contributing to Open SourceOvercoming the Fear of Contributing to Open Source
Overcoming the Fear of Contributing to Open SourceAll Things Open
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Holden Karau
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Kaxil Naik
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau
 
Docathon: How to write (good) documentation
Docathon: How to write (good) documentationDocathon: How to write (good) documentation
Docathon: How to write (good) documentationnelle varoquaux
 
Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...Holden Karau
 

Semelhante a Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond (20)

Contributing to Apache Spark 3
Contributing to Apache Spark 3Contributing to Apache Spark 3
Contributing to Apache Spark 3
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
A Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And Kubernetes
A Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And KubernetesA Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And Kubernetes
A Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And Kubernetes
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Overcoming the Fear of Contributing to Open Source
Overcoming the Fear of Contributing to Open SourceOvercoming the Fear of Contributing to Open Source
Overcoming the Fear of Contributing to Open Source
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
 
Docathon: How to write (good) documentation
Docathon: How to write (good) documentationDocathon: How to write (good) documentation
Docathon: How to write (good) documentation
 
Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond

  • 1. Thanks for coming early! Want to make clothes from code? https://haute.codes Want to hear about a KF book? http://www.introtomlwithkubeflow.com Teach kids Apache Spark? http://distributedcomputing4kids.com
  • 2. @holdenkarau Starting to Contribute to Apache Spark Spark Summit EU 2019 I am on the PMC but this represents my own personal views
  • 3. @holdenkarau Who am I? Holden ● Prefered pronouns: she/her ● Co-author of the Learning Spark & High Performance Spark books ● Spark PMC & Committer ● Twitter @holdenkarau ● Live stream code & reviews: http://bit.ly/holdenLiveOSS ● Spark Dev in the bay area (no longer @ Google)
  • 5. @holdenkarau What we are going to explore together! Getting a change into Apache Spark & the components involved: ● The current state of the Apache Spark dev community ● Reason to contribute to Apache Spark ● Different ways to contribute ● Places to find things to contribute ● Tooling around code & doc contributions Torsten Reuschling
  • 6. @holdenkarau Who I think you wonderful humans are? ● Nice* people ● Don’t mind pictures of cats ● May know some Apache Spark? ● Want to contribute to Apache Spark
  • 7. @holdenkarau Why I’m assuming you might want to contribute: ● Fix your own bugs/problems with Apache Spark ● Learn more about distributed systems (for fun or profit) ● Improve your Scala/Python/R/Java experience ● You <3 functional programming and want to trick more people into using it ● “Credibility” of some vague type ● You just like hacking on random stuff and Spark seems shiny
  • 8. @holdenkarau What’s the state of the Spark dev community? ● Really large number of contributors ● Active PMC & Committer’s somewhat concentrated ○ Better than we used to be ● Also a lot of SF Bay Area - but certainly not exclusively so gigijin
  • 9. @holdenkarau How can we contribute to Spark? ● Direct code in the Apache Spark code base ● Code in packages built on top of Spark ● Code reviews ● Yak shaving (aka fixing things that Spark uses) ● Documentation improvements & examples ● Books, Talks, and Blogs ● Answering questions (mailing lists, stack overflow, etc.) ● Testing & Release Validation Andrey
  • 10. @holdenkarau Which is right for you? ● Direct code in the Apache Spark code base ○ High visibility, some things can only really be done here ○ Can take a lot longer to get changes in ● Code in packages built on top of Spark ○ Really great for things like formats or standalone features ● Yak shaving (aka fixing things that Spark uses) ○ Super important to do sometimes - can take even longer to get in romana klee
  • 11. @holdenkarau Which is right for you? (continued) ● Code reviews ○ High visibility to PMC, can be faster to get started, easier to time box ○ Less tracked in metrics ● Documentation improvements & examples ○ Lots of places to contribute - mixed visibility - large impact ● Advocacy: Books, Talks, and Blogs ○ Can be high visibility romana klee
  • 12. @holdenkarau Testing/Release Validation ● Join the dev@ list and look for [VOTE] threads ○ Check and see if Spark deploys on your environment ○ If your application still works, or if we need to fix something ○ Great way to keep your Spark application working with less work ● Adding more automated tests is good too ○ Especially integration tests ● Check out release previews ○ Run mirrors of your production workloads if possible and compare the results ○ The earlier we know the easier it is to improve ○ Even if we can't fix it, gives you a heads up on coming changes
  • 13. @holdenkarau Helping users ● Join the user@ list to answer peoples questions ○ You'll probably want to make some filter rules so you see the relevant ones ○ I tried this with ML once -- it didn't go great. Now I look for specific Python questions. ● Contribute to docs (internal and external) ● Stackoverflow questions ● Blog posts ● Tools to explain errors ● Pay it forward ● Stream your experiences -- there is value in not being alone Mitchell Friedman
  • 14. @holdenkarau Contributing Code Directly to Spark ● Maybe we encountered a bug we want to fix ● Maybe we’ve got a feature we want to add ● Either way we should see if other people are doing it ● And if what we want to do is complex, it might be better to find something simple to start with ● It’s dangerous to go alone - take this http://spark.apache.org/contributing.html Jon Nelson
  • 15. @holdenkarau The different pieces of Spark: 3+? Apache Spark “Core” SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages Structured Streaming Spark on Yarn Spark on Mesos Spark on Kubernetes Standalone Spark
  • 16. @holdenkarau Choosing a component? ● Core ○ Conservative to external changes, but biggest impact ● ML / MLlib ○ ML is the home of the future - you can improve existing algorithms - new algorithms face uphill battle ● Structured Streaming ○ Current API is in a lot of flux so it is difficult for external participation ● SQL ○ Lots of fun stuff - very active - I have limited personal experience ● Python / R ○ Improve coverage of current APIs, improve performance Rikki's Refuge
  • 17. @holdenkarau Choosing a component? (cont) ● GraphX - See (external) GraphFrames instead ● Kubernetes ○ New, lots of active work and reviewers ● YARN ○ Old faithful, always needs a little work. ● Mesos ○ Needs some love, probably easy-ish-path to committer (still hard) ● Standalone ○ Not a lot going on Rikki's Refuge
  • 18. @holdenkarau Onto JIRA - Issue tracking funtimes ● It’s like bugzilla or fog bugz ● There is an Apache JIRA for many Apache projects ● You can (and should) sign up for an account ● All changes in Spark (now) require a JIRA ● https://www.youtube.com/watch?v=ca8n9uW3afg ● Check it out at: ○ https://issues.apache.org/jira/browse/SPARK
  • 19. @holdenkarau What we can do with ASF JIRA? ● Search for issues (remember to filter to Spark project) ● Create new issues ○ search first to see if someone else has reported it ● Comment on issues to let people know we are working on it ● Ask people for clarification or help ○ e.g. “Reading this I think you want the null values to be replaced by a string when processing - is that correct?” ○ @mentions work here too
  • 20. @holdenkarau What can’t we do with ASF JIRA? ● Assign issues (to ourselves or other people) ○ In lieu of assigning we can “watch” & comment ● Post long design documents (create a Google Doc & link to it from the JIRA) ● Tag issues ○ While we can add tags, they often get removed
  • 22. @holdenkarau Finding a good “starter” issue: ● https://issues.apache.org/jira/browse/SPARK ○ Has an starter issue tag, but inconsistently applied ● Instead read through and look for simple issues ● Pick something in the same component you eventually want to work in ● Look at the reporter and commenters - is there a committer or someone whose name you recognize? ● Leave a comment that says you are going to start working on this ● Look for old issues that we couldn't fix because of API compatibility
  • 23. @holdenkarau Going beyond reported issues: Read the user list & look for issues Grep for TODO in components you are interested in (e.g. grep -r TODO ./python/pyspark or grep -R TODO ./core/src) Look between language APIs and see if anything is missing that you think is interesting Check deprecations (internal & external) neko kabachi
  • 24. @holdenkarau While we are here: Bug Triage ● Add tags as you go ○ e.g. Found a good starter issue in another area? Tag it! ● Things that are questions in the bug tracker? ○ Redirect folks to the dev/user lists gently and helpfully ● Data correctness issues tagged as "minor"? ○ Help us avoid missing important issues with "blockers" ● Additional information required to be useful? ○ Let people know what would help the bug be more actionable ● Old issue - not sure if it's fixed? ○ Try and repro. A repro from a 2nd person is so valuable ● It's ok that not to look at all of the issues Carol VanHook
  • 27. @holdenkarau But before we get too far: ● Spark wishes to maintain compatibility between releases ● We're working on 3 though so this is the time to break things Meagan Fisher
  • 28. @holdenkarau Getting at the code: yay for GitHub :) ● https://github.com/apache/spark ● Make a fork of it ● Clone it locally dougwoods
  • 30. @holdenkarau Building Spark ./build/sbt or ./build/mvn Working in Python? Make sure to build the package target so your Python code will run :) You can quickly verify build with the Spark Shell :) Kara
  • 31. @holdenkarau What about documentation changes? ● Still use JIRAs to track ● We can’t edit the wiki :( ● But a lot of documentations lives in docs/*.md Kreg Steppe
  • 32. @holdenkarau Building Spark’s docs ./docs/README.md has a lot of info - but quickly: SKIP_API=1 jekyll build SKIP_API=1 jekyll serve --watch *Requires a recentish jekyll - install instructions assume ruby2.0 only, on debian based s/gem/gem2.0/
  • 33. @holdenkarau Finding your way around the project ● Organized into sub-projects by directory ● IntelliJ is very popular with Spark developers ○ The free version is fine ● Some people like using emacs + ensime or magit too ● Language specific code is in each sub directory
  • 34. @holdenkarau Testing the issue The spark-shell can often be a good way to verify the issue reported in the JIRA is still occurring and come up with a reasonable test. Once you’ve got a handle on the issue in the spark-shell (or if you decide to skip that step) check out ./[component]/src/test for Scala or doctests for Python
  • 35. @holdenkarau While we get our code working: ● Remember to follow the style guides ○ http://spark.apache.org/contributing.html#code-style-guide ● Please always add tests ○ For development we can run scala test with “sbt [module]/testOnly” ○ In python we can specify module with ./python/run-tests -m ● ./dev/lint-scala & ./dev/lint-python check for some style ● Changing the API? Make sure we pass or you update MiMa! ○ Sometimes its OK to make breaking changes, and MiMa can be a bit overzealous so adding exceptions is common
  • 36. @holdenkarau A bit more on MiMa ● Spark wishes to maintain binary compatibility ○ in non-experimental components ○ 3.0 can be different ● MiMa exclusions can be added if we verify (and document how we verified) the compatibility ● Often MiMa is a bit over sensitive so don’t feel stressed - feel free to ask for help if confused Julie Krawczyk
  • 37. @holdenkarau Making the change: No arguing about which editor please - kthnx Making a doc change? Look inside docs/*.md Making a code change? grep or intellij or github inside project codesearch can all help you find what you're looking for.
  • 39. @holdenkarau Yay! Let’s make a PR :) ● Push to your branch ● Visit github ● Create PR (put JIRA name in title as well as component) ○ Components control where our PR shows up in https://spark-prs.appspot.com/ ● If you’ve been whitelisted tests will run ● Otherwise will wait for someone to verify ● Tag it “WIP” if its a work in progress (but maybe wait) [puamelia]
  • 40. @holdenkarau Code review time ● Note: this is after the pull request creation ● I believe code reviews should be done in the open ○ With an exception of when we are deciding if we want to try and submit a change ○ Even then should have hopefully decided that back at the JIRA stage ● My personal beliefs & your org’s may not align ● If you have the time you can contribute by reviewing others code too (please!) Mitchell Joyce
  • 41. @holdenkarau And now onto the actual code review... ● Most often committers will review your code (eventually) ● Other people can help too ● People can be very busy (check the release schedule) ● If you don’t get traction try pinging people ○ Me ( @holdenkarau - I'm not an expert everywhere but I can look) ○ The author of the JIRA (even if not a committer) ○ The shepherd of the JIRA (if applicable) ○ The person who wrote the code you are changing (git blame) ○ Active committers for the component Mitchell Joyce
  • 42. @holdenkarau What does the review look like? ● LGTM - Looks good to me ○ Individual thinks the code looks good - ready to merge (sometimes LGTM pending tests or LGTM but check with @[name]). ● SGTM - Sounds good to me (normally in response to a suggestion) ● Sometimes get sent back to the drawing board ● Not all PRs get in - its ok! ○ Don’t feel bad & don’t get discouraged. ● Mixture of in-line comments & general comments ● You can see some videos of my live reviews at http://bit.ly/holdenLiveOSS Phil Long
  • 52. @holdenkarau That’s a pretty standard small PR ● It took some time to get merged in ● It was fairly simple ● Review cycles are long - so move on to other things ● Only two reviewers ● Apache Spark Jenkins comments on build status :) ○ “Jenkins retest this please” is great ● Big PRs - like making PySpark pip installable can have > 10 reviewers and take a long time ● Sometimes it can be hard to find reviewers - tag your PRs & ping people on github James Joel
  • 53. @holdenkarau Don’t get discouraged David Martyn Hunt It is normal to not get every pull request accepted Sometimes other people will “scoop” you on your pull request Sometimes people will be super helpful with your pull request
  • 54. @holdenkarau When things don't go well... If you don’t hear anything there is a good chance it is a “soft no” The community has been trying to get better at explicit “Won’t Fix” or saying no on PRs If folks say "no" (explicitly or implicitly) it doesn't mean your idea isn't awesome If your idea doesn't fit in Spark at present, see if you can make it as a library If you can't make a library see what hooks Spark would need to make those libraries possible and suggest them.
  • 55. @holdenkarau While we are waiting: ● Keep merging in master when we get out of sync ● If we don’t jenkins can’t run :( ● We get out of sync surprisingly quickly! ● If our pull request gets older than 30 days it might get auto-closed ● If you don’t here anything try pinging the dev list to see if it's a “soft no” (and or ping me :)) Moyan Brenn
  • 56. Open Source Code reviews are a like Mermaid School 1) They help you grow your skills 2) Build on your existings skills (e.g. swimming or Scala) 3) You get better with time but you need to start 4) People (read sometimes management*) don't understand how they help you grow your skills and don't want to pay for it 5) Coffee makes it better
  • 57. Why the community needs you? ● Many projects suffer from maintainer burn out ○ Some of this comes from the pressure to review too much code ● Reviewing code is less “fun” ○ and with a largely fun motivated work base ● Some projects are limited by reviewers not coding ○ Spark has > 500 open PRs ● More diverse reviewers: more diverse solutions ● Experienced reviewers become blind to “the way it’s always been done” ● Represent the user(s) Jerry Lai
  • 58. Rate of PRs / Reviews
  • 59. Benefits you get from OSS reviews ● Grow skills ● See the world* ● Faster recognition ● Deeper integration in community ● The ability to contribute with fixed amounts of time *Of open source & maybe the real world
  • 60. See more of the world ● Starter issues are often designed to only touch a few things ● Even moving beyond starter issues, there’s only so many hours in the day and you can’t write everything ● Helps you can a better understanding of the project as a whole ● Let's you take skills between projects faster ○ Know what good Python looks like? Great, many projects need help with that Vania Rivalta
  • 61. Possible Faster Recognition ● General more contributors than reviewers ● Reviewers stand out ● Reviews can be the difference between a contributor and someone trusted to make their own changes to the project ● Allows you to work with more people Sham Hardy
  • 62. Easier to control your time ● Getting code into large OSS projects can take lots of time ● Want to contribute a new PR? You will often need to shepard a PR for an extended period of time ● “One more bug” ● With reviews: do what you can, but you don’t have to be continuously responding to provide value Rob Hill
  • 63. Finding a good first PR to review ● Smaller PRs can be better ● Something you care about ● Often easier to be one of the early reviewers so if it’s late stage stay away from ● You can drill down by component in https://spark-prs.appspot.com/
  • 64. Doing that first review: ● Feel free to leave comments like ○ “I’m new to the project reading this I think it’s intention is X is that correct? Maybe we could add a comment here” ○ Look for when changes are getting out of sync with docs “Can we update the docs or create a follow up issue to do that?” ○ Style: Is there a style guide? Does this follow it? Does this follow general “good” style? ○ Building: Does this build on your platform? ○ Look around for duplicated logic elsewhere in the codebase ○ Find the original author and ping them to take a look ● Get your IDE set up and jump to definition a lot ● Be prepared to look at the libraries documentation
  • 65. Communicate carefully please ● The internet is scary enough ● “This sucks” can be heartbreaking ● You don’t know how much time someone put in ● Make it clear you are new to the project (gives you some more leeway) & sets expectations ● Understand folks can get defensive about designs: sometimes it’s not worth the argument ● People are allowed to be wrong on the internet ● It’s ok to be scared ivva
  • 66. Phrasing matters a lot ● This is slow ● This is hard to understand ● This library sucks ● No one would ever use this ● You're using this wrong ● Could we do this faster? ● I'm confused, is it doing X & could we add a comment? ● Have you looked at X? ● What's the usage pattern? ● X has problem Y, how about Z?
  • 67. OSS reviews videos (live & recorded): https://www.youtube.com/user/holdenkarau Depending on time we can do one now….
  • 68. @holdenkarau What about when we want to make big changes? ● Talk with the community ○ Developer mailing list dev@spark.apache.org ○ User mailing list user@spark.apache.org ● First change? Try and build some karma first ● Consider if it can be published as a spark-package ● Create a public design document (google doc normally) ● Be aware this will be somewhat of an uphill battle (I’m sorry) ● You can look at SPIPs (Spark's versions of PEPs)
  • 69. @holdenkarau How about yak shaving? ● Lots of areas need shaving ● JVM deps are easier to update, Python deps are not :( ● Things built on top are a great place to go yak shaving ○ Jupyter etc. Jason Crane
  • 70. @holdenkarau Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 71. @holdenkarau High Performance Spark! You can buy it today! On the internet! Cats love it* *Or at least the box it comes in. If buying for a cat, get print rather than e-book.
  • 72. @holdenkarau Sign up for the mailing list @ http://www.distributedcomputing4kids.com
  • 73. @holdenkarau Local to Amsterdam? ● I'll be back for ITNext at the end of the month ● Have spark/oss questions? ○ Let me know and we can set up office hours ● Also know of any good halloween parties? ○ I've got a cool costume but I'm told y'all don't normally celebrate :(
  • 74. @holdenkarau k thnx bye :) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark . Will tweet results “eventually” @holdenkarau Do you want more realistic benchmarks? Share your UDFs! http://bit.ly/pySparkUDF I want to give better talks and feedback is welcome: http://bit.ly/holdenTalkFeedback