O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Sharing (or stealing) the jewels of python with big data & the jvm (1)

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 39 Anúncio

Sharing (or stealing) the jewels of python with big data & the jvm (1)

Baixar para ler offline

With the new Apache Arrow integration in PySpark 2.3, it is now starting become reasonable to look to the Python world and ask “what else do we want to steal besides tensorflow”, or as a Python developer look and say “how can I get my code into production without it being rewritten into a mess of Java?”

Regardless of your specific side(s) in the JVM/Python divide, collaboration is getting a lot faster, so lets learn how to share! In this brief talk we will examine sharing some of the wonders of Spacy with the Java world, which still has a somewhat lackluster set of options for NLP.

With the new Apache Arrow integration in PySpark 2.3, it is now starting become reasonable to look to the Python world and ask “what else do we want to steal besides tensorflow”, or as a Python developer look and say “how can I get my code into production without it being rewritten into a mess of Java?”

Regardless of your specific side(s) in the JVM/Python divide, collaboration is getting a lot faster, so lets learn how to share! In this brief talk we will examine sharing some of the wonders of Spacy with the Java world, which still has a somewhat lackluster set of options for NLP.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Sharing (or stealing) the jewels of python with big data & the jvm (1) (20)

Anúncio

Mais recentes (20)

Sharing (or stealing) the jewels of python with big data & the jvm (1)

  1. 1. Stealing/Sharing the Jewels From Python w/Spark Guilty looking software cat goes here @holdenkarau Photo by Dean Wampler
  2. 2. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC :) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos ● Talk feedback: http://bit.ly/holdenTalkFeedback http://bit.ly/holdenTalkFeedback
  3. 3. Who I think you wonderful humans are? ● Nice enough people ● I’m sure you love pictures of cats ● Possibly know some Apache Spark ● Interested in stealing from Python (or getting your Python code into production faster) Lori Erickson
  4. 4. What will be covered? ● A quick look at the current state of PySpark ● Looking at how to reverse this ● Using Arrow for fast Python UDFS with Spark ● Reversing this again ● Beam Outside the JVM ● Our even less subtle attempts to get you to buy my new book ● Pictures of cats & stuffed animals ● tl;dr - Java has poor NLP and limited DL options, but it doesn’t matter we can steal them from Python Photo by Dean Wampler
  5. 5. What’s the state of non-JVM big data? Most of the tools are built in the JVM, so how do we play together? ● Pickling, Strings, JSON, XML, oh my! ● Unix pipes ● Sockets What about if we don’t want to copy the data all the time? Dataframe Api + Arrow ● Or standalone “pure”* re-implementations of everything ○ Reasonable option for things like Kafka where you would have the I/O regardless. ○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem David Brown
  6. 6. PySpark: ● The Python interface to Spark ● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API ● Has some serious performance hurdles from the design ● Same general technique used as the bases for the other non JVM implementations in Spark ○ C# ○ R ○ Julia ○ Javascript - surprisingly different
  7. 7. A quick detour into PySpark’s internals + + JSON
  8. 8. Spark in Scala, how does PySpark work? ● Py4J + pickling + JSON and magic ○ Py4j in the driver ○ Pipes to start python process from java exec ○ cloudPickle to serialize data between JVM and python executors (transmitted via sockets) ○ Json for dataframe schema ● Data from Spark worker serialized and piped to Python worker --> then piped back to jvm ○ Multiple iterator-to-iterator transformations are still pipelined :) ○ So serialization happens only once per stage ● Spark SQL (and DataFrames) avoid some of this kristin klein
  9. 9. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  10. 10. Ok so how do we use this from the JVM? ● Dirty dirty tricks ● Launch Python from the JVM ● Instead of launching context.py launch our own special entry point (startup.py - I’m bad at names) ● Implement an interface matching a Python class we can call to register Python functions by string names ○ Optional: implement Scala classes for each of the Python classes. But that sounded like more work, so uhhh PRs welcome? ● Run it! Curse. Debug. Run it!
  11. 11. So what does that look like? shell/pipes Py4J with Spark & a friend: startup.py Photo by Dean Wampler
  12. 12. So what does that look like? Request a UDF Register a UDF Photo by Dean Wampler
  13. 13. So what does that look like? Driver Worker 1 Worker K pipe pipe
  14. 14. What is/why Sparkling ML ● A place for useful Spark ML pipeline stages to live ○ Including both feature transformers and estimators ● The why: Spark ML can’t keep up with every new algorithm ● Lots of cool ML on Spark tools exist, but many don’t play nice with Spark ML or together. ● We make it easier to expose Python transformers into Scala land and vice versa. ● Our repo is at: https://github.com/sparklingpandas/sparklingml
  15. 15. So what goes in startup.py? ● A class for our Java code to call with parameters & request functions ● Code to take the Python UDFS and construct/return the underlying Java UDFS ● A main function to startup the Py4J gateway & Spark context to serialize our functions in the way that is expected ● Pretty much it’s just boilerplate but you can take a look if you want. Jennifer C.
  16. 16. So what goes in startup.py? class PythonRegistrationProvider(object): class Java: package = "com.sparklingpandas.sparklingml.util.python" className = "PythonRegisterationProvider" implements = [package + "." + className] Jennifer C.
  17. 17. So what goes in startup.py? def registerFunction(self, ssc, jsession, function_name, params): setup_spark_context_if_needed() if function_name in functions_info: function_info = functions_info[function_name] evaledParams = ast.literal_eval(params) func = function_info.func(*evaledParams) udf = UserDefinedFunction(func, ret_type, make_registration_name()) return udf._judf else: print("Could not find function") Jennifer C.
  18. 18. What’s the boilerplate in Java? ● Call Python ● A trait representing the Python entry point ● Wrapping the UDFS in Spark ML stages (optional buuut nice?) ● Also kind of boring, its in a few files if you want to look.
  19. 19. Enough boilerplate: counting words! With Spacy, so you know more than English* def inner(inputString): nlp = SpacyMagic.get(lang) def spacyTokenToDict(token): """Convert the input token into a dictionary""" return dict(map(lookup_field_or_none, fields)) return list(map(spacyTokenToDict, list(nlp(inputString))))
  20. 20. And from the JVM: val transformer = new SpacyTokenizePython() transformer.setLang("en") val input = spark.createDataset( List(InputData("hi boo"), InputData("boo"))) transformer.setInputCol("input") transformer.setOutputCol("output") val result = transformer.transform(input).collect() Alexy Khrabrov
  21. 21. Ok but now it’s kind of slow…. ● Well yeah ● Think back to that architecture diagram ● It’s not like a fast design ● We could try Jython?
  22. 22. *For a small price of your fun libraries. Bad idea.
  23. 23. That was a bad idea, buuut….. ● Work going on in Scala land to translate simple Scala into SQL expressions - need the Dataset API ○ Maybe we can try similar approaches with Python? ● POC use Jython for simple UDFs (e.g. 2.7 compat & no native libraries) - SPARK-15369 ○ Early benchmarking w/word count 5% slower than native Scala UDF, close to 2x faster than regular Python ● Willing to share your Python UDFs for benchmarking? - http://bit.ly/pySparkUDF *The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its ok!
  24. 24. Andrew Skudder *Arrow: likely the future. I really hope so. Spark 2.3 and beyond! * *
  25. 25. What does the future look like?* *Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html. *Vendor benchmark. Trust but verify.
  26. 26. What does the future look like - in code @pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) def normalize(pdf): v = pdf.v return pdf.assign(v=(v - v.mean()) / v.std())
  27. 27. And we can share this with Java...: With NLTK now! Sentiment is all the rage. def inner(input_series): from nltk.sentiment.vader import SentimentIntensityAnalyzer sid = SentimentIntensityAnalyzer() result = input_series.apply(lambda sentence: sid.polarity_scores(sentence)['pos']) return result
  28. 28. And from the JVM: val transformer = new NltkPosPython() val input = spark.createDataset( List(InputData("Boo is happy"), InputData("Boo is sad"))) transformer.setInputCol("input") transformer.setOutputCol("output") val result = transformer.transform(input).collect() result.size shouldBe 2 result(0)(0) shouldBe "Boo is happy" result(0)(1) shouldBe 0.649 Alexy Khrabrov
  29. 29. Everyone loves wordcount right? With Spacy now! Non-English language support! def inner(inputSeries): """Tokenize the inputString using spacy for the provided language.""" nlp = SpacyMagic.get(lang) def tokenizeElem(elem): return list(map(lambda token: token.text, list(nlp(unicode(elem))))) return inputSeries.apply(tokenizeElem)
  30. 30. BEAM Beyond the JVM ● Non JVM BEAM doesn’t work outside of Google’s environment yet, so I’m going to skip the details. ● tl;dr : uses grpc / protobuf ● But exciting new plans to unify the runners and ease the support of different languages (called SDKS) ○ See https://beam.apache.org/contribute/portability/ ● If this is exciting, you can come join me on making BEAM work in Python3 ○ Yes we still don’t have that :( ○ But we're getting closer!
  31. 31. Why now? ● There’s been better formats/options for a long time ● JVM devs want to use libraries in other languages with lots of data ○ e.g. startup + Deep Learning + ? => profit ● Arrow has solved the chicken-egg problem by building not just the chicken & the egg, but also a hen house Andrew Mager
  32. 32. References ● Live Streaming of Working with Spark + Arrow: https://www.youtube.com/watch?v=EPvd5BhhevM&list=PLRLebp9QyZtYF46jl SnIu2x1NDBkKa2uw&index=5 ● Sparkling ML: https://github.com/sparklingpandas/sparklingml ● Apache Arrow: https://arrow.apache.org/ ● Brian (IBM) on initial Spark + Arrow https://arrow.apache.org/blog/2017/07/26/spark-arrow/ ● Li Jin (two sigma) https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspar k.html ● Bill Maimone https://blogs.nvidia.com/blog/2017/06/27/gpu-computation-visualization/
  33. 33. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  34. 34. High Performance Spark! You can buy it today! Only one chapter on non-JVM stuff, I’m sorry. Cats love it* *Or at least the box it comes in. If buying for a cat, get print rather than e-book.
  35. 35. k thnx bye :) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark I need to give a testing talk in a few months, help a “friend” out. Will tweet results “eventually” @holdenkarau Do you want more realistic benchmarks? Share your UDFs! http://bit.ly/pySparkUDF It’s performance review season, so help a friend out and fill out this survey with your talk feedback http://bit.ly/holdenTalkFeedback
  36. 36. Beyond wordcount: depencies? ● Your machines probably already have pandas ○ But maybe an old version ● But they might not have “special_business_logic” ○ Very special business logic, no one wants change fortran code*. ● Option 1: Talk to your vendor** ● Option 2: Try some sketchy open source software from a hack day ● We’re going to focus on option 2! *Because it’s perfect, it is fortran after all. ** I don’t like this option because the vendor I work for doesn’t have an answer.
  37. 37. coffee_boat to the rescue* # This is beta, be careful. It may screw up your venv !pip install --upgrade coffee_boat # Use the coffee boat from coffee_boat import Captain captain = Captain(accept_conda_license=True) captain.add_pip_packages("pyarrow", "edtf") captain.launch_ship() sc = SparkContext(master="yarn") # You can now use pyarrow & edtf captain.add_pip_packages("yourmagic") # You can now use your magic in transformations!

×