20160512 apache-spark-for-everyone

•

0 gostou•49 visualizações

Amanda Casari

Seattle Spark Meetup - intro to Spark for everyone, demo of using Spark in notebooks via Python, R, Scala + Java

Software

apache spark for everyone
amcasari + deb siegel
2016 May 05 - Seattle Spark Meetup

who: @amcasari
@dsiegel
what: @SparkSeattle
where: @Concur
why: @ApacheSpark
(now we can be found)
COORDINATES
@
@
@

{now you are safe take a nap….}
https://github.com/
morningc/
wwconnect-2016-
spark4everyone
courtesy YouTube

you
because we feel the same way…
we are all learning!

we might be a wee bit
ambitious…
8 hour
workshop
intro to
spark
8 hour
workshop
cluster
computing
apps
tutorials
+
banging
head
against
keyboard
personal
projects
today
}professional
experience

why do we care about spark?
what we data people do all day:
• lots of data collection, curation + storage
• lots and lots of data engineering
• product development with machine learning
algorithms!
data science
ideation exploration experimentation productization

what is spark?
• “fast and general-purpose cluster computing system”
• advanced cyclic data flow and in-memory computing
> runs 10x-100x faster than Hadoop MR
• interactive shells in several languages (incl. SQL)
• performant + scalable
courtesy databricks
WARNING: THINGS CHANGE IN SPARK ALL THE TIME. SOME
THINGS MIGHT BE HIDDEN, NO LONGER ACCESSIBLE.
LIKE SPARK.UNICORNS()

n.b.> it will not solve every problem for everyone
• not an all-in-one cluster management + admin tool. utilizes other
resource managers (YARN, Mesos, Amazon EC2)
• quickly changing updates (major release every 3 months)
sometimes requires additional work for backwards compatibility
• for small and medium sized data: not necessary for performant
analysis, data science + ML apps
• learning curve is broad for designing cluster applications @ scale
what is spark?

• multi-language APIs give many different users the
ability to work with Spark
• gateway into Spark but you must still run Spark!
• current languages supported (with various levels of
depth): Scala, Python, Java, R
• moving beyond the shell + text edit
what is spark?
courtesy databricks

MLlib + GraphX on Apache Spark
Classiﬁcation and regression
linear models (SVMs, logistic regression, linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
Isotonic regression
Collaborative ﬁltering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
Power iteration clustering (PIC)
Latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
Graph analytics
• how can we continue to approach every data science
product with scale + performance as top priority?
what is spark?
v1.3 -> v1.6
courtesy databricks

• Spark SQL allows you to query structured data in Spark programs either
using SQL or DataFrames API
• can be used in applications + iterative workflows from a shell or notebook
• DataFrames API conceptually similar to a table in a relational database or
data frame in R/Python
• preserves schema of original data for many file formats, including Parquet
• highly optimized, distributed collection of data
• Datasets: experimental interface (Scala + Java)
what is spark?
courtesy databricks

• spark-packages is a hosted module resource
center for packages developed by the Spark
community
• extends functionality + integration options for
current Spark releases
• examples: spark-csv, spark-testing-base
what is spark?
courtesy databricks

courtesy xkcd
you are not alone…
courtesy xkcd

Mais conteúdo relacionado

Mais procurados

Distributed ML in Apache SparkDatabricks

From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...Databricks

How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...Spark Summit

Neo4j tms_mdev_

Open source big data landscape and possible ITS applicationsSoftwareMill

Scala and Spark are Ideal for Big DataJohn Nestor

Cascalog at May Bay Area Hadoop User Groupnathanmarz

Apache Spark in IndustryDorian Beganovic

Koalas: Unifying Spark and pandas APIsXiao Li

MLflow on and inside AzureDatabricks

Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinDatabricks

Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore

Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab

Análisis de las novedades del Elastic StackElasticsearch

seminar presentation on apache-sparkJawhar Ali

Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney

Large Scale Graph Analytics with JanusGraphP. Taylor Goetz

NATE-Central-LogStefan Coetzee

C* Summit 2013: High Throughput Analytics with Cassandra by Aaron StannardDataStax Academy

JBCN barcelona 2017 kappa architecture 2.0Juantomás García Molina

Mais procurados (20)

Distributed ML in Apache Spark

From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...

How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...

Neo4j tms

Open source big data landscape and possible ITS applications

Scala and Spark are Ideal for Big Data

Cascalog at May Bay Area Hadoop User Group

Apache Spark in Industry

Koalas: Unifying Spark and pandas APIs

MLflow on and inside Azure

Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin

Advanced Analytics and Big Data (August 2014)

Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

Análisis de las novedades del Elastic Stack

seminar presentation on apache-spark

Apache Arrow: Cross-language Development Platform for In-memory Data

Large Scale Graph Analytics with JanusGraph

NATE-Central-Log

C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

JBCN barcelona 2017 kappa architecture 2.0

Semelhante a 20160512 apache-spark-for-everyone

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

Scalable Machine Learning with PySparkLadle Patel

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks

IBM Strategy for SparkMark Kerzner

Big Data Processing with Spark and Scala Edureka!

Hadoop world overview trends and topicsValentin Kropov

2015 Data Science Summit @ dato ReviewHang Li

Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys

Enabling exploratory data science with Spark and RDatabricks

Apache spark-melbourne-april-2015-meetupNed Shawa

Intro to Apache SparkMarius Soutier

963Annu Ahmed

ETL 2.0 Data Engineering for developersMicrosoft Tech Community

Apache Spark 101 - Demi Ben-Ari - PanoraysDemi Ben-Ari

eScience Cluster Arch. OverviewFrancesco Bongiovanni

Apache Spark in Scientific ApplciationsDr. Mirko Kämpf

Apache Spark in Scientific ApplicationsDr. Mirko Kämpf

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson

Pyspark vs Spark Let's Unravel the Bond!ankitbhandari32

Semelhante a 20160512 apache-spark-for-everyone (20)

Apache Spark for Everyone - Women Who Code Workshop

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Scalable Machine Learning with PySpark

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

IBM Strategy for Spark

Big Data Processing with Spark and Scala

Hadoop world overview trends and topics

2015 Data Science Summit @ dato Review

Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...

Enabling exploratory data science with Spark and R

Apache spark-melbourne-april-2015-meetup

Intro to Apache Spark

963

ETL 2.0 Data Engineering for developers

Apache Spark 101 - Demi Ben-Ari - Panorays

eScience Cluster Arch. Overview

Apache Spark in Scientific Applciations

Apache Spark in Scientific Applications

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)

Pyspark vs Spark Let's Unravel the Bond!

Mais de Amanda Casari

When Privacy Scales - Intelligent Product Design under GDPRAmanda Casari

Scaling Data Science Products, Not Data Science TeamsAmanda Casari

Spark Hearts GraphLab CreateAmanda Casari

Feature Engineering for Machine Learning at QConSPAmanda Casari

Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...Amanda Casari

Design for X: Exploring Product Design with Apache Spark and GraphLabAmanda Casari

PyLadies Seattle - Lessons in Interactive VisualizationsAmanda Casari

Mais de Amanda Casari (7)

When Privacy Scales - Intelligent Product Design under GDPR

Scaling Data Science Products, Not Data Science Teams

Spark Hearts GraphLab Create

Feature Engineering for Machine Learning at QConSP

Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...

Design for X: Exploring Product Design with Apache Spark and GraphLab

PyLadies Seattle - Lessons in Interactive Visualizations

Último

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

TECUNIQUE: Success Stories: IT Service providermohitmore19

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

DNT_Corporate presentation know about usDynamic Netsoft

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

Active Directory Penetration Testing, cionsystems.com.pdfCionsystems

Clustering techniques data mining book ....ShaimaaMohamedGalal

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

Test Automation Strategy for Frontend and BackendArshad QA

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

20160512 apache-spark-for-everyone

1. apache spark for everyone amcasari + deb siegel 2016 May 05 - Seattle Spark Meetup

2. who: @amcasari @dsiegel what: @SparkSeattle where: @Concur why: @ApacheSpark (now we can be found) COORDINATES @ @ @

3. {now you are safe take a nap….} https://github.com/ morningc/ wwconnect-2016- spark4everyone courtesy YouTube

4. you don’t worry about this….

5. you because we feel the same way… we are all learning!

6. we might be a wee bit ambitious… 8 hour workshop intro to spark 8 hour workshop cluster computing apps tutorials + banging head against keyboard personal projects today }professional experience

7. why do we care about spark? what we data people do all day: • lots of data collection, curation + storage • lots and lots of data engineering • product development with machine learning algorithms! data science ideation exploration experimentation productization

8. what is spark? • “fast and general-purpose cluster computing system” • advanced cyclic data flow and in-memory computing > runs 10x-100x faster than Hadoop MR • interactive shells in several languages (incl. SQL) • performant + scalable courtesy databricks WARNING: THINGS CHANGE IN SPARK ALL THE TIME. SOME THINGS MIGHT BE HIDDEN, NO LONGER ACCESSIBLE. LIKE SPARK.UNICORNS()

9. n.b.> it will not solve every problem for everyone • not an all-in-one cluster management + admin tool. utilizes other resource managers (YARN, Mesos, Amazon EC2) • quickly changing updates (major release every 3 months) sometimes requires additional work for backwards compatibility • for small and medium sized data: not necessary for performant analysis, data science + ML apps • learning curve is broad for designing cluster applications @ scale what is spark?

10. courtesy databricks what is spark?

11. • multi-language APIs give many different users the ability to work with Spark • gateway into Spark but you must still run Spark! • current languages supported (with various levels of depth): Scala, Python, Java, R • moving beyond the shell + text edit what is spark? courtesy databricks

12. MLlib + GraphX on Apache Spark Classiﬁcation and regression linear models (SVMs, logistic regression, linear regression) naive Bayes decision trees ensembles of trees (Random Forests and Gradient-Boosted Trees) Isotonic regression Collaborative ﬁltering alternating least squares (ALS) Clustering k-means Gaussian mixture Power iteration clustering (PIC) Latent Dirichlet allocation (LDA) streaming k-means Dimensionality reduction singular value decomposition (SVD) principal component analysis (PCA) Optimization (developer) stochastic gradient descent limited-memory BFGS (L-BFGS) Graph analytics • how can we continue to approach every data science product with scale + performance as top priority? what is spark? v1.3 -> v1.6 courtesy databricks

13. • Spark SQL allows you to query structured data in Spark programs either using SQL or DataFrames API • can be used in applications + iterative workflows from a shell or notebook • DataFrames API conceptually similar to a table in a relational database or data frame in R/Python • preserves schema of original data for many file formats, including Parquet • highly optimized, distributed collection of data • Datasets: experimental interface (Scala + Java) what is spark? courtesy databricks

14. • spark-packages is a hosted module resource center for packages developed by the Spark community • extends functionality + integration options for current Spark releases • examples: spark-csv, spark-testing-base what is spark? courtesy databricks

15. courtesy xkcd you are not alone… courtesy xkcd

20160512 apache-spark-for-everyone

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a 20160512 apache-spark-for-everyone

Semelhante a 20160512 apache-spark-for-everyone (20)

Mais de Amanda Casari

Mais de Amanda Casari (7)

Último

Último (20)

20160512 apache-spark-for-everyone