Scalding by Adform Research, Alex Gryzlov

•Transferir como PPTX, PDF•

1 gostou•870 visualizações

Vasil Remeniuk

Tecnologia

Cascading
Tap / Pipe / Sink abstraction over Map / Reduce in Java

Scalding
• Scala wrapper for Cascading
• Just like working with in-memory collections (map/filter/sort…)
• Built-in parsers for {T|C}SV, date annotations etc
• Helper algorithms e.g.
 approximations (Algebird library)
 matrix API

run the WordCountJob in local
mode with given input and output

Building and Deploying
• Get sbt
• sbt assembly produces jar file in target/scala_2.10
• sbt s3-upload produces jar and uploads to s3

Running on EMR
• hadoop fs -get s3://dev-adform-test/madeup-job.jar job.jar
• hadoop jar job.jar
com.twitter.scalding.Tool Entry class
com.adform.dspr.MadeupJob Scalding job class
--hdfs Run in HDFS mode
--logs s3://dev-adform-test/logs Parameter
--meta s3://dev-adform-test/metadata Parameter
--output s3://dev-adform-test/output Parameter
For more complicated workflows you would have to use applications like Oozie or Pentaho, or write a
custom runner app, check out
https://gitz.adform.com/dco/dco-amazon-runner

Development
• Two APIs:
• Fields – everything is a string
• Typed – working with classes, e.g. Request/Transaction

Development
• Fields:
• No need to parse columns
• Redundancy
• No IDE support like auto-completion
• Typed:
• All benefits of types, esp. compile-time checking
• More manual work with parsing
• Sometimes API can be confusing (TypedPipe/Grouped/Cogrouped…)

Downsides
• A lot of configuring and googling random issues
• Scarce documentation, have to read source code/stackoverflow
• IntelliJ is slow
• Boilerplate code for parsing data

Some tips
• In local mode you specify files as input/output, in HDFS – folders
• You can use Hadoop API to read files from HDFS directly, but only on submitting
node, not in the pipeline
• As a workaround for previous problem, you can use a distributed cache
mechanism, but that only works on Hadoop 1 AFAIK
• Default memory limit per mapper/reducer is ~200Mb, can be raised by overriding
Job.config and adding “mapred.child.java.opts“ -> ”-Xmx<NUMBER>m”

Resources
• https://github.com/twitter/scalding/wiki Wiki
• https://github.com/twitter/scalding/tree/develop/tutorial Basic stuff
• https://github.com/twitter/scalding/tree/develop/scalding-
core/src/main/scala/com/twitter/scalding/examples Advanced examples, e.g., iterative jobs
• http://www.slideshare.net/AntwnisChalkiopoulos/scalding-presentation
• http://polyglotprogramming.com/papers/ScaldingForHadoop.pdf
• http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-2014

Mais conteúdo relacionado

Mais procurados

HDFS ArchitectureJeff Hammerbacher

Hadoop technologytipanagiriharika

HADOOP TECHNOLOGY pptsravya raju

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn

Apache HadoopAjit Koti

What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!

HadoopMallikarjuna G D

PPT on HadoopShubham Parmar

HDFSSteve Loughran

Big Data and HadoopFlavio Vit

Hadoop Architecture and HDFSEdureka!

HadoopRajesh Piryani

What is HDFS | Hadoop Distributed File System | EdurekaEdureka!

Big data Hadoop presentation Shivanee garg

Hadoop: Distributed Data ProcessingCloudera, Inc.

Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari

Big data and HadoopRahul Agarwal

Seminar Presentation HadoopVarun Narang

Hadoop technologySohini~~ Music

Mais procurados (19)

HDFS Architecture

Hadoop technology

HADOOP TECHNOLOGY ppt

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...

Apache Hadoop

What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka

Hadoop

PPT on Hadoop

HDFS

Big Data and Hadoop

Hadoop Architecture and HDFS

Hadoop

What is HDFS | Hadoop Distributed File System | Edureka

Big data Hadoop presentation

Hadoop: Distributed Data Processing

Big data Hadoop Analytic and Data warehouse comparison guide

Big data and Hadoop

Seminar Presentation Hadoop

Hadoop technology

Semelhante a Scalding by Adform Research, Alex Gryzlov

Scalding by Adform Research, Alex GryzlovVasil Remeniuk

Introduction to Apache Spark EcosystemBojan Babic

Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks

Introduction to SparkDavid Smelker

Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin

Meet Hadoop Family: part 4caizer_x

Intro to Apache SparkRobert Sanders

Intro to Apache Sparkclairvoyantllc

Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal

Apache Spark FundamentalsZahra Eskandari

Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati

Unit II Real Time Data Processing tools.pptxRahul Borate

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt

Real time Analytics with Apache Kafka and Apache SparkRahul Jain

Introduction to apache sparkUserReport

Productionizing Spark and the Spark Job ServerEvan Chan

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly

Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095

Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedSri Ambati

Productionizing Spark and the REST Job Server- Evan ChanSpark Summit

Semelhante a Scalding by Adform Research, Alex Gryzlov (20)

Scalding by Adform Research, Alex Gryzlov

Introduction to Apache Spark Ecosystem

Migrating ETL Workflow to Apache Spark at Scale in Pinterest

Introduction to Spark

Introduction to Apache Spark :: Lagos Scala Meetup session 2

Meet Hadoop Family: part 4

Intro to Apache Spark

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive

Apache Spark Fundamentals

Big Data and Hadoop in Cloud - Leveraging Amazon EMR

Unit II Real Time Data Processing tools.pptx

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...

Real time Analytics with Apache Kafka and Apache Spark

Introduction to apache spark

Productionizing Spark and the Spark Job Server

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...

Apache Spark™ is a multi-language engine for executing data-S5.ppt

Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned

Productionizing Spark and the REST Job Server- Evan Chan

Mais de Vasil Remeniuk

Product Minsk - РТБ и ПрограмматикVasil Remeniuk

Работа с Akka Сluster, @afiskon, scalaby#14Vasil Remeniuk

Cake pattern. Presentation by Alex Famin at scalaby#14Vasil Remeniuk

Scala laboratory: Globus. iteration #3Vasil Remeniuk

Testing in Scala by Adform researchVasil Remeniuk

Spark Intro by Adform ResearchVasil Remeniuk

Types by Adform Research, Saulius ValatkaVasil Remeniuk

Types by Adform ResearchVasil Remeniuk

Spark by Adform Research, PauliusVasil Remeniuk

Scala Style by Adform Research (Saulius Valatka)Vasil Remeniuk

Spark intro by Adform ResearchVasil Remeniuk

SBT by Aform Research, Saulius ValatkaVasil Remeniuk

Scala laboratory: Globus. iteration #2Vasil Remeniuk

Testing in Scala. Adform ResearchVasil Remeniuk

Scala laboratory. Globus. iteration #1Vasil Remeniuk

Cassandra + Spark + ElkVasil Remeniuk

Опыт использования Spark, Основано на реальных событияхVasil Remeniuk

ETL со SparkVasil Remeniuk

Funtional Reactive Programming with Examples in Scala + GWTVasil Remeniuk

Vaadin+ScalaVasil Remeniuk

Mais de Vasil Remeniuk (20)

Product Minsk - РТБ и Программатик

Работа с Akka Сluster, @afiskon, scalaby#14

Cake pattern. Presentation by Alex Famin at scalaby#14

Scala laboratory: Globus. iteration #3

Testing in Scala by Adform research

Spark Intro by Adform Research

Types by Adform Research, Saulius Valatka

Types by Adform Research

Spark by Adform Research, Paulius

Scala Style by Adform Research (Saulius Valatka)

Spark intro by Adform Research

SBT by Aform Research, Saulius Valatka

Scala laboratory: Globus. iteration #2

Testing in Scala. Adform Research

Scala laboratory. Globus. iteration #1

Cassandra + Spark + Elk

Опыт использования Spark, Основано на реальных событиях

ETL со Spark

Funtional Reactive Programming with Examples in Scala + GWT

Vaadin+Scala

Último

ICT role in 21st century education and its challengesrafiqahmad00786416

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2

Elevate Developer Efficiency & build GenAI Application with Amazon QBhuvaneswari Subramani

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Architecting Cloud Native ApplicationsWSO2

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

Exploring Multimodal Embeddings with MilvusZilliz

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Why Teams call analytics are critical to your entire businesspanagenda

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Scalding by Adform Research, Alex Gryzlov

1. Wordcount in MapReduce

2. Cascading Tap / Pipe / Sink abstraction over Map / Reduce in Java

3. Cascading

4. Wordcount in Cascading

5. Scalding • Scala wrapper for Cascading • Just like working with in-memory collections (map/filter/sort…) • Built-in parsers for {T|C}SV, date annotations etc • Helper algorithms e.g.  approximations (Algebird library)  matrix API

6. Wordcount in Scalding

7. run the WordCountJob in local mode with given input and output

8. Building and Deploying • Get sbt • sbt assembly produces jar file in target/scala_2.10 • sbt s3-upload produces jar and uploads to s3

9. Running on EMR • hadoop fs -get s3://dev-adform-test/madeup-job.jar job.jar • hadoop jar job.jar com.twitter.scalding.Tool Entry class com.adform.dspr.MadeupJob Scalding job class --hdfs Run in HDFS mode --logs s3://dev-adform-test/logs Parameter --meta s3://dev-adform-test/metadata Parameter --output s3://dev-adform-test/output Parameter For more complicated workflows you would have to use applications like Oozie or Pentaho, or write a custom runner app, check out https://gitz.adform.com/dco/dco-amazon-runner

10. Development • Two APIs: • Fields – everything is a string • Typed – working with classes, e.g. Request/Transaction

11. Development • Fields: • No need to parse columns • Redundancy • No IDE support like auto-completion • Typed: • All benefits of types, esp. compile-time checking • More manual work with parsing • Sometimes API can be confusing (TypedPipe/Grouped/Cogrouped…)

12. Downsides • A lot of configuring and googling random issues • Scarce documentation, have to read source code/stackoverflow • IntelliJ is slow • Boilerplate code for parsing data

13. Some tips • In local mode you specify files as input/output, in HDFS – folders • You can use Hadoop API to read files from HDFS directly, but only on submitting node, not in the pipeline • As a workaround for previous problem, you can use a distributed cache mechanism, but that only works on Hadoop 1 AFAIK • Default memory limit per mapper/reducer is ~200Mb, can be raised by overriding Job.config and adding “mapred.child.java.opts“ -> ”-Xmx<NUMBER>m”

14. Resources • https://github.com/twitter/scalding/wiki Wiki • https://github.com/twitter/scalding/tree/develop/tutorial Basic stuff • https://github.com/twitter/scalding/tree/develop/scalding- core/src/main/scala/com/twitter/scalding/examples Advanced examples, e.g., iterative jobs • http://www.slideshare.net/AntwnisChalkiopoulos/scalding-presentation • http://polyglotprogramming.com/papers/ScaldingForHadoop.pdf • http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-2014

Scalding by Adform Research, Alex Gryzlov

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a Scalding by Adform Research, Alex Gryzlov

Semelhante a Scalding by Adform Research, Alex Gryzlov (20)

Mais de Vasil Remeniuk

Mais de Vasil Remeniuk (20)

Último

Último (20)

Scalding by Adform Research, Alex Gryzlov