SlideShare uma empresa Scribd logo
1 de 14
SCALDING
Introduction and usage
What is Scalding?
• Scalding is a Scala based API for Map Reduce
applications
• Scalding is built on top of Cascading
• Cascading is a flow oriented processing framework which
acts as an abstraction layer for MapReduce
What is Cascading?
• Cascading introduces the
concept of source taps
(input) and sink taps
(output) and pipes to
connect them, essentially
abstracting the key/value
scheme in MR
• Within a pipe, users define
the transformation of data
by applying operations
such as GroupBy, Every
and others.
WordCount!
• WordCount in Cascading:
In comes Scalding
• Scalding was created by Twitter, basically as a DSL for
Cascading.
• The goal is to offer functions to operate on the data flow
as opposed to constructing objects with embedded
operations
• Scalding applications feel and behave like scripts, ideally
replacing Pig.
Scalding APIs
• Scalding offers three different APIs:
• Field API – a simple, abstracted symbol based function oriented
API, first choice for most use cases
• Type safe API – a more low level, typed API with closer access to
Cascading. This API is used for more complex inputs, such as Avro
• Matrix API – allows to apply matrix and vector operations to pipes,
however of type Int, Long and String (due to comparator ops)
• Both Field and Type APIs can convert to one another, the
APIs are designed to offer the same type of functions, i.e.
(Field) Pipe instances convert to TypePipe and vice versa.
Functions
• Scalding has Map – like functions, such as:
• map
• flatMap
• filter and filterNot
• collect
• Grouping / Joining functions:
• groupBy
• groupAll
• Join (left,right, outer etc)
• Reduce functions:
• reduce (DUH!)
• foldLeft
• average, sum
Documentation: https://github.com/twitter/scalding/wiki/Fields-
based-API-Reference
Example – Field API
Simple map and filter with the Field API
Example – Typed API
Simple mapping with Avro and TypedAPI
Example – Configuring and running
Configuration uses hadoop
And the Job / Toolrunner scheme:
Flow Listener
• You can monitor the execution progress with cascading
listeners.
1. Define Scalding Stat objects (Case classes for Hadoop
counters)
2. Increment within your operations by calling incBy(Int)
3. Implement FlowListener interface and increment your
Jobs listeners:
override def listeners = super.listeners ++ List(new
FlowListener)
Example: Flow Listener
Example: Flow Listener
Accessing stats values:
Resources
Scalding home and docs on Github:
https://github.com/twitter/scalding
Excellent intro and advanced topics:
http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-
2014

Mais conteúdo relacionado

Mais procurados

EUROCONTROL LARA - Presentation
EUROCONTROL LARA - PresentationEUROCONTROL LARA - Presentation
EUROCONTROL LARA - Presentation
SalvatoreBI
 
Reservoir drainage workflow new
Reservoir drainage workflow newReservoir drainage workflow new
Reservoir drainage workflow new
Andrew Zolnai
 

Mais procurados (20)

Streaming sql w kafka and flink
Streaming sql w  kafka and flinkStreaming sql w  kafka and flink
Streaming sql w kafka and flink
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless Environment
 
Hkube
HkubeHkube
Hkube
 
Writing an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on FlinkWriting an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on Flink
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
 
Generating Pipeline Alignment Sheets Using FME
Generating Pipeline Alignment Sheets Using FMEGenerating Pipeline Alignment Sheets Using FME
Generating Pipeline Alignment Sheets Using FME
 
Scaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rushScaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rush
 
Stream Patterns
Stream PatternsStream Patterns
Stream Patterns
 
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
 
EUROCONTROL LARA - Presentation
EUROCONTROL LARA - PresentationEUROCONTROL LARA - Presentation
EUROCONTROL LARA - Presentation
 
StockPredictionML Presentation
StockPredictionML PresentationStockPredictionML Presentation
StockPredictionML Presentation
 
An Introduction to the Heatmap / Histogram Plugin
An Introduction to the Heatmap / Histogram PluginAn Introduction to the Heatmap / Histogram Plugin
An Introduction to the Heatmap / Histogram Plugin
 
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
 
Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for Prometheus
 
The journey of Moving from AWS ELK to GCP Data Pipeline
The journey of Moving from AWS ELK to GCP Data PipelineThe journey of Moving from AWS ELK to GCP Data Pipeline
The journey of Moving from AWS ELK to GCP Data Pipeline
 
AI at Scale
AI at ScaleAI at Scale
AI at Scale
 
Reservoir drainage workflow new
Reservoir drainage workflow newReservoir drainage workflow new
Reservoir drainage workflow new
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at Uber
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Fieldtrip GB
Fieldtrip GBFieldtrip GB
Fieldtrip GB
 

Destaque

MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London Meetup
Landoop Ltd
 

Destaque (11)

MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London Meetup
 
Cascading at the Lyon Hadoop User Group
Cascading at the Lyon Hadoop User GroupCascading at the Lyon Hadoop User Group
Cascading at the Lyon Hadoop User Group
 
스칼라
스칼라스칼라
스칼라
 
Scalding
ScaldingScalding
Scalding
 
Programming Cascading
Programming CascadingProgramming Cascading
Programming Cascading
 
Scalding - Big Data Programming with Scala
Scalding - Big Data Programming with ScalaScalding - Big Data Programming with Scala
Scalding - Big Data Programming with Scala
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
빅데이터 구축 사례
빅데이터 구축 사례빅데이터 구축 사례
빅데이터 구축 사례
 
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
 
KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈
KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈
KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈
 

Semelhante a Scalding intro 20141125

Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014
soujavajug
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
Fei Dong
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
confluent
 

Semelhante a Scalding intro 20141125 (20)

Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Apache pig
Apache pigApache pig
Apache pig
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Graphql usage
Graphql usageGraphql usage
Graphql usage
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Building Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on HadoopBuilding Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on Hadoop
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectos
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Scalding intro 20141125

  • 2. What is Scalding? • Scalding is a Scala based API for Map Reduce applications • Scalding is built on top of Cascading • Cascading is a flow oriented processing framework which acts as an abstraction layer for MapReduce
  • 3. What is Cascading? • Cascading introduces the concept of source taps (input) and sink taps (output) and pipes to connect them, essentially abstracting the key/value scheme in MR • Within a pipe, users define the transformation of data by applying operations such as GroupBy, Every and others.
  • 5. In comes Scalding • Scalding was created by Twitter, basically as a DSL for Cascading. • The goal is to offer functions to operate on the data flow as opposed to constructing objects with embedded operations • Scalding applications feel and behave like scripts, ideally replacing Pig.
  • 6. Scalding APIs • Scalding offers three different APIs: • Field API – a simple, abstracted symbol based function oriented API, first choice for most use cases • Type safe API – a more low level, typed API with closer access to Cascading. This API is used for more complex inputs, such as Avro • Matrix API – allows to apply matrix and vector operations to pipes, however of type Int, Long and String (due to comparator ops) • Both Field and Type APIs can convert to one another, the APIs are designed to offer the same type of functions, i.e. (Field) Pipe instances convert to TypePipe and vice versa.
  • 7. Functions • Scalding has Map – like functions, such as: • map • flatMap • filter and filterNot • collect • Grouping / Joining functions: • groupBy • groupAll • Join (left,right, outer etc) • Reduce functions: • reduce (DUH!) • foldLeft • average, sum Documentation: https://github.com/twitter/scalding/wiki/Fields- based-API-Reference
  • 8. Example – Field API Simple map and filter with the Field API
  • 9. Example – Typed API Simple mapping with Avro and TypedAPI
  • 10. Example – Configuring and running Configuration uses hadoop And the Job / Toolrunner scheme:
  • 11. Flow Listener • You can monitor the execution progress with cascading listeners. 1. Define Scalding Stat objects (Case classes for Hadoop counters) 2. Increment within your operations by calling incBy(Int) 3. Implement FlowListener interface and increment your Jobs listeners: override def listeners = super.listeners ++ List(new FlowListener)
  • 14. Resources Scalding home and docs on Github: https://github.com/twitter/scalding Excellent intro and advanced topics: http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays- 2014