SlideShare a Scribd company logo
1 of 19
11
Headline Goes Here
Speaker Name or Subhead Goes Here
Cloudera Developer Kit:
Hadoop Application Development Made Easier
E. Sammer | Engineering Manager
May 2013
22
“[I]t’s not enough to just build a
scalable and stable system; the system
also has to be easy enough for
thousands of internal developers of all
types and all skill levels to use.”
http://gigaom.com/data/how-disney-built-a-big-data-platform-on-a-startup-budget/
3
Hadoop is incredibly powerful
3
4
Hadoop is incredibly flexible
4
5
Hadoop is incredibly low-level
5
6
Hadoop is incredibly complex
6
7
A typical system (zoom 100:1)
7
8
A typical system (zoom 10:1)
8
9
A typical system (zoom 5:1)
9
10
What you actually care about
Getting data from A to B
Using it later
10
11
Infrastructure details
Serialization, file formats, and compression
Metadata capture and maintenance
Dataset organization and partitioning
Durability and delivery guarantees
Well-defined failure semantics
Performance and health instrumentation
11
12
Cloudera Development Kit
Make Hadoop accessible to the enterprise developer
Codify expert patterns and practices
Make the “right thing” easy and obvious
Address the most common cases
Let developers focus on business logical, not infrastructure
12
13
Cloudera Development Kit
An open source set of libraries, guides, and examples for
building data-oriented systems and applications
Provides higher level APIs atop existing components of CDH
Supports piecemeal adoption via loosely coupled modules
13
14
CDK Data Module
High level APIs for interacting with datasets in HDFS
Configuration-based format and schema management
Consistent data model and serialization semantics
Metadata system integration and support
Automatic dataset partitioning and file management
14
1515
DatasetRepository repo = new FileSystemDatasetRepository.Builder()
.fileSystem(FileSystem.get(new Configuration()))
.directory(new Path(“/data”))
.get();
Dataset events = repo.create(“events”,
new DatasetDescriptor.Builder()
.schema(new File(“event.avsc”))
.partitionStrategy(
new PartitionStrategy.Builder().hash(“userId”, 53).get()
).get()
);
DatasetWriter<GenericRecord> writer = events.getWriter();
writer.open();
writer.write(
new GenericRecordBuilder(schema)
.set(“userId”, 1)
.set(“timeStamp”, System.currentTimeMillis())
.build()
);
writer.close();
/data
/events
/.metadata
/schema.avsc
/descriptor.properties
/userId=0
/10000000.avro
/10000001.avro
/userId=1
/20000000.avro
/userId=2
/30000000.avro
Code
Data
16
Under development
Configuration-based record transformation and filtering engine
Data pipeline deployment, discovery, and management
Working with customers, partners, and the community on new
modules and features
16
17
Getting started
CDK code repo: github.com/cloudera/cdk
CDK example repo: github.com/cloudera/cdk-examples
Binary artifacts available from Cloudera’s Maven repository
Mailing list: groups.google.com/a/cloudera.org/d/forum/cdk-dev
17
18
• Submit questions in the Q&A panel
• Watch this webinar on-demand at
http://cloudera.com
• Follow Cloudera @Cloudera
• Follow Cloudera Engineering
@ClouderaEng
• Thank you for attending!
Learn more about the CDK
http://cloudera.com/cdk
CDK on GitHub
http://cloudera.github.io/cdk/docs/0.
2.0/
1919

More Related Content

Viewers also liked

A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
Nathan Bijnens
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
How to Design a Quality system that meets compliance requirements 2014
How to Design a Quality system that meets compliance requirements 2014How to Design a Quality system that meets compliance requirements 2014
How to Design a Quality system that meets compliance requirements 2014
Gilead Sciences
 

Viewers also liked (19)

Intro to Pig UDF
Intro to Pig UDFIntro to Pig UDF
Intro to Pig UDF
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Inside hadoop-dev
Inside hadoop-devInside hadoop-dev
Inside hadoop-dev
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)
 
Data protection for hadoop environments
Data protection for hadoop environmentsData protection for hadoop environments
Data protection for hadoop environments
 
Importance of Big data for your Business
Importance of Big data for your BusinessImportance of Big data for your Business
Importance of Big data for your Business
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
 
TQM- Quality management system
TQM- Quality management systemTQM- Quality management system
TQM- Quality management system
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
How to Design a Quality system that meets compliance requirements 2014
How to Design a Quality system that meets compliance requirements 2014How to Design a Quality system that meets compliance requirements 2014
How to Design a Quality system that meets compliance requirements 2014
 
Quality management
Quality managementQuality management
Quality management
 
Quality Management System
Quality Management SystemQuality Management System
Quality Management System
 
ISO 9000
ISO 9000ISO 9000
ISO 9000
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Cloudera Development Kit (CDK): Hadoop Application Development Made Easier

  • 1. 11 Headline Goes Here Speaker Name or Subhead Goes Here Cloudera Developer Kit: Hadoop Application Development Made Easier E. Sammer | Engineering Manager May 2013
  • 2. 22 “[I]t’s not enough to just build a scalable and stable system; the system also has to be easy enough for thousands of internal developers of all types and all skill levels to use.” http://gigaom.com/data/how-disney-built-a-big-data-platform-on-a-startup-budget/
  • 7. 7 A typical system (zoom 100:1) 7
  • 8. 8 A typical system (zoom 10:1) 8
  • 9. 9 A typical system (zoom 5:1) 9
  • 10. 10 What you actually care about Getting data from A to B Using it later 10
  • 11. 11 Infrastructure details Serialization, file formats, and compression Metadata capture and maintenance Dataset organization and partitioning Durability and delivery guarantees Well-defined failure semantics Performance and health instrumentation 11
  • 12. 12 Cloudera Development Kit Make Hadoop accessible to the enterprise developer Codify expert patterns and practices Make the “right thing” easy and obvious Address the most common cases Let developers focus on business logical, not infrastructure 12
  • 13. 13 Cloudera Development Kit An open source set of libraries, guides, and examples for building data-oriented systems and applications Provides higher level APIs atop existing components of CDH Supports piecemeal adoption via loosely coupled modules 13
  • 14. 14 CDK Data Module High level APIs for interacting with datasets in HDFS Configuration-based format and schema management Consistent data model and serialization semantics Metadata system integration and support Automatic dataset partitioning and file management 14
  • 15. 1515 DatasetRepository repo = new FileSystemDatasetRepository.Builder() .fileSystem(FileSystem.get(new Configuration())) .directory(new Path(“/data”)) .get(); Dataset events = repo.create(“events”, new DatasetDescriptor.Builder() .schema(new File(“event.avsc”)) .partitionStrategy( new PartitionStrategy.Builder().hash(“userId”, 53).get() ).get() ); DatasetWriter<GenericRecord> writer = events.getWriter(); writer.open(); writer.write( new GenericRecordBuilder(schema) .set(“userId”, 1) .set(“timeStamp”, System.currentTimeMillis()) .build() ); writer.close(); /data /events /.metadata /schema.avsc /descriptor.properties /userId=0 /10000000.avro /10000001.avro /userId=1 /20000000.avro /userId=2 /30000000.avro Code Data
  • 16. 16 Under development Configuration-based record transformation and filtering engine Data pipeline deployment, discovery, and management Working with customers, partners, and the community on new modules and features 16
  • 17. 17 Getting started CDK code repo: github.com/cloudera/cdk CDK example repo: github.com/cloudera/cdk-examples Binary artifacts available from Cloudera’s Maven repository Mailing list: groups.google.com/a/cloudera.org/d/forum/cdk-dev 17
  • 18. 18 • Submit questions in the Q&A panel • Watch this webinar on-demand at http://cloudera.com • Follow Cloudera @Cloudera • Follow Cloudera Engineering @ClouderaEng • Thank you for attending! Learn more about the CDK http://cloudera.com/cdk CDK on GitHub http://cloudera.github.io/cdk/docs/0. 2.0/
  • 19. 1919