SlideShare uma empresa Scribd logo
1 de 74
Hadoop Real World: For S2DS Students
Sean Roberts — Partner Engineering, Hortonworks @seano
®
© Hortonworks Inc. 2015. All Rights Reserved
$ id seano
Sean Roberts
Partner Solutions Engineer
London & EMEA & everywhere
@seano
linkedin.com/in/seanorama
Louisiana. MacGyver. Cook. Autodidact. Volunteer.
Ancestral Health. Fito. Couchsurfer. Nomad.
Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
I am your anti-thesis
© Hortonworks Inc. 2015. All Rights Reserved
self-taught. autodidact to the extreme.
Formal education ended at 17
© Hortonworks Inc. 2015. All Rights Reserved
as do most systems engineers
Avoided Java my entire life
© Hortonworks Inc. 2015. All Rights Reserved
where everything is Java!
Fail: Then joined the Hadoop company
© Hortonworks Inc. 2015. All Rights Reserved
But I do love Python and shell scripts!
Horrified by R, Matlab
© Hortonworks Inc. 2015. All Rights Reserved
not once
Statistics
© Hortonworks Inc. 2015. All Rights Reserved
I have no idea what I’m doing
Machine Learning, Predictive Analytics
© Hortonworks Inc. 2015. All Rights Reserved
Areas of Study?
Hadoop?
Data sets? How complex, mixed or big?
Questions to you
© Hortonworks Inc. 2015. All Rights Reserved
Heart rate monitor, Solr, Banana, Kafka,
Storm, HBase, SparkML
Side note & fun for me: Horton’s Gym
© Hortonworks Inc. 2015. All Rights Reserved
Hadoop Real World: For S2DS Students
Sean Roberts — Partner Engineering, Hortonworks @seano
®
© Hortonworks Inc. 2015. All Rights Reserved
Chief Data Officer’s Needs Application Team’s Response
Lorry/Truck Fleet Use Case
Tim Brady is a General
Manager at a major
energy company in the
forest city basin and
revenue has been a little
flat. Senior Management
has asked Tim to see
what he can do to contain
cost.
Mr Brady’s background in working with equipment has served him well
in his role overseeing the water hauling, pumper, and equipment trucks
at his company. However, despite the recent drop in gas prices, fuel
costs have continued to increase for the fleet of trucks that Mr Brady
oversees.
2012 2013 2014 2015
50% 60% 70% 80%
Senior management asked
Mr. Brady to explain the
cost increases and get them
under control as well as
look for opportunities to
grow revenue.
Insurance premiums and equipment outages have also increased
under Brady’s watch.
2012
900K
2013 2014 2015
Insurance Premiums Equipment Outages
At first, Mr. Brady feels deflated as
he thinks through the volume of
complex and varied data types that
he must analyze to answer the
questions posed by senior
management. In addition, Mr.
Brady realizes that whatever
system he chooses will have to
handle batch, interactive and real-
time processing.
Clickstream
Route Data As The Drivers
Choose Their Routes Through
Mapping Software
Sensor Data
Coming Off The Assets
Geolocation Data
Providing The Location of Assets
Web Data
Weather
Structured Data
Master Data on Drivers and
Assets
Unstructured Data
Asset Work Orders and Assets
New
Traditional
New Data Growth
Then Mr. Brady starts to get a
grip on the situation and
remembers a team he once
used to get him some data.
Tim reaches out to his team.
Jim
Business Analyst
Sue
Developer
Varun
System Admin
Maria
SME
Tim’s team has recently downloaded Hortonworks’ Sandbox from
http://hortonworks.com/products/hortonworks-sandbox/
and they tell him they think Hadoop can do the job.
Hadoop’s Genesis and Unique Characteristics Make It The Perfect
Target for The Modern Data Architecture
Any Data, Anywhere,
Anytime
Continuous Availability
Data Locality
Self-Healing Self-Leveling
Schema on Read Machine Leaning
20
Our Mission:
Power your Modern Data Architecture
with HDP and Enterprise Apache Hadoop
Customer Momentum
• 330+ customers (as of end of 2014)
• Two thirds of customers come from F1000
Hortonworks Data Platform Hadoop at Scale
• Multiple +1000 node clusters under support, including
35,000 nodes at Yahoo!, 800 nodes at Spotify
• Open multi-tenant platform for any app & any data.
• Centralized architecture
Partner for Customer Success
• Open source community leadership focus on enterprise
needs
• Unrivaled world class support
• Founded in 2011
• Original 24 architects, developers,
operators of Hadoop from Yahoo!
• 600+ Employees
• 1000+ Ecosystem Partners
No One Knows Hadoop Better Than Hortonworks
21Hortonworks Data Platform is a Enterprise Ready Centralized
Architecture That Allows For Batch, Interactive, and Real-Time
Processing on a Single Data Source
Storage
YARN: Data Operating System
Governance Security
Operations
Resource Management
Existing
Apps
New
Analytics
Partner
Apps
(ie. SAS)
Data Access: Batch, Interactive & Real-time
Mr. Brady is encouraged that the Hortonworks
Data Platform can handle the volume of
complex and varied data types that he must
analyze as well as handle the batch, interactive
and real-time processing that is required.
© Hortonworks Inc. 2015. All Rights Reserved
HCatalog: Shared Table & User Defined Metadata for All Workloads
Falcon & Oozie: Orchestrate Processing
Ambari: Provision, Manage and Monitor Cluster Resources
Ingest
Sqoop
NFS
WebHDFS
Stream
Storm
Flume
Source Systems
Clickstream
Social/Web
Geolocation
Machine/Sensor
Server Log
Unstructured
CRM/ERP
ODS
EDW
Security: Perimeter and Full Stack Policy Definition & Enforcement
Data
Processing
&
Data
Transforms
Data
Science
&
Machine
Learning
Spark
Hive
Solr
Stream
Processing
&
Stream
Analytics
Kafka
Storm
Target Systems
ODS
EDW
Pig
Hive
Cascading
Real Time &
NoSQL
SQL
Batch & Interactive
Hive
° ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
1 ° ° ° ° ° ° ° ° ° ° ° ° ° °
HDFS (Hadoop Distributed File System)
YARN (Cluster Resource Management) HBase
Accumulo
Visualization &
Reporting
Business
Applications
Data
Marts
Hortonworks Data Platform
23
Jim
Business Analyst
Sue
Developer
Varun
System Admin
Maria
SME
+
HDP Data Analyst
Training
=
HDP Data Analyst
+
Developer
Training
=
HDP Developer
+
HDP System Admin
Training
=
HDP Sys Admin
+
Data Science
Training
=
HDP Data Scientist
www.hortonworks.com
Varun Stands up
the Cluster
Varu
nHDP Sys Admin
Demo Here
© Hortonworks Inc. 2015. All Rights Reserved
Data Scientist: Explore Data & Build Model in Cloud
Click-thru Demo
Provision data science
environment in the cloud
Use data science
notebook to explore data
Run algorithms to create
predictive model
Cloudbreak
1. Choose a cloud
2. Pick the Spark
blueprint
3. Launch HDP
Microsoft Azure
© Hortonworks Inc. 2015. All Rights Reserved
Login to launch.hortonworks.com
which is a self-service portal for
launching HDP clusters to the cloud
© Hortonworks Inc. 2015. All Rights Reserved
© Hortonworks Inc. 2015. All Rights Reserved
Name the cluster, choose your region,
and pick your blueprint…in this case,
we want “hdp-spark-cluster” for our
data science work
© Hortonworks Inc. 2015. All Rights Reserved
We clicked “create cluster” and
Cloudbreak is now provisioning our
Spark environment on Azure
www.hortonworks.com
30
Varun Secures
the Cluster
Varu
nHDP Sys Admin
Demo Here
31
SueJi
mHDP Data
Analyst
HDP Developer
Jim and Sue Build
Monitoring App
Demo Here
© Hortonworks Inc. 2015. All Rights Reserved
Apps on YARN
Datasets stored in HDFS
Real-time and Predictive Application Architecture
Monitoring application
Truck sensors
App alerts
(ActiveMQ)
Messages
Stream NoSQL
Apache Kafka
▪ High throughput distributed messaging
system
▪ Publish-Subscribe semantics but re-
imagined at the implementation level
to operate at speed with big data
volumes
▪ Kafka @LinkedIn:
▪ 800 billion messages per day
▪ 175 terabytes of data written per day
▪ 650 terabytes of data read per day
▪ Over 13 million messages/2.75GB of
data per second
Kafka
Cluster
producer
producer
producer
consumer
consumer
consumer
Kafka: Anatomy of a Topic
Partition
0
Partition
1
Partition
2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10
11 11
12
Write
s
Ol
d
Ne
w
▪ Partitioning allows topics to
scale beyond a single
machine/node
▪ Topics can also be replicated,
for high availability.
© Hortonworks Inc. 2015. All Rights Reserved
Apps on YARN
Datasets stored in HDFS
Real-time and Predictive Application Architecture
Monitoring application
Truck sensors
App alerts
(ActiveMQ)
Messages
Stream NoSQL
Apache Storm
• Distributed, real time, fault tolerant Stream Processing platform.
• Provides processing guarantees.
• Key concepts include:
•Tuples
•Streams
•Spouts
•Bolts
•Topology
Page 36
Storm: Tuples and Streams
• What is a Tuple?
–Fundamental data structure in Storm. Is a named list of values that can be of any data type.
Page 37
• What is a Stream?
–An unbounded sequences of tuples.
–Core abstraction in Storm and are what you “process” in Storm
Storm: Spouts
• What is a Spout?
–Generates or a source of Streams
–E.g.: JMS, Twitter, Log, Kafka Spout
–Can spin up multiple instances of a Spout and dynamically adjust as needed
Page 38
Storm: Bolts
• What is a Bolt?
–Processes any number of input streams and produces output streams
–Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting
logic
–Can spin up multiple instances of a Bolt and dynamically adjust as needed
• Bolts used in the Use Case:
1. HBaseBolt: persisting and counting in Hbase
2. HDFSBolt: persisting into HFDS as Avro Files using Flume
3. MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the
number of illegal driver incidents exceed a given threshhold.
Page 39
Storm: Topology
• What is a Topology?
–A network of spouts and bolts wired together into a workflow
Page 40
© Hortonworks Inc. 2015. All Rights Reserved
Apps on YARN
Datasets stored in HDFS
Real-time and Predictive Application Architecture
Monitoring application
Truck sensors
App alerts
(ActiveMQ)
Messages
Stream NoSQL
Apache HBase
• HBase = Key / Value store
• Designed for petabyte scale
• Supports low latency reads, writes and updates
• Key features
– Updateable records
– Versioned Records
– Distributed across a cluster of machines
– Low Latency
– Caching
• Popular use cases:
– User profiles and session state
– Object store
– Sensor apps
Page 42
HBase: Data Assignment
Page 43
HBase Table
Keys within HBase
Divided among
different RegionServers
HBase: Data Access
• Get
–Retrieves a single cell, all cells with a matching rowkey, or all cells in a column family with
a matching rowkey
• Put
–Inserts a new version of a cell.
• Scan
–The whole table, row by row, or a section of that table starting at a particular start key and
ending at a particular end key
• Delete
–It is actually a version of put(Add a new version with put with a deletion marker)
• SQL via Apache Phoenix
–Unique capability in the NoSQL market
Page 44
© Hortonworks Inc. 2015. All Rights Reserved
Apps on YARN
Datasets stored in HDFS
Real-time and Predictive Application Architecture
Monitoring application
Truck sensors
App alerts
(ActiveMQ)
Messages
Stream NoSQL
Apache HDFS: Hadoop Distributed File System
• Very large scale distributed file system
• 10K nodes, tens of millions files and PBs of data
• Supports large files
• Designed to run on commodity hardware, assumes hardware failures
• Files are replicated to handle hardware failure
• Detect failures and recovers from them automatically
• Optimized for Large Scale Processing
• Data locations are exposed so that the computations can move to where data resides
• Data Coherency
• Write once and read many times access pattern
• Files are broken up in chunks called ‘blocks’
• Blocks are distributed over nodes
www.hortonworks.com
47
Ji
mHDP Data
Analyst
Jim Build BI
Reports To Analyze
Routes
Demo Here
48
Jim
HDP Data
Analyst
Jim Build BI
Reports To Events
Per Routes
Demo Here
© Hortonworks Inc. 2015. All Rights Reserved
Apps on YARN
Datasets stored in HDFS
Real-time and Predictive Application Architecture
Your BI Tool
Predictive application
Truck sensors
App alerts
(ActiveMQ)
Messages
SQL Stream NoSQL
© Hortonworks Inc. 2015. All Rights Reserved
51
Mr. Brady is happy with the
results. He is able to
determine that a subset of
drivers are responsible for the
increased cost. But like most
managers he is not happy for
long. Now he wants to be able
to predict which drivers are
likely going to be a risk.
Maria
Data Scientist
Machine Leaning
Maria points out that HDP has tremendous Machine Learning
capabilities and she can use this to predict which drivers are likely
to have an event before the event occurs.
Maria implements predicted
violations logic using HDP
Machine Learning
and is able to predict events
before they happen
Demo Here
© Hortonworks Inc. 2015. All Rights Reserved
Apps on YARN
Datasets stored in HDFS
Real-time and Predictive Application Architecture
Your BI Tool
Predictive application
Truck sensors
App alerts
(ActiveMQ)
Messages
SQL Stream NoSQLML
Use
Model
© Hortonworks Inc. 2015. All Rights Reserved
Elegant Developer APIs
DataFrames, Machine Learning, and SQL
Interactive Data Science
All apps need to get predictive at scale and fine granularity
Democratize Machine Learning
Spark is doing to ML on Hadoop what Hive did for SQL on
Hadoop
Community
Broad developer, customer and partner interest
Realize Value of Data Operating System
A key tool in the Hadoop toolbox
Why We Love Spark at Hortonworks
Storage
YARN: Data Operating System
Governance Security
Operations
Resource Management
© Hortonworks Inc. 2015. All Rights Reserved
Resource Management
YARN for multi-tenant, diverse workloads with predictable SLAs
Tiered Memory Storage
HDFS in-memory tier for off-heap RDD cache
SparkSQL & Hive for SQL
Interop with modern metastore, HS2; optimized ORC support
Spark & NoSQL
Deep integration with HBase via RDDs for predicate pushdown
Connect The Dots – Algorithms to Use-Cases
Higher-level ML abstractions - Validation, tuning, pipeline
assembly... e.g. GeoSpatial
Ease of Use
Apache Zeppelin for interactive notebooks
Spark and Hadoop – How Can We Do Better?
Storage
YARN: Data Operating System
Governance Security
Operations
Resource Management
Mr. Brady is happy now that he
can isolate where problems
exist, identify causal events
and build models that help him
predict events before they
occur. However, he knows he
still has to come up with a way
to grow revenue.
Demo Here
Mr. Brady thinks there may be a
mismatch between his truck
capacity and route demand. In
other words, he has some routes
that would generate more revenue if
the trucks on those routes had more
capacity. He also has some routes
where the trucks have excess
capacity. The problem is, the trucks
capacity only exist in a pdf.
Peterbilt 348 Heavy Duty Trucks - Tank
Trucks - Water,
Type:
5000 GallonCapacity:
DynaHauler®/MH Water Trucks - Water,Type:
8000 GallonCapacity:
MAN Heavy Duty Water Tank TruckType:
10000 GallonCapacity:
Demo Here
Mr. Brady struggles with how
to match the right truck with
the right route because he
knows of no way to relate
unstructured pdf data with the
route data that he has in a
structured database.
Jim
Business Analyst
Jim points out that HDP can handled unstructured data and can
process the equipment spec sheets.
Schema on Read
© Hortonworks Inc. 2015. All Rights Reserved
Apps on YARN
Datasets stored in HDFS
Real-time and Predictive Application Architecture
Your BI Tool
Predictive application
Truck sensors
App alerts
(ActiveMQ)
Messages
SQL Stream NoSQLML
Use
Model
60
Mr. Brady is overjoyed with his
big win as he adds millions in
revenue by matching the right
truck with the right route at the
right time.
Demo Here
© Hortonworks Inc. 2015. All Rights Reserved
© Hortonworks Inc. 2015. All Rights Reserved
We can now access Zeppelin which is
a data science notebook for Spark
that’s similar to iPython notebook
© Hortonworks Inc. 2015. All Rights Reserved
Does location have an impact on
incidents?
© Hortonworks Inc. 2015. All Rights Reserved
Upcoming Workshop:
Deep Learning with Hadoop & Apache Spark
http://hortonworks.com/partners/learn/
Page 65 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
The End. Thanks. Questions?
@seano
© Hortonworks Inc. 2015. All Rights Reserved
Links for Reference
● Hortonworks Sandbox: http://hortonworks.com/sandbox
● CloudBreak (to deploy HDP on Cloud):
○ http://sequenceiq.com/cloudbreak/
○ http://cloudbreak.sequenceiq.com
● Apache Zeppelin: https://zeppelin.incubator.apache.org/
● Apache Zeppelin installer for Ambari: https://github.com/hortonworks-
gallery/ambari-zeppelin-service
● HortonsGym: https://itunes.apple.com/us/app/hortons-
gym/id993130619?mt=8
● IOT Demo Code: https://github.com/abajwa-hw/iotdemo-service
© Hortonworks Inc. 2015. All Rights Reserved
Extra slides showing Apache Zeppelin
© Hortonworks Inc. 2015. All Rights Reserved
Let’s look at our data. We can see
eventType, if the driver’s certified,
how many hours driven, as well as
weather data such as foggy, rainy,
© Hortonworks Inc. 2015. All Rights Reserved
Let’s start asking questions of our
data; such as, does fatigue cause
violations?
© Hortonworks Inc. 2015. All Rights Reserved
Let’s view the data in a pie chart
graphic to see how violations look by
hours driven.
© Hortonworks Inc. 2015. All Rights Reserved
How are violations impacted by fog?
© Hortonworks Inc. 2015. All Rights Reserved
OK, we’ve learned enough about the
data and what features we want to
include in our model. So we’ll run a
logistic regression on training data.
© Hortonworks Inc. 2015. All Rights Reserved
Let’s run our code
© Hortonworks Inc. 2015. All Rights Reserved
Let’s look at our model.
Next step is to hand the model off to
the Enterprise Architect to integrate
into our real-time application.

Mais conteúdo relacionado

Mais procurados

Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
BlueData, Inc.
 
BYOP: Custom Processor Development with Apache NiFi
BYOP: Custom Processor Development with Apache NiFiBYOP: Custom Processor Development with Apache NiFi
BYOP: Custom Processor Development with Apache NiFi
DataWorks Summit
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
DataWorks Summit
 

Mais procurados (20)

Cloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep DiveCloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep Dive
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
 
What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014
 
BDTC2015 hulu-梁宇明-voidbox - docker on yarn
BDTC2015 hulu-梁宇明-voidbox - docker on yarnBDTC2015 hulu-梁宇明-voidbox - docker on yarn
BDTC2015 hulu-梁宇明-voidbox - docker on yarn
 
DEVNET-1141 Dynamic Dockerized Hadoop Provisioning
DEVNET-1141	Dynamic Dockerized Hadoop ProvisioningDEVNET-1141	Dynamic Dockerized Hadoop Provisioning
DEVNET-1141 Dynamic Dockerized Hadoop Provisioning
 
Bare-metal performance for Big Data workloads on Docker containers
Bare-metal performance for Big Data workloads on Docker containersBare-metal performance for Big Data workloads on Docker containers
Bare-metal performance for Big Data workloads on Docker containers
 
Curing the Kafka Blindness – Streams Messaging Manager
Curing the Kafka Blindness – Streams Messaging ManagerCuring the Kafka Blindness – Streams Messaging Manager
Curing the Kafka Blindness – Streams Messaging Manager
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to Implementation
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
 
DC HUG Hadoop for Windows
DC HUG Hadoop for WindowsDC HUG Hadoop for Windows
DC HUG Hadoop for Windows
 
Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemRunning An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid Them
 
Getting involved with Open Source at the ASF
Getting involved with Open Source at the ASFGetting involved with Open Source at the ASF
Getting involved with Open Source at the ASF
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in Hadoop
 
Apache deep learning 202 Washington DC - DWS 2019
Apache deep learning 202   Washington DC - DWS 2019Apache deep learning 202   Washington DC - DWS 2019
Apache deep learning 202 Washington DC - DWS 2019
 
BYOP: Custom Processor Development with Apache NiFi
BYOP: Custom Processor Development with Apache NiFiBYOP: Custom Processor Development with Apache NiFi
BYOP: Custom Processor Development with Apache NiFi
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
 
Deep learning on HDP 2018 Prague
Deep learning on HDP 2018 PragueDeep learning on HDP 2018 Prague
Deep learning on HDP 2018 Prague
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 

Destaque

Congelamiento de Precios - Productos en supermercados libertad
Congelamiento de Precios - Productos en supermercados libertadCongelamiento de Precios - Productos en supermercados libertad
Congelamiento de Precios - Productos en supermercados libertad
Diario Elcomahueonline
 
Cuanto influye la tecnología en mi medio
Cuanto influye la tecnología en mi  medioCuanto influye la tecnología en mi  medio
Cuanto influye la tecnología en mi medio
agustinapascal
 
Congelamiento de precios productos en cooperativa obrera
Congelamiento de precios   productos en cooperativa obreraCongelamiento de precios   productos en cooperativa obrera
Congelamiento de precios productos en cooperativa obrera
Diario Elcomahueonline
 
Upgrading the Curriculum
Upgrading the CurriculumUpgrading the Curriculum
Upgrading the Curriculum
Janet Hale
 
Budget Cuts And Their Effects
Budget Cuts And Their EffectsBudget Cuts And Their Effects
Budget Cuts And Their Effects
endre1mr
 

Destaque (20)

Data Analytics in Real World (May 2016)
Data Analytics in Real World (May 2016)Data Analytics in Real World (May 2016)
Data Analytics in Real World (May 2016)
 
Advanced ebay
Advanced ebayAdvanced ebay
Advanced ebay
 
eHealth
eHealtheHealth
eHealth
 
Africa conf-report en Conference report High-Level Conference EU-Africa Pa...
Africa conf-report en Conference report    High-Level Conference EU-Africa Pa...Africa conf-report en Conference report    High-Level Conference EU-Africa Pa...
Africa conf-report en Conference report High-Level Conference EU-Africa Pa...
 
Baillieu Holst Post Election Seminar
Baillieu Holst Post Election Seminar Baillieu Holst Post Election Seminar
Baillieu Holst Post Election Seminar
 
Resume
ResumeResume
Resume
 
Congelamiento de Precios - Productos en supermercados libertad
Congelamiento de Precios - Productos en supermercados libertadCongelamiento de Precios - Productos en supermercados libertad
Congelamiento de Precios - Productos en supermercados libertad
 
C:\Data\Pdf Converter\Referral Handout
C:\Data\Pdf Converter\Referral  HandoutC:\Data\Pdf Converter\Referral  Handout
C:\Data\Pdf Converter\Referral Handout
 
Walmart
WalmartWalmart
Walmart
 
Herbert Allen
Herbert AllenHerbert Allen
Herbert Allen
 
Taller de primer semestre
Taller de primer semestreTaller de primer semestre
Taller de primer semestre
 
Cuanto influye la tecnología en mi medio
Cuanto influye la tecnología en mi  medioCuanto influye la tecnología en mi  medio
Cuanto influye la tecnología en mi medio
 
1 scl dan kbk
1 scl dan kbk1 scl dan kbk
1 scl dan kbk
 
Australian Junior Mining Exploration Company
Australian Junior Mining Exploration CompanyAustralian Junior Mining Exploration Company
Australian Junior Mining Exploration Company
 
Congelamiento de precios productos en cooperativa obrera
Congelamiento de precios   productos en cooperativa obreraCongelamiento de precios   productos en cooperativa obrera
Congelamiento de precios productos en cooperativa obrera
 
Upgrading the Curriculum
Upgrading the CurriculumUpgrading the Curriculum
Upgrading the Curriculum
 
Budget Cuts And Their Effects
Budget Cuts And Their EffectsBudget Cuts And Their Effects
Budget Cuts And Their Effects
 
BNI Lake Business Builders- LOZ Vice President report
BNI Lake Business Builders- LOZ Vice President reportBNI Lake Business Builders- LOZ Vice President report
BNI Lake Business Builders- LOZ Vice President report
 
Week 2: Setting up your Account
Week 2: Setting up your AccountWeek 2: Setting up your Account
Week 2: Setting up your Account
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Group
 

Semelhante a S2DS London 2015 - Hadoop Real World

Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
skumpf
 
Transform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and Hortonworks
Hortonworks
 

Semelhante a S2DS London 2015 - Hadoop Real World (20)

Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptx
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Transform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and Hortonworks
 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks
 
Hortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User GroupHortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User Group
 
Meetup oslo hortonworks HDP
Meetup oslo hortonworks HDPMeetup oslo hortonworks HDP
Meetup oslo hortonworks HDP
 
The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...
The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...
The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

S2DS London 2015 - Hadoop Real World

  • 1. Hadoop Real World: For S2DS Students Sean Roberts — Partner Engineering, Hortonworks @seano ®
  • 2. © Hortonworks Inc. 2015. All Rights Reserved $ id seano Sean Roberts Partner Solutions Engineer London & EMEA & everywhere @seano linkedin.com/in/seanorama Louisiana. MacGyver. Cook. Autodidact. Volunteer. Ancestral Health. Fito. Couchsurfer. Nomad.
  • 3. Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved I am your anti-thesis
  • 4. © Hortonworks Inc. 2015. All Rights Reserved self-taught. autodidact to the extreme. Formal education ended at 17
  • 5. © Hortonworks Inc. 2015. All Rights Reserved as do most systems engineers Avoided Java my entire life
  • 6. © Hortonworks Inc. 2015. All Rights Reserved where everything is Java! Fail: Then joined the Hadoop company
  • 7. © Hortonworks Inc. 2015. All Rights Reserved But I do love Python and shell scripts! Horrified by R, Matlab
  • 8. © Hortonworks Inc. 2015. All Rights Reserved not once Statistics
  • 9. © Hortonworks Inc. 2015. All Rights Reserved I have no idea what I’m doing Machine Learning, Predictive Analytics
  • 10. © Hortonworks Inc. 2015. All Rights Reserved Areas of Study? Hadoop? Data sets? How complex, mixed or big? Questions to you
  • 11. © Hortonworks Inc. 2015. All Rights Reserved Heart rate monitor, Solr, Banana, Kafka, Storm, HBase, SparkML Side note & fun for me: Horton’s Gym
  • 12. © Hortonworks Inc. 2015. All Rights Reserved
  • 13. Hadoop Real World: For S2DS Students Sean Roberts — Partner Engineering, Hortonworks @seano ®
  • 14. © Hortonworks Inc. 2015. All Rights Reserved Chief Data Officer’s Needs Application Team’s Response Lorry/Truck Fleet Use Case
  • 15. Tim Brady is a General Manager at a major energy company in the forest city basin and revenue has been a little flat. Senior Management has asked Tim to see what he can do to contain cost. Mr Brady’s background in working with equipment has served him well in his role overseeing the water hauling, pumper, and equipment trucks at his company. However, despite the recent drop in gas prices, fuel costs have continued to increase for the fleet of trucks that Mr Brady oversees. 2012 2013 2014 2015 50% 60% 70% 80%
  • 16. Senior management asked Mr. Brady to explain the cost increases and get them under control as well as look for opportunities to grow revenue. Insurance premiums and equipment outages have also increased under Brady’s watch. 2012 900K 2013 2014 2015 Insurance Premiums Equipment Outages
  • 17. At first, Mr. Brady feels deflated as he thinks through the volume of complex and varied data types that he must analyze to answer the questions posed by senior management. In addition, Mr. Brady realizes that whatever system he chooses will have to handle batch, interactive and real- time processing. Clickstream Route Data As The Drivers Choose Their Routes Through Mapping Software Sensor Data Coming Off The Assets Geolocation Data Providing The Location of Assets Web Data Weather Structured Data Master Data on Drivers and Assets Unstructured Data Asset Work Orders and Assets New Traditional New Data Growth
  • 18. Then Mr. Brady starts to get a grip on the situation and remembers a team he once used to get him some data. Tim reaches out to his team. Jim Business Analyst Sue Developer Varun System Admin Maria SME Tim’s team has recently downloaded Hortonworks’ Sandbox from http://hortonworks.com/products/hortonworks-sandbox/ and they tell him they think Hadoop can do the job.
  • 19. Hadoop’s Genesis and Unique Characteristics Make It The Perfect Target for The Modern Data Architecture Any Data, Anywhere, Anytime Continuous Availability Data Locality Self-Healing Self-Leveling Schema on Read Machine Leaning
  • 20. 20 Our Mission: Power your Modern Data Architecture with HDP and Enterprise Apache Hadoop Customer Momentum • 330+ customers (as of end of 2014) • Two thirds of customers come from F1000 Hortonworks Data Platform Hadoop at Scale • Multiple +1000 node clusters under support, including 35,000 nodes at Yahoo!, 800 nodes at Spotify • Open multi-tenant platform for any app & any data. • Centralized architecture Partner for Customer Success • Open source community leadership focus on enterprise needs • Unrivaled world class support • Founded in 2011 • Original 24 architects, developers, operators of Hadoop from Yahoo! • 600+ Employees • 1000+ Ecosystem Partners No One Knows Hadoop Better Than Hortonworks
  • 21. 21Hortonworks Data Platform is a Enterprise Ready Centralized Architecture That Allows For Batch, Interactive, and Real-Time Processing on a Single Data Source Storage YARN: Data Operating System Governance Security Operations Resource Management Existing Apps New Analytics Partner Apps (ie. SAS) Data Access: Batch, Interactive & Real-time Mr. Brady is encouraged that the Hortonworks Data Platform can handle the volume of complex and varied data types that he must analyze as well as handle the batch, interactive and real-time processing that is required.
  • 22. © Hortonworks Inc. 2015. All Rights Reserved HCatalog: Shared Table & User Defined Metadata for All Workloads Falcon & Oozie: Orchestrate Processing Ambari: Provision, Manage and Monitor Cluster Resources Ingest Sqoop NFS WebHDFS Stream Storm Flume Source Systems Clickstream Social/Web Geolocation Machine/Sensor Server Log Unstructured CRM/ERP ODS EDW Security: Perimeter and Full Stack Policy Definition & Enforcement Data Processing & Data Transforms Data Science & Machine Learning Spark Hive Solr Stream Processing & Stream Analytics Kafka Storm Target Systems ODS EDW Pig Hive Cascading Real Time & NoSQL SQL Batch & Interactive Hive ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) YARN (Cluster Resource Management) HBase Accumulo Visualization & Reporting Business Applications Data Marts Hortonworks Data Platform
  • 23. 23 Jim Business Analyst Sue Developer Varun System Admin Maria SME + HDP Data Analyst Training = HDP Data Analyst + Developer Training = HDP Developer + HDP System Admin Training = HDP Sys Admin + Data Science Training = HDP Data Scientist
  • 24. www.hortonworks.com Varun Stands up the Cluster Varu nHDP Sys Admin Demo Here
  • 25. © Hortonworks Inc. 2015. All Rights Reserved Data Scientist: Explore Data & Build Model in Cloud Click-thru Demo Provision data science environment in the cloud Use data science notebook to explore data Run algorithms to create predictive model Cloudbreak 1. Choose a cloud 2. Pick the Spark blueprint 3. Launch HDP Microsoft Azure
  • 26. © Hortonworks Inc. 2015. All Rights Reserved Login to launch.hortonworks.com which is a self-service portal for launching HDP clusters to the cloud
  • 27. © Hortonworks Inc. 2015. All Rights Reserved
  • 28. © Hortonworks Inc. 2015. All Rights Reserved Name the cluster, choose your region, and pick your blueprint…in this case, we want “hdp-spark-cluster” for our data science work
  • 29. © Hortonworks Inc. 2015. All Rights Reserved We clicked “create cluster” and Cloudbreak is now provisioning our Spark environment on Azure
  • 31. 31 SueJi mHDP Data Analyst HDP Developer Jim and Sue Build Monitoring App Demo Here
  • 32. © Hortonworks Inc. 2015. All Rights Reserved Apps on YARN Datasets stored in HDFS Real-time and Predictive Application Architecture Monitoring application Truck sensors App alerts (ActiveMQ) Messages Stream NoSQL
  • 33. Apache Kafka ▪ High throughput distributed messaging system ▪ Publish-Subscribe semantics but re- imagined at the implementation level to operate at speed with big data volumes ▪ Kafka @LinkedIn: ▪ 800 billion messages per day ▪ 175 terabytes of data written per day ▪ 650 terabytes of data read per day ▪ Over 13 million messages/2.75GB of data per second Kafka Cluster producer producer producer consumer consumer consumer
  • 34. Kafka: Anatomy of a Topic Partition 0 Partition 1 Partition 2 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 11 11 12 Write s Ol d Ne w ▪ Partitioning allows topics to scale beyond a single machine/node ▪ Topics can also be replicated, for high availability.
  • 35. © Hortonworks Inc. 2015. All Rights Reserved Apps on YARN Datasets stored in HDFS Real-time and Predictive Application Architecture Monitoring application Truck sensors App alerts (ActiveMQ) Messages Stream NoSQL
  • 36. Apache Storm • Distributed, real time, fault tolerant Stream Processing platform. • Provides processing guarantees. • Key concepts include: •Tuples •Streams •Spouts •Bolts •Topology Page 36
  • 37. Storm: Tuples and Streams • What is a Tuple? –Fundamental data structure in Storm. Is a named list of values that can be of any data type. Page 37 • What is a Stream? –An unbounded sequences of tuples. –Core abstraction in Storm and are what you “process” in Storm
  • 38. Storm: Spouts • What is a Spout? –Generates or a source of Streams –E.g.: JMS, Twitter, Log, Kafka Spout –Can spin up multiple instances of a Spout and dynamically adjust as needed Page 38
  • 39. Storm: Bolts • What is a Bolt? –Processes any number of input streams and produces output streams –Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting logic –Can spin up multiple instances of a Bolt and dynamically adjust as needed • Bolts used in the Use Case: 1. HBaseBolt: persisting and counting in Hbase 2. HDFSBolt: persisting into HFDS as Avro Files using Flume 3. MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the number of illegal driver incidents exceed a given threshhold. Page 39
  • 40. Storm: Topology • What is a Topology? –A network of spouts and bolts wired together into a workflow Page 40
  • 41. © Hortonworks Inc. 2015. All Rights Reserved Apps on YARN Datasets stored in HDFS Real-time and Predictive Application Architecture Monitoring application Truck sensors App alerts (ActiveMQ) Messages Stream NoSQL
  • 42. Apache HBase • HBase = Key / Value store • Designed for petabyte scale • Supports low latency reads, writes and updates • Key features – Updateable records – Versioned Records – Distributed across a cluster of machines – Low Latency – Caching • Popular use cases: – User profiles and session state – Object store – Sensor apps Page 42
  • 43. HBase: Data Assignment Page 43 HBase Table Keys within HBase Divided among different RegionServers
  • 44. HBase: Data Access • Get –Retrieves a single cell, all cells with a matching rowkey, or all cells in a column family with a matching rowkey • Put –Inserts a new version of a cell. • Scan –The whole table, row by row, or a section of that table starting at a particular start key and ending at a particular end key • Delete –It is actually a version of put(Add a new version with put with a deletion marker) • SQL via Apache Phoenix –Unique capability in the NoSQL market Page 44
  • 45. © Hortonworks Inc. 2015. All Rights Reserved Apps on YARN Datasets stored in HDFS Real-time and Predictive Application Architecture Monitoring application Truck sensors App alerts (ActiveMQ) Messages Stream NoSQL
  • 46. Apache HDFS: Hadoop Distributed File System • Very large scale distributed file system • 10K nodes, tens of millions files and PBs of data • Supports large files • Designed to run on commodity hardware, assumes hardware failures • Files are replicated to handle hardware failure • Detect failures and recovers from them automatically • Optimized for Large Scale Processing • Data locations are exposed so that the computations can move to where data resides • Data Coherency • Write once and read many times access pattern • Files are broken up in chunks called ‘blocks’ • Blocks are distributed over nodes
  • 47. www.hortonworks.com 47 Ji mHDP Data Analyst Jim Build BI Reports To Analyze Routes Demo Here
  • 48. 48 Jim HDP Data Analyst Jim Build BI Reports To Events Per Routes Demo Here
  • 49. © Hortonworks Inc. 2015. All Rights Reserved Apps on YARN Datasets stored in HDFS Real-time and Predictive Application Architecture Your BI Tool Predictive application Truck sensors App alerts (ActiveMQ) Messages SQL Stream NoSQL
  • 50. © Hortonworks Inc. 2015. All Rights Reserved
  • 51. 51 Mr. Brady is happy with the results. He is able to determine that a subset of drivers are responsible for the increased cost. But like most managers he is not happy for long. Now he wants to be able to predict which drivers are likely going to be a risk. Maria Data Scientist Machine Leaning Maria points out that HDP has tremendous Machine Learning capabilities and she can use this to predict which drivers are likely to have an event before the event occurs.
  • 52. Maria implements predicted violations logic using HDP Machine Learning and is able to predict events before they happen Demo Here
  • 53. © Hortonworks Inc. 2015. All Rights Reserved Apps on YARN Datasets stored in HDFS Real-time and Predictive Application Architecture Your BI Tool Predictive application Truck sensors App alerts (ActiveMQ) Messages SQL Stream NoSQLML Use Model
  • 54. © Hortonworks Inc. 2015. All Rights Reserved Elegant Developer APIs DataFrames, Machine Learning, and SQL Interactive Data Science All apps need to get predictive at scale and fine granularity Democratize Machine Learning Spark is doing to ML on Hadoop what Hive did for SQL on Hadoop Community Broad developer, customer and partner interest Realize Value of Data Operating System A key tool in the Hadoop toolbox Why We Love Spark at Hortonworks Storage YARN: Data Operating System Governance Security Operations Resource Management
  • 55. © Hortonworks Inc. 2015. All Rights Reserved Resource Management YARN for multi-tenant, diverse workloads with predictable SLAs Tiered Memory Storage HDFS in-memory tier for off-heap RDD cache SparkSQL & Hive for SQL Interop with modern metastore, HS2; optimized ORC support Spark & NoSQL Deep integration with HBase via RDDs for predicate pushdown Connect The Dots – Algorithms to Use-Cases Higher-level ML abstractions - Validation, tuning, pipeline assembly... e.g. GeoSpatial Ease of Use Apache Zeppelin for interactive notebooks Spark and Hadoop – How Can We Do Better? Storage YARN: Data Operating System Governance Security Operations Resource Management
  • 56. Mr. Brady is happy now that he can isolate where problems exist, identify causal events and build models that help him predict events before they occur. However, he knows he still has to come up with a way to grow revenue. Demo Here
  • 57. Mr. Brady thinks there may be a mismatch between his truck capacity and route demand. In other words, he has some routes that would generate more revenue if the trucks on those routes had more capacity. He also has some routes where the trucks have excess capacity. The problem is, the trucks capacity only exist in a pdf. Peterbilt 348 Heavy Duty Trucks - Tank Trucks - Water, Type: 5000 GallonCapacity: DynaHauler®/MH Water Trucks - Water,Type: 8000 GallonCapacity: MAN Heavy Duty Water Tank TruckType: 10000 GallonCapacity: Demo Here
  • 58. Mr. Brady struggles with how to match the right truck with the right route because he knows of no way to relate unstructured pdf data with the route data that he has in a structured database. Jim Business Analyst Jim points out that HDP can handled unstructured data and can process the equipment spec sheets. Schema on Read
  • 59. © Hortonworks Inc. 2015. All Rights Reserved Apps on YARN Datasets stored in HDFS Real-time and Predictive Application Architecture Your BI Tool Predictive application Truck sensors App alerts (ActiveMQ) Messages SQL Stream NoSQLML Use Model
  • 60. 60 Mr. Brady is overjoyed with his big win as he adds millions in revenue by matching the right truck with the right route at the right time. Demo Here
  • 61. © Hortonworks Inc. 2015. All Rights Reserved
  • 62. © Hortonworks Inc. 2015. All Rights Reserved We can now access Zeppelin which is a data science notebook for Spark that’s similar to iPython notebook
  • 63. © Hortonworks Inc. 2015. All Rights Reserved Does location have an impact on incidents?
  • 64. © Hortonworks Inc. 2015. All Rights Reserved Upcoming Workshop: Deep Learning with Hadoop & Apache Spark http://hortonworks.com/partners/learn/
  • 65. Page 65 © Hortonworks Inc. 2011 – 2015. All Rights Reserved The End. Thanks. Questions? @seano
  • 66. © Hortonworks Inc. 2015. All Rights Reserved Links for Reference ● Hortonworks Sandbox: http://hortonworks.com/sandbox ● CloudBreak (to deploy HDP on Cloud): ○ http://sequenceiq.com/cloudbreak/ ○ http://cloudbreak.sequenceiq.com ● Apache Zeppelin: https://zeppelin.incubator.apache.org/ ● Apache Zeppelin installer for Ambari: https://github.com/hortonworks- gallery/ambari-zeppelin-service ● HortonsGym: https://itunes.apple.com/us/app/hortons- gym/id993130619?mt=8 ● IOT Demo Code: https://github.com/abajwa-hw/iotdemo-service
  • 67. © Hortonworks Inc. 2015. All Rights Reserved Extra slides showing Apache Zeppelin
  • 68. © Hortonworks Inc. 2015. All Rights Reserved Let’s look at our data. We can see eventType, if the driver’s certified, how many hours driven, as well as weather data such as foggy, rainy,
  • 69. © Hortonworks Inc. 2015. All Rights Reserved Let’s start asking questions of our data; such as, does fatigue cause violations?
  • 70. © Hortonworks Inc. 2015. All Rights Reserved Let’s view the data in a pie chart graphic to see how violations look by hours driven.
  • 71. © Hortonworks Inc. 2015. All Rights Reserved How are violations impacted by fog?
  • 72. © Hortonworks Inc. 2015. All Rights Reserved OK, we’ve learned enough about the data and what features we want to include in our model. So we’ll run a logistic regression on training data.
  • 73. © Hortonworks Inc. 2015. All Rights Reserved Let’s run our code
  • 74. © Hortonworks Inc. 2015. All Rights Reserved Let’s look at our model. Next step is to hand the model off to the Enterprise Architect to integrate into our real-time application.