Lecture to the London S2DS students.
Some fun in highlighting that I'm their polar opposite (no schooling since 17, and focused on operations not science).
15. Tim Brady is a General
Manager at a major
energy company in the
forest city basin and
revenue has been a little
flat. Senior Management
has asked Tim to see
what he can do to contain
cost.
Mr Brady’s background in working with equipment has served him well
in his role overseeing the water hauling, pumper, and equipment trucks
at his company. However, despite the recent drop in gas prices, fuel
costs have continued to increase for the fleet of trucks that Mr Brady
oversees.
2012 2013 2014 2015
50% 60% 70% 80%
16. Senior management asked
Mr. Brady to explain the
cost increases and get them
under control as well as
look for opportunities to
grow revenue.
Insurance premiums and equipment outages have also increased
under Brady’s watch.
2012
900K
2013 2014 2015
Insurance Premiums Equipment Outages
17. At first, Mr. Brady feels deflated as
he thinks through the volume of
complex and varied data types that
he must analyze to answer the
questions posed by senior
management. In addition, Mr.
Brady realizes that whatever
system he chooses will have to
handle batch, interactive and real-
time processing.
Clickstream
Route Data As The Drivers
Choose Their Routes Through
Mapping Software
Sensor Data
Coming Off The Assets
Geolocation Data
Providing The Location of Assets
Web Data
Weather
Structured Data
Master Data on Drivers and
Assets
Unstructured Data
Asset Work Orders and Assets
New
Traditional
New Data Growth
18. Then Mr. Brady starts to get a
grip on the situation and
remembers a team he once
used to get him some data.
Tim reaches out to his team.
Jim
Business Analyst
Sue
Developer
Varun
System Admin
Maria
SME
Tim’s team has recently downloaded Hortonworks’ Sandbox from
http://hortonworks.com/products/hortonworks-sandbox/
and they tell him they think Hadoop can do the job.
19. Hadoop’s Genesis and Unique Characteristics Make It The Perfect
Target for The Modern Data Architecture
Any Data, Anywhere,
Anytime
Continuous Availability
Data Locality
Self-Healing Self-Leveling
Schema on Read Machine Leaning
20. 20
Our Mission:
Power your Modern Data Architecture
with HDP and Enterprise Apache Hadoop
Customer Momentum
• 330+ customers (as of end of 2014)
• Two thirds of customers come from F1000
Hortonworks Data Platform Hadoop at Scale
• Multiple +1000 node clusters under support, including
35,000 nodes at Yahoo!, 800 nodes at Spotify
• Open multi-tenant platform for any app & any data.
• Centralized architecture
Partner for Customer Success
• Open source community leadership focus on enterprise
needs
• Unrivaled world class support
• Founded in 2011
• Original 24 architects, developers,
operators of Hadoop from Yahoo!
• 600+ Employees
• 1000+ Ecosystem Partners
No One Knows Hadoop Better Than Hortonworks
21. 21Hortonworks Data Platform is a Enterprise Ready Centralized
Architecture That Allows For Batch, Interactive, and Real-Time
Processing on a Single Data Source
Storage
YARN: Data Operating System
Governance Security
Operations
Resource Management
Existing
Apps
New
Analytics
Partner
Apps
(ie. SAS)
Data Access: Batch, Interactive & Real-time
Mr. Brady is encouraged that the Hortonworks
Data Platform can handle the volume of
complex and varied data types that he must
analyze as well as handle the batch, interactive
and real-time processing that is required.
33. Apache Kafka
▪ High throughput distributed messaging
system
▪ Publish-Subscribe semantics but re-
imagined at the implementation level
to operate at speed with big data
volumes
▪ Kafka @LinkedIn:
▪ 800 billion messages per day
▪ 175 terabytes of data written per day
▪ 650 terabytes of data read per day
▪ Over 13 million messages/2.75GB of
data per second
Kafka
Cluster
producer
producer
producer
consumer
consumer
consumer
34. Kafka: Anatomy of a Topic
Partition
0
Partition
1
Partition
2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10
11 11
12
Write
s
Ol
d
Ne
w
▪ Partitioning allows topics to
scale beyond a single
machine/node
▪ Topics can also be replicated,
for high availability.
37. Storm: Tuples and Streams
• What is a Tuple?
–Fundamental data structure in Storm. Is a named list of values that can be of any data type.
Page 37
• What is a Stream?
–An unbounded sequences of tuples.
–Core abstraction in Storm and are what you “process” in Storm
38. Storm: Spouts
• What is a Spout?
–Generates or a source of Streams
–E.g.: JMS, Twitter, Log, Kafka Spout
–Can spin up multiple instances of a Spout and dynamically adjust as needed
Page 38
39. Storm: Bolts
• What is a Bolt?
–Processes any number of input streams and produces output streams
–Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting
logic
–Can spin up multiple instances of a Bolt and dynamically adjust as needed
• Bolts used in the Use Case:
1. HBaseBolt: persisting and counting in Hbase
2. HDFSBolt: persisting into HFDS as Avro Files using Flume
3. MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the
number of illegal driver incidents exceed a given threshhold.
Page 39
40. Storm: Topology
• What is a Topology?
–A network of spouts and bolts wired together into a workflow
Page 40
42. Apache HBase
• HBase = Key / Value store
• Designed for petabyte scale
• Supports low latency reads, writes and updates
• Key features
– Updateable records
– Versioned Records
– Distributed across a cluster of machines
– Low Latency
– Caching
• Popular use cases:
– User profiles and session state
– Object store
– Sensor apps
Page 42
44. HBase: Data Access
• Get
–Retrieves a single cell, all cells with a matching rowkey, or all cells in a column family with
a matching rowkey
• Put
–Inserts a new version of a cell.
• Scan
–The whole table, row by row, or a section of that table starting at a particular start key and
ending at a particular end key
• Delete
–It is actually a version of put(Add a new version with put with a deletion marker)
• SQL via Apache Phoenix
–Unique capability in the NoSQL market
Page 44
46. Apache HDFS: Hadoop Distributed File System
• Very large scale distributed file system
• 10K nodes, tens of millions files and PBs of data
• Supports large files
• Designed to run on commodity hardware, assumes hardware failures
• Files are replicated to handle hardware failure
• Detect failures and recovers from them automatically
• Optimized for Large Scale Processing
• Data locations are exposed so that the computations can move to where data resides
• Data Coherency
• Write once and read many times access pattern
• Files are broken up in chunks called ‘blocks’
• Blocks are distributed over nodes
51. 51
Mr. Brady is happy with the
results. He is able to
determine that a subset of
drivers are responsible for the
increased cost. But like most
managers he is not happy for
long. Now he wants to be able
to predict which drivers are
likely going to be a risk.
Maria
Data Scientist
Machine Leaning
Maria points out that HDP has tremendous Machine Learning
capabilities and she can use this to predict which drivers are likely
to have an event before the event occurs.
56. Mr. Brady is happy now that he
can isolate where problems
exist, identify causal events
and build models that help him
predict events before they
occur. However, he knows he
still has to come up with a way
to grow revenue.
Demo Here
57. Mr. Brady thinks there may be a
mismatch between his truck
capacity and route demand. In
other words, he has some routes
that would generate more revenue if
the trucks on those routes had more
capacity. He also has some routes
where the trucks have excess
capacity. The problem is, the trucks
capacity only exist in a pdf.
Peterbilt 348 Heavy Duty Trucks - Tank
Trucks - Water,
Type:
5000 GallonCapacity:
DynaHauler®/MH Water Trucks - Water,Type:
8000 GallonCapacity:
MAN Heavy Duty Water Tank TruckType:
10000 GallonCapacity:
Demo Here
58. Mr. Brady struggles with how
to match the right truck with
the right route because he
knows of no way to relate
unstructured pdf data with the
route data that he has in a
structured database.
Jim
Business Analyst
Jim points out that HDP can handled unstructured data and can
process the equipment spec sheets.
Schema on Read
60. 60
Mr. Brady is overjoyed with his
big win as he adds millions in
revenue by matching the right
truck with the right route at the
right time.
Demo Here