Mais conteúdo relacionado Semelhante a Schema Registry & Stream Analytics Manager (20) Schema Registry & Stream Analytics Manager1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Streaming Analytics Manager (SAM)
& Registry
2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Registry
Streaming Analytics Manager (SAM)
Demo
Questions
3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
History of Streaming at Hortonworks
Introduced Storm as Stream Processing Engine in HDP-2.1 (Late 2013)
First to ship Apache Kafka as Enterprise Messaging Queue ( Early 2014)
Added several improvements & features into Apache Storm.
Added Security and critical features/improvements to Apache Kafka
Lot of learnings from shipping Storm & Kafka for past 3 years
Vision & Implementation of Registry & Streaming Analytics Manager based on our learnings from shipping Storm
& Kafka for past 3 years.
5. Page5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Registry
Foundational service to enable multiple use-cases including Streaming, Machine Learning,
Service discovery, Application templates
Offers base frameworks to develop Schema Registry, ML Registry etc..
Registry modules like Schema Registry, ML Registry build their own entities on top of
versioned entity
Modular approach to running registry services.
Users will have flexibility to choose what registry services they would like to enable.
We have Schema Registry and ML Registry
6. Page6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Schema Registry? What Value Does it Provide?
What is Schema Registry?
• A shared repository of schemas that allows applications to flexibly interact with each other
What Value does Schema Registry Provide?
– Central Metadata Repository
• Provide reusable schema
• Define relationship between schemas
• Enable generic format conversion, and generic routing
– Operational Efficiency
• To avoid attaching schema to every piece of data
• Producers and consumers can evolve at different rates
Example Use
– Register Schemas for Kafka Topics to be used by consumers of Kafka Topic (e.g: Nifi, StreamLine)
7. Page7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema Registry Concepts
• Schema Group
A logical grouping/container
for similar type of schemas or
based any criteria that the
customer has from managing
the schemas
• Schema Metadata
Metadata associated with a
named schema.
• Schema Version
The actual versioned schema
associated a schema meta
definition
Schema Metadata 1
Schema Name
Schema Type
Description
Compatibility Policy
Serializers
Deserializers
Schema Group
Group Name
SchemaVersion 3
SchemaVersion 2
Schema Version 1
version
text
Fingerprint
8. Page8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sender/Receiver flow
Local
schema/serdes
cache
Serializer
Producer
Schema
Registry Client
Message Store
Local
schema/serdes
cache
Deserializer
Schema
Registry Client
version
payloa
d
version
payloa
d
Schema Storage SerDes Storage
Consumer
SchemaRegist
ry
SchemaRegist
ry
SchemaRegist
ry
9. Page9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema Registry
Schema Registry Component Architecture
SR Web Server
Schema Registry
Web App
REST APISchema Registry Client
Java Client
Integrations
Nifi Processors Kafka Ser/Des StreamLine
Schema
Storage
Pluggable Storage
Serializer/Deserializer
Jar Storage
MySQL In-Memory Local
File
System
HDFSPostgre
s
10. Page11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema Compatibility Policies
What is a Compatibility Policy?
– Defines the rules of how the schemas can evolve
– Subsequent version updates has to honor the schema’s original compatibility.
Policies Supported
– Backward
– Forward
– Both
– None
11. Page12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema evolution
Producer
v2
Consumer
v2
Producer
v1
Producer
v4
Consumer
v5
Producer
v1
Consumer
v7
12. Page17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Serializers/Deserializers
Snapshot based serializer/deserializer
– Seriliazes the complete payload
– Deserializes the payload to respective type
Pull based serializer/deserializer
– Serialize whatever elements are required and ignore other elements
– Pull out whatever elements that are required to build the desired object
Push based deserializer
– Gives callback to receive parsing events for respective fields in schema
13. Page18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema registry client
REST based client
Caching
– Metadata
– Schema versions
– Ser/des libs and class loaders
URL selectors
– Round robin
– Failover
14. Page19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HA
Storage provider
– Depends on transactional support
of underlying SQL stores
– Spinup required schema registry
instances
Supports HA at SchemaRegistry
– Using ZK/Curator
– Automatic failover of master
– Master gets all writes
– Slaves receives only reads
SchemaRegistr
y
storage
SchemaRegistr
y
SchemaRegistr
y
SchemaRegistr
y
SchemaRegistr
ySchemaRegistr
y
storage
15. Page20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Integration of Schema Registry
Kafka
– Using producer/consumer API for serializer/deserializer
Nifi Processors for Schema Registry
– Fetch Schema
– Serialize/Deserialize with Schema
StreamLine processors for Schema Registry
– Lookup Schema of a Kafka, Kinesis, EventHubs Topic
– Lookup Schema of a HDFS Directory
17. Page25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
WIP/Future enhancements
Security
– Kerberos support
– Default authorizers and Apache Ranger support
Audit of Schemas & Clients
Rich Types in Schema definition
Pluggable Listeners
Schema Policies
Notifications
– New versions
– Archiving
Converters
18. Page26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Try it out!
Its open source under Apache License
https://github.com/hortonworks/registry
Apache incubation soon
Registry 0.2 release April 25th, 0.3 release on May 31st
https://groups.google.com/forum/#!forum/registry
We are seeing outside contributions
Contributions are welcome!
19. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Streaming Analytics Manager
20. Page28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Streaming Analytics Manager
What is it?
• A platform used to design, develop, deploy and manage streaming analytics
applications using a drag drop visualize paradigm in minutes
• Supports event correlation, context enrichment , complex pattern matching,
analytical aggregations and alerts/notifications when insights are discovered.
• It is agnostic to the underlying streaming engine and can support multiple streaming
substrates (e.g: Storm, Spark Streaming, Flink)
• Extensibility is a first class citizen (add sinks, processors, sources as needed)
Guiding Principle
– Build streaming applications easily while focusing on business logic
21. Page29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Complexities in building streaming applications
New streaming engines and APIs
Implementing windows, joins, and state management is hard
Adding user’s business logic into the application
Interaction with external services such as HBase, Hive, HDFS etc
Deploying with all the necessary configuration files
Operations around the streaming application including monitoring and metrics
Debugging streaming application
22. Page30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key challenges that SAM is trying to solve
Building streaming applications requires specialized skillsets that most enterprise
organizations don’t have today
Streaming applications require considerable amount of programming, testing and tuning
before deploying to production which takes a significant amount of time
Key streaming primitives such as joining/splitting streams, aggregations over a window of
time and pattern matching are difficult to implement
People don’t prefer to code to build complex streaming applications
No true open source project today solves all of the above challenges
People don’t care about the streaming engine that powers streaming applications so much as
long challenges above are addressed and doesn’t force them into vendor lock in.
23. Page31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Streaming Analytics Manager Components and User
Personas
Distributed Streaming
Computation Engine
(Different Streaming Engines that powers higher level services to build stream application. )
App Developer
Business Analyst
Operations
24. Page32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SAM’s Value Proposition
A platform using a graphical programming paradigm allowing users to focus on business
logic and easily build and deploy complex streaming applications
Makes it easier for users to import other service configurations and use them in streaming
applications
Provides abstractions on the streaming engine used. The abstraction provides the ability to
plugin in open source streaming engines (Storm, Spark, Flink, etc.)
Decouple schema from the streaming application via integration with Schema Registry
Provide operational metrics to monitor streaming application via pluggable metrics storage.
E.g. Ambari, OpenTSDB
Streaming Insights, visualize the data that’s being processed by streaming application
25. Page33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SAM’s Key Capabilities
Building streaming apps using the following primitives
– Connecting to Streams
– Joining Streams
– Forking Streams
– Aggregations over Windows
– Stream Analytics – Descriptive, Predictive, Prescriptive
– Rules Engine
– Transformations
– Filtering and Routing
– Notifications / Alerts
Deploying streaming apps
– Deploying the streaming app on a a supported streaming engine
– Monitoring the streaming app with metrics
26. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Typical Streaming Application Workflow
K
a
f
k
a
P1 W1
H
B
a
s
e
27. Page35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SAM’s Service Pools and Environments
Stream App 1 Stream App 2
• Service Pool
• A pool of services that can be
used to create different
environments
• Environment
• Consists of a set of services
you choose from 1 or more
service pools.
• Stream App
• The environment is then
associated with a Stream
Application which then uses the
services in that environment for
various configuration
30. Page38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SAM’s Components
Builder Components
Source • Kafka Source
• Event Hub
• HDFS
All Integrated with Schema Registry
Processor • Join
• Window/Aggregate
• Rule
• Normalization/Projection
• Branch
• PMML
• Custom
Sinks • Notification/Alerts (Email Support)
• HDFS
• HBase
• Hive
• JDBC
• Druid
• Cassandra
• Kafka
• OpenTSDB
• Solr
34. Page42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Streaming Analytics powered by Druid and Superset
What is Stream Insight?
– Provides a tool to business analysts to do descriptive analytics of the streaming data and
perishable insights using a sophisticated UI provided by Superset
– Tooling to create time-series and real-time analytics dashboards, charts and graphs and
create rich customizable visualization of data
37. Page45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Extensibility with SAM SDK
Custom Processor
– Allows users to write their own business logic
/**
* Interface for processors to implement for processing messages at runtime
*/
public interface ProcessorRuntime {
/**
* Process the {@link StreamlineEvent} and throw a {@link ProcessingException} if an
error arises during processing
* @param event to be processed
* @return
* @throws ProcessingException
*/
List<Result> process (StreamlineEvent event) throws ProcessingException;
/**
* Initialize any necessary resources needed for the implementation
* @param config
*/
void initialize(Map<String, Object> config);
/**
* Clean up any necessary resources needed for the implementation
*/
void cleanup();
}
38. Page46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Extensibility with SAM SDK
Window UDF
– Custom UDF’s to process window data
/**
* This is an interface for implementing user defined
functions for a single argument.
*
* @param <O> type of the result
* @param <I> type of the input argument
*/
public interface UDF<O, I> {
O evaluate(I i);
}
Built in functions
STDDEV
STDDEVP
VARIANCE
VARIANCEP
MEAN
MIN
MAX
SUM
COUNT
UPPER
LOWER
INITCAP
SUBSTRING
CHAR_LENGTH
CONCAT
39. Page47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Extensibility with SAM SDK
Notification Sink
– Interface to send Notifications such as Email, SMS and More complex to invoke external
APIs
public interface Notifier {
void open(NotificationContext ctx);
void notify(Notification notification);
void close();
boolean isPull();
List<String> getFields();
NotificationContext getContext();
}
public interface Notification {
enum Status {
NEW, DELIVERED, FAILED
}
String getId();
List<String> getEventIds();
List<String> getDataSourceIds();
String getRuleId();
Status getStatus();
Map<String, Object> getFieldsAndValues();
String getNotifierName();
long getTs();
}
40. Page48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s Next?
Manual service pool registration not requiring Ambari
Test sources and sinks to easily test functionality of streaming app
Authentication and Authorization
Other components(sources(Kinesis), processors and sinks)
41. Page49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Try it out!
Its open source under Apache License
https://github.com/hortonworks/streamline
Apache incubation soon
SAM 0.4 is out!
https://groups.google.com/forum/#!forum/streamline-users
Contributions are welcome!
42. Page50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Follow-up questions
JP Player, Principle Solutions Engineer
jplayer@hortonworks.com
650.773.3313
Sam Hjelmfelt, Resident Architect
shjelmfelt@hortonworks.com
605.393.7244
Kristine Hannigan, Enterprise Account Manager
khannigan@hortonworks.com
415.323.8819