This document discusses the need for end-to-end data quality when using Apache Kafka to build trust in streaming data. It outlines common challenges organizations face when adopting Kafka like inability to monitor data or identify issues. The Infogix Data360 platform provides data quality validation, balancing, and reconciliation across the full data pipeline from source to consumption to ensure trust in streaming data. It features over 100 predefined rules and capabilities to handle data quality for streaming, batch, and hybrid use cases.
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Slides: Why You Need End-to-End Data Quality to Build Trust in Kafka
1. Infogix Confidential Copyright 2020Infogix Confidential Copyright 2020
Why You Need End-to-End Data Quality
to Build Trust in Kafka
2. Infogix Confidential Copyright 2020
Webinar Speakers
Jeff Brown
Infogix, Inc., Director, Data Quality and Analytics
Jeff Brown has been with Infogix for more than 8 years. He is currently a Director in Product
Management responsible for delivering customer driven solutions across the Infogix
Data3Sixty Platform. He has a Bachelor of Science in Engineering from Michigan State
University and an MBA from DePaul University.
4. Infogix Confidential Copyright 2020
Kafka at a Glance…
Kafka will be a
mission-critical
part of
organization in
2018**
Kafka
deployments
replacing
existing
technology**
90% 62%
Fortune 100
Companies using
Apache Kafka*
Organizations
Globally using
Apache Kafka
Worldwide*
60% +100K
*Kafka Summit San Francisco 2019; Jun Rao: Confluent **2018 Apache Kafka Report; Confluent
Survey of 600 Users from 59 Countries
5. Infogix Confidential Copyright 2020Infogix Confidential Copyright 2019
Why are organizations
moving to a
streaming-based
architecture?
6. Infogix Confidential Copyright 2020
What is Apache Kafka?
*Source: https://kafka.apache.org/intro
Kafka is an open source real-time streaming message system and
protocol built around a publish-subscribe system*
Producers publish data to feeds (topics) that consumers subscribe to
and receive messages from
Messages are stored in topics across Kafka in many partitions which
supports redundancy
7. Infogix Confidential Copyright 2020
Kafka Data Pipeline Flow
Producers Consumers
Kafka
Platform
Data Lake
Database
Application
Applications /
3rd Party
Vendors
Files
Logs / IOT
Publish Subscribe
Messages
Topic
Topic
Messages
8. Infogix Confidential Copyright 2020
Advantages of Apache Kafka
Fault Tolerance
ensures that data is available even if failures
occur within the cluster
Real-Time Data Availability enables
reduced lag time in critical data driven
decisions
Centralized Access to Data provides a
consolidated data hub approach to reduce
complexity
Data Storage Layer
acts as intermediary storage enabling
consumption when needed
Scalability of Data Handling supports
high volume data handling and data delivery
Reduced Integration Points
lower complexity of data system
communication
9. Infogix Confidential Copyright 2020
Key Drivers to Move to Kafka
Create a unified
data hub for the
business to
consume data
Give better data
access to data
scientists and
analytics teams
Support data
communication
for digital
transformation
strategy
Ability to make
faster business
decisions on
more real-time
data
10. Infogix Confidential Copyright 2020Infogix Confidential Copyright 2019
Common challenges
confronting
organizations as they
adopt Kafka
11. Infogix Confidential Copyright 2020
What are organizations saying?
“We are moving all system-to-
system communication from file
based to Kafka messages”
• New means of digital
communication
• Recognize need for real-
time data access
“We don’t trust the stability of our
Kafka platform to expand
its usage”
• Lack of trust in their Kafka
platform
• Require insights into
operations
“Audit will not let us move forward
with our Kafka platform without
being able to validate the data”
• Need validation on data in
motion
• Auditability of data and
process is still a key focus
12. Infogix Confidential Copyright 2020
New Technologies, Same Challenges
How do we know if all data
arrived in the correct order?
Do we know if all
transactions that were
supposed to be sent were
sent?
Do we know if duplicate data
transactions have been sent?
Do we know if all transactions
that were supposed to be sent
arrived?
What action should be taken
on errors or potential lost
data?
Do we know if all transactions
were sent and arrived in a
timely manner?
Organizations
Should be
Asking…
Do we know if all transactions
were aggregated and
transformed correctly?
13. Infogix Confidential Copyright 2020
• Unable to monitor data volumes
for anomalies
• Inaccurate prediction of data
volume needed for retention
• Unable to identify underlying
infrastructure issues
What is the Impact of the Challenges?
• Incorrect data being consumed
to make business decisions
• Potential customer loss, harm to
reputation, revenue loss,
regulatory fines
• Reduced overall trust
IT / Operations Business
14. Infogix Confidential Copyright 2020Infogix Confidential Copyright 2019
How do you build data
trust within your
organization?
15. Infogix Confidential Copyright 2020
Source Processing Finished Good
Focus on Data Pipeline from End-to-End
- Raw Data
- Source Data
- Third Party
- Semi-Processed Data
- Non-Aggregated
- Data Warehouse
- Data Mart
- MDM
16. Infogix Confidential Copyright 2020
Infogix Enables That Level of Independent Trust
Producers Consumers
Data Lake / Warehouse
Database
Application
Applications /
3rd Party
Vendors
Files
Logs / IOT
Kafka
Cluster
17. Infogix Confidential Copyright 2020
Producer to Consumer
Data Quality
Reconciliation
Balancing
Integrity
Trust is Built on a Multifaceted Approach
18. Infogix Confidential Copyright 2020
Producer Consumer
VISUALIZE
REMEDIATE
MONITOR
VALIDATE
BOTTOM
TOP
Provide a 360o Standard to Data Trust
20. Infogix Confidential Copyright 2020
Data Quality for Streaming Data
Real-Time and Batch
Validation
Validate streaming data in real-
time or in batch to meet
required time windows
Balancing & Reconciliation
Reconcile data from source to
target to ensure all messages
arrived and values are
balanced
Visualize
Generate dashboards and track
streaming data over time to
highlight operational results
Identify and Manage
Exceptions
Identify, route and remediate
streaming data exceptions
Transformation &
Aggregation
Capture, transform and
aggregate data for both
streaming and non-
streaming data
Statistical Control
Monitor streams and apply
statistical controls like
thresholds violations or std.
deviations
Machine Learning
Utilize ML to identify patterns
and outliers within data
streams for better insights
Enrich Streaming Data
Enrich or join streaming data,
then generate new streams or
other output types
21. Infogix Confidential Copyright 2020
Data360 Streaming
• Streaming Data Store Input:
Bring in data from a streaming
data source
• Streaming Data Store Output:
Output streaming data to a data
source
• Convert to Micro Batch:
Convert streaming data to batch
data
• Streaming SQL
Stream input using SQL
statements
• Streaming Join
Joins two streaming sources or a
streaming and batch source
• Streaming Deduplication
Eliminates redundant streaming
data
Streaming Functionality
22. Infogix Confidential Copyright 2020
Handling Messages In-Flight
C u s t o m e r U s e C a s e s
• Validate Streaming Data Inline
◦ Read messages from a Kafka Topic
◦ Apply data quality/custom rules to the message
◦ Determine if the message data passes/fails the rules
◦ Route the message to a corresponding Topic (valid/invalid)
23. Infogix Confidential Copyright 2020
Handling Messages via Micro batches
C u s t o m e r U s e C a s e s
• Validate & Process Micro batched Streaming Data
◦ Read small batches (micro batch) of messages from a Kafka Topic
◦ Apply DQ/custom rules or complex processing to the micro batch
◦ Route the micro batch to other downstream processes OR convert
micro batch to messages and post to a Kafka Topic
24. Infogix Confidential Copyright 2020
Handling Messages Streaming & Non-Streaming
C u s t o m e r U s e C a s e s
• Streaming & Non-Streaming Data
◦ Validate both streaming and
non-streaming sources
◦ Join non-streaming data with
streaming data messages
◦ Output to Kafka or non-
streaming data types
25. Infogix Confidential Copyright 2020
Built-in Quality and Exception Tracking
Route, Workflow, Resolve Issues
Infogix Provides Single Solution to Build Data
+100 Pre-Defined Data Quality Rules
26. Infogix Confidential Copyright 2020
• International bank with +$100B assets is working with us
for reconciliation on streaming architecture
• A large financial institution is working with us to deliver
Kafka capabilities as part of their data integrity and data
quality controls group
• One of the largest health insurers in the world – and a
30 year Infogix customer – has initiated discussions with
us around our Kafka solutions
What are we Hearing from Customers?
27. Infogix Confidential Copyright 2020
Key Takeaways
• Organizations are sprinting towards adopting Kafka, but will be
faced with the same operational data quality issues as before
• Faster data delivery and higher data volumes will lead to
increased data quality issues if not managed properly
• The entire data pipeline from end-to-end must be validated
and monitored to ensure trust and optimize streaming data
investments
28. Infogix Confidential Copyright 2020Infogix Confidential Copyright 2019
Find out more:
• Infographic
• eBook
• Data Sheet
• Blogs
• Visit our resource center to learn more about Infogix and Kafka
www.infogix.com
Jeff Brown
Infogix, Inc., Director, Data Quality and Analytics
Email: jbrown@infogix.com
Phone: 1.630.505.5566
29. Infogix Confidential Copyright 2020Infogix Confidential Copyright 2019
Questions?
Please submit your questions via the web in the Q&A panel in the
lower right hand corner of your screen.
If we do not get to your question we will personally follow up with you following the event.