Streaming Time Series Data With Kenny Gorman and Elena Cuevas | Current 2022
Modern streaming use cases are generating massive amounts of data - much of it needs to be organized and queried over time. The sheer amount and complexity of this problem presents new challenges for data engineers and developers alike.
To solve this problem Apache Kafka and MongoDB Time Series collections are a powerful combination. In this talk, Kenny Gorman and Elena Cuevas will present how Apache Kafka on Confluent Cloud can stream massive amounts of data to Time Series Collections via the MongoDB Connector for Apache Kafka. Elena and Kenny will discuss the required configuration details and critical components of Confluent Cloud and MongoDB Atlas as well as some tips, tricks and best practices.
You will leave armed with the knowledge of how Confluent Cloud, Apache Kafka, MongoDB Atlas, and Time Series collections fit into your event-driven architecture.
Scaling API-first – The story of a global engineering organization
Streaming Time Series Data With Kenny Gorman and Elena Cuevas | Current 2022
1. Streaming Time
Series Data with
Apache Kafka and
MongoDB
DATE AND TIME GOES HERE IN ALL CAPS
Kenny Gorman
Principal Product
Manager - Streaming,
MongoDB
Elena Cuevas
Manager, Cloud Partner
Solutions Engineering,
Confluent
5. MongoDB
Developer Data Platform
Confluent and MongoDB in the Cloud
Real-Time Online
Data Store
Primary Secondary Secondary
High Volume Real-Time
Operational Data
Analytical
Time
Series
Real Time Analytical Data
Data tiering (DL/DWH, Archiving)
Lucene based Text Search
Sharding and Replica Sets
Sensors
Digital
Content
Transactions
Clients
Security
More Legacy
Systems
of
record
Sources
of
truth
And
Mobile and
Web Apps
Personalized
Marketing
Research &
Analytics
AML / AFM / ...
?
Many
Others...
Real
Time
Vertical
Solutions
BI Connector /
Real Time Analytics
Apache Spark
Team
Collaboration
Flexible API and
Microservices
Data Facilitation
REALM
Mobile
Sync
Real-Time Online
Data Store
High Volume Real-Time
Event Processing
Bridge to Cloud
Bidirectional sync, native connectors
Registry, Real-time Processing
Highly available, scalable
Confluent
Platform
Confluent
Cloud
Kafka Sink Kafka Source
Supported
connectors
Fully managed Atlas connectors
Stream kSQL
Registry
6. MongoDB
Developer Data Platform
Confluent and MongoDB in the Cloud
Real-Time Online
Data Store
Primary Secondary Secondary
High Volume Real-Time
Operational Data
Analytical
Time
Series
Real Time Analytical Data
Data tiering (DL/DWH, Archiving)
Lucene based Text Search
Sharding and Replica Sets
Sensors
Digital
Content
Transactions
Clients
Security
More Legacy
Systems
of
record
Sources
of
truth
And
Mobile and
Web Apps
Personalized
Marketing
Research &
Analytics
AML / AFM / ...
?
Many
Others...
Real
Time
Vertical
Solutions
BI Connector /
Real Time Analytics
Apache Spark
Team
Collaboration
Flexible API and
Microservices
Data Facilitation
REALM
Mobile
Sync
Real-Time Online
Data Store
High Volume Real-Time
Event Processing
Bridge to Cloud
Bidirectional sync, native connectors
Registry, Real-time Processing
Highly available, scalable
Confluent
Platform
Confluent
Cloud
Kafka Sink Kafka Source
Supported
connectors
Fully managed Atlas connectors
Stream kSQL
Registry
Produce event
Connector Time Series Query
8. Real-time &
Historical
Data
A sale
A shipment
A trade
A customer
interaction
A new paradigm is required for Data in Motion
Continuously process streams of data in real time
“We need to shift our thinking from everything
at rest, to everything in motion.” —
Real-Time Stream Processing
Rich, front-end
customer experiences
Real-time, software-driven
business operations
10. Serverless
● Elastic scaling up &
down from 0 to GBps
● Auto capacity mgmt,
load balancing, and
upgrades
High Availability
● 99.99% SLA
● Multi-region / AZ availability
across cloud providers
● Patches deployed in Confluent
Cloud before Apache Kafka
Infinite Storage
● Store data cost-
effectively at any scale
without growing
compute
DevOps Automation
● API-driven and/or
point-and-click ops
● Service portability &
consistency across cloud
providers and on-prem
Network
Flexibility
● Public, VPC, and
Private Link
● Self-managed
option for
air-gapped
environments
Elastic: Instantly scale to meet any demand
Seamlessly provision and deploy fully managed,
elastically scaling clusters with infinite storage that
expand & shrink to cost-effectively support all streaming
use cases
Reliable: Power all your streaming apps &
analytics with resilience
Maintain high availability of your clusters and data
streams with our 99.99% uptime SLA, multi-AZ /
region clusters, and no-touch Kafka patches &
upgrades
Agile: Focus on innovation, not infrastructure
Fully automate management of serverless clusters
through code via Terraform integration and REST APIs,
paying only for what you use when you use it
Cloud-Native
Apache Kafka®
, fully managed and
re-architected to harness the power of
the cloud
“Before Confluent, when we had broker outages that
required rebuilds, it could take up to three days of
developer time to resolve. Now, Confluent takes care
of everything for us, so our developers can focus on
building new features and applications.”
11. Complete
Go above & beyond Kafka with all the
essential tools for a complete data
streaming platform
Connectors & Stream Processing: Connect to
and from any app / system and process your
data streams in-flight
Reduce TCO and architectural complexity with our
portfolio of 120+ pre-built connectors and stream
processing powered by ksqlDB, all available fully
managed and built-in with Confluent Cloud
Stream Designer: Quickly build and deploy
streaming apps & pipelines
Rapidly build, test, and deploy streaming data
pipelines with Stream Designer, extensible with SQL,
while reducing the need to write boilerplate code
Security & Governance: Secure, discover, and
organize your data streams
Build trust and put your data streams to work with
enterprise-grade security and the only Stream
Governance suite for data in motion
“BHG is a fast-moving company, and Confluent is
quickly becoming not only a central highway for
our data with their vast connector portfolio, but a
streaming transformation engine as well for a vast
number of use cases… We are making Confluent the
true backbone of BHG, including leveraging 20+
Confluent connectors across both modern,
cloud-based technologies & legacy systems, to help
integrate our critical apps & data systems together.”
11
Connectors
Security
Data
Governance
Stream
Processing
Monitoring
Global
Resilience
Stream
Designer
12. Everywhere
Connect your data in real time with a
platform that spans from on-prem to
cloud and across clouds
Run Anywhere: Deploy across any environment
Provision Confluent as a fully managed service on
AWS, Azure, and Google Cloud across 60+ regions w/
Confluent Cloud, or on-premises w/ Confluent
Platform
Unified: Unify data across hybrid and
multi-clouds
Provide consistent, self-service access to real-time data
across all your environments with Cluster Linking and
globally connected clusters that perfectly mirror data
Consistent: Learn one platform for all
environments
Remove the burden of learning new tools for each
environment with a consistent experience spanning
across cloud, on-prem, and hybrid / multicloud
“Our transformation to a cloud-native, agile
company required a large-scale migration from
open source Apache Kafka. With Confluent, we now
support real-time data sharing across all of our
environments, and see a clear path forward for our
hybrid cloud roadmap.”
12
13. Using fully managed connectors is the fastest, most
efficient way to break data silos
Self-managed connector
Accelerated time-to-value • Increased developer productivity • Reduced operational burden
● Pre-built but requires manual
installation / config efforts to
set-up and deploy connectors
● Perpetual management and
maintenance of connectors that
leads to ongoing tech debt
● Risk of downtime and business
disruption due to connector /
Connect cluster related issues
Fully managed connector
Custom-built connector
● Streamlined configurations and
on-demand provisioning of your
connectors
● Eliminates operational overhead
and management complexity
with seamless scaling and load
balancing
● Reduced risk of downtime with
Confluent Cloud’s 99.99% SLA for
all your mission critical use cases
● Costly to allocate resources to
design, build, test, and maintain
non-differentiated data
integration components
● Delays time-to-value, taking up
to 3-6+ engineering months to
develop
● Perpetual management and
maintenance increases tech debt
and risk of downtime
14. Connect IoT data sources
Leverage existing
infrastructure investments
Reduce operational complexity
Avoid the need for third party
MQTT brokers
Ensure IoT data delivery
Compatible with all QoS
levels of the MQTT protocol
Gateways BROKER
Devices MQTT
Proxy
MQTT Proxy1
Easily connect with IOT data sources
1
Support for self-managed components with a CC
subscription with Business support tier or higher.
16. MongoDB Connector for Apache Kafka
● Enables users to easily integrate MongoDB with Kafka
● Users can configure MongoDB as a source to publish data changes from MongoDB
into Kafka topics for streaming to consuming applications
● Users can configure MongoDB as a sink to easily persist events from Kafka topics
directly to MongoDB collections
● Dead letter queue
● Time series integration
● JMX Integration
●
● Available from Confluent Hub and Verified Gold
● Fully managed using Confluent Cloud
● Configured via Confluent Cloud or Kafka Connect REST endpoint.
● Certified against Apache Kafka 2.3 and Confluent Platform 5.3 (or later)
17. Destination:
MongoDB Database
MongoDB Sink
Connector
topicA
topicB
topicC
Kafka Cluster
Writes documents
to DB collection
Receives events from
Kafka Topic(s)
MongoDB Connector for Kafka
Source:
MongoDB Database
MDB Source
Connector
Kafka Cluster
Receives documents
from DB collection
Writes events to
Kafka Topics(s)
topicA
topicB
topicC
Change
Streams
18. • Reads messages from topic (based on pointer to message in topic)
• Writes message into MongoDB database collection
• Moves pointer to next message based on write to database
Kafka Topic
connector
database
collection
{}
1: offset to
message to read
2: bulk write to db
3: on successful write (of
batch), moves offset to
next batch
Sink Connector Specifics
20. Time Series Collection
An optimized column oriented
collection for time-series data
which organizes writes so that
data for the same source is
stored in the same bucket,
alongside other data points from
a similar point in time
Launched with
5.0
Increases developer productivity
Reduces complexity for working with Time Series data
Reduces I/O for read operations
Massive reduction in storage size and index size
Optimized WiredTiger cache usage
21. Creating a Time Series
Collection
TO CREATE A TIME SERIES COLLECTION, USE THE
timeseries OPTION
Launched with
5.0
db.createCollection("weather", {
timeseries: {
timeField: "timestamp",
metaField: "sensorId",
granularity: “minutes”
},
expireAfterSeconds: 9000
})
The timeField is the only required parameter for a Time Series
collection
22. Terminology & concepts: metaField
> db.createCollection ("weather", { timeseries: { ..., metaField: “sensorId” } } )
{
"sensorId": 123,
“timestamp”: ISODate(“..."),
“temperature”: 47.0
},
{
"sensorId": 456,
“timestamp”: ISODate(“..."),
“temperature”: 69.8
},
{
"sensorId": 789,
“timestamp”: ISODate(“..."),
“temperature”: 97.0
}
● Label or tag that uniquely identifies a time series
● Never/rarely changes over time
123 456 789
100
75
50
25
23. Terminology & concepts: measurement
● A set of related key-value pairs at a specific time
● Any other fields except metadata and time
123 456 789
100
75
50
25
> db.createCollection ("weather", { timeseries: { ..., metaField: “sensorId” } } )
{
"sensorId": 123,
“timestamp”: ISODate(“..."),
“temperature”: 47.0
},
{
"sensorId": 456,
“timestamp”: ISODate(“..."),
“temperature”: 69.8
},
{
"sensorId": 789,
“timestamp”: ISODate(“..."),
“temperature”: 97.0
}
25. Time Series Collection Columnar
Compression
Columnar compression adds a number of
innovations that work together to significantly
improve practical compression before on-disk
compression
Launched with
5.2
Dramatically reduce database storage footprint
Improves read performance
Increases Cache efficiency fitting more data in memory
and using less I/O
26. Columnar
Compression
Time Series Collection Columnar Compression Example
Uncompressed BSON vs. Storage Size (Weather Data)
Uncompressed
BSON Size
Time Series Collection
Compressed Storage Size
25
50
75
100
125
107MB
2.2MB
-97%
6MB
Time Series Collection
Compressed Bucket Size
Uncompressed BSON
Size
Time Series Collection
Compressed Storage Size
Time Series Collection
Compressed Bucket Size
27. Querying Time Series
Collections
> db.weather.find()
Launched with
5.0
When querying time-series
collections, two main things happen
under the hood:
● Query rewrites
● Bucket “unpacking”