Responding to a global pandemic presents a unique set of technical and public health challenges. The real challenge is the ability to gather data coming in via many data streams in variety of formats influences the real-world outcome and impacts everyone. The Centers for Disease Control and Prevention CELR (COVID Electronic Lab Reporting) program was established to rapidly aggregate, validate, transform, and distribute laboratory testing data submitted by public health departments and other partners. Confluent Kafka with KStreams and Connect play a critical role in program objectives to:
o Track the threat of COVID-19 virus
o Provide comprehensive data for local, state, and federal response
o Better understand locations with an increase in incidence
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka Summit 2020
1. National Center for Emerging and Zoonotic Infectious Diseases
Flattening the Curve with
Covid-19 Electronic Lab Reporting
Rishi Tarar, Northrop Grumman
Jason Hall, CDC
Kafka Summit , 2020
2. Background
§ This architecture stemmed out of necessity for CDC’s
EIP(Emerging Infections Program) programs, with an eye on
ongoing agency efforts (CDC Data and IT Modernization)
§ Multiple national level use case Implementations proved out
the architecture and exposed commonality that can extend
enterprise wide…
§ And meet hard challenges like a Pandemic – head on
3. COVID-19 Electronic Lab Reporting(CELR) - Scope
§ Agency initiative to collect COVID-19 line level lab testing data
from alljurisdictions in United States
§ Goal to have most comprehensive testing data
§ Improve the quality and fidelity of line level data on an
ongoing basis
§ Could be used for other conditions
6. Primary Citizen -> TESTING EVENT
§ Data: Each record is an TESTING EVENT
§ Producers organized adjacent to feed formats
§ Streaming data and shaping it record-by-record through the pipelines
§ Each record is a primary citizen
– Each record flows through the set of stream processors
– Metadata is added to each record
– Makes “things” happening to “a” record rapidly observable
– Each record conforms to an evolving schema capability
§ Data can be aggregated and streamed to any destination on the fly
7. Event Pipelines
Event sourcing
Program Y
Data
Sources
Program X
Feed1
Data
Lake
New
Producer
Current
Producer
Feed2
Validate Redact Transform Translate
Biz
Rules
Case
Clasification
S3 Sink
Connector
JDBC
Sink
Connector
event event event event event
Elastic
Sink
Connector
Data Lake
Kafka
Data Streams
Configuration driven workloads
Data events sink-ed
Storage (Blob,Relational, ElasticSearch)
Frameworks
Pipelines organized
by Program and
Pathogen
8. The Platform High Level Architecture
Kafka
FLAT Pipelines
Pipelines
HL7
CDA Pipelines
FHIR Pipelines
CDA
Labs
Hospital
FHIR
SPHL
CSV/
JSON
XLXS
HL7 Pipelines
Registri
es
Data Lake
S3
Dashboards
Data Sets
Data Science Tools
Case Notifications Lab Reporting
Healthcare
Interoperability
Use Cases Implemented
Athena
Redshift
Schema
Dictionary
Partner
Collaboration
Tools Real Time Data
Stream
Custom Data
Sets
Bulk Exports
Machine
Learning
Business
User
(Non Tech)
Data Manager
Data Science
User
9. Data Storefronts
Merged Lab Data
Athena
Tables
Redshift
Tables
Quick
Sights
Data
Science
Tools
DCIPHER
Curated
Views
All Data
HHS
CELR
Portal
Self Service Data Storefront
Business
User
Data Manager Data Science
User
Automated Data Storefront
Line Level
Lab Data
Aggregated
Lab Data
Data Products
Provenance
Validation Reports
Dead letter Reports
Audit Reports
VAR
Team
Glue Crawler Glue Jobs
Analytical Pipelines
Data Lake
Update
Hourly
Glue
ETL
Translation
Exclusion
Tagging
Race and Age
Calculation
Fllter
Schedule
Trigger
Trigger
Schedule
Data
Catalogue
10. Features in place TODAY
§ Ingest
• Real time Staged Event Pipeline Processing or
Manual Upload
• HL7 Pipelines - Support HL7 (2.5.1, 2.5, 2.3.z,
2.3.1)
• FLAT Pipelines - CSV/FLAT/JSON (Any Size)
• FHIR Pipeline*
§ Validation
• FLAT File and Record level Validations via
Configurations (no code)
• HL7 2.5.1, 2.5, 2.3.z, 2.3.1 Validations via
Configurations
§ Transformations
• FLAT to HL7 Hierarchy
• HL7 to FHIR (per build.fhir.org) *
• HL7 to FLAT via Configurations
§ Translations
• Terminology transformations via Configurations
§ Data Lake Management Services
• At scale ETL Workflows
• SQL Style Querying on all Data
• Data Replay and Data de-Duplication
• Biz rules for calculating fields
• Machine Learning for feature extraction
from raw data and ETL for Data cleaning *
• Configuration Management
• Data Case Classifications
• Data Catalogue (Schemas and Dictionaries)
• Auditor Services for proactive issue
detections
§ Data Policy and Governance
• Data Use Agreement Filter
• Data Enrichment
• Auto Data Catalogue
• Data Security
• Data Redaction
• Pseudonymization for linking*
• De-Identification
11. § Data Products (Reporting/Provision)
• Merged Line level Data from all sources
in single schema
• On demand canned Data Products
(extracts)
• Bulk Data Exports - time stamped data
sets at scale
• Self Service Custom queries
• De-duplications for resubmissions at the
record level
§ Data Integration Products
• Data Routing
• Clinical Decision Support for Guidance
Delivery *
• Exposing Data as FHIR API *
• SMART on FHIR App for integration with
EHR *
§ Analytics
• Real Time Dashboards for Lake Operations
• Real Time Dashboards for Lake Data Quality
and Provenance
• Jupyter Notebooks with all tooling (R, Python,
Scala) for Data Science
• Spark Jobs for high volume batch processing
• Canned ML algorithms
§ DEVSECOPS
• DEV to PROD in hours not days
• Full scans and deployment as part of CI/CD
• HOSTED on FISMA Moderate Cloud
Environment
• CDC ATO Environment
• HIPAA Compliant Environment
§ Data Apps
• Portal Access for Partner Agencies based on
Business needs
More Features in place TODAY
13. All Features are AT SCALE
§ Parallelism in Data Pipelines for Large-Scale Processing
– 30 Kafka Partitions, 5 Broker Kafka Cluster
§ Horizontal scaling for storage (S3, Redshift -> Petabytes)
§ Delivering Data to Consumers at Scale
– Bulk Exports -> Gigabyte Slices of Data
§ Cloud managed Serverless services for analytics
14. Current Status
§ Status: In Production
§ Infrastructure Build out completed in ~5 days
§ Initial production deployment in ~10 days
§ Data Streams Logistical stabilization: ~15 days
§ O&M Started 30 days from Start Date
§ Full stack release cycle every 3 days (twice a week) -> now down to once per week
§ Data Products and Analytic Products are in “Real Time”
§ Data Consumers
– HHS Protect
– CDC
16. For more information, contact CDC
1-800-CDC-INFO (232-4636)
TTY: 1-888-232-6348 www.cdc.gov
The findings and conclusions in this report are those of the authors and do not necessarily represent the
official position of the Centers for Disease Control and Prevention.
Jason Hall, NCEZID, (zfr9@cdc.gov)
Rishi Tarar, Enterprise Architect and Fellow, Northrop Grumman (rrt8@cdc.gov)
17. Terminology
§ Disease surveillance is an epidemiological practice by which the spread of
disease is monitored in order to establish patterns of progression.
– The main role of disease surveillance is to
• predict, observe, and
• minimize the harm caused by outbreak, epidemic, and pandemic
situations, as well as
• increase knowledge about which factors contribute to such
circumstances.
18. “Surveillance data is a series of natural and spontaneous
raw data streams.
Don't resist them; that only creates sorrow and silos.
Let reality be the reality.
Let data streams flow naturally forward in whatever way it likes.”
-- Adapted from Lao Tzu