Takeda’s Plasma Derived Therapies (PDT) business unit has recently embarked on a project to use Spark Streaming on Databricks to empower how they deliver value to their Plasma Donation centers. As patients come in and interface without clinics, we store and track all of the patient interactions in real time and deliver outputs and results based on said interactions. The current problem with our existing architecture is that it is very expensive to maintain and has an unsustainable number of failure points. Spark Streaming is essential for allowing this use case because it allows for a more robust ETL pipeline. With Spark Streaming, we are able to replace our existing ETL processes (that are based on Lamdbas, step functions, triggered jobs, etc) into a purely stream driven architecture.
Data is brought into our s3 raw layer as a large set of CSV files through AWS DMS and Informatica IICS as these services bring data from on-prem systems into our cloud layer. We have a stream currently running which takes these raw files up and merges them into Delta tables established in the bronze/stage layer. We are using AWS Glue as the metadata provider for all of these operations. From the stage layer, we have another set of streams using the stage Delta tables as their source, which transform and conduct stream to stream lookups before writing the enriched records into RDS (silver/prod layer). Once the data has been merged into RDS we have a DMS task which lifts the data back into S3 as CSV files. We have a small intermediary stream which merge these CSV files into corresponding delta tables, from which we have our gold/analytic streams. The on-prem systems are able to speak to the silver layer and allow for the near real-time latency that our patient care centers require.
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Empowering Real Time Patient Care Through Spark Streaming
1. Empowering PDT Analytics through Databricks &
Spark Structured Streaming
05/05/2021
Arnav Chaudhary (he/him)
Digital Product Manager
Takeda
Jonathan E. Yee (he/him)
Data and Analytics Executive
EY
Jeff Cubeta (they/them)
Clinical Intelligence Executive
Algernon Solutions
2.
3. Databricks on Takeda’s Enterprise Data Backbone
The EDB (Enterprise Data Backbone) is Takeda’s integrated data platform responsible for combining
global data assets into a single source of truth and enabling tools to provide insights via analytics.
• Data Ingestion
• Data Processing
• Advanced Analytics
Domains
• US
• Europe
• Japan
Global Regions
• Python
• R Studio
Applications
• MIT Researches as an
ongoing collaboration
Specialized
Deployment
Databricks on AWS is used heavily by Takeda across the business
200,000 DBUs
of Monthly
compute
600+ Monthly
Active Users
50+ validated
schemas with
100s of tables
15 advanced
analytics
teams using
Databricks
4. PDT Analytics Program
Drive improved
plasma yield
Increased access to a
greater volume of
plasma donors
What are we solving for? Expected Outcomes
Gain access to a larger
share of the donor market
to reduce CPL
Increase yield by improving
retention and the conversion
funnel
Reduce manual processes and
increase automation to
improve operations efficiency
Harvest the value of
PDT’s data assets
Reduce cost per liter
Improved data, analytics,
and process layers for PDT
analytics
5. PDT Donor Portal Application & Analytics Foundation
Going forward…
Previously…
• Existing 153 disparate
center systems
• Reliance on 3rd party for
marketing insights
• Manual report generation
• Lack of real-time
information for quick
decision making
• Reactive decision-making
process
• Consolidated data into one
operational data store (ODS)
Near Real time data
transmission
• Data lake to store years of
information
• Analytics platform allowing
data scientists to perform data
mining, create predictive
model, and generate
actionable insights
• Reduction of manual reports
PDT/BioLife Data
Backbone
• PDT is the pioneer using the
newly developed Takeda
Enterprise Data Backbone
Platform in the CLOUD
• Supporting Analytics,
Operational Use, and other
Products data needs (e.g. Donor
Engagement, Fuji Innovation
Engine)
6. Daily Batch Jobs
Manual Report Generation
Limited Access to Data
Structured Typed SQL Data
API Returned JSON
Scheduled CSV Uploads
4 Enterprise Data Systems
151 collection centers
250 SQL Tables
~ 1 TB Historic Data
~ .5 GB/Hr Ongoing CDC
We designed opportunities to drive value and address the core pain point themes for PDT
• Spark Structured Streams
• Low latency data processing
• Standardized event streams to
empower downstream apps
Real Time Data
• Single presence for Donors
• Cross system relationships
• Business process data entities
Unified Data Schema
• Uniform ingestion process
• Configuration driven operations
• S3 Delta Tables
• Data served to SQL DB for low latency,
high volume querying
Lakehouse Model
Data Isolation Latency of Analytics Narrow Audience
Three Key Pain Points with PDT Data Analytics
9. Lakehouse Model
Key Design Details
• Uniform ingestion platform
• Improved accessibility to data
• Delta Tables backing each layer
• Structured Streams between layers
• Support for big data analysis through
serving Delta Tables
• Support for high volume, low latency
querying using SQL based tools
• Extensible design to allow expansion
11. Using foreachBatch to Fork and Serve Streaming CDC Data
Using the Delta Table merge construct within _serve
Writing the CDC stream
Within the foreachBatch function, we target
multiple sinks
• Delta Table
• SQL Database
• Event Bridge