SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
A Journey To Modernization
Shawn Benjamin & Prabha Rajendran
Problems Faced
q Informatica ETL pipeline was brittle
q Lengthy Informatica ETL development
cycle
q Lengthy load time workflows for ingestion
of data
q Lack of ability for real time/ near real-time
data
q Lack of data science platform
2
Legacy Architecture
Data Scientists
Statisticians
Business Analysts
Extracted
Data
R / Python Code Business
Insight
3
Datawarehouse
Source Systems
4
C3 CON
eCIMS
C3 LANS
C4
NFTS
ELIS
NASS
CPMS
Pay.gov
AR-11
RAPS
MFAS
VSS
ATS
PCTS
SNAP
RNACS
Benefits
Files
Payment
Verification
Validation
Cust Svc
Scheduler
AdminFraudFOIA
Benefits
Mart
Scheduler
Mart
Payment
Mart
Validation
Mart
APSS
C3 Con
C4
CIS
ELIS2
MFAS
NFTS
PCTS
RNACS
VSS
C3 LAN
AAO x2
CSC x2
MSC x2
NSC x2
TSC x2
VSC x2
CHAMPS
ECHO
iCLAIMS
IDCMS
QUEUE
NPWR
CAMINO VIBE
SODA
Data Marts
SCCLAIMS
FDNS-DS
SMART Subject Areas
SAS LibrarieseCISCOR
Direct Connects
Data
Marts
Direct Connects
Active ODSes Decom. ODSes
Treasury
CIR
CIS
x7
x1
x2
x1
x1
x5
x2
x5
x2
x1
x2
x6 x1
x1
x1
x1
x2
x1
x2
x1
x1
LEGEND
ODSes
VIS x2
CPMS x1
SRMT x1
NFTS x1x1
x1
x1
FACCON x1
ePULSE x1
BI Tools
Data Marts
Users
4
2016SNAPSHOT
JANUARY
Data
Sources
SMART
Subject Areas
36
2
eCISCOR
ODS
FutureData MartDirect Connect
66
2,354
ETL28 Processes
Implemented
Databricks private
cloud
VPC (26 nodes) in the
AWS
Connected the Databricks
cluster to the Oracle database
Created all relevant
DB tables in HIVE
metadata pointing to
Oracle database
Copied relevant tables from Oracle
database to S3 using Scala code
Data is stored in Apache Parquet
columnar format. For context, the 120
million row 83 column can be dumped
to S3 in just 10 minutes.
Identified appropriate
partition scheme
large tables were partitioned to
optimize Spark query performance
Created multiple notebooks
Perform data analysis and visualize the
results, e.g. created histogram of case
life cycle duration
Successful Proof of Concept
5
Current Databricks Implementation
6
Statisticians
Business Analysts
Business
Insight
S3 Data
Lake
Data Scientists
LakeHouse
7
• 75 Data
Sources
• Xx Data
Interfaces
• 7 Data
Marts
• 4 BI Tools
• 6,086
Tableau
Dashboards
• 118 SMART
Subject
Areas
• 56 SAS
Libraries
• 6,233 Users
• 75 Data
Sources
• 35 Application
Interfaces
• 7 Data Marts
• 4 BI Tools
• 6,086
Tableau
Dashboards
• 118 SMART
Subject Areas
• 56 SAS
Libraries
• 6,233 Users
Databricks Accomplishments
IMPLEMENTATION
OF DELTA LAKE
EASY INTEGRATION
WITH OBIEE ,SAS
AND TABLEAU WITH
NATIVE
CONNECTORS
INTEGRATION WITH
GITHUB FOR
CONTINUOUS
INTEGRATION &
DEPLOYMENT
AUTOMATING
ACCOUNT
PROVISIONING
MACHINE LEARNING
(ML FLOW)
INTEGRATION
8
Change Data Capture using Delta Lake
Databricks Delta –Success Factors
v Faster Ingestion of CDC changes
v Resiliency
v Improved Data Quality , Reporting
Availability and runtime
performance
v Schema evolution - adding
additional columns without
rewriting the entire table
Databricks Delta-Lessons Learned
v Storage requirements increased
v Vacuum and Optimization is
mandatory to improve the
performance
9
Unified Data Analytical Platform -Tableau
10
Unified Data Analytical Platform –OBIEE/SAS
11
Data Science Experiments using ML
12
ML Graphs processes after running the models
Prediction Model Samples
Text and Log Mining
0
0.5
1
NegativeSentiment Positive Sentiment
Sentiment Analysis
13
Time Series Models and H2O Integration
Integrated H20 with Databricks and built a model predicting the count of ‘No show’ on N400 using
the traditional Time series forecasting to predict inefficiencies in normal day-to-day planning and operations
14
Enabling Security & Governance
15
Access
Control (ACL)
Credentials
Passthrough
Secrets
Management
v Control users access to data using the
Databricks view-based access control
model (Table and schema level ACLs)
v Control users access to clusters that are
not enabled for table access control
v Enforced data object privileges at
onboarding phase
v Used Databricks secrets manager to store
credentials and reference in notebooks
Databricks Management API Usage
16
Cluster/Jobs management àCreate, delete,
manage clusters and get execution status of daily
scheduled jobs which helped automated
monitoring.
Library /Secret managementà Easy upload of
any third-party libraries and manage encrypted
scopes/credentials to connect to source and
target endpoints.
Integration and Deploymentsà API with Git
and Jenkins for continuous integration and
continuous deployment
Enabled MLFlow Tracking API for our Data
Science experiments
API
Integrated Databricks Management API with Jenkins and other scripting tools to
automate all our administration and management tasks.
Lessons learned through this Journey
Training plan Cloud based
experience
Subject Matter
Expertise
Automation
17
Success Strategy
Success Criteria Benefit
Performance
ü Auto-scalability leveraging on-demand and spot instances
ü Efficient processing of larger datasets comparable to RDMS systems
ü Scalable read/write performance on S3
Support for a variety of statistical
programming languages
ü Data Science Platform ( R, Python, Scala and SQL)
ü Supports MLIB : Machine Learning & Deep Learning
Integration with existing tools
ü Allows connections to industry standard technologies via ODBC/JDBC
connection and inbuilt connectors.
Easily integrate new data sources
ü Supports seamless integration with data streaming technologies like
Kafka/Kinesis using Spark Streaming. This supports both structured and
unstructured
ü Leverages S3 extensively
Secure
ü Supports integration with multiple Single-Sign-On platforms
ü Supports native encryption-decryption features (AES-256 and KMS)
ü Supports Access Control Layer (ACL)
ü Implemented in USCIS Private cloud
18
Questions!
19

Mais conteúdo relacionado

Semelhante a Lessons Learned from Modernizing USCIS Data Analytics Platform

Leveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern AnalyticsLeveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern Analyticsconfluent
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 editionDavid Talby
 
Overview SQL Server 2019
Overview SQL Server 2019Overview SQL Server 2019
Overview SQL Server 2019Juan Fabian
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafkaconfluent
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeDATAVERSITY
 
Sql server 2019 new features
Sql server 2019 new featuresSql server 2019 new features
Sql server 2019 new featuresGeorge Walters
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingTimothy Spann
 
AnalysisServices
AnalysisServicesAnalysisServices
AnalysisServiceswebuploader
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsVMware Tanzu
 
Confluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with ReplyConfluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with Replyconfluent
 
Gs08 modernize your data platform with sql technologies wash dc
Gs08 modernize your data platform with sql technologies   wash dcGs08 modernize your data platform with sql technologies   wash dc
Gs08 modernize your data platform with sql technologies wash dcBob Ward
 
Streaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaAttunity
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperVasu S
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & RŁukasz Grala
 

Semelhante a Lessons Learned from Modernizing USCIS Data Analytics Platform (20)

Leveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern AnalyticsLeveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern Analytics
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
Overview SQL Server 2019
Overview SQL Server 2019Overview SQL Server 2019
Overview SQL Server 2019
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafka
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
 
Sql server 2019 new features
Sql server 2019 new featuresSql server 2019 new features
Sql server 2019 new features
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
 
AnalysisServices
AnalysisServicesAnalysisServices
AnalysisServices
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
 
Confluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with ReplyConfluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with Reply
 
Gs08 modernize your data platform with sql technologies wash dc
Gs08 modernize your data platform with sql technologies   wash dcGs08 modernize your data platform with sql technologies   wash dc
Gs08 modernize your data platform with sql technologies wash dc
 
Sparkflows.io
Sparkflows.ioSparkflows.io
Sparkflows.io
 
Streaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache Kafka
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & R
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 

Último (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 

Lessons Learned from Modernizing USCIS Data Analytics Platform

  • 1. A Journey To Modernization Shawn Benjamin & Prabha Rajendran
  • 2. Problems Faced q Informatica ETL pipeline was brittle q Lengthy Informatica ETL development cycle q Lengthy load time workflows for ingestion of data q Lack of ability for real time/ near real-time data q Lack of data science platform 2
  • 3. Legacy Architecture Data Scientists Statisticians Business Analysts Extracted Data R / Python Code Business Insight 3 Datawarehouse Source Systems
  • 4. 4 C3 CON eCIMS C3 LANS C4 NFTS ELIS NASS CPMS Pay.gov AR-11 RAPS MFAS VSS ATS PCTS SNAP RNACS Benefits Files Payment Verification Validation Cust Svc Scheduler AdminFraudFOIA Benefits Mart Scheduler Mart Payment Mart Validation Mart APSS C3 Con C4 CIS ELIS2 MFAS NFTS PCTS RNACS VSS C3 LAN AAO x2 CSC x2 MSC x2 NSC x2 TSC x2 VSC x2 CHAMPS ECHO iCLAIMS IDCMS QUEUE NPWR CAMINO VIBE SODA Data Marts SCCLAIMS FDNS-DS SMART Subject Areas SAS LibrarieseCISCOR Direct Connects Data Marts Direct Connects Active ODSes Decom. ODSes Treasury CIR CIS x7 x1 x2 x1 x1 x5 x2 x5 x2 x1 x2 x6 x1 x1 x1 x1 x2 x1 x2 x1 x1 LEGEND ODSes VIS x2 CPMS x1 SRMT x1 NFTS x1x1 x1 x1 FACCON x1 ePULSE x1 BI Tools Data Marts Users 4 2016SNAPSHOT JANUARY Data Sources SMART Subject Areas 36 2 eCISCOR ODS FutureData MartDirect Connect 66 2,354 ETL28 Processes
  • 5. Implemented Databricks private cloud VPC (26 nodes) in the AWS Connected the Databricks cluster to the Oracle database Created all relevant DB tables in HIVE metadata pointing to Oracle database Copied relevant tables from Oracle database to S3 using Scala code Data is stored in Apache Parquet columnar format. For context, the 120 million row 83 column can be dumped to S3 in just 10 minutes. Identified appropriate partition scheme large tables were partitioned to optimize Spark query performance Created multiple notebooks Perform data analysis and visualize the results, e.g. created histogram of case life cycle duration Successful Proof of Concept 5
  • 6. Current Databricks Implementation 6 Statisticians Business Analysts Business Insight S3 Data Lake Data Scientists LakeHouse
  • 7. 7 • 75 Data Sources • Xx Data Interfaces • 7 Data Marts • 4 BI Tools • 6,086 Tableau Dashboards • 118 SMART Subject Areas • 56 SAS Libraries • 6,233 Users • 75 Data Sources • 35 Application Interfaces • 7 Data Marts • 4 BI Tools • 6,086 Tableau Dashboards • 118 SMART Subject Areas • 56 SAS Libraries • 6,233 Users
  • 8. Databricks Accomplishments IMPLEMENTATION OF DELTA LAKE EASY INTEGRATION WITH OBIEE ,SAS AND TABLEAU WITH NATIVE CONNECTORS INTEGRATION WITH GITHUB FOR CONTINUOUS INTEGRATION & DEPLOYMENT AUTOMATING ACCOUNT PROVISIONING MACHINE LEARNING (ML FLOW) INTEGRATION 8
  • 9. Change Data Capture using Delta Lake Databricks Delta –Success Factors v Faster Ingestion of CDC changes v Resiliency v Improved Data Quality , Reporting Availability and runtime performance v Schema evolution - adding additional columns without rewriting the entire table Databricks Delta-Lessons Learned v Storage requirements increased v Vacuum and Optimization is mandatory to improve the performance 9
  • 10. Unified Data Analytical Platform -Tableau 10
  • 11. Unified Data Analytical Platform –OBIEE/SAS 11
  • 12. Data Science Experiments using ML 12 ML Graphs processes after running the models Prediction Model Samples
  • 13. Text and Log Mining 0 0.5 1 NegativeSentiment Positive Sentiment Sentiment Analysis 13
  • 14. Time Series Models and H2O Integration Integrated H20 with Databricks and built a model predicting the count of ‘No show’ on N400 using the traditional Time series forecasting to predict inefficiencies in normal day-to-day planning and operations 14
  • 15. Enabling Security & Governance 15 Access Control (ACL) Credentials Passthrough Secrets Management v Control users access to data using the Databricks view-based access control model (Table and schema level ACLs) v Control users access to clusters that are not enabled for table access control v Enforced data object privileges at onboarding phase v Used Databricks secrets manager to store credentials and reference in notebooks
  • 16. Databricks Management API Usage 16 Cluster/Jobs management àCreate, delete, manage clusters and get execution status of daily scheduled jobs which helped automated monitoring. Library /Secret managementà Easy upload of any third-party libraries and manage encrypted scopes/credentials to connect to source and target endpoints. Integration and Deploymentsà API with Git and Jenkins for continuous integration and continuous deployment Enabled MLFlow Tracking API for our Data Science experiments API Integrated Databricks Management API with Jenkins and other scripting tools to automate all our administration and management tasks.
  • 17. Lessons learned through this Journey Training plan Cloud based experience Subject Matter Expertise Automation 17
  • 18. Success Strategy Success Criteria Benefit Performance ü Auto-scalability leveraging on-demand and spot instances ü Efficient processing of larger datasets comparable to RDMS systems ü Scalable read/write performance on S3 Support for a variety of statistical programming languages ü Data Science Platform ( R, Python, Scala and SQL) ü Supports MLIB : Machine Learning & Deep Learning Integration with existing tools ü Allows connections to industry standard technologies via ODBC/JDBC connection and inbuilt connectors. Easily integrate new data sources ü Supports seamless integration with data streaming technologies like Kafka/Kinesis using Spark Streaming. This supports both structured and unstructured ü Leverages S3 extensively Secure ü Supports integration with multiple Single-Sign-On platforms ü Supports native encryption-decryption features (AES-256 and KMS) ü Supports Access Control Layer (ACL) ü Implemented in USCIS Private cloud 18