The Census Bureau is the U.S. government's largest statistical agency with a mission to provide current facts and figures about America's people, places and economy. The Bureau operates a large number of surveys to collect this data, the most well known being the decennial population census. Data is being collected in increasing volumes and the analytics solutions must be able to scale to meet the ever increasing needs while maintaining the confidentiality of the data. Past data analytics have occurred in processing silos inhibiting the sharing of information and common reference data is replicated across multiple system. The use of the Hortonworks Data Platform, Hortonworks Data Flow and other open-source technologies is enabling the creation of a cloud-based enterprise data lake and analytics platform. Cloud object stores are used to provide scalable data storage and cloud compute supports permanent and transient clusters. Data governance tools are used to track the data lineage and to provide access controls to sensitive data.
2. 2
Census Background
• United States leading provider of quality
data about its people and economy
• Decennial, Economic, Demographic,
and a multitude of other surveys
• Serves other federal agencies
• Processes large volumes of data
• Preparing for future analytic needs of the
enterprise
3. Enterprise Data Lake
Make changes to culture, processes, and technologies to practice
and accelerate the efforts required to remain a leader in data and
technology innovation.
• Optimize Survey Operations
• Reduce Respondent Burden
• Improve Data Products
• Consolidate Data and Code
• Manage Large Datasets
• Centralize Security
4. EDL Guiding Principles
• Scalability
• Availability
• Automation
• Security & Privacy
• Data Diversity
• Data Stewardship
• Identifiable, Locatable and Linkable Data
• Reproducibility
• Governance
5. Cloud First
• Establish the EDL in GovCloud
• On-demand Server Instances
• Cloud Object Stores
• Leverage Serverless Computing
6. Data Availability
• Short-term and shared data available through cloud object stores
• Long-term data available through archival stores
• Built-in resiliency of storage to prevent data loss
• EDL applications deployed as highly available
8. Census Business Process and EDL
Mapping of the Survey Lifecycle to the Data Lifecycle and the identification of the data flow, allows EDL to incrementally
build upon the key areas highlighted in green. The EDL will focus on the Enterprise Data Lifecycle Stages (Process,
Derive, ect.) and leverage technology advances (e.g. Data mashups, Machine Learning, Distributed computing).
Consolidation of Data
Collection Systems
Consolidation of Data Management / Store Systems
Consolidation of Data
Dissemination Systems
CEDCaP Enterprise Data Lake (EDL) CEDSCI
Survey
Design
Frame
Development
Sample
Design
Response Data
Collection
Instrument
Development
Data Editing
& Imputation
Disclosure
Avoidance
Research/
Analytics
Data Product
Dissemination
Estimation, Data
Review, & Analysis
DEFINE COLLECT
Survey
Design
Frame
Development
Sample
Design
Instrument
Development
Response Data
Collection
CAPTURE
3rd Party Data
Capture
PROCESS DERIVE PUBLISH RESEARCH
Data Editing &
Imputation
Estimation, Data
Review, & Analysis
Disclosure
Avoidance
Data Product
Dissemination
Research/
Analytics
DISSEMINATE
Areas currently in the scope of CEDCaP,
CEDSCI or other programs
Areas currently in the scope of EDL
LEGEND
Survey Lifecycle
Data Lifecycle
EDL Supported Areas (not in scope)
Data Lifecycle
9. Enterprise Data Lake Features
• Data Control
• Data Lineage
• Authorization Model
• Storage Management
• Data Sharing
• Dynamic Platform Provisioning
• Cloud-based
• Cost Control
9
10. Data Control
• Data Registration
• Datasets onboarded with mandatory metadata
• Registered in the existing Data Management System
• Project access controls generated
• Code Repositories
• Data Lineage: Atlas
• Authorization: Ranger
• Controls for projects
• Column protection
• Row filtering