1. Digital Worlds (applications)
q VEC (Enterprise Scale)
• 1,300 source databases
• 10+ million views (via data integration)
q US Healthcare (National Scale)
• Scale
o Health care and social assistance offices: 784,626 incl
• Doctors offices: 220,131
• Dentists: 127,057
• Hospitals: 6,505
• Clinics: ~5,000 ~= SME say 100 Databases
o Patients: 100-300+ million
o Databases: ~32 million
• Scope
o Comprehensive medical events, methods, analysis, …
• E.g., Alice (62) in Emergency Room with liver failure
o Insurance, payments, …
o New metric: healthcare quality
• Examples
o SHRINE (2009): 3 hospitals; uses 2,381,883 distinct concepts (ontologies)
o HHS CIO (Todd Park): Open Health Data Initiative
o US (PCAST, White House) vision
2. Observations
q Data Sources
• Massive
o Number
o Heterogeneity
o Distribution (data at source)
o Constant change – data, model, ontology, business rules, …
• Constrained
o Governance: privacy, confidentiality, legal, …
o Quality, correctness, precision, …
o Competition
q Critical Requirement: meaningful
• Human lives
• Health of individuals, communities, nation
• Economic impact: $ trillions / year
• Political: meaningless debates
3. Trends
q Digital Universe
q Holistic Views
• Information Ecosystems: data
• Ecosystems: Processes over services
q Big Data: massive
o Number
o Distribution
o Heterogeneity
• Semantics
• Structure: relational databases, X databases, web, deep web
• Technology: databases, data warehouses, files, …
q New Models: problem solving, data, …
• Data-driven
• Social computing: data as social artifacts
• Science: Wolfram Alpha
• Pragmatics: Driven by healthcare quality improvement
4. Databases and AI: The Twain Just Met
q Database World
• Engineering (RDBMSs) @ scale
• Reasoning: Relational model (FoL)
q AI World
• Reasoning: more powerful & expressive
• Engineering: in the small
q Digital Universe, e.g., Web
• Reasoning: beyond the RDM & AI?
• Engineering: way beyond RDBMS
q Information ecosystems
• Databases: join
• Web: link
Power Law of Data
The value of a data element is proportional to the number of its meaningful uses.
5. What Underlies the Digital Universe
Modelling Execution
Data Models DBMS Engines
Languages Algorithms
Semantics Semantics
Problem Solving Computation
6. What Underlies the Data Universe
Relational
Data Independence RDBMS
Data Model
Semantics Semantics
Problem Solving Computation
7. Relational Database Improvements
q Pre-Relational
• Hierarchical
• Network
q Relational
• Row store
• OLAP / Data Warehouse
q Post-Relational
• RDF store
• Column store
• Bare bones relational
• Stream / complex event processing
q Push Down
• Database / data warehouse appliances (20+ on the market)
• In-database analytics, … (10+ on the market)
8. Data Models For New Domains Must Honor
Data Independence
q Array (Matrix)-store (SciDB) [Linear algebra]
q XML databases: structured content, information exchange
q Content management: e.g., Sharepoint
q Graph/network store: social networking (Facebook), link analysis
q Protein store: protein folding, drug discovery, …
q Geospatial / map store: location-based applications
q Time series: signal processing, statistical and financial analysis
q Cloud / Mesh data (NoSQL) stores: web scale applications
q and they just keep coming …
9. Data Universe
Database Universe
Relational
Data
Universe
10. Data Universe Graph-
Network Time
Data Series
Scientific Model Data
Data Model
Model
DBU
Geo-
Spatial
RDM Data
Model
Document
Data
Digital
Model Media ETC.
Data ETC.
ETC.
Model
11. Data Universe Graph-
Network Time
Data Series
Scientific Model Data
Data Model
Model
DBU
Geo-
Spatial
RDM Data
Model
Document
Data
Digital
Model Media ETC.
Data ETC.
ETC.
Model
12. Data Integration Solution Space:
Data Independence Required
Computation Problem Solving
Databases
Relational Optimal 4 homogeneous Optimal 4 pure
relational data relational data
Domain-specific Emerging Emerging
Semantic Technologies (AI)
Knowledge Representation Minimal Powerful
Ontologies Minimal Powerful
Semantic Web Modest / emerging Modest / emerging
Semantic Data Management Emerging Emerging
Architectural
Information-As-A-Service Emerging Emerging
Cloud Emerging N/A
13. Databases vs. Semantic Web
Discrete Worlds Heterogeneous Worlds
Single Versions of Truth Multiple Truths
Data Models LOD Models?
Mathematical Logic What Logic ?
1,000s of databases
Probabilistic / Eventual Common Sense
Reasoning Reasoning?
DI: Relational Join DI: Evidence Gathering
Databases Semantic Web
14. Databases vs. Web
Web
Explora2on
Mul2ple
versions
of
truth
.
.
.
Analysis
/
BI
Evidence
Gathering
Data
Warehouses
Scale
.
.
.
Seman+cally
Heterogeneous
Views
Single
versions
Data
Management
of
truth
.
.
.
Seman+cally
Homogeneous
Databases
15. Data Integration
q Query: define the result
• Entity
• Computation
q Find candidate data sets: search Hard
q Extract, Transform, and Load (ETL): engineering
q Data Integration
• Entity resolution Harder
• Integration computation
16. Managing Data @ Scale I
q Introduction
• Michael L. Brodie
q Global Data Integration and Global Data Mining
• Chris Bizer
q DB vs RDF: structure vs correlation
• Peter Boncz