SlideShare uma empresa Scribd logo
1 de 31
Factual
 
Eric Lui
Software Engineer, Data Storage
eric@factual.com 
 
What is Factual.com?
Factual is a platform for sharing, 
mashing, and publishing open data.
Crowd-Sourced Data
… is terrific!
• Verifiable
• Vote-driven
• Customizable
Demo
Data Storage
Goal:
• 10M tables 
• 1B rows (summarized)
• 10B inputs (or "votes")
 
Raw storage
• 1TB per input server
• 100MB+ per dataset
What does all this "scale" mean?
Map-Reduce is the right architecture for us:
•High volume storage
•Scales (with the right design)
•Shards and partitions in-place
•Minimal downtime
•Throwaway intermediary stages
What does all this "scale" mean?
•Hard to profile
•Hard to predict what table will get "hot"
•Performance tuning has to be general, unless we're on a 
Service Level Agreement and can devote DBA resources (not 
our core strength)
•Map-Reduce is not real time
Data Storage
Challenges
 
• Summarization operations are memory-intensive
• N-Way merging is expensive (ie., slow)
• Streaming is necessary to serve back full summaries
• Common use case is just the first N rows
Emerging Patterns
• Many Reads
• (Relatively) Few New rows
• (Very) Few row Updates
• Infrequent (< 1 per day) table-wide re-summarizations
High Availability
Votestore
• 3x Redundancy
High Availability
Problem: Summarization is slow.
High Availability
Problem: Summarization is slow.
Solution: Build a caching layer.
High Availability
Problem: Summarization is slow.
Solution: Build a caching layer.
Cache
• 3x Replication
• "Dumb" load balancing
• Server Affinity (via Zookeeper)
Metaphor Shear
Why PostgreSQL?
Pros
• End-user expectations map to RDBMS world
• Indexing on common operations
o (ORDER BY, WHERE)
• Full-text search
• Latitude/longitude/geo functions with PostGIS
• Aggregation on summarized results
• Built-in persistence
Metaphor Shear
Why PostgreSQL?
Cons
• No built-in "versioning"
• Re-summarization, though infrequent, is expensive
• Need to map lisp-based query language to SQL
High Availability
Why PostgreSQL?
Other considerations
• Must pro-actively store attributes
• Schema changes are expensive
• Handling "upsert" operations is awkward
• Deletes are difficult (but infrequent)
• (related) No concept of row merge
Demo
Cache Consistency
ACID? Not really...
High-concurrency 
favored over
database-style transactions 
Cache Consistency
ACID? Not really...
Eventually Consistent
Consistency Challenges
Cache Invalidation
• How do I handle new inputs?
Consistency Challenges
Cache Invalidation
• How do I handle new inputs?
o Shield the Input Store
 Low-priority - shield the input store
 Row-level invalidations
o Lazy
 Fetch updated rows on summary request
 Leverage postgres to track invalidations
o Decouple From Input API call
 Async notification
Consistency Challenges
Cache Instance Management
• How do we handle query changes?
o filtering out spam inputs
o change the aggregation function
o give more weight to table owner's votes
Consistency Challenges
Cache Instance Management
• Simple Re-cache
o Dump the current cached copy, and re-cache.
o Slow
o Poor user experience
Consistency Challenges
Cache Instance Management
• Better solution: Double Buffering
o Reload new version in background
o Continue to serve current table
 "closest match" warning
o Allow switch-back
 Continue to accept invalidations against old table
Performance
Encoding-compliant tablespaces
•Support UTF-8, non-Latin sort orders
Select Tables get SSD-based PostgreSQL caching
•See Jignesh Shah's terrific slides from PgEast 2009
•http://blogs.sun.com/jkshah/entry/effects_of_flash_ssd_on
•20x improvement in random reads (IO pattern for unclustered
index reads)
•2x improvement on sequential writes (generally pretty smooth)
What's next?
Encoding-compliant tablespaces
•Support UTF-8, non-Latin sort orders
Select Tables get SSD-based PostgreSQL caching
•See Jignesh Shah's terrific slides from PgEast 2009
•http://blogs.sun.com/jkshah/entry/effects_of_flash_ssd_on
•20x improvement in random reads (IO pattern for unclustered
index reads)
•2x improvement on sequential writes (generally pretty smooth)
How can I use Factual?
Web UI
• Dataset Creation
• Workbench
http://www.factual.com/
APIs
• Server API
http://wiki.developer.factual.com/FrontPage
• Visualizations
http://wiki.developer.factual.com/Factual-Visualization-
Documentation
Questions
eric@factual.com
Twitter: @factualinc
http://blog.factual.com

Mais conteúdo relacionado

Mais procurados

Chosse a best algorithm for page replacement to reduce page fault and analysi...
Chosse a best algorithm for page replacement to reduce page fault and analysi...Chosse a best algorithm for page replacement to reduce page fault and analysi...
Chosse a best algorithm for page replacement to reduce page fault and analysi...MdAlAmin187
 
Enterprise PostgreSQL - EDB's answer to conventional Databases
Enterprise PostgreSQL - EDB's answer to conventional DatabasesEnterprise PostgreSQL - EDB's answer to conventional Databases
Enterprise PostgreSQL - EDB's answer to conventional DatabasesAshnikbiz
 
42 lru optimal
42 lru optimal42 lru optimal
42 lru optimalmyrajendra
 
Computer architecture page replacement algorithms
Computer architecture page replacement algorithmsComputer architecture page replacement algorithms
Computer architecture page replacement algorithmsMazin Alwaaly
 
Presentation db2 best practices for optimal performance
Presentation   db2 best practices for optimal performancePresentation   db2 best practices for optimal performance
Presentation db2 best practices for optimal performancexKinAnx
 
X-DB Replication Server and MMR
X-DB Replication Server and MMRX-DB Replication Server and MMR
X-DB Replication Server and MMRAshnikbiz
 
Where do I put this data? #lessql
Where do I put this data? #lessqlWhere do I put this data? #lessql
Where do I put this data? #lessqlEzra Zygmuntowicz
 
Layers in Deep Learning & Caffe layers (model architecture )
Layers in Deep Learning&Caffe layers (model architecture )Layers in Deep Learning&Caffe layers (model architecture )
Layers in Deep Learning & Caffe layers (model architecture )Farshid Pirahansiah
 
SQL 2014 In-Memory OLTP
SQL 2014 In-Memory  OLTPSQL 2014 In-Memory  OLTP
SQL 2014 In-Memory OLTPAmber Keyse
 
From 100s to 100s of Millions
From 100s to 100s of MillionsFrom 100s to 100s of Millions
From 100s to 100s of MillionsErik Onnen
 
Pagereplacement algorithm(computional concept)
Pagereplacement algorithm(computional concept)Pagereplacement algorithm(computional concept)
Pagereplacement algorithm(computional concept)Siddhi Viradiya
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at HuaweiHBaseCon
 
Centricity EMRCPS_Platform_Architecture_Performance
Centricity EMRCPS_Platform_Architecture_PerformanceCentricity EMRCPS_Platform_Architecture_Performance
Centricity EMRCPS_Platform_Architecture_PerformanceSteve Oubre
 
BigTable PreReading
BigTable PreReadingBigTable PreReading
BigTable PreReadingeverestsun
 

Mais procurados (20)

Chosse a best algorithm for page replacement to reduce page fault and analysi...
Chosse a best algorithm for page replacement to reduce page fault and analysi...Chosse a best algorithm for page replacement to reduce page fault and analysi...
Chosse a best algorithm for page replacement to reduce page fault and analysi...
 
Enterprise PostgreSQL - EDB's answer to conventional Databases
Enterprise PostgreSQL - EDB's answer to conventional DatabasesEnterprise PostgreSQL - EDB's answer to conventional Databases
Enterprise PostgreSQL - EDB's answer to conventional Databases
 
42 lru optimal
42 lru optimal42 lru optimal
42 lru optimal
 
Computer architecture page replacement algorithms
Computer architecture page replacement algorithmsComputer architecture page replacement algorithms
Computer architecture page replacement algorithms
 
141060753008 3715302
141060753008 3715302141060753008 3715302
141060753008 3715302
 
Presentation db2 best practices for optimal performance
Presentation   db2 best practices for optimal performancePresentation   db2 best practices for optimal performance
Presentation db2 best practices for optimal performance
 
HBase Snapshots
HBase SnapshotsHBase Snapshots
HBase Snapshots
 
X-DB Replication Server and MMR
X-DB Replication Server and MMRX-DB Replication Server and MMR
X-DB Replication Server and MMR
 
Page replacement
Page replacementPage replacement
Page replacement
 
Where do I put this data? #lessql
Where do I put this data? #lessqlWhere do I put this data? #lessql
Where do I put this data? #lessql
 
Layers in Deep Learning & Caffe layers (model architecture )
Layers in Deep Learning&Caffe layers (model architecture )Layers in Deep Learning&Caffe layers (model architecture )
Layers in Deep Learning & Caffe layers (model architecture )
 
TPC-H in MongoDB
TPC-H in MongoDBTPC-H in MongoDB
TPC-H in MongoDB
 
SQL 2014 In-Memory OLTP
SQL 2014 In-Memory  OLTPSQL 2014 In-Memory  OLTP
SQL 2014 In-Memory OLTP
 
From 100s to 100s of Millions
From 100s to 100s of MillionsFrom 100s to 100s of Millions
From 100s to 100s of Millions
 
Pagereplacement algorithm(computional concept)
Pagereplacement algorithm(computional concept)Pagereplacement algorithm(computional concept)
Pagereplacement algorithm(computional concept)
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
 
Centricity EMRCPS_Platform_Architecture_Performance
Centricity EMRCPS_Platform_Architecture_PerformanceCentricity EMRCPS_Platform_Architecture_Performance
Centricity EMRCPS_Platform_Architecture_Performance
 
BigTable PreReading
BigTable PreReadingBigTable PreReading
BigTable PreReading
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
 
Ch09
Ch09Ch09
Ch09
 

Semelhante a Factual presentation for pg west 2010

Hekaton introduction for .Net developers
Hekaton introduction for .Net developersHekaton introduction for .Net developers
Hekaton introduction for .Net developersShy Engelberg
 
Novedades SQL Server 2014
Novedades SQL Server 2014Novedades SQL Server 2014
Novedades SQL Server 2014netmind
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Marco Tusa
 
SQL Server In-Memory OLTP Case Studies
SQL Server In-Memory OLTP Case StudiesSQL Server In-Memory OLTP Case Studies
SQL Server In-Memory OLTP Case Studiesjosdebruijn
 
Azure storage deep dive
Azure storage deep diveAzure storage deep dive
Azure storage deep diveYves Goeleven
 
azure track -04- azure storage deep dive
azure track -04- azure storage deep diveazure track -04- azure storage deep dive
azure track -04- azure storage deep diveITProceed
 
Hyperbatch (LoteRapido) - Punta Dreamin' 2017
Hyperbatch (LoteRapido) - Punta Dreamin' 2017Hyperbatch (LoteRapido) - Punta Dreamin' 2017
Hyperbatch (LoteRapido) - Punta Dreamin' 2017Daniel Peter
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonJAXLondon2014
 
Sql sever engine batch mode and cpu architectures
Sql sever engine batch mode and cpu architecturesSql sever engine batch mode and cpu architectures
Sql sever engine batch mode and cpu architecturesChris Adkin
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDBAWS Germany
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsSpeedment, Inc.
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalabilityjbellis
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
 
4. (mjk) extreme performance 2
4. (mjk) extreme performance 24. (mjk) extreme performance 2
4. (mjk) extreme performance 2Doina Draganescu
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedTony Rogerson
 
Graph processing
Graph processingGraph processing
Graph processingyeahjs
 

Semelhante a Factual presentation for pg west 2010 (20)

Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Hekaton introduction for .Net developers
Hekaton introduction for .Net developersHekaton introduction for .Net developers
Hekaton introduction for .Net developers
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Novedades SQL Server 2014
Novedades SQL Server 2014Novedades SQL Server 2014
Novedades SQL Server 2014
 
No stress with state
No stress with stateNo stress with state
No stress with state
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
 
SQL Server In-Memory OLTP Case Studies
SQL Server In-Memory OLTP Case StudiesSQL Server In-Memory OLTP Case Studies
SQL Server In-Memory OLTP Case Studies
 
Azure storage deep dive
Azure storage deep diveAzure storage deep dive
Azure storage deep dive
 
azure track -04- azure storage deep dive
azure track -04- azure storage deep diveazure track -04- azure storage deep dive
azure track -04- azure storage deep dive
 
Hyperbatch (LoteRapido) - Punta Dreamin' 2017
Hyperbatch (LoteRapido) - Punta Dreamin' 2017Hyperbatch (LoteRapido) - Punta Dreamin' 2017
Hyperbatch (LoteRapido) - Punta Dreamin' 2017
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Sql sever engine batch mode and cpu architectures
Sql sever engine batch mode and cpu architecturesSql sever engine batch mode and cpu architectures
Sql sever engine batch mode and cpu architectures
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDB
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
4. (mjk) extreme performance 2
4. (mjk) extreme performance 24. (mjk) extreme performance 2
4. (mjk) extreme performance 2
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - Advanced
 
Graph processing
Graph processingGraph processing
Graph processing
 

Factual presentation for pg west 2010

Notas do Editor

  1. &amp;quot;built&amp;quot; on living data?
  2. Look for data Create a new table Sort + search it
  3. Alternate: roll our own solution Persistance: Application/server restarts
  4. Other attributes: number of inputs, level of consensus, etc.
  5. Add new column Add new row Give inputs Do a merge