Factual presentation for pg west 2010

Factual

Eric Lui
Software Engineer, Data Storage
eric@factual.com

What is Factual.com?
Factual is a platform for sharing,
mashing, and publishing open data.

Crowd-Sourced Data
… is terrific!
• Verifiable
• Vote-driven
• Customizable

Data Storage
Goal:
• 10M tables
• 1B rows (summarized)
• 10B inputs (or "votes")

Raw storage
• 1TB per input server
• 100MB+ per dataset

What does all this "scale" mean?
Map-Reduce is the right architecture for us:
•High volume storage
•Scales (with the right design)
•Shards and partitions in-place
•Minimal downtime
•Throwaway intermediary stages

What does all this "scale" mean?
•Hard to profile
•Hard to predict what table will get "hot"
•Performance tuning has to be general, unless we're on a
Service Level Agreement and can devote DBA resources (not
our core strength)
•Map-Reduce is not real time

Data Storage
Challenges

• Summarization operations are memory-intensive
• N-Way merging is expensive (ie., slow)
• Streaming is necessary to serve back full summaries
• Common use case is just the first N rows

Emerging Patterns
• Many Reads
• (Relatively) Few New rows
• (Very) Few row Updates
• Infrequent (< 1 per day) table-wide re-summarizations

High Availability
Votestore
• 3x Redundancy

High Availability
Problem: Summarization is slow.

High Availability
Solution: Build a caching layer.

High Availability
Solution: Build a caching layer.
Cache
• 3x Replication
• "Dumb" load balancing
• Server Affinity (via Zookeeper)

Metaphor Shear
Why PostgreSQL?
Pros
• End-user expectations map to RDBMS world
• Indexing on common operations
o (ORDER BY, WHERE)
• Full-text search
• Latitude/longitude/geo functions with PostGIS
• Aggregation on summarized results
• Built-in persistence

Metaphor Shear
Why PostgreSQL?
Cons
• No built-in "versioning"
• Re-summarization, though infrequent, is expensive
• Need to map lisp-based query language to SQL

High Availability
Why PostgreSQL?
Other considerations
• Must pro-actively store attributes
• Schema changes are expensive
• Handling "upsert" operations is awkward
• Deletes are difficult (but infrequent)
• (related) No concept of row merge

Cache Consistency
ACID? Not really...
High-concurrency
favored over
database-style transactions

Cache Consistency
ACID? Not really...
Eventually Consistent

Consistency Challenges
Cache Invalidation
• How do I handle new inputs?

Cache Invalidation
• How do I handle new inputs?
o Shield the Input Store
 Low-priority - shield the input store
 Row-level invalidations
o Lazy
 Fetch updated rows on summary request
 Leverage postgres to track invalidations
o Decouple From Input API call
 Async notification

Cache Instance Management
• How do we handle query changes?
o filtering out spam inputs
o change the aggregation function
o give more weight to table owner's votes

• Simple Re-cache
o Dump the current cached copy, and re-cache.
o Slow
o Poor user experience

• Better solution: Double Buffering
o Reload new version in background
o Continue to serve current table
 "closest match" warning
o Allow switch-back
 Continue to accept invalidations against old table

Performance
Encoding-compliant tablespaces
•Support UTF-8, non-Latin sort orders
Select Tables get SSD-based PostgreSQL caching
•See Jignesh Shah's terrific slides from PgEast 2009
•http://blogs.sun.com/jkshah/entry/effects_of_flash_ssd_on
•20x improvement in random reads (IO pattern for unclustered
index reads)
•2x improvement on sequential writes (generally pretty smooth)

What's next?
Encoding-compliant tablespaces
•Support UTF-8, non-Latin sort orders
Select Tables get SSD-based PostgreSQL caching
•See Jignesh Shah's terrific slides from PgEast 2009
•http://blogs.sun.com/jkshah/entry/effects_of_flash_ssd_on
•20x improvement in random reads (IO pattern for unclustered
index reads)
•2x improvement on sequential writes (generally pretty smooth)

How can I use Factual?
Web UI
• Dataset Creation
• Workbench
http://www.factual.com/
APIs
• Server API
http://wiki.developer.factual.com/FrontPage
• Visualizations
http://wiki.developer.factual.com/Factual-Visualization-
Documentation

eric@factual.com
Twitter: @factualinc
http://blog.factual.com

Factual presentation for pg west 2010

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Factual presentation for pg west 2010

Semelhante a Factual presentation for pg west 2010 (20)

Factual presentation for pg west 2010

Notas do Editor