5. h
=
NaiveHash.new(("A".."J").to_a)
tracknodes
=
Array.new(100000)
100000.times
do
|i|
tracknodes[i]
=
h.node(i)
end
h.add("K")
misses
=
0
100000.times
do
|i|
misses
+=
1
if
tracknodes[i]
!=
h.node(i)
end
puts
"misses:
#{(misses.to_f/100000)
*
100}%"
misses:
90.922%
Monday, April 22, 13
6. 2160 0
a single partition
Node 0
Node 1
ring with 32 partitions
Node 2
SHA1(Key)
2160/2
Monday, April 22, 13
7. 2160 0
a single partition
Node 0
Node 1
ring with 32 partitions
Node 2
Node 3
SHA1(Key)
2160/2
Monday, April 22, 13
8. SHA1BITS
=
160
class
PartitionedConsistentHash h
=
PartitionedConsistentHash.new(("A".."J").to_a)
def
initialize(nodes=[],
partitions=32) nodes
=
Array.new(100000)
@partitions
=
partitions 100000.times
do
|i|
@nodes,
@ring
=
nodes.clone.sort,
{}
@power
=
SHA1BITS
-‐
Math.log2(partitions).to_i
nodes[i]
=
h.node(i)
@partitions.times
do
|i| end
@ring[range(i)]
=
@nodes[0] puts
"add
K"
@nodes
<<
@nodes.shift
end
h.add("K")
@nodes.sort! misses
=
0
end 100000.times
do
|i|
misses
+=
1
if
nodes[i]
!=
h.node(i)
def
range(partition)
(partition*(2**@power)..(partition+1)*(2**@power)-‐1) end
end puts
"misses:
#{(misses.to_f/100000)
*
100}%n"
def
hash(key)
Digest::SHA1.hexdigest(key.to_s).hex
end misses:
9.473%
def
add(node)
@nodes
<<
node
partition_pow
=
Math.log2(@partitions)
pow
=
SHA1BITS
-‐
partition_pow.to_i
(0..@partitions).step(@nodes.length)
do
|i|
@ring[range(i,
pow)]
=
node
end
end
def
node(keystr)
return
nil
if
@ring.empty?
key
=
hash(keystr)
@ring.each
do
|range,
node|
return
node
if
range.cover?(key)
end
end
end
Monday, April 22, 13
9. class
Node
def
initialize(name,
nodes=[],
partitions=32)
@name
=
name
@data
=
{}
@ring
=
ConsistentHash.new(nodes,
partitions)
end
def
put(key,
value)
if
@name
==
@ring.node(key)
puts
"put
#{key}
#{value}"
@data[
@ring.hash(key)
]
=
value
end
end
def
get(key)
if
@name
==
@ring.node(key)
puts
"get
#{key}"
@data[@ring.hash(key)]
end
end
end
Monday, April 22, 13
25.
def
replicate(message,
n)
list
=
@ring.pref_list(n)
results
=
[]
while
replicate_node
=
list.shift
results
<<
remote_call(replicate_node,
message)
end
results
end
Monday, April 22, 13
29. WHAT TO EAT FOR DINNER?
• Adam wants Pizza
{value:"pizza", vclock:{adam:1}}
• Barb wants Tacos
{value:"tacos", vclock:{barb:1}}
• Adam gets the value, the system can’t resolve, so he gets bolth
[{value:"pizza", vclock:{adam:1}},
{value:"tacos", vclock:{barb:1}}]
• Adam resolves the value however he wants
{value:"taco pizza", vclock:{adam:2, barb:1}}
Monday, April 22, 13
30. #
artificially
create
a
conflict
with
vclocks
req.send('put
1
foo
{"B":1}
hello1')
&&
req.recv
req.send('put
1
foo
{"C":1}
hello2')
&&
req.recv
puts
req.send("get
2
foo")
&&
req.recv
sleep
5
#
resolve
the
conflict
by
decending
from
one
of
the
vclocks
req.send('put
2
foo
{"B":3}
hello1')
&&
req.recv
puts
req.send("get
2
foo")
&&
req.recv
Monday, April 22, 13
45. WHAT WE’VE DONE SO FAR
http://git.io/MYrjpQ
• Distributed, Replicated, Self-healing, Conflict-resolving,
Eventually Consistent Key/Value Datastore... with Mapreduce
Monday, April 22, 13
46. Key/Value
Preference List
Vector Clocks
Distributed Hash Ring
Request/
Response
Merkle Tree Node Gossip
CRDT (coming) Read Repair
Monday, April 22, 13
47. basho
@coderoshi
http://pragprog.com/book/rwdata http://github.com/coderoshi/little_riak_book
Monday, April 22, 13
48. I am the very model of a distributed database,
I've information in my nodes residing out in cyber space,
While other datastorers keep consistent values just in case,
It's tolerant partitions and high uptime, that I embrace.
Monday, April 22, 13
49. I use a SHA-1algorithm, one-sixty bits of hashing space
consistent hash ensures k/v's are never ever out of place,
I replicate my data across more than one partition,
In case a running node or two encounters decommission.
Monday, April 22, 13
50. My read repairs consistently, at least it does eventually,
CAP demands you've only 2 to choose from preferentially,
In short, you get consistency or high availability,
To claim you can distribute and do both is pure futility.
Monday, April 22, 13
51. A system such as I is in a steady tiff with Entropy,
since practically a quorum is consistent only sloppily,
My favorite solutions are both read repair and AAE,
My active anti-entropy trades deltas via Merkel Tree.
Monday, April 22, 13
52. After write, the clients choose their conflict resolution,
With convergent replicated data types are a solution:
I only take a change in state, and not results imperforate,
And merge results to forge a final value that is proximate.
Monday, April 22, 13
53. How to know a sequence of events w/o employing locks,
Well naturally, I use Lamport logic ordered vector-clocks.
I have both sibling values when my v-clocks face a conflict,
and keep successive values 'til a single one is picked.
Monday, April 22, 13
54. If values are your query goal then write some mapreduces.
Invert-indexing values may reduce query obtuseness.
Distributing a graph or a relation-style structure,
though somewhat possible provides a weaker juncture.
Monday, April 22, 13
55. For data of the meta kind, my ring state is a toss up,
To keep in sync my nodes will chat through protocol gossip.
I am a mesh type network, not tree/star topological, their
single points of failure make such choices most illogical.
Monday, April 22, 13
56. My network ring is just the thing, ensuring writes are quick,
But know that CAP demands my choice give only 2 to pick.
In short, I get consistency or high availability,
To claim I could distribute and do both is just futility.
Monday, April 22, 13