KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB

KVSの性能
RDBMSのインデックス
更にMapReduceを併せ持つ
All-in-one NoSQL

楽天株式会社開発部アーキテクト G 窪田博昭｜ 2 0 1 2 年 1 月 1 8 日 1

Introduction
Agenda
• Introduction
• How to use mongo on the news.infoseek.co.jp

2

Introduction
Profile

Name：窪田博昭 Hiroaki Kubota
Company： Rakuten Inc.
Unit： ACT = Development Unit Architect Group
Mail: hiroaki.kubota@mail.rakuten.com

Hobby： Futsal , Golf
Recent： My physical power has gradual declined...

twitter : crumbjp
github: crumbjp
5

How to take advantages of the Mongo
for the infoseek news

6

For instance of our page

7

Layout / Components

Layout Components

9

Albatross structure

Internet

Request SessionDB
LayoutDB Gat page layout

MongoDB
WEB ReplSet
MongoDB
ReplSet Get components
Call APIs Memcache

API

Retrieve data

ContentsDB MongoDB
ReplSet 10

Albatross structure
Developer

HTML markup
LayoutDB Set page layout & Deploy API
API settings

CMS Batch servers
MongoDB
ReplSet
Set components
Insert Data

API servers

ContentsDB MongoDB
ReplSet 11

CMS
Layout editor

12

MapReduce
Our usage
We have never used MapReduce as regular operation.
However, We have used it for some irreglar case.

• To search the invalid articles that should be removed
because of someone’s mistakes...

• To analyze the number of new articles posted a day.

• To analyze the updated number an article.

• We get start considering to use it regularly for the
social data analyzing before long ...
16

Structure & Performance

17

Structure
We are using very poor machine (Virtual machine) !!

• Intel(R) Xeon(R) CPU X5650 2.67GHz 1core!!
• 4GB memory
• 50 GB disk space ( iScsi )
• CentOS5.5 64bit
• mongodb 1.8.0
– ReplicaSet 5 nodes ( + 1 Arbiter)
– Oplog size 1.2GB
– Average object size 1KB

18

Structure
Researched environment

We’ve also researched following environments...
• Virtual machine 1 core
– 1kb data , 6,000,000 documents
– 8kb data , 200,000 documents
• Virtual machine 3 core
– 1kb data , 6,000,000 documents
– 8kb data , 200,000 documents
• EC2 large instance
– 2kb data , 60,000,000 documents. ( 100GB )
19

Performance
I found the formula for making a rough estimation of QPS

1~8 kb documents + 1 unique index
C = Number of CPU cores (Xeon 2.67 GHz)
DD = Score of ‘dd’ command (byte/sec)
S = Document size (byte)

• GET qps = 4500 × C
• SET(fsync) bytes/s = 0.05×DD ÷ S
• SET(nsync) qps = 4500 BUT...
have chance of STALE
20

Performance example (on EC2 large)

21

Environment and amount of data

EC2 large instance
– 2kb data , 60,000,000 documents. ( 100GB )
– 1 unique index

Data-type
{
shop: 'someone',
item: 'something',
description: 'item explanation sentences...‘
} 22

Batch insert (1000 documents) fsync=true
17906 sec (=289 min) (=3358 docs/sec)

Ensure index (background=false)

4049 sec (=67min)
1. primary 2101 sec (=35min)
2. secondary 1948 sec (=32min)

23

Add one node
5833sec (=97min)
1. Get files 2GB×48 2120 sec (=35min)
2. _id indexing 1406 sec (=23min)
3. uniq indexing 2251 sec (=38min)
4. other processes 56 sec (=1 min)

24

Group by
• Reduce by unique index & map & reduce
– 368 msec
db.data.group({ key: { shop: 1},
cond: { shop: 'someone' },
reduce: function ( o , p ) { p.sum++; },
initial: { sum: 0 } });

25

MapReduce
• Scan all data 3116sec (=52min)
– number of key = 39092
db.data.mapReduce(
function(){ emit(this.shop,1); },
function(k,v){
var ret=0;
v.forEach( function (value){ ret+=value; });
return ret; },
{ query: {}, inline: 1, out: 'Tmp' } );
26

Major problems...

27

Index probrem
Online indexisng is completely useless even if last version (2.0.2)
Indexing is lock operation in default.
Indexing operation can run as background
on the primary. But...
It CANNOT run as background on the secondary
Moreover the all secondary’s indexing run
at the same time !!
Result in above...

All slave freezes ! orz...
29

Present indexing ( default )

30

Index probrem
Primary
save
Batch

Secondary Secondary Secondary

Client Client Client Client Client 31

Index probrem
Primary
ensureIndex
Lock Cannot Batch
write
Indexing



Index probrem
Primary
finished
Batch
Complete
SYNC SYNC
SYNC
Lock Lock Lock
Indexing Indexing Indexing

Cannot read !!

Index probrem
Ideal indexing ( default )
Primary
Batch
Complete


Complete Complete Complete


Present indexing ( background )

35

Index probrem
Primary
save
Batch



Index probrem

ensureIndex(background)
Primary Slow down...
Slowdown Batch
Indexing



Index probrem
Primary
finished
Batch
Complete
SYNC SYNC
SYNC
Lock Lock Lock

Cannot read !!

Index probrem
Primary
finished
Batch
Background Complete don’t work
indexing
SYNC SYNC
SYNC
on the
Lock
secondaries
Lock Lock

Cannot read !!

Index probrem
Primary
finished
Batch
Complete
SYNC SYNC
SYNC
Lock Lock Lock

Cannot read !!

Index probrem
Ideal indexing ( background )
Primary
Batch
Complete




Probable 2.1.X indexing

42

Index probrem
Accoding to mongodb.org this probrem will fix in 2.1.0

But not released formally.
So I checked out the source code up to date.
Certainlly it’ll be fixed !
Moreover it sounds like it’ll run as foreground
when slave status isn’t SECONDARY
(it means RECOVERING )

43

Index probrem
Primary
save
Batch



Index probrem

Slowdown Batch
Indexing



Index probrem
Primary
finished
Batch
Complete
SYNC SYNC
SYNC
Slowdown Slowdown Slowdown

Slow down...

Index probrem
Primary
Batch
Complete




Index probrem
Background indexing 2.1.X

But I think it’s not enough.
I think it can be fatal for the system that
the all secondaries slowdown at the same time !!

So...

48

Index probrem
Ideal indexing
Primary
save
Batch



Index probrem
Ideal indexing

Slowdown Batch
Indexing



Index probrem
Ideal indexing
Primary
finished
Batch
Complete
ensureIndex

Recovering Secondary Secondary

Indexing


Index probrem
Ideal indexing
Primary
Batch
Complete
ensureIndex
Secondary Recovering Secondary

Complete Indexing


Index probrem
Ideal indexing
Primary
Batch
Complete
ensureIndex

Secondary Secondary Recovering

Complete Complete Indexing


Index probrem
Ideal indexing
Primary
Batch
Complete




Index probrem
But ... I easilly guess it’s difficult to apply for current Oplog

It would be great if I can operate indexing manually
at each secondaries

56

I suggest Manual indexing

57

Index probrem
Manual indexing
Primary
save
Batch



Index probrem
Manual indexing
Primary
ensureIndex(manual,background) Slow down...
Slowdown Batch
Indexing



Index probrem
Manual indexing
Primary
finished
Batch
Complete



Index probrem
Manual indexing
Primary
finished
Batch
Complete


The secondaries don’t sync
automatically

Index probrem
Manual indexing
Primary
finished
Batch
Complete



Index probrem
Manual indexing
Primary
Batch
Complete
ensureIndex(manual)

Recovering Secondary Secondary

Indexing


Index probrem
Manual indexing
Primary
Batch
Complete
ensureIndex(manual)

Secondary Recovering Secondary

Complete Indexing


Index probrem
Manual indexing
Primary
Batch
Complete
ensureIndex(manual,background)

Slowdown


Index probrem
Manual indexing
Primary
Batch
Complete
It needs to support

background operation
Slowdown
Just in case,if the ReplSet has only
one Secondary

Index probrem
Manual indexing
Primary
Batch
Complete

Slowdown


Index probrem
Manual indexing
Primary
Batch
Complete




That’s all about Indexing problem

69

Struggle to control the sync

70

Unknown log & Out of control the ReplSet
We often suffered from going out of control the Secondaries...

• Secondaries change status repeatedly in a moment
between Secondary and Recovering (1.8.0)
• Then we found the strange line in the log...

[rsSync] replSet error RS102 too stale to catch up

72

What’s Stale ?
stale [stéil] (レベル：社会人必須 ) powered by goo.ne.jp

• 〈食品・飲料などが〉新鮮でない（⇔fresh）；
• 気の抜けた, 〈コーヒーが〉香りの抜けた,
• 〈パンが〉ひからびた, 堅くなった,
• 〈空気・臭(にお)いなどが〉むっとする,
• いやな臭いのする

73

What’s Stale ?
stale [stéil] (レベル：社会人必須 ) powered by goo.ne.jp

• 〈食品・飲料などが〉新鮮でない（⇔fresh）；
• 気の抜けた, 〈コーヒーが〉香りの抜けた,
• 〈パンが〉ひからびた, 堅くなった,
• 〈空気・臭(にお)いなどが〉むっとする,
• いやな臭いのする

どうも非常によろしくないらしい・・・

74

Mechanizm of being stale

75

ReplicaSet

Client

mongod mongod

Database Oplog Database Oplog
Primary Secondary
76

Replication (simple case)

77

ReplicaSet

Client

mongod mongod

Primary Secondary
78

Insert & Replication 1

A
Client
Insert

mongod mongod

Insert A
A

Primary Secondary
79


Client

Sync

Insert A Insert A
A A

Primary Secondary
80

Replication (busy case)

81

Stale

Client

mongod mongod

Insert A Insert A
A A

Primary Secondary
82


B
Client
Insert

Insert B
B Insert A Insert A
A A

Primary Secondary
83


C
Client
Insert

Insert C
C Insert B
B Insert A Insert A
A A

Primary Secondary
84


A
Client
Update

Update A
Insert C
C Insert B
B Insert A Insert A
A A

Primary Secondary
85


Client

Check Oplog

Update A
Insert C
C Insert B
B Insert A Insert A
A A

Primary Secondary
86


Client

Sync

Update A Update A
Insert C Insert C
C Insert B C Insert B
B Insert A B Insert A
A A

Primary Secondary
87

Replication (more busy)

88

Stale

Client

mongod mongod

Insert A Insert A
A A

Primary Secondary
89

Stale

B
Client
Insert

Insert B
B Insert A Insert A
A A

Primary Secondary
90

Stale

C
Client
Insert

Insert C
C Insert B
B Insert A Insert A
A A

Primary Secondary
91

Stale

A
Client
Update

Update A
Insert C
C Insert B
B Insert A Insert A
A A

Primary Secondary
92

Stale

C
Client
Update

Update C
Update A
C Insert C
B Insert B Insert A
A Insert A A

Primary Secondary
93

Stale

D
Client
Insert

Insert D
D Update C
C Update A
B Insert C Insert A
A Insert B A

Database Insert A Database Oplog
Primary Secondary
94

Stale

Client [Inset A]
not found !!
Check Oplog

Insert D
D Update C
C Update A
B Insert C Insert A
A Insert B A

Primary Secondary
95

Stale

Client [Inset A]
not found !!
Check Oplog

It cannot get
infomation about
[Insert B].
Insert D
D Update C
C Update A So cannot sync !!
B Insert C Insert A
A Insert B A
It’s called STALE
Primary Recovering
96

Stale
We have to understand the importance of adjusting oplog size

We can specify the oplog size as one of the command line option
Only at the first time per the dbpath
that is also specified as a command line.
Also we cannot change the oplog size
without clearing the dbpath.

Be careful !

97

Replication (Join as a new node)

98

InitialSync

Client

mongod

Insert D
D Update C
C Update A
B Insert C
A

Database Oplog
Primary
99

InitialSync

Client

mongod mongod

Insert D
D Update C
C Update A
B Insert C
A

Primary Startup
100

InitialSync

Client

Get last Oplog

Insert D
D Update C
C Update A
B Insert C Insert D
A

Primary Recovering
101

InitialSync

D
Client
C
B
A Cloning DB

Insert D
D Update C
C Update A
B Insert C Insert D
A

Primary Recovering
102

InitialSync

D
Client
C
B
A Cloning DB

Insert D
D Update C
C Update A
B Insert C Insert D
A A

Primary Recovering
103

InitialSync

E D
Client
Insert C
B
A Cloning DB

E Insert E
D Insert D
C Update C
B B
Update A Insert D
A A
Insert C

Primary Recovering
104

InitialSync

B
Client
Update

Cloning DB complete

E Update B
D Insert E D
C Insert D C
B Update C B Insert D
A Update A A

Primary Recovering
105

InitialSync

Client

Check Oplog

E Update B
D Insert E D
C Insert D C
A A

Primary Recovering
106

InitialSync

Client

Sync

E Update B E
D Insert E D Update B
C Insert D C Insert E
A A

Primary Secondary
107

Additional infomation
From source code. ( I’ve never examed these... )

Secondary will try to sync from other Secondaries
when it cannot reach the Primary or
might be stale against the Primary.

There is a bit of chance that sync problem not occured if the
secondary has old Oplog or larger Oplog space than Primary

108

Sync from another secondary

Client

Insert D Insert D
D Update C D Update C
C Update A C Update A
B Insert C Insert A B Insert C
A Insert B A A Insert B

Database Insert A Database Oplog Database Insert A

Primary Secondary Secondary
109


Client [Inset A]
not found !!

Check Oplog

Insert D Insert D


110


Client But found at the other secondary
So it’s able to sync

Check Oplog

Insert D Insert D


111

Sync from the other secondary

Client But found at the other secondary
So it’s able to sync

Sync

Insert D Insert D Insert D
D Update C D Update C D Update C
C Update A C Update A C Update A
B Insert C B Insert C B Insert C
A Insert B A Insert B A Insert B
Insert A Insert A Insert A
Database Database Database
112

That’s all about sync

113

Disk space
Data fragment into any DB files sparsely...
We met the unfavorable circumstance in our DBs

This circumstance appears at some of our collections
around 3 months after we launched the services

db.ourcol.storageSize() = 16200727264 (15GB)
db.ourcol.totalSize() = 16200809184
db.ourcol.totalIndexSize() = 81920
db.outcol.dataSize() = 2032300 (2MB)

What’s happen to them !!
116

Disk space
Data fragment into any DB files sparsely...
It’s seems like to be caused by the specific operation
that insert , update and delete over and over.

Anyway we have to shrink the using disk space regularly
just like PostgreSQL’s vacume.

But how to do it ?

117

Disk space
Shrink the using disk spaces
MongoDB offers some functions for this case.
But couldn’t use in our case !

repairdatabase:
Only runable on the Primary.
It needs long time and BLOCK all operations !!

compact:
Only runable on the Secondary.
Zero-fill the blank space instead of shrink disk spaces.
So cannot shrink...
118

Disk space
Our measurements
For temporary collection:
To issue drop-command regularly.
For other collections:
1. Get rid of one secondary from the ReplSet.
2. Shut down this.
3. Remove all DB files.
4. Join to the ReplSet.
5. Do these operations one after another.
6. Step down the Primary. (Change Primary node)
7. At last, do 1 – 4 operations on prior Primary.
119

PHP client
We tried 1.4.4 and 1.2.2
1.4.4:
There is some critical bugs around connection pool.
We struggled to invalidate the broken connection.
I think, you should use 1.2.X instead of 1.4.X
1.2.2:
It seems like to be fixed around connection pool.
But there are 2 critical bugs !
– Socket handle leak
– Useless sleep
However, This version is relatively stable 121

as long as to fix these bugs

PHP client
We tried 1.4.4 and 1.2.2

https://github.com/crumbjp/Personal

- mongo1.2.2.non-wait.patch
- mongo1.2.2.sock-leak.patch

122

Closing
What’s MongoDB ?
It has very good READ performance.
We can use mongo instead of memcached.
if we can allow the limited write performance.
Die hard !
MongoDB have high availability even if under a severe stress..
Can use easilly without deep consideration
We can manage to do anything after getting start to use.
Let’s forget any awkward trivial things that have bothered us.
How to treat the huge data ?
How to put in the cache system ?
How to keep the availablity ?
And so on .... 125

Closing
Keep in mind
Sharding is challenging...
It’s last resort !
It’s hard to operate. In particular, to maintain config-servers.
[Mongos] is also difficult to keep alive.
I want the way to failover Mongos.
Mongo is able to run on the poor environment but...
You should ONLY put aside the large diskspace
Huge write is sensitive
Adjust the oplog size carefully
Indexing function has been unfinished
Cannot apply index online
126

All right, Have fun !!

127

Thank you for your listening

128

KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Semelhante a KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB

Semelhante a KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB (20)

Mais de Rakuten Group, Inc.

Mais de Rakuten Group, Inc. (20)

Último

Último (20)

KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB