SlideShare uma empresa Scribd logo
1 de 51
Baixar para ler offline
S2Graph :
A large-scale graph database
with Hbase
daumkakao
2
Reference
1. HBase Conference 2015
1.http://www.slideshare.net/HBaseCon/use-cases-session-5
2.https://vimeo.com/128203919
2. Deview 2015
3. Apache Con BigData Europe
1.http://sched.co/3ztM
3
Our Social Graph
Message
Write
length :
Read
Coupon
price :
Present
price : 3
affinityaffinity:
affinity
affinity
affinity
affinity
affinity
affinity
affinity
Friend
Group
size : 6
Emoticon
Eat
rating :
View
count :
Play
level: 6
Style
share : 3
Advertise
Search
keyword
:
Listen
count :
Like
count : 7
Comment
affinity
4
Our Social Graph
Message
length : 9
Write
length : 3
affinity 6affinity: 9
affinity 3
affinity 3
affinity 4
affinity 1
affinity 2
affinity 2
affinity 9
Friend
Play
level: 6
Style
share : 3
Advertise
ctr : 0.32
Search
keyword
: “HBase"
Listen
count : 6
C
l
affinity 3
Message ID :
201
Ad ID : 603
Music ID
Item ID : 13
Post ID : 97
Game ID : 1984
5
Technical Challenges
1. Large social graph constantly changing
a. Scale
more than,
social network: 10 billion edges, 200 million vertices, 50 million update on existing edges.
user activities: over 1 billion new edges per day
6
Technical Challenges (cont)
2. Low latency for breadth first search traversal on connected data.
a. performance requirement
peak graph-traversing query per second: 20000
response time: 100ms
7
Technical Challenges (cont)
3. Realtime update capabilities for viral effects
Person A
Post
Fast Person B
Comment
Person C
Sharing
Person D
Mention
Fast Fast
8
Technical Challenges (cont)
4. Support for Dynamic Ranking logic
a. Push strategy: Hard to change data ranking logic dynamically.
b. Pull strategy: Enables user to try out various data ranking logics.
9
Before
Each app server should know each DB’s sharding logic.
Highly inter-connected architecture
Friend relationship SNS feeds Blog user activities Messaging
Messaging
App
SNS
App
Blog
App
10
After
SNS
App
Blog
App
Messaging
App
S2Graph DB
stateless app servers
What is S2Graph?
daumkakao
12
What is S2Graph?
Storage-as-a-Service + Graph API = Realtime Breadth First Search
13
Example: Messanger Data Model
Participates
Chat Room
Message 1
Message 1
Message 1
Contains
Recent messages in my chat rooms.
SELECT a.* FROM user_chat_rooms a, chat_room_messages b WHERE a.user_id = 1 AND a.chat_room_id =
b.chat_room_id WHERE b.created_at >= yesterday
14
Example: Messanger Data Model
Participates
Chat Room
Message 1
Message 1
Message 1
Contains
Recent messages in my chat rooms.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],
"steps": [
[{"label": "user_chat_rooms", "direction": "out", "limit": 100}], // step
[{"label": "chat_room_messages", "direction": "out", "limit": 10, “where”: “created_at >= yesterday”}]
]
}
'
15
Example: News Feed (cont)
Friends
Post1
Post 2
Post 3
create/like/share posts
Posts that my friends interacted.
SELECT a.*, b.* FROM friends a, user_posts b WHERE a.user_id = b.user_id WHERE b.updated_at >= yesterday and
b.action_type in (‘create’, ‘like’, ‘share’)
16
Example: News Feed (cont)
Friends
Post1
Post 2
Post 3
create/like/share posts
Posts that my friends interacted.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],
"steps": [
[{"label": "friends", "direction": "out", "limit": 100}], // step
[{"label": “user_posts", "direction": "out", "limit": 10, “where”: “created_at >= yesterday”}]
]
}
'
17
Example: Recommendation(User-based CF) (cont)
Similar Users
Product 1
Product2
Product 3
user-product interaction
(click/buy/like/share)
Products that similar user interact recently.
SELECT a.* , b.* FROM similar_users a, user_products b WHERE a.sim_user_id = b.user_id AND b.updated_at >=
yesterday
Batch
18
Example: Recommendation(User-based CF) (cont)
Products that similar user interact recently.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
“filterOut”: {“srcVertices”: [{“serviceName”: “s2graph”, “columnName”: “user_id”, “id”: 1}],
“steps”: [[{“label”: “user_products_interact”}]]
},
"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],
"steps": [
[{"label": “similar_users", "direction": "out", "limit": 100, “where”: “similarity > 0.2”}], // step
[{"label": “user_products_interact”, "direction": "out", "limit": 10, “where”: “created_at >= yesterday and price >= 1000”}]
]
}
Similar Users
Product 1
Product2
Product 3
user-product interaction
(click/buy/like/share)
Batch
19
Example: Recommendation(Item-based CF) (cont)
Similar Products
Product 1
Product2
Product 3
user-product interaction
(click/buy/like/share)
Product 1
Product 1
Product 1
Products that are similar to what I have interested.
SELECT a.* , b.* FROM similar_ a, user_products b WHERE a.sim_user_id = b.user_id AND b.updated_at >= yesterday
Batch
20
Example: Recommendation(Item-based CF) (cont)
Products that are similar to what I have interested.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],
"steps": [
[{"label": “user_products_interact", "direction": "out", "limit": 100, “where”: “created_at >= yesterday and price >= 1000”}],
[{"label": “similar_products”, "direction": "out", "limit": 10, “where”: “similarity > 0.2”}]
]
}
'
Similar Products
Product 1
Product2
Product 3
user-product interaction
(click/buy/like/share)
Product 1
Product 1
Product 1
Batch
21
Example: Recommendation(Content + Most popular) (cont)
TopK(k=1) product per timeUnit(day)
Product1
Product2
Product 3
user-product interaction
(click/buy/like/share)
Daily top product per categories in products that I liked.
SELECT c.*
FROM user_products a, product_categories b, category_daily_top_products c
WHERE a.user_id = 1 and a.product_id = b.product_id and b.category_id = c.category_id and c.time between (yesterday,
today)
Category1
Category2
Product10
Product20
Product20
Today
Product10 Yesterday
Today
Yesterday
22
Example: Recommendation(Content + Most popular) (cont)
Daily top product per categories in products that I liked.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],
"steps": [
[{"label": “user_products_interact", "direction": "out", "limit": 100, “where”: “created_at >= yesterday and price >= 1000”}],
[{“label”: “product_cates”, “direction”: “out”, “limit”: 3}],
[{"label": “category_products_topK”, "direction": "out", "limit": 10]
]
}
TopK(k=1) product per timeUnit(day)
Product1
Product2
Product 3
user-product interaction
(click/buy/like/share)
Category1
Category2
Product10
Product20
Product20
Today
Product10 Yesterday
Today
Yesterday
23
Example: Recommendation(Spreading Activation) (cont)
Product 1
Product2
Product 3
user-product interaction
(click/buy/like/share)
Products that is interacted by users who interacted on products that I interact
SELECT b.product_id, count(*)
FROM user_products a, user_products b
WHERE a.user_id = 1
AND a.product_id = b.product_id
GROUP BY b.product_id
24
Example: Recommendation(Spreading Activation) (cont)
Product 1
Product2
Product 3
user-product interaction
(click/buy/like/share)
Products that is interacted by users who interacted on products that I interact
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],
"steps": [
[{"label": “user_products_interact", "direction": "out", "limit": 100, “where”: “created_at >= yesterday and price >= 1000”}],
[{"label": “user_products_interact", "direction": "in", "limit": 10, “where”: “created_at >= today”}],
[{"label": “user_products_interact", "direction": "out", "limit": 10, “where”: “created_at >= 1 hour ago”}],
]
}
'
25
Realization
1. These examples resemble graphs.
2. Object isVertex, Relationship is Edge.
3. Necessary APIs: breadth first search on large scale graph.
26
S2Graph API: Vertex
Vertex:
1. insert, delete, getVertex
2. vertex id: what user
provided(string/int/long)
ID 1231-123
Prop1 Val1
Prop2 Val2
… …
27
S2Graph API: Edge
Edges:
1. Insert, delete, update, getEdge(like
CRUD in RDBMS)
2. Edge reference: (from, to, label,
direction)
3. Multiple props on edge.
4. Every edges are ordered (details
follow).
Edge Reference 1,101,”friend”,”out”
Prop1 Val1
Prop2 Val2
… …
28
S2Graph API: Query
Query: getEdges, countEdges, removeEdges
Class Query {
// Define breadth first search
List[VertexId] startVertices;
List[Step] steps;
}
Class Step {
// Define one breadth
List[QueryParam] queryParams;
}
Class QueryParam {
// Define each edges to traverse for current
breadth
String label;
String direction;
Map options;
}
QueryParam
Step1 Step2
Query
29
S2Graph API: indices
Degree Q1 Q2 Q3
1-friend-
out-PK
3 c-103 b-102 a-101
1
101
102
103
Name: a
Name: b
Name: c
Ordered(DESC)
Indices:
1. addIndex, createIndex
2. Automatically keep edges ordered for
multiple indices.
3. Support int/long/float/string data
types.
class Index {
// define how to order edges.
String indexName;
List[Prop] indexProps;
}
30
What is S2Graph
Not support global computation(not like Apache Giraph, graphX).
Not support graph algorithm like page rank, shortest path.
Storage-as-a-Service + Graph API = Realtime Breadth First Search
S2Graph is Not
31
Why S2Graph: Push vs Pull. Feeds with Push
1. Only timestamp can be used as scoring
2. Hard to change scoring function dynamically
Post
Like
Write(Fanout)
Friends Feed Queue
Feed Queue
Feed Queue
Write # of friends
Read O(1) for friends
Storage AVG(# of friends) * total user activity
Query O(1)
32
1.Different weights to different action types: Like = 0.8, Click = 0.1…
2.Client can change scoring dynamically.
PostLike
Friends
Why S2Graph: Push vs Pull. Feeds with Pull
Write O(1)
Read None
Storage total user activity
Query O(1) for friends + O(# of friends)
33
Pull >> push only if
1. fast response time: 10 ~ 100ms
2. throughput: 10K ~ 20K QPS
S2Graph provide linear scalability on
1. number of machine.
2. bfs search space(how many edges that single query will traverse).
more detail on benchmark section later.
Why S2Graph: S2Graph Supports Pull + Push
34
Why S2Graph: Simplify Data Flow
S2Graph
Write API + Query DSL
WAL log
OpenSourced
User/Item
Similarity
Apache Spark
(Batch Computing Layer)
TopK Counter Others
S2Graph
Bulk Loader
will be open sourced soon
35
Why S2Graph: Built in A/B test
1. Register Query Template: Each Query template have impressionId.
2. Insert Click/Impression event into S2Graph as Edge insert.
36
Why S2Graph: Just Insert Edge
S2Graph
1. user activity history.
2. friends feed.
3. user-item based collaborative filtering.
4. topK ranking(most popular, segmented most popular).
and many many more.
just think your service as graph model.
S2Graph Internal
daumkakao
38
Detail: previous talk on HBaseCon 2015
1.https://vimeo.com/128203919
2.http://www.slideshare.net/HBaseCon/use-cases-session-5
Benchmarks
daumkakao
40
HBase Table Configuration
1. setDurability(Durability.ASYNC_WAL)
2. setCompressionType(Compression.Algorithm.LZ4)
3. setBloomFilterType(BloomType.Row)
4. setDataBlockEncoding(DataBlockEncoding.FAST_DIFF)
5. setBlockSize(32768)
6. setBlockCacheEnabled(true)
7. pre-split by (Intger.MaxValue / regionCount). regionCount = 120 when create table(on 20 region server).
41
HBase Cluster Configuration
• each machine: 8core, 32G memory, SSD
• hfile.block.cache.size: 0.6
• hbase.hregion.memstore.flush.size: 128MB
• otherwise use default value from CDH 5.3.1
• s2graph rest server: 4core, 16G memory
42
Performance
1. Total # of Edges: 100,000,000,000(100,000,000 row x 1000 column)
2. Test environment
a. Zookeeper server: 3
b. HBase Masterserver: 2
c. HBase Regionserver: 20
d. App server: 4 core, 16GB Ram
e. Write traffic: 5K / second
43
- Benchmark Query : src.out(“friend”).limit(100).out(“friend”).limit(10)
- Total concurrency: 20 * # of app server
Performance
2. Linear scalability
Latency
0
50
100
150
200
QPS
0
1,000
2,000
3,000
4,000
# of app server
1 2 4 8
QPS(Query Per Second) Latency(ms)
46454543
3,491
1,763
885
464
43 45 45 46
# of app server
1 2 3 4 5 6 7 8
50010001500200025003000
QPS
Performance
3. Varying width of traverse (tested with a single server)
Latency
0
87.5
175
262.5
350
QPS
0
500
1,000
1,500
2,000
Limit on first step
20 40 80 200 400 800
QPS Latency(ms)
327
164
84
351911 61122237
570
1,023
1,821
11 19 35
84
164
327
- Benchmark Query : src.out(“friend”).limit(x).out(“friend”).limit(10)
- Total concurrency = 20 * 1(# of app server)
45
- All query touch 1000 edges.
- each step` limit is on x axis.
- Can expect performance with given query`s search space.
Performance
4. Different query path(different I/O pattern)
Latency
0
37.5
75
112.5
150
QPS
0
80
160
240
320
400
limits on path
10 -> 100 100 -> 10 10 -> 10 -> 10 2 -> 5 -> 10 -> 10 2 -> 5 -> 2 -> 5 -> 10
QPS Latency(ms)
323436
2314
307.5292.1274.4
435.3695
14 23
36 34 32
46
Performance
5. Write throughput per operation on single app server
Insert operation
Latency
0
1.25
2.5
3.75
5
Request per second
8000 16000 800000
47
Performance
6. write throughput per operation on single app server
Update(increment/update/delete) operation
Latency
0
2
4
6
8
Request per second
2000 4000 6000
48
Stats
1. HBase cluster per IDC (2 IDC)
- 3 Zookeeper Server
- 2 HBase Master
- (20 + 40) HBase Slave
2. App server per IDC
- 10 server for write-only
- 30 server for query only
3. Real traffic
- read: 10K ~ 20K request per second
- now mostly 2 step queries with limit 100 on first step.
- write: over 5k ~ 10k request per second
49
50
Through S2Graph !
51
Now Available As an Open Source
- https://github.com/daumkakao/s2graph
- Finding contributors and mentors
Contact
- Doyoung Yoon : shom83@gmail.com

Mais conteúdo relacionado

Mais procurados

Building Your First Data Science Applicatino in MongoDB
Building Your First Data Science Applicatino in MongoDBBuilding Your First Data Science Applicatino in MongoDB
Building Your First Data Science Applicatino in MongoDB
MongoDB
 
Writing robust Node.js applications
Writing robust Node.js applicationsWriting robust Node.js applications
Writing robust Node.js applications
Tom Croucher
 
A million connections and beyond - Node.js at scale
A million connections and beyond - Node.js at scaleA million connections and beyond - Node.js at scale
A million connections and beyond - Node.js at scale
Tom Croucher
 
5952 database systems administration (comp 1011.1)-cw1
5952   database systems administration (comp 1011.1)-cw15952   database systems administration (comp 1011.1)-cw1
5952 database systems administration (comp 1011.1)-cw1
saeedkhan841514
 
Lucene revolution 2011
Lucene revolution 2011Lucene revolution 2011
Lucene revolution 2011
Takahiko Ito
 

Mais procurados (20)

Building Your First Data Science Applicatino in MongoDB
Building Your First Data Science Applicatino in MongoDBBuilding Your First Data Science Applicatino in MongoDB
Building Your First Data Science Applicatino in MongoDB
 
DEVIEW - 오픈소스를 활용한 분산아키텍처 구현기술
DEVIEW - 오픈소스를 활용한 분산아키텍처 구현기술DEVIEW - 오픈소스를 활용한 분산아키텍처 구현기술
DEVIEW - 오픈소스를 활용한 분산아키텍처 구현기술
 
Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4
 
Reactive Access to MongoDB from Java 8
Reactive Access to MongoDB from Java 8Reactive Access to MongoDB from Java 8
Reactive Access to MongoDB from Java 8
 
Writing robust Node.js applications
Writing robust Node.js applicationsWriting robust Node.js applications
Writing robust Node.js applications
 
High Performance XQuery Processing in PHP with Zorba by Vikram Vaswani
High Performance XQuery Processing in PHP with Zorba by Vikram VaswaniHigh Performance XQuery Processing in PHP with Zorba by Vikram Vaswani
High Performance XQuery Processing in PHP with Zorba by Vikram Vaswani
 
C#을 이용한 task 병렬화와 비동기 패턴
C#을 이용한 task 병렬화와 비동기 패턴C#을 이용한 task 병렬화와 비동기 패턴
C#을 이용한 task 병렬화와 비동기 패턴
 
FwDays 2021: Metarhia Technology Stack for Node.js
FwDays 2021: Metarhia Technology Stack for Node.jsFwDays 2021: Metarhia Technology Stack for Node.js
FwDays 2021: Metarhia Technology Stack for Node.js
 
Service discovery and configuration provisioning
Service discovery and configuration provisioningService discovery and configuration provisioning
Service discovery and configuration provisioning
 
From Node to Go
From Node to GoFrom Node to Go
From Node to Go
 
Cassandra Summit EU 2014 Lightning talk - Paging (no animation)
Cassandra Summit EU 2014 Lightning talk - Paging (no animation)Cassandra Summit EU 2014 Lightning talk - Paging (no animation)
Cassandra Summit EU 2014 Lightning talk - Paging (no animation)
 
Stampede con 2014 cassandra in the real world
Stampede con 2014   cassandra in the real worldStampede con 2014   cassandra in the real world
Stampede con 2014 cassandra in the real world
 
Metaprogramming with JavaScript
Metaprogramming with JavaScriptMetaprogramming with JavaScript
Metaprogramming with JavaScript
 
LJC Conference 2014 Cassandra for Java Developers
LJC Conference 2014 Cassandra for Java DevelopersLJC Conference 2014 Cassandra for Java Developers
LJC Conference 2014 Cassandra for Java Developers
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
 
Using npm to Manage Your Projects for Fun and Profit - USEFUL INFO IN NOTES!
Using npm to Manage Your Projects for Fun and Profit - USEFUL INFO IN NOTES!Using npm to Manage Your Projects for Fun and Profit - USEFUL INFO IN NOTES!
Using npm to Manage Your Projects for Fun and Profit - USEFUL INFO IN NOTES!
 
A million connections and beyond - Node.js at scale
A million connections and beyond - Node.js at scaleA million connections and beyond - Node.js at scale
A million connections and beyond - Node.js at scale
 
5952 database systems administration (comp 1011.1)-cw1
5952   database systems administration (comp 1011.1)-cw15952   database systems administration (comp 1011.1)-cw1
5952 database systems administration (comp 1011.1)-cw1
 
Rapid API development examples for Impress Application Server / Node.js (jsfw...
Rapid API development examples for Impress Application Server / Node.js (jsfw...Rapid API development examples for Impress Application Server / Node.js (jsfw...
Rapid API development examples for Impress Application Server / Node.js (jsfw...
 
Lucene revolution 2011
Lucene revolution 2011Lucene revolution 2011
Lucene revolution 2011
 

Destaque

Destaque (20)

[264] large scale deep-learning_on_spark
[264] large scale deep-learning_on_spark[264] large scale deep-learning_on_spark
[264] large scale deep-learning_on_spark
 
[231] the simplicity of cluster apps with circuit
[231] the simplicity of cluster apps with circuit[231] the simplicity of cluster apps with circuit
[231] the simplicity of cluster apps with circuit
 
[223] h base consistent secondary indexing
[223] h base consistent secondary indexing[223] h base consistent secondary indexing
[223] h base consistent secondary indexing
 
[232] 수퍼컴퓨팅과 데이터 어낼리틱스
[232] 수퍼컴퓨팅과 데이터 어낼리틱스[232] 수퍼컴퓨팅과 데이터 어낼리틱스
[232] 수퍼컴퓨팅과 데이터 어낼리틱스
 
[234] 산업 현장을 위한 증강 현실 기기 daqri helmet 개발기
[234] 산업 현장을 위한 증강 현실 기기 daqri helmet 개발기[234] 산업 현장을 위한 증강 현실 기기 daqri helmet 개발기
[234] 산업 현장을 위한 증강 현실 기기 daqri helmet 개발기
 
[224] 번역 모델 기반_질의_교정_시스템
[224] 번역 모델 기반_질의_교정_시스템[224] 번역 모델 기반_질의_교정_시스템
[224] 번역 모델 기반_질의_교정_시스템
 
[252] 증분 처리 플랫폼 cana 개발기
[252] 증분 처리 플랫폼 cana 개발기[252] 증분 처리 플랫폼 cana 개발기
[252] 증분 처리 플랫폼 cana 개발기
 
[243] turning data into value
[243] turning data into value[243] turning data into value
[243] turning data into value
 
[244] 분산 환경에서 스트림과 배치 처리 통합 모델
[244] 분산 환경에서 스트림과 배치 처리 통합 모델[244] 분산 환경에서 스트림과 배치 처리 통합 모델
[244] 분산 환경에서 스트림과 배치 처리 통합 모델
 
[253] apache ni fi
[253] apache ni fi[253] apache ni fi
[253] apache ni fi
 
[242] wifi를 이용한 실내 장소 인식하기
[242] wifi를 이용한 실내 장소 인식하기[242] wifi를 이용한 실내 장소 인식하기
[242] wifi를 이용한 실내 장소 인식하기
 
[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtos[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtos
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
[212] large scale backend service develpment
[212] large scale backend service develpment[212] large scale backend service develpment
[212] large scale backend service develpment
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
[211] 네이버 검색과 데이터마이닝
[211] 네이버 검색과 데이터마이닝[211] 네이버 검색과 데이터마이닝
[211] 네이버 검색과 데이터마이닝
 
[222]대화 시스템 서비스 동향 및 개발 방법
[222]대화 시스템 서비스 동향 및 개발 방법[222]대화 시스템 서비스 동향 및 개발 방법
[222]대화 시스템 서비스 동향 및 개발 방법
 
[214] data science with apache zeppelin
[214] data science with apache zeppelin[214] data science with apache zeppelin
[214] data science with apache zeppelin
 
[213] ethereum
[213] ethereum[213] ethereum
[213] ethereum
 
[241] Storm과 Elasticsearch를 활용한 로깅 플랫폼의 실시간 알람 시스템 구현
[241] Storm과 Elasticsearch를 활용한 로깅 플랫폼의 실시간 알람 시스템 구현[241] Storm과 Elasticsearch를 활용한 로깅 플랫폼의 실시간 알람 시스템 구현
[241] Storm과 Elasticsearch를 활용한 로깅 플랫폼의 실시간 알람 시스템 구현
 

Semelhante a [263] s2graph large-scale-graph-database-with-hbase-2

Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Lucidworks
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
MongoDB
 

Semelhante a [263] s2graph large-scale-graph-database-with-hbase-2 (20)

HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
Socialite, the Open Source Status Feed
Socialite, the Open Source Status FeedSocialite, the Open Source Status Feed
Socialite, the Open Source Status Feed
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
 
Ams adapters
Ams adaptersAms adapters
Ams adapters
 
Extensible RESTful Applications with Apache TinkerPop
Extensible RESTful Applications with Apache TinkerPopExtensible RESTful Applications with Apache TinkerPop
Extensible RESTful Applications with Apache TinkerPop
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Montreal Elasticsearch Meetup
Montreal Elasticsearch MeetupMontreal Elasticsearch Meetup
Montreal Elasticsearch Meetup
 
The Future of Responsive Design Standards
The Future of Responsive Design StandardsThe Future of Responsive Design Standards
The Future of Responsive Design Standards
 
Streaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talkStreaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talk
 
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
 
GitConnect
GitConnectGitConnect
GitConnect
 
Consumer driven contract testing
Consumer driven contract testingConsumer driven contract testing
Consumer driven contract testing
 
2019 hashiconf consul-templaterb
2019 hashiconf consul-templaterb2019 hashiconf consul-templaterb
2019 hashiconf consul-templaterb
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 
Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcases
 
Scott Guthrie at Dot Net Startup meetup
Scott Guthrie at Dot Net Startup meetupScott Guthrie at Dot Net Startup meetup
Scott Guthrie at Dot Net Startup meetup
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Api's and ember js
Api's and ember jsApi's and ember js
Api's and ember js
 
Insight User Conference Bootcamp - Use the Engagement Tracking and Metrics A...
Insight User Conference Bootcamp - Use the Engagement Tracking  and Metrics A...Insight User Conference Bootcamp - Use the Engagement Tracking  and Metrics A...
Insight User Conference Bootcamp - Use the Engagement Tracking and Metrics A...
 

Mais de NAVER D2

Mais de NAVER D2 (20)

[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다
 
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
 
[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발
 
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
 
[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A
 
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기
 
[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning
 
[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications
 
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingOld version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
 
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
 
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
 
[224]네이버 검색과 개인화
[224]네이버 검색과 개인화[224]네이버 검색과 개인화
[224]네이버 검색과 개인화
 
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
 
[213] Fashion Visual Search
[213] Fashion Visual Search[213] Fashion Visual Search
[213] Fashion Visual Search
 
[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화
 
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
 
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
 
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 

[263] s2graph large-scale-graph-database-with-hbase-2

  • 1. S2Graph : A large-scale graph database with Hbase daumkakao
  • 2. 2 Reference 1. HBase Conference 2015 1.http://www.slideshare.net/HBaseCon/use-cases-session-5 2.https://vimeo.com/128203919 2. Deview 2015 3. Apache Con BigData Europe 1.http://sched.co/3ztM
  • 3. 3 Our Social Graph Message Write length : Read Coupon price : Present price : 3 affinityaffinity: affinity affinity affinity affinity affinity affinity affinity Friend Group size : 6 Emoticon Eat rating : View count : Play level: 6 Style share : 3 Advertise Search keyword : Listen count : Like count : 7 Comment affinity
  • 4. 4 Our Social Graph Message length : 9 Write length : 3 affinity 6affinity: 9 affinity 3 affinity 3 affinity 4 affinity 1 affinity 2 affinity 2 affinity 9 Friend Play level: 6 Style share : 3 Advertise ctr : 0.32 Search keyword : “HBase" Listen count : 6 C l affinity 3 Message ID : 201 Ad ID : 603 Music ID Item ID : 13 Post ID : 97 Game ID : 1984
  • 5. 5 Technical Challenges 1. Large social graph constantly changing a. Scale more than, social network: 10 billion edges, 200 million vertices, 50 million update on existing edges. user activities: over 1 billion new edges per day
  • 6. 6 Technical Challenges (cont) 2. Low latency for breadth first search traversal on connected data. a. performance requirement peak graph-traversing query per second: 20000 response time: 100ms
  • 7. 7 Technical Challenges (cont) 3. Realtime update capabilities for viral effects Person A Post Fast Person B Comment Person C Sharing Person D Mention Fast Fast
  • 8. 8 Technical Challenges (cont) 4. Support for Dynamic Ranking logic a. Push strategy: Hard to change data ranking logic dynamically. b. Pull strategy: Enables user to try out various data ranking logics.
  • 9. 9 Before Each app server should know each DB’s sharding logic. Highly inter-connected architecture Friend relationship SNS feeds Blog user activities Messaging Messaging App SNS App Blog App
  • 12. 12 What is S2Graph? Storage-as-a-Service + Graph API = Realtime Breadth First Search
  • 13. 13 Example: Messanger Data Model Participates Chat Room Message 1 Message 1 Message 1 Contains Recent messages in my chat rooms. SELECT a.* FROM user_chat_rooms a, chat_room_messages b WHERE a.user_id = 1 AND a.chat_room_id = b.chat_room_id WHERE b.created_at >= yesterday
  • 14. 14 Example: Messanger Data Model Participates Chat Room Message 1 Message 1 Message 1 Contains Recent messages in my chat rooms. curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { "srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}], "steps": [ [{"label": "user_chat_rooms", "direction": "out", "limit": 100}], // step [{"label": "chat_room_messages", "direction": "out", "limit": 10, “where”: “created_at >= yesterday”}] ] } '
  • 15. 15 Example: News Feed (cont) Friends Post1 Post 2 Post 3 create/like/share posts Posts that my friends interacted. SELECT a.*, b.* FROM friends a, user_posts b WHERE a.user_id = b.user_id WHERE b.updated_at >= yesterday and b.action_type in (‘create’, ‘like’, ‘share’)
  • 16. 16 Example: News Feed (cont) Friends Post1 Post 2 Post 3 create/like/share posts Posts that my friends interacted. curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { "srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}], "steps": [ [{"label": "friends", "direction": "out", "limit": 100}], // step [{"label": “user_posts", "direction": "out", "limit": 10, “where”: “created_at >= yesterday”}] ] } '
  • 17. 17 Example: Recommendation(User-based CF) (cont) Similar Users Product 1 Product2 Product 3 user-product interaction (click/buy/like/share) Products that similar user interact recently. SELECT a.* , b.* FROM similar_users a, user_products b WHERE a.sim_user_id = b.user_id AND b.updated_at >= yesterday Batch
  • 18. 18 Example: Recommendation(User-based CF) (cont) Products that similar user interact recently. curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { “filterOut”: {“srcVertices”: [{“serviceName”: “s2graph”, “columnName”: “user_id”, “id”: 1}], “steps”: [[{“label”: “user_products_interact”}]] }, "srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}], "steps": [ [{"label": “similar_users", "direction": "out", "limit": 100, “where”: “similarity > 0.2”}], // step [{"label": “user_products_interact”, "direction": "out", "limit": 10, “where”: “created_at >= yesterday and price >= 1000”}] ] } Similar Users Product 1 Product2 Product 3 user-product interaction (click/buy/like/share) Batch
  • 19. 19 Example: Recommendation(Item-based CF) (cont) Similar Products Product 1 Product2 Product 3 user-product interaction (click/buy/like/share) Product 1 Product 1 Product 1 Products that are similar to what I have interested. SELECT a.* , b.* FROM similar_ a, user_products b WHERE a.sim_user_id = b.user_id AND b.updated_at >= yesterday Batch
  • 20. 20 Example: Recommendation(Item-based CF) (cont) Products that are similar to what I have interested. curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { "srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}], "steps": [ [{"label": “user_products_interact", "direction": "out", "limit": 100, “where”: “created_at >= yesterday and price >= 1000”}], [{"label": “similar_products”, "direction": "out", "limit": 10, “where”: “similarity > 0.2”}] ] } ' Similar Products Product 1 Product2 Product 3 user-product interaction (click/buy/like/share) Product 1 Product 1 Product 1 Batch
  • 21. 21 Example: Recommendation(Content + Most popular) (cont) TopK(k=1) product per timeUnit(day) Product1 Product2 Product 3 user-product interaction (click/buy/like/share) Daily top product per categories in products that I liked. SELECT c.* FROM user_products a, product_categories b, category_daily_top_products c WHERE a.user_id = 1 and a.product_id = b.product_id and b.category_id = c.category_id and c.time between (yesterday, today) Category1 Category2 Product10 Product20 Product20 Today Product10 Yesterday Today Yesterday
  • 22. 22 Example: Recommendation(Content + Most popular) (cont) Daily top product per categories in products that I liked. curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { "srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}], "steps": [ [{"label": “user_products_interact", "direction": "out", "limit": 100, “where”: “created_at >= yesterday and price >= 1000”}], [{“label”: “product_cates”, “direction”: “out”, “limit”: 3}], [{"label": “category_products_topK”, "direction": "out", "limit": 10] ] } TopK(k=1) product per timeUnit(day) Product1 Product2 Product 3 user-product interaction (click/buy/like/share) Category1 Category2 Product10 Product20 Product20 Today Product10 Yesterday Today Yesterday
  • 23. 23 Example: Recommendation(Spreading Activation) (cont) Product 1 Product2 Product 3 user-product interaction (click/buy/like/share) Products that is interacted by users who interacted on products that I interact SELECT b.product_id, count(*) FROM user_products a, user_products b WHERE a.user_id = 1 AND a.product_id = b.product_id GROUP BY b.product_id
  • 24. 24 Example: Recommendation(Spreading Activation) (cont) Product 1 Product2 Product 3 user-product interaction (click/buy/like/share) Products that is interacted by users who interacted on products that I interact curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { "srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}], "steps": [ [{"label": “user_products_interact", "direction": "out", "limit": 100, “where”: “created_at >= yesterday and price >= 1000”}], [{"label": “user_products_interact", "direction": "in", "limit": 10, “where”: “created_at >= today”}], [{"label": “user_products_interact", "direction": "out", "limit": 10, “where”: “created_at >= 1 hour ago”}], ] } '
  • 25. 25 Realization 1. These examples resemble graphs. 2. Object isVertex, Relationship is Edge. 3. Necessary APIs: breadth first search on large scale graph.
  • 26. 26 S2Graph API: Vertex Vertex: 1. insert, delete, getVertex 2. vertex id: what user provided(string/int/long) ID 1231-123 Prop1 Val1 Prop2 Val2 … …
  • 27. 27 S2Graph API: Edge Edges: 1. Insert, delete, update, getEdge(like CRUD in RDBMS) 2. Edge reference: (from, to, label, direction) 3. Multiple props on edge. 4. Every edges are ordered (details follow). Edge Reference 1,101,”friend”,”out” Prop1 Val1 Prop2 Val2 … …
  • 28. 28 S2Graph API: Query Query: getEdges, countEdges, removeEdges Class Query { // Define breadth first search List[VertexId] startVertices; List[Step] steps; } Class Step { // Define one breadth List[QueryParam] queryParams; } Class QueryParam { // Define each edges to traverse for current breadth String label; String direction; Map options; } QueryParam Step1 Step2 Query
  • 29. 29 S2Graph API: indices Degree Q1 Q2 Q3 1-friend- out-PK 3 c-103 b-102 a-101 1 101 102 103 Name: a Name: b Name: c Ordered(DESC) Indices: 1. addIndex, createIndex 2. Automatically keep edges ordered for multiple indices. 3. Support int/long/float/string data types. class Index { // define how to order edges. String indexName; List[Prop] indexProps; }
  • 30. 30 What is S2Graph Not support global computation(not like Apache Giraph, graphX). Not support graph algorithm like page rank, shortest path. Storage-as-a-Service + Graph API = Realtime Breadth First Search S2Graph is Not
  • 31. 31 Why S2Graph: Push vs Pull. Feeds with Push 1. Only timestamp can be used as scoring 2. Hard to change scoring function dynamically Post Like Write(Fanout) Friends Feed Queue Feed Queue Feed Queue Write # of friends Read O(1) for friends Storage AVG(# of friends) * total user activity Query O(1)
  • 32. 32 1.Different weights to different action types: Like = 0.8, Click = 0.1… 2.Client can change scoring dynamically. PostLike Friends Why S2Graph: Push vs Pull. Feeds with Pull Write O(1) Read None Storage total user activity Query O(1) for friends + O(# of friends)
  • 33. 33 Pull >> push only if 1. fast response time: 10 ~ 100ms 2. throughput: 10K ~ 20K QPS S2Graph provide linear scalability on 1. number of machine. 2. bfs search space(how many edges that single query will traverse). more detail on benchmark section later. Why S2Graph: S2Graph Supports Pull + Push
  • 34. 34 Why S2Graph: Simplify Data Flow S2Graph Write API + Query DSL WAL log OpenSourced User/Item Similarity Apache Spark (Batch Computing Layer) TopK Counter Others S2Graph Bulk Loader will be open sourced soon
  • 35. 35 Why S2Graph: Built in A/B test 1. Register Query Template: Each Query template have impressionId. 2. Insert Click/Impression event into S2Graph as Edge insert.
  • 36. 36 Why S2Graph: Just Insert Edge S2Graph 1. user activity history. 2. friends feed. 3. user-item based collaborative filtering. 4. topK ranking(most popular, segmented most popular). and many many more. just think your service as graph model.
  • 38. 38 Detail: previous talk on HBaseCon 2015 1.https://vimeo.com/128203919 2.http://www.slideshare.net/HBaseCon/use-cases-session-5
  • 40. 40 HBase Table Configuration 1. setDurability(Durability.ASYNC_WAL) 2. setCompressionType(Compression.Algorithm.LZ4) 3. setBloomFilterType(BloomType.Row) 4. setDataBlockEncoding(DataBlockEncoding.FAST_DIFF) 5. setBlockSize(32768) 6. setBlockCacheEnabled(true) 7. pre-split by (Intger.MaxValue / regionCount). regionCount = 120 when create table(on 20 region server).
  • 41. 41 HBase Cluster Configuration • each machine: 8core, 32G memory, SSD • hfile.block.cache.size: 0.6 • hbase.hregion.memstore.flush.size: 128MB • otherwise use default value from CDH 5.3.1 • s2graph rest server: 4core, 16G memory
  • 42. 42 Performance 1. Total # of Edges: 100,000,000,000(100,000,000 row x 1000 column) 2. Test environment a. Zookeeper server: 3 b. HBase Masterserver: 2 c. HBase Regionserver: 20 d. App server: 4 core, 16GB Ram e. Write traffic: 5K / second
  • 43. 43 - Benchmark Query : src.out(“friend”).limit(100).out(“friend”).limit(10) - Total concurrency: 20 * # of app server Performance 2. Linear scalability Latency 0 50 100 150 200 QPS 0 1,000 2,000 3,000 4,000 # of app server 1 2 4 8 QPS(Query Per Second) Latency(ms) 46454543 3,491 1,763 885 464 43 45 45 46 # of app server 1 2 3 4 5 6 7 8 50010001500200025003000 QPS
  • 44. Performance 3. Varying width of traverse (tested with a single server) Latency 0 87.5 175 262.5 350 QPS 0 500 1,000 1,500 2,000 Limit on first step 20 40 80 200 400 800 QPS Latency(ms) 327 164 84 351911 61122237 570 1,023 1,821 11 19 35 84 164 327 - Benchmark Query : src.out(“friend”).limit(x).out(“friend”).limit(10) - Total concurrency = 20 * 1(# of app server)
  • 45. 45 - All query touch 1000 edges. - each step` limit is on x axis. - Can expect performance with given query`s search space. Performance 4. Different query path(different I/O pattern) Latency 0 37.5 75 112.5 150 QPS 0 80 160 240 320 400 limits on path 10 -> 100 100 -> 10 10 -> 10 -> 10 2 -> 5 -> 10 -> 10 2 -> 5 -> 2 -> 5 -> 10 QPS Latency(ms) 323436 2314 307.5292.1274.4 435.3695 14 23 36 34 32
  • 46. 46 Performance 5. Write throughput per operation on single app server Insert operation Latency 0 1.25 2.5 3.75 5 Request per second 8000 16000 800000
  • 47. 47 Performance 6. write throughput per operation on single app server Update(increment/update/delete) operation Latency 0 2 4 6 8 Request per second 2000 4000 6000
  • 48. 48 Stats 1. HBase cluster per IDC (2 IDC) - 3 Zookeeper Server - 2 HBase Master - (20 + 40) HBase Slave 2. App server per IDC - 10 server for write-only - 30 server for query only 3. Real traffic - read: 10K ~ 20K request per second - now mostly 2 step queries with limit 100 on first step. - write: over 5k ~ 10k request per second
  • 49. 49
  • 51. 51 Now Available As an Open Source - https://github.com/daumkakao/s2graph - Finding contributors and mentors Contact - Doyoung Yoon : shom83@gmail.com