Solbase & Real-time Activity

•

1 gostou•1,492 visualizações

Solbase, the real time open-source search engine, is now available on github. Solbase was developed by Photobucket.com and is built upon Lucene, Solr and HBase. Photobucket has also recently released a real time community activity stream capturing the 4 million daily uploads as well as all of your friends' comments and favorite photos. The foundation of the system is HBase and also employs Kestrel queues. This talk will cover the architecture, implementation details and share many of the lessons learned while developing this real time big data system.

Tecnologia

[object Object],[object Object],[object Object],[object Object],Who we are

Photobucket Solbase Activity Stream Agenda

• Photobucket is the most-visited photo site with 23.4 Million UVs • Over 9 Billion photos stored! • Users upload 4 Million images per day! • Photobucket users spend more time than any other photo site with 3.8 Avg mins/visit • 2.0 Million avg daily visitors - more daily visits than Flickr and Picasa combined Sources: 1comScore May 2011, 2Internal data Photobucket Overview

23.4M UVs 9.9M UVs 9.5M UVs 7.9M UVs 1.6M UVs 19.7M UVs 6.0M UVs

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Sources: 1comScore May 2011, 2Internal data Photobucket Stats

Solbase is an open-source, real-time search platform based on Lucene, Solr and HBase built at Photobucket What is Solbase?

[object Object],[object Object],[object Object],[object Object],Why Solbase?

[object Object],[object Object],[object Object],[object Object],Summary of what we did

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Results

[object Object],[object Object],Next Steps

https://github.com/Photobucket/Solbase https://github.com/Photobucket/Solbase-Solr https://github.com/Photobucket/Solbase-Lucene Solbase repos

Activity Stream is Social networking feature using HBase, Flume, Kestrel, Camel built at Photobucket What is Activity Stream?

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Activity Events

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Delivering Activities

[object Object],[object Object],[object Object],[object Object],Discussion Overview

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Fanout Processor & Camel

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Query Service

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Performance

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],What is HBase?

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Why HBase?

Schema: {row key 1 { column family 1{ c olumn 1 {data1}, column 2 {data 2} … } ...} } {row key 2 {...}} Example: {dog:spotty {owner{matt{age 41}, linda{age 41}} vaccinations{rabies{july 2011}}} {cat:fluffy {owner{doug{age 41}, heather{age 41}} vaccinations{rabies{june2011}}} HBase Tables

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Our Schema Design

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Challenges

http://www.cloudera.com/resource/hadoop-world-2011-presentation-slides-advanced-hbase-schema-design http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ References

Mais conteúdo relacionado

Mais procurados

Wieldy remote apis with Kekkonen - ClojureD 2016Metosin Oy

Composer the right wayRafael Dohms

Composer The Right Way #PHPjhb15Rafael Dohms

Composer the right way - DPC15Rafael Dohms

Composer the Right Way - PHPBNL16Rafael Dohms

Composer the right way - NomadPHPRafael Dohms

Composer The Right WayRafael Dohms

PuppetConf 2017: Custom Types & Providers: Modeling Modern REST Interfaces an...Puppet

Composer The Right Way - 010PHPRafael Dohms

MongoDB revs you up: What Storage Engine is Right for You?Jonathan E. Tobin

StormCrawler in the wildJulien Nioche

Cloudera - Using morphlines for on the-fly ETL by Wolfgang HoschekHakka Labs

Real-Time Inverted Search NYC ASLUG Oct 2014Bryan Bende

Emerging technologies /frameworks in Big DataRahul Jain

Real-Time Big Data at In-Memory Speed, Using StormNati Shalom

Rest 2.0 graph qlNick Zheng

For each component in muleRajkattamuri

Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017Codemotion

Composer the right way [SweetlakePHP]Rafael Dohms

Storm - As deep into real-time data processing as you can get in 30 minutes.Dan Lynn

Mais procurados (20)

Wieldy remote apis with Kekkonen - ClojureD 2016

Composer the right way

Composer The Right Way #PHPjhb15

Composer the right way - DPC15

Composer the Right Way - PHPBNL16

Composer the right way - NomadPHP

Composer The Right Way

PuppetConf 2017: Custom Types & Providers: Modeling Modern REST Interfaces an...

Composer The Right Way - 010PHP

MongoDB revs you up: What Storage Engine is Right for You?

StormCrawler in the wild

Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Real-Time Inverted Search NYC ASLUG Oct 2014

Emerging technologies /frameworks in Big Data

Real-Time Big Data at In-Memory Speed, Using Storm

Rest 2.0 graph ql

For each component in mule

Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017

Composer the right way [SweetlakePHP]

Storm - As deep into real-time data processing as you can get in 30 minutes.

Destaque

2011_Replanning_Your Businessmguckin

PresentacióN Finalgueste40a07

Spatial and Socioeconomic Fishing Profiles: Central California National Marin...Ecotrust

God's Grace (shared using http://VisualBee.com).VisualBee.com

Tugas rekayasa webfery pernandos

Denah ppambaroshin

T A L L E R D E V O C E R O SSanmil

Tugas 3 rekayasa webmuslim rohadi

NflRafael Passos

A Socioeconomic Baseline Assessment of the Pribilof IslandsEcotrust

Vlaggenlijn makenTrias ngo

Cómo se hace un guión de cineJuan Sandoval Nava

Cordell Construction Market Movement ReportCoreLogic

個人開発アプリのご紹介とあり得ない不具合に対する掲示板の活用事例Yusaku Kinoshita

Description of inergen systemزعما فعلا عايش

Assessment of the newbornAde Pratiwi

Confidentiality Awarenessitchomecare

Trias - yearly report 2014Trias ngo

2009 07 25 Authorityolopya

Socioeconomic considerations in marine resource management Ecotrust

Destaque (20)

2011_Replanning_Your Business

PresentacióN Final

Spatial and Socioeconomic Fishing Profiles: Central California National Marin...

God's Grace (shared using http://VisualBee.com).

Tugas rekayasa web

Denah pp

T A L L E R D E V O C E R O S

Tugas 3 rekayasa web

Nfl

A Socioeconomic Baseline Assessment of the Pribilof Islands

Vlaggenlijn maken

Cómo se hace un guión de cine

Cordell Construction Market Movement Report

個人開発アプリのご紹介とあり得ない不具合に対する掲示板の活用事例

Description of inergen system

Assessment of the newborn

Confidentiality Awareness

Trias - yearly report 2014

2009 07 25 Authority

Socioeconomic considerations in marine resource management

Semelhante a Solbase & Real-time Activity

Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network

Building Event-Based Systems for the Real-Time Webpauldix

Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Big Data Spain

A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)Robert Metzger

2011 06-30-hadoop-summit v5Samuel Rash

Comet: by pushing server data, we push the web forwardNOLOH LLC.

Jethro for tableau webinar (11 15)Remy Rosenbaum

Shortening the feedback loopJosh Baer

ProjectHubSematext Group, Inc.

Open Source Library System Software: Libraries Are Doing it For Themselvesloriayre

facebook architecture for 600M usersJongyoon Choi

How to Build a High Performance Application Using Cloud Foundry and Redis (Cl...VMware Tanzu

Data infrastructure at Facebook AhmedDoukh

Apache Eagle: Secure Hadoop in Real TimeDataWorks Summit/Hadoop Summit

Apache Eagle at Hadoop Summit 2016 San JoseHao Chen

HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketCloudera, Inc.

Puppet Keynote by Ralph LuchsNETWAYS

Facebook[The Nuts and Bolts Technology]Koushik Reddy

Trend Micro Big Data Platform and Apache BigtopEvans Ye

DataHubAditya Parameswaran

Semelhante a Solbase & Real-time Activity (20)

Hadoop and Pig at Twitter__HadoopSummit2010

Building Event-Based Systems for the Real-Time Web

Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...

A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

2011 06-30-hadoop-summit v5

Comet: by pushing server data, we push the web forward

Jethro for tableau webinar (11 15)

Shortening the feedback loop

ProjectHub

Open Source Library System Software: Libraries Are Doing it For Themselves

facebook architecture for 600M users

How to Build a High Performance Application Using Cloud Foundry and Redis (Cl...

Data infrastructure at Facebook

Apache Eagle: Secure Hadoop in Real Time

Apache Eagle at Hadoop Summit 2016 San Jose

HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

Puppet Keynote by Ralph Luchs

Facebook[The Nuts and Bolts Technology]

Trend Micro Big Data Platform and Apache Bigtop

DataHub

Último

GenAI Risks & Security Meetup 01052024.pdflior mazor

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

A Year of the Servo Reboot: Where Are We Now?Igalia

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Manulife - Insurer Transformation Award 2024The Digital Insurer

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Real Time Object Detection Using Open CVKhem

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

MINDCTI Revenue Release Quarter One 2024MIND CTI

Why Teams call analytics are critical to your entire businesspanagenda

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Solbase & Real-time Activity

3. Photobucket Solbase Activity Stream Agenda

4. • Photobucket is the most-visited photo site with 23.4 Million UVs • Over 9 Billion photos stored! • Users upload 4 Million images per day! • Photobucket users spend more time than any other photo site with 3.8 Avg mins/visit • 2.0 Million avg daily visitors - more daily visits than Flickr and Picasa combined Sources: 1comScore May 2011, 2Internal data Photobucket Overview

5. 23.4M UVs 9.9M UVs 9.5M UVs 7.9M UVs 1.6M UVs 19.7M UVs 6.0M UVs

7. Solbase is an open-source, real-time search platform based on Lucene, Solr and HBase built at Photobucket What is Solbase?

10.

11.

12. https://github.com/Photobucket/Solbase https://github.com/Photobucket/Solbase-Solr https://github.com/Photobucket/Solbase-Lucene Solbase repos

13. Activity Stream is Social networking feature using HBase, Flume, Kestrel, Camel built at Photobucket What is Activity Stream?

14.

15. Activity Events Rendered

16.

17.

18. Activity Collection

19.

20.

21.

22.

23.

24.

25. Hadoop/Hbase Architecture

26. Schema: {row key 1 { column family 1{ c olumn 1 {data1}, column 2 {data 2} … } ...} } {row key 2 {...}} Example: {dog:spotty {owner{matt{age 41}, linda{age 41}} vaccinations{rabies{july 2011}}} {cat:fluffy {owner{doug{age 41}, heather{age 41}} vaccinations{rabies{june2011}}} HBase Tables

27.

28.

29.

30.

31. http://www.cloudera.com/resource/hadoop-world-2011-presentation-slides-advanced-hbase-schema-design http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ References

32. Q&A

Notas do Editor

We should go over these agendas and introduce each of presenters
First, Koh is going to talk about Solbase. That's our real time search engine that was built on top of Lucene, Solr, and HBase. We started presenting Solbase about 9 months ago, and at that time we reported that our standard implementation of lucene/solr was no longer scaling to meet our needs, and our initial tests of Solbase gave us hope that we were going to solve that problem AND dramatically improve performance. In addition we were updating our search index in real time. Great results, but possibly the bigger news at that time was that we were planning to open source all the code. Tonight Koh is here to deliver on that promise. The next topic we'll cover is another HBase feature developed at PB: our activity stream. It's what you'd probably expect. A social network feature that distributes events about photos and videos in near real time. We've seen a number of presentations on similar features, but rarely to you see any detail on the architecture or lessons learned that would help you build your own. Ron and Josh are going to do exactly that. But before we jump into all that... why do you care? who is PB?
We're the biggest dedicated photo site on the web and we're right next door. We have millions of active users and billions of photos.
Here's a quick slide on our size compared to our peers… its a little old, but you get the idea. We have millions of unique visitors.
Over time those users have contributed half a billion public photos and videos to our search index, and we generate a boatload of social events around all that public media.
Lucene's Field cache for sorting and filtering became very problematic for us Turn around time for building entire set of indices took us about a day Every 100 ms improvement in response time equates to approximatey 1 extra page views Impractical to add significatn number of new docs and data
In a nutch shell, Solbase have basically replaced indices stored in local filesystem to database in HBase also overcame lucene's inherent limitations. and one major one we solved is sort/filter
Ron Here
Ron Here
Ron Here
Ron Here
Kestrel is open source and developed at twitter.
Talk about scale and real-time processing speed. Ops per second. 1 thread push 40/s all the way to hbase.
Talk about scale and real-time processing speed. Ops per second. 1 thread push 40/s all the way to hbase.
Josh Here HBase is a distributed big-table like database build upon Hadoop components leverages HDFS, Hadoop ’s distributed file system Built upon Hadoop, scales to a massive size, virtually limitless used by many large scale companies: Facebook, Yahoo, Google (through their big-table implementaiton) Ask who has used hbase
Josh Here HBase is a distributed big-table like database build upon Hadoop components leverages HDFS, Hadoop ’s distributed file system Built upon Hadoop, scales to a massive size, virtually limitless used by many large scale companies: Facebook, Yahoo, Google (through their big-table implementaiton) Ask who has used hbase To fix: 1. Features column store key/value store witih semi-structured values. 2. Why use hbase? -horizontal scalability -high write throughput -millions of columns billion of rows
consists of master nodes with a set of region servers to distribute the data The master is the gateway interface to direct clients to the proper region server for the requested data Data is replicated among several data nodes by Hadoop ’s file system, HDFS There is ‘locational affinity’ between the region server and the data served
Each table consists of a row key, a set of defined column families, and an arbitrary number of qualified columns for each family Keys are store lexicographically so that range scans between two keys is extremely fast All data is binary interestingly, this is similar to the concept of the inverted index, where the ‘terms’ are lexicographically stored; this is something that we leverage in our implementation
Mention using lexicographical key to pre-sort data.
Get : single row access, similar to SQL like query by primary key Put: single row update/insert (can be done in batch) Scan: lexicographic range query between 2 specified keys
Back to Ron HBase optimization: scans continue to be fast, large multi-gets have been an issue.
HBase optimization: scans continue to be fast, large multi-gets have been an issue.

Solbase & Real-time Activity

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Solbase & Real-time Activity

Semelhante a Solbase & Real-time Activity (20)

Último

Último (20)

Solbase & Real-time Activity

Notas do Editor