Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Próximos SlideShares
Carregando em…5
×

# HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon

3.544 visualizações

Determining the number of unique users that have interacted with a web page, game, or application is a very common use case. HBase is becoming an increasingly accepted tool for calculating sets or counts of unique individuals who meet some criteria. Computing these statistics can range in difficulty from very simple to very difficult. This session will explore how different approaches have worked or not worked at scale for counting uniques on HBase with Hadoop.

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Seja o primeiro a comentar

### HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon

1. 1. UNIQUE SETS WITHHBASEElliott Clark
2. 2. Problem SummaryThe business needs to know how many unique people have done some action
3. 3. Problem Specifics Need to create a lot of different counts of unique users  100 different counters per day per game (could be per website, or any other group)  1000 different games Some counters require knowledge of old data  Count of unique users who joined today  Count of unique users who have ever paid
4. 4. 1st Try – Bit Set per Hour Row Key is the game and the hour Column qualifiers are the counter names Column values are 1.5Mb bit sets Each hour a new bloom filter is created for every counter Compute a day’s counter by OR’ing the bits and dividing the count of high bits by probability of a collision
5. 5. 1st Try – Example Row D:DAU D:new_uniquesGame1 2012-01-01 0100 NUM_IN_SET: 1.5M NUM_IN_SET: 0.9M 010010001101100100…. 1100110100111010….
6. 6. 1st Try – Pluses & Minuses Allows accuracy to drive size Requires a full table scan of all bit sets A lot of data generated Huge number of regions Not 100% accurate Very hard to debug
7. 7. 2nd Try – Bit Sets per User Row Key is the user’s ID reversed  Reverse the ID to stop hot spotting of regions Column qualifiers are a compound key of game and counter name Column values are a start date-hour and a bit set  Each position in a bit set refers to a subsequent hour after the start time  1 means the user performed that action  0 means the user did not perform that action
8. 8. 2nd Try – Example Row D:game1_active D:game2_paid_money Start Date: 2012-01-01 0500 Start Date: 2012-01-01 0500Game1 2012-01-01 0100 010010001101100100…. 00000000000000000100….
9. 9. 2nd Try – Pluses & Minuses Easier to debug Size grows with the number of users not with the accuracy required Requires a full table scan of all users Scales with number of users ever seen; not number of users active on a given day Very active users can make rows grow without bound Very hard to un-do any mistakes. Dirty data is very hard to correct.
10. 10. 3rd Try – Multi Pass Group Group all log data for a day by user ID Join log data with historic data in Hbase, by doing a GET on the user’s row Compute new information about the user Emit new data about the user and +1s for every action that the user did from the log data
11. 11. 3rd Try – Data Flow Count: +1Log DataLog DataLog Data Recomputed User Data Hbase User Data
12. 12. 3rd Try – Pluses & Minuses Easy to debug Scales with number of users that are active Allows for a more holistic view of the users Requires a large amount of data to be shuffled and sorted
13. 13. Conclusions Try to get the best upper bound on runtime More and more flexibility will be required as time goes on Store more data now, and when new features are requested development will be easier Choose a good serialization framework and stick with it Always clean your data before inserting
14. 14. Questions?