2. Social Business Wha?
Big Data meets Big Budgets
• Brand marketers spend
• $450 (~£270) billion annually on tradition media
• $50 (~£30) billion annually on SEO/SEM
• Starting to transition to social media
5. The Dachis Group
Measure all the things!
• Jeff Dachis amasses small army of social strategists
• Funds team to create social analytics platform
• Measure business outcomes of social media strategies
• Track social media surrounding Forbes Global 2000
• Include all brands, all subsidiaries, all social media types
6. Architecture
• Raw data in S3
• Cassandra
• Realtime queries to return raw data
• Hadoop analytic integration for foundational measures
• Horizontal scalability
• Operationally simple
• RDBMS
• Time rollups of measures
• Aggregates and composite measures
• Arbitrary dimensional queries
• Mini data warehouse
7. Pipeline
Memcached
AWS S3 Cassandra Postgres
Raw Signal Signal Metrics
Storage Repository Store
Normalization Enrichment Analysis
8. Normalization
• Parallel copy from S3 to HDFS
• MapReduce to Cassandra from Raw to Normalized CF
• Normalized data model
• Decent investment to get right
• Mostly for conceptual reasons rather than concerns about queries
• Secondary indexes vs app maintained indexes
9. Enrichment
• Enrich with
• Unique company/brand information
• Sentiment
• Relationships
• Conversations
• Social graph information
• Enter Pig
• Enter Oozie
10. The Bleeding Edge
Pig
• newlogicalplan in 0.8.0
• Debugging/tracing?
• Incremental development
• Working with Cassandra
• Pygmalion - facilitating to and from Cassandra
• Experience, unit test framework, UDFs, community
slowly became
11. The Bleeding Edge
Oozie
• Learning curve and common errors
• User impersonation
• Logs, we haz them, lots of them
• Web UI needs love
• Specific to Cassandra
• mapreduce.fileoutputcommitter.marksuccessfuljobs
• See http://wiki.apache.org/cassandra/HadoopSupport#Oozie
• Still very good DAG workflow crunching tool
• Subworkflows, fork/join, regular scheduling, dataset detection
• Extensible
• Apache Incubator (@oozie on twitter, #oozie on freenode)
12. The Bleeding Edge
Cassandra
• Rack aware snitch and replication
• Always rotate racks in order in topology
• In EC2 this likely means rotate AZs
• Dealing with scanning over column families
• Project early
• General tuning and unique workload
• Mahout and other higher memory hadoop tasks
• EC2 instance types
• Visualization tool helped (OpsCenter, Acunu has Control Center)
• Community++
13. Social Business Index
Launches September 2011
• Global Ranking of Companies
• Industry Rankings
• Visualization of strategy
14. This might actually work!
• Fall 2011, built up the team
• Expertise in Pig, Lucene/Solr, machine learning, statistics, event
prediction and analysis
• Making everlasting gobstoppers
17. Productizing Topics
• Ongoing automated topic detection
• Lessons from one-off topic analysis
• Represented by term distributions
• Threads with detail like
• Signal volume
• Participants
• Links
• Sentiment gauge
18. Advocates
• Auto-discovery of potential advocates
• Curated set of known advocates
• Example signal (from Cassandra)
• Reports and other useful bits
19. Lessons learned
• Emerging products are sometimes frustrating, but well worth the pain in
their respective niche.
• “Never underestimate the massive impact of small bugs in big
data.” (@peteskomoroch at LinkedIn)
• Community karma
20. A Note on Community
• Community involvement
• IRC, mailing lists, twitter, conferences, meetups
• Newer projects have little or outdated docs
• Some features may be
• Deprecated
• Not ready for primetime
• Not a fit for your use case
• Community karma
• Don’t just take
• Be a bridge builder
• Positive karma helps
This is how they see their capability after, for example, the Superbowl.\n
When managers ask how effective the campaign was, the marketing department says it was awesome. When asked how they know that, they say that Zoltar told them so. In reality there are a lot of home grown methods, some good, some not so good. Some of what we did grew out of a spreadsheet that was manually updated, validated and refined over time with one of our major customers.\n\n
What brands does Berkshire Hathaway have under its gigantic umbrella?!?\nCan mention Red Bull, Disney, HP, Levis, Samsung, Honda, etc.\n
Operationally simple doesn’t mean that you don’t need to learn a lot about it, just that there aren’t a lot of moving parts.\nUnique use case in that it’s hybrid. Both lots of writes and analytics and reads.\n
\n
It’s just scads of text, but we do classify - conversations long/short difference between microblogs and blogs.\nWe may use hadoop to generate alternate CFs for specific queries as we need them.\n
Company information is unique because we had to buy, borrow, steal and yes crowd source that data.\nPig handles joins really well for example account snapshots and signal for enrichment.\n\n
Mention Brandon’s work to make things better with CassandraStorage and newer versions of Pig, including regression tests.\nSpeculative exectution.\n
Mention having looked at Azkaban as well.\nNo real way around the logs, just takes getting used to. User impersonation is a product of the authorization framework, patch added to DSE.\n
Mention consistency level choices.\nRotate racks - yeah, wasn’t documented except in the code.\nBackup/restore.\nRoot causes sometimes difficult to determine.\nScaling up - each order of magnitude jump has its own problems.\n
But the long sleepless Summer finally pays off...\n
Everlasting gobstoppers are a fun phase for the projects.\n
Reveals numbers\n
Explanation\nGreat working as a team\nMention Boxing Day\n
Also customer curated topics in the future\n
\n
Data consistency - periodic checks, staging cluster, unit tests, integration testing.\nReparable data. Sometimes incredibly painful, but possible.\nMention backup/restore.\nMention root causes.\n
Be active in communities of these new projects\nIf necessary start building communities around them\nDon’t just take, answer questions, follow mailing lists but have a filter, docs, bug submission, feature requests, votes, representation, tests, patches/pull requests.\n