SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
Scaling at Showyou
      John Muellerleile (@jrecursive, john@remixation.com)
      September 26, 2011




Tuesday, September 27, 2011
John Muellerleile

      • Basho Technologies: Jan. 2008 - Dec. 2010
            • Riak
            • Riak Search
            • Automated research, NLP, spidering
      • Consulting: 2003 - 2008
            • E-commerce, AdSense, AdWords, ...
      • Infrastructure:
                  • Messaging
                  • Riak
                  • Solr



Tuesday, September 27, 2011
Agenda

      • Who am I?


      • What is Showyou?


      • The Nature of “Social Data”


      • Showyou’s Data: Today & Tomorrow


      • Data Management: Technology Stack


      • Riak: The Awesome & Sub-Awesome, Integration Patterns & Observations


      • Not Bob’s Riak: The “Mecha” Backend & Query System


Tuesday, September 27, 2011
What is Showyou?




Tuesday, September 27, 2011
A minute about the nature
                  of “Social Data”



Tuesday, September 27, 2011
Tuesday, September 27, 2011
Tuesday, September 27, 2011
Tuesday, September 27, 2011
Tuesday, September 27, 2011
Tuesday, September 27, 2011
Hi!




Tuesday, September 27, 2011
Showyou’s Data: Today

      • Riak with Bitcask


      • Solr with replication


      • Often ever-growing blobs of JSON


      • No useful way to “find” data based on anything other than compound primary
        keys :(


            • Primary keys crafted for specific access patterns :(


      • This will not last forever




Tuesday, September 27, 2011
Showyou’s Data: Tomorrow

      • Significant data growth per additional user


      • Find & aggregate data about our users & their videos


      • Derive useful “signal” from this data


            • Better search: disambiguation, “more like this”, performance


            • “Collective Intelligence”: trending, “smart collections”


            • Spam & De-duplication


            • & more: #hashtags, auto-complete, statistics, usage, ...


Tuesday, September 27, 2011
Data Management: Technology Stack

      • Erlang/OTP


            • Riak, Bitcask


      • Java


            • JInterface


            • Hazelcast


            • Solr




Tuesday, September 27, 2011
Riak: The Awesome Parts

      • By far the best operational story in its class


      • Shared-nothing


      • No single point of failure


      • Masterless multi-site replication


      • Support via EnterpriseDS Startup Program


      • I helped write it & turned it inside-out to design Riak Search while working at
        Basho -- this helps




Tuesday, September 27, 2011
Riak: The Sub-Awesome Parts

      • Riak Search as it exists today does not perform well for us & lacks features


      • Bitcask keeps all keys in memory


      • Listing keys for a bucket will take your cluster down


      • Map/Reduce is virtually useless (for us) other than as “multi-get”


      • Pre-1.0 cluster membership changes are “at your peril”


      • Usable/useful built-in monitoring is non-existent


      • If you are not intimately familiar with Riak, it’s very hard to debug!


Tuesday, September 27, 2011
Riak: Integration Patterns

      • api_riak_node: “main” Riak cluster node


      • sidecar: post-commit talks to local
        Java-based Erlang node using JInterface


      • fabric: distributed data structures &
        utilities via Hazelcast - very similar in
        spirit & implementation as Riak!


      • indexer: pull from fabric queue &
        index record in Solr


      • Identical deployment on every node



Tuesday, September 27, 2011
Riak: Integration Patterns: The Big Picture

      • backend_node: nginx, redis; logs


      • spider: finds, extracts,
        stores & indexes further
        information on users & videos


      • log_indexer: aggregate & index
        interesting parts of our access logs


      • research_riak_node: a special-purpose
        Riak cluster node to support Showyou
        data “Tomorrow”




Tuesday, September 27, 2011
Riak: Integration Patterns: Observations

      • Riak post-commit hooks


      • Why not pre-commit hooks?


      • Riak as a “virtual memory”


      • Post-commit hooks as “change events”


      • Wishlist: pre- & post-delete hook (I realize this is tricky - do it anyway)


      • Wishlist: pre- & post-create hook (Less tricky - do it anyway)




Tuesday, September 27, 2011
Riak: Integration Patterns: Search

      • Monolithic replicated Solr won’t last forever & sharding is a faceted multi-
        value “shitshow” [1]


      • We’re doing fine, for now, with lots of RAM, SSDs, etc. but...


      • I don’t want to find out “the hard way” where the joyride ends and the hellride
        starts [2]


      • Clearly the answer is to kill every bird in nearby airspace by writing a
        Riak storage backend and integrated query mechanism! [2,3,4]

          [1] Cliff Moon suggested the use of this word (for emphasis)
          [2] It’s okay, I have done this several dozen times
          [3] Yes, really: BDB, BDBJE, Innostore, sqlite, hsql, mysql, postgres, etc.
          [4] This is not a pride thing: it was a hard, lonely, unforgiving road




Tuesday, September 27, 2011
Not Bob’s Riak: A Moment of Weakness




                              “My way of joking is to tell the truth;
                               it’s the funniest joke in the world”
                                      George Bernard Shaw




Tuesday, September 27, 2011
Not Bob’s Riak: Introducing “Mecha”

      • Bob?




Tuesday, September 27, 2011
Mecha: Goals

      • Birds to kill:


            • Tight, purposed Riak integration


            • Efficient & feature-complete indexing


            • Fast sequential & range object access


            • Flexible distributed query mechanism


            • Query parallelism where appropriate




Tuesday, September 27, 2011
Mecha: What Stays The Same

      • Works with “stock” (unmodified) Riak 1.0 (pre & release)


      • Little/no difference from the “Riak side” -- everything works “as it should”


      • Differences:


            • All objects you “put” into Riak must be JSON objects (this will change by
              release to respect content type)


            • Any fields without a specific “indexed field type” (e.g., _t, _s, _dt, etc.) are
              simply stored along with the rest of the fields (i.e. “stored field”)




Tuesday, September 27, 2011
Mecha: At a glance

      • Written in Java, uses many, many third-party libraries:
        LevelDB (JNI), Solr, JInterface, Jetlang, Netty, Protobufs, Commons, ...


      • Riak backend module written in Erlang; beaten into submission for reliable
        interaction with JInterface-driven Java node


      • LevelDB instance per partition, per bucket; “ulimit -n 90000” :)


      • One Solr instance per node (covers all partitions, buckets)


      • Objects stored as Riak Objects in JSON form in LevelDB


      • Standardized schema covering common data types using name suffixes



Tuesday, September 27, 2011
Mecha: Riak Integration




Tuesday, September 27, 2011
Mecha: Index Field Types

      • Supported index field types (by suffix):


      • _t, _tt - full-text (optionally w/ term vectors, ...)


      • _s, _s_mv - exact string, multi-value exact string


      • _i, _l, _d, _f - trie-based integer, long, double, float


      • _dt - trie-based date (YYYY-MM-DDTHH:MM:SS)


      • _b - boolean


      • _xy, _ll, _geo - point, lat/lon & geohash


Tuesday, September 27, 2011
Mecha: Example Object

          { “content_t”: “NoSQL is a ghetto”,
            “lol_count_i”: 47,
            “lol_dt”: “2009-04-13T07:01:43.000Z”
            “rating_f”: 4.111111164093018,
            “tags_s_mv”: [
              “funny”,
              “lol”,
              “nosql”,
              “ghetto”]}


Tuesday, September 27, 2011
Mecha: Fast Sequential & Range Object Access

      • Every bucket gets own instance of LevelDB per partition


      • No “multiplexing” buckets or partitions per LevelDB instance


      • Keys are literal, no encoded Erlang terms; simple ranges, smaller values


      • Values stored as JSON-ized Riak objects (why? you’ll see)


      • LevelDB JNI binaries shipped with Snappy compression built-in




Tuesday, September 27, 2011
Mecha: Flexible Distributed Querying

      • Exact, prefix, suffix, & wildcard filtering on multiple index fields


      • Ultra-fast list_keys, list_bucket, list_buckets replacements


      • Equally fast bucket count operation


      • Ridiculous range query performance (any trie- type; sane datetime
        functions)


      • Faceting, group counts


      • Spatial (bounding box/bowl, geofilt, Haversine, distance faceting)




Tuesday, September 27, 2011
Mecha: Query Parallelism




Tuesday, September 27, 2011
Mecha: Coverage

      • Coverage is the set of nodes and respectively owned partitions you must
        process to cover all objects in a given bucket.

                                         Node 1




                                                              Node 2
              (for n=2)



Tuesday, September 27, 2011
Mecha: Examples: Count & Faceting

      • Count the number of records in the "research" bucket modified within the last
        10 seconds




      • Top search terms by count for the last hour




Tuesday, September 27, 2011
Mecha: Examples: Multi-value String Fields

      • See which field values occur with other values, and how often (using a multi-
        value string field)




Tuesday, September 27, 2011
Mecha: Next Steps

      • Code cleanup, test & polish current functionality


      • Sane build & deployment (yay, Maven)


      • Simplify configuration


      • Embed Solr (currently running standalone for debugging)


      • Extend & improve Solr “standard configuration”


      • “Out of Band” Map/reduce with “direct bucket-level” LevelDB instance


      • After that? Join operators, ... :)


Tuesday, September 27, 2011
Mecha: Availability

      • Right now it is a complex system


      • It will be worth waiting for tighter integration & polish


      • This is me not answering your question :)


      • As soon as responsible, sane, & possible :)




Tuesday, September 27, 2011
Thank you!

      • Basho Technologies, especially Andy Gross (@argv0) & Kelly McLaughlin
        (@_klm) specifically for help with Riak 1.0 changes


      • Of course, without Erlang/OTP, LevelDB, Solr, Java, JInterface, and a host of
        other open source projects, this would have never even gotten started --
        thank you.


      • Last but not least, thank you to my Showyou teammates for encouragement
        & support!


      • Contact:
        John Muellerleile / http://twitter.com/jrecursive
        john@remixation.com, jmuellerleile@gmail.com
        http://github.com/jrecursive



Tuesday, September 27, 2011

Mais conteúdo relacionado

Mais procurados

Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
tsliwowicz
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
Stephen Borg
 

Mais procurados (20)

Change data capture with MongoDB and Kafka.
Change data capture with MongoDB and Kafka.Change data capture with MongoDB and Kafka.
Change data capture with MongoDB and Kafka.
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzArchiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big Data
 
Becoming Protocol-Agnostic with Kafka, REST, GraphQL & gRPC | Tyler Mills, Sm...
Becoming Protocol-Agnostic with Kafka, REST, GraphQL & gRPC | Tyler Mills, Sm...Becoming Protocol-Agnostic with Kafka, REST, GraphQL & gRPC | Tyler Mills, Sm...
Becoming Protocol-Agnostic with Kafka, REST, GraphQL & gRPC | Tyler Mills, Sm...
 
Bootstrap SaaS startup using Open Source Tools
Bootstrap SaaS startup using Open Source ToolsBootstrap SaaS startup using Open Source Tools
Bootstrap SaaS startup using Open Source Tools
 
Change Data Capture using Kafka
Change Data Capture using KafkaChange Data Capture using Kafka
Change Data Capture using Kafka
 
How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
DEVNET-1106 Upcoming Services in OpenStack
DEVNET-1106	Upcoming Services in OpenStackDEVNET-1106	Upcoming Services in OpenStack
DEVNET-1106 Upcoming Services in OpenStack
 
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesApache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
 
Making Scala Faster: 3 Expert Tips For Busy Development Teams
Making Scala Faster: 3 Expert Tips For Busy Development TeamsMaking Scala Faster: 3 Expert Tips For Busy Development Teams
Making Scala Faster: 3 Expert Tips For Busy Development Teams
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Introduction to CosmosDB - Azure Bootcamp 2018
Introduction to CosmosDB - Azure Bootcamp 2018Introduction to CosmosDB - Azure Bootcamp 2018
Introduction to CosmosDB - Azure Bootcamp 2018
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
 

Destaque

Riak Use Cases : Dissecting The Solutions To Hard Problems
Riak Use Cases : Dissecting The Solutions To Hard ProblemsRiak Use Cases : Dissecting The Solutions To Hard Problems
Riak Use Cases : Dissecting The Solutions To Hard Problems
Andy Gross
 

Destaque (20)

Redis : Play buzz uses Redis
Redis : Play buzz uses RedisRedis : Play buzz uses Redis
Redis : Play buzz uses Redis
 
Modelling Data as Graphs (Neo4j)
Modelling Data as Graphs (Neo4j)Modelling Data as Graphs (Neo4j)
Modelling Data as Graphs (Neo4j)
 
An intro to Neo4j and some use cases (JFokus 2011)
An intro to Neo4j and some use cases (JFokus 2011)An intro to Neo4j and some use cases (JFokus 2011)
An intro to Neo4j and some use cases (JFokus 2011)
 
Riak Use Cases : Dissecting The Solutions To Hard Problems
Riak Use Cases : Dissecting The Solutions To Hard ProblemsRiak Use Cases : Dissecting The Solutions To Hard Problems
Riak Use Cases : Dissecting The Solutions To Hard Problems
 
Recommendations with Neo4j (FOSDEM 2015)
Recommendations with Neo4j (FOSDEM 2015)Recommendations with Neo4j (FOSDEM 2015)
Recommendations with Neo4j (FOSDEM 2015)
 
Advanced Neo4j Use Cases with the GraphAware Framework
Advanced Neo4j Use Cases with the GraphAware FrameworkAdvanced Neo4j Use Cases with the GraphAware Framework
Advanced Neo4j Use Cases with the GraphAware Framework
 
Redis use cases
Redis use casesRedis use cases
Redis use cases
 
Modelling Data in Neo4j (plus a few tips)
Modelling Data in Neo4j (plus a few tips)Modelling Data in Neo4j (plus a few tips)
Modelling Data in Neo4j (plus a few tips)
 
Best Buy Web 2.0
Best Buy Web 2.0Best Buy Web 2.0
Best Buy Web 2.0
 
Introduction to Redis Data Structures: Sorted Sets
Introduction to Redis Data Structures: Sorted SetsIntroduction to Redis Data Structures: Sorted Sets
Introduction to Redis Data Structures: Sorted Sets
 
Redis and its many use cases
Redis and its many use casesRedis and its many use cases
Redis and its many use cases
 
Introduction to some top Redis use cases
Introduction to some top Redis use casesIntroduction to some top Redis use cases
Introduction to some top Redis use cases
 
The BestBuy.com Cloud Architecture
The BestBuy.com Cloud ArchitectureThe BestBuy.com Cloud Architecture
The BestBuy.com Cloud Architecture
 
Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at Facebook
 
Bootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4jBootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4j
 
Neo4j GraphTalks - Einführung in Graphdatenbanken
Neo4j GraphTalks - Einführung in GraphdatenbankenNeo4j GraphTalks - Einführung in Graphdatenbanken
Neo4j GraphTalks - Einführung in Graphdatenbanken
 
Redis Use Patterns (DevconTLV June 2014)
Redis Use Patterns (DevconTLV June 2014)Redis Use Patterns (DevconTLV June 2014)
Redis Use Patterns (DevconTLV June 2014)
 
Neo4j the Anti Crime Database
Neo4j the Anti Crime DatabaseNeo4j the Anti Crime Database
Neo4j the Anti Crime Database
 
Neo4j GraphTalks Panama Papers
Neo4j GraphTalks Panama PapersNeo4j GraphTalks Panama Papers
Neo4j GraphTalks Panama Papers
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph Databases
 

Semelhante a Scaling with Riak at Showyou

A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
DATAVERSITY
 
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
OpenBlend society
 
SeaJUG May 2012 mybatis
SeaJUG May 2012 mybatisSeaJUG May 2012 mybatis
SeaJUG May 2012 mybatis
Will Iverson
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
Korea Sdec
 
Riak seattle-meetup-august
Riak seattle-meetup-augustRiak seattle-meetup-august
Riak seattle-meetup-august
pharkmillups
 

Semelhante a Scaling with Riak at Showyou (20)

Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
 
JCR - Java Content Repositories
JCR - Java Content RepositoriesJCR - Java Content Repositories
JCR - Java Content Repositories
 
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
 
Eurydike: Schemaless Object Relational SQL Mapper
Eurydike: Schemaless Object Relational SQL MapperEurydike: Schemaless Object Relational SQL Mapper
Eurydike: Schemaless Object Relational SQL Mapper
 
Data Modeling for NoSQL
Data Modeling for NoSQLData Modeling for NoSQL
Data Modeling for NoSQL
 
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
DSpace Under the Hood
DSpace Under the HoodDSpace Under the Hood
DSpace Under the Hood
 
SeaJUG May 2012 mybatis
SeaJUG May 2012 mybatisSeaJUG May 2012 mybatis
SeaJUG May 2012 mybatis
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
 
Oracle Analytics Security Everything you always wanted to know
Oracle Analytics Security Everything you always wanted to knowOracle Analytics Security Everything you always wanted to know
Oracle Analytics Security Everything you always wanted to know
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
 
Combine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quicklCombine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quickl
 
Riak seattle-meetup-august
Riak seattle-meetup-augustRiak seattle-meetup-august
Riak seattle-meetup-august
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
OSGi, Scripting and REST, Building Webapps With Apache Sling
OSGi, Scripting and REST, Building Webapps With Apache SlingOSGi, Scripting and REST, Building Webapps With Apache Sling
OSGi, Scripting and REST, Building Webapps With Apache Sling
 
Apache Cayenne: a Java ORM Alternative
Apache Cayenne: a Java ORM AlternativeApache Cayenne: a Java ORM Alternative
Apache Cayenne: a Java ORM Alternative
 
Oct meetup open stack 101 clean
Oct meetup open stack 101   cleanOct meetup open stack 101   clean
Oct meetup open stack 101 clean
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
 
NOsql Presentation.pdf
NOsql Presentation.pdfNOsql Presentation.pdf
NOsql Presentation.pdf
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Scaling with Riak at Showyou

  • 1. Scaling at Showyou John Muellerleile (@jrecursive, john@remixation.com) September 26, 2011 Tuesday, September 27, 2011
  • 2. John Muellerleile • Basho Technologies: Jan. 2008 - Dec. 2010 • Riak • Riak Search • Automated research, NLP, spidering • Consulting: 2003 - 2008 • E-commerce, AdSense, AdWords, ... • Infrastructure: • Messaging • Riak • Solr Tuesday, September 27, 2011
  • 3. Agenda • Who am I? • What is Showyou? • The Nature of “Social Data” • Showyou’s Data: Today & Tomorrow • Data Management: Technology Stack • Riak: The Awesome & Sub-Awesome, Integration Patterns & Observations • Not Bob’s Riak: The “Mecha” Backend & Query System Tuesday, September 27, 2011
  • 4. What is Showyou? Tuesday, September 27, 2011
  • 5. A minute about the nature of “Social Data” Tuesday, September 27, 2011
  • 12. Showyou’s Data: Today • Riak with Bitcask • Solr with replication • Often ever-growing blobs of JSON • No useful way to “find” data based on anything other than compound primary keys :( • Primary keys crafted for specific access patterns :( • This will not last forever Tuesday, September 27, 2011
  • 13. Showyou’s Data: Tomorrow • Significant data growth per additional user • Find & aggregate data about our users & their videos • Derive useful “signal” from this data • Better search: disambiguation, “more like this”, performance • “Collective Intelligence”: trending, “smart collections” • Spam & De-duplication • & more: #hashtags, auto-complete, statistics, usage, ... Tuesday, September 27, 2011
  • 14. Data Management: Technology Stack • Erlang/OTP • Riak, Bitcask • Java • JInterface • Hazelcast • Solr Tuesday, September 27, 2011
  • 15. Riak: The Awesome Parts • By far the best operational story in its class • Shared-nothing • No single point of failure • Masterless multi-site replication • Support via EnterpriseDS Startup Program • I helped write it & turned it inside-out to design Riak Search while working at Basho -- this helps Tuesday, September 27, 2011
  • 16. Riak: The Sub-Awesome Parts • Riak Search as it exists today does not perform well for us & lacks features • Bitcask keeps all keys in memory • Listing keys for a bucket will take your cluster down • Map/Reduce is virtually useless (for us) other than as “multi-get” • Pre-1.0 cluster membership changes are “at your peril” • Usable/useful built-in monitoring is non-existent • If you are not intimately familiar with Riak, it’s very hard to debug! Tuesday, September 27, 2011
  • 17. Riak: Integration Patterns • api_riak_node: “main” Riak cluster node • sidecar: post-commit talks to local Java-based Erlang node using JInterface • fabric: distributed data structures & utilities via Hazelcast - very similar in spirit & implementation as Riak! • indexer: pull from fabric queue & index record in Solr • Identical deployment on every node Tuesday, September 27, 2011
  • 18. Riak: Integration Patterns: The Big Picture • backend_node: nginx, redis; logs • spider: finds, extracts, stores & indexes further information on users & videos • log_indexer: aggregate & index interesting parts of our access logs • research_riak_node: a special-purpose Riak cluster node to support Showyou data “Tomorrow” Tuesday, September 27, 2011
  • 19. Riak: Integration Patterns: Observations • Riak post-commit hooks • Why not pre-commit hooks? • Riak as a “virtual memory” • Post-commit hooks as “change events” • Wishlist: pre- & post-delete hook (I realize this is tricky - do it anyway) • Wishlist: pre- & post-create hook (Less tricky - do it anyway) Tuesday, September 27, 2011
  • 20. Riak: Integration Patterns: Search • Monolithic replicated Solr won’t last forever & sharding is a faceted multi- value “shitshow” [1] • We’re doing fine, for now, with lots of RAM, SSDs, etc. but... • I don’t want to find out “the hard way” where the joyride ends and the hellride starts [2] • Clearly the answer is to kill every bird in nearby airspace by writing a Riak storage backend and integrated query mechanism! [2,3,4] [1] Cliff Moon suggested the use of this word (for emphasis) [2] It’s okay, I have done this several dozen times [3] Yes, really: BDB, BDBJE, Innostore, sqlite, hsql, mysql, postgres, etc. [4] This is not a pride thing: it was a hard, lonely, unforgiving road Tuesday, September 27, 2011
  • 21. Not Bob’s Riak: A Moment of Weakness “My way of joking is to tell the truth; it’s the funniest joke in the world” George Bernard Shaw Tuesday, September 27, 2011
  • 22. Not Bob’s Riak: Introducing “Mecha” • Bob? Tuesday, September 27, 2011
  • 23. Mecha: Goals • Birds to kill: • Tight, purposed Riak integration • Efficient & feature-complete indexing • Fast sequential & range object access • Flexible distributed query mechanism • Query parallelism where appropriate Tuesday, September 27, 2011
  • 24. Mecha: What Stays The Same • Works with “stock” (unmodified) Riak 1.0 (pre & release) • Little/no difference from the “Riak side” -- everything works “as it should” • Differences: • All objects you “put” into Riak must be JSON objects (this will change by release to respect content type) • Any fields without a specific “indexed field type” (e.g., _t, _s, _dt, etc.) are simply stored along with the rest of the fields (i.e. “stored field”) Tuesday, September 27, 2011
  • 25. Mecha: At a glance • Written in Java, uses many, many third-party libraries: LevelDB (JNI), Solr, JInterface, Jetlang, Netty, Protobufs, Commons, ... • Riak backend module written in Erlang; beaten into submission for reliable interaction with JInterface-driven Java node • LevelDB instance per partition, per bucket; “ulimit -n 90000” :) • One Solr instance per node (covers all partitions, buckets) • Objects stored as Riak Objects in JSON form in LevelDB • Standardized schema covering common data types using name suffixes Tuesday, September 27, 2011
  • 26. Mecha: Riak Integration Tuesday, September 27, 2011
  • 27. Mecha: Index Field Types • Supported index field types (by suffix): • _t, _tt - full-text (optionally w/ term vectors, ...) • _s, _s_mv - exact string, multi-value exact string • _i, _l, _d, _f - trie-based integer, long, double, float • _dt - trie-based date (YYYY-MM-DDTHH:MM:SS) • _b - boolean • _xy, _ll, _geo - point, lat/lon & geohash Tuesday, September 27, 2011
  • 28. Mecha: Example Object { “content_t”: “NoSQL is a ghetto”, “lol_count_i”: 47, “lol_dt”: “2009-04-13T07:01:43.000Z” “rating_f”: 4.111111164093018, “tags_s_mv”: [ “funny”, “lol”, “nosql”, “ghetto”]} Tuesday, September 27, 2011
  • 29. Mecha: Fast Sequential & Range Object Access • Every bucket gets own instance of LevelDB per partition • No “multiplexing” buckets or partitions per LevelDB instance • Keys are literal, no encoded Erlang terms; simple ranges, smaller values • Values stored as JSON-ized Riak objects (why? you’ll see) • LevelDB JNI binaries shipped with Snappy compression built-in Tuesday, September 27, 2011
  • 30. Mecha: Flexible Distributed Querying • Exact, prefix, suffix, & wildcard filtering on multiple index fields • Ultra-fast list_keys, list_bucket, list_buckets replacements • Equally fast bucket count operation • Ridiculous range query performance (any trie- type; sane datetime functions) • Faceting, group counts • Spatial (bounding box/bowl, geofilt, Haversine, distance faceting) Tuesday, September 27, 2011
  • 31. Mecha: Query Parallelism Tuesday, September 27, 2011
  • 32. Mecha: Coverage • Coverage is the set of nodes and respectively owned partitions you must process to cover all objects in a given bucket. Node 1 Node 2 (for n=2) Tuesday, September 27, 2011
  • 33. Mecha: Examples: Count & Faceting • Count the number of records in the "research" bucket modified within the last 10 seconds • Top search terms by count for the last hour Tuesday, September 27, 2011
  • 34. Mecha: Examples: Multi-value String Fields • See which field values occur with other values, and how often (using a multi- value string field) Tuesday, September 27, 2011
  • 35. Mecha: Next Steps • Code cleanup, test & polish current functionality • Sane build & deployment (yay, Maven) • Simplify configuration • Embed Solr (currently running standalone for debugging) • Extend & improve Solr “standard configuration” • “Out of Band” Map/reduce with “direct bucket-level” LevelDB instance • After that? Join operators, ... :) Tuesday, September 27, 2011
  • 36. Mecha: Availability • Right now it is a complex system • It will be worth waiting for tighter integration & polish • This is me not answering your question :) • As soon as responsible, sane, & possible :) Tuesday, September 27, 2011
  • 37. Thank you! • Basho Technologies, especially Andy Gross (@argv0) & Kelly McLaughlin (@_klm) specifically for help with Riak 1.0 changes • Of course, without Erlang/OTP, LevelDB, Solr, Java, JInterface, and a host of other open source projects, this would have never even gotten started -- thank you. • Last but not least, thank you to my Showyou teammates for encouragement & support! • Contact: John Muellerleile / http://twitter.com/jrecursive john@remixation.com, jmuellerleile@gmail.com http://github.com/jrecursive Tuesday, September 27, 2011