SlideShare uma empresa Scribd logo
1 de 137
Sharing a Startup’s Big Data
          Lessons
Experiences with non-RDBMS solutions at
Who we are
• A search
  engine
• A people
  search engine
• An influencer
  search engine
• Subscription-
  based
George Stathis

VP Engineering
14+ years of experience
building full-stack web
software systems with a past
focus on e-commerce and
publishing. Currently
responsible for building
engineering capability to
enable Traackr's growth goals.
What’s this talk about?

• Share what we know about Big Data/NoSQL:
 what’s behind the buzz words?
• Our reasons and method for picking a NoSQL
 database
• Share the lessons we learned going through
 the process
Big Data/NoSQL: behind the buzz words
What is Big Data?
• 3 Vs:
  – Volume
  – Velocity
  – Variety
What is Big Data? Volume + Velocity
• Data sets too large or coming in at too high a velocity
  to process using traditional databases or desktop tools.
  E.g.

   big science                   Astronomy
   web logs                      atmospheric science
   rfid                          genomics
   sensor networks               biogeochemical
   social networks               military surveillance
   social data                   medical records
   internet text and documents   photography archives
   internet search indexing      video archives
   call detail records           large-scale e-commerce
What is Big Data? Variety
• Big Data is varied and unstructured
Traditional static reports   Analytics, exploration &
                             experimentation
What is Big Data?
• Scaling data processing cost effectively




                        $$$$$


     $$$$$$$$                                $$$
What is NoSQL?
• NoSQL ≠ No SQL
• NoSQL ≈ Not Only SQL
• NoSQL addresses RDBMS limitations, it’s not
  about the SQL language
• RDBMS = static schema
• NoSQL = schema flexibility; don’t have to
  know exact structure before storing
What is Distributed Computing?
• Sharing the workload: divide a problem into
  many tasks, each of which can be solved by one
  or more computers
• Allows computations to be accomplished in
  acceptable timeframes
• Distributed computation approaches were
  developed to leverage multiple machines:
  MapReduce
• With MapReduce, the program goes to the data
  since the data is too big to move
What is MapReduce?




Source: developer.yahoo.com
What is MapReduce?
• MapReduce = batch processing = analytical
• MapReduce ≠ interactive
• Therefore many NoSQL solutions don’t
  outright replace warehouse solutions, they
  complement them
• RDBMS is still safe 
What is Big Data? Velocity
• In some instances, being able to process large
  amounts of data in real-time can yield a
  competitive advantage. E.g.
   – Online retailers leveraging buying history and click-
     though data for real-time recommendations
• No time to wait for MapReduce jobs to finish
• Solutions: streaming processing (e.g. Twitter
  Storm), pre-computing (e.g. aggregate and count
  analytics as data arrives), quick to read key/value
  stores (e.g. distributed hashes)
What is Big Data? Data Science
• Emergence of Data Science
• Data Scientist ≈ Statistician
• Possess scientific discipline & expertise
• Formulate and test hypotheses
• Understand the math behind the algorithms so
  they can tweak when they don’t work
• Can distill the results into an easy to understand
  story
• Help businesses gain actionable insights
Big Data Landscape




Source: capgemini.com
Big Data Landscape




Source: capgemini.com
Big Data Landscape




Source: capgemini.com
So what’s Traackr and why did we
      need a NoSQL DB?
Traackr: context
• A cloud computing company as about to
  launch a new platform; how does it find the
  most influential IT bloggers on the web that
  can help bring visibility to the new product?
  How does it find the opinion leaders, the
  people that matter?
Traackr: a people search engine




     Up to 50 keywords per search!
Traackr: a people search engine
                Proprietary
                3-scale ranking




People
as                                Content
search                            aggregated
results                           by author
Traackr: 30,000 feet




Acquisition   Processing   Storage & Indexing   Services   Applications
NoSQL is usually associated with
“Web Scale” (Volume & Velocity)
Do we fit the “Web scale” profile?


       • In terms of users/traffic?
Source: compete.com
Source: compete.com
Source: compete.com
Source: compete.com
Do we fit the “Web scale” profile?


       • In terms of users/traffic?

    • In terms of the amount of data?
PRIMARY> use traackr
switched to db traackr
PRIMARY> db.stats()
{
     "db" : "traackr",
     "collections" : 12,
     "objects" : 68226121,
     "avgObjSize" : 2972.0800625760330,
                                That’s a quarter of a
     "dataSize" : 202773493971,
                                    terabyte …
     "storageSize" : 221491429671,
     "numExtents" : 199,
     "indexes" : 33,
     "indexSize" : 27472394891,
     "fileSize" : 266623699968,
     "nsSizeMB" : 16,
     "ok" : 1
}
Wait! What? My
Synology NAS at home
can hold 2TB!
No need for us to track the entire web


                          Influencer
                           Content


                Web Content




               Not at scale :-)
Do we fit the “Web scale” profile?


       • In terms of users/traffic?

    • In terms of the amount of data?
Variety view of “Web Scale”

        Web data is:

      Heterogeneous

     Unstructured (text)
Visualization of the Internet, Nov. 23rd 2003
          Source: http://www.opte.org/
Data sources are
isolated islands of rich
data with lose links to
one another
How do we build a database that
models all possible entities found on
             the web?
Modeling the web: the RDBMS way
Source: socialbutterflyclt.com
or
{
    "realName": "David Chancogne",
    "title": "CTO",
    "description": "Web. Geek.rnTraackr: http://traackr.comrnPropz: http://propz.me",
    "primaryAffiliation": "Traackr",
    "email": "dchancogne@traackr.com",
    "location": "Cambridge, MA, United States",
    "siteReferences": [
      {
         "siteUrl": "http://twitter.com/dchancogne",
         "metrics": [
           {
              "value": 216,
              "name": "twitter_followers_count"
           },
           {
              "value": 2107,
              "name": "twitter_statuses_count"
           }
         ]
      },
      {
         "siteUrl": "http://traackr.com/blog/author/david",
         "metrics": [
           {
              "value": 21,
              "name": "google_inbound_links"
           }
         ]
      }
    ]
}
                          Influencer data as JSON
NoSQL = schema flexibility
Do we fit the “Web scale” profile?


       • In terms of users/traffic?

    • In terms of the amount of data?
Do we fit the “Web scale” profile?


       • In terms of users/traffic?

    • In terms of the amount of data?

   • In terms of the variety of the data   ✓
Traackr’s Datastore Requirements

• Schema flexibility   ✓
• Good at storing lots of variable length text

• Batch processing options
Requirement: text storage

        Variable text length:
  140                      multi-page
character < big variance <
 tweets                    blog posts
Requirement: text storage

RDBMS’ answer to variable text length:

    Plan ahead for largest value

            CLOB/BLOB
Requirement: text storage

    Issues with CLOB/BLOG for us:

    No clue what largest value is

CLOB/BLOB for tweets = wasted space
Requirement: text storage

  NoSQL solutions are great for text:

No length requirements (automated
             chunking)

      Limited space overhead
Traackr’s Datastore Requirements

• Schema flexibility   ✓
• Good at storing lots of variable length text   ✓
• Batch processing options
Requirement: batch processing


 Some NoSQL
solutions come
with MapReduce




                 Source: http://code.google.com/
Requirement: batch processing

        MapReduce + RDBMS:

 Possible but proprietary solutions
Usually involves exporting data from
RDBMS into a NoSQL system anyway.
 Defeats data locality benefit of MR
Traackr’s Datastore Requirements

• Schema flexibility   ✓
• Good at storing lots of variable length text   ✓
• Batch processing options      ✓

           A NoSQL option is the right fit
How did we pick a NoSQL DB?
Bewildering number of options (early 2010)

  Key/Value Databases          Column Databases
  •   Distributed hashtables   •   Spread sheet like
  •   Designed for high load   •   Key is a row id
  •   In-memory or on-disk     •   Attributes are columns
  •   Eventually consistent    •   Columns can be grouped
                                   into families

  Document Databases           Graph Databases
  •   Like Key/Value           • Graph Theory G=(E,V)
  •   Value = Document         • Great for modeling
  •   Document = JSON/BSON       networks
  •   JSON = Flexible Schema   • Great for graph-based
                                 query algorithms
Bewildering number of options (early 2010)

  Key/Value Databases          Column Databases
  •   Distributed hashtables   •   Spread sheet like
  •   Designed for high load   •   Key is a row id
  •   In-memory or on-disk     •   Attributes are columns
  •   Eventually consistent    •   Columns can be grouped
                                   into families

  Document Databases           Graph Databases
  •   Like Key/Value           • Graph Theory G=(E,V)
  •   Value = Document         • Great for modeling
  •   Document = JSON/BSON       networks
  •   JSON = Flexible Schema   • Great for graph-based
                                 query algorithms
Trimming options
Key/Value Databases              Column Databases
•   Distributed hashtables while•weSpread sheet like
            Graph Databases:             can model
•                                   • Key is a row
    Designed for high as a graph we don’t want to id
           our domain load
•          pigeonhole ourselves into this structure. columns
    In-memory or on-disk            • Attributes are
•   Eventually consistent use these tools for can be grouped
               We’d rather          • Columns
            specialized data analysis but not as the
                                        into families
                     main data store.

Document Databases               Graph Databases
•   Like Key/Value               • Graph Theory G=(E,V)
•   Value = Document             • Great for modeling
•   Document = JSON/BSON           networks
•   JSON = Flexible Schema       • Great for graph-based
                                   query algorithms
Trimming options
Key/Value Databases            Column Databases
                     Memcache: memory-based,
•   Distributed hashtables      • Spread sheet like
                      we need true persistence
• Designed for high load       • Key is a row id
• In-memory or on-disk         • Attributes are columns
• Eventually consistent        • Columns can be grouped
                                 into families

Document Databases             Graph Databases
•   Like Key/Value             • Graph Theory G=(E,V)
•   Value = Document           • Great for modeling
•   Document = JSON/BSON         networks
•   JSON = Flexible Schema     • Great for graph-based
                                 query algorithms
Trimming options
Key/Value Databases             Column Databases
•   Distributed hashtables      •     Spread sheet like
•   Designed for high load      •     Key is a row id
•   In-memory or on-disk        •     Attributes are columns
•   Eventually consistent       •     Columns can be grouped
                   Amazon SimpleDB: not willing to
                    store our data in into families
                                      a proprietary
                             datastore.
Document Databases              Graph Databases
•   Like Key/Value              • Graph Theory G=(E,V)
•   Value = Document            • Great for modeling
•   Document = JSON/BSON          networks
•   JSON = Flexible Schema      • Great for graph-based
                                  query algorithms
Trimming options
Key/Value Databases                Column Databases
•   Distributed hashtables         •     Spread sheet like
•   Designed for high load         •     Key is a row id
•   In-memory or on-disk           •     Attributes are columns
•   Eventually consistent          •     Columns can be grouped
                                         into families

Document Databases                 Graph Databases
•   Like Key/Value                  •    Graph Theory G=(E,V)
     Not willing to store ourProject a
        Redis and LinkedIn’s data in
•   Value proprietary datastore. •
          = Document
       Voldermort: no query filters,     Great for modeling
•   Document used as queues or
         better = JSON/BSON              networks
•   JSON = Flexible Schema
            distributed caches      •    Great for graph-based
                                         query algorithms
Trimming options
Key/Value Databases             Column Databases
•   Distributed hashtables        • Spread sheet like
•   Designed for high load        • Key is a row id
•   In-memory or on-disk          • Attributes are columns
                         CouchDB: no ad-hoc queries;
•   Eventually consistent         • Columns can us
                        maturity in early 2010 madebe grouped
                                     into families
                         shy away although we did try
                              early prototypes.
Document Databases              Graph Databases
•   Like Key/Value              • Graph Theory G=(E,V)
•   Value = Document            • Great for modeling
•   Document = JSON/BSON          networks
•   JSON = Flexible Schema      • Great for graph-based
                                  query algorithms
Trimming options
Key/Value Databases              Column Databases
•   Distributed hashtables       •   Spread sheet like
•   Designed for high load       •   Key is a row id
•   In-memory or on-disk         •   Attributes are columns
•   Eventually consistent        •   Columns can be grouped
                                     into families

Document Databases 2010, Graph Databases
     Cassandra: in early
•                                  •
      maturity questions, no secondary Graph Theory G=(E,V)
    Like Key/Value
•   Value = Document processing Great for modeling
      indexes and no batch         •
•         options (came later on).
    Document = JSON/BSON               networks
•   JSON = Flexible Schema         • Great for graph-based
                                       query algorithms
Trimming options
Key/Value Databases           Column Databases
•   Distributed hashtables       • Spread sheet like
•                       MongoDB: in earlyis a row id
    Designed for high load       • Key 2010, maturity
•   In-memory or on-disk questions, adoption questions
                                 • Attributes are columns
                        and no batch processing options.
•   Eventually consistent        • Columns can be grouped
                                     into families

Document Databases            Graph Databases
•   Like Key/Value            • Graph Theory G=(E,V)
•   Value = Document          • Great for modeling
•   Document = JSON/BSON        networks
•   JSON = Flexible Schema    • Great for graph-based
                                query algorithms
Trimming options
Key/Value Databases             Column Databases
•   Distributed hashtables      •   Spread sheet like
•   Designed for high load      •   Key is a row id
•   In-memory or on-disk        •   Attributes are columns
•   Eventually consistent       •   Columns can be grouped
                                    into families

Document Databases              Graph Databases
•   Like Key/Value very close but• in early 2010,
               Riak:                  Graph Theory G=(E,V)
•                                • Great for
    Value = Document adoption questions. modeling
                  we had
•   Document = JSON/BSON              networks
•   JSON = Flexible Schema       • Great for graph-based
                                      query algorithms
Trimming options
Key/Value Databases             Column Databases
•   Distributed hashtables      •   Spread sheet like
•   Designed for high load      •   Key is a row id
•   In-memory or on-disk        •   Attributes are columns
•   Eventually consistent       •   Columns can be grouped
                                    into families

Document Databases              Graph Databases
•   Like Key/Value came across as•theGraphmature G=(E,V)
             HBase:                     most Theory
•   Value = Document with several deployments, a
            at the time,          • Great for modeling
•   Document = JSON/BSON "out-of-the box"
              healthy community,      networks
            secondary indexes through a contrib and
•   JSON = Flexible Schema        • Great for graph-based
               support for batch processing using
                         Hadoop/MR query algorithms
                                      .
Lessons Learned

Challenges               Rewards
- Complexity             - Choices

- Missing Features       - Empowering

- Problem solution fit   - Community

- Resources              - Cost
Rewards: Choices
Key/Value Databases          Column Databases
•   Distributed hashtables   •   Spread sheet like
•   Designed for high load   •   Key is a row id
•   In-memory or on-disk     •   Attributes are columns
•   Eventually consistent    •   Columns can be grouped
                                 into families

Document Databases           Graph Databases
•   Like Key/Value           • Graph Theory G=(E,V)
•   Value = Document         • Great for modeling
•   Document = JSON/BSON       networks
•   JSON = Flexible Schema   • Great for graph-based
                               query algorithms
Rewards: Choices




  Source: capgemini.com
Lessons Learned

Challenges               Rewards
- Complexity             - Choices

- Missing Features       - Empowering

- Problem solution fit   - Community

- Resources              - Cost
When Big-Data = Big Architectures
  Must have an odd                                    Master/slave architecture
     number of                                      means a single point of failure,
 Zookeeper quorum                                    so you need to protect your
       nodes                                                   master.




          Then you can run your Hbase
         nodes but it’s recommended to
          co-locate regionservers with
         hadoop datanodes so you have
              to manage resources.
                                                           Must have a Hadoop HDFS
                                                        cluster of at least 2x replication
                                                                  factor nodes


           And then we also have to
            manage the MapReduce
         processes and resources in the
                 Hadoop layer.



Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
Source: socialbutterflyclt.com
Jokes aside, no one said open source
           was easy to use
To be expected
• Hadoop/Hbase are
  designed to move
  mountains

• If you want to move big
  stuff, be prepared to
  sometimes use big
  equipment
What it means to a startup



                Development capacity before




Congrats, you
  are now a
 sysadmin…      Development capacity after
Lessons Learned

Challenges               Rewards
- Complexity             - Choices

- Missing Features       - Empowering

- Problem solution fit   - Community

- Resources              - Cost
Mapping an saved search to a column store


                              Name




Ranks                                  References to influencer records
Mapping an saved search to a column store
          “attributes”
         column family
Unique    for general         “influencerId” column family
 key       attributes    for influencer ranks and foreign keys
Mapping an saved search to a column store

                             Influencer ranks
                             can be attribute
     “name” attribute         names as well
Mapping an saved search to a column store


                Can get pretty long so needs indexing and pagination
Problem: no out-of-the-box row-based
      indexing and pagination
Jumping right into the code
Lessons Learned

Challenges               Rewards
- Complexity             - Choices

- Missing Features       - Empowering

- Problem solution fit   - Community

- Resources              - Cost
a few months later…
Need to upgrade to Hbase 0.90

• Making sure to remain on recent code base

• Performance improvements

• Mostly to get the latest bug fixes




                              No thanks!
Looks like something is missing
Our DB indexes depend on this!
Let’s get this straight

• Hbase no longer comes with secondary
  indexing out-of-the-box

• It’s been moved out of the trunk to GitHub

• Where only one other company besides us
  seems to care about it
Only one other
 maintainer
  besides us
What it means to a startup
  Congrats, you are
    now an hbase
 contrib maintainer…




                       Development capacity
Source: socialbutterflyclt.com
Lessons Learned

Challenges               Rewards
- Complexity             - Choices

- Missing Features       - Empowering

- Problem solution fit   - Community

- Resources              - Cost
Homegrown Hbase Indexes
                Row ids for Posts




 Rows have id prefixes that can be
 efficiently scanned using STARTROW
 and STOPROW filters
Homegrown Hbase Indexes
       Row ids for Posts




                    Find posts for
                influencer_id_1234
Homegrown Hbase Indexes
       Row ids for Posts




                    Find posts for
                influencer_id_5678
Homegrown Hbase Indexes

• No longer depending on
 unmaintained code

• Work with out-of-the-box Hbase
 installation
What it means to a startup
 You are back but you
     still need to
  maintain indexing
          logic




                        Development capacity
a few months later…
Cracks in the data model
      huffingtonpost.com
                                published under
   writes for

                       http://www.huffingtonpost.com/arianna-huffington/post_1.html
                         http://www.huffingtonpost.com/arianna-huffington/post_2.html
         authored by        http://www.huffingtonpost.com/arianna-huffington/post_3.html



      huffingtonpost.com
                                published under
   writes for

                       http://www.huffingtonpost.com/shaun-donovan/post1.html
                         http://www.huffingtonpost.com/shaun-donovan/post2.html
         authored by        http://www.huffingtonpost.com/shaun-donovan/post3.html
Cracks in the data model
      huffingtonpost.com
                                published under
   writes for
                                         Denormalized/duplicated
                                           for fast runtime access
                       http://www.huffingtonpost.com/arianna-huffington/post_1.html
                                         and storage of influencer-
                         http://www.huffingtonpost.com/arianna-huffington/post_2.html
         authored by        http://www.huffingtonpost.com/arianna-huffington/post_3.html
                                             to-site relationship
                                                  properties

      huffingtonpost.com
                                published under
   writes for

                       http://www.huffingtonpost.com/shaun-donovan/post1.html
                         http://www.huffingtonpost.com/shaun-donovan/post2.html
         authored by        http://www.huffingtonpost.com/shaun-donovan/post3.html
Cracks in the data model
      huffingtonpost.com
                                      published under
   writes for

                             http://www.huffingtonpost.com/arianna-huffington/post_1.html
                               http://www.huffingtonpost.com/arianna-huffington/post_2.html
         authored by



      huffingtonpost.com
                                      published under
   writes for

                             http://www.huffingtonpost.com/shaun-donovan/post1.html
                               http://www.huffingtonpost.com/shaun-donovan/post2.html
         authored by              http://www.huffingtonpost.com/shaun-donovan/post3.html
                                     http://www.huffingtonpost.com/arianna-huffington/post_3.html



      Content attribution logic could sometimes
         mis-attribute posts because of the
                   duplicated data.
Cracks in the data model
      huffingtonpost.com
                                       published under
   writes for

                              http://www.huffingtonpost.com/arianna-huffington/post_1.html

         authored by



      huffingtonpost.com
                                       published under
   writes for

                              http://www.huffingtonpost.com/shaun-donovan/post1.html
                                http://www.huffingtonpost.com/shaun-donovan/post2.html
         authored by               http://www.huffingtonpost.com/shaun-donovan/post3.html
                                      http://www.huffingtonpost.com/arianna-huffington/post_3.html
                                        http://www.huffingtonpost.com/arianna-huffington/post_2.html


        Exacerbated when we started tracking
       people’s content on a daily basis in mid-
                        2011
Fixing the cracks in the data model

                                      Normalize the sites

                                      http://www.huffingtonpost.com/arianna-huffington/post_1.html
                                        http://www.huffingtonpost.com/arianna-huffington/post_2.html
                        authored by        http://www.huffingtonpost.com/arianna-huffington/post_3.html

     writes for
                                               published under
                     huffingtonpost.com
                                               published under
                  writes for

                                      http://www.huffingtonpost.com/shaun-donovan/post1.html
                                        http://www.huffingtonpost.com/shaun-donovan/post2.html
                        authored by        http://www.huffingtonpost.com/shaun-donovan/post3.html
Fixing the cracks in the data model

• Normalization requires stronger
 secondary indexing

• Our application layer indexing would
 need revisiting…again!
What it means to a startup
 Psych! You are back
 to writing indexing
        code.




                       Development capacity
Source: socialbutterflyclt.com
Lessons Learned

Challenges               Rewards
- Complexity             - Choices

- Missing Features       - Empowering

- Problem solution fit   - Community

- Resources              - Cost
Traackr’s Datastore Requirements
               (Revisited)
• Schema flexibility

• Good at storing lots of variable length text

• Out-of-the-box SECONDARY INDEX support!

• Simple to use and administer
NoSQL picking – Round 2 (mid 2011)
  Key/Value Databases          Column Databases
  •   Distributed hashtables   •   Spread sheet like
  •   Designed for high load   •   Key is a row id
  •   In-memory or on-disk     •   Attributes are columns
  •   Eventually consistent    •   Columns can be grouped
                                   into families

  Document Databases           Graph Databases
  •   Like Key/Value           • Graph Theory G=(E,V)
  •   Value = Document         • Great for modeling
  •   Document = JSON/BSON       networks
  •   JSON = Flexible Schema   • Great for graph-based
                                 query algorithms
NoSQL picking – Round 2 (mid 2011)
  Key/Value Databases              Column Databases
  •   Distributed hashtables       •   Spread sheet like
  •   Designed for high load       •   Key is a row id
  •   In-memory or on-disk         •   Attributes are columns
  •   Eventually consistent        •   Columns can be grouped
                                       into families
                               Nope!
  Document Databases               Graph Databases
  •   Like Key/Value               • Graph Theory G=(E,V)
  •   Value = Document             • Great for modeling
  •   Document = JSON/BSON           networks
  •   JSON = Flexible Schema       • Great for graph-based
                                     query algorithms
NoSQL picking – Round 2 (mid 2011)
  Key/Value Databases             Column Databases
  •   Distributed hashtables        • Spread sheet like
  •   Designed for high load        • Key is a row id
  •                Graph Databases:•weAttributes are columns
      In-memory or on-disk               looked at
  •                                 • Columns can
      Eventually consistent closer but passed again be grouped
                 Neo4J a bit
                   for the same reasons into families
                                        as before.

  Document Databases              Graph Databases
  •   Like Key/Value              • Graph Theory G=(E,V)
  •   Value = Document            • Great for modeling
  •   Document = JSON/BSON          networks
  •   JSON = Flexible Schema      • Great for graph-based
                                    query algorithms
NoSQL picking – Round 2 (mid 2011)
  Key/Value Databases             Column Databases
                           Memcache: still no
  •   Distributed hashtables      •   Spread sheet like
  •   Designed for high load      •   Key is a row id
  •   In-memory or on-disk        •   Attributes are columns
  •   Eventually consistent       •   Columns can be grouped
                                      into families

  Document Databases              Graph Databases
  •   Like Key/Value              • Graph Theory G=(E,V)
  •   Value = Document            • Great for modeling
  •   Document = JSON/BSON          networks
  •   JSON = Flexible Schema      • Great for graph-based
                                    query algorithms
NoSQL picking – Round 2 (mid 2011)
  Key/Value Databases           Column Databases
  •   Distributed hashtables    •   Spread sheet like
  •   Designed for high load    •   Key is a row id
  •   In-memory or on-disk      •   Attributes are columns
  •   Eventually consistent     •   Columns can be grouped
                       Amazon SimpleDB: still no.
                                    into families

  Document Databases            Graph Databases
  •   Like Key/Value            • Graph Theory G=(E,V)
  •   Value = Document          • Great for modeling
  •   Document = JSON/BSON        networks
  •   JSON = Flexible Schema    • Great for graph-based
                                  query algorithms
NoSQL picking – Round 2 (mid 2011)
  Key/Value Databases                Column Databases
  •   Distributed hashtables         •     Spread sheet like
  •   Designed for high load         •     Key is a row id
  •   In-memory or on-disk           •     Attributes are columns
  •   Eventually consistent          •     Columns can be grouped
                                           into families

  Document Databases                 Graph Databases
  •   Like Key/Value                  •    Graph Theory G=(E,V)
       Not willing to store ourProject a
          Redis and LinkedIn’s data in
  •   Value proprietary datastore. •
            =Voldermort: still no
               Document                    Great for modeling
  •   Document = JSON/BSON                 networks
  •   JSON = Flexible Schema          •    Great for graph-based
                                           query algorithms
NoSQL picking – Round 2 (mid 2011)
  Key/Value Databases            Column Databases
  •   Distributed hashtables      • Spread sheet like
  •   Designed for high load      • Key is a row id
  •   In-memory or on-disk        • Attributes are columns
                          CouchDB: more mature but still
  •   Eventually consistent       • Columns can
                               no ad-hoc queries. be grouped
                                      into families

  Document Databases             Graph Databases
  •   Like Key/Value             • Graph Theory G=(E,V)
  •   Value = Document           • Great for modeling
  •   Document = JSON/BSON         networks
  •   JSON = Flexible Schema     • Great for graph-based
                                   query algorithms
NoSQL picking – Round 2 (mid 2011)
  Key/Value Databases                       Column Databases
  •   Distributed hashtables                •     Spread sheet like
  •   Designed for high load                •     Key is a row id
  •   In-memory or on-disk                  •     Attributes are columns
  •   Eventually consistent                 •     Columns can be grouped
                                                  into families

  Document Databasesa bit, added
      Cassandra: matured quite Graph Databases
         secondary indexes and batch processing
  •                                         •
      Like Key/Valuerestrictive in its’ use than Graph Theory G=(E,V)
        options but more
  •                                         •
      Value =solutions. After the Hbase lesson, Great for modeling
          other Document
  •   Document useJSON/BSON
        simplicity of
                      = was now more important. networks
  •   JSON = Flexible Schema                • Great for graph-based
                                                 query algorithms
NoSQL picking – Round 2 (mid 2011)
  Key/Value Databases           Column Databases
  •   Distributed hashtables    •   Spread sheet like
  •   Designed for high load    •   Key is a row id
  •   In-memory or on-disk      •   Attributes are columns
  •   Eventually consistent     •   Columns can be grouped
                                    into families

  Document Databases            Graph Databases
  •                              • Graph Theory G=(E,V)
      Like Key/Value strong contender still but
                   Riak:
  •                              • Great for
      Value = Document questions remained. modeling
                   adoption
  •   Document = JSON/BSON         networks
  •   JSON = Flexible Schema     • Great for graph-based
                                   query algorithms
NoSQL picking – Round 2 (mid 2011)
  Key/Value Databasesby leaps Column Databases
         MongoDB: matured     and bounds, increased
  •                                       • Spread sheet like
      Distributed hashtables 10gen, advanced indexing
             adoption, support from
  •                                       • batch processing
      Designed for high load as some Key is a row id
              out-of-the-box as well
            options, breeze to use, well documented and fit into
  •                                       • Attributes
      In-memory or on-disk code base very nicely. are columns
                     our existing
  •   Eventually consistent               • Columns can be grouped
                                             into families

  Document Databases                Graph Databases
  •   Like Key/Value                • Graph Theory G=(E,V)
  •   Value = Document              • Great for modeling
  •   Document = JSON/BSON            networks
  •   JSON = Flexible Schema        • Great for graph-based
                                      query algorithms
Lessons Learned

Challenges               Rewards
- Complexity             - Choices

- Missing Features       - Empowering

- Problem solution fit   - Community

- Resources              - Cost
Immediate Benefits

• No more maintaining custom application-layer
 secondary indexing code
What it means to a startup
  Yay! I’m back!




                   Development capacity
Immediate Benefits

• No more maintaining custom application-layer
  secondary indexing code

• Single binary installation greatly simplifies
  administration
What it means to a startup
 Honestly, I thought
  I’d never see you
      guys again!




                       Development capacity
Immediate Benefits

• No more maintaining custom application-layer
  secondary indexing code
• Single binary installation greatly simplifies
  administration
• Our NoSQL could now support our domain
  model
many-to-many
 relationship
{
    ”_id": "770cf5c54492344ad5e45fb791ae5d52”,
    "realName": "David Chancogne",
    "title": "CTO",
    "description": "Web. Geek.rnTraackr: http://traackr.comrnPropz: http://propz.me",
    "primaryAffiliation": "Traackr",
    "email": "dchancogne@traackr.com",
    "location": "Cambridge, MA, United States",
    "siteReferences": [
      {                                                                         Embedded list of
         "siteId": "b31236da306270dc2b5db34e943af88d",                         references to sites
         "contribution": 0.25                                                   augmented with
      },
                                                                               influencer-specific
      {
         "siteId": "602dc370945d3b3480fff4f2a541227c",                         site attributes (e.g.
         "contribution": 1.0                                                  percent contribution
      }                                                                             to content)
    ]
}




                             Modeling an influencer
{
                   ”_id": "770cf5c54492344ad5e45fb791ae5d52”,
                   "realName": "David Chancogne",
                   "title": "CTO",
                   "description": "Web. Geek.rnTraackr: http://traackr.comrnPropz: http://propz.me",
                   "primaryAffiliation": "Traackr",
                   "email": "dchancogne@traackr.com",
                   "location": "Cambridge, MA, United States",
                   "siteReferences": [
                     {
                        "siteId": "b31236da306270dc2b5db34e943af88d",
                        "contribution": 0.25                                                  siteId indexed for
                     },                                                                        “find influencers
                     {                                                                       connected to site X”
                        "siteId": "602dc370945d3b3480fff4f2a541227c",
                        "contribution": 1.0
                     }
                   ]
               }



> db.influencers.ensureIndex({siteReferences.siteId: 1});
> db.influencers.find({siteReferences.siteId: "602dc370945d3b3480fff4f2a541227c"});



                                           Modeling an influencer
Other Benefits
• Ad hoc queries and reports became easier to write with JavaScript:
  no need for a Java developer to write map reduce code to extract
  the data in a usable form like it was needed with Hbase.

• Simpler backups: Hbase mostly relied on HDFS redundancy; intra-
  cluster replication is available but experimental and a lot more
  involved to setup.

• Great documentation

• Great adoption and community
looks like we found the right fit!
We have more of this




     Development capacity
And less of this




 Source: socialbutterflyclt.com
Recap & Final Thoughts
• 3 Vs of Big Data:
  – Volume
  – Velocity
  – Variety  Traackr
• Big Data technologies are complementary to
  SQL and RDBMS
• Until machines can think for themselves Data
  Science will be increasingly important
Recap & Final Thoughts

• Be prepared to deal with less mature tech
• Be as flexible as the data => fearless
  refactoring
• Importance of ease of use and
  administration cannot be overstated for a
  small startup
Q&A

Mais conteúdo relacionado

Mais procurados

Webinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDBWebinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDBMongoDB
 
Introducing Azure DocumentDB - NoSQL, No Problem
Introducing Azure DocumentDB - NoSQL, No ProblemIntroducing Azure DocumentDB - NoSQL, No Problem
Introducing Azure DocumentDB - NoSQL, No ProblemAndrew Liu
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Casesboorad
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherMongoDB
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDBMongoDB
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSTreasure Data, Inc.
 
Benefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSsBenefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSsMongoDB
 
Family tree of data – provenance and neo4j
Family tree of data – provenance and neo4jFamily tree of data – provenance and neo4j
Family tree of data – provenance and neo4jM. David Allen
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
 
The state of the art in Linked Data
The state of the art in Linked DataThe state of the art in Linked Data
The state of the art in Linked DataJoshua Shinavier
 
Javascript & SQL within database management system
Javascript & SQL within database management systemJavascript & SQL within database management system
Javascript & SQL within database management systemClusterpoint
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkMongoDB
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageBethmi Gunasekara
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irdatastack
 

Mais procurados (20)

Webinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDBWebinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDB
 
Introducing Azure DocumentDB - NoSQL, No Problem
Introducing Azure DocumentDB - NoSQL, No ProblemIntroducing Azure DocumentDB - NoSQL, No Problem
Introducing Azure DocumentDB - NoSQL, No Problem
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop Together
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDB
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWS
 
Benefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSsBenefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSs
 
Family tree of data – provenance and neo4j
Family tree of data – provenance and neo4jFamily tree of data – provenance and neo4j
Family tree of data – provenance and neo4j
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
NoSQL-Overview
NoSQL-OverviewNoSQL-Overview
NoSQL-Overview
 
The state of the art in Linked Data
The state of the art in Linked DataThe state of the art in Linked Data
The state of the art in Linked Data
 
Javascript & SQL within database management system
Javascript & SQL within database management systemJavascript & SQL within database management system
Javascript & SQL within database management system
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data Storage
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.ir
 

Destaque

Health Declartion Minor
Health Declartion MinorHealth Declartion Minor
Health Declartion MinorJewish Agency
 
Israel. A Better Stimulus Plan Guidebook
Israel. A Better Stimulus Plan GuidebookIsrael. A Better Stimulus Plan Guidebook
Israel. A Better Stimulus Plan GuidebookJewish Agency
 
Jewish Agency Mission Guide
Jewish Agency Mission GuideJewish Agency Mission Guide
Jewish Agency Mission GuideJewish Agency
 
Jewish Agency General Brochure
Jewish Agency General BrochureJewish Agency General Brochure
Jewish Agency General BrochureJewish Agency
 
2008 Jewish Agency Annual Report
2008 Jewish Agency Annual Report2008 Jewish Agency Annual Report
2008 Jewish Agency Annual ReportJewish Agency
 
Health Declartion Minor
Health Declartion MinorHealth Declartion Minor
Health Declartion MinorJewish Agency
 
Jewish Agency Budget Shortfall
Jewish Agency Budget ShortfallJewish Agency Budget Shortfall
Jewish Agency Budget ShortfallJewish Agency
 
The Flourishing of Jewish Life in Hungary
The Flourishing of Jewish Life in HungaryThe Flourishing of Jewish Life in Hungary
The Flourishing of Jewish Life in HungaryJewish Agency
 
Jafi Image Brochure A10 3.31.08
Jafi Image Brochure A10 3.31.08Jafi Image Brochure A10 3.31.08
Jafi Image Brochure A10 3.31.08Jewish Agency
 
Firststepsolim 090421154659 Phpapp01
Firststepsolim 090421154659 Phpapp01Firststepsolim 090421154659 Phpapp01
Firststepsolim 090421154659 Phpapp01Jewish Agency
 
Firststepskc 090421154931 Phpapp02
Firststepskc 090421154931 Phpapp02Firststepskc 090421154931 Phpapp02
Firststepskc 090421154931 Phpapp02Jewish Agency
 

Destaque (17)

Health Declartion Minor
Health Declartion MinorHealth Declartion Minor
Health Declartion Minor
 
Israel. A Better Stimulus Plan Guidebook
Israel. A Better Stimulus Plan GuidebookIsrael. A Better Stimulus Plan Guidebook
Israel. A Better Stimulus Plan Guidebook
 
Jewish Agency Mission Guide
Jewish Agency Mission GuideJewish Agency Mission Guide
Jewish Agency Mission Guide
 
Mounting Impact0809
Mounting Impact0809Mounting Impact0809
Mounting Impact0809
 
Jewish Agency General Brochure
Jewish Agency General BrochureJewish Agency General Brochure
Jewish Agency General Brochure
 
Questioniar Eo+Kc
Questioniar Eo+KcQuestioniar Eo+Kc
Questioniar Eo+Kc
 
2008 Jewish Agency Annual Report
2008 Jewish Agency Annual Report2008 Jewish Agency Annual Report
2008 Jewish Agency Annual Report
 
Health Declartion Minor
Health Declartion MinorHealth Declartion Minor
Health Declartion Minor
 
Jewish Agency Budget Shortfall
Jewish Agency Budget ShortfallJewish Agency Budget Shortfall
Jewish Agency Budget Shortfall
 
Health Declartion
Health DeclartionHealth Declartion
Health Declartion
 
Questioniar
QuestioniarQuestioniar
Questioniar
 
The Flourishing of Jewish Life in Hungary
The Flourishing of Jewish Life in HungaryThe Flourishing of Jewish Life in Hungary
The Flourishing of Jewish Life in Hungary
 
Jafi Image Brochure A10 3.31.08
Jafi Image Brochure A10 3.31.08Jafi Image Brochure A10 3.31.08
Jafi Image Brochure A10 3.31.08
 
2010
2010 2010
2010
 
Project TEN
Project TENProject TEN
Project TEN
 
Firststepsolim 090421154659 Phpapp01
Firststepsolim 090421154659 Phpapp01Firststepsolim 090421154659 Phpapp01
Firststepsolim 090421154659 Phpapp01
 
Firststepskc 090421154931 Phpapp02
Firststepskc 090421154931 Phpapp02Firststepskc 090421154931 Phpapp02
Firststepskc 090421154931 Phpapp02
 

Semelhante a Sharing a Startup’s Big Data Lessons

Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social WebBogdan Gaza
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...Felix Gessert
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxRahul Borate
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxRahul Borate
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, whenEugenio Minardi
 
Evolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital eraEvolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital eraVishal Puri
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabasesAdi Challa
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataMelissa Hornbostel
 

Semelhante a Sharing a Startup’s Big Data Lessons (20)

Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
NoSQL
NoSQLNoSQL
NoSQL
 
NOsql Presentation.pdf
NOsql Presentation.pdfNOsql Presentation.pdf
NOsql Presentation.pdf
 
NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social Web
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
 
Evolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital eraEvolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital era
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
NoSQL and MongoDB
NoSQL and MongoDBNoSQL and MongoDB
NoSQL and MongoDB
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 

Sharing a Startup’s Big Data Lessons

  • 1. Sharing a Startup’s Big Data Lessons Experiences with non-RDBMS solutions at
  • 2. Who we are • A search engine • A people search engine • An influencer search engine • Subscription- based
  • 3. George Stathis VP Engineering 14+ years of experience building full-stack web software systems with a past focus on e-commerce and publishing. Currently responsible for building engineering capability to enable Traackr's growth goals.
  • 4. What’s this talk about? • Share what we know about Big Data/NoSQL: what’s behind the buzz words? • Our reasons and method for picking a NoSQL database • Share the lessons we learned going through the process
  • 5. Big Data/NoSQL: behind the buzz words
  • 6. What is Big Data? • 3 Vs: – Volume – Velocity – Variety
  • 7. What is Big Data? Volume + Velocity • Data sets too large or coming in at too high a velocity to process using traditional databases or desktop tools. E.g. big science Astronomy web logs atmospheric science rfid genomics sensor networks biogeochemical social networks military surveillance social data medical records internet text and documents photography archives internet search indexing video archives call detail records large-scale e-commerce
  • 8. What is Big Data? Variety • Big Data is varied and unstructured Traditional static reports Analytics, exploration & experimentation
  • 9. What is Big Data? • Scaling data processing cost effectively $$$$$ $$$$$$$$ $$$
  • 10. What is NoSQL? • NoSQL ≠ No SQL • NoSQL ≈ Not Only SQL • NoSQL addresses RDBMS limitations, it’s not about the SQL language • RDBMS = static schema • NoSQL = schema flexibility; don’t have to know exact structure before storing
  • 11. What is Distributed Computing? • Sharing the workload: divide a problem into many tasks, each of which can be solved by one or more computers • Allows computations to be accomplished in acceptable timeframes • Distributed computation approaches were developed to leverage multiple machines: MapReduce • With MapReduce, the program goes to the data since the data is too big to move
  • 12. What is MapReduce? Source: developer.yahoo.com
  • 13. What is MapReduce? • MapReduce = batch processing = analytical • MapReduce ≠ interactive • Therefore many NoSQL solutions don’t outright replace warehouse solutions, they complement them • RDBMS is still safe 
  • 14. What is Big Data? Velocity • In some instances, being able to process large amounts of data in real-time can yield a competitive advantage. E.g. – Online retailers leveraging buying history and click- though data for real-time recommendations • No time to wait for MapReduce jobs to finish • Solutions: streaming processing (e.g. Twitter Storm), pre-computing (e.g. aggregate and count analytics as data arrives), quick to read key/value stores (e.g. distributed hashes)
  • 15. What is Big Data? Data Science • Emergence of Data Science • Data Scientist ≈ Statistician • Possess scientific discipline & expertise • Formulate and test hypotheses • Understand the math behind the algorithms so they can tweak when they don’t work • Can distill the results into an easy to understand story • Help businesses gain actionable insights
  • 16. Big Data Landscape Source: capgemini.com
  • 17. Big Data Landscape Source: capgemini.com
  • 18. Big Data Landscape Source: capgemini.com
  • 19. So what’s Traackr and why did we need a NoSQL DB?
  • 20. Traackr: context • A cloud computing company as about to launch a new platform; how does it find the most influential IT bloggers on the web that can help bring visibility to the new product? How does it find the opinion leaders, the people that matter?
  • 21. Traackr: a people search engine Up to 50 keywords per search!
  • 22. Traackr: a people search engine Proprietary 3-scale ranking People as Content search aggregated results by author
  • 23. Traackr: 30,000 feet Acquisition Processing Storage & Indexing Services Applications
  • 24. NoSQL is usually associated with “Web Scale” (Volume & Velocity)
  • 25. Do we fit the “Web scale” profile? • In terms of users/traffic?
  • 30.
  • 31. Do we fit the “Web scale” profile? • In terms of users/traffic? • In terms of the amount of data?
  • 32. PRIMARY> use traackr switched to db traackr PRIMARY> db.stats() { "db" : "traackr", "collections" : 12, "objects" : 68226121, "avgObjSize" : 2972.0800625760330, That’s a quarter of a "dataSize" : 202773493971, terabyte … "storageSize" : 221491429671, "numExtents" : 199, "indexes" : 33, "indexSize" : 27472394891, "fileSize" : 266623699968, "nsSizeMB" : 16, "ok" : 1 }
  • 33. Wait! What? My Synology NAS at home can hold 2TB!
  • 34. No need for us to track the entire web Influencer Content Web Content Not at scale :-)
  • 35. Do we fit the “Web scale” profile? • In terms of users/traffic? • In terms of the amount of data?
  • 36. Variety view of “Web Scale” Web data is: Heterogeneous Unstructured (text)
  • 37. Visualization of the Internet, Nov. 23rd 2003 Source: http://www.opte.org/
  • 38. Data sources are isolated islands of rich data with lose links to one another
  • 39. How do we build a database that models all possible entities found on the web?
  • 40. Modeling the web: the RDBMS way
  • 42. or
  • 43.
  • 44. { "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.rnTraackr: http://traackr.comrnPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "dchancogne@traackr.com", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteUrl": "http://twitter.com/dchancogne", "metrics": [ { "value": 216, "name": "twitter_followers_count" }, { "value": 2107, "name": "twitter_statuses_count" } ] }, { "siteUrl": "http://traackr.com/blog/author/david", "metrics": [ { "value": 21, "name": "google_inbound_links" } ] } ] } Influencer data as JSON
  • 45. NoSQL = schema flexibility
  • 46. Do we fit the “Web scale” profile? • In terms of users/traffic? • In terms of the amount of data?
  • 47. Do we fit the “Web scale” profile? • In terms of users/traffic? • In terms of the amount of data? • In terms of the variety of the data ✓
  • 48. Traackr’s Datastore Requirements • Schema flexibility ✓ • Good at storing lots of variable length text • Batch processing options
  • 49. Requirement: text storage Variable text length: 140 multi-page character < big variance < tweets blog posts
  • 50. Requirement: text storage RDBMS’ answer to variable text length: Plan ahead for largest value CLOB/BLOB
  • 51. Requirement: text storage Issues with CLOB/BLOG for us: No clue what largest value is CLOB/BLOB for tweets = wasted space
  • 52. Requirement: text storage NoSQL solutions are great for text: No length requirements (automated chunking) Limited space overhead
  • 53. Traackr’s Datastore Requirements • Schema flexibility ✓ • Good at storing lots of variable length text ✓ • Batch processing options
  • 54. Requirement: batch processing Some NoSQL solutions come with MapReduce Source: http://code.google.com/
  • 55. Requirement: batch processing MapReduce + RDBMS: Possible but proprietary solutions Usually involves exporting data from RDBMS into a NoSQL system anyway. Defeats data locality benefit of MR
  • 56. Traackr’s Datastore Requirements • Schema flexibility ✓ • Good at storing lots of variable length text ✓ • Batch processing options ✓ A NoSQL option is the right fit
  • 57. How did we pick a NoSQL DB?
  • 58. Bewildering number of options (early 2010) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 59. Bewildering number of options (early 2010) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 60. Trimming options Key/Value Databases Column Databases • Distributed hashtables while•weSpread sheet like Graph Databases: can model • • Key is a row Designed for high as a graph we don’t want to id our domain load • pigeonhole ourselves into this structure. columns In-memory or on-disk • Attributes are • Eventually consistent use these tools for can be grouped We’d rather • Columns specialized data analysis but not as the into families main data store. Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 61. Trimming options Key/Value Databases Column Databases Memcache: memory-based, • Distributed hashtables • Spread sheet like we need true persistence • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 62. Trimming options Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped Amazon SimpleDB: not willing to store our data in into families a proprietary datastore. Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 63. Trimming options Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) Not willing to store ourProject a Redis and LinkedIn’s data in • Value proprietary datastore. • = Document Voldermort: no query filters, Great for modeling • Document used as queues or better = JSON/BSON networks • JSON = Flexible Schema distributed caches • Great for graph-based query algorithms
  • 64. Trimming options Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns CouchDB: no ad-hoc queries; • Eventually consistent • Columns can us maturity in early 2010 madebe grouped into families shy away although we did try early prototypes. Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 65. Trimming options Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases 2010, Graph Databases Cassandra: in early • • maturity questions, no secondary Graph Theory G=(E,V) Like Key/Value • Value = Document processing Great for modeling indexes and no batch • • options (came later on). Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 66. Trimming options Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • MongoDB: in earlyis a row id Designed for high load • Key 2010, maturity • In-memory or on-disk questions, adoption questions • Attributes are columns and no batch processing options. • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 67. Trimming options Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value very close but• in early 2010, Riak: Graph Theory G=(E,V) • • Great for Value = Document adoption questions. modeling we had • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 68. Trimming options Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value came across as•theGraphmature G=(E,V) HBase: most Theory • Value = Document with several deployments, a at the time, • Great for modeling • Document = JSON/BSON "out-of-the box" healthy community, networks secondary indexes through a contrib and • JSON = Flexible Schema • Great for graph-based support for batch processing using Hadoop/MR query algorithms .
  • 69. Lessons Learned Challenges Rewards - Complexity - Choices - Missing Features - Empowering - Problem solution fit - Community - Resources - Cost
  • 70. Rewards: Choices Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 71. Rewards: Choices Source: capgemini.com
  • 72. Lessons Learned Challenges Rewards - Complexity - Choices - Missing Features - Empowering - Problem solution fit - Community - Resources - Cost
  • 73. When Big-Data = Big Architectures Must have an odd Master/slave architecture number of means a single point of failure, Zookeeper quorum so you need to protect your nodes master. Then you can run your Hbase nodes but it’s recommended to co-locate regionservers with hadoop datanodes so you have to manage resources. Must have a Hadoop HDFS cluster of at least 2x replication factor nodes And then we also have to manage the MapReduce processes and resources in the Hadoop layer. Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
  • 75. Jokes aside, no one said open source was easy to use
  • 76. To be expected • Hadoop/Hbase are designed to move mountains • If you want to move big stuff, be prepared to sometimes use big equipment
  • 77. What it means to a startup Development capacity before Congrats, you are now a sysadmin… Development capacity after
  • 78. Lessons Learned Challenges Rewards - Complexity - Choices - Missing Features - Empowering - Problem solution fit - Community - Resources - Cost
  • 79. Mapping an saved search to a column store Name Ranks References to influencer records
  • 80. Mapping an saved search to a column store “attributes” column family Unique for general “influencerId” column family key attributes for influencer ranks and foreign keys
  • 81. Mapping an saved search to a column store Influencer ranks can be attribute “name” attribute names as well
  • 82. Mapping an saved search to a column store Can get pretty long so needs indexing and pagination
  • 83. Problem: no out-of-the-box row-based indexing and pagination
  • 84. Jumping right into the code
  • 85. Lessons Learned Challenges Rewards - Complexity - Choices - Missing Features - Empowering - Problem solution fit - Community - Resources - Cost
  • 86. a few months later…
  • 87. Need to upgrade to Hbase 0.90 • Making sure to remain on recent code base • Performance improvements • Mostly to get the latest bug fixes No thanks!
  • 88. Looks like something is missing
  • 89.
  • 90. Our DB indexes depend on this!
  • 91. Let’s get this straight • Hbase no longer comes with secondary indexing out-of-the-box • It’s been moved out of the trunk to GitHub • Where only one other company besides us seems to care about it
  • 92. Only one other maintainer besides us
  • 93. What it means to a startup Congrats, you are now an hbase contrib maintainer… Development capacity
  • 95. Lessons Learned Challenges Rewards - Complexity - Choices - Missing Features - Empowering - Problem solution fit - Community - Resources - Cost
  • 96. Homegrown Hbase Indexes Row ids for Posts Rows have id prefixes that can be efficiently scanned using STARTROW and STOPROW filters
  • 97. Homegrown Hbase Indexes Row ids for Posts Find posts for influencer_id_1234
  • 98. Homegrown Hbase Indexes Row ids for Posts Find posts for influencer_id_5678
  • 99. Homegrown Hbase Indexes • No longer depending on unmaintained code • Work with out-of-the-box Hbase installation
  • 100. What it means to a startup You are back but you still need to maintain indexing logic Development capacity
  • 101. a few months later…
  • 102. Cracks in the data model huffingtonpost.com published under writes for http://www.huffingtonpost.com/arianna-huffington/post_1.html http://www.huffingtonpost.com/arianna-huffington/post_2.html authored by http://www.huffingtonpost.com/arianna-huffington/post_3.html huffingtonpost.com published under writes for http://www.huffingtonpost.com/shaun-donovan/post1.html http://www.huffingtonpost.com/shaun-donovan/post2.html authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
  • 103. Cracks in the data model huffingtonpost.com published under writes for Denormalized/duplicated for fast runtime access http://www.huffingtonpost.com/arianna-huffington/post_1.html and storage of influencer- http://www.huffingtonpost.com/arianna-huffington/post_2.html authored by http://www.huffingtonpost.com/arianna-huffington/post_3.html to-site relationship properties huffingtonpost.com published under writes for http://www.huffingtonpost.com/shaun-donovan/post1.html http://www.huffingtonpost.com/shaun-donovan/post2.html authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
  • 104. Cracks in the data model huffingtonpost.com published under writes for http://www.huffingtonpost.com/arianna-huffington/post_1.html http://www.huffingtonpost.com/arianna-huffington/post_2.html authored by huffingtonpost.com published under writes for http://www.huffingtonpost.com/shaun-donovan/post1.html http://www.huffingtonpost.com/shaun-donovan/post2.html authored by http://www.huffingtonpost.com/shaun-donovan/post3.html http://www.huffingtonpost.com/arianna-huffington/post_3.html Content attribution logic could sometimes mis-attribute posts because of the duplicated data.
  • 105. Cracks in the data model huffingtonpost.com published under writes for http://www.huffingtonpost.com/arianna-huffington/post_1.html authored by huffingtonpost.com published under writes for http://www.huffingtonpost.com/shaun-donovan/post1.html http://www.huffingtonpost.com/shaun-donovan/post2.html authored by http://www.huffingtonpost.com/shaun-donovan/post3.html http://www.huffingtonpost.com/arianna-huffington/post_3.html http://www.huffingtonpost.com/arianna-huffington/post_2.html Exacerbated when we started tracking people’s content on a daily basis in mid- 2011
  • 106. Fixing the cracks in the data model Normalize the sites http://www.huffingtonpost.com/arianna-huffington/post_1.html http://www.huffingtonpost.com/arianna-huffington/post_2.html authored by http://www.huffingtonpost.com/arianna-huffington/post_3.html writes for published under huffingtonpost.com published under writes for http://www.huffingtonpost.com/shaun-donovan/post1.html http://www.huffingtonpost.com/shaun-donovan/post2.html authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
  • 107. Fixing the cracks in the data model • Normalization requires stronger secondary indexing • Our application layer indexing would need revisiting…again!
  • 108. What it means to a startup Psych! You are back to writing indexing code. Development capacity
  • 110. Lessons Learned Challenges Rewards - Complexity - Choices - Missing Features - Empowering - Problem solution fit - Community - Resources - Cost
  • 111. Traackr’s Datastore Requirements (Revisited) • Schema flexibility • Good at storing lots of variable length text • Out-of-the-box SECONDARY INDEX support! • Simple to use and administer
  • 112. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 113. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Nope! Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 114. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • Graph Databases:•weAttributes are columns In-memory or on-disk looked at • • Columns can Eventually consistent closer but passed again be grouped Neo4J a bit for the same reasons into families as before. Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 115. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases Memcache: still no • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 116. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped Amazon SimpleDB: still no. into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 117. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) Not willing to store ourProject a Redis and LinkedIn’s data in • Value proprietary datastore. • =Voldermort: still no Document Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 118. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns CouchDB: more mature but still • Eventually consistent • Columns can no ad-hoc queries. be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 119. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databasesa bit, added Cassandra: matured quite Graph Databases secondary indexes and batch processing • • Like Key/Valuerestrictive in its’ use than Graph Theory G=(E,V) options but more • • Value =solutions. After the Hbase lesson, Great for modeling other Document • Document useJSON/BSON simplicity of = was now more important. networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 120. NoSQL picking – Round 2 (mid 2011) Key/Value Databases Column Databases • Distributed hashtables • Spread sheet like • Designed for high load • Key is a row id • In-memory or on-disk • Attributes are columns • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • • Graph Theory G=(E,V) Like Key/Value strong contender still but Riak: • • Great for Value = Document questions remained. modeling adoption • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 121. NoSQL picking – Round 2 (mid 2011) Key/Value Databasesby leaps Column Databases MongoDB: matured and bounds, increased • • Spread sheet like Distributed hashtables 10gen, advanced indexing adoption, support from • • batch processing Designed for high load as some Key is a row id out-of-the-box as well options, breeze to use, well documented and fit into • • Attributes In-memory or on-disk code base very nicely. are columns our existing • Eventually consistent • Columns can be grouped into families Document Databases Graph Databases • Like Key/Value • Graph Theory G=(E,V) • Value = Document • Great for modeling • Document = JSON/BSON networks • JSON = Flexible Schema • Great for graph-based query algorithms
  • 122. Lessons Learned Challenges Rewards - Complexity - Choices - Missing Features - Empowering - Problem solution fit - Community - Resources - Cost
  • 123. Immediate Benefits • No more maintaining custom application-layer secondary indexing code
  • 124. What it means to a startup Yay! I’m back! Development capacity
  • 125. Immediate Benefits • No more maintaining custom application-layer secondary indexing code • Single binary installation greatly simplifies administration
  • 126. What it means to a startup Honestly, I thought I’d never see you guys again! Development capacity
  • 127. Immediate Benefits • No more maintaining custom application-layer secondary indexing code • Single binary installation greatly simplifies administration • Our NoSQL could now support our domain model
  • 129. { ”_id": "770cf5c54492344ad5e45fb791ae5d52”, "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.rnTraackr: http://traackr.comrnPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "dchancogne@traackr.com", "location": "Cambridge, MA, United States", "siteReferences": [ { Embedded list of "siteId": "b31236da306270dc2b5db34e943af88d", references to sites "contribution": 0.25 augmented with }, influencer-specific { "siteId": "602dc370945d3b3480fff4f2a541227c", site attributes (e.g. "contribution": 1.0 percent contribution } to content) ] } Modeling an influencer
  • 130. { ”_id": "770cf5c54492344ad5e45fb791ae5d52”, "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.rnTraackr: http://traackr.comrnPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "dchancogne@traackr.com", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteId": "b31236da306270dc2b5db34e943af88d", "contribution": 0.25 siteId indexed for }, “find influencers { connected to site X” "siteId": "602dc370945d3b3480fff4f2a541227c", "contribution": 1.0 } ] } > db.influencers.ensureIndex({siteReferences.siteId: 1}); > db.influencers.find({siteReferences.siteId: "602dc370945d3b3480fff4f2a541227c"}); Modeling an influencer
  • 131. Other Benefits • Ad hoc queries and reports became easier to write with JavaScript: no need for a Java developer to write map reduce code to extract the data in a usable form like it was needed with Hbase. • Simpler backups: Hbase mostly relied on HDFS redundancy; intra- cluster replication is available but experimental and a lot more involved to setup. • Great documentation • Great adoption and community
  • 132. looks like we found the right fit!
  • 133. We have more of this Development capacity
  • 134. And less of this Source: socialbutterflyclt.com
  • 135. Recap & Final Thoughts • 3 Vs of Big Data: – Volume – Velocity – Variety  Traackr • Big Data technologies are complementary to SQL and RDBMS • Until machines can think for themselves Data Science will be increasingly important
  • 136. Recap & Final Thoughts • Be prepared to deal with less mature tech • Be as flexible as the data => fearless refactoring • Importance of ease of use and administration cannot be overstated for a small startup
  • 137. Q&A

Notas do Editor

  1. Big science: Large Hadron Collider (LHC)Sensor networks: forest fire detectionCall detail record, a record of a (billing) event produced by a telecommunication network element
  2. Big science: Large Hadron Collider (LHC)Sensor networks: forest fire detectionCall detail record, a record of a (billing) event produced by a telecommunication network element
  3. Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
  4. Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
  5. Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
  6. Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
  7. Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
  8. Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
  9. Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
  10. Taking a look at the amount of storage we are using as of a month ago in Mongo; this includes indexes
  11. The point is that we don’t need to track the entire web: just the subset belonging to influencers!
  12. There is a different perspective on “Web Scale” that has to do with the nature of the data on the web
  13. Take the approach of using a simplifiedentity model
  14. …withsemi-structured data storage formats like JSON:Facilitate capturing related attribute structures Enablethe flexibility of definingnew attributes as they are discovered
  15. CLOB pre-allocated space
  16. Sparse maps
  17. - This is something we thought we needed back in early 2010- Traack needs to score its’ entire DB of influencers on a weekly basis to adjust the weighted averages and stats that drive the scores. This means processing north of 750K of sites, over 650K influencers and soon, millions of posts (post-level attributes)
  18. Graph Databases: while we can model our domain as a graph we don’t want to pigeonhole ourselves into this structure. We’d rather use these tools for specialized data analysis but not as the main data store.
  19. Memcache: memory-based,we need true persistence
  20. Amazon SimpleDB: not willing to store our data in a proprietary datastore.
  21. Redis and LinkedIn’s Project Voldermort: no query filters, better used as queues or distributed caches
  22. CouchDB: no ad-hoc queries; maturity in early 2010 made us shy away although we did try early prototypes
  23. Cassandra: in early 2010, maturity questions, no secondary indexes and no batch processing options (came later on).
  24. MongoDB: in early 2010, maturity questions, adoption questions and no batch processing options
  25. Riak: very close but in early 2010, we had adoption questions
  26. HBase: came across as the most mature at the time, with several deployments, a healthy community, &quot;out-of-the box&quot; secondary indexes through a contrib and support for batch processing using Hadoop/MR Hadoop and its’ maturity was a big reason we picked HBase
  27. Had to deal with a complex right from the start:- minimum number of data nodes to support replication- odd number of zookeper nodes to avoid voting deadlocks- co-locating region servers = paying close attention to JVM resources- Master = SPOF- co-locating job trackers = paying close attention to JVM resources
  28. - Quick overview of how we modeled a list in hbase =&gt; saved searches- This is what our customers see- Let&apos;s consider the name, the ranks of the influencers and the influencer references
  29. Each row has a unique key: the alist idWe would group general attributes under one family of columns appropriately named “attributes”. Benefit: can get Alist information without loading all the influencersWe would group the influencer references under another family of columns named “influencerIds”
  30. Now we can see where the attributes we see on the screen are stored
  31. - We coded the pagination and indexing features ourselves and contributed them back- Felt really good about it!
  32. It wasn’t bad enough we had to write our own code to support our indexing needs, we now had to maintain a third-party code base that was quickly becoming outdated!
  33. Simplified example for posts
  34. Denormalized/duplicated for fast runtime access and storage of influencer-to-site relationship properties
  35. Content attribution logic could sometimes mis-attribute posts because of the duplicated
  36. Exacerbated when we started tracking people’s content on a daily basis in mid-2011
  37. Graph Databases: we looked at Neo4J a bit closer but passed again for the same reasons as before
  38. CouchDB: more mature but still no ad-hoc queries
  39. Cassandra: matured quite a bit, added secondary indexes and batch processing options but more restrictive in its’ use than other solutions. After the Hbase lesson, simplicity of use was now more important
  40. Riak: strong contender still but adoption questions
  41. MongoDB: matured by leaps and bounds, increased adoption, support from 10gen, advanced indexing out-of-the-box as well as some batch processing options, breeze to use, well documented and fit into our existing code base very nicely.
  42. Embedded list of references to sites augmented with influencer-specific site attributes (e.g. percent contribution to content)
  43. siteId indexed for “find influencers connected to site X”
  44. Big science: Large Hadron Collider (LHC)Sensor networks: forest fire detectionCall detail record, a record of a (billing) event produced by a telecommunication network element