SlideShare uma empresa Scribd logo
1 de 35
Cassandra @walmartlabs
Cassandra @walmartlabs

• Cassandra adoption at Walmart
    – Using the DataStax distribution http://www.datastax.com/
• Introduction to the talks
• Hiring @labs




Walmart eCommerce                                                2
Cassandra @walmartlabs

• Introduction to the talks
    – Walmartlabs
       • @labs – Using Cassandra for real-time stream processing
       • @services – Using Cassandra for product and items
    – DataStax
       • Data modeling with Cassandra




Walmart eCommerce                                                  3
Cassandra @walmartlabs

• Hiring @labs
    – Cassandra admins
    – Java engineers
    – http://www.walmartlabs.com/open-positions/




Walmart eCommerce                                  4
Cassandra for Real-time
Stream Processing
Karl Mueller, @WalmartLabs
Wang Lam, @WalmartLabs
Data-stream computation

• “Big” data: MapReduce (Hadoop)
    – Map and Reduce steps
    – Batch process large input (e.g., from HDFS)
    – Hadoop distributes computation



• Fast data: MapUpdate (Muppet)
    –   Map and Update steps
    –   Continuously process streaming input
    –   Muppet maintains computation
    –   Muppet manages memory/storage




2012 Cassandra for Real-Time Stream Processing @WalmartLabs
The MapReduce framework (Hadoop)

• Event
    – A <key, value> pair of data


• Map
    – A function that performs (stateless) computation on incoming
      events


• Reduce
    – A function that combines all input for a particular key


• Application
    – Map -> Reduce


2012 Cassandra for Real-Time Stream Processing @WalmartLabs
The MapUpdate framework (Muppet)

• Event
    – A <key, value> pair of data


• Map
    – A function that performs (stateless) computation on incoming
      events


• Update
    – A function that updates a slate using incoming events


• Application
    – A directed graph of Mappers and Updaters


2012 Cassandra for Real-Time Stream Processing @WalmartLabs
A MapUpdate application




2012 Cassandra for Real-Time Stream Processing @WalmartLabs
The Map (Foursquare::CheckinMapper)
sub map {
      my $self = shift;
      my $event = shift;


      my $checkin = $event->{checkin};
      my $timeslot = int($checkin->{created} / 900) * 900;
      $event->{kosmix}->{timeslot} = $timeslot;
      $event->{kosmix}->{interval} = 900;


      my $venue_name = $checkin->{venue}->{name};
      my $retailer = 0;
      $retailer = 'ToysRUs' if ($venue_name =~ /toys.*r.*us/i);
      $retailer = 'Walmart' if ($venue_name =~ /wal.*mart/i);
      $retailer = 'SamsClub' if ($venue_name =~ /sam.*club/i);
      if ($retailer) {
             $event->{kosmix}->{retailer} = $retailer;
             $self->publish("FoursquareRetailerCheckin", $event,
                     $retailer.".".$timeslot);
2012 ISD YBM Tech Fair - Big Fast Data @WalmartLabs
The Update (Foursquare::RetailerUpdater)
use Muppet::Updater;
package Foursquare::RetailerUpdater;
@ISA = qw( Muppet::Updater );


use strict;


sub update {
      my $self = shift;
      my $event = shift;
      my $slate = shift;
      my $config = shift;
      my $key = shift;


      $slate->{timeslot} = $event->{kosmix}->{timeslot};
      $slate->{interval} = $event->{kosmix}->{interval};
      $slate->{retailer} = $event->{kosmix}->{retailer};
      $slate->{count} += 1;
2012 ISD YBM Tech Fair - Big Fast Data @WalmartLabs
Example results




2012 Cassandra for Real-Time Stream Processing @WalmartLabs
Muppet Processing

• Slates are 1 – 100KB in size

• Local cache on Muppet Node
    – 85% reads from cache
    – Write-though delayed cache
    – ~750K slates in cache per node


• Remote slates read through Muppet API

• Cassandra is the permanent datastore

• Slates tend to be updated and read in batches
    – 10-50 at a time
2012 Cassandra for Real-Time Stream Processing @WalmartLabs
Muppet & Cassandra Architecture

                                     ~100x Muppet
        Node                                                  Node
                   Processes                                   NodeProcesses
                                                                Node Processes
                   Processes                                     Node
                                                                   Processes
                                                                       Processes
                   Processes                                          Processes
                                                                    Processes
                                                                          Processes
                                                   API                  Processes
                                                                          Processes
              Slate Cache                                          Slate Cache
                                                                     Slate Cache
           Delay
                                                                       Slate Cache
                                                                  Delay Slate Cache




                                            16x Cassandra
            Cassandra                              Cassandra                    Cassandra
            8x RAID0 SSD                           8x RAID0 SSD                 8x RAID0 SSD
             1.2TB RAW                              1.2TB RAW                    1.2TB RAW




2012 Cassandra for Real-Time Stream Processing @WalmartLabs
Datastore Requirements

• Consistent, low response time
    – 10ms or less for slate reads on average


• 1+ billion keys, future expansion maybe 5-10 billion

• Value is whole set of data
    – Slate losses in small amounts OK


• Datastore gets entirely “cold” reads
    – Muppet Cache: 85% for reads
    – Datastore cannot rely on cache for performance



2012 Cassandra for Real-Time Stream Processing @WalmartLabs
Why Cassandra?

• Timeframe: Early 2010
    – Low latency: a rare feature among NoSQL
    – Most NoSQL favors throughput over response time
    – New “Best NoSQL evur!!” every 2 months


• Cassandra:
    – Open-Source, active community, Clustering a core feature


• Simple is good
    – Peer networking, Data file format, key distribution


• QUORUM consistency good middle ground
    – AP focus in CAP aligns well with our needs
2012 Cassandra for Real-Time Stream Processing @WalmartLabs
Why Cassandra – the Challenges

• Seeks are going to be difficult
    –   Overwrites mean nightly compactions
    –   Compactions blow up seek performance
    –   90%+ cold reads means lots of seeks
    –   Head and body reads can produce a lot of seeks


• Slates as an atomic unit means no bulk column slice reads

• Likely to have unfavorable read:write ratio
    – Early estimates: 1:3, or even worse


• Oh yeah, spinning disks hate seeks. Uh oh!

2012 Cassandra for Real-Time Stream Processing @WalmartLabs
Frequent Row Overwrites in Cassandra

                                                     TAIL     Few Seeks
 Full Compaction

                                                   BODY       Some Seeks

                                                    HEAD      Many Seeks




                  Growth During Day
                                  Data Files (SS Tables)
2012 Cassandra for Real-Time Stream Processing @WalmartLabs
Solution

• Cassandra + SSDs !!

• Expensive in terms of space, cheap in terms of IOps

• Random seeks “free”

• Good performance during nightly compactions




2012 Cassandra for Real-Time Stream Processing @WalmartLabs
Compaction Effect on System




2012 Cassandra for Real-Time Stream Processing @WalmartLabs
How did Cassandra do?

• Average latency below 10ms, often 5-8ms

• read-write ratio: 1:2
    – Today, 1:1


• Compacting 500GB every night in <4 hours

• Individual C* nodes handled over 1500 rps/wps

• SSD cost: well worth it


2012 Cassandra for Real-Time Stream Processing @WalmartLabs
Helping Cassandra out

• Muppet absorbs writes in local cache
    – Write on # of updates or staleness
    – Reduces write counts in Cassandra
    – More efficient


• Compress all slates on Muppet nodes
    – Easier to scale than C* nodes doing compression
    – Less disk IO, less network
    – CPU on Muppet nodes cheap


• Expire data via TTL
    – Muppet apps decide data-keep length


• Java GC tuning flattened out CPU and GC stops
2012 Cassandra for Real-Time Stream Processing @WalmartLabs
Recent and Future

• Cassandra 0.8.x
    – Faster compaction
    – Stability
    – Performance


• Cassandra 1.0.x
    –   Close to deployment @WML
    –   LevelDB is very, very interesting
    –   Cache memory changes make large caches feasible!
    –   Row[Column] latest-only: very nice
    –   SSDs no longer needed? Possibly!
         • Depends on cold seek requirements


2012 Cassandra for Real-Time Stream Processing @WalmartLabs
Lessons

• Simple is usually faster and cheaper
    – Add complexity only where needed


• Best solution can usually be made to work

• Proactive monitoring very important
    – Trend graph everything relevant!


• Failing fast is better than succeeding late

• No substitute for understanding your platform

• Spend money when it will save you time and complexity
2012 Cassandra for Real-Time Stream Processing @WalmartLabs
Q&A



2012 Cassandra for Real-Time Stream Processing @WalmartLabs
Using Cassandra for
 Products & Items
Rajkumar Venkat
rvenkat@walmartlabs.com
First Challenge
Build a truly Global Product Catalog
Dimensions, Products & Product Offerings - Example
Second Challenge
Catalog (& Categorize) Any Sellable Item
Flexible Categorization & Attribution


 • The right kind of categorization and
   attribution is crucial to making sense of the
   enormity of product data
     • Ultimate shopping experience
     • Fine-grained analytics & planning

 • Standards exist, but severely limiting
     • Product landscape changes dramatically
       every day
Other excerpts from the “shopping list”

• Lookup and potentially match products
  and offerings by any combination of
  attributes and other dimensional
  criteria
• Item-Item Relationships & Collections
    • Hierarchical
    • Graph
• Low Latency, High Throughput,
  Highly Available
• A scalable but unified system of record
  for all product and offering data
Translating to Cassandra

• Modeling options
    1. Product as a “wide row” encompassing all
       offerings
    2. Product assembled from several offering
       “fragment” rows
• Multiple Column Families
    • Product fragments
    • Custom consistency enabler
    • Custom row caching at column family level
• Single keyspace to hold all core data fragments
    • Tighter control of replication factor, strategy
    • Additional keyspaces only for supporting data
Translating to Cassandra (contd.)

• Flexible, selective denormalization
• Secondary indexes for faster attribute-level queries
• Dynamic composites
    • define flexible comparators for different column key levels
    • capture 1-n levels of dimension intersections
• Column slicing to retrieve the right offerings
The “Supporting Cast”


• Solr for additional indexing querying capabilities
   • Mainly for attribute values
        • Pattern matching
        • Non-standard type comparisons
        • Range checks
Queries?

Mais conteúdo relacionado

Último

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Destaque

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 

Destaque (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

Cassandra atwalmartlabsmeetup201203

  • 2. Cassandra @walmartlabs • Cassandra adoption at Walmart – Using the DataStax distribution http://www.datastax.com/ • Introduction to the talks • Hiring @labs Walmart eCommerce 2
  • 3. Cassandra @walmartlabs • Introduction to the talks – Walmartlabs • @labs – Using Cassandra for real-time stream processing • @services – Using Cassandra for product and items – DataStax • Data modeling with Cassandra Walmart eCommerce 3
  • 4. Cassandra @walmartlabs • Hiring @labs – Cassandra admins – Java engineers – http://www.walmartlabs.com/open-positions/ Walmart eCommerce 4
  • 5. Cassandra for Real-time Stream Processing Karl Mueller, @WalmartLabs Wang Lam, @WalmartLabs
  • 6. Data-stream computation • “Big” data: MapReduce (Hadoop) – Map and Reduce steps – Batch process large input (e.g., from HDFS) – Hadoop distributes computation • Fast data: MapUpdate (Muppet) – Map and Update steps – Continuously process streaming input – Muppet maintains computation – Muppet manages memory/storage 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 7. The MapReduce framework (Hadoop) • Event – A <key, value> pair of data • Map – A function that performs (stateless) computation on incoming events • Reduce – A function that combines all input for a particular key • Application – Map -> Reduce 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 8. The MapUpdate framework (Muppet) • Event – A <key, value> pair of data • Map – A function that performs (stateless) computation on incoming events • Update – A function that updates a slate using incoming events • Application – A directed graph of Mappers and Updaters 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 9. A MapUpdate application 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 10. The Map (Foursquare::CheckinMapper) sub map { my $self = shift; my $event = shift; my $checkin = $event->{checkin}; my $timeslot = int($checkin->{created} / 900) * 900; $event->{kosmix}->{timeslot} = $timeslot; $event->{kosmix}->{interval} = 900; my $venue_name = $checkin->{venue}->{name}; my $retailer = 0; $retailer = 'ToysRUs' if ($venue_name =~ /toys.*r.*us/i); $retailer = 'Walmart' if ($venue_name =~ /wal.*mart/i); $retailer = 'SamsClub' if ($venue_name =~ /sam.*club/i); if ($retailer) { $event->{kosmix}->{retailer} = $retailer; $self->publish("FoursquareRetailerCheckin", $event, $retailer.".".$timeslot); 2012 ISD YBM Tech Fair - Big Fast Data @WalmartLabs
  • 11. The Update (Foursquare::RetailerUpdater) use Muppet::Updater; package Foursquare::RetailerUpdater; @ISA = qw( Muppet::Updater ); use strict; sub update { my $self = shift; my $event = shift; my $slate = shift; my $config = shift; my $key = shift; $slate->{timeslot} = $event->{kosmix}->{timeslot}; $slate->{interval} = $event->{kosmix}->{interval}; $slate->{retailer} = $event->{kosmix}->{retailer}; $slate->{count} += 1; 2012 ISD YBM Tech Fair - Big Fast Data @WalmartLabs
  • 12. Example results 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 13. Muppet Processing • Slates are 1 – 100KB in size • Local cache on Muppet Node – 85% reads from cache – Write-though delayed cache – ~750K slates in cache per node • Remote slates read through Muppet API • Cassandra is the permanent datastore • Slates tend to be updated and read in batches – 10-50 at a time 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 14. Muppet & Cassandra Architecture ~100x Muppet Node Node Processes NodeProcesses Node Processes Processes Node Processes Processes Processes Processes Processes Processes API Processes Processes Slate Cache Slate Cache Slate Cache Delay Slate Cache Delay Slate Cache 16x Cassandra Cassandra Cassandra Cassandra 8x RAID0 SSD 8x RAID0 SSD 8x RAID0 SSD 1.2TB RAW 1.2TB RAW 1.2TB RAW 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 15. Datastore Requirements • Consistent, low response time – 10ms or less for slate reads on average • 1+ billion keys, future expansion maybe 5-10 billion • Value is whole set of data – Slate losses in small amounts OK • Datastore gets entirely “cold” reads – Muppet Cache: 85% for reads – Datastore cannot rely on cache for performance 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 16. Why Cassandra? • Timeframe: Early 2010 – Low latency: a rare feature among NoSQL – Most NoSQL favors throughput over response time – New “Best NoSQL evur!!” every 2 months • Cassandra: – Open-Source, active community, Clustering a core feature • Simple is good – Peer networking, Data file format, key distribution • QUORUM consistency good middle ground – AP focus in CAP aligns well with our needs 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 17. Why Cassandra – the Challenges • Seeks are going to be difficult – Overwrites mean nightly compactions – Compactions blow up seek performance – 90%+ cold reads means lots of seeks – Head and body reads can produce a lot of seeks • Slates as an atomic unit means no bulk column slice reads • Likely to have unfavorable read:write ratio – Early estimates: 1:3, or even worse • Oh yeah, spinning disks hate seeks. Uh oh! 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 18. Frequent Row Overwrites in Cassandra TAIL Few Seeks Full Compaction BODY Some Seeks HEAD Many Seeks Growth During Day Data Files (SS Tables) 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 19. Solution • Cassandra + SSDs !! • Expensive in terms of space, cheap in terms of IOps • Random seeks “free” • Good performance during nightly compactions 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 20. Compaction Effect on System 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 21. How did Cassandra do? • Average latency below 10ms, often 5-8ms • read-write ratio: 1:2 – Today, 1:1 • Compacting 500GB every night in <4 hours • Individual C* nodes handled over 1500 rps/wps • SSD cost: well worth it 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 22. Helping Cassandra out • Muppet absorbs writes in local cache – Write on # of updates or staleness – Reduces write counts in Cassandra – More efficient • Compress all slates on Muppet nodes – Easier to scale than C* nodes doing compression – Less disk IO, less network – CPU on Muppet nodes cheap • Expire data via TTL – Muppet apps decide data-keep length • Java GC tuning flattened out CPU and GC stops 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 23. Recent and Future • Cassandra 0.8.x – Faster compaction – Stability – Performance • Cassandra 1.0.x – Close to deployment @WML – LevelDB is very, very interesting – Cache memory changes make large caches feasible! – Row[Column] latest-only: very nice – SSDs no longer needed? Possibly! • Depends on cold seek requirements 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 24. Lessons • Simple is usually faster and cheaper – Add complexity only where needed • Best solution can usually be made to work • Proactive monitoring very important – Trend graph everything relevant! • Failing fast is better than succeeding late • No substitute for understanding your platform • Spend money when it will save you time and complexity 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 25. Q&A 2012 Cassandra for Real-Time Stream Processing @WalmartLabs
  • 26. Using Cassandra for Products & Items Rajkumar Venkat rvenkat@walmartlabs.com
  • 27. First Challenge Build a truly Global Product Catalog
  • 28. Dimensions, Products & Product Offerings - Example
  • 29. Second Challenge Catalog (& Categorize) Any Sellable Item
  • 30. Flexible Categorization & Attribution • The right kind of categorization and attribution is crucial to making sense of the enormity of product data • Ultimate shopping experience • Fine-grained analytics & planning • Standards exist, but severely limiting • Product landscape changes dramatically every day
  • 31. Other excerpts from the “shopping list” • Lookup and potentially match products and offerings by any combination of attributes and other dimensional criteria • Item-Item Relationships & Collections • Hierarchical • Graph • Low Latency, High Throughput, Highly Available • A scalable but unified system of record for all product and offering data
  • 32. Translating to Cassandra • Modeling options 1. Product as a “wide row” encompassing all offerings 2. Product assembled from several offering “fragment” rows • Multiple Column Families • Product fragments • Custom consistency enabler • Custom row caching at column family level • Single keyspace to hold all core data fragments • Tighter control of replication factor, strategy • Additional keyspaces only for supporting data
  • 33. Translating to Cassandra (contd.) • Flexible, selective denormalization • Secondary indexes for faster attribute-level queries • Dynamic composites • define flexible comparators for different column key levels • capture 1-n levels of dimension intersections • Column slicing to retrieve the right offerings
  • 34. The “Supporting Cast” • Solr for additional indexing querying capabilities • Mainly for attribute values • Pattern matching • Non-standard type comparisons • Range checks

Notas do Editor

  1. Products are inherently multi-dimensional and mostly multi-variantDimensions includeBusiness Unit (Walmart, Sam’s Club, ASDA etc.)Geography (US, Canada, UK etc.)Language (en_US, fr_CA, en_UK etc.) Supply Chain (Owned Inventory, Direct Ship, Marketplace etc.) Channel (Website, Retail/Store, Mobile, Facebook etc.)Variants includeSize (S, M, L, XL etc.)Color (Red, Green, Blue etc.)Capacity (8 GB, 16GB etc.)A true Global Product content is typically agnostic of any specific dimensions or variants Items as we know and see them are actually Product OfferingsRepresenting content and behavior changes captured at every dimension and variant intersectionWhat you shop for is different from what you order is different from what you actually get!
  2. Notice the need for the concept of dimensions and variants to capture and maintain data at each levelIngest external catalogs, even if we do not plan to sell it rightawayOn a scale of 100’s of millions of unique SKU’sBase – Variant and pre-configured bundles create order of magnitude increases in these estimates
  3. How do we give our customers access to the largest assortment in the world?As the digital arm of the worlds largest retailer, we need to not only give existing customers access to an endless shelf, but we also need to have a broad assortment to expand into the consideration set of retailer non-walmart shoppers… this means millions and millions of items. And, we do so in a manner that is scalable and gives the consumer the right product information to make an informed decision about whether or not the product will meet their needs.
  4. Ultimate shopping experience Customer finds everything that he/she needs intuitively and in the right place, whether browsing or searchingFine-grained analytics &amp; planning Fine-grained analytics helps us put the right kind of products on our shelves (physical or virtual) at the right level of availability (inventory) and pricingStandards exist, but severely limiting e.g. GPC hierarchical classification and attribution structureProducts landscape changes dramatically every day e.g. Tablets, a radically new form factor, unleashes itself on the market, we want to be able to adopt it and sell it ASAP and not wait for a cumbersome change control process due to inflexible categorization and attribution
  5. Ability to lookup and potentially match products and offerings/items by any combination of attributes and other dimensional criteriaItem-Item Relationships &amp; CollectionsHierarchicalBase-variants (e.g. iPhone 4S 16/32/64 GB)GraphBundlesHard, Fixed, Inflexible or ConfigurableComponents &amp; IngredientsAccessories &amp; ReplacementsCase Packs &amp; Vendor PacksLow Latency, High Throughput, Highly AvailableSellers typically update 40-50% of their offerings at some level each dayBased on global projections, this may be comparable to the scale of social media feedsAccept, process, search, retrieve and analyze large volumes of data 24x7
  6. Multiple Column FamiliesProduct fragmentsCustom consistency enablerSeparate the “data” from the “index” or “event log”Use to separate “Work In Progress” from golden copyImplicit versioning and potential archiving/purging requirementsTunable consistency levels per API call (Read/Write)Custom row caching at column family levelOptimize for read-intensivevs. write-intensive column familiesSingle keyspace to hold all data fragmentsTighter control of replication factor (DC + 3 or 5), strategy (NetworkTopologyStrategy (formerly known as Datacenter-ShardStrategy))Additional keyspaces only for supporting dataLower priority, loosely coupled or completely decoupledE.g. Purgeable audit &amp; history logs
  7. Flexible, selective denormalizationBi-directional relationshipsCapture more than just foreign keysIndicesMerge records to create product offering in the application/DaaS layerRight balance of optimization of the retrieval algorithm vs, spaceSecondary indexes for faster attribute-level queries, but simple queries onlyHowever, complex queries may need to be supplemented with other tools as we will see later Dynamic composites capture 1-n levels of dimension intersections define flexible comparators for different column key levelsColumn slicing to retrieve the right offerings (i.e. intersections)No need to use Order Preserving PartitionerCategorization and structure is completely handled outside of the data storeCassandra only used to capture attribute values
  8. Solr for additional indexing querying capabilitiesMainly at attribute value levelPattern matchingNon-standard comparisons and range checks HDFS/Hadoop for “extreme” bulk/batch operationsLarge File/content streaming and parallel processingCorresponding response aggregationHadoop “append”