SlideShare uma empresa Scribd logo
1 de 15
Baixar para ler offline
Demonstration
Outline
● Some comments on what we're trying to
  show
  ○ high level cluster configuration
  ○ an example application that might use this config
    ■ based on a Gowalla data set
● Launch cluster nodes on EC2
● Launch/configure Cassandra on cluster
● Demonstrate use of Cassandra
  ○ cassandra-cli, pycassa scripts to interact with db
● Demonstrate use of Hadoop
● Demonstrate use of Pig on the real data
Cluster configuration
● Four EC2 nodes
  ○ m1.medium instances
    ■ realistically a bit small for real world
● 3 nodes part of Cassandra
  ○ data can be input dynamically into db via Thrift API
● All nodes run Hadoop Tasktracker
● MapReduce runs close to (Cassandra) data
● JobTracker on separate node
Cluster config


        Job Tracker                           Cassandra

                                             Task Tracker




                               Cassandra                     Cassandra

                              Task Tracker                  Task Tracker



All nodes m1.small for demo
Let's get the cluster up...
       ...over to Lamine!
Let's get Cassandra
      running...
  ...and show the basic cli...
Application data
● Used Gowalla data in this test application
● Gowalla provide anonymized data for
  test/research purposes:
  ○ Graph of UID connections
  ○ List of checkins - UID, LocID
● Size of data set:
  ○ 400MB checkins
    ■ 6.4m checkins
  ○ ~200k users
● Also generated simpler variant of this data
  for demonstration
  ○ more real user information
  ○ more real location information
Application data - User Graph




 Simple graph structure -
 unidirectional graph with
 UIDs as nodes
Application Data - Checkin info
How this data can be used
● Application interested in:
   ○   my checkins
   ○   list my friends
   ○   checkins at given location
   ○   my friends checkins
● Analytics:
   ○ top ten most active users - most checkins
   ○ aggregate checkins per week
   ○ aggregate checkins per week per city
Cassandra data models
● The following data models were used:
  ○ User
  ○ Location
  ○ Checkin
  ○ FriendRels
    ■ graph of friend relationships
  ○ UserCheckins
    ■ checkins by user
  ○ LocationCheckins
    ■ checkins by location
  ○ FriendCheckins
    ■ checkins by friends
Cassandra data models
● Use of valueless columns
  ○ FriendRels, UserCheckins, LocationCheckins,
    FriendCheckins are just sets of valueless columns
● FriendRel:
  ○ row_key: {friendid1: '', friendid2: '', friendid3: '', ...}
    ■ row_key is a uid
● UserCheckins:
  ○ row_key: {checkinid1: '', checkinid2: '', ...}
    ■ row_key is uid
● LocationCheckins use LocID as row key
● FriendCheckins use my UID to get my
  friend's checkins
Let's import the data into
       Cassandra...
You deserve a coffee...
Using Hadoop and Pig
 ...and we can do some analytics...

Mais conteúdo relacionado

Mais procurados

Elasticsearch avoiding hotspots
Elasticsearch  avoiding hotspotsElasticsearch  avoiding hotspots
Elasticsearch avoiding hotspotsChristophe Marchal
 
Object multifunctional indexing with an open API
Object multifunctional indexing with an open API Object multifunctional indexing with an open API
Object multifunctional indexing with an open API akvalex
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...NoSQLmatters
 
Streaming data to s3 using akka streams
Streaming data to s3 using akka streamsStreaming data to s3 using akka streams
Streaming data to s3 using akka streamsMikhail Girkin
 
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...Srinath Perera
 
DB reading group may 16, 2018
DB reading group may 16, 2018DB reading group may 16, 2018
DB reading group may 16, 2018Keisuke Suzuki
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkSpark Summit
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRaghunath A
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...Rob Skillington
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей РыбакOntico
 
Data Step Hash Object vs SQL Join
Data Step Hash Object vs SQL JoinData Step Hash Object vs SQL Join
Data Step Hash Object vs SQL JoinGeoff Ness
 
Pain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication worksPain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication worksRob Skillington
 
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-OnApache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-OnApache Flink Taiwan User Group
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodbDeep Kapadia
 
MongoDB Workshop Universidad de Huelva
MongoDB Workshop Universidad de HuelvaMongoDB Workshop Universidad de Huelva
MongoDB Workshop Universidad de HuelvaJuan Antonio Roy Couto
 
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...NETWAYS
 
Amazon Web Services lection 4
Amazon Web Services lection 4  Amazon Web Services lection 4
Amazon Web Services lection 4 Binary Studio
 
Data Lessons Learned at Scale
Data Lessons Learned at ScaleData Lessons Learned at Scale
Data Lessons Learned at ScaleCharlie Reverte
 

Mais procurados (20)

Elasticsearch avoiding hotspots
Elasticsearch  avoiding hotspotsElasticsearch  avoiding hotspots
Elasticsearch avoiding hotspots
 
Object multifunctional indexing with an open API
Object multifunctional indexing with an open API Object multifunctional indexing with an open API
Object multifunctional indexing with an open API
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
Streaming data to s3 using akka streams
Streaming data to s3 using akka streamsStreaming data to s3 using akka streams
Streaming data to s3 using akka streams
 
NBITSearch. Features.
NBITSearch. Features.NBITSearch. Features.
NBITSearch. Features.
 
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
 
DB reading group may 16, 2018
DB reading group may 16, 2018DB reading group may 16, 2018
DB reading group may 16, 2018
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By Spark
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
 
Data Step Hash Object vs SQL Join
Data Step Hash Object vs SQL JoinData Step Hash Object vs SQL Join
Data Step Hash Object vs SQL Join
 
Pain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication worksPain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication works
 
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-OnApache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
MongoDB Workshop Universidad de Huelva
MongoDB Workshop Universidad de HuelvaMongoDB Workshop Universidad de Huelva
MongoDB Workshop Universidad de Huelva
 
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
 
Amazon Web Services lection 4
Amazon Web Services lection 4  Amazon Web Services lection 4
Amazon Web Services lection 4
 
R user group 2011 09
R user group 2011 09R user group 2011 09
R user group 2011 09
 
Data Lessons Learned at Scale
Data Lessons Learned at ScaleData Lessons Learned at Scale
Data Lessons Learned at Scale
 

Destaque

No sql course introduction
No sql course   introductionNo sql course   introduction
No sql course introductionSean Murphy
 
Rss announcements
Rss announcementsRss announcements
Rss announcementsSean Murphy
 
Overview of no sql
Overview of no sqlOverview of no sql
Overview of no sqlSean Murphy
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandraTarun Garg
 

Destaque (7)

No sql course introduction
No sql course   introductionNo sql course   introduction
No sql course introduction
 
Rocco pres-v1
Rocco pres-v1Rocco pres-v1
Rocco pres-v1
 
Rss announcements
Rss announcementsRss announcements
Rss announcements
 
Rss talk
Rss talkRss talk
Rss talk
 
Overview of no sql
Overview of no sqlOverview of no sql
Overview of no sql
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 

Semelhante a Demonstrate Cassandra, Hadoop and Pig on EC2 Cluster

Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoringRohit Jnagal
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanDatabricks
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in RetailHari Shreedharan
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaJose Mº Muñoz
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdfAbhi Jain
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
 
Avoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfAvoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfCédrick Lunven
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightScyllaDB
 
Peer sim (p2p network)
Peer sim (p2p network)Peer sim (p2p network)
Peer sim (p2p network)Hein Min Htike
 
Mongo db improve the performance of your application codemotion2016
Mongo db improve the performance of your application codemotion2016Mongo db improve the performance of your application codemotion2016
Mongo db improve the performance of your application codemotion2016Juan Antonio Roy Couto
 

Semelhante a Demonstrate Cassandra, Hadoop and Pig on EC2 Cluster (20)

Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Native Container Monitoring
Native Container MonitoringNative Container Monitoring
Native Container Monitoring
 
Druid
DruidDruid
Druid
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
Running Cassandra in AWS
Running Cassandra in AWSRunning Cassandra in AWS
Running Cassandra in AWS
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
 
Presentation
PresentationPresentation
Presentation
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdf
 
MongoDB FabLab León
MongoDB FabLab LeónMongoDB FabLab León
MongoDB FabLab León
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Avoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfAvoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdf
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
 
Peer sim (p2p network)
Peer sim (p2p network)Peer sim (p2p network)
Peer sim (p2p network)
 
Mongo db improve the performance of your application codemotion2016
Mongo db improve the performance of your application codemotion2016Mongo db improve the performance of your application codemotion2016
Mongo db improve the performance of your application codemotion2016
 

Último

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 

Último (20)

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 

Demonstrate Cassandra, Hadoop and Pig on EC2 Cluster

  • 2. Outline ● Some comments on what we're trying to show ○ high level cluster configuration ○ an example application that might use this config ■ based on a Gowalla data set ● Launch cluster nodes on EC2 ● Launch/configure Cassandra on cluster ● Demonstrate use of Cassandra ○ cassandra-cli, pycassa scripts to interact with db ● Demonstrate use of Hadoop ● Demonstrate use of Pig on the real data
  • 3. Cluster configuration ● Four EC2 nodes ○ m1.medium instances ■ realistically a bit small for real world ● 3 nodes part of Cassandra ○ data can be input dynamically into db via Thrift API ● All nodes run Hadoop Tasktracker ● MapReduce runs close to (Cassandra) data ● JobTracker on separate node
  • 4. Cluster config Job Tracker Cassandra Task Tracker Cassandra Cassandra Task Tracker Task Tracker All nodes m1.small for demo
  • 5. Let's get the cluster up... ...over to Lamine!
  • 6. Let's get Cassandra running... ...and show the basic cli...
  • 7. Application data ● Used Gowalla data in this test application ● Gowalla provide anonymized data for test/research purposes: ○ Graph of UID connections ○ List of checkins - UID, LocID ● Size of data set: ○ 400MB checkins ■ 6.4m checkins ○ ~200k users ● Also generated simpler variant of this data for demonstration ○ more real user information ○ more real location information
  • 8. Application data - User Graph Simple graph structure - unidirectional graph with UIDs as nodes
  • 9. Application Data - Checkin info
  • 10. How this data can be used ● Application interested in: ○ my checkins ○ list my friends ○ checkins at given location ○ my friends checkins ● Analytics: ○ top ten most active users - most checkins ○ aggregate checkins per week ○ aggregate checkins per week per city
  • 11. Cassandra data models ● The following data models were used: ○ User ○ Location ○ Checkin ○ FriendRels ■ graph of friend relationships ○ UserCheckins ■ checkins by user ○ LocationCheckins ■ checkins by location ○ FriendCheckins ■ checkins by friends
  • 12. Cassandra data models ● Use of valueless columns ○ FriendRels, UserCheckins, LocationCheckins, FriendCheckins are just sets of valueless columns ● FriendRel: ○ row_key: {friendid1: '', friendid2: '', friendid3: '', ...} ■ row_key is a uid ● UserCheckins: ○ row_key: {checkinid1: '', checkinid2: '', ...} ■ row_key is uid ● LocationCheckins use LocID as row key ● FriendCheckins use my UID to get my friend's checkins
  • 13. Let's import the data into Cassandra...
  • 14. You deserve a coffee...
  • 15. Using Hadoop and Pig ...and we can do some analytics...