SlideShare uma empresa Scribd logo
1 de 20
Cassandra/Hadoop Integration OLTP + OLAP = Cassandra
BigTable + Dynamo Semi-structured data model Decentralized – no special roles, no SPOF Horizontally scalable Ridiculously fast writes, fast reads Tunably consistent Cross-DC capable Cassandra (basic overview)
Design your data model based on your query model Real-time ad-hoc queries aren’t viable Secondary indexes help What about analytics? Querying with Cassandra
Hadoopbrings analytics MapReduce Pig/Hive and other tools built above MapReduce Configurable data sources/destinations Many already familiar with it Active community Enter Hadoop
Basic Recipe Overlay Hadoop on top of Cassandra Separate server for name node and job tracker Co-locate task trackers with Cassandra nodes Data nodes for distributed cache Voilà Data locality Analytics engine scales with data Cluster Configuration
Always tune Cassandra to taste For Hadoop workloads you might Have a separate analytics virtual datacenter Using the NetworkTopologyStrategy Tune the rpc_timeout_in_ms in cassandra.yaml (higher) Tune the cassandra.range.batch.size See org.apache.cassandra.hadoop.ConfigHelper Cluster Tuning
All-in-one Configuration JobTracker and NameNode Each node has Cassandra, a TaskTracker, and a DataNode (for distributed cache)
Separate Analytics Configuration Separated nodes for analytics Nodes for real-time random access A single Cassandra cluster with different virtual data centers
Cassandra specific InputFormat ColumnFamilyInputFormat Configuration – ConfigHelper, Hadoop variables InputSplits over the data – tunable Example usage in contrib/word_count MapReduce - InputFormat
OutputFormat ColumnFamilyOutputFormat Configuration – ConfigHelper, Hadoopvariables Batches output – tunable Don’t have to use Cassandra api Some optimizations (e.g. ConsistencyLevel.ONE) Uses Avro for output serialization (enables streaming) Example usage in contrib/word_count MapReduce - OutputFormat
Visualizing Take vertical slices of columns Over the whole column family
What about languages outside of Java? Build on what Hadoop uses - Streaming Output streaming as of0.7.0 Example in contrib/hadoop_streaming_output Input streaming in progress, hoping for 0.7.2 Hadoop Streaming
Developed at Yahoo! PigLatin/Grunt shell Powerful scripting language for analytics Configuration – Hadoop/Envvariables Uses pig 0.7+ Example usage in contrib/pig Pig
LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage() br />	as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)}); cols = FOREACH rows GENERATE flatten(cols) as (name, value); words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word; grouped = GROUP words BY word; counts = FOREACH grouped GENERATE group, COUNT(words) as count; ordered = ORDER counts BY count DESC; topten = LIMIT ordered 10; dump topten;
ColumnFamilyInputFormat ColumnFamilyOutputFormat Hadoop Streaming Output Pig support – Cassandra LoadFunc Summary of Integration
Raptr.com Home grown solution -> Cassandra + Hadoop Query time: hours -> minutes Pig obviated their need for multi-lingual MR Speed and ease are enabling Imagini/Visual DNA The Dachis Group US Government (Digital Reasoning) See http://github.com/digitalreasoning/PyStratus Users of Cassandra + Hadoop
Hive support in progress (HIVE-1434) Hadoop Input Streaming (hoping for 0.7.2 - 1497) Pig Storage Func (CASSANDRA-1828) Row predicates (pending CASSANDRA-1600) MapReduce et al over secondary indexes (1600) Performance improvements (though already good) Future
Performant OLTP + powerful OLAP Less need to shuttle data between storage systems Data locality for processing Scales with the cluster Can separate analytics load into virtual DC Conclusion
About Cassandra http://www.datastax.com/docs http://wiki.apache.org/cassandra Search and subscribe to the user mailing list (very active) #Cassandra on freenode (IRC) ~150-200+ users from around the world Cassandra: The Definitive Guide About Hadoop Support in Cassandra Check out various <source>/contrib modules: README/code http://wiki.apache.org/cassandra/HadoopSupport Learn More
About me: jeremy.hanna@dachisgroup.com @jeromatron on Twitter jeromatron on IRC in #cassandra Questions

Mais conteúdo relacionado

Mais procurados

Sensor Protocols for Information via Negotiation (SPIN)
Sensor Protocols for Information via Negotiation (SPIN)Sensor Protocols for Information via Negotiation (SPIN)
Sensor Protocols for Information via Negotiation (SPIN)
rajivagarwal23dei
 
Unix memory management
Unix memory managementUnix memory management
Unix memory management
Tech_MX
 
Computational Learning Theory
Computational Learning TheoryComputational Learning Theory
Computational Learning Theory
butest
 

Mais procurados (20)

Sensor Protocols for Information via Negotiation (SPIN)
Sensor Protocols for Information via Negotiation (SPIN)Sensor Protocols for Information via Negotiation (SPIN)
Sensor Protocols for Information via Negotiation (SPIN)
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
On demand provisioning
On demand provisioningOn demand provisioning
On demand provisioning
 
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | EdurekaMapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
 
Yarn.ppt
Yarn.pptYarn.ppt
Yarn.ppt
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hybrid wireless protocols
Hybrid wireless protocolsHybrid wireless protocols
Hybrid wireless protocols
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Developing a Map Reduce Application
Developing a Map Reduce ApplicationDeveloping a Map Reduce Application
Developing a Map Reduce Application
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop Yarn
 
Open Source Cloud
Open Source CloudOpen Source Cloud
Open Source Cloud
 
The CAP Theorem
The CAP Theorem The CAP Theorem
The CAP Theorem
 
Unix memory management
Unix memory managementUnix memory management
Unix memory management
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Computational Learning Theory
Computational Learning TheoryComputational Learning Theory
Computational Learning Theory
 

Destaque

BURGLAR ALARM BASICS and insurance
BURGLAR ALARM BASICS and insuranceBURGLAR ALARM BASICS and insurance
BURGLAR ALARM BASICS and insurance
Duncan Waugh
 

Destaque (17)

Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and Cassandra
 
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in AnalyticsPig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in Analytics
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
 
Recommended homeowners insurance endorsements for charleston, sc
Recommended homeowners insurance endorsements for charleston, scRecommended homeowners insurance endorsements for charleston, sc
Recommended homeowners insurance endorsements for charleston, sc
 
TruLink hearing control app user guide
TruLink hearing control app user guideTruLink hearing control app user guide
TruLink hearing control app user guide
 
Is life insurance tax deductible in super?
Is life insurance tax deductible in super?Is life insurance tax deductible in super?
Is life insurance tax deductible in super?
 
Coverage Insights - Vacant Property Insurance
Coverage Insights - Vacant Property InsuranceCoverage Insights - Vacant Property Insurance
Coverage Insights - Vacant Property Insurance
 
GENBAND G6 datasheet
GENBAND G6 datasheetGENBAND G6 datasheet
GENBAND G6 datasheet
 
Business Advisors, Consultants, and Coaches: Whats The Difference?
Business Advisors, Consultants, and Coaches:  Whats The Difference?Business Advisors, Consultants, and Coaches:  Whats The Difference?
Business Advisors, Consultants, and Coaches: Whats The Difference?
 
Bridging the gap between digital and relationship marketing - DMA 2013 Though...
Bridging the gap between digital and relationship marketing - DMA 2013 Though...Bridging the gap between digital and relationship marketing - DMA 2013 Though...
Bridging the gap between digital and relationship marketing - DMA 2013 Though...
 
SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?
SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?
SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?
 
BURGLAR ALARM BASICS and insurance
BURGLAR ALARM BASICS and insuranceBURGLAR ALARM BASICS and insurance
BURGLAR ALARM BASICS and insurance
 
IBM AppScan Source - The SAST solution
IBM AppScan Source - The SAST solutionIBM AppScan Source - The SAST solution
IBM AppScan Source - The SAST solution
 
Avaya Aura 6.x suite licensing
Avaya Aura 6.x suite licensingAvaya Aura 6.x suite licensing
Avaya Aura 6.x suite licensing
 
Box Security Whitepaper
Box Security WhitepaperBox Security Whitepaper
Box Security Whitepaper
 

Semelhante a Cassandra/Hadoop Integration

Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Yahoo Developer Network
 
Hadoop online training by certified trainer
Hadoop online training by certified trainerHadoop online training by certified trainer
Hadoop online training by certified trainer
sriram0233
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Xebia Nederland BV
 
Best Hadoop and Amazon Online Training
Best Hadoop and Amazon Online TrainingBest Hadoop and Amazon Online Training
Best Hadoop and Amazon Online Training
Samatha Kamuni
 
Hadoop and aws map reducecourse
Hadoop and aws map reducecourseHadoop and aws map reducecourse
Hadoop and aws map reducecourse
Samatha Kamuni
 
Hadoop+Cassandra_Integration
Hadoop+Cassandra_IntegrationHadoop+Cassandra_Integration
Hadoop+Cassandra_Integration
Joyabrata Das
 

Semelhante a Cassandra/Hadoop Integration (20)

Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergy
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Hadoop online training by certified trainer
Hadoop online training by certified trainerHadoop online training by certified trainer
Hadoop online training by certified trainer
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Best Hadoop and Amazon Online Training
Best Hadoop and Amazon Online TrainingBest Hadoop and Amazon Online Training
Best Hadoop and Amazon Online Training
 
Hadoop and aws map reducecourse
Hadoop and aws map reducecourseHadoop and aws map reducecourse
Hadoop and aws map reducecourse
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 
Hadoop+Cassandra_Integration
Hadoop+Cassandra_IntegrationHadoop+Cassandra_Integration
Hadoop+Cassandra_Integration
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 

Mais de Jeremy Hanna (11)

Göteborg Distributed: Eventual Consistency in Apache Cassandra
Göteborg Distributed: Eventual Consistency in Apache CassandraGöteborg Distributed: Eventual Consistency in Apache Cassandra
Göteborg Distributed: Eventual Consistency in Apache Cassandra
 
Apache Cassandra in the Real World
Apache Cassandra in the Real WorldApache Cassandra in the Real World
Apache Cassandra in the Real World
 
Apache Cassandra in the Real World
Apache Cassandra in the Real WorldApache Cassandra in the Real World
Apache Cassandra in the Real World
 
Modern Cassandra for Developers
Modern Cassandra for DevelopersModern Cassandra for Developers
Modern Cassandra for Developers
 
Troubleshooting Cassandra
Troubleshooting CassandraTroubleshooting Cassandra
Troubleshooting Cassandra
 
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
Cassandra + Hadoop: Analisi Batch con Apache CassandraCassandra + Hadoop: Analisi Batch con Apache Cassandra
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
 
End-to-end Analytics with Apache Cassandra
End-to-end Analytics with Apache CassandraEnd-to-end Analytics with Apache Cassandra
End-to-end Analytics with Apache Cassandra
 
Cassandra eu
Cassandra euCassandra eu
Cassandra eu
 
Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon
 
Intro to cassandra + hadoop
Intro to cassandra + hadoopIntro to cassandra + hadoop
Intro to cassandra + hadoop
 
Cassandra+Hadoop
Cassandra+HadoopCassandra+Hadoop
Cassandra+Hadoop
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Cassandra/Hadoop Integration

  • 2. BigTable + Dynamo Semi-structured data model Decentralized – no special roles, no SPOF Horizontally scalable Ridiculously fast writes, fast reads Tunably consistent Cross-DC capable Cassandra (basic overview)
  • 3. Design your data model based on your query model Real-time ad-hoc queries aren’t viable Secondary indexes help What about analytics? Querying with Cassandra
  • 4. Hadoopbrings analytics MapReduce Pig/Hive and other tools built above MapReduce Configurable data sources/destinations Many already familiar with it Active community Enter Hadoop
  • 5. Basic Recipe Overlay Hadoop on top of Cassandra Separate server for name node and job tracker Co-locate task trackers with Cassandra nodes Data nodes for distributed cache Voilà Data locality Analytics engine scales with data Cluster Configuration
  • 6. Always tune Cassandra to taste For Hadoop workloads you might Have a separate analytics virtual datacenter Using the NetworkTopologyStrategy Tune the rpc_timeout_in_ms in cassandra.yaml (higher) Tune the cassandra.range.batch.size See org.apache.cassandra.hadoop.ConfigHelper Cluster Tuning
  • 7. All-in-one Configuration JobTracker and NameNode Each node has Cassandra, a TaskTracker, and a DataNode (for distributed cache)
  • 8. Separate Analytics Configuration Separated nodes for analytics Nodes for real-time random access A single Cassandra cluster with different virtual data centers
  • 9. Cassandra specific InputFormat ColumnFamilyInputFormat Configuration – ConfigHelper, Hadoop variables InputSplits over the data – tunable Example usage in contrib/word_count MapReduce - InputFormat
  • 10. OutputFormat ColumnFamilyOutputFormat Configuration – ConfigHelper, Hadoopvariables Batches output – tunable Don’t have to use Cassandra api Some optimizations (e.g. ConsistencyLevel.ONE) Uses Avro for output serialization (enables streaming) Example usage in contrib/word_count MapReduce - OutputFormat
  • 11. Visualizing Take vertical slices of columns Over the whole column family
  • 12. What about languages outside of Java? Build on what Hadoop uses - Streaming Output streaming as of0.7.0 Example in contrib/hadoop_streaming_output Input streaming in progress, hoping for 0.7.2 Hadoop Streaming
  • 13. Developed at Yahoo! PigLatin/Grunt shell Powerful scripting language for analytics Configuration – Hadoop/Envvariables Uses pig 0.7+ Example usage in contrib/pig Pig
  • 14. LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage() br /> as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)}); cols = FOREACH rows GENERATE flatten(cols) as (name, value); words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word; grouped = GROUP words BY word; counts = FOREACH grouped GENERATE group, COUNT(words) as count; ordered = ORDER counts BY count DESC; topten = LIMIT ordered 10; dump topten;
  • 15. ColumnFamilyInputFormat ColumnFamilyOutputFormat Hadoop Streaming Output Pig support – Cassandra LoadFunc Summary of Integration
  • 16. Raptr.com Home grown solution -> Cassandra + Hadoop Query time: hours -> minutes Pig obviated their need for multi-lingual MR Speed and ease are enabling Imagini/Visual DNA The Dachis Group US Government (Digital Reasoning) See http://github.com/digitalreasoning/PyStratus Users of Cassandra + Hadoop
  • 17. Hive support in progress (HIVE-1434) Hadoop Input Streaming (hoping for 0.7.2 - 1497) Pig Storage Func (CASSANDRA-1828) Row predicates (pending CASSANDRA-1600) MapReduce et al over secondary indexes (1600) Performance improvements (though already good) Future
  • 18. Performant OLTP + powerful OLAP Less need to shuttle data between storage systems Data locality for processing Scales with the cluster Can separate analytics load into virtual DC Conclusion
  • 19. About Cassandra http://www.datastax.com/docs http://wiki.apache.org/cassandra Search and subscribe to the user mailing list (very active) #Cassandra on freenode (IRC) ~150-200+ users from around the world Cassandra: The Definitive Guide About Hadoop Support in Cassandra Check out various <source>/contrib modules: README/code http://wiki.apache.org/cassandra/HadoopSupport Learn More
  • 20. About me: jeremy.hanna@dachisgroup.com @jeromatron on Twitter jeromatron on IRC in #cassandra Questions

Notas do Editor

  1. Floating above the clouds
  2. Mention how InputSplit works and how it can choose among replicas – array of locations returned.
  3. Highlight how this is the same extension point that is used with HDFS, HBase and any other data source/destination for MapReduce.
  4. Mention Jeff Hodges, Johan, Stu, and Todd Lipcon.
  5. IOW, are people using this stuff in the real world? In production?Put some notes in here about raptr and imagini’s use cases.