SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
Store and Process Big Data
with Hadoop and Cassandra
     Apache BarCamp
              By
      Deependra Ariyadewa
          WSO2, Inc.
Store Data with

 ● Project site : http://cassandra.apache.org

 ● The latest release version is 1.0.7

 ● Cassandra is in use at Netflix, Twitter, Urban Airship, Constant
  Contact, Reddit, Cisco, OpenX, Digg, CloudKick and Ooyala

 ● Cassandra Users : http://www.datastax.com/cassandrausers

 ● The largest known Cassandra cluster has over 300 TB of data in over
   400 machines.

 ● Commercial support http://wiki.apache.org/cassandra/ThirdPartySupport
Cassandra Deployment architecture
                                   hash(key1)




                      hash(key2)




 key => {(k,v),(k,v),(k,v)}

 hash(key) => key order
How to Install Cassandra

 ● Download the artifact
   apache-cassandra-1.0.7-bin.tar.gz from   http://cassandra.apache.org/download/


 ● Extract
   tar -xzvf apache-cassandra-1.0.7-bin.tar.gz

 ● Set up folder paths

        mkdir -p /var/log/cassandra

        chown -R `whoami` /var/log/cassandra

        mkdir -p /var/lib/cassandra
        chown -R `whoami` /var/lib/cassandra
How to Configure Cassandra
Main Configuration file :

  $CASSANDRA_HOME/conf/cassandra.yaml

           cluster_name: 'Test Cluster'

            seed_provider:
                              - seeds: "192.168.0.121"

             storage_port: 7000

             listen_address: localhost

             rpc_address: localhost

             rpc_port: 9160
Cassandra Clustering

 initial_token:

 partitioner: org.apache.cassandra.dht.RandomPartitioner


 http://wiki.apache.org/cassandra/Operations
Cassandra DevOps

$CASSANDRA_HOME/bin$ ./cassandra-cli --host localhost

   [default@unknown] show keyspaces;
   Keyspace: system:
    Replication Strategy: org.apache.cassandra.locator.LocalStrategy
    Durable Writes: true
     Options: [replication_factor:1]
    Column Families:
     ColumnFamily: HintsColumnFamily (Super)
     "hinted handoff data"
       Key Validation Class: org.apache.cassandra.db.marshal.BytesType
       Default column value validator: org.apache.cassandra.db.marshal.BytesType
       Columns sorted by: org.apache.cassandra.db.marshal.BytesType/org.apache.cassandra.db.
   marshal.BytesType
       Row cache size / save period in seconds / keys to save : 0.0/0/all
       Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider
       Key cache size / save period in seconds: 0.01/0
       GC grace seconds: 0
       Compaction min/max thresholds: 4/32
       Read repair chance: 0.0
       Replicate on write: true
       Bloom Filter FP chance: default
       Built indexes: []
       Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
Cassandra CLI

[default@apache] create column family Location with comparator=UTF8Type and
default_validation_class=UTF8Type and key_validation_class=UTF8Type;
f04561a0-60ed-11e1-0000-242d50cf1fbf
Waiting for schema agreement...
... schemas agree across the cluster


[default@apache] set Location[00001][City]='Colombo';
Value inserted.
Elapsed time: 140 msec(s).


[default@apache] list Location;
Using default limit of 100
-------------------
RowKey: 00001
=> (column=City, value=Colombo, timestamp=1330311097464000)

1 Row Returned.
Elapsed time: 122 msec(s).
Store Data with Hector
import me.prettyprint.cassandra.service.CassandraHostConfigurator;
import me.prettyprint.hector.api.Cluster;
import me.prettyprint.hector.api.factory.HFactory;

import java.util.HashMap;
import java.util.Map;

public class ExampleHelper {

    public static final String CLUSTER_NAME = "ClusterOne";
    public static final String USERNAME_KEY = "username";
    public static final String PASSWORD_KEY = "password";
    public static final String RPC_PORT = "9160";
    public static final String CSS_NODE0 = "localhost";
    public static final String CSS_NODE1 = "css1.stratoslive.wso2.com";
    public static final String CSS_NODE2 = "css2.stratoslive.wso2.com";

    public static Cluster createCluster(String username, String password) {
      Map<String, String> credentials =
            new HashMap<String, String>();
      credentials.put(USERNAME_KEY, username);
      credentials.put(PASSWORD_KEY, password);
      String hostList = CSS_NODE0 + ":" + RPC_PORT + "," + CSS_NODE1 + ":" + RPC_PORT + "," +
                                                                                  CSS_NODE2 + ":" + RPC_PORT;
      return HFactory.createCluster(CLUSTER_NAME,
                           new CassandraHostConfigurator(hostList), credentials);
    }

}
Store Data with Hector
Create Keyspace:

    KeyspaceDefinition definition = new ThriftKsDef(keyspaceName);
    cluster.addKeyspace(definition);

Add column family:
    ColumnFamilyDefinition familyDefinition = new ThriftCfDef(keyspaceName, columnFamily);
    cluster.addColumnFamily(familyDefinition);

Write Data:

Mutator<String> mutator = HFactory.createMutator(keyspace, new StringSerializer());

String columnValue = UUID.randomUUID().toString();
          mutator.insert(rowKey, columnFamily, HFactory.createStringColumn(columnName, columnValue));


Read Data:
        ColumnQuery<String, String, String> columnQuery = HFactory.createStringColumnQuery(keyspace);

         columnQuery.setColumnFamily(columnFamily).setKey(key).setName(columnName);
         QueryResult<HColumn<String, String>> result = columnQuery.execute();
         HColumn<String, String> hColumn = result.get();

         System.out.println("Column: " + hColumn.getName() + " Value : " + hColumn.getValue() + "n");
Variable Consistency
    ● ANY: Wait until some replica has responded.

    ● ONE: Wait until one replica has responded.

    ● TWO: Wait until two replicas have responded.

    ● THREE: Wait until three replicas have responded
.
    ● LOCAL_QUORUM: Wait for quorum on the datacenter the connection was
      stablished.

    ● EACH_QUORUM: Wait for quorum on each datacenter.

    ● QUORUM: Wait for a quorum of replicas (no matter which datacenter).

    ● ALL: Blocks for all the replicas before returning to the client.
Variable Consistency

Create a customized Consistency Level:

ConfigurableConsistencyLevel configurableConsistencyLevel = new ConfigurableConsistencyLevel();
Map<String, HConsistencyLevel> clmap = new HashMap<String, HConsistencyLevel>();


clmap.put("MyColumnFamily", HConsistencyLevel.ONE);

configurableConsistencyLevel.setReadCfConsistencyLevels(clmap);
configurableConsistencyLevel.setWriteCfConsistencyLevels(clmap);


HFactory.createKeyspace("MyKeyspace", myCluster, configurableConsistencyLevel);
CQL

Insert data with CQL:

  cqlsh> INSERT INTO Location (KEY, City) VALUES ('00001', 'Colombo');


Retrieve data with CQL

  cqlsh> select * from Location where KEY='00001';
Apache


 ● Project Site: http://hadoop.apache.org

 ● Latest Version 1.0.1

 ● Hadoop is in use at Amazon, Yahoo, Adobe, eBay,
   Facebook

 ● Commercial support : http://hortonworks.com
                        http://www.cloudera.com
Hadoop deployment Architecture
How to install Hadoop

 ● Download the artifact from:

                      http://hadoop.apache.org/common/releases.
html

 ● Extract : tar -xzvf hadoop-1.0.1.tar.gz


 ● Copy and extract installation to each data node.


       scp hadoop-1.0.1.tar.gz user@datanode01:/home/hadoop

 ● Start Hadoop : $HADOOP_HOME:/bin/start-all
Hadoop CLI - HDFS


Format Namenode :

  $HADOOP_HOME:/bin/hadoop namenode -format

File operations on HDFS:

  $HADOOP_HOME:/bin/hadoop dfs -lsr /

  $HADOOP_HOME:/bin/hadoop dfs -mkdir /users/deep/wso2
Mapreduce




source:http://developer.yahoo.com/hadoop/tutorial/module4.html
Simple Mapreduce Job

Mapper

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();

      public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws                                                                                                              I
OException {

          String line = value.toString();
          StringTokenizer tokenizer = new StringTokenizer(line);

          while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            output.collect(word, one);
          }
      }
  }
Simple Mapreduce Job
Reducer:
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

   public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter)                                                                                                           throws
IOException {
        int sum = 0;
        while (values.hasNext()) {
          sum += values.next().get();
        }
        output.collect(key, new IntWritable(sum));
    }
  }
Simple Mapreduce Job

Job Runner:

     JobConf conf = new JobConf(WordCount.class);
     conf.setJobName("wordcount");

     conf.setOutputKeyClass(Text.class);
     conf.setOutputValueClass(IntWritable.class);

     conf.setMapperClass(Map.class);
     conf.setCombinerClass(Reduce.class);
     conf.setReducerClass(Reduce.class);

     conf.setInputFormat(TextInputFormat.class);
     conf.setOutputFormat(TextOutputFormat.class);

     FileInputFormat.setInputPaths(conf, new Path(args[0]));
     FileOutputFormat.setOutputPath(conf, new Path(args[1]));

     JobClient.runJob(conf);
High level Mapreduce Interfaces

● Hive

● Pig
Q&A

Mais conteúdo relacionado

Mais procurados

C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with Cassandra
DataStax
 
VPN Access Runbook
VPN Access RunbookVPN Access Runbook
VPN Access Runbook
Taha Shakeel
 

Mais procurados (20)

C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with Cassandra
 
Cassandra Summit 2015
Cassandra Summit 2015Cassandra Summit 2015
Cassandra Summit 2015
 
MongoDB-SESSION03
MongoDB-SESSION03MongoDB-SESSION03
MongoDB-SESSION03
 
Cassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super ModelerCassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super Modeler
 
Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0
 
VPN Access Runbook
VPN Access RunbookVPN Access Runbook
VPN Access Runbook
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3
 
Database administration commands
Database administration commands Database administration commands
Database administration commands
 
Oracle ORA Errors
Oracle ORA ErrorsOracle ORA Errors
Oracle ORA Errors
 
Config BuildConfig
Config BuildConfigConfig BuildConfig
Config BuildConfig
 
はじめてのGroovy
はじめてのGroovyはじめてのGroovy
はじめてのGroovy
 
اسلاید اول جلسه چهارم کلاس پایتون برای هکرهای قانونی
اسلاید اول جلسه چهارم کلاس پایتون برای هکرهای قانونیاسلاید اول جلسه چهارم کلاس پایتون برای هکرهای قانونی
اسلاید اول جلسه چهارم کلاس پایتون برای هکرهای قانونی
 
The Ring programming language version 1.8 book - Part 34 of 202
The Ring programming language version 1.8 book - Part 34 of 202The Ring programming language version 1.8 book - Part 34 of 202
The Ring programming language version 1.8 book - Part 34 of 202
 
Python database access
Python database accessPython database access
Python database access
 
Lodash js
Lodash jsLodash js
Lodash js
 
Python in the database
Python in the databasePython in the database
Python in the database
 
Mysql
MysqlMysql
Mysql
 
Bootstrap
BootstrapBootstrap
Bootstrap
 
Dmxedit
DmxeditDmxedit
Dmxedit
 
Php 5.4: New Language Features You Will Find Useful
Php 5.4: New Language Features You Will Find UsefulPhp 5.4: New Language Features You Will Find Useful
Php 5.4: New Language Features You Will Find Useful
 

Destaque

Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Jonathan Seidman
 
Last Special Programming Task on December Presentation
Last Special Programming Task on December PresentationLast Special Programming Task on December Presentation
Last Special Programming Task on December Presentation
Dion Webiaswara
 

Destaque (20)

Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
 
Amazon Elastic Computing 2
Amazon Elastic Computing 2Amazon Elastic Computing 2
Amazon Elastic Computing 2
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Taller hadoop
Taller hadoopTaller hadoop
Taller hadoop
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop admin
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Last Special Programming Task on December Presentation
Last Special Programming Task on December PresentationLast Special Programming Task on December Presentation
Last Special Programming Task on December Presentation
 
E-learning and agriMoodle, OER Growers Autumn 2012
E-learning and agriMoodle, OER Growers Autumn 2012E-learning and agriMoodle, OER Growers Autumn 2012
E-learning and agriMoodle, OER Growers Autumn 2012
 

Semelhante a Store and Process Big Data with Hadoop and Cassandra

ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
alex_araujo
 
What is row level isolation on cassandra
What is row level isolation on cassandraWhat is row level isolation on cassandra
What is row level isolation on cassandra
Kazutaka Tomita
 

Semelhante a Store and Process Big Data with Hadoop and Cassandra (20)

Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
 
Solr @ Etsy - Apache Lucene Eurocon
Solr @ Etsy - Apache Lucene EuroconSolr @ Etsy - Apache Lucene Eurocon
Solr @ Etsy - Apache Lucene Eurocon
 
RestMQ - HTTP/Redis based Message Queue
RestMQ - HTTP/Redis based Message QueueRestMQ - HTTP/Redis based Message Queue
RestMQ - HTTP/Redis based Message Queue
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
NoSQL and JavaScript: a Love Story
NoSQL and JavaScript: a Love StoryNoSQL and JavaScript: a Love Story
NoSQL and JavaScript: a Love Story
 
Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19
 
Fun Teaching MongoDB New Tricks
Fun Teaching MongoDB New TricksFun Teaching MongoDB New Tricks
Fun Teaching MongoDB New Tricks
 
Productionalizing spark streaming applications
Productionalizing spark streaming applicationsProductionalizing spark streaming applications
Productionalizing spark streaming applications
 
Hazelcast and MongoDB at Cloud CMS
Hazelcast and MongoDB at Cloud CMSHazelcast and MongoDB at Cloud CMS
Hazelcast and MongoDB at Cloud CMS
 
Artimon - Apache Flume (incubating) NYC Meetup 20111108
Artimon - Apache Flume (incubating) NYC Meetup 20111108Artimon - Apache Flume (incubating) NYC Meetup 20111108
Artimon - Apache Flume (incubating) NYC Meetup 20111108
 
Extending Flux to Support Other Databases and Data Stores | Adam Anthony | In...
Extending Flux to Support Other Databases and Data Stores | Adam Anthony | In...Extending Flux to Support Other Databases and Data Stores | Adam Anthony | In...
Extending Flux to Support Other Databases and Data Stores | Adam Anthony | In...
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
 
Kerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit eastKerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit east
 
Spring data iii
Spring data iiiSpring data iii
Spring data iii
 
Hazelcast
HazelcastHazelcast
Hazelcast
 
V8
V8V8
V8
 
What is row level isolation on cassandra
What is row level isolation on cassandraWhat is row level isolation on cassandra
What is row level isolation on cassandra
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Store and Process Big Data with Hadoop and Cassandra

  • 1. Store and Process Big Data with Hadoop and Cassandra Apache BarCamp By Deependra Ariyadewa WSO2, Inc.
  • 2. Store Data with ● Project site : http://cassandra.apache.org ● The latest release version is 1.0.7 ● Cassandra is in use at Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Digg, CloudKick and Ooyala ● Cassandra Users : http://www.datastax.com/cassandrausers ● The largest known Cassandra cluster has over 300 TB of data in over 400 machines. ● Commercial support http://wiki.apache.org/cassandra/ThirdPartySupport
  • 3. Cassandra Deployment architecture hash(key1) hash(key2) key => {(k,v),(k,v),(k,v)} hash(key) => key order
  • 4. How to Install Cassandra ● Download the artifact apache-cassandra-1.0.7-bin.tar.gz from http://cassandra.apache.org/download/ ● Extract tar -xzvf apache-cassandra-1.0.7-bin.tar.gz ● Set up folder paths mkdir -p /var/log/cassandra chown -R `whoami` /var/log/cassandra mkdir -p /var/lib/cassandra chown -R `whoami` /var/lib/cassandra
  • 5. How to Configure Cassandra Main Configuration file : $CASSANDRA_HOME/conf/cassandra.yaml cluster_name: 'Test Cluster' seed_provider: - seeds: "192.168.0.121" storage_port: 7000 listen_address: localhost rpc_address: localhost rpc_port: 9160
  • 6. Cassandra Clustering initial_token: partitioner: org.apache.cassandra.dht.RandomPartitioner http://wiki.apache.org/cassandra/Operations
  • 7. Cassandra DevOps $CASSANDRA_HOME/bin$ ./cassandra-cli --host localhost [default@unknown] show keyspaces; Keyspace: system: Replication Strategy: org.apache.cassandra.locator.LocalStrategy Durable Writes: true Options: [replication_factor:1] Column Families: ColumnFamily: HintsColumnFamily (Super) "hinted handoff data" Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.BytesType/org.apache.cassandra.db. marshal.BytesType Row cache size / save period in seconds / keys to save : 0.0/0/all Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider Key cache size / save period in seconds: 0.01/0 GC grace seconds: 0 Compaction min/max thresholds: 4/32 Read repair chance: 0.0 Replicate on write: true Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
  • 8. Cassandra CLI [default@apache] create column family Location with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type; f04561a0-60ed-11e1-0000-242d50cf1fbf Waiting for schema agreement... ... schemas agree across the cluster [default@apache] set Location[00001][City]='Colombo'; Value inserted. Elapsed time: 140 msec(s). [default@apache] list Location; Using default limit of 100 ------------------- RowKey: 00001 => (column=City, value=Colombo, timestamp=1330311097464000) 1 Row Returned. Elapsed time: 122 msec(s).
  • 9. Store Data with Hector import me.prettyprint.cassandra.service.CassandraHostConfigurator; import me.prettyprint.hector.api.Cluster; import me.prettyprint.hector.api.factory.HFactory; import java.util.HashMap; import java.util.Map; public class ExampleHelper { public static final String CLUSTER_NAME = "ClusterOne"; public static final String USERNAME_KEY = "username"; public static final String PASSWORD_KEY = "password"; public static final String RPC_PORT = "9160"; public static final String CSS_NODE0 = "localhost"; public static final String CSS_NODE1 = "css1.stratoslive.wso2.com"; public static final String CSS_NODE2 = "css2.stratoslive.wso2.com"; public static Cluster createCluster(String username, String password) { Map<String, String> credentials = new HashMap<String, String>(); credentials.put(USERNAME_KEY, username); credentials.put(PASSWORD_KEY, password); String hostList = CSS_NODE0 + ":" + RPC_PORT + "," + CSS_NODE1 + ":" + RPC_PORT + "," + CSS_NODE2 + ":" + RPC_PORT; return HFactory.createCluster(CLUSTER_NAME, new CassandraHostConfigurator(hostList), credentials); } }
  • 10. Store Data with Hector Create Keyspace: KeyspaceDefinition definition = new ThriftKsDef(keyspaceName); cluster.addKeyspace(definition); Add column family: ColumnFamilyDefinition familyDefinition = new ThriftCfDef(keyspaceName, columnFamily); cluster.addColumnFamily(familyDefinition); Write Data: Mutator<String> mutator = HFactory.createMutator(keyspace, new StringSerializer()); String columnValue = UUID.randomUUID().toString(); mutator.insert(rowKey, columnFamily, HFactory.createStringColumn(columnName, columnValue)); Read Data: ColumnQuery<String, String, String> columnQuery = HFactory.createStringColumnQuery(keyspace); columnQuery.setColumnFamily(columnFamily).setKey(key).setName(columnName); QueryResult<HColumn<String, String>> result = columnQuery.execute(); HColumn<String, String> hColumn = result.get(); System.out.println("Column: " + hColumn.getName() + " Value : " + hColumn.getValue() + "n");
  • 11. Variable Consistency ● ANY: Wait until some replica has responded. ● ONE: Wait until one replica has responded. ● TWO: Wait until two replicas have responded. ● THREE: Wait until three replicas have responded . ● LOCAL_QUORUM: Wait for quorum on the datacenter the connection was stablished. ● EACH_QUORUM: Wait for quorum on each datacenter. ● QUORUM: Wait for a quorum of replicas (no matter which datacenter). ● ALL: Blocks for all the replicas before returning to the client.
  • 12. Variable Consistency Create a customized Consistency Level: ConfigurableConsistencyLevel configurableConsistencyLevel = new ConfigurableConsistencyLevel(); Map<String, HConsistencyLevel> clmap = new HashMap<String, HConsistencyLevel>(); clmap.put("MyColumnFamily", HConsistencyLevel.ONE); configurableConsistencyLevel.setReadCfConsistencyLevels(clmap); configurableConsistencyLevel.setWriteCfConsistencyLevels(clmap); HFactory.createKeyspace("MyKeyspace", myCluster, configurableConsistencyLevel);
  • 13. CQL Insert data with CQL: cqlsh> INSERT INTO Location (KEY, City) VALUES ('00001', 'Colombo'); Retrieve data with CQL cqlsh> select * from Location where KEY='00001';
  • 14. Apache ● Project Site: http://hadoop.apache.org ● Latest Version 1.0.1 ● Hadoop is in use at Amazon, Yahoo, Adobe, eBay, Facebook ● Commercial support : http://hortonworks.com http://www.cloudera.com
  • 16. How to install Hadoop ● Download the artifact from: http://hadoop.apache.org/common/releases. html ● Extract : tar -xzvf hadoop-1.0.1.tar.gz ● Copy and extract installation to each data node. scp hadoop-1.0.1.tar.gz user@datanode01:/home/hadoop ● Start Hadoop : $HADOOP_HOME:/bin/start-all
  • 17. Hadoop CLI - HDFS Format Namenode : $HADOOP_HOME:/bin/hadoop namenode -format File operations on HDFS: $HADOOP_HOME:/bin/hadoop dfs -lsr / $HADOOP_HOME:/bin/hadoop dfs -mkdir /users/deep/wso2
  • 19. Simple Mapreduce Job Mapper public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws I OException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  • 20. Simple Mapreduce Job Reducer: public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 21. Simple Mapreduce Job Job Runner: JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);
  • 22. High level Mapreduce Interfaces ● Hive ● Pig
  • 23. Q&A