SlideShare uma empresa Scribd logo
1 de 31
Joseph Adler   April 24 2012
Don’t use Hadoop.
(Unless you have to.)
• What is Hadoop?

• Why do people use Hadoop?

• How does it work?

• When should you consider Hadoop?
What is Hadoop?
Apache Hadoop is an open source, java-based
system for processing data on a network of
commodity servers using a map-reduce
paradigm.
How do people use Hadoop?
A few examples from the Apache site
  – Amazon search
  – Facebook log storage and reporting
  – LinkedIn’s People You May Know
  – Twitter data analysis
  – Yahoo! Uses it for ad targeting
A search on LinkedIn shows people at financial
services, biotech, oil and gas exploration, retail,
and other industries are using Hadoop.
Where did Hadoop come from?
• Hadoop was created by Doug Cutting. It’s
  named after his son’s toy elephant.
• Hadoop was written to support Nutch, an
  open source web search engine.
  Hadoop was spun out in 2006.
• Yahoo! invested in Hadoop,
  bringing it to “web scale” by
  2008.
Hadoop is open source
• Hadoop is an open source project (Apache
  license)
  – You can download and install it freely
  – You can also compile your own custom version of
    Hadoop
• There are three subprojects
Hadoop is written for Java
• The good news: Hadoop runs on a JVM
  – You can run Hadoop on your workstation (for testing),
    on a private cluster, or in a cloud
  – You can write Hadoop jobs in Java, or in Scala, Jruby,
    Jython, Clojure, or any other JVM language
  – You can use other Java libraries
• The bad news: Hadoop was originally written by
  and for Java programmers.
  – You can do basic work without knowing Java. But you
    will quickly get stuck if you can’t write code.
Hadoop runs on a network of servers
Hadoop runs on commodity servers
• Doesn’t require very fast, very big, or very
  reliable servers
• Works better on good quality servers
  connected through a fast network
• Hadoop is fault tolerant—multiple copies of
  data, protection against failed jobs
When should you consider Hadoop?
•   Big problem
•   Fits Map/Reduce model
•   Don’t need to compute in real time
•   Technical team
Picking the right tool for the job

 1,000,000,000,000
  100,000,000,000
                                                                               ?
   10,000,000,000
    1,000,000,000
      100,000,000
       10,000,000
        1,000,000
          100,000
           10,000
            1,000
              100
               10
                1
                     Calculator   Spreadsheet   Numerical   Parallel Systems   ?
                                                Software
Man / Reduce
• I need 7 volunteers:
  – 4 mappers
  – 3 reducers
• We’re going to show how map/reduce works
  by sorting and counting some notes.
What is Map/Reduce
• You compute things in two phases
  – The map step
     • Reads the input data
     • Transforms the data
     • Tags each datum with a key and sends each datum to
       the right reducer
  – The reduce step
     • Collects all the data for each key
     • Do some work on the data by key
     • Outputs the results
Map/Reduce is over 100 years old
• Hollerith machines from the 1890 census
Good fits for Map/Reduce
• Aggregating unstructured data to enter into a
  database (ETL)
• Creating email messages
• Processing log files and creating reports
Problems that don’t perfectly fit
• Logistic regression
• Matrix operations
• Social graph calculations
Batch computation
Hadoop is a shared system that allocates
resources to jobs from a queue. It’s not a real
time system.
Coding example
Suppose that we had some log files with events by
date (say, page views). Let’s count the number of
events by day!

Sample data:

  1335300359000,Home Page, Joe
  1335300359027,Login,
  1335300359031,Home Page, Romy
  1335300369123,Settings, Joe
  …
A Java Example
• Mappers will
  – Read the input files
  – Extract the timestamp
  – Round to the nearest day
  – Set the output key to the day
• Reducers will
  – Iterate through records by day, counting records
  – Output the count for each day
A Java example (Mapper)
public class exampleMapper
extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {

    public void map(LongWritable key, Text value,
    OutputCollector<Text, Text> output,
    Reporter reporter) throws IOException {
         String line = value.toString();
         String[] values = line.split(",");
         Long timeStampLong = Long.parseLong(values[0]);
         DateTime timeStamp = new DateTime(timeStampLong);
         DateTimeFormatter dateFormat =
      ISODateTimeFormat.date();
         output.collect(new
      Text(dateFormat.print(timeStamp)),
      new Text(line));
    }

}
A Java example (Reducer)
public class exampleReducer
 extends MapReduceBase
 implements Reducer<Text, Text, Text,
  LongWritable> {

      public void reduce(Text key,
    Iterator<Text> values,
    OutputCollector<Text,LongWritable> output,
    Reporter reporter) throws IOException {
          long count = 0;
          while (values.hasNext())
              count++;
          output.collect(key,
       new LongWritable(count));
      }

}
A Java example (job file)
public class exampleJob extends Configured implements Tool {

     @Override
     public int run(String[] arg0) throws Exception {
            // TODO Auto-generated method stub

           JobConf conf = new JobConf(getConf(), getClass());
           conf.setJobName("Count events by date");
           conf.setInputFormat(TextInputFormat.class);
           TextInputFormat.addInputPath(conf, new Path(arg0[0]));

           conf.setOutputFormat(TextOutputFormat.class);
           conf.setOutputKeyClass(Text.class);
           conf.setOutputValueClass(LongWritable.class);
           TextOutputFormat.setOutputPath(conf, new Path(arg0[1]));

           conf.setMapperClass(exampleMapper.class);
           conf.setReducerClass(exampleReducer.class);

           JobClient.runJob(conf);

           return 0;
     }
}
• Tools that make it easier to use Hadoop:

  – Hive
  – Pig
  – Cascading
Cascading
• Tool for constructing Hadoop workflows in Java
• Example:
  Scheme pvScheme = new TextLine(new Fields (“timestamp”, …);
  Tap source = new Hfs(pvScheme, inpath);
  Scheme countScheme = new TextLine(new Files (“date”, “count”);
  Tap sink = new Hfs(countScheme, outpath);
  Pipe assembly = new Pipe(“pagesByDate”);
  Function function = new DateFormatter(Fields(“timestamp”),
   “yyyy/mm/dd”);
  assembly = new Each(assembly , new Fields(“date”), function);
  assembly = new GroupBy(assembly , new Fields (“date”));
  Aggregator count = new Count( new Fields( "count" ) );
  assembly = new Every(assembly , count );
  Properties properties = new Properties();
  FlowConnector.setApplicationJarClass( properties, Main.class );
  FlowConnector flowConnector = new FlowConnector( properties );
  Flow flow = flowConnector.connect( "pagesByDate", source, sink,
    assembly );
  flow.complete();
Pig
• Tool to write SQL-like queries against Hadoop
• Example:
  define TODATE
   org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay();
  %declare now `date "+%s000"`;
  page_views = LOAD ‘PAGEVIEWS’ USING PigStorage()
    AS (timestamp:int, page:chararray, user:chararray);
  last_week = FILTER page_views BY timestamp > $now – 86400000 * 7;
  truncated = FOREACH page_views GENERATE *,
    TODATE(timestamp) as date;
  grouped = GROUP truncated BY date;
  counted = FOREACH grouped GENERATE group as date,
    COUNT_STAR(truncated) as N;
  sorted = ORDER counted BY date;
  STORE sorted INTO ‘results’ USING PigStorage();
Hive
• Tool from Facebook that lets you write SQL
  queries against Hadoop
• Example code:

  SELECT TO_DATE(timestamp), COUNT(*)
  FROM PAGEVIEWS
  WHERE timestamp > unix_timestamp()-86400000 * 7
  GROUP BY TO_DATE(timestamp)
  ORDER BY TO_DATE(timestamp)
Some important related projects
•   Hbase
•   NextGen Hadoop (0.23)
•   Zookeeper
•   Mahout
•   Giraph
What to do next
• Watch training videos at
  http://www.cloudera.com/resource-types/video/

• Get Hadoop (including the code!) at
      http://hadoop.apache.org

• Get commercial support from
      http://www.cloudera.com/
   or http://hortonworks.com/

• Run it in the cloud with Amazon Elastic Map Reduce:
     http://aws.amazon.com/elasticmapreduce/
Big data week presentation

Mais conteúdo relacionado

Mais procurados

Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
Sri Ambati
 

Mais procurados (20)

Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Working with the Scalding Type -Safe API
Working with the Scalding Type -Safe APIWorking with the Scalding Type -Safe API
Working with the Scalding Type -Safe API
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Introduction to df
Introduction to dfIntroduction to df
Introduction to df
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability Mahout
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Designing and Building a Graph Database Application – Architectural Choices, ...
Designing and Building a Graph Database Application – Architectural Choices, ...Designing and Building a Graph Database Application – Architectural Choices, ...
Designing and Building a Graph Database Application – Architectural Choices, ...
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
PySaprk
PySaprkPySaprk
PySaprk
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Real-World NoSQL Schema Design
Real-World NoSQL Schema DesignReal-World NoSQL Schema Design
Real-World NoSQL Schema Design
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaReal-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic Data
 

Destaque

Shahrukh Riviera Maya Honeymoon Options
Shahrukh Riviera Maya Honeymoon OptionsShahrukh Riviera Maya Honeymoon Options
Shahrukh Riviera Maya Honeymoon Options
chglat
 
The Royal Playa del Carmen
The Royal Playa del CarmenThe Royal Playa del Carmen
The Royal Playa del Carmen
chglat
 
Daniel Grenada Options
Daniel Grenada OptionsDaniel Grenada Options
Daniel Grenada Options
chglat
 
User experience Basic (Swedish)
User experience Basic (Swedish)User experience Basic (Swedish)
User experience Basic (Swedish)
FlorianFiechter
 
2011 lotus lantern_english_broucher
2011 lotus lantern_english_broucher2011 lotus lantern_english_broucher
2011 lotus lantern_english_broucher
DongGeun Choi
 
Jill Costa Rica Options
Jill Costa Rica OptionsJill Costa Rica Options
Jill Costa Rica Options
chglat
 
2011 12 19 inauguracionjardin
2011 12 19 inauguracionjardin2011 12 19 inauguracionjardin
2011 12 19 inauguracionjardin
pabloacostarobles
 

Destaque (20)

Shahrukh Riviera Maya Honeymoon Options
Shahrukh Riviera Maya Honeymoon OptionsShahrukh Riviera Maya Honeymoon Options
Shahrukh Riviera Maya Honeymoon Options
 
Jen
Jen Jen
Jen
 
The Royal Playa del Carmen
The Royal Playa del CarmenThe Royal Playa del Carmen
The Royal Playa del Carmen
 
Excellence Riviera Maya
Excellence Riviera MayaExcellence Riviera Maya
Excellence Riviera Maya
 
Dreams Playa Mujeres
Dreams Playa MujeresDreams Playa Mujeres
Dreams Playa Mujeres
 
Beth
BethBeth
Beth
 
Daniel Grenada Options
Daniel Grenada OptionsDaniel Grenada Options
Daniel Grenada Options
 
facebook
facebookfacebook
facebook
 
Tối Ưu Hóa Trang Đích - Landing Page Optimization
Tối Ưu Hóa Trang Đích - Landing Page OptimizationTối Ưu Hóa Trang Đích - Landing Page Optimization
Tối Ưu Hóa Trang Đích - Landing Page Optimization
 
Karen excellence pm
Karen excellence pmKaren excellence pm
Karen excellence pm
 
Coldset Case Study
Coldset Case StudyColdset Case Study
Coldset Case Study
 
User experience Basic (Swedish)
User experience Basic (Swedish)User experience Basic (Swedish)
User experience Basic (Swedish)
 
2011 lotus lantern_english_broucher
2011 lotus lantern_english_broucher2011 lotus lantern_english_broucher
2011 lotus lantern_english_broucher
 
Maddi Cancun
Maddi CancunMaddi Cancun
Maddi Cancun
 
Jill Costa Rica Options
Jill Costa Rica OptionsJill Costa Rica Options
Jill Costa Rica Options
 
2011 12 19 inauguracionjardin
2011 12 19 inauguracionjardin2011 12 19 inauguracionjardin
2011 12 19 inauguracionjardin
 
Tối ưu hoá Chiến dịch Google Adwords theo mùa (Quý 2)
Tối ưu hoá Chiến dịch Google Adwords theo mùa (Quý 2)Tối ưu hoá Chiến dịch Google Adwords theo mùa (Quý 2)
Tối ưu hoá Chiến dịch Google Adwords theo mùa (Quý 2)
 
KP
KPKP
KP
 
Jana Cabo
Jana CaboJana Cabo
Jana Cabo
 
Curriculumm
CurriculummCurriculumm
Curriculumm
 

Semelhante a Big data week presentation

AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
MongoDB APAC
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Tomas Cervenka
 

Semelhante a Big data week presentation (20)

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
מיכאל
מיכאלמיכאל
מיכאל
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
Spark etl
Spark etlSpark etl
Spark etl
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Big data week presentation

  • 1. Joseph Adler April 24 2012
  • 3. • What is Hadoop? • Why do people use Hadoop? • How does it work? • When should you consider Hadoop?
  • 4. What is Hadoop? Apache Hadoop is an open source, java-based system for processing data on a network of commodity servers using a map-reduce paradigm.
  • 5. How do people use Hadoop? A few examples from the Apache site – Amazon search – Facebook log storage and reporting – LinkedIn’s People You May Know – Twitter data analysis – Yahoo! Uses it for ad targeting A search on LinkedIn shows people at financial services, biotech, oil and gas exploration, retail, and other industries are using Hadoop.
  • 6. Where did Hadoop come from? • Hadoop was created by Doug Cutting. It’s named after his son’s toy elephant. • Hadoop was written to support Nutch, an open source web search engine. Hadoop was spun out in 2006. • Yahoo! invested in Hadoop, bringing it to “web scale” by 2008.
  • 7. Hadoop is open source • Hadoop is an open source project (Apache license) – You can download and install it freely – You can also compile your own custom version of Hadoop • There are three subprojects
  • 8. Hadoop is written for Java • The good news: Hadoop runs on a JVM – You can run Hadoop on your workstation (for testing), on a private cluster, or in a cloud – You can write Hadoop jobs in Java, or in Scala, Jruby, Jython, Clojure, or any other JVM language – You can use other Java libraries • The bad news: Hadoop was originally written by and for Java programmers. – You can do basic work without knowing Java. But you will quickly get stuck if you can’t write code.
  • 9. Hadoop runs on a network of servers
  • 10. Hadoop runs on commodity servers • Doesn’t require very fast, very big, or very reliable servers • Works better on good quality servers connected through a fast network • Hadoop is fault tolerant—multiple copies of data, protection against failed jobs
  • 11. When should you consider Hadoop? • Big problem • Fits Map/Reduce model • Don’t need to compute in real time • Technical team
  • 12. Picking the right tool for the job 1,000,000,000,000 100,000,000,000 ? 10,000,000,000 1,000,000,000 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 Calculator Spreadsheet Numerical Parallel Systems ? Software
  • 13. Man / Reduce • I need 7 volunteers: – 4 mappers – 3 reducers • We’re going to show how map/reduce works by sorting and counting some notes.
  • 14. What is Map/Reduce • You compute things in two phases – The map step • Reads the input data • Transforms the data • Tags each datum with a key and sends each datum to the right reducer – The reduce step • Collects all the data for each key • Do some work on the data by key • Outputs the results
  • 15. Map/Reduce is over 100 years old • Hollerith machines from the 1890 census
  • 16. Good fits for Map/Reduce • Aggregating unstructured data to enter into a database (ETL) • Creating email messages • Processing log files and creating reports
  • 17. Problems that don’t perfectly fit • Logistic regression • Matrix operations • Social graph calculations
  • 18. Batch computation Hadoop is a shared system that allocates resources to jobs from a queue. It’s not a real time system.
  • 19. Coding example Suppose that we had some log files with events by date (say, page views). Let’s count the number of events by day! Sample data: 1335300359000,Home Page, Joe 1335300359027,Login, 1335300359031,Home Page, Romy 1335300369123,Settings, Joe …
  • 20. A Java Example • Mappers will – Read the input files – Extract the timestamp – Round to the nearest day – Set the output key to the day • Reducers will – Iterate through records by day, counting records – Output the count for each day
  • 21. A Java example (Mapper) public class exampleMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { String line = value.toString(); String[] values = line.split(","); Long timeStampLong = Long.parseLong(values[0]); DateTime timeStamp = new DateTime(timeStampLong); DateTimeFormatter dateFormat = ISODateTimeFormat.date(); output.collect(new Text(dateFormat.print(timeStamp)), new Text(line)); } }
  • 22. A Java example (Reducer) public class exampleReducer extends MapReduceBase implements Reducer<Text, Text, Text, LongWritable> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text,LongWritable> output, Reporter reporter) throws IOException { long count = 0; while (values.hasNext()) count++; output.collect(key, new LongWritable(count)); } }
  • 23. A Java example (job file) public class exampleJob extends Configured implements Tool { @Override public int run(String[] arg0) throws Exception { // TODO Auto-generated method stub JobConf conf = new JobConf(getConf(), getClass()); conf.setJobName("Count events by date"); conf.setInputFormat(TextInputFormat.class); TextInputFormat.addInputPath(conf, new Path(arg0[0])); conf.setOutputFormat(TextOutputFormat.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(LongWritable.class); TextOutputFormat.setOutputPath(conf, new Path(arg0[1])); conf.setMapperClass(exampleMapper.class); conf.setReducerClass(exampleReducer.class); JobClient.runJob(conf); return 0; } }
  • 24. • Tools that make it easier to use Hadoop: – Hive – Pig – Cascading
  • 25. Cascading • Tool for constructing Hadoop workflows in Java • Example: Scheme pvScheme = new TextLine(new Fields (“timestamp”, …); Tap source = new Hfs(pvScheme, inpath); Scheme countScheme = new TextLine(new Files (“date”, “count”); Tap sink = new Hfs(countScheme, outpath); Pipe assembly = new Pipe(“pagesByDate”); Function function = new DateFormatter(Fields(“timestamp”), “yyyy/mm/dd”); assembly = new Each(assembly , new Fields(“date”), function); assembly = new GroupBy(assembly , new Fields (“date”)); Aggregator count = new Count( new Fields( "count" ) ); assembly = new Every(assembly , count ); Properties properties = new Properties(); FlowConnector.setApplicationJarClass( properties, Main.class ); FlowConnector flowConnector = new FlowConnector( properties ); Flow flow = flowConnector.connect( "pagesByDate", source, sink, assembly ); flow.complete();
  • 26. Pig • Tool to write SQL-like queries against Hadoop • Example: define TODATE org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(); %declare now `date "+%s000"`; page_views = LOAD ‘PAGEVIEWS’ USING PigStorage() AS (timestamp:int, page:chararray, user:chararray); last_week = FILTER page_views BY timestamp > $now – 86400000 * 7; truncated = FOREACH page_views GENERATE *, TODATE(timestamp) as date; grouped = GROUP truncated BY date; counted = FOREACH grouped GENERATE group as date, COUNT_STAR(truncated) as N; sorted = ORDER counted BY date; STORE sorted INTO ‘results’ USING PigStorage();
  • 27. Hive • Tool from Facebook that lets you write SQL queries against Hadoop • Example code: SELECT TO_DATE(timestamp), COUNT(*) FROM PAGEVIEWS WHERE timestamp > unix_timestamp()-86400000 * 7 GROUP BY TO_DATE(timestamp) ORDER BY TO_DATE(timestamp)
  • 28.
  • 29. Some important related projects • Hbase • NextGen Hadoop (0.23) • Zookeeper • Mahout • Giraph
  • 30. What to do next • Watch training videos at http://www.cloudera.com/resource-types/video/ • Get Hadoop (including the code!) at http://hadoop.apache.org • Get commercial support from http://www.cloudera.com/ or http://hortonworks.com/ • Run it in the cloud with Amazon Elastic Map Reduce: http://aws.amazon.com/elasticmapreduce/

Notas do Editor

  1. Thanks for having me here today as part of Big Data week. For a lot of people, Hadoop is big data.Today, I’m here to share my experience as a Hadoop user. I use Hadoop every day at LinkedIn because it helps me get my work done. Ask audience: Who uses Hadoop nowWho is thinking about itWho sort of knows what Hadoop is for, but isn’t sure how it helps them
  2. Hadoop can help you if you have a gigantic amount of data. You can do things with Hadoop that are hard to do with any other off-the-shelf tool. But Hadoop can be a handful.
  3. I’m hoping that you leave here today knowing what Hadoop is.
  4. Open sourceJava basedNetwork of serversComodity serversMap reduce
  5. The biggest users are mostly web companies:Amazon builds their search indices on HadoopFacebook processes all their usage logs on Hadoop. (They also store photos with hbase.) I bet they do other things as well.Twitter uses hadoop for data analysisYahoo use Hadoop for many things, including a log of their advertising modelseBay and Netlix uses Hadoop as wellAnd a lot more people are using Hadoop for some tasks.
  6. The source code for hadoop is freely avaialble, and easy to modifyBut that doesn’t mean it’s cheap and easy to run. It take a lot of operational expertise to set up and run a system with hundreds or thousands of computers. Every big Hadoop shop has a team of developers and operations people who keep the system runningWe’ve modified the Hadoop scheduler, added extra code for debugging, and fixed quite a few bugs
  7. I have become very good at reading Java stack traces.
  8. Hadoop was designed to run on commodity servers.It doesn’t need servers with super-fast processors, huge amounts of memory, solid state disks, or any other exotic featuresBut that doesn’t mean you should just run down to Fry’s and buy the cheapest computers you can find. Cheap computers fail more often. You need to find a good balance between cost and reliability.By the way, Hadoop runs really well on cloud services.
  9. Even really good quality computers fail, and Hadoop was designed to deal with that problem. If the probability of a machine failing is 1/1000 for a given day, you’re going to see failures when you have thousands of computers.As a user, you don’t usually have to worry too much about how hadoop runs your jobs. But sometimes, understanding what Hadoop is doing can help you understand what the system is up to.
  10. Let’s talk about each of these things Hadoop is great for doing all the data munging that you do at the start of a data project
  11. Mentally, this is my hierarchy of tools. As your data gets bigger, it takes more work to use each tool, so I try not to overshoot.[should add in databases, python tools in the middle of R and hadoop]But sometimes, you have to upgrade. For example, suppose that it takes 25 hours to analyze 24 hours of data on your desktop…
  12. As we said before, for your problem to fit, your problem should meet 4 criteria… one of them is that it has to work with Map/Reduce.To help explain map reduce, we’re going to use map reduce here to do some work. [ask for volunteers]
  13. The key is used to group data together and to route it to the right reducer.
  14. At LinkedIn, we have hundreds of users on our Hadoop system running dozens of jobs. It’s pretty busy in the middle of the day.Unlike some other tools (like Oracle), Hadoop won’t start working on your problem until earlier jobs finish. It’s a very efficient way to use resources, but it could mean that you have to wait around for a long time.
  15. So far, we’ve talked about who uses hadoop, and how hadoop works.I’d like to show an example of what you see as a hadoop user; how do you write programs for hadoop.In practice, you might have many input files from many different web servers. Or maybe one giant file. Either way, Hadoop can split up those files to divide the processing work across the cluster.
  16. Most Java map/reduce jobs have three parts: a mapper, a reducer, and a job file. I’m going to walk through all three of them here.
  17. Here is part of the Java Map/Reduce job for doing this calculation. At this point, it should be clear why we didn’t make this a hands on session. I’m not going to explain everything that’s going on here, but I’ll point out a few piece of how this works.
  18. All the keys for a key are handled by a specific reducer. In this case, that means that all the records for each date will be sent to a single reducer, so all we have to do is to count those records.
  19. Lastly, you connect everything together with a job file and run it.
  20. I’ve probably scared off a lot of people in this room by showing the Java Map/Reduce code. Luckily, there are some simpler ways to solve the problem.
  21. One of the coolest things about Cascading is that you can use if from other JVM languages: jython, jruby, clojure, and scala
  22. Don’t need a lot of software, can run from your workstation
  23. Hive is great, but it takes some work to set it upIt’s great for working with unstructured data…The big disadvantage of Hive is that every operation is a full table scan. With a database like Oracle, data is stored with indexes, so you can quickly look up single values. Hive is good for large calculations, bad for lookups.Another issue with Hive is that it’s not as mature as most databases. You can easily see a Java stack trace.
  24. Wrap up, then more things to know