Hadoop

•Transferir como PPTX, PDF•

3 gostaram•2,213 visualizações

Amit Chaudhary

Hadoop is a project run under Apache. It is an efficient choice to manage big clusters of data easily.

Educação Tecnologia Negócios

Structured, Unstructured and Complex Data
Management

Amit Chaudhary 11MCA03
Karthik Iyer 11MCA05

Hadoop
 What is this?
 Structure of this
 Is this unknown thing right for me?
 Where is this used?

What is ?
 It is an open source project by the
Apache Foundation to handle large
data processing
 It was inspired by Google’s MapReduce
and Google File System (GFS) papers
 It was originally conceived by Doug
Cutting
 It is named after his son’s pet elephant
incidentally

Large Data Means?
 1000 kilobytes = 1 Megabyte
 1000 Megabytes = 1 Gigabyte
 1000 Gigabytes = 1 Terabyte
 1000 Terabytes = 1 Petabyte
 1000 Petabytes = 1 Exabyte
 1000 Exabytes = 1 Zettabyte
 1000 Zettabytes = 1 Yottabyte
 1000 Yottabytes = 1 Bronobyte
 1000 Bronobytes = 1 Geopbyte

So what’s the big deal?
 Scalable: New nodes can be added as
needed, without changing the formats
 Flexible: It is schema-less, and can
absorb any type of data, structured or
not, from any number of sources
 Fault tolerant: System redirects work to
another location if a node fails

Hadoop = HDFS + MapReduce
 HDFS: For storing massive datasets
using low-cost storage
 MapReduce: The algorithm on which
Google built its empire

HDFS
 It is a fault-tolerant storage system
 Able to store huge amounts of
information
 It creates clusters of machines and
coordinates work among them
 If one fails, it continues to operate the
cluster without losing data or interrupting
work, by shifting work to the remaining
machines in the cluster

HDFS
 It manages storage on the cluster by
breaking incoming files into
pieces, called blocks
 Stores each of the blocks redundantly
across the pool of servers
 It stores three complete copies of each
file by copying each piece to three
different servers

Which companies are
using?
 LinkedIn
 Walt Disney
 Wal-mart
 General Electric
 Nokia
 Bank of America
 Foursquare

at Foursquare
 Foursquare: Mobile + Location + Social
Networking

Mais conteúdo relacionado

Mais procurados

Introduction to Numetric (1)Matt Polson

Big Data and Hadoop with MapReduce ParadigmsArundhati Kanungo

Big Data simplifiedPraveen Hanchinal

Steve Woolege Of Aster Data Gives Lightning Talk At BigDataCampBigDataCamp

Geospatial dataMostafaAliAbbas

Big data abstractnandhiniarumugam619

Big Data Sameer Sawhney

Introduction to hadoopGanesh Sanap

Twister4Azure - Iterative MapReduce for Azure CloudThilina Gunarathne

Gail Zhou on "Big Data Technology, Strategy, and Applications"Gail Zhou, MBA, PhD

What Are Science Clouds?Robert Grossman

Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com

Real time big data analytical architecture for remote sensing applicationLeMeniz Infotech

Hadoop MapReduce ParadigmTarjMehta1

Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Big Data Spain

Big data 101Paresh Motiwala, PMP®

"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...Dataconomy Media

Big data managementzeba khanam

Büyük Veriyle Büyük Resmi Görmekideaport

Mais procurados (19)

Introduction to Numetric (1)

Big Data and Hadoop with MapReduce Paradigms

Big Data simplified

Steve Woolege Of Aster Data Gives Lightning Talk At BigDataCamp

Geospatial data

Big data abstract

Big Data

Introduction to hadoop

Twister4Azure - Iterative MapReduce for Azure Cloud

Gail Zhou on "Big Data Technology, Strategy, and Applications"

What Are Science Clouds?

Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...

Real time big data analytical architecture for remote sensing application

Hadoop MapReduce Paradigm

Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...

Big data 101

"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...

Big data management

Büyük Veriyle Büyük Resmi Görmek

Semelhante a Hadoop

Hadoop TechnologyAtul Kushwaha

Seminar pptRajatTripathi34

Hadoop seminarKrishnenduKrishh

Hadoop infoNikita Sure

Hadoop hdfs interview questionsKalyan Hadoop

Final deckSteve Watt

A gentle introduction to the world of BigData and HadoopStefano Paluello

hadoopswatic018

TCS_DATA_ANALYSIS_REPORT_ADITYAAditya Srinivasan

Big data and hadoopRoushan Sinha

Hadoop and BigData - July 2016Ranjith Sekar

BDA Mod2@AzDOCUMENTS.in.pdfKUMARRISHAV37

OPERATING SYSTEM .pptxAltafKhadim

Big data analytics 1gauravsc36

A Glimpse of Bigdata - Introductionsaisreealekhya

Introduction to apache hadoopShashwat Shriparv

Unit-1 Introduction to Big Data.pptxAnkitChauhan817826

Hadoop: Distributed Data ProcessingCloudera, Inc.

Bigdata and Hadoop BootcampSpotle.ai

Semelhante a Hadoop (20)

Hadoop Technology

Seminar ppt

Hadoop seminar

Hadoop info

Hadoop hdfs interview questions

Final deck

A gentle introduction to the world of BigData and Hadoop

hadoop

TCS_DATA_ANALYSIS_REPORT_ADITYA

Big data and hadoop

Hadoop and BigData - July 2016

BDA Mod2@AzDOCUMENTS.in.pdf

OPERATING SYSTEM .pptx

Big data analytics 1

A Glimpse of Bigdata - Introduction

Introduction to apache hadoop

Unit-1 Introduction to Big Data.pptx

Hadoop: Distributed Data Processing

Bigdata and Hadoop Bootcamp

Mais de Amit Chaudhary

Synonyms 1Amit Chaudhary

Nouvelle Technologie 2nd WeekAmit Chaudhary

Nouvelle Technologie 1st weekAmit Chaudhary

Amazon silk browserAmit Chaudhary

Gps Navigation SystemAmit Chaudhary

Firefox osAmit Chaudhary

Mais de Amit Chaudhary (6)

Synonyms 1

Nouvelle Technologie 2nd Week

Nouvelle Technologie 1st week

Amazon silk browser

Gps Navigation System

Firefox os

Último

Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"National Information Standards Organization (NISO)

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy

A Critique of the Proposed National Education Policy ReformChameera Dedduwage

Paris 2024 Olympic Geographies - an activityGeoBlogs

Nutritional Needs Presentation - HLTH 104misteraugie

microwave assisted reaction. General introductionMaksud Ahmed

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar

Measures of Central Tendency: Mean, Median and ModeThiyagu K

Advanced Views - Calendar View in Odoo 17Celine George

Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31

BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur

APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management

Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB

Software Engineering Methodologies (overview)eniolaolutunde

Interactive Powerpoint_How to Master effective communicationnomboosow

Grant Readiness 101 TechSoup and Remy ConsultingTechSoup

1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh

Student login on Anyboli platform.helpinRaunakKeshri1

The basics of sentences session 2pptx copy.pptxheathfieldcps1

Hadoop

1. Structured, Unstructured and Complex Data Management Amit Chaudhary 11MCA03 Karthik Iyer 11MCA05

2. Hadoop  What is this?  Structure of this  Is this unknown thing right for me?  Where is this used?

3.  Any idea? (Idea SIM card)

4. What is ?  It is an open source project by the Apache Foundation to handle large data processing  It was inspired by Google’s MapReduce and Google File System (GFS) papers  It was originally conceived by Doug Cutting  It is named after his son’s pet elephant incidentally

5. Large Data Means?  1000 kilobytes = 1 Megabyte  1000 Megabytes = 1 Gigabyte  1000 Gigabytes = 1 Terabyte  1000 Terabytes = 1 Petabyte  1000 Petabytes = 1 Exabyte  1000 Exabytes = 1 Zettabyte  1000 Zettabytes = 1 Yottabyte  1000 Yottabytes = 1 Bronobyte  1000 Bronobytes = 1 Geopbyte

6. So what’s the big deal?  Scalable: New nodes can be added as needed, without changing the formats  Flexible: It is schema-less, and can absorb any type of data, structured or not, from any number of sources  Fault tolerant: System redirects work to another location if a node fails

7. Hadoop = HDFS + MapReduce  HDFS: For storing massive datasets using low-cost storage  MapReduce: The algorithm on which Google built its empire

8. HDFS  It is a fault-tolerant storage system  Able to store huge amounts of information  It creates clusters of machines and coordinates work among them  If one fails, it continues to operate the cluster without losing data or interrupting work, by shifting work to the remaining machines in the cluster

9. HDFS  It manages storage on the cluster by breaking incoming files into pieces, called blocks  Stores each of the blocks redundantly across the pool of servers  It stores three complete copies of each file by copying each piece to three different servers

10. How this works?

11. How this works?

12. Which companies are using?  LinkedIn  Walt Disney  Wal-mart  General Electric  Nokia  Bank of America  Foursquare

13. at Foursquare  Foursquare: Mobile + Location + Social Networking

14. Is this unknown thing right for me?

Notas do Editor

Hadoopis only one part under Apache FoundationAccording to IDC, the amount digital information produced in 2012 will be ten times that produced in 2006: 1800 exabytesThe majority of this data will be “unstructured” – complex data poorly-suited to management by structured storage systems like relational databases
1 Petabyte [where most SME corporations are?]1 Exabyte [where most large corporations are?]1 Zettabyte [where leaders like Facebook and Google are]
-Flexible: Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.80% of the world’s data is unstructured, and most businesses don’t even attempt to use this data to their advantage. Imagine if you had a way to analyze that data?
HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodesMapReduce: It refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after a Map.MapReduce was first presented to the world via a 2004 white paper by Google where salient insights were blurt out. Yahoo re-implemented this technique and open sourced it via the Apache foundationAs an analogy, you can think of map and reduce tasks as the way a census was conducted in Roman times, where the census bureau would dispatch its people to each city in the empire. Each census taker in each city would be tasked to count the number of people in that city and then return their results to the capital city. There, the results from each city would be reduced to a single count (sum of all cities) to determine the overall population of the empire. This mapping of people to cities, in parallel, and then combining the results (reducing) is much more efficient than sending a single person to count every person in the empire in a serial fashion.Large volumes of complex data can hide important insights. Are there buying patterns in point-of-sale data that can forecast demand for products a particular stores?Do user logs from a website, or calling records in a mobile network, contain information about relationships among individual customers? Companies that can extract facts like these from the huge volume of data can better control processes and costs, can better predict demand and build better products
HDFS: Hadoop Distributed File SystemMapReduce: Parellel data-processing frameworkHadoop Common: A set of utilities that support the Hadoop subprojectsHbase: Hadoop database for random read/write accessHive: SQL-like queries and tables on large datasetsPig: Data flow language and compilerOozie: Workflow for interdependent Hadoop jobsSqoop: Integration of databases and data warehouses with HadoopFlume: Configurable streaming data collectionZookeeper: Coordination service for distributed applicationsHue: User interface framework and SDK for visual Hadoop applications
In the very simple example shown, any two servers can fail, and the entire file will still be available. HDFS notices when a block or a node is lost, and creates a new copy of missing data from the replicas it manages. Because the cluster stores several copies of every block, more clients can read them at the same time without creating bottlenecks.
Each of the server runs the analysis on its own block from the file. Results are collated and digested into a single result after each piece has been analyzedRunning the analysis on the nodes that actually store the data delivers much better performance than reading data over the network from a single centralized serverIt monitors jobs during execution, and will restart work lost due to node failure if necessary. In fact, if a particular node is running very slowly, it will restart its work on another server with a copy of the data
All above companies are using for variety of tasks like marketing, advertising, and sentiment and risk analysis. IBM used the software as the engine for its Watson computer, which competed with the champions of TV game show Jeopardy.
Foursquare aimed at letting your friends in almost every country know where you are and figuring where they are.As a platform, it is now aware of 25+ million venues worldwide, each of which can be described by unique signals about who is coming to these places, when, and for how long. To reward and incent users foursquare allows frequent users to collect points, prize “badges,” and eventually coupons, for check-ins

Hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a Hadoop

Semelhante a Hadoop (20)

Mais de Amit Chaudhary

Mais de Amit Chaudhary (6)

Último

Último (20)

Hadoop

Notas do Editor