SlideShare uma empresa Scribd logo
1 de 94
Baixar para ler offline
Course Instructor: Dr. C. Sreedhar
BIG DATA ANALYTICS
B.Tech VII Sem CSE A
*Note: Some images are downloaded and used from internet sources
Unit I
 What is Big Data Analytics
 Why this sudden hype around big data analytics
 Classification of Analytics
 Top Challenges facing big data
 Few top analytics tools
 Introduction to Hadoop;
 HDFS, HDFS Commands
 Processing Data with Hadoop
 Managing Resources and Applications with Hadoop YARN
 Interacting with Hadoop Ecosystem
Unit II
 Understanding MapReduce & YARN:
 The Map Reduce Framework Concept
 Developing Simple MapReduce Application
 Points to consider while designing mapreduce
 YARN background
 YARN architecture
 Working of YARN
Unit III
 Analyzing Data with Pig
 Introducing Pig
 Running Pig
 Getting started with pig latin
 Working with operators in pig
 Debugging pig
Unit IV
 Understanding HIVE:
 Introducing Hive
 Hive services
 Builtin functions in Hive
 Hive DDL
 Data manipulation in Hive
Unit V
 NoSQL Data Management:
 Introducing to NoSQL,
 characteristics of NoSQL
 Types of NoSQL data models
 Schema less databases
BDA (P): List of Experiments
Big Data: Introduction
8
What ?
Why ?
Who ?
How ?
Existing ?
When ?
Applications ?
Big Data: common misconceptions
 Expensive
 Machine Data
 Quality Data
 Always right
 100% accurate
Big Data is NOT:
 A Self-Learning Algorithm
 Solution for every Business
 Meant only for Data
Scientists
 Magic that changes
overnight
9
Traditional method of file management
Patients
Doctors
Wards
Rooms
Patients program
Doctors program
Wards program
Rooms program
Users
Users
Users
Users
What Big Data is
 Big Data is about the extraction of actionable or
useful information from very large datasets.
 Big data is a field that treats ways to analyze,
systematically extract information from, or otherwise
deal with data sets that are too large or complex to
be dealt with by traditional data-processing
application software.
11
Big Data
 The importance of big data does not revolve around
how much data the organization/company has, but
what can be done with such massive volumes of data.
 Big Data helps in
 cost reductions,
 time reductions,
 new product development and optimized offerings and
 smart decision making.
12
Big Data in the real world
 By 2020, there will be around 40 trillion gigabytes of data
(40 zettabytes).
 90% of all data has been created in the last few years.
 Today it would take a person approximately 181 million years
to download all the data from the internet.
 In 2012, only 0.5% of all data was analyzed.
 In 2018, internet users spent 2.8 million years online.
 Social media accounts for 33% of the total time spent online.
13
Big Data in the real world
 97.2% of organizations are investing in big data and AI.
 Using big data, Netflix saves $1 billion per year on
customer retention.
 Job listings for data science and analytics will reach
around 2.7 million by 2020.
 Automated analytics will be vital to big data by 2022.
 The big data analytics market is set to reach $103 billion
by 2023
14
What Big Data is
Big datasets are too large and complex to be
processed by traditional methods.
Considering in a single minute, there are approx.:
 3,00,000 Instagrams posted
 5,00,000 tweets sent
 45,00,000 Youtube videos watched
 45,00,000 Google searches
 20 crores of emails sent
15
Big Data
How do organizations optimize the values of big data?
 Set a big data strategy
 Identify big data sources
 Access, manage and store big data
 Analyze big data
 Make data-driven decisions
16
Big Data: Definition
Gartner:
 Big data is high-volume, high-velocity and/or
high-variety information assets that demand cost-
effective, innovative forms of information
processing that enable enhanced insight, decision
making, and process automation.
17
Characteristics
of Big Data
 Volume
 Variety
 Velocity
 Veracity
18
What is Big Data Analytics
 Analytics in general, involves the use of mathematical
or scientific methods to generate insight from data
 Big data analytics is the use of advanced analytic
techniques against very large, diverse data sets that
include structured, semi-structured and unstructured
data, from different sources, and in different sizes
from terabytes to zettabytes.
What is Big Data Analytics
 Technology-enabled analytics:
 Quite a few data analytics and visualization tools are available
in the market today from leading vendors such as IBM, Tableau,
SAS, R Analytics, Statistica, World Programming Systems (WPS),
etc. to help process and analyze big data.
 About gaining a meaningful, deeper, and richer insight into
the business to steer it in the right direction, understanding
the customers demographics to cross-sell and up-sell to them.
What is Big Data Analytics
 Handshake between three communities: IT, business users,
and data scientists.
 Working with datasets whose volume and variety exceed
current storage, processing capabilities and infrastructure.
 About moving code to data. This makes perfect sense as
the program for distributed processing is tiny (just a few
KBs) compared to the data (Terabytes or Petabytes today
and likely to be Exabytes or Zettabytes in near future).
Why this sudden hype around BDA?
 Following are some of the reasons for sudden hype about BDA:
 Data is growing at a 40% compound annual rate, reaching nearly 45
ZB by 2020.
 Volume of business data worldwide is expected to double every 1.2 y.
 500 million ―tweets‖ are posted by Twitter users every day.
 2.7 billion ―Likes‖, comments posted by Facebook users in a day.
 90% of the world’s data created in the past 2 years.
 Cost per gigabyte of storage has hugely dropped.
 There are an overwhelming number of user-friendly analytics tools
available in the market today.
Classification of Analytics
 There are basically two schools of thought:
 Those that classify analytics into basic,
operationalized, advanced, and monetized.
 Those that classify analytics into analytics 1.0,
analytics 2.0, analytics 3.0 and analytics 4.0.
First School of thought
Basic analytics:
used to explore your data in a graphical manner where the data provides some
value through simple visualizations
Operationalized analytics:
Operationalized analytics includes several concepts like data discovery, decision
management, information delivery
Advanced analytics:
Provide analytical algorithms for executing complex analysis of either structured
or unstructured data
Monetized analytics:
This is analytics in use to derive direct business revenue.
Second School of Analytics
 Analytics 1.0
 Data sources relatively small and
structured, from internal systems
 Majority of analytical activity
was descriptive analytics, or
reporting
 Creating analytical models was a
time-consuming batch process
 Decisions were made based on
experience and intuition
 Analytics 2. 0
 Data sources are big,
complex, unstructured, fast
moving data
 Rise of Data Scientists
 Rise of Hadoop & open
source
 Visual Analytics
Second School of Analytics
 Analytics 3.0
 Mix of all data
 Internal/external
products/decisions
 Analytics a core capability
 Move at speed & scale
 Predictive & prescriptive
analytics
 Analytics 4.0
 Analytics embedded,
invisible and automated
 Cognitive technologies
 Robotic process automation
for digital tasks
 Augmentation and not
automation
Classification of Analytics
 Descriptive analytics
 Diagnostic analytics
 Predictive analytics
 Prescriptive analytics
Challenges in Big Data
 capture
 cleaning
 Availability
 integration
 storage
 processing
 indexing
 Security
 Sharing
 Consistency
 Partition tolerant
 Analysis
 Visualization
Top Analytics Tools
 Apache Hadoop
 Apache Spark
 Apache Storm
 Apache Cassandra
 Tableau
 HBase
 Windows Azure
 Splunk
 Talend
 Elastic
 Apache Pig
 Lumify
Apache Hadoop
 is a collection of open-source software utilities
that facilitate using a network of many
computers to solve problems involving massive
amounts of data and computation.
 It provides a software framework for distributed
storage and processing of big data using the
MapReduce programming model
Apache Spark
 is an open-source distributed general-purpose
cluster-computing framework.
 is a unified analytics engine for large-scale data
processing.
 provides an interface for programming entire
clusters with implicit data parallelism and fault
tolerance
Apache Storm
 A system for processing streaming data in real time
 adds reliable real-time data processing capabilities
to Enterprise Hadoop
 Is distribute, resilent and real time
Cassandra
 is a free and open-source, distributed, wide column
store, NoSQL database management system
designed to handle large amounts of data across
many commodity servers, providing high availability
with no single point of failure
 is the right choice when you need scalability and
high availability without compromising performance
Tableau
 Tableau Empowers business users to quickly and
easily find valuable insights in their vast Hadoop
datasets.
 Tableau removes the need for users to have
advanced knowledge of query languages by
providing a clean visual analysis interface
Lumify
 LUMIFY is a powerful big data fusion, analysis, and
visualization platform that supports the
development of actionable intelligence.
 Lumify is possibly the choice for those pouring over
the 11 million-plus documents
Windows Azure
 is a cloud computing service created by Microsoft
for building, testing, deploying, and managing
applications and services through Microsoft-
managed data centers
Splunk
 Splunk is a software platform to search, analyze
and visualize the machine-generated data
gathered from the websites, applications, sensors,
devices etc. which make up IT infrastructure and
business.
Talend
 Talend is an open source software integration
platform helps you in effortlessly turning this data
into business insights.
 provides various software and services for data
integration, data management, enterprise
application integration, data quality, cloud storage
and Big Data.
HBase
 is an open-source non-relational distributed
database modeled after Google's Bigtable and
written in Java.
 It is developed as part of Apache Software
Foundation's Apache Hadoop project and runs on
top of HDFS, providing Bigtable-like capabilities for
Hadoop
Hive
 is a data warehouse software project built on top of
Apache Hadoop for providing data query and
analysis.
 Hive gives an SQL-like interface to query data
stored in various databases and file systems that
integrate with Hadoop
Apache Pig
 is a high-level platform for creating programs that
run on Apache Hadoop.
 It is a tool/platform which is used to analyze larger
sets of data representing them as data flows.
 perform all the data manipulation operations in
Hadoop using Apache Pig.
Introduction to Hadoop
 Hadoop is an open source framework that allows to
store (HDFS) and process (MapReduce) large data sets
in distributed and parallel manner.
Traditional DB vs. Hadoop
Traditional Database System Hadoop
Data is stored in a central location and
sent to the processor at runtime.
In Hadoop, the program goes to the data. It
initially distributes the data to multiple systems
and later runs the computation wherever the
data is located. (distributed computation)
Traditional Database Systems cannot be
used to process and store a significant
amount of data(big data).
Hadoop works better when the data size is big.
It can process and store a large amount of
data efficiently and effectively.
Traditional RDBMS is used to manage
only structured and semi-structured
data. It cannot be used to control
unstructured data.
Hadoop can process and store a variety of
data, whether it is structured or unstructured.
History of Hadoop
 1997: Doug Cutting, developed Lucene; Open source search and indexing Java
based indexing and open source search software.
 2001: Mike Cafarella, focused on indexing entire web
 Problems:
 Schema less (no tables and columns)
 Durable (once written should never lost)
 Capability of handling component failure ( CPU, Memory, N/w)
 Automatically re-balanced (disk space consumption)
 2003: Google published GFS Paper; developed Nutch DFS.
 Problem of durability and fault tolerance was still not solved.
History of Hadoop…
 Sol: Distributed processing; Divided file system into 64mb chunks, storing each
element on 3 different nodes(replication factor).
 2004: Google published a paper: Map Reduce – Simple Data Processing on Large
Clusters.
 Solved problem of Parallelization, Distribution, Fault Tolerance.
 2005: Cutting integrated Map Reduce into Nutch.
 2006: Cutting named it Hadoop, included HDFS, MR.
 2008: Cutting licensed it under Apache Software Foundations.
 Certain problems/enhancements in Hadoop, created sub projects like Hive, PIG,
HBase, Zookeeper.
Hadoop
 Hadoop runs code across a cluster of computers. This process
includes the following core tasks that Hadoop performs −
 Data is divided into directories and files (uniform sized blocks 128MB)
 These files are then distributed across various cluster nodes for further processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checks for successful execution of the code
 Performs sort phase that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.
 Hadoop as a good
choice for:
 Indexing log files
 Sorting vast amounts
of data
 Image analysis
 Search engine
optimization
 Analytics
 Hadoop as a poor
choice for:
 Calculating value of pi to
1,000,000 digits
 Calculating Fibonacci
sequences
 Small structured data
Components of Hadoop
 HDFS
 YARN
 MapReduce
HDFS
 HDFS is a distributed file system that is fault tolerant,
scalable and extremely easy to expand.
 HDFS is the primary distributed storage for Hadoop
applications.
 HDFS provides interfaces for applications to move
themselves closer to data.
 HDFS is designed to process large data sets with
write-once-read-many semantics
HDFS
Namenode
Datanode1 Datanode2 ... Datanode2
Secondary
Namenode
Namenode
 It is master daemon that maintains and manages the DataNodes (slave nodes)
 It records metadata of all blocks stored in the cluster, e.g. location of blocks
stored, size of the files, permissions, hierarchy, etc.
 It records each and every change that takes place to file system metadata
 If a file is deleted in HDFS, NameNode will immediately record this in EditLog
 It regularly receives a Heartbeat and a block report from all the DataNodes
in the cluster to ensure that the DataNodes are alive
 It keeps record of all blocks in HDFS and DataNode in which they are stored
DataNode
 It is the slave daemon which runs on each slave machine
 The actual data is stored on DataNodes
 It is responsible for serving read and write requests from the clients
 It is also responsible for creating blocks, deleting blocks and
replicating the same based on the decisions taken by the
NameNode
 It sends heartbeats to the NameNode periodically to report the
overall health of HDFS, by default, this frequency is set to 3 seconds
Secondary Namenode
 Works as a helper node to primary NameNode
 Downloads the Fsimage file and edit logs file from
NameNode
 Reads from RAM of NameNode and stores it to hard
disks periodically.
 If NameNode fails, last save Fsimage on secondary
NameNode can be used to recover file system
metadata
HDFS Commands
 Listing files
 Read / Write files
 Upload / Download files
 File Management
 Permissions
 File System
 Administration
List: Files
hdfs dfs -ls / List all the files/directories for the given
hdfs destination path.
hdfs dfs -ls -d /inputnew2 Directories are listed as plain files. In this
case, this command will list the details of
inputnew2 folder.
hdfs dfs -ls -R /user Recursively list all files in hadoop directory
and all subdirectories in user directory.
hdfs dfs -ls /inputnew*
List all the files matching the pattern. In
this case, it will list all the files inside
present directory which starts with
'inputnew'.
Read / Write Files
hdfs dfs -text /inputnew/inputFile.txt
HDFS Command that takes a source file
and outputs the file in text format on the
terminal.
hdfs dfs -cat /inputnew/inputFile.txt This command will display the content of
the HDFS file test on your stdout .
hdfs dfs -appendToFile
/home/ubuntu/test1
/hadoop/text2
Appends the content of a local file test1
to a hdfs file test2.
Upload / Download Files
hdfs dfs -put /home/ubuntu/sample
/hadoop
Copies the file from local file system to HDFS.
hdfs dfs -put -f
/home/ubuntu/sample /hadoop
Copies the file from local file system to HDFS,
and in case the local already exits in the
given destination path, using -f option with
put command will overwrite it.
hdfs dfs -get /newfile /home/ubuntu/ Copies the file from HDFS to local file system.
hdfs dfs -copyFromLocal
/home/ubuntu/sample /hadoop
Works similarly to the put command, except
that the source is restricted to a local file
reference.
hdfs dfs -copyToLocal /newfile
/home/ubuntu/
Works similarly to the put command, except
that the destination is restricted to a local file
reference.
File management
hdfs dfs -cp /hadoop/file1 /hadoop1
Copies file from source to destination on
HDFS. In this case, copying file1 from hadoop
directory to hadoop1 directory.
hdfs dfs -cp -f /hadoop/file1 /hadoop1
Copies file from source to destination on
HDFS. Passing -f overwrites the destination if
it already exists.
hdfs dfs -mv /hadoop/file1 /hadoop1
Move files that match the specified file pattern
<src> to a destination <dst>. When moving
multiple files, destination must be a directory.
hdfs dfs -rm /hadoop/file1 Deletes the file (sends it to the trash).
hdfs dfs -rmdir /hadoop1 Delete a directory.
hdfs dfs -mkdir /hadoop2 Create a directory in specified HDFS location.
Permissions
hdfs dfs -touchz /hadoop3
Creates a file of zero length at <path>
with current time as the timestamp of
that <path>.
hdfs dfs -chmod 755 /hadoop/file1 Changes permissions of file.
hdfs dfs -chown ubuntu:ubuntu
/hadoop
Changes owner of the file. 1st ubuntu
in the command is owner and
2nd one is group.
File System
hdfs dfs -df /hadoop Shows the capacity, free and used space of
the filesystem.
hdfs dfs -df -h /hadoop
Shows the capacity, free and used space of
the filesystem. -h parameter Formats the
sizes of files in a human-readable fashion.
hdfs dfs -du /hadoop/file
Show the amount of space, in bytes, used by
the files that match the specified file pattern.
hdfs dfs -du -s /hadoop/file
Rather than showing the size of each
individual file that matches the pattern,
shows the total (summary) size.
Administration
hadoop version To check the vesrion of Hadoop.
hdfs fsck / It checks the health of the Hadoop file system.
hdfs dfsadmin -safemode leave The command to turn off the safemode of
NameNode.
hdfs dfsadmin –refreshNodes
Re-read the hosts and exclude files to update the
set of Datanodes that are allowed to connect to the
Namenode and those that should be
decommissioned or recommissioned.
hdfs namenode -format Formats the NameNode.
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Data processing using programming
 Spark: In-memory Data Processing
 PIG, HIVE: Data Processing Services using Query (SQL-like)
 Hbase: NoSQL Database
 Mahout, Spark Mllib: Machine Learning
 Apache Drill: SQL on Hadoop
 Zookeeper: Managing Cluster
 Oozie: Job Scheduling
 Flume, Sqoop: Data Ingesting Services
 Solr& Lucene: Searching & Indexing
 Ambari: Provision, Monitor and Maintain cluster
MapReduce Framework
 Need for parallel distribution of tasks
 Automatic expansion and contraction of processes
 Enables continuation of processes w/o being affected by
network failures or system failures
 MapReduce: Map and reduce
 Do not modify the original data instead, create new data
structures to display their o/p
Features of MapReduce
 Scheduling
 Synchronization
 Data locality
 Handling of errors/faults
 Scale out architecture
MapReduce Framework
 Very large scale data: peta, exa bytes
 Write once and read many data: allows for parallelism without mutexes
 Map and Reduce are the main operations: simple code
 There are other supporting operations such as combine and partition.
 All the map should be completed before reduce operation starts.
 Map and reduce operations are typically performed by the same physical processor.
 Number of map tasks and reduce tasks are configurable.
 Operations are provisioned near the data.
 Commodity hardware and storage.
 Special distributed file system. Example: Hadoop Distributed File
Working of MapReduce
 MapReduce programming model works on an algorithm to execute the map and
reduce operations. Algorithm steps as follows:
 Take a large dataset or set of records
 Perform iteration over the data
 Extract interesting patterns to prepare an o/p list using map function
 Arrange o/p list to enable optimization for further processing
 Compute a set of results by using reduce function
 Provide the final output
 MR mode executes given task by dividing into two functions: map and reduce.
Map function is executed first in parallel on different machines. Reduce function
takes output of map function to present final output in an aggregate form.
MapReduce framework
 JobTracker receives the jobs from client applications to process
large information.
 These jobs are assigned in the forms of individual tasks (after a
job is divided into smaller parts) to various TaskTrackers.
 The task distribution is transmitted to the reduce function so that
the final, integrated output, which is an aggregated of the data
processed by the map funciton, can be provided.
 A cluster uses commodity servers to store nodes. The data
processing job is accomplished through MapReduce and HDFS
 Input is provided from large data files in the form of key-value pair,
which is the standard input format in a Hadoop MapReduce
programming model.
 The input data is divided into small pieces, and master and slave
nodes are created. The master node usually executes on the
machine where the data is present, and slaves are made to work
remotely on the data
 The map operation is performed simultaneously on all the data
pieces, which are read by the map function. The map function
extracts the relavent data and generates key-value pair for it.
Client: This initializes the job
JobTracker: is the master daemon for both Job resource management and
scheduling/monitoring of jobs
TaskTracker: is a slave node daemon in the cluster that accepts tasks (Map, Reduce and
Shuffle operations) from a JobTracker
Map task deal with splitting and mapping of data while Reduce task shuffle and
reduce the data
MapReduce Framework
Hadoop Modes
Local Pseudo
distributed
Fully
distributed
•No daemons
•executes all parts of
Hadoop MapReduce
within a single Java
process and uses the local
filesystem as the storage
•No DFS
•useful for testing/
debugging the
MapReduce applications
locally
we can run Hadoop on a
single machine emulating a
distributed cluster
runs different services of
Hadoop as different Java
processes, but within a single
machine.
Each hadoop daemon run as
separate java process.
•supports clusters
that span from a
few nodes to
thousands of nodes
•Full production run
Hadoop MapReduce: A Closer Look
file
file
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local
HDFS store
file
file
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local
HDFS store
Node 1 Node 2
Shuffling
Process
Intermediate
(K,V) pairs
exchanged by
all nodes
Hadoop YARN
Resource Manager (RM)
 Responsibilities: Resource management and
assignment of all the apps
 Master daemon of YARN
 Requests received by the RM are forwarded to the
corresponding node manager.
 RM does the allocation of available resources.
 RM is highest authority for the allocation of resources.
Node Manager (NM)
Generic, flexible and efficient than TaskTracker.
Dynamically created resource containers.
Container refers to a collection of resources such as
memory, CPU, disk and network IO.
NM is the slave daemon of Yarn.
NM is the per-machine/per-node framework agent,
responsible for containers, monitoring their resource usage
and reporting the same to the ResourceManager.
Container
 Container in YARN is where a unit of work happens in the
form of task.
 A job/application is split in tasks and each task gets
executed in one container having a specific amount of
allocated resources.
 A container can be understood as logical reservation of
resources (memory and vCores) that will be utilized by task
running in that container
Application Master
 Responsible for managing a set of submitted tasks or applications.
 First verifies and validates submitted application’s specifications and
rejects the applications, if there are not enough resources available.
 It also ensures no other application exists with the same ID which is
already submitted.
 Finally, it also observes the states of applications and manages
finished applications to save some Resource Manager’s memory.
YARN: Yet Another Resource Negotiator
 YARN: Resource management + Job scheduling
Criteria MapReduce YARN
Type of processing
batch processing with a single
engine
Real-time, batch, and
interactive processing with
multiple engines
Cluster resource
optimization
Average due to fixed Map and
Reduce slots
Excellent due to central
resource management
Suitable for Only MapReduce applications
MapReduce and non-
MapReduce applications
Managing cluster
resource
Done by JobTracker Done by YARN
Namespace
Supports only one namespace,
i.e., HDFS
Hadoop supports multiple
namespaces
Working of YARN
1.A client program submits the application
2.ResourceManager allocates a specified
container to start the ApplicationMaster (AM)
3. ApplicationMaster, on boot-up, registers with RM
4. AM negotiates with RM for appropriate resource
containers
5. On successful container allocations, AM contacts
NM to launch the container
6. Application code is executed within the
container, and then AM is responded with the
execution status
7. During execution, client communicates directly
with AM or RM to get status, progress updates etc.
8. Once the application is complete AM unregisters
with RM and shuts down, allowing its own container
process
Introduction: Apache Pig
 Pig is a high level scripting language for operating on large datasets inside
Hadoop.
 Compiles scripting language into MapReduce operations.
 provides a simple language called Pig Latin, for queries and data
manipulation
 Pig is multi-query approach reduces the number of times data is scanned.
 Pig was developed for ad-hoc way of creating and executing MapReduce
jobs on very large data sets.
 Pig provides data operations like filters, joins, ordering, etc. and nested data
types like tuples and maps, that are missing from MapReduce.
When to use and Not to use Pig
 When data loads are
time sensitive.
 When processing
various data sources.
 When analytical
insights are required
through sampling.
In places where the data is
completely unstructured, like
video, audio and readable text.
In places where time constraints
exist, as Pig is slower than
MapReduce jobs.
In places where more power is
required to optimize the codes.
Applications of Pig
 Processing of web logs.
 Data processing for search platforms.
 Provides the supports across large data-sets for Ad-hoc queries.
 For exploring large datasets Pig Scripting is used.
 In the prototyping of large data-sets processing algorithms.
 Required to process the time sensitive data loads.
 Collecting large amounts of datasets in form of search logs, web
crawls.
 Used where the analytical insights are needed using the sampling.
Apache Pig MapReduce
It is a scripting language.
It is a compiled programming
language.
Abstraction is at higher level. Abstraction is at lower level.
It have less line of code as
compared to MapReduce.
Lines of code is more.
Less effort is needed for Apache
Pig.
More development efforts are
required for MapReduce.
Code efficiency is less as
compared to MapReduce.
As compared to Pig efficiency of
code is higher.
Features of Pig
Rich set of operators
It provides many operators to perform operations like join, sort, filer, etc.
Ease of programming:
Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL.
Optimization opportunities:
The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus
only on semantics of the language.
Extensibility:
Using existing operators, users can develop their own functions to read, process, write data.
UDF’s:
Pig provides the facility to create User-defined Functions in other programming languages
such as Java and invoke or embed them in Pig Scripts.
Handles all kinds of data:
Apache Pig analyzes all kinds of data, both structured, unstructured.
It stores the results in HDFS.
Apache Pig: Introduction
 The language used to analyze data in Hadoop using Pig is known
as Pig Latin.
 To perform a particular task Programmers using Pig, programmers
need to write a Pig script using the Pig Latin language, and execute
them using any of the execution mechanisms (Grunt Shell, UDFs,
Embedded).
 Internally, Apache Pig converts these scripts into a series of
MapReduce jobs, and thus, it makes the programmer’s job easy.
Parameter MapReduce Pig
Paradigm Data processing paradigm Procedural dataflow language
Type of language Low level and rigid A high-level language
Join operation
It is difficult to perform join
operations between datasets
Performing a join operation is
simple
Skills needed
The developer needs to have a robust
knowledge of Java
A good knowledge of SQL is
needed
Code length
MapReduce requires 20 times more
code length to accomplish the same
task
Due to the multi-query
approach, the code length is
greatly reduced
Compilation
MapReduce jobs have a prolonged
compilation process
No need for compilation as
every Pig operator is converted
internally into MapReduce jobs
Nested data types Not present in MapReduce Present in Pig
Pig Architecture
Running Pig
 Pig Latin statements; Pig commands
 Pig Latin statements and Pig commands can be run using interactive
mode and batch mode.
 Pig Latin commands in three ways—via the Grunt interactive shell,
through a script file, and as embedded queries inside Java
programs.
 Pig has six execution modes
 Local mode, Tez Local mode, Spark local mode, Mapreduce mode, Tez mode,
Spark mode
 Local Mode – This mode can be run on a single machine; all files are installed and
run using local host and file system. uses the -x flag (pig -x local).
 Tez Local Mode – This mode is similar to local mode, except internally Pig will
invoke tez runtime engine. Uses -x flag (pig -x tez_local).
 Spark Local Mode – This mode is similar to local mode, except internally Pig will
invoke spark runtime engine. Uses -x flag (pig -x spark_local).
 Mapreduce Mode - To run Pig in mapreduce mode, access to a Hadoop cluster and
HDFS installation. Mapreduce mode is the default mode; Uses -x flag (pig OR pig -
x mapreduce).
 Tez Mode – This mode, access to a Hadoop cluster and HDFS installation. Specify
Tez mode using the -x flag (-x tez).
 Spark Mode - To run Pig in Spark mode, you need access to a Spark, Yarn or Mesos
cluster and HDFS installation. Specify Spark mode using the -x flag (-x spark)
Interactive mode
 Local Mode : $ pig -x local
 Tez Local Mode : $ pig -x tez_local
 Spark Local Mode : $ pig -x spark_local
 Mapreduce Mode : $ pig -x mapreduce
 Tez Mode : $ pig -x tez
 Spark Mode : $ pig -x spark
Batch mode
 Local Mode : $ pig -x local id.pig
 Tez Local Mode : $ pig -x tez_local id.pig
 Spark Local Mode : $ pig -x spark_local id.pig
 Mapreduce Mode : $ pig id.pig
 Tez Mode : $ pig -x tez id.pig
 Spark Mode : $ pig -x spark id.pig
Pig Latin statements
 Pig Latin statements are basic constructs used to process data
using Pig.
 A Pig Latin statement is an operator that takes a relation as input
and produces another relation as output.
 Pig Latin statements may include expressions and schemas.
 Pig Latin statements can span multiple lines and must end with a
semi-colon ( ; ).
 By default, Pig Latin statements are processed using multi-query
execution.
 Pig Latin statements are generally organized as follows:
 A LOAD statement to read data from the file system.
 A series of "transformation" statements to process the data.
 A DUMP statement to view results or a STORE statement to save the results.
The following example, Pig will validate, but not execute, the LOAD and FOREACH
statements.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
In the following example, Pig will validate and then execute the LOAD, FOREACH, and
DUMP statements.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(Ram)
(Sam)
(Ham)
(Pam)

Mais conteúdo relacionado

Mais procurados

Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
pcherukumalla
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 

Mais procurados (20)

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Overview of Big data(ppt)
Overview of Big data(ppt)Overview of Big data(ppt)
Overview of Big data(ppt)
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
What is big data?
What is big data?What is big data?
What is big data?
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Data cubes
Data cubesData cubes
Data cubes
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Hadoop
Hadoop Hadoop
Hadoop
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 

Semelhante a Big Data Analytics

using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
ranjit banshpal
 
Memory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective ViewMemory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective View
ijtsrd
 

Semelhante a Big Data Analytics (20)

Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential Tools
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Big Data
Big DataBig Data
Big Data
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop Platform
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)
 
Memory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective ViewMemory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective View
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companies
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Big data
Big dataBig data
Big data
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Moving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesMoving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and Perspectives
 
Big data Analytics
Big data Analytics Big data Analytics
Big data Analytics
 
How to Become a Big Data Professional.pdf
How to Become a Big Data Professional.pdfHow to Become a Big Data Professional.pdf
How to Become a Big Data Professional.pdf
 

Mais de Sreedhar Chowdam

Mais de Sreedhar Chowdam (20)

Design and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesDesign and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture Notes
 
Design and Analysis of Algorithms (Knapsack Problem)
Design and Analysis of Algorithms (Knapsack Problem)Design and Analysis of Algorithms (Knapsack Problem)
Design and Analysis of Algorithms (Knapsack Problem)
 
DCCN Network Layer congestion control TCP
DCCN Network Layer congestion control TCPDCCN Network Layer congestion control TCP
DCCN Network Layer congestion control TCP
 
Data Communication and Computer Networks
Data Communication and Computer NetworksData Communication and Computer Networks
Data Communication and Computer Networks
 
DCCN Unit 1.pdf
DCCN Unit 1.pdfDCCN Unit 1.pdf
DCCN Unit 1.pdf
 
Data Communication & Computer Networks
Data Communication & Computer NetworksData Communication & Computer Networks
Data Communication & Computer Networks
 
PPS Notes Unit 5.pdf
PPS Notes Unit 5.pdfPPS Notes Unit 5.pdf
PPS Notes Unit 5.pdf
 
PPS Arrays Matrix operations
PPS Arrays Matrix operationsPPS Arrays Matrix operations
PPS Arrays Matrix operations
 
Programming for Problem Solving
Programming for Problem SolvingProgramming for Problem Solving
Programming for Problem Solving
 
Python Programming: Lists, Modules, Exceptions
Python Programming: Lists, Modules, ExceptionsPython Programming: Lists, Modules, Exceptions
Python Programming: Lists, Modules, Exceptions
 
Python Programming by Dr. C. Sreedhar.pdf
Python Programming by Dr. C. Sreedhar.pdfPython Programming by Dr. C. Sreedhar.pdf
Python Programming by Dr. C. Sreedhar.pdf
 
Python Programming Strings
Python Programming StringsPython Programming Strings
Python Programming Strings
 
Python Programming
Python Programming Python Programming
Python Programming
 
Python Programming
Python ProgrammingPython Programming
Python Programming
 
C Recursion, Pointers, Dynamic memory management
C Recursion, Pointers, Dynamic memory managementC Recursion, Pointers, Dynamic memory management
C Recursion, Pointers, Dynamic memory management
 
C Programming Storage classes, Recursion
C Programming Storage classes, RecursionC Programming Storage classes, Recursion
C Programming Storage classes, Recursion
 
Programming For Problem Solving Lecture Notes
Programming For Problem Solving Lecture NotesProgramming For Problem Solving Lecture Notes
Programming For Problem Solving Lecture Notes
 
Data Structures Notes 2021
Data Structures Notes 2021Data Structures Notes 2021
Data Structures Notes 2021
 
Computer Networks Lecture Notes 01
Computer Networks Lecture Notes 01Computer Networks Lecture Notes 01
Computer Networks Lecture Notes 01
 
Dbms university library database
Dbms university library databaseDbms university library database
Dbms university library database
 

Último

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 

Último (20)

data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 

Big Data Analytics

  • 1. Course Instructor: Dr. C. Sreedhar BIG DATA ANALYTICS B.Tech VII Sem CSE A *Note: Some images are downloaded and used from internet sources
  • 2. Unit I  What is Big Data Analytics  Why this sudden hype around big data analytics  Classification of Analytics  Top Challenges facing big data  Few top analytics tools  Introduction to Hadoop;  HDFS, HDFS Commands  Processing Data with Hadoop  Managing Resources and Applications with Hadoop YARN  Interacting with Hadoop Ecosystem
  • 3. Unit II  Understanding MapReduce & YARN:  The Map Reduce Framework Concept  Developing Simple MapReduce Application  Points to consider while designing mapreduce  YARN background  YARN architecture  Working of YARN
  • 4. Unit III  Analyzing Data with Pig  Introducing Pig  Running Pig  Getting started with pig latin  Working with operators in pig  Debugging pig
  • 5. Unit IV  Understanding HIVE:  Introducing Hive  Hive services  Builtin functions in Hive  Hive DDL  Data manipulation in Hive
  • 6. Unit V  NoSQL Data Management:  Introducing to NoSQL,  characteristics of NoSQL  Types of NoSQL data models  Schema less databases
  • 7. BDA (P): List of Experiments
  • 8. Big Data: Introduction 8 What ? Why ? Who ? How ? Existing ? When ? Applications ?
  • 9. Big Data: common misconceptions  Expensive  Machine Data  Quality Data  Always right  100% accurate Big Data is NOT:  A Self-Learning Algorithm  Solution for every Business  Meant only for Data Scientists  Magic that changes overnight 9
  • 10. Traditional method of file management Patients Doctors Wards Rooms Patients program Doctors program Wards program Rooms program Users Users Users Users
  • 11. What Big Data is  Big Data is about the extraction of actionable or useful information from very large datasets.  Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. 11
  • 12. Big Data  The importance of big data does not revolve around how much data the organization/company has, but what can be done with such massive volumes of data.  Big Data helps in  cost reductions,  time reductions,  new product development and optimized offerings and  smart decision making. 12
  • 13. Big Data in the real world  By 2020, there will be around 40 trillion gigabytes of data (40 zettabytes).  90% of all data has been created in the last few years.  Today it would take a person approximately 181 million years to download all the data from the internet.  In 2012, only 0.5% of all data was analyzed.  In 2018, internet users spent 2.8 million years online.  Social media accounts for 33% of the total time spent online. 13
  • 14. Big Data in the real world  97.2% of organizations are investing in big data and AI.  Using big data, Netflix saves $1 billion per year on customer retention.  Job listings for data science and analytics will reach around 2.7 million by 2020.  Automated analytics will be vital to big data by 2022.  The big data analytics market is set to reach $103 billion by 2023 14
  • 15. What Big Data is Big datasets are too large and complex to be processed by traditional methods. Considering in a single minute, there are approx.:  3,00,000 Instagrams posted  5,00,000 tweets sent  45,00,000 Youtube videos watched  45,00,000 Google searches  20 crores of emails sent 15
  • 16. Big Data How do organizations optimize the values of big data?  Set a big data strategy  Identify big data sources  Access, manage and store big data  Analyze big data  Make data-driven decisions 16
  • 17. Big Data: Definition Gartner:  Big data is high-volume, high-velocity and/or high-variety information assets that demand cost- effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. 17
  • 18. Characteristics of Big Data  Volume  Variety  Velocity  Veracity 18
  • 19. What is Big Data Analytics  Analytics in general, involves the use of mathematical or scientific methods to generate insight from data  Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include structured, semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zettabytes.
  • 20. What is Big Data Analytics  Technology-enabled analytics:  Quite a few data analytics and visualization tools are available in the market today from leading vendors such as IBM, Tableau, SAS, R Analytics, Statistica, World Programming Systems (WPS), etc. to help process and analyze big data.  About gaining a meaningful, deeper, and richer insight into the business to steer it in the right direction, understanding the customers demographics to cross-sell and up-sell to them.
  • 21. What is Big Data Analytics  Handshake between three communities: IT, business users, and data scientists.  Working with datasets whose volume and variety exceed current storage, processing capabilities and infrastructure.  About moving code to data. This makes perfect sense as the program for distributed processing is tiny (just a few KBs) compared to the data (Terabytes or Petabytes today and likely to be Exabytes or Zettabytes in near future).
  • 22. Why this sudden hype around BDA?  Following are some of the reasons for sudden hype about BDA:  Data is growing at a 40% compound annual rate, reaching nearly 45 ZB by 2020.  Volume of business data worldwide is expected to double every 1.2 y.  500 million ―tweets‖ are posted by Twitter users every day.  2.7 billion ―Likes‖, comments posted by Facebook users in a day.  90% of the world’s data created in the past 2 years.  Cost per gigabyte of storage has hugely dropped.  There are an overwhelming number of user-friendly analytics tools available in the market today.
  • 23. Classification of Analytics  There are basically two schools of thought:  Those that classify analytics into basic, operationalized, advanced, and monetized.  Those that classify analytics into analytics 1.0, analytics 2.0, analytics 3.0 and analytics 4.0.
  • 24. First School of thought Basic analytics: used to explore your data in a graphical manner where the data provides some value through simple visualizations Operationalized analytics: Operationalized analytics includes several concepts like data discovery, decision management, information delivery Advanced analytics: Provide analytical algorithms for executing complex analysis of either structured or unstructured data Monetized analytics: This is analytics in use to derive direct business revenue.
  • 25. Second School of Analytics  Analytics 1.0  Data sources relatively small and structured, from internal systems  Majority of analytical activity was descriptive analytics, or reporting  Creating analytical models was a time-consuming batch process  Decisions were made based on experience and intuition  Analytics 2. 0  Data sources are big, complex, unstructured, fast moving data  Rise of Data Scientists  Rise of Hadoop & open source  Visual Analytics
  • 26. Second School of Analytics  Analytics 3.0  Mix of all data  Internal/external products/decisions  Analytics a core capability  Move at speed & scale  Predictive & prescriptive analytics  Analytics 4.0  Analytics embedded, invisible and automated  Cognitive technologies  Robotic process automation for digital tasks  Augmentation and not automation
  • 27. Classification of Analytics  Descriptive analytics  Diagnostic analytics  Predictive analytics  Prescriptive analytics
  • 28. Challenges in Big Data  capture  cleaning  Availability  integration  storage  processing  indexing  Security  Sharing  Consistency  Partition tolerant  Analysis  Visualization
  • 29. Top Analytics Tools  Apache Hadoop  Apache Spark  Apache Storm  Apache Cassandra  Tableau  HBase  Windows Azure  Splunk  Talend  Elastic  Apache Pig  Lumify
  • 30. Apache Hadoop  is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation.  It provides a software framework for distributed storage and processing of big data using the MapReduce programming model
  • 31. Apache Spark  is an open-source distributed general-purpose cluster-computing framework.  is a unified analytics engine for large-scale data processing.  provides an interface for programming entire clusters with implicit data parallelism and fault tolerance
  • 32. Apache Storm  A system for processing streaming data in real time  adds reliable real-time data processing capabilities to Enterprise Hadoop  Is distribute, resilent and real time
  • 33. Cassandra  is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure  is the right choice when you need scalability and high availability without compromising performance
  • 34. Tableau  Tableau Empowers business users to quickly and easily find valuable insights in their vast Hadoop datasets.  Tableau removes the need for users to have advanced knowledge of query languages by providing a clean visual analysis interface
  • 35. Lumify  LUMIFY is a powerful big data fusion, analysis, and visualization platform that supports the development of actionable intelligence.  Lumify is possibly the choice for those pouring over the 11 million-plus documents
  • 36. Windows Azure  is a cloud computing service created by Microsoft for building, testing, deploying, and managing applications and services through Microsoft- managed data centers
  • 37. Splunk  Splunk is a software platform to search, analyze and visualize the machine-generated data gathered from the websites, applications, sensors, devices etc. which make up IT infrastructure and business.
  • 38. Talend  Talend is an open source software integration platform helps you in effortlessly turning this data into business insights.  provides various software and services for data integration, data management, enterprise application integration, data quality, cloud storage and Big Data.
  • 39. HBase  is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java.  It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS, providing Bigtable-like capabilities for Hadoop
  • 40. Hive  is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis.  Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop
  • 41. Apache Pig  is a high-level platform for creating programs that run on Apache Hadoop.  It is a tool/platform which is used to analyze larger sets of data representing them as data flows.  perform all the data manipulation operations in Hadoop using Apache Pig.
  • 42. Introduction to Hadoop  Hadoop is an open source framework that allows to store (HDFS) and process (MapReduce) large data sets in distributed and parallel manner.
  • 43. Traditional DB vs. Hadoop Traditional Database System Hadoop Data is stored in a central location and sent to the processor at runtime. In Hadoop, the program goes to the data. It initially distributes the data to multiple systems and later runs the computation wherever the data is located. (distributed computation) Traditional Database Systems cannot be used to process and store a significant amount of data(big data). Hadoop works better when the data size is big. It can process and store a large amount of data efficiently and effectively. Traditional RDBMS is used to manage only structured and semi-structured data. It cannot be used to control unstructured data. Hadoop can process and store a variety of data, whether it is structured or unstructured.
  • 44. History of Hadoop  1997: Doug Cutting, developed Lucene; Open source search and indexing Java based indexing and open source search software.  2001: Mike Cafarella, focused on indexing entire web  Problems:  Schema less (no tables and columns)  Durable (once written should never lost)  Capability of handling component failure ( CPU, Memory, N/w)  Automatically re-balanced (disk space consumption)  2003: Google published GFS Paper; developed Nutch DFS.  Problem of durability and fault tolerance was still not solved.
  • 45. History of Hadoop…  Sol: Distributed processing; Divided file system into 64mb chunks, storing each element on 3 different nodes(replication factor).  2004: Google published a paper: Map Reduce – Simple Data Processing on Large Clusters.  Solved problem of Parallelization, Distribution, Fault Tolerance.  2005: Cutting integrated Map Reduce into Nutch.  2006: Cutting named it Hadoop, included HDFS, MR.  2008: Cutting licensed it under Apache Software Foundations.  Certain problems/enhancements in Hadoop, created sub projects like Hive, PIG, HBase, Zookeeper.
  • 46. Hadoop  Hadoop runs code across a cluster of computers. This process includes the following core tasks that Hadoop performs −  Data is divided into directories and files (uniform sized blocks 128MB)  These files are then distributed across various cluster nodes for further processing.  HDFS, being on top of the local file system, supervises the processing.  Blocks are replicated for handling hardware failure.  Checks for successful execution of the code  Performs sort phase that takes place between the map and reduce stages.  Sending the sorted data to a certain computer.  Writing the debugging logs for each job.
  • 47.  Hadoop as a good choice for:  Indexing log files  Sorting vast amounts of data  Image analysis  Search engine optimization  Analytics  Hadoop as a poor choice for:  Calculating value of pi to 1,000,000 digits  Calculating Fibonacci sequences  Small structured data
  • 48. Components of Hadoop  HDFS  YARN  MapReduce
  • 49. HDFS  HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.  HDFS is the primary distributed storage for Hadoop applications.  HDFS provides interfaces for applications to move themselves closer to data.  HDFS is designed to process large data sets with write-once-read-many semantics
  • 50. HDFS Namenode Datanode1 Datanode2 ... Datanode2 Secondary Namenode
  • 51. Namenode  It is master daemon that maintains and manages the DataNodes (slave nodes)  It records metadata of all blocks stored in the cluster, e.g. location of blocks stored, size of the files, permissions, hierarchy, etc.  It records each and every change that takes place to file system metadata  If a file is deleted in HDFS, NameNode will immediately record this in EditLog  It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are alive  It keeps record of all blocks in HDFS and DataNode in which they are stored
  • 52. DataNode  It is the slave daemon which runs on each slave machine  The actual data is stored on DataNodes  It is responsible for serving read and write requests from the clients  It is also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode  It sends heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds
  • 53. Secondary Namenode  Works as a helper node to primary NameNode  Downloads the Fsimage file and edit logs file from NameNode  Reads from RAM of NameNode and stores it to hard disks periodically.  If NameNode fails, last save Fsimage on secondary NameNode can be used to recover file system metadata
  • 54. HDFS Commands  Listing files  Read / Write files  Upload / Download files  File Management  Permissions  File System  Administration
  • 55. List: Files hdfs dfs -ls / List all the files/directories for the given hdfs destination path. hdfs dfs -ls -d /inputnew2 Directories are listed as plain files. In this case, this command will list the details of inputnew2 folder. hdfs dfs -ls -R /user Recursively list all files in hadoop directory and all subdirectories in user directory. hdfs dfs -ls /inputnew* List all the files matching the pattern. In this case, it will list all the files inside present directory which starts with 'inputnew'.
  • 56. Read / Write Files hdfs dfs -text /inputnew/inputFile.txt HDFS Command that takes a source file and outputs the file in text format on the terminal. hdfs dfs -cat /inputnew/inputFile.txt This command will display the content of the HDFS file test on your stdout . hdfs dfs -appendToFile /home/ubuntu/test1 /hadoop/text2 Appends the content of a local file test1 to a hdfs file test2.
  • 57. Upload / Download Files hdfs dfs -put /home/ubuntu/sample /hadoop Copies the file from local file system to HDFS. hdfs dfs -put -f /home/ubuntu/sample /hadoop Copies the file from local file system to HDFS, and in case the local already exits in the given destination path, using -f option with put command will overwrite it. hdfs dfs -get /newfile /home/ubuntu/ Copies the file from HDFS to local file system. hdfs dfs -copyFromLocal /home/ubuntu/sample /hadoop Works similarly to the put command, except that the source is restricted to a local file reference. hdfs dfs -copyToLocal /newfile /home/ubuntu/ Works similarly to the put command, except that the destination is restricted to a local file reference.
  • 58. File management hdfs dfs -cp /hadoop/file1 /hadoop1 Copies file from source to destination on HDFS. In this case, copying file1 from hadoop directory to hadoop1 directory. hdfs dfs -cp -f /hadoop/file1 /hadoop1 Copies file from source to destination on HDFS. Passing -f overwrites the destination if it already exists. hdfs dfs -mv /hadoop/file1 /hadoop1 Move files that match the specified file pattern <src> to a destination <dst>. When moving multiple files, destination must be a directory. hdfs dfs -rm /hadoop/file1 Deletes the file (sends it to the trash). hdfs dfs -rmdir /hadoop1 Delete a directory. hdfs dfs -mkdir /hadoop2 Create a directory in specified HDFS location.
  • 59. Permissions hdfs dfs -touchz /hadoop3 Creates a file of zero length at <path> with current time as the timestamp of that <path>. hdfs dfs -chmod 755 /hadoop/file1 Changes permissions of file. hdfs dfs -chown ubuntu:ubuntu /hadoop Changes owner of the file. 1st ubuntu in the command is owner and 2nd one is group.
  • 60. File System hdfs dfs -df /hadoop Shows the capacity, free and used space of the filesystem. hdfs dfs -df -h /hadoop Shows the capacity, free and used space of the filesystem. -h parameter Formats the sizes of files in a human-readable fashion. hdfs dfs -du /hadoop/file Show the amount of space, in bytes, used by the files that match the specified file pattern. hdfs dfs -du -s /hadoop/file Rather than showing the size of each individual file that matches the pattern, shows the total (summary) size.
  • 61. Administration hadoop version To check the vesrion of Hadoop. hdfs fsck / It checks the health of the Hadoop file system. hdfs dfsadmin -safemode leave The command to turn off the safemode of NameNode. hdfs dfsadmin –refreshNodes Re-read the hosts and exclude files to update the set of Datanodes that are allowed to connect to the Namenode and those that should be decommissioned or recommissioned. hdfs namenode -format Formats the NameNode.
  • 62.
  • 63.  HDFS: Hadoop Distributed File System  YARN: Yet Another Resource Negotiator  MapReduce: Data processing using programming  Spark: In-memory Data Processing  PIG, HIVE: Data Processing Services using Query (SQL-like)  Hbase: NoSQL Database  Mahout, Spark Mllib: Machine Learning  Apache Drill: SQL on Hadoop  Zookeeper: Managing Cluster  Oozie: Job Scheduling  Flume, Sqoop: Data Ingesting Services  Solr& Lucene: Searching & Indexing  Ambari: Provision, Monitor and Maintain cluster
  • 64. MapReduce Framework  Need for parallel distribution of tasks  Automatic expansion and contraction of processes  Enables continuation of processes w/o being affected by network failures or system failures  MapReduce: Map and reduce  Do not modify the original data instead, create new data structures to display their o/p
  • 65. Features of MapReduce  Scheduling  Synchronization  Data locality  Handling of errors/faults  Scale out architecture
  • 66. MapReduce Framework  Very large scale data: peta, exa bytes  Write once and read many data: allows for parallelism without mutexes  Map and Reduce are the main operations: simple code  There are other supporting operations such as combine and partition.  All the map should be completed before reduce operation starts.  Map and reduce operations are typically performed by the same physical processor.  Number of map tasks and reduce tasks are configurable.  Operations are provisioned near the data.  Commodity hardware and storage.  Special distributed file system. Example: Hadoop Distributed File
  • 67. Working of MapReduce  MapReduce programming model works on an algorithm to execute the map and reduce operations. Algorithm steps as follows:  Take a large dataset or set of records  Perform iteration over the data  Extract interesting patterns to prepare an o/p list using map function  Arrange o/p list to enable optimization for further processing  Compute a set of results by using reduce function  Provide the final output  MR mode executes given task by dividing into two functions: map and reduce. Map function is executed first in parallel on different machines. Reduce function takes output of map function to present final output in an aggregate form.
  • 68. MapReduce framework  JobTracker receives the jobs from client applications to process large information.  These jobs are assigned in the forms of individual tasks (after a job is divided into smaller parts) to various TaskTrackers.  The task distribution is transmitted to the reduce function so that the final, integrated output, which is an aggregated of the data processed by the map funciton, can be provided.  A cluster uses commodity servers to store nodes. The data processing job is accomplished through MapReduce and HDFS
  • 69.  Input is provided from large data files in the form of key-value pair, which is the standard input format in a Hadoop MapReduce programming model.  The input data is divided into small pieces, and master and slave nodes are created. The master node usually executes on the machine where the data is present, and slaves are made to work remotely on the data  The map operation is performed simultaneously on all the data pieces, which are read by the map function. The map function extracts the relavent data and generates key-value pair for it.
  • 70. Client: This initializes the job JobTracker: is the master daemon for both Job resource management and scheduling/monitoring of jobs TaskTracker: is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from a JobTracker Map task deal with splitting and mapping of data while Reduce task shuffle and reduce the data MapReduce Framework
  • 71. Hadoop Modes Local Pseudo distributed Fully distributed •No daemons •executes all parts of Hadoop MapReduce within a single Java process and uses the local filesystem as the storage •No DFS •useful for testing/ debugging the MapReduce applications locally we can run Hadoop on a single machine emulating a distributed cluster runs different services of Hadoop as different Java processes, but within a single machine. Each hadoop daemon run as separate java process. •supports clusters that span from a few nodes to thousands of nodes •Full production run
  • 72. Hadoop MapReduce: A Closer Look file file InputFormat Split Split Split RR RR RR Map Map Map Input (K, V) pairs Partitioner Intermediate (K, V) pairs Sort Reduce OutputFormat Files loaded from local HDFS store RecordReaders Final (K, V) pairs Writeback to local HDFS store file file InputFormat Split Split Split RR RR RR Map Map Map Input (K, V) pairs Partitioner Intermediate (K, V) pairs Sort Reduce OutputFormat Files loaded from local HDFS store RecordReaders Final (K, V) pairs Writeback to local HDFS store Node 1 Node 2 Shuffling Process Intermediate (K,V) pairs exchanged by all nodes
  • 74. Resource Manager (RM)  Responsibilities: Resource management and assignment of all the apps  Master daemon of YARN  Requests received by the RM are forwarded to the corresponding node manager.  RM does the allocation of available resources.  RM is highest authority for the allocation of resources.
  • 75. Node Manager (NM) Generic, flexible and efficient than TaskTracker. Dynamically created resource containers. Container refers to a collection of resources such as memory, CPU, disk and network IO. NM is the slave daemon of Yarn. NM is the per-machine/per-node framework agent, responsible for containers, monitoring their resource usage and reporting the same to the ResourceManager.
  • 76. Container  Container in YARN is where a unit of work happens in the form of task.  A job/application is split in tasks and each task gets executed in one container having a specific amount of allocated resources.  A container can be understood as logical reservation of resources (memory and vCores) that will be utilized by task running in that container
  • 77. Application Master  Responsible for managing a set of submitted tasks or applications.  First verifies and validates submitted application’s specifications and rejects the applications, if there are not enough resources available.  It also ensures no other application exists with the same ID which is already submitted.  Finally, it also observes the states of applications and manages finished applications to save some Resource Manager’s memory.
  • 78. YARN: Yet Another Resource Negotiator  YARN: Resource management + Job scheduling Criteria MapReduce YARN Type of processing batch processing with a single engine Real-time, batch, and interactive processing with multiple engines Cluster resource optimization Average due to fixed Map and Reduce slots Excellent due to central resource management Suitable for Only MapReduce applications MapReduce and non- MapReduce applications Managing cluster resource Done by JobTracker Done by YARN Namespace Supports only one namespace, i.e., HDFS Hadoop supports multiple namespaces
  • 79. Working of YARN 1.A client program submits the application 2.ResourceManager allocates a specified container to start the ApplicationMaster (AM) 3. ApplicationMaster, on boot-up, registers with RM 4. AM negotiates with RM for appropriate resource containers 5. On successful container allocations, AM contacts NM to launch the container 6. Application code is executed within the container, and then AM is responded with the execution status 7. During execution, client communicates directly with AM or RM to get status, progress updates etc. 8. Once the application is complete AM unregisters with RM and shuts down, allowing its own container process
  • 80. Introduction: Apache Pig  Pig is a high level scripting language for operating on large datasets inside Hadoop.  Compiles scripting language into MapReduce operations.  provides a simple language called Pig Latin, for queries and data manipulation  Pig is multi-query approach reduces the number of times data is scanned.  Pig was developed for ad-hoc way of creating and executing MapReduce jobs on very large data sets.  Pig provides data operations like filters, joins, ordering, etc. and nested data types like tuples and maps, that are missing from MapReduce.
  • 81. When to use and Not to use Pig  When data loads are time sensitive.  When processing various data sources.  When analytical insights are required through sampling. In places where the data is completely unstructured, like video, audio and readable text. In places where time constraints exist, as Pig is slower than MapReduce jobs. In places where more power is required to optimize the codes.
  • 82. Applications of Pig  Processing of web logs.  Data processing for search platforms.  Provides the supports across large data-sets for Ad-hoc queries.  For exploring large datasets Pig Scripting is used.  In the prototyping of large data-sets processing algorithms.  Required to process the time sensitive data loads.  Collecting large amounts of datasets in form of search logs, web crawls.  Used where the analytical insights are needed using the sampling.
  • 83. Apache Pig MapReduce It is a scripting language. It is a compiled programming language. Abstraction is at higher level. Abstraction is at lower level. It have less line of code as compared to MapReduce. Lines of code is more. Less effort is needed for Apache Pig. More development efforts are required for MapReduce. Code efficiency is less as compared to MapReduce. As compared to Pig efficiency of code is higher.
  • 84. Features of Pig Rich set of operators It provides many operators to perform operations like join, sort, filer, etc. Ease of programming: Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL. Optimization opportunities: The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus only on semantics of the language. Extensibility: Using existing operators, users can develop their own functions to read, process, write data. UDF’s: Pig provides the facility to create User-defined Functions in other programming languages such as Java and invoke or embed them in Pig Scripts. Handles all kinds of data: Apache Pig analyzes all kinds of data, both structured, unstructured. It stores the results in HDFS.
  • 85. Apache Pig: Introduction  The language used to analyze data in Hadoop using Pig is known as Pig Latin.  To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded).  Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the programmer’s job easy.
  • 86. Parameter MapReduce Pig Paradigm Data processing paradigm Procedural dataflow language Type of language Low level and rigid A high-level language Join operation It is difficult to perform join operations between datasets Performing a join operation is simple Skills needed The developer needs to have a robust knowledge of Java A good knowledge of SQL is needed Code length MapReduce requires 20 times more code length to accomplish the same task Due to the multi-query approach, the code length is greatly reduced Compilation MapReduce jobs have a prolonged compilation process No need for compilation as every Pig operator is converted internally into MapReduce jobs Nested data types Not present in MapReduce Present in Pig
  • 88. Running Pig  Pig Latin statements; Pig commands  Pig Latin statements and Pig commands can be run using interactive mode and batch mode.  Pig Latin commands in three ways—via the Grunt interactive shell, through a script file, and as embedded queries inside Java programs.  Pig has six execution modes  Local mode, Tez Local mode, Spark local mode, Mapreduce mode, Tez mode, Spark mode
  • 89.  Local Mode – This mode can be run on a single machine; all files are installed and run using local host and file system. uses the -x flag (pig -x local).  Tez Local Mode – This mode is similar to local mode, except internally Pig will invoke tez runtime engine. Uses -x flag (pig -x tez_local).  Spark Local Mode – This mode is similar to local mode, except internally Pig will invoke spark runtime engine. Uses -x flag (pig -x spark_local).  Mapreduce Mode - To run Pig in mapreduce mode, access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; Uses -x flag (pig OR pig - x mapreduce).  Tez Mode – This mode, access to a Hadoop cluster and HDFS installation. Specify Tez mode using the -x flag (-x tez).  Spark Mode - To run Pig in Spark mode, you need access to a Spark, Yarn or Mesos cluster and HDFS installation. Specify Spark mode using the -x flag (-x spark)
  • 90. Interactive mode  Local Mode : $ pig -x local  Tez Local Mode : $ pig -x tez_local  Spark Local Mode : $ pig -x spark_local  Mapreduce Mode : $ pig -x mapreduce  Tez Mode : $ pig -x tez  Spark Mode : $ pig -x spark
  • 91. Batch mode  Local Mode : $ pig -x local id.pig  Tez Local Mode : $ pig -x tez_local id.pig  Spark Local Mode : $ pig -x spark_local id.pig  Mapreduce Mode : $ pig id.pig  Tez Mode : $ pig -x tez id.pig  Spark Mode : $ pig -x spark id.pig
  • 92. Pig Latin statements  Pig Latin statements are basic constructs used to process data using Pig.  A Pig Latin statement is an operator that takes a relation as input and produces another relation as output.  Pig Latin statements may include expressions and schemas.  Pig Latin statements can span multiple lines and must end with a semi-colon ( ; ).  By default, Pig Latin statements are processed using multi-query execution.
  • 93.  Pig Latin statements are generally organized as follows:  A LOAD statement to read data from the file system.  A series of "transformation" statements to process the data.  A DUMP statement to view results or a STORE statement to save the results.
  • 94. The following example, Pig will validate, but not execute, the LOAD and FOREACH statements. A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float); B = FOREACH A GENERATE name; In the following example, Pig will validate and then execute the LOAD, FOREACH, and DUMP statements. A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float); B = FOREACH A GENERATE name; DUMP B; (Ram) (Sam) (Ham) (Pam)