Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Fundamentals of big data analytics and Hadoop
1. Myself Archana R
Assistant Professor In
Department Of Computer Science
SACWC.
I am here because I love to give
presentations.
2. BIG DATAAND ANALYTICS?
• Big data analytics is the use of advanced analytic techniques against very large, diverse
data sets that include structured, semi-structured and unstructured data, from different
sources, and in different sizes from terabytes to zetta bytes.
• Big data analytics refers to the method of analysing huge volumes of data, or big data. ...
The major aim of Big Data Analytics is to discover new patterns and relationships which
might be invisible, and it can provide new insights about the users who created it.
3. BIG DATAANALYTICS EXAMPLE
• Big data analytics helps businesses to get insights from today's huge data resources.
People, organizations, and machines now produce massive amounts of data. Social
media, cloud applications, and machine sensor data are just some examples.
Why is big data analytics important?
• Big data analytics helps organizations harness their data and use it to identify new
opportunities. That, in turn, leads to smarter business moves, more efficient operations,
higher profits and happier customers.
4. BIG DATAANALYTICS TOOLS
• Hadoop - helps in storing and analysing data.
• MongoDB - used on datasets that change frequently.
• Talend - used for data integration and management.
• Cassandra - a distributed database used to handle chunks of data.
• Spark - used for real-time processing and analysing large amounts of data.
5. WHAT ARE THE CONCEPTS OF BIG DATA?
• Big data was originally associated with three key concepts: volume, variety, and
velocity.
• The analysis of big data presents challenges in sampling, and thus previously
allowing for only observations and sampling.
6. WHAT ARE THE THREE TYPES OF BIG DATA?
• Big data is classified in three ways:
• Structured Data.
• Unstructured Data.
• Semi-Structured Data.
7. DIFFERENCE BETWEEN DATAAND BIG DATA?
• Any definition is a bit circular, as “Big” data is still data of course. Data is a set of qualitative
or quantitative variables – it can be structured or unstructured, machine readable or not, digital
or analogue, personal or not. ... Hence, BIG DATA, is not just “more” data.
• What is the size of big data?
• The term Big Data implies a large amount of information (terabytes and petabytes). It is
important to understand that to solve a particular business case, the value usually does not
have the entire volume, but only a small part. However, in advance this valuable component
cannot be determined without analysis.
8. HOW HADOOP WORKS
• Hadoop makes it easier to use all the storage and processing capacity in cluster servers,
and to execute distributed processes against huge amounts of data.
• Applications that collect data in various formats can place data into the Hadoop cluster
by using an API operation to connect to the Name Node.
• To run a job to query the data, provide a Map Reduce job made up of many map and
reduce tasks that run against the data in HDFS spread across the Data Nodes.
• Map tasks run on each node against the input files supplied, and reducers run to
aggregate and organize the final output.
9. • Spark – An open source, distributed processing system commonly used for big data
workloads. Apache Spark uses in-memory caching and optimized execution for fast
performance, and it supports general batch processing, streaming analytics, machine learning,
graph databases, and ad hoc queries.
• Presto – An open source, distributed SQL query engine optimized for low-latency, ad-hoc
analysis of data. It supports the ANSI SQL standard, including complex queries, aggregations,
joins, and window functions. Presto can process data from multiple data sources including the
Hadoop Distributed File System (HDFS) and Amazon S3.
• hive– Allows users to leverage Hadoop MapReduce using a SQL interface, enabling analytics
at a massive scale, in addition to distributed and fault-tolerant data warehousing.
• HBase– An open source, non-relational, versioned database that runs on top of Amazon S3
(using EMRFS) or the Hadoop Distributed File System (HDFS). HBase is a massively
scalable, distributed big data store built for random, strictly consistent, real-time access for
tables with billions of rows and millions of columns.
• Zeppelin – An interactive notebook that enables interactive data exploration.
10. RUNNING HADOOP ON AWS
• Amazon EMR is a managed service that lets you process and analyze large datasets using the
latest versions of bigdata processing frameworks such as Apache Hadoop, Spark, HBase, and
Presto on fully customizable clusters.
• Easy to use : You can launch an Amazon EMR cluster in minutes. You don’t need to worry
about node provisioning, cluster setup, Hadoop configuration, or cluster tuning.
• Low cost : Amazon EMR pricing is simple and predictable: You pay an hourly rate for every
instance hour you use and you can leverage Spot Instances for greater savings.
11. • Elastic : With Amazon EMR, you can provision one, hundreds, or thousands of compute
instances to process data at any scale.
• Transient : You can use EMRFS to run clusters on-demand based on HDFS data stored
persistently in Amazon S3. As jobs finish, you can shut down a cluster and have the data saved
in Amazon. You pay only for the compute time that the cluster is running.
• Secure : Amazon EMR uses all common security characteristics of AWS services:
• Identity and Access Management (IAM) roles and policies to manage permissions.
• Encryption in-transit and at-rest to help you protect your data and meet compliance
standards, such as HIPAA.
• Security groups to control inbound and outbound network traffic to your cluster nodes.
12. HADOOP ECOSYSTEM
• The term Hadoop is a general term that may refer to any of the following: The overall Hadoop
Ecosystem, which encompasses both the core modules and related sub-modules.
• The core Hadoop modules, including Hadoop Distributed File System (HDFS™), Yet Another
Resource Negotiator (YARN), MapReduce, and Hadoop Common (discussed below). These are
the basic building blocks of a typical Hadoop deployment.
• Hadoop-related sub-modules, including: Apache Hive™, Apache Impala™,
Apache Pig™, and Apache Zookeeper™, among others. These related pieces of software can
be used to customize, improve upon, or extend the functionality of core Hadoop.
13. HADOOP MODULES
• HDFS — Hadoop Distributed File System. HDFS is a Java-based system that allows large
data sets to be stored across nodes in a cluster in a fault-tolerant manner.
• YARN — Yet Another Resource Negotiator. YARN is used for cluster resource management,
planning tasks, and scheduling jobs that are running on Hadoop.
• Map Reduce —map reduce is both a programming model and big data processing engine
used for the parallel processing of large data sets. Originally, Map Reduce was the only
execution engine available in Hadoop, but later on, Hadoop added support for others,
including apache tez™ and apache sparker™.
• Hadoop Common — Hadoop Common provides a set of services across libraries and utilities
to support the other Hadoop modules.
14. BENEFITS OF HADOOP
• Scalability — Unlike traditional systems that limit data storage, Hadoop is scalable as it
operates in a distributed environment. This allowed data architects to build early datalakes on
Hadoop. Learn more about the history and evoluation of data lakes.
• Resilience — The Hadoop Distributed File System (HDFS) is fundamentally resilient. Data
stored on any node of a Hadoop cluster is also replicated on other nodes of the cluster to
prepare for the possibility of hardware or software failures. This intentionally redundant
design ensures fault tolerance. If one node goes down, there is always a backup of the data
available in the cluster.
• Flexibility — unlike traditional relational database management systems, when working with
Hadoop, you can store data in any format, including semi-structured or unstructured formats.
Hadoop enables businesses to easily access new data sources and tap into different types of
data.