unit 1 big data.pptx

WEB DATA
 A few years back, it was all manual data mining and it took long days for almost
all small and medium players in the market for web data mining. Today, technology
is evolving a lot and we are in an era of Big data and manual data mining is no more
a right method and it is mostly about automation tools, custom scripts, or Hadoop
framework. thing about web data extraction.
 It is a process of collecting data from World Wide Web using some web
scrapper, crawler, manual mining, etc. A web scrapper or crawler is a cutting tool
for harvesting information available on internet.
 In other word web data extraction is a process of crawling websites and extract
data from that page using a tool or programming.
 Web extraction is related to web indexing which refers to various methods of
indexing the contents of web page using a bot or web crawler. A web crawler is an
automated program, script or tool using that we can ‘crawl’ web pages to collect
multiple information from websites.

1 In this whole process, first step is web data extraction, that can be done using
different scraping tools available in market (there are free and paid tools are available)
or create custom script using programming language with the help of expert in scripting
language like Python, ruby, etc.
2 Second step is to find insight from the data. For this, first we need to process the data
using the right tool based on the size of the data and availability of the expert resources.
Hadoop framework is the most popular and highly used tool for big data processing.
3 Also, for sentimental analysis of those data, if needed, we need MapReduce which
is one of the components of big data (Hadoop).
To summarize, for web data extraction, we can choose different tools for automation or
develop scripts using programming language.
4 Developing a script is often minimize effort as it is reusable with minimal
modification. Moreover, as the volume of web data is huge-what we extract, it is
always advisable to go for Hadoop framework for quick processing.

Media companies use web scraping to collect recent and popular topics of
interest from different social media and popular websites.
Business directories use web scraping to collect information about the business
profile, address, phone, location, zip code, etc.
In healthcare sector, health physician scrap data from multiple websites to
collect information on diseases, medicine, components, etc.
When companies decide to go for web data extraction today, then they move
ahead thinking about big data because they know that data will come in bulk
i.e. in millions of records will be there and it will be mostly in semi or
unstructured format. So, we will need to treat it as big data and use Hadoop
framework and tools for converting it for any decision making.

Challenges of Conventional Systems
Analytics' has been used in the business intelligence world to provide tools
and intelligence to gain insight into the data
Data mining is used in enterprises to keep pace with the critical
monitoring and analysis of mountains of data
How to unearth all the hidden information through the vast amount of
data

Common changes:
It cannot work on unstructured data efficiently
It is built on to profile the relational data model
It is batch oriented and we need to wait for nightly ETL(extract, transform and
load)and transformation jobs to complete before the required insight is obtained
Parallelism in a traditional analytics system is achieved through costly hardware
like MPP(Massively Parallel Processing) systems
In adequate support of aggregated summaries of data

Data Challenges
• Volume, Velocity, Variety & Veracity
• Data discovery and comprehensiveness
• Scalability
• Storage issues
Process Challenges
• Capturing data
• Aligning data from different sources
• Transforming data into suitable form for data analysis
• Modeling data(mathematically, simulation)
• Understanding output, visualizing results and display issues on
mobile devices

Management Challenges
• Security
• Privacy
• Governance
• Ethical issues
Traditional/ RDBMS
• Designed to handle well structured data
• Traditional storage vendor solutions are very expensive
• Shared block-level storage is too slow
• Read data in 8k or 16k block size
• Schema-on-write requires data be validated before it can be
written to disk.
• Software licenses are too expensive
• Get data from disk and load into memory requires application

Solution constraints
• Inexpensive storage
• A data platform that could handle large volumes of data and be linearly
scalable at cost and performance
• A highly parallel processing model that was highly distributed to
access and compute the data very fast
• A data repository that could break down the silos and store structured,
semi-structured, and unstructured data to make it easy to correlate and
analyze the data together

The Evolution of Analytic Scalability
• Scalability: The ability of a system to handle increasing amount of work required
to perform its task
• The increase in data storage ability has grown in recent years to accommodate
the need for big data
• Measures of Data Size – Kilo, Mega, Giga , Tera, Peta, Exa, Zetta, Yotta
Basic Definitions
• Data: – Known facts that can be recorded and have an implicit meaning.
• Database: – Organized collection of related data.
• Database Management System (DBMS) – A software package to facilitate the
creation and maintenance of a computerized database. • Relational Database
Management System (RDBMS) – DBMS based on relational model
• Relation is group of tuples
• Enterprise Data Warehouse (EDW) – Central warehouse of all sources of data

Massively Parallel Processing Systems (MPP)
– Has lots of processor
– All these processor works in parallel
– Big data is split into many parts and the processors works in parallel in each part
– Divide and conquer strategy

Data Preparation
•Manipulation of data into suitable form for analysis
–Join
• Combining columns of different data sources
–Aggregation
• Combining all data into one
–Eg: statistical summary
–Combining rows of different data source
–Derivations
• Creating new columns of data
• Calculating ratio
–Transformation
• Converting data into useful format
• Taking log, converting date of birth to age

Ways for in-database data preparation
• SQL
• User defined functions / Embedded processes
–Eg: Select customer, attrition_score
–Analytic tool’s engine running on database
• Predictive modeling markup language
–Based on XML
Cloud Computing
• McKinsey Definition
–Enterprises incur no infrastructure or capital cost. They will be paying on a
pay-per-use basis
–Should be scalable
–The architectural specifics of the underlying hardware are abstracted from the
user
• Public Clouds and Private Clouds
–Security
–specialized service
–Long term cost

MapReduce
•Parallel Processing Framework
•Computational processing can occur on data (even semi-structured and
unstructured data) stored in a file system without loading it into any kind
of database
Analytic process and Tools:
1.Deployment
2. Business Understanding
3. Data Exploration
4. Data Preparation
5. Data Modeling
6. Data Evaluation

Step 1: Deployment
• Here we need to:
– plan the deployment and monitoring and maintenance,
– we need to produce a final report and review the project.
– In this phase,
• we deploy the results of the analysis.
• This is also known as reviewing the project.
Step 2: Business Understanding
• Business Understanding
– The very first step consists of business understanding.
– Whenever any requirement occurs, firstly we need to determine the business objective,
– assess the situation,
– determine data mining goals and then
– produce the project plan as per the requirement.
• Business objectives are defined in this phase.
Step 3: Data Exploration
• The second step consists of Data understanding.
– For the further process, we need to gather initial data, describe and explore the data and
verify data quality to ensure it contains the data we require

– Data collected from the various sources is described in terms of its application and the
need for the project in this phase.
– This is also known as data exploration.
• This is necessary to verify the quality of data collected.
Step 4: Data Preparation
• From the data collected in the last step,
– we need to select data as per the need, clean it, construct it to get useful information
and
– then integrate it all.
• Finally, we need to format the data to get the appropriate data.
• Data is selected, cleaned, and integrated into the format finalized for the analysis in this
phase.
Step 5: Data Modeling
• we need to
– select a modeling technique, generate test design, build a model and assess the model
built.
• The data model is build to – analyze relationships between various selected objects in
the data,
– test cases are built for assessing the model and model is tested and implemented on the
data in this phase.

• Where processing is hosted?
– Distributed Servers / Cloud (e.g. Amazon EC2)
• Where data is stored?
– Distributed Storage (e.g. Amazon S3)
• What is the programming model?
– Distributed Processing (e.g. MapReduce)
• How data is stored & indexed?
– High-performance schema-free databases (e.g. MongoDB)
• What operations are performed on data?
– Analytic / Semantic Processing
• Big data tools for HPC and supercomputing
– MPI
• Big data tools on clouds
– MapReduce model
– Iterative MapReduce model
– DAG model
– Graph model
– Collective model
• Other BDA tools
– SaS
– R
– Hadoop

Analysis VS Reporting
What is Analysis?
• The process of exploring data and reports
– in order to extract meaningful insights,
– which can be used to better understand and improve business performance
– “the process of organizing data
– into informational summaries
– in order to monitor how different areas of a business are performing
Comparing Analysis VS Reporting:
Reporting is “the process of organizing data into informational summaries
in order to monitor how different areas of a business are performing.”
• Measuring core metrics and presenting them — whether in an email, a
slidedeck, or online dashboard — falls under this category.
• Analytics is “the process of exploring data and reports in order to
extract meaningful insights, which can be used to better understand and
improve business performance.”
• Reporting helps companies to monitor their online business and be alerted to
when data falls outside of expected ranges.

• Good reporting
• should raise questions about the business from its end users.
• The goal of analysis is
• to answer questions by interpreting the data at a deeper level and
providing actionable recommendations.
• A firm may be focused on the general area of analytics (strategy,
implementation, reporting, etc.)
– but not necessarily on the specific aspect of analysis.
• It’s almost like some organizations run out of gas after the initial set-up-
related activities and don’t make it to the analysis stage

Analysis Reporting
1. Provides what is needed Provides what is asked for
2 .Is typically customized Is Typically standardized
3. Involves a person Does not involve a person
4 .Is extremely flexible Is fairly Inflexible
Reporting translates raw data into information
1 Analysis transforms data and information into insights.
2 Reporting shows you what is happening
3 while analysis focuses on explaining why it is happening and what you can do
about it.
5 Reports are like Robots n monitor and alter you and where as analysis is like
parents - c an figure out what is going on (hungry, dirty diaper, no pacifier, ,
teething, tired, ear infection, etc).
6 Reporting and analysis can go hand-in-hand:
7 Reporting provides no limited context about what is happening in the data.
Context is critical to good analysis.
8 Reporting translate a raw data into information

History of Hadoop:
1. Hadoop was started by Doug Cutting to support two of his other well
known projects, Lucene and Nutch
2. Hadoop has been inspired by Google's File System (GFS) which was
detailed in a paper by released by Google in 2003
3. Hadoop, originally called Nutch Distributed File System (NDFS)
split from Nutch in 2006 to become a sub-project of Lucene. At this point
it was renamed to Hadoop

Apache Hadoop:
Apache Hadoop is the most important framework for working with Big
Data. Hadoop biggest strength is scalability. It upgrades from working on a
single node to thousands of nodes without any issue in a seamless manner.
The web media was generating loads of information on a daily basis, and it
was becoming very difficult to manage the data of around one billion pages
of content.
In order of revolutionary, Google invented a new methodology of processing
data popularly known as MapReduce. Later after a year Google published a
white paper

Hadoop runs the applications on the basis of MapReduce where the data is
processed in parallel and accomplish the entire statistical analysis on large
amount of data.
It is a framework which is based on java programming.
It is intended to work upon from a single server to thousands of machines each
offering local computation and storage.
It supports the large collection of data set in a distributed computing
environment.
The Apache Hadoop software library based framework that gives permissions
to distribute huge amount of data sets processing across clusters of computers
using easy programming models.

Analyzing data with Hadoop:
Analyzing the Data with Hadoop To take advantage of the parallel processing that
Hadoop provides, we need to express our query as a MapReduce job.Map and
Reduce.
MapReduce works by breaking the processing into two phases:
The map phase and the reduce phase. Each phase has key-value pairs as input and
output, the types of which may be chosen by the programmer.
The programmer also specifies two functions: the map function and the reduce
function.
The input to our map phase is the raw NCDC data.
To visualize the way the map works, consider the following sample lines of input data
(some unused columns have been dropped to fit the page, indicated by ellipses):
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...

These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)
The keys are the line offsets within the file, which we ignore in our map function.
The map function merely extracts the year and the air temperature
(indicated in bold text), and emits them as its output (the temperature values have been
interpreted asintegers):
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)

The output from the map function is processed by the MapReduce framework
before being sent to the reduce function.
This processing sorts and groups the key-value pairs by key. So, continuing the
example, our reduce function sees the following input:
(1949, [111, 78])
(1950, [0, 22, −11])
Each year appears with a list of all its air temperature readings.
All the reduce function has to do now is iterate through the list and pick up the
maximum reading:
(1949, 111)
(1950, 22)

Hadoop Streaming:
Hadoop streaming is a utility that comes with the Hadoop distribution. The utility
allows you to create and run Map/Reduce jobs with any executable or script as the
mapper and/or the reducer. For example: $HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/hadoop-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper /bin/cat
-reducer /bin/wc
In the meantime, the mapper collects the line oriented outputs from the stdout of the
process and converts each line into a key/value pair, which is collected as the output of
the mapper.
By default, the prefix of a line up to the first tab character is the key and the rest of the
line (excluding the tab character) will be the value
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-input myInputDirs
-output myOutputDir

Streaming Command Options Streaming supports streaming command options as well as
generic command options.

unit 1 big data.pptx

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a unit 1 big data.pptx

Semelhante a unit 1 big data.pptx (20)

Último

Último (20)

unit 1 big data.pptx