SlideShare uma empresa Scribd logo
1 de 33
WEB DATA
 A few years back, it was all manual data mining and it took long days for almost
all small and medium players in the market for web data mining. Today, technology
is evolving a lot and we are in an era of Big data and manual data mining is no more
a right method and it is mostly about automation tools, custom scripts, or Hadoop
framework. thing about web data extraction.
 It is a process of collecting data from World Wide Web using some web
scrapper, crawler, manual mining, etc. A web scrapper or crawler is a cutting tool
for harvesting information available on internet.
 In other word web data extraction is a process of crawling websites and extract
data from that page using a tool or programming.
 Web extraction is related to web indexing which refers to various methods of
indexing the contents of web page using a bot or web crawler. A web crawler is an
automated program, script or tool using that we can ‘crawl’ web pages to collect
multiple information from websites.
1 In this whole process, first step is web data extraction, that can be done using
different scraping tools available in market (there are free and paid tools are available)
or create custom script using programming language with the help of expert in scripting
language like Python, ruby, etc.
2 Second step is to find insight from the data. For this, first we need to process the data
using the right tool based on the size of the data and availability of the expert resources.
Hadoop framework is the most popular and highly used tool for big data processing.
3 Also, for sentimental analysis of those data, if needed, we need MapReduce which
is one of the components of big data (Hadoop).
To summarize, for web data extraction, we can choose different tools for automation or
develop scripts using programming language.
4 Developing a script is often minimize effort as it is reusable with minimal
modification. Moreover, as the volume of web data is huge-what we extract, it is
always advisable to go for Hadoop framework for quick processing.
Media companies use web scraping to collect recent and popular topics of
interest from different social media and popular websites.
Business directories use web scraping to collect information about the business
profile, address, phone, location, zip code, etc.
In healthcare sector, health physician scrap data from multiple websites to
collect information on diseases, medicine, components, etc.
When companies decide to go for web data extraction today, then they move
ahead thinking about big data because they know that data will come in bulk
i.e. in millions of records will be there and it will be mostly in semi or
unstructured format. So, we will need to treat it as big data and use Hadoop
framework and tools for converting it for any decision making.
Challenges of Conventional Systems
Analytics' has been used in the business intelligence world to provide tools
and intelligence to gain insight into the data
Data mining is used in enterprises to keep pace with the critical
monitoring and analysis of mountains of data
How to unearth all the hidden information through the vast amount of
data
Common changes:
It cannot work on unstructured data efficiently
It is built on to profile the relational data model
It is batch oriented and we need to wait for nightly ETL(extract, transform and
load)and transformation jobs to complete before the required insight is obtained
Parallelism in a traditional analytics system is achieved through costly hardware
like MPP(Massively Parallel Processing) systems
In adequate support of aggregated summaries of data
Data Challenges
• Volume, Velocity, Variety & Veracity
• Data discovery and comprehensiveness
• Scalability
• Storage issues
Process Challenges
• Capturing data
• Aligning data from different sources
• Transforming data into suitable form for data analysis
• Modeling data(mathematically, simulation)
• Understanding output, visualizing results and display issues on
mobile devices
Management Challenges
• Security
• Privacy
• Governance
• Ethical issues
Traditional/ RDBMS
• Designed to handle well structured data
• Traditional storage vendor solutions are very expensive
• Shared block-level storage is too slow
• Read data in 8k or 16k block size
• Schema-on-write requires data be validated before it can be
written to disk.
• Software licenses are too expensive
• Get data from disk and load into memory requires application
Solution constraints
• Inexpensive storage
• A data platform that could handle large volumes of data and be linearly
scalable at cost and performance
• A highly parallel processing model that was highly distributed to
access and compute the data very fast
• A data repository that could break down the silos and store structured,
semi-structured, and unstructured data to make it easy to correlate and
analyze the data together
The Evolution of Analytic Scalability
• Scalability: The ability of a system to handle increasing amount of work required
to perform its task
• The increase in data storage ability has grown in recent years to accommodate
the need for big data
• Measures of Data Size – Kilo, Mega, Giga , Tera, Peta, Exa, Zetta, Yotta
Basic Definitions
• Data: – Known facts that can be recorded and have an implicit meaning.
• Database: – Organized collection of related data.
• Database Management System (DBMS) – A software package to facilitate the
creation and maintenance of a computerized database. • Relational Database
Management System (RDBMS) – DBMS based on relational model
• Relation is group of tuples
• Enterprise Data Warehouse (EDW) – Central warehouse of all sources of data
Massively Parallel Processing Systems (MPP)
– Has lots of processor
– All these processor works in parallel
– Big data is split into many parts and the processors works in parallel in each part
– Divide and conquer strategy
Data Preparation
•Manipulation of data into suitable form for analysis
–Join
• Combining columns of different data sources
–Aggregation
• Combining all data into one
–Eg: statistical summary
–Combining rows of different data source
–Derivations
• Creating new columns of data
• Calculating ratio
–Transformation
• Converting data into useful format
• Taking log, converting date of birth to age
Ways for in-database data preparation
• SQL
• User defined functions / Embedded processes
–Eg: Select customer, attrition_score
–Analytic tool’s engine running on database
• Predictive modeling markup language
–Based on XML
Cloud Computing
• McKinsey Definition
–Enterprises incur no infrastructure or capital cost. They will be paying on a
pay-per-use basis
–Should be scalable
–The architectural specifics of the underlying hardware are abstracted from the
user
• Public Clouds and Private Clouds
–Security
–specialized service
–Long term cost
MapReduce
•Parallel Processing Framework
•Computational processing can occur on data (even semi-structured and
unstructured data) stored in a file system without loading it into any kind
of database
Analytic process and Tools:
1.Deployment
2. Business Understanding
3. Data Exploration
4. Data Preparation
5. Data Modeling
6. Data Evaluation
Step 1: Deployment
• Here we need to:
– plan the deployment and monitoring and maintenance,
– we need to produce a final report and review the project.
– In this phase,
• we deploy the results of the analysis.
• This is also known as reviewing the project.
Step 2: Business Understanding
• Business Understanding
– The very first step consists of business understanding.
– Whenever any requirement occurs, firstly we need to determine the business objective,
– assess the situation,
– determine data mining goals and then
– produce the project plan as per the requirement.
• Business objectives are defined in this phase.
Step 3: Data Exploration
• The second step consists of Data understanding.
– For the further process, we need to gather initial data, describe and explore the data and
verify data quality to ensure it contains the data we require
– Data collected from the various sources is described in terms of its application and the
need for the project in this phase.
– This is also known as data exploration.
• This is necessary to verify the quality of data collected.
Step 4: Data Preparation
• From the data collected in the last step,
– we need to select data as per the need, clean it, construct it to get useful information
and
– then integrate it all.
• Finally, we need to format the data to get the appropriate data.
• Data is selected, cleaned, and integrated into the format finalized for the analysis in this
phase.
Step 5: Data Modeling
• we need to
– select a modeling technique, generate test design, build a model and assess the model
built.
• The data model is build to – analyze relationships between various selected objects in
the data,
– test cases are built for assessing the model and model is tested and implemented on the
data in this phase.
• Where processing is hosted?
– Distributed Servers / Cloud (e.g. Amazon EC2)
• Where data is stored?
– Distributed Storage (e.g. Amazon S3)
• What is the programming model?
– Distributed Processing (e.g. MapReduce)
• How data is stored & indexed?
– High-performance schema-free databases (e.g. MongoDB)
• What operations are performed on data?
– Analytic / Semantic Processing
• Big data tools for HPC and supercomputing
– MPI
• Big data tools on clouds
– MapReduce model
– Iterative MapReduce model
– DAG model
– Graph model
– Collective model
• Other BDA tools
– SaS
– R
– Hadoop
Analysis VS Reporting
What is Analysis?
• The process of exploring data and reports
– in order to extract meaningful insights,
– which can be used to better understand and improve business performance
– “the process of organizing data
– into informational summaries
– in order to monitor how different areas of a business are performing
Comparing Analysis VS Reporting:
Reporting is “the process of organizing data into informational summaries
in order to monitor how different areas of a business are performing.”
• Measuring core metrics and presenting them — whether in an email, a
slidedeck, or online dashboard — falls under this category.
• Analytics is “the process of exploring data and reports in order to
extract meaningful insights, which can be used to better understand and
improve business performance.”
• Reporting helps companies to monitor their online business and be alerted to
when data falls outside of expected ranges.
• Good reporting
• should raise questions about the business from its end users.
• The goal of analysis is
• to answer questions by interpreting the data at a deeper level and
providing actionable recommendations.
• A firm may be focused on the general area of analytics (strategy,
implementation, reporting, etc.)
– but not necessarily on the specific aspect of analysis.
• It’s almost like some organizations run out of gas after the initial set-up-
related activities and don’t make it to the analysis stage
Analysis Reporting
1. Provides what is needed Provides what is asked for
2 .Is typically customized Is Typically standardized
3. Involves a person Does not involve a person
4 .Is extremely flexible Is fairly Inflexible
Reporting translates raw data into information
1 Analysis transforms data and information into insights.
2 Reporting shows you what is happening
3 while analysis focuses on explaining why it is happening and what you can do
about it.
5 Reports are like Robots n monitor and alter you and where as analysis is like
parents - c an figure out what is going on (hungry, dirty diaper, no pacifier, ,
teething, tired, ear infection, etc).
6 Reporting and analysis can go hand-in-hand:
7 Reporting provides no limited context about what is happening in the data.
Context is critical to good analysis.
8 Reporting translate a raw data into information
History of Hadoop:
1. Hadoop was started by Doug Cutting to support two of his other well
known projects, Lucene and Nutch
2. Hadoop has been inspired by Google's File System (GFS) which was
detailed in a paper by released by Google in 2003
3. Hadoop, originally called Nutch Distributed File System (NDFS)
split from Nutch in 2006 to become a sub-project of Lucene. At this point
it was renamed to Hadoop
Apache Hadoop:
Apache Hadoop is the most important framework for working with Big
Data. Hadoop biggest strength is scalability. It upgrades from working on a
single node to thousands of nodes without any issue in a seamless manner.
The web media was generating loads of information on a daily basis, and it
was becoming very difficult to manage the data of around one billion pages
of content.
In order of revolutionary, Google invented a new methodology of processing
data popularly known as MapReduce. Later after a year Google published a
white paper
Hadoop runs the applications on the basis of MapReduce where the data is
processed in parallel and accomplish the entire statistical analysis on large
amount of data.
It is a framework which is based on java programming.
It is intended to work upon from a single server to thousands of machines each
offering local computation and storage.
It supports the large collection of data set in a distributed computing
environment.
The Apache Hadoop software library based framework that gives permissions
to distribute huge amount of data sets processing across clusters of computers
using easy programming models.
Analyzing data with Hadoop:
Analyzing the Data with Hadoop To take advantage of the parallel processing that
Hadoop provides, we need to express our query as a MapReduce job.Map and
Reduce.
MapReduce works by breaking the processing into two phases:
The map phase and the reduce phase. Each phase has key-value pairs as input and
output, the types of which may be chosen by the programmer.
The programmer also specifies two functions: the map function and the reduce
function.
The input to our map phase is the raw NCDC data.
To visualize the way the map works, consider the following sample lines of input data
(some unused columns have been dropped to fit the page, indicated by ellipses):
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)
The keys are the line offsets within the file, which we ignore in our map function.
The map function merely extracts the year and the air temperature
(indicated in bold text), and emits them as its output (the temperature values have been
interpreted asintegers):
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
The output from the map function is processed by the MapReduce framework
before being sent to the reduce function.
This processing sorts and groups the key-value pairs by key. So, continuing the
example, our reduce function sees the following input:
(1949, [111, 78])
(1950, [0, 22, −11])
Each year appears with a list of all its air temperature readings.
All the reduce function has to do now is iterate through the list and pick up the
maximum reading:
(1949, 111)
(1950, 22)
Hadoop Streaming:
Hadoop streaming is a utility that comes with the Hadoop distribution. The utility
allows you to create and run Map/Reduce jobs with any executable or script as the
mapper and/or the reducer. For example: $HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/hadoop-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper /bin/cat
-reducer /bin/wc
In the meantime, the mapper collects the line oriented outputs from the stdout of the
process and converts each line into a key/value pair, which is collected as the output of
the mapper.
By default, the prefix of a line up to the first tab character is the key and the rest of the
line (excluding the tab character) will be the value
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar 
-input myInputDirs 
-output myOutputDir
Streaming Command Options Streaming supports streaming command options as well as
generic command options.
unit 1 big data.pptx

Mais conteúdo relacionado

Semelhante a unit 1 big data.pptx

Hadoop-based architecture approaches
Hadoop-based architecture approachesHadoop-based architecture approaches
Hadoop-based architecture approaches
Miraj Godha
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
Priyadarshini648418
 

Semelhante a unit 1 big data.pptx (20)

TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Hadoop-based architecture approaches
Hadoop-based architecture approachesHadoop-based architecture approaches
Hadoop-based architecture approaches
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Lecture 1-big data engineering (Introduction).pdf
Lecture 1-big data engineering (Introduction).pdfLecture 1-big data engineering (Introduction).pdf
Lecture 1-big data engineering (Introduction).pdf
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdf
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Big data Question bank.pdf
Big data Question bank.pdfBig data Question bank.pdf
Big data Question bank.pdf
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
 

Último

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Último (20)

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

unit 1 big data.pptx

  • 1. WEB DATA  A few years back, it was all manual data mining and it took long days for almost all small and medium players in the market for web data mining. Today, technology is evolving a lot and we are in an era of Big data and manual data mining is no more a right method and it is mostly about automation tools, custom scripts, or Hadoop framework. thing about web data extraction.  It is a process of collecting data from World Wide Web using some web scrapper, crawler, manual mining, etc. A web scrapper or crawler is a cutting tool for harvesting information available on internet.  In other word web data extraction is a process of crawling websites and extract data from that page using a tool or programming.  Web extraction is related to web indexing which refers to various methods of indexing the contents of web page using a bot or web crawler. A web crawler is an automated program, script or tool using that we can ‘crawl’ web pages to collect multiple information from websites.
  • 2. 1 In this whole process, first step is web data extraction, that can be done using different scraping tools available in market (there are free and paid tools are available) or create custom script using programming language with the help of expert in scripting language like Python, ruby, etc. 2 Second step is to find insight from the data. For this, first we need to process the data using the right tool based on the size of the data and availability of the expert resources. Hadoop framework is the most popular and highly used tool for big data processing. 3 Also, for sentimental analysis of those data, if needed, we need MapReduce which is one of the components of big data (Hadoop). To summarize, for web data extraction, we can choose different tools for automation or develop scripts using programming language. 4 Developing a script is often minimize effort as it is reusable with minimal modification. Moreover, as the volume of web data is huge-what we extract, it is always advisable to go for Hadoop framework for quick processing.
  • 3. Media companies use web scraping to collect recent and popular topics of interest from different social media and popular websites. Business directories use web scraping to collect information about the business profile, address, phone, location, zip code, etc. In healthcare sector, health physician scrap data from multiple websites to collect information on diseases, medicine, components, etc. When companies decide to go for web data extraction today, then they move ahead thinking about big data because they know that data will come in bulk i.e. in millions of records will be there and it will be mostly in semi or unstructured format. So, we will need to treat it as big data and use Hadoop framework and tools for converting it for any decision making.
  • 4. Challenges of Conventional Systems Analytics' has been used in the business intelligence world to provide tools and intelligence to gain insight into the data Data mining is used in enterprises to keep pace with the critical monitoring and analysis of mountains of data How to unearth all the hidden information through the vast amount of data
  • 5. Common changes: It cannot work on unstructured data efficiently It is built on to profile the relational data model It is batch oriented and we need to wait for nightly ETL(extract, transform and load)and transformation jobs to complete before the required insight is obtained Parallelism in a traditional analytics system is achieved through costly hardware like MPP(Massively Parallel Processing) systems In adequate support of aggregated summaries of data
  • 6. Data Challenges • Volume, Velocity, Variety & Veracity • Data discovery and comprehensiveness • Scalability • Storage issues Process Challenges • Capturing data • Aligning data from different sources • Transforming data into suitable form for data analysis • Modeling data(mathematically, simulation) • Understanding output, visualizing results and display issues on mobile devices
  • 7. Management Challenges • Security • Privacy • Governance • Ethical issues Traditional/ RDBMS • Designed to handle well structured data • Traditional storage vendor solutions are very expensive • Shared block-level storage is too slow • Read data in 8k or 16k block size • Schema-on-write requires data be validated before it can be written to disk. • Software licenses are too expensive • Get data from disk and load into memory requires application
  • 8. Solution constraints • Inexpensive storage • A data platform that could handle large volumes of data and be linearly scalable at cost and performance • A highly parallel processing model that was highly distributed to access and compute the data very fast • A data repository that could break down the silos and store structured, semi-structured, and unstructured data to make it easy to correlate and analyze the data together
  • 9. The Evolution of Analytic Scalability • Scalability: The ability of a system to handle increasing amount of work required to perform its task • The increase in data storage ability has grown in recent years to accommodate the need for big data • Measures of Data Size – Kilo, Mega, Giga , Tera, Peta, Exa, Zetta, Yotta Basic Definitions • Data: – Known facts that can be recorded and have an implicit meaning. • Database: – Organized collection of related data. • Database Management System (DBMS) – A software package to facilitate the creation and maintenance of a computerized database. • Relational Database Management System (RDBMS) – DBMS based on relational model • Relation is group of tuples • Enterprise Data Warehouse (EDW) – Central warehouse of all sources of data
  • 10. Massively Parallel Processing Systems (MPP) – Has lots of processor – All these processor works in parallel – Big data is split into many parts and the processors works in parallel in each part – Divide and conquer strategy
  • 11. Data Preparation •Manipulation of data into suitable form for analysis –Join • Combining columns of different data sources –Aggregation • Combining all data into one –Eg: statistical summary –Combining rows of different data source –Derivations • Creating new columns of data • Calculating ratio –Transformation • Converting data into useful format • Taking log, converting date of birth to age
  • 12. Ways for in-database data preparation • SQL • User defined functions / Embedded processes –Eg: Select customer, attrition_score –Analytic tool’s engine running on database • Predictive modeling markup language –Based on XML Cloud Computing • McKinsey Definition –Enterprises incur no infrastructure or capital cost. They will be paying on a pay-per-use basis –Should be scalable –The architectural specifics of the underlying hardware are abstracted from the user • Public Clouds and Private Clouds –Security –specialized service –Long term cost
  • 13. MapReduce •Parallel Processing Framework •Computational processing can occur on data (even semi-structured and unstructured data) stored in a file system without loading it into any kind of database Analytic process and Tools: 1.Deployment 2. Business Understanding 3. Data Exploration 4. Data Preparation 5. Data Modeling 6. Data Evaluation
  • 14.
  • 15. Step 1: Deployment • Here we need to: – plan the deployment and monitoring and maintenance, – we need to produce a final report and review the project. – In this phase, • we deploy the results of the analysis. • This is also known as reviewing the project. Step 2: Business Understanding • Business Understanding – The very first step consists of business understanding. – Whenever any requirement occurs, firstly we need to determine the business objective, – assess the situation, – determine data mining goals and then – produce the project plan as per the requirement. • Business objectives are defined in this phase. Step 3: Data Exploration • The second step consists of Data understanding. – For the further process, we need to gather initial data, describe and explore the data and verify data quality to ensure it contains the data we require
  • 16. – Data collected from the various sources is described in terms of its application and the need for the project in this phase. – This is also known as data exploration. • This is necessary to verify the quality of data collected. Step 4: Data Preparation • From the data collected in the last step, – we need to select data as per the need, clean it, construct it to get useful information and – then integrate it all. • Finally, we need to format the data to get the appropriate data. • Data is selected, cleaned, and integrated into the format finalized for the analysis in this phase. Step 5: Data Modeling • we need to – select a modeling technique, generate test design, build a model and assess the model built. • The data model is build to – analyze relationships between various selected objects in the data, – test cases are built for assessing the model and model is tested and implemented on the data in this phase.
  • 17. • Where processing is hosted? – Distributed Servers / Cloud (e.g. Amazon EC2) • Where data is stored? – Distributed Storage (e.g. Amazon S3) • What is the programming model? – Distributed Processing (e.g. MapReduce) • How data is stored & indexed? – High-performance schema-free databases (e.g. MongoDB) • What operations are performed on data? – Analytic / Semantic Processing • Big data tools for HPC and supercomputing – MPI • Big data tools on clouds – MapReduce model – Iterative MapReduce model – DAG model – Graph model – Collective model • Other BDA tools – SaS – R – Hadoop
  • 18. Analysis VS Reporting What is Analysis? • The process of exploring data and reports – in order to extract meaningful insights, – which can be used to better understand and improve business performance – “the process of organizing data – into informational summaries – in order to monitor how different areas of a business are performing Comparing Analysis VS Reporting: Reporting is “the process of organizing data into informational summaries in order to monitor how different areas of a business are performing.” • Measuring core metrics and presenting them — whether in an email, a slidedeck, or online dashboard — falls under this category. • Analytics is “the process of exploring data and reports in order to extract meaningful insights, which can be used to better understand and improve business performance.” • Reporting helps companies to monitor their online business and be alerted to when data falls outside of expected ranges.
  • 19. • Good reporting • should raise questions about the business from its end users. • The goal of analysis is • to answer questions by interpreting the data at a deeper level and providing actionable recommendations. • A firm may be focused on the general area of analytics (strategy, implementation, reporting, etc.) – but not necessarily on the specific aspect of analysis. • It’s almost like some organizations run out of gas after the initial set-up- related activities and don’t make it to the analysis stage
  • 20. Analysis Reporting 1. Provides what is needed Provides what is asked for 2 .Is typically customized Is Typically standardized 3. Involves a person Does not involve a person 4 .Is extremely flexible Is fairly Inflexible Reporting translates raw data into information 1 Analysis transforms data and information into insights. 2 Reporting shows you what is happening 3 while analysis focuses on explaining why it is happening and what you can do about it. 5 Reports are like Robots n monitor and alter you and where as analysis is like parents - c an figure out what is going on (hungry, dirty diaper, no pacifier, , teething, tired, ear infection, etc). 6 Reporting and analysis can go hand-in-hand: 7 Reporting provides no limited context about what is happening in the data. Context is critical to good analysis. 8 Reporting translate a raw data into information
  • 21. History of Hadoop: 1. Hadoop was started by Doug Cutting to support two of his other well known projects, Lucene and Nutch 2. Hadoop has been inspired by Google's File System (GFS) which was detailed in a paper by released by Google in 2003 3. Hadoop, originally called Nutch Distributed File System (NDFS) split from Nutch in 2006 to become a sub-project of Lucene. At this point it was renamed to Hadoop
  • 22.
  • 23. Apache Hadoop: Apache Hadoop is the most important framework for working with Big Data. Hadoop biggest strength is scalability. It upgrades from working on a single node to thousands of nodes without any issue in a seamless manner. The web media was generating loads of information on a daily basis, and it was becoming very difficult to manage the data of around one billion pages of content. In order of revolutionary, Google invented a new methodology of processing data popularly known as MapReduce. Later after a year Google published a white paper
  • 24.
  • 25. Hadoop runs the applications on the basis of MapReduce where the data is processed in parallel and accomplish the entire statistical analysis on large amount of data. It is a framework which is based on java programming. It is intended to work upon from a single server to thousands of machines each offering local computation and storage. It supports the large collection of data set in a distributed computing environment. The Apache Hadoop software library based framework that gives permissions to distribute huge amount of data sets processing across clusters of computers using easy programming models.
  • 26.
  • 27. Analyzing data with Hadoop: Analyzing the Data with Hadoop To take advantage of the parallel processing that Hadoop provides, we need to express our query as a MapReduce job.Map and Reduce. MapReduce works by breaking the processing into two phases: The map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function. The input to our map phase is the raw NCDC data. To visualize the way the map works, consider the following sample lines of input data (some unused columns have been dropped to fit the page, indicated by ellipses): 0067011990999991950051507004...9999999N9+00001+99999999999... 0043011990999991950051512004...9999999N9+00221+99999999999... 0043011990999991950051518004...9999999N9-00111+99999999999... 0043012650999991949032412004...0500001N9+01111+99999999999... 0043012650999991949032418004...0500001N9+00781+99999999999...
  • 28. These lines are presented to the map function as the key-value pairs: (0, 0067011990999991950051507004...9999999N9+00001+99999999999...) (106, 0043011990999991950051512004...9999999N9+00221+99999999999...) (212, 0043011990999991950051518004...9999999N9-00111+99999999999...) (318, 0043012650999991949032412004...0500001N9+01111+99999999999...) (424, 0043012650999991949032418004...0500001N9+00781+99999999999...) The keys are the line offsets within the file, which we ignore in our map function. The map function merely extracts the year and the air temperature (indicated in bold text), and emits them as its output (the temperature values have been interpreted asintegers): (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78)
  • 29. The output from the map function is processed by the MapReduce framework before being sent to the reduce function. This processing sorts and groups the key-value pairs by key. So, continuing the example, our reduce function sees the following input: (1949, [111, 78]) (1950, [0, 22, −11]) Each year appears with a list of all its air temperature readings. All the reduce function has to do now is iterate through the list and pick up the maximum reading: (1949, 111) (1950, 22)
  • 30.
  • 31. Hadoop Streaming: Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. For example: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input myInputDirs -output myOutputDir -mapper /bin/cat -reducer /bin/wc In the meantime, the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input myInputDirs -output myOutputDir
  • 32. Streaming Command Options Streaming supports streaming command options as well as generic command options.